Crazy at Errors in Netbackup 3.4 on Solaris 7

NMT_screen · ‎02-26-2009

Hi Gents,

Need your kind help/advise/instructions to cure my big problems in netbackup server3.4 running on solaris7 and overland tape library . I have one netbackup 3.4 server and a few media servers running under it. I used to received status code 83 from the backup of media tapes under one of the media servers . I suspect it is due to media tape errors and I changed the tapes under this media server group. But other media tapes under another media server also came out status code 83 and a lot of errors as per follows. PLS PLS HELP ME OUT FROM THIS TROUBLE SINCE I M GOING TO CRAZY.

/var/adm/messages

Feb 26 09:55:52 netbkpserver002 tldcd[22074]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION

( It is due to backup tape library ? Pls.advise )
Feb 26 09:55:52 netbkpserver002 tldcd[22074]: TLD(2) Move_medium error: CHECK CONDITION

( Really due to media error ? )
Feb 26 10:02:25 netbkpserver002 tldcd[22386]: TLD(2) cannot dismount drive 2, slot 91 already is full
Feb 26 10:02:25 netbkpserver002 tldcd[22386]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION
Feb 26 10:02:25 netbkpserver002 tldcd[22386]: TLD(2) Move_medium error: CHECK CONDITION

-Feb 26 10:19:55 netbkpserver002 tldcd[22962]: TLD(2) expected barcode (media03 ) in slot 85, found barcode (--------)
Feb 26 10:23:42 netbkpserver002 tldcd[23000]: TLD(2) expected barcode (media03 ) in slot 85, found barcode (media01 )
Feb 26 10:27:28 netbkpserver002 tldcd[23034]: TLD(2) expected barcode (media03 ) in slot 85, found barcode (media01 )
Feb 26 10:28:13 netbkpserver002 tldcd[23055]: TLD(2) expected barcode (media03 ) in slot 85, found barcode (media01 )

( Volume database is not update with physical media location , Pls. advise , it happen automatically )

Feb 26 10:30:19 netbkpserver002 tldcd[23222]: TLD(2) cannot dismount drive 2, slot 91 already is full

Feb 26 10:31:48 netbkpserver002 ID[RICHPse.monlog.2000]: Disk state amber entered, Action: Move load from busy disks to idle disks
Feb 26 10:33:48 netbkpserver002 ID[RICHPse.monlog.2000]: Disk state white entered, Action: No activity

( I have no idea about the disk state amber and white , what does it mean ? )

Feb 26 20:14:50 netbkpserver002 tldd[2402]: TLD(2) drive 1 (device 0) is being DOWNED, status: Robotic mount failure

( The tape LTOD drives are keeping on down evey 5 to 10 minutes eventhough I changed the tape library drives , why ?)

/usr/openv/netbackup/db/error

1235615750 1 4 16 production 26131 0 0 bkpwindowsATC001 bpsched backup of client bkpwindowsATC001 exited with status 83 (media open error)
1235615750 1 4 16 netbkpserver002 0 0 0 *NULL* bpsched scheduler exiting - media open error (83)

Robot inventory/ media database is not sync with the physical slots in the tape library.

1235651678 1 4 4 netbkpserver002 0 0 0 netbkpserver002 bpsched skipping backup of client netbkpserver002, class bkp_web2, schedule incrbkps because it has exceeded the configured number of tries
1235651678 1 4 4 netbkpserver002 0 0 0 bkpunix006 bpsched skipping backup of client bkpunix006, class bkp_web3, schedule incrbkps because it has exceeded the configured number of tries

( Why so many tapes are not able to mount ? Drive error or media error ? )

1235650486 1 132 16 netbkpserver002 26153 0 0 bkpClass bptm FREEZING media id bkp02, it is unmountable and cannot be used for backups
1235650488 1 132 16 netbkpserver002 26153 0 0 bkpClass bptm FREEZING media id bkp07, it is unmountable and cannot be used for backups

( the tapes are usually freeze every now and then , due to media tape error ? )

************************************************************************************************

bpmedialist

Feb 26 20:14:52 netbkpserver002 vmd[2385]: media ID bkp04 has expired
Feb 26 20:14:52 netbkpserver002 vmd[2385]: media ID bkp05 has expired
Feb 26 20:14:58 netbkpserver002 vmd[2385]: media ID bkp04 has expired
Feb 26 20:14:58 netbkpserver002 vmd[2385]: media ID bkp05 has expired

( How can I use back those media to normal backup ? )

I really really really need your gentlemen help since I am getting a lot of problems with this old backup 3.4 server . Pls. help me out.

Thanks

Andy_Welburn · ‎02-26-2009

Looks like there is LOT (sorry, had to put a bit of red there!) of mismatches between what the library knows/thinks are in its slots & what NetBackup thinks are there.

Can you get the library to re-register what tapes are where - power off/on? May have to manually unload any tapes that are in any drives (robtest).

Then do an inventory of the library on NetBackup with update configuration (I presume that's possible in 3.4?????) to try & get the two in sync again.

There are some disk errors in your logs but not sure if this is a NetBackup issue.

There may still be some issues to sort once the library & NetBackup are back in sync methinks! Tapes frozen probably due to these mis-matches so probably ok just to unfreeze them. Don't know how 3.4 coped with swapping of tape drives but there may be something to look at there also.

NMT_screen · ‎02-26-2009

Andy,Thanks for your quick response. In fact , i kill all the bkp jobs and reboot the tape libary a few time already. And after reboot, i need to go GUI and update the volume configuration also. This is on the media management -> robot-> use inventory to use volume configuration -> update volume configurattion. After that all tapes are sync with media database again.

How can I solve out this disk error by the help by tape libary vendor.But they said, tape drives are OK and complain netbackup whenever they come. I used to unfreeze " bpmedia -unfreeze -ev ( media id ) -h ( media server id )". But it is quite often freeze ( more than 10times ) even single day. REALLY APPRECIATE YOUR COMMENTS.

Regards

Andy_Welburn · ‎02-26-2009

If the same tapes are freezing again & again then I would suspect 'faulty' tapes & I would get them replaced.

If it is different tapes each time but the same tape drive then I would suggest the tape drive(s) at fault.

If it is any tape & any drive then I would suspect some NetBackup configuration issue, but I know nothing of 3.4 (started at 4.5 & now running 6.5.1 so things have changed dramatically since then!).

May be an idea to keep a log of all tape issues & what drives then you can use this as evidence to get tapes or drives replaced.

As far as the disk errors are concerned I would imagine that this is more of a hardware issue but I must admit that I don't recognise the error messages. I would hazard a guess at a monitoring application (RICHPse or SE Toolkit) that's running on your checking media server hardware - nothing to do with NetBackup.

NMT_screen · ‎02-26-2009

Hi Andy,

Thanks. As i mentioned i changed some tapes under specific media server that was having problem. I suspect that tape drives also could be issue. But the vendor just came and change again and again and complain to netbackup software and config error.

Is your HW monitoring tool kits (RICHPse or SE Toolkit) able to use with overland tape library 4000 series ?

Appreciate your comments,

Regards,

Andy_Welburn · ‎02-26-2009

Maybe someone has a bit more knowledge about 3.4 that could guide you through a few checks for your tape drive config as it may need looking at.... (Bob?)

As far as the SE Toolkit is concerned I've never used it so couldn't comment on its interaction with either NetBackup nor your tape library - but I don't see it as being a problem, it only appears to be notifying you of a possible issue?

Sorry I can't help any further.

Stumpr2 · ‎02-26-2009

I would start by checking the tape drives outside of NetBackup using the mt and tar commands.

mount a tape, write using tar, unmount tape, mount tape, read tar, unmount tape.

I would do that for every drive on every media server.

prove the tapedrives, device files, connections, media good at the OS level.

Then once I know I got a good base to work with, then I would troubleshoot the library.

Again outside of Netbackup. Do manual manipulations of inventory, moving tapes, mounting tapes,etc.

Then I would jump to Netbackup and run robtest.

Thats a start :)

NMT_screen · ‎02-26-2009

Hi Andy,

Thanks for advise. In fact 3.4 and 4.xx versions are not that different, i think. Do you feel like it is something wrong with the netbackup configuration ? Pls.advise.

Thanks.

Yasuhisa_Ishika · ‎02-26-2009

Can you show us syslog and NetBackup logs recorded after you fix the inventory?
Freezing might be occur by different cause.

Stumpr2 · ‎02-27-2009

@NMT screen wrote:
Do you feel like it is something wrong with the netbackup configuration ? Pls.advise.
Thanks.

yes. You most definitely have a configuration problem and that is exactly why I would go through the painful steps of proving the environment outside of NetBackup with the simple OS level commands. That is what I would do.

Steps to verify device configuration using robtest.

http://support.veritas.com/docs/264193

NMT_screen · ‎02-27-2009

Hi,

Sorry for my late reply. I replaced the tapes which resulted status code 83 with new tapes . After that whenever I run bkp on these tapes ( without killing the jobs and just replace from the tape library ), I noticed mounting time is very long and never happen to backup again for all those new tapes.

02/27/2009 00:14:40 - connecting

02/27/2009 00:14:40 - connected; connect

02/27/2009 00:14:40 - mounting media01

started : 02/27/09 00:14:40

Elapsed : 021:06:40

Ended :

/var/log/syslog

, relay=bkpserver001.pureIT.net. [203.166.10 .131], dsn=2.0.0, stat=Sent (n1RDXdm19900 Message accepted for delivery)

Feb 27 21:32:40 bkpserver002 sendmail[21261]: n1RDWao21259: to=bkmail@pureIT.net, ctladdr=root (0/1), delay=00:00:04, xdelay=00:00:04, mailer=relay, pri=120320, relay=bkpserver001.pureIT.net. [203.166.10 .131], dsn=2.0.0, stat=Sent (n1RDXdm19901 Message accepted for delivery)

Feb 27 21:32:41 bkpserver002 sendmail[21290]: n1RDWfH21290: from=root, size=320, class=0, nrcpts=1, msg id=<200902271332.n1RDWfH21290@bkpserver002.>, relay=root@localhost

Feb 27 21:32:42 bkpserver002 sendmail[21292]: n1RDWfH21290: to=bkmail@pureIT.net, ctladdr=root (0/1), delay=00:00:01, xdelay=00:00:01, mailer=relay, pri=120320, relay=bkpserver001.pureIT.net. [203.166.10 .131], dsn=2.0.0, stat=Sent (n1RDXfm19907 Message accepted for delivery)

Feb 27 21:45:01 bkpserver002 sendmail[21788]: n1RDj1421788: from=root, size=271, class=0, nrcpts=1, msg id=<200902271345.n1RDj1421788@bkpserver002.>, relay=root@localhost

Feb 27 21:45:02 bkpserver002 sendmail[21790]: n1RDj1421788: to=root, ctladdr=root (0/1), delay=00:00:01 , xdelay=00:00:00, mailer=local, pri=120271, relay=local, dsn=2.0.0, stat=Sent

/var/adm/messages

Feb 27 21:51:27 bkpserver002 tldcd[21943]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION

Feb 27 21:51:27 bkpserver002 tldcd[21943]: TLD(2) Move_medium error: CHECK CONDITION

Feb 27 21:52:10 bkpserver002 ID[RICHPse.monlog.2000]: Disk state amber entered, Action: Move load from busy disks to idle disks

Feb 27 21:53:38 bkpserver002 tldcd[22020]: TLD(2) cannot dismount drive 1, slot 77 already is full

Feb 27 21:53:38 bkpserver002 tldcd[22020]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION

/usr/openv/netbackup/logs/bptm/logs

21:57:18 [22110] <2> bptm: EXITING with status 0 <----------

21:58:20 [22166] <2> bptm: INITIATING: -U

21:58:20 [22166] <2> bptm: EXITING with status 0 <----------

21:58:21 [22169] <2> bptm: INITIATING: -U

21:58:21 [22169] <2> bptm: EXITING with status 0 <----------

21:59:00 [22187] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 2

21:59:00 [22187] <2> bptm: EXITING with status 0 <----------

22:00:03 [22332] <2> bptm: INITIATING: -mlist -cmd

22:00:03 [22332] <2> bptm: EXITING with status 0 <----------

22:01:01 [22385] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 2

22:01:01 [22385] <2> bptm: EXITING with status 0 <----------

22:01:34 [22403] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 2 -stunit LT0-DNS2 -den 6 -mt 2 -masterversion 340000

22:01:34 [22403] <2> bptm: EXITING with status 0 <----------

22:01:35 [22405] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 2 -stunit LTO-bkpserver002 -den 6 -mt 2 -masterversion 340000

22:01:35 [22405] <2> bptm: EXITING with status 0 <----------

22:01:36 [22407] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 2 -stunit LTO-iaccssportal -den 6 -mt 2 -masterversion 340000

22:01:36 [22407] <2> bptm: EXITING with status 0 <----------

22:02:23 [22434] <2> bptm: INITIATING: -count -cmd -rt 8 -rn 0 -stunit DLT -den 13 -mt 2 -masterversion 340000

22:02:23 [22434] <2> bptm: EXITING with status 0 <----------

/usr/openv/netbackup/logs/bpdbm/logs

22:03:45 [22455] <2> image_db: Q_IMAGE_ADD_FRAGMENT (locking)

22:03:45 [22455] <4> bpdbm: request complete: exit status 0

22:03:45 [22456] <4> connected_peer: Connection from host bkpmediaserver01, 203.166.10.100, on non-reserved port 44299

22:03:45 [22456] <2> error_db: Q_ERRADD

22:03:45 [22456] <4> bpdbm: request complete: exit status 0

22:03:46 [22457] <4> connected_peer: Connection from host bkpmediaserver01, 203.166.10.100, on non-reserved port 44300

22:03:46 [22457] <2> image_db: Q_IMAGE_VALIDATE

22:03:46 [22457] <16> Default Retention: No user retention file

22:03:46 [22457] <4> bpdbm: request complete: exit status 0

22:03:47 [22458] <4> connected_peer: Connection from host bkpserver002, 203.166.10.100, on non-reserved port 58603

22:03:47 [22458] <2> error_db: Q_ERRADD

22:03:48 [22458] <4> bpdbm: request complete: exit status 0

/usr/openv/netbackup/logs
bkpserver002# ls                                            Pls.let me know if you want any logs from the following.
bpbkar/    bpcd/      bphdb/     bpsched/   dbbackup/
bpbrm/     bpdbm/     bprd/      bptm/      user_ops/

Regards,

Andy_Welburn · ‎02-27-2009

@NMT screen wrote:
Hi,
Sorry for my late reply. I replaced the tapes which resulted status code 83 with new tapes .

This certainly appears to indicate a config issue - I would follow Bobs suggestion to use robtest & try & identify these mis-configurations.

Also, as a matter of interest how do you load tapes into your library? Do you use the individual 'mail slots' & inventory with "empty media access port" selected or do you manually fill the empty slots?

If the latter, you must take care as not all empty slots are 'empty' - the tape(s) may be loaded into one of your drives for a backup & so if these slots are filled with 'new' tapes the ones in the drives can no longer be returned to their original locations which could explain:

@NMT screen wrote:

/var/adm/messages
Feb 27 21:51:27 bkpserver002 tldcd[21943]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION
Feb 27 21:51:27 bkpserver002 tldcd[21943]: TLD(2) Move_medium error: CHECK CONDITION
Feb 27 21:53:38 bkpserver002 tldcd[22020]: TLD(2) cannot dismount drive 1, slot 77 already is full
Feb 27 21:53:38 bkpserver002 tldcd[22020]: TLD(2) key = 0x0, asc = 0x0, ascq = 0x0, NO ADDITIONAL SENSE INFORMATION

Our operators did this once (& only once!!) & it caused no end of issues.

NMT_screen · ‎02-27-2009

Hi Andy,

I got the status code for those tapes under MedaServer01.And just manually removed those media tapes and replaced with the new media tapes with the same label. After that whenever I run bkp on these tapes ( without killing the jobs and just replace from the tape library ), I noticed mounting time is very long and never happen to backup again for all those new tapes. I notice following message in job status .
02/27/2009 00:14:40 - connecting
02/27/2009 00:14:40 - connected; connect
02/27/2009 00:14:40 - mounting media01
started : 02/27/09 00:14:40
Elapsed : 021:06:40
Ended :
And the tape is pending at the mounting stage and never move at all.
Should I use the command" vmquery -deassignbyid MEDIA_ID 4 0" ?
Regards,

NMT_screen · ‎05-05-2009

Hi ,
I still have this issue for mounting . Message is as follow in the Job Status. The "TAPE1" is mounting for more than 3 hours but not mounted. This is the problem TAPE1 . Not only this TAPE1 , all tapes under this MediaServer1 are having same problem as follow. But other tapes under different media servers are running normal .

"Job xxxx"                                                                                              Started : 05/05/2009 17:05:30
Storage Unit :                LTO- MediaServer1                                               Elapsed : 03:23:00
Media Server :                MediaServer1 Ended :
Status : 05/05/2009 17:38:20 - Connecting
05/05/2009 17:38:10 - Connected: connect
05/05/2009 17:38:10 - mounting TAPE1

Current 0
K Bytes Written

Current
Files Writtern 1

Estimated completing Previous backup data is not available.
*************************************************************************************************************
Pls.compare with the other tapes under different media server. This is normal backuping tape .

Job xxxx"                                                    Started : 05/05/2009 17:05:30
Storage Unit : LTO- MediaServer2                   Elapsed : 17:23:24
Media Server : MediaServer2                          Ended : 05/05/2009 16:22:30

Status : 05/05/2009 17:38:20 - Connecting
05/05/2009 17:38:10 - Connected: connect
05/05/2009 17:38:10 - mounting TAPE2
05/05/2009 17:38:10 - mounted; mount time: 000:01:01
05/05/2009 17:38:10 - positioning to file 34
05/05/2009 17:38:10 - positioned: position time 000:17:01
05/05/2009 17:38:10 - begin writing
05/05/2009 18:50:10 - end writing ; writing time 000:15:00

Current 0
K Bytes Written 6693088

Current 93342
Files Writtern 1

Estimated %
completing 100 %

*****************************************************************************************
They are running on Netbackup 3.4 server and media tapes are just replaced on line ( w/o shutting down the tape library ). The TAPE1 from the media server 1 is mounting forever and never get mounted and written. Is it due to media tape problem or netbackup 3.4 problem? Kindly advise since I am having this problem for long time.

J_H_Is_gone · ‎05-05-2009

It sounds like you might not have the tape drives configured correctly.

Example 2 drives

on server you have rmt0 and rmt1

you configer drives in netbackup
robot drive 1 is rmt1 and robot drive 2 is rmt0

so robot mounts tape in robot drive 1, so media server is looking for a tape to show up in rmt1....

but....

robot drive 1 is NOT REALLY rmt1 it really is rmt0....
so the media server NEVER sees the tape mount in rmt1.

Try to conpare the serial numbers of the drives in your media server, to the drive locations in the library and make sure they match.

In my example robot drive 1 should be rmt0 and robot drive 2 should be rmt1.

Stumpr2 · ‎05-06-2009

Andy Welburn - would suspect some NetBackup configuration issue
Andy Welburn - I would follow Bobs suggestion to use robtest & try & identify these mis-configurations.
Stumpr - You most definitely have a configuration problem
J. Hinchcliffe - It sounds like you might not have the tape drives configured correctly.

Steps to verify device configuration using robtest.

http://support.veritas.com/docs/264193

NMT_screen · ‎05-06-2009

Thanks for you gentlemen advice. Will try and let you know the result. Currently tpconfig -d result is here.

netbackupserver# tpconfig -d
Index DriveName DrivePath Type Multihost Status
***** ********* ********** **** ********* ******
0 LTODrv1 /dev/rmt/0cbn hcart Yes UP
TLD(2) Definition DRIVE=1
1 LTODrv2 /dev/rmt/1cbn hcart Yes UP
TLD(2) Definition DRIVE=2
2 LTODrv3 /dev/rmt/2cbn hcart Yes UP
TLD(2) Definition DRIVE=3
3 LTODrv4 /dev/rmt/3cbn hcart Yes UP
TLD(2) Definition DRIVE=4

After following the http://seer.entsupport.symantec.com/docs/264193.htm step 4 ( move the tape to the drive 1 ) , I found out as follow.

netbackupserver# mt -f /dev/rmt/0cbn status
HP Ultrium tape drive:
sense key(0x0)= No Additional Sense residual= 0 retries= 0
file no= 0 block no= 0

Could it be normal ? Kindly advise.

NMT_screen · ‎05-06-2009

Gentlemen
But I have 4 media servers and all clients under 3 media servers area able to mount / write . Only clients from one of 4 media servers having mounting / writing issue. Could it be configuation issue ? Kindly advise. I am very poor knowledge about solaris and netback 3.4 .

Thx.

NMT_screen · ‎05-06-2009

Gentlemen
But I have 4 media servers and all clients under 3 media servers area able to mount / write . Only clients from one of 4 media servers having mounting / writing issue. Could it be configuation issue ? Kindly advise. I am very poor knowledge about solaris and netback 3.4 .

Thx.

NMT_screen · ‎05-06-2009

Gentlemen
But I have 4 media servers and all clients under 3 media servers area able to mount / write . Only clients from one of 4 media servers having mounting / writing issue. Could it be configuation issue ? Kindly advise. I am very poor knowledge about solaris and netback 3.4 .

Thx.

VOX

Crazy at Errors in Netbackup 3.4 on Solaris 7