03-25-2013 12:00 AM
We are using netbackup version 7.5.0.4 running on RedHat 6.1 and media servers running on multiplatform. The Tape Library is by HP, ESL322. First for the media server we only have HPUX and Windows. Then we add several LTO5 on the same Tape Library and the media running on Solaris (5.10 sparc).5
We add the 3 LTO5 to this Solaris and running the test backup. The test backup running looked fine but after sometimes the Tape status going down, we can just bring up it can run several backup and down again. It happened to all the 3 new LTO5 drives. We always can bring up but after several backup they all went down again.
What went wrong? I don't think the tape drives had problems, since it happened to all the new 3 LTO5 drives (?). is there some parameter needed to add on Solaris side?
Thank you,
Iwan Tamimi
Solved! Go to Solution.
03-25-2013 01:55 AM
Should check if you have enabled MPxIO for HBAs that you use to connect the tape drives. MPxIO on Solaris 10 does not support tape devices, and this would let tape devices be unstable.
03-25-2013 01:16 AM
Please enable logging on the Solaris media server as follows:
Create bptm folder in /usr/openv/netbackup/logs
Add VERBOSE entry to /usr/openv/volmgr/vm.conf file and restart NBU.
NBU tape manager errors will be logged to bptm and device errors will be logged to /var/adm/messages (along with reason for drives to be DOWN'ed).
In the meantime, the following will help us to get an idea:
Post contents of /usr/openv/netbackup/db/media/error file on media server
Run 'Tape logs' report in the GUI, select Solaris media server, specify date range during which errors were seen and run the report.
If lots of info is displayed, please filter report to exclude 'Info'. Export the report to a .txt file and post as File attachment.
03-25-2013 01:55 AM
Should check if you have enabled MPxIO for HBAs that you use to connect the tape drives. MPxIO on Solaris 10 does not support tape devices, and this would let tape devices be unstable.
03-25-2013 06:28 AM
03-26-2013 12:47 AM
Hi All,
Thanks for the supports.
Marianne, I will try as you sugested. I will put the logs/error report.
Yasuhisa, I will check with my colleague that knows solaris better than me.
Regard,
Iwan Tamimi
03-26-2013 12:52 AM
As per my previous post, you can in the meantime do the following:
Post contents of /usr/openv/netbackup/db/media/error file on media server
Run 'Tape logs' report in the GUI, select Solaris media server, specify date range during which errors were seen and run the report.
If lots of info is displayed, please filter report to exclude 'Info'. Export the report to a .txt file and post as File attachment.
03-27-2013 06:37 PM
Marrianne,
Thank you.
This is the content of /usr/openv/netbackup/db/media/errors
root@ebs12-bck # cat errors
03/14/13 12:06:56 600118 0 WRITE_ERROR ESL125_Drive4
03/25/13 16:19:11 600064 3 WRITE_ERROR ESL125_Drive1
03/26/13 00:06:00 600072 3 WRITE_ERROR ESL125_Drive1
I also attached the Tape Logs
Bellow I put the error on the policy backup from Java GUI.
BTW some facts:
o The tape library is HP ESL322E and inside consisst of LTO4 and LTO5
o One HPUX media server connected to same tape library (means share the same robot) running fine for years.
o The Solaris server is a new additon.
Thank you.
Iwan tamimi
ps:
error from one failed policy on Java GUI:
03/26/2013 02:33:44 - Info nbjm (pid=8143) starting backup job (jobid=70316) for client ebs12-bck, policy EBS12_FS_OS, schedule Daily_Incre
03/26/2013 02:33:44 - Info nbjm (pid=8143) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=70316, request id:{8623779A-957A-11E2-8449-9411033F729C})
03/26/2013 02:33:44 - requesting resource ESLM5_EBS12_MPX
03/26/2013 02:33:44 - requesting resource ebsbck.NBU_CLIENT.MAXJOBS.ebs12-bck
03/26/2013 02:33:44 - requesting resource ebsbck.NBU_POLICY.MAXJOBS.EBS12_FS_OS
03/26/2013 02:33:44 - granted resource ebsbck.NBU_CLIENT.MAXJOBS.ebs12-bck
03/26/2013 02:33:44 - granted resource ebsbck.NBU_POLICY.MAXJOBS.EBS12_FS_OS
03/26/2013 02:33:44 - granted resource 600192
03/26/2013 02:33:44 - granted resource ESL125_Drive1
03/26/2013 02:33:44 - granted resource ESLM5_EBS12_MPX
03/26/2013 02:33:44 - estimated 636801 kbytes needed
03/26/2013 02:33:44 - Info nbjm (pid=8143) started backup (backupid=ebs12-bck_1364236424) job for client ebs12-bck, policy EBS12_FS_OS, schedule Daily_Incre on storage unit ESLM5_EBS12_MPX
03/26/2013 02:33:46 - started process bpbrm (pid=20211)
03/26/2013 02:33:51 - Info bpbrm (pid=20211) starting bptm
03/26/2013 02:33:52 - Info bpbrm (pid=20211) Started media manager using bpcd successfully
03/26/2013 02:34:00 - Info bpbrm (pid=20211) ebs12-bck is the host to backup data from
03/26/2013 02:34:00 - Info bpbrm (pid=20211) telling media manager to start backup on client
03/26/2013 02:34:00 - Info bptm (pid=20214) using 65536 data buffer size
03/26/2013 02:34:00 - connecting
03/26/2013 02:34:00 - connected; connect time: 0:00:00
03/26/2013 02:34:01 - Info bptm (pid=20214) using 12 data buffers
03/26/2013 02:34:02 - mounting 600192
03/26/2013 02:34:03 - Info bpbrm (pid=20211) spawning a brm child process
03/26/2013 02:34:03 - Info bpbrm (pid=20211) child pid: 20282
03/26/2013 02:34:04 - Info bpbrm (pid=20211) sending bpsched msg: CONNECTING TO CLIENT FOR ebs12-bck_1364236424
03/26/2013 02:34:05 - Info bpbrm (pid=20211) start bpbkar on client
03/26/2013 02:34:05 - Info bpbkar (pid=20287) Backup started
03/26/2013 02:34:05 - Info bpbrm (pid=20211) Sending the file list to the client
03/26/2013 02:34:05 - Info bptm (pid=20214) start backup
03/26/2013 02:34:06 - Info bptm (pid=20214) Waiting for mount of media id 600192 (copy 1) on server ebs12-bck.
03/26/2013 02:34:07 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/ors008/oraarch] is in a different file system from [/]. Skipping
03/26/2013 02:34:07 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/osm009/oraarch] is in a different file system from [/]. Skipping
03/26/2013 02:34:08 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/appspool] is in a different file system from [/]. Skipping
03/26/2013 02:34:12 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/bpm008/oraarch] is in a different file system from [/]. Skipping
03/26/2013 02:34:13 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/tmp] is in a different file system from [/]. Skipping
03/26/2013 02:34:14 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/bpmadm] is in a different file system from [/]. Skipping
03/26/2013 02:34:15 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/orsadm] is in a different file system from [/]. Skipping
03/26/2013 02:34:16 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/proc] is on file system type PROC. Skipping
03/26/2013 02:34:17 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/devices] is in a different file system from [/]. Skipping
03/26/2013 02:34:17 - Info bpbrm (pid=20282) from client ebs12-bck: TRV - [/backup-after] is in a different file system from [/]. Skipping
03/26/2013 02:49:27 - current media 600192 complete, requesting next media Any
03/26/2013 02:49:30 - Error bptm (pid=20214) error requesting media, TpErrno = Robot operation failed
03/26/2013 02:49:30 - Warning bptm (pid=20214) media id 600192 load operation reported an error
03/26/2013 02:49:57 - end writing
03/26/2013 02:50:01 - Error bptm (pid=20214) NBJM returned an extended error status: All compatible drive paths are down but media is available (2009)
03/26/2013 02:50:01 - Info bpbrm (pid=20211) got ERROR 252 from media manager
03/26/2013 02:50:01 - Info bpbrm (pid=20211) terminating bpbrm child 20282 jobid=70316
An extended error status has been encountered, check detailed status (252)
03-28-2013 01:36 AM
03-28-2013 02:08 AM
Which media server is configured as robot control host?
Are any robot load operation errors logged on the control host?
We see this in job details:
03/26/2013 02:49:30 - Error bptm (pid=20214) error requesting media, TpErrno = Robot operation failed 03/26/2013 02:49:30 - Warning bptm (pid=20214) media id 600192 load operation reported an error
Then there is this error in Media logs report:
incorrect media found in drive index 3, expected 600073, found 600054, FREEZING 600073
This points to incorrect device mapping - either incorrectly configured initially or no Persistent Binding in place, which causes device names to change when server is rebooted.
If the problem is only with Solaris media server, start troubleshooting there.
Find out make/model of HBA used for tapes in this media server, then go to HBA manufacturer's web site and look for Persistent Binding info. There are normally hba tools that can be downloaded and used for this purpose.
Once Persistent Binding is correctly configured, delete all devices for this media server in NBU and OS.
Recreate devices at OS level with 'devfsadm'.
Ensure all devices are correctly seen by OS and NBU with sgscan command, then run Device Config Wizard again. Select robot control host and Solaris media server. Allow the wizard to restart NBU on the media server.
Let us know how it goes.
04-07-2013 06:23 PM
Thanks Mariane for the explanation.
After I disable MPxIO the problem seems went away (I am also new to the Solaris but your can read this for reading http://saifulaziz.com/2009/12/10/enabling-or-disabling-mpxio-multipathing-per-port/ )
We also have new AIX media servers, similiar things also happened then later we disabled the multipath, they problem also went away.
Thank All for the support.
Regards,
Iwan Tamimi