cancel
Showing results for 
Search instead for 
Did you mean: 

NBU 6.5 RESTORE SOCKET READ ERROR 1 Master 4 SAN Media Servers

alwindc
Level 2

Hello Gentelmen, I have a problem with one of my M5000 Enterprise NBU SAN media servers in regards to NBU restores that I hope you may have a solution for.

 The file systems for the restores to this media server are Veritas managed (vxvm and vxfs 5.0). I tried changing settings in different combinations in the bp.conf, and BUFFERS but the restores always fail

The SATA disks we are using on the media server (tst-mediasrv) for the file systems in question come from an EMC CX-300 SAN which use MPXIO for multipath using leadville Emulex SAN drivers to attach the devices to the host.  All the backups of these file systems complete without a problem.

If we use CX-300 fibre disk from the same SAN the restores are sucesful, but other hosts have been and continue being restored just fine using these SATA disks in the same environment.
 
The restores fail with the following error.
--------------------------------------------------------
09/16/2009 13:33:14 - begin reading
09/16/2009 14:36:14 - Error bpbrm (pid=12506) socket read failed: errno = 62 - Timer expired
09/16/2009 14:36:16 - Error bptm (pid=12520) media manager terminated by parent process
09/16/2009 14:36:47 - Error bpbrm (pid=12506) client restore EXIT STATUS 13: file read failed
09/16/2009 14:34:53 - restored from image epicprd2_1253088007; restore time: 1:03:54
09/16/2009 14:34:53 - Warning bprd (pid=18160) Restore must be resumed prior to first image expiration on Wed Sep 30 03:00:07 2009
09/16/2009 14:34:53 - end Restore; elapsed time 1:03:55
the restore failed to recover the requested files (5)
--------------------------------------------------------
If these SATA disks are deported/imported to another host the restores complete just fine on the other hosts.

Here are some of the settings that may be relevant:
________________________________________________
MASTER (T5220 SPARC) 

OS 10 update 7 Kernel 141414-09; NBU 6.5.2A

Memory=32640 Megabytes; (swap -s =8,388,788,224); swap slice=7.81 GB

[/usr/openv/netbackup] # more bp.conf
SERVER = eb1
SERVER = tst-mediasrv
BPSTART_TIMEOUT = 600
BPEND_TIMEOUT = 600
SERVER_CONNECT_TIMEOUT = 120
VERBOSE = 5
BPTM_VERBOSE = 5
BPDBM_VERBOSE = 0
SERVER_SENDS_MAIL = YES
MEDIA_SERVER = tst-mediasrv
FORCE_RESTORE_MEDIA_SERVER = prd2-mediasrv tst-mediasrv
EMMSERVER = eb1
VXDBMS_NB_DATA = /usr/openv/db/data
CLIENT_READ_TIMEOUT = 500
MPX_RESTORE_DELAY = 60
KEEP_VAULT_SESSIONS_DAYS = 30

/usr/openv/netbackup/
NET_BUFFER_SZ = 65536
NON_MPX_RESTORE

/usr/openv/netbackup/db/config
ENABLE_SCSI_RESERVE
NUMBER_DATA_BUFFERS = 32
NUMBER_DATA_BUFFERS_RESTORE = 32
SIZE_DATA_BUFFERS = 262144
SIZE_DATA_BUFFERS_RESTORE  = 262144
_____________________________________________________
MEDIA SERVER  (M5000 SPARC)

OS 10 update 7 Kernel 141414-09; NBU 6.5.2A

Memory=49152 Megabytes;  (swap -s =17,182,941,184); swap slice=16 GB

[/usr/openv/netbackup] # more bp.conf
SERVER = eb1
SERVER = tst-mediasrv
CLIENT_NAME = tst-mediasrv
VERBOSE = 5
SERVER_CONNECT_TIMEOUT = 120
BPSTART_TIMEOUT = 600
BPEND_TIMEOUT = 600
EMMSERVER = master
CLIENT_READ_TIMEOUT = 3000
CLIENT_CONNECT_TIMEOUT = 3000
DO_NOT_RESET_FILE_ACCESS_TIME
IGNORE_XATTR = YES


The device used for the backups/restores are virtual TAPES (EMC  CDL). Backups and Restores to  and from other hosts in our environment work  just fine.

The data we are restoring from is from a 3rd  NBU media server called prd2-mediasrv, we are forcing the restore from the tst-mediasrv to use fibre interconnectivity/speed.

We also tried restoring directly from the master (via the network) but the restores always fail.

It seems like restores of files under 50 GB may complete but almost every CACHE.DAT file is larger than that.

The restores initiate with values ranging from 80 to 60 MB and after about 30 minutes of these transmissions, then they drop to about 20 MB to finally fail the restore.

 Do you have any suggestions to try? Is it worth looking into setting up a Netbackup project to control shared memory through the Resource Controls Facility.

Please let me know if you require more information,

Thanks

Antonio

8 REPLIES 8

Joe_Despres
Level 6
Partner
Master = solaris 9 NBU 6.5.3 LTO2 (2)
San Media server = solaris 10 NBU 6.5.4 LTO3 (2)

Attempting to perform a test restore on the New San Media server: 

Approx 4.2 TB....

Getting the following error:

[error 5] socket read failed: errno = 62 - Timer expired

I ran a verify on the image I want to restore from and it came up clean..


Backup took approx 9 hrs...  attempted restore was working on 30 hrs!

Joe Despres

Stumpr2
Level 6
create local hosts entry and force the client and server to use it

Joe_Despres
Level 6
Partner
 Yeah....  I wish that was the fix!

Already have the hosts files populated...

Thanks....

Joe Despres

mph999
Level 6
Employee Accredited
Try commenting out the FORCE_RESTORE_MEDIA entry in bp.conf. 

Martin

Joe_Despres
Level 6
Partner
 Fixed the issue with CLIENT_READ_TIMEOUT......

But it's still a issue.... It will recover files 20 -> 30 mins then wait 20 mins with no activity...

Then start right up!

Joe Despres

sdo
Moderator
Moderator
Partner    VIP    Certified
This isn't the solution, just brief notes about other issues with Solaris+Leadville+Emulex...

We had issues with SAN drivers too... whereby if we rebooted either of EMC EDL 4206 and/or Quantum Scalar i2000... then Solaris 10 Sun V445 master cluster, and Soalris 10 Sun M5000 media servers would randomly loose visibility of zero, one, several, lots target WWPNs.  The temporary solution was, following reboots of the tape storage, to then close the SAN switch ports for the initiators (NetBackup servers) in the tape zones, and re-open after 60 seconds.  We've recently applied Solaris 10 update 6, and this appears to have resolved the issue.  I see that you're already on update 7.

I believe we required patch 120222-29, details:
http://sunsolve.sun.com/search/advsearch.do?collection=PATCH&type=collections&max=50&language=en&que...

sdo
Moderator
Moderator
Partner    VIP    Certified
I should add that we also had some very very (and I'm not over stressing this - it was truely odd) strange behaviour until we tracked down a noisy SFP and replaced it.
Maybe you could try zeroing the counters on the SAN switch ports across your entire backup estate - and then as soon as you get the problem look for errors.

Double check the light levels (Brocade sfpshow) to check your Rx/Tx power levels too.  Does one of them have significant Db loss?  Near zero even?

alwindc
Level 2

Hi sdw303, thanks for your suggestions,

Would you say that if the problem was a noisy SFP GBIC we still would complete restores to the same server via  CX300 and CX380 FIBRE LUNS? The switch we use is an EMC  DS 200B, it seems to me that when we use CX300 SATA the speed in which it transmits data between the server and the SAN is too high and somehow it reaches a point of saturation and the socket connection freezes?

I tried configuring the same SATA LUNS from the CX300 managed by VXVM by creating UFS file systems on them and the restores actually completed in a much longer time frame, the same happened with the backups. The throughput was on average about 20 MB.
 
I then removed the drives from the control of VXVM and created ZFS file systems and both restores and backup complete. The times of restore/backup were pretty decent for both. The throughput for the restores fluctuated constanlty between 80 to 10.

As I mentioned before when it comes to using CX300 and CX380 FIBRE all restores complete as expected, the restore issues only occur with CX300 SATA LUNS.

If  VXVM/UFS file systems restores are completing and VXVM/VXFS are not what could be tuned in the VXFS file system to make this work?

Also, why all of the other servers running solaris 9 restore just fine unsing the CX300 SATA LUNS under VXVM/VXFS?

I really do not want to switch to ZFS because we manage everything through VXVM.

By the way I am compliant with the Emulex-sun driver patch you suggested...
Patch: 139608-05 Obsoletes:  120222-31

Antonio