09-16-2009 05:39 PM
Hello Gentelmen, I have a problem with one of my M5000 Enterprise NBU SAN media servers in regards to NBU restores that I hope you may have a solution for.
The file systems for the restores to this media server are Veritas managed (vxvm and vxfs 5.0). I tried changing settings in different combinations in the bp.conf, and BUFFERS but the restores always fail
The SATA disks we are using on the media server (tst-mediasrv) for the file systems in question come from an EMC CX-300 SAN which use MPXIO for multipath using leadville Emulex SAN drivers to attach the devices to the host. All the backups of these file systems complete without a problem.
If we use CX-300 fibre disk from the same SAN the restores are sucesful, but other hosts have been and continue being restored just fine using these SATA disks in the same environment.
The restores fail with the following error.
--------------------------------------------------------
09/16/2009 13:33:14 - begin reading
09/16/2009 14:36:14 - Error bpbrm (pid=12506) socket read failed: errno = 62 - Timer expired
09/16/2009 14:36:16 - Error bptm (pid=12520) media manager terminated by parent process
09/16/2009 14:36:47 - Error bpbrm (pid=12506) client restore EXIT STATUS 13: file read failed
09/16/2009 14:34:53 - restored from image epicprd2_1253088007; restore time: 1:03:54
09/16/2009 14:34:53 - Warning bprd (pid=18160) Restore must be resumed prior to first image expiration on Wed Sep 30 03:00:07 2009
09/16/2009 14:34:53 - end Restore; elapsed time 1:03:55
the restore failed to recover the requested files (5)
--------------------------------------------------------
If these SATA disks are deported/imported to another host the restores complete just fine on the other hosts.
Here are some of the settings that may be relevant:
________________________________________________
MASTER (T5220 SPARC)
OS 10 update 7 Kernel 141414-09; NBU 6.5.2A
Memory=32640 Megabytes; (swap -s =8,388,788,224); swap slice=7.81 GB
[/usr/openv/netbackup] # more bp.conf
SERVER = eb1
SERVER = tst-mediasrv
BPSTART_TIMEOUT = 600
BPEND_TIMEOUT = 600
SERVER_CONNECT_TIMEOUT = 120
VERBOSE = 5
BPTM_VERBOSE = 5
BPDBM_VERBOSE = 0
SERVER_SENDS_MAIL = YES
MEDIA_SERVER = tst-mediasrv
FORCE_RESTORE_MEDIA_SERVER = prd2-mediasrv tst-mediasrv
EMMSERVER = eb1
VXDBMS_NB_DATA = /usr/openv/db/data
CLIENT_READ_TIMEOUT = 500
MPX_RESTORE_DELAY = 60
KEEP_VAULT_SESSIONS_DAYS = 30
/usr/openv/netbackup/
NET_BUFFER_SZ = 65536
NON_MPX_RESTORE
/usr/openv/netbackup/db/config
ENABLE_SCSI_RESERVE
NUMBER_DATA_BUFFERS = 32
NUMBER_DATA_BUFFERS_RESTORE = 32
SIZE_DATA_BUFFERS = 262144
SIZE_DATA_BUFFERS_RESTORE = 262144
_____________________________________________________
MEDIA SERVER (M5000 SPARC)
OS 10 update 7 Kernel 141414-09; NBU 6.5.2A
Memory=49152 Megabytes; (swap -s =17,182,941,184); swap slice=16 GB
[/usr/openv/netbackup] # more bp.conf
SERVER = eb1
SERVER = tst-mediasrv
CLIENT_NAME = tst-mediasrv
VERBOSE = 5
SERVER_CONNECT_TIMEOUT = 120
BPSTART_TIMEOUT = 600
BPEND_TIMEOUT = 600
EMMSERVER = master
CLIENT_READ_TIMEOUT = 3000
CLIENT_CONNECT_TIMEOUT = 3000
DO_NOT_RESET_FILE_ACCESS_TIME
IGNORE_XATTR = YES
The device used for the backups/restores are virtual TAPES (EMC CDL). Backups and Restores to and from other hosts in our environment work just fine.
The data we are restoring from is from a 3rd NBU media server called prd2-mediasrv, we are forcing the restore from the tst-mediasrv to use fibre interconnectivity/speed.
We also tried restoring directly from the master (via the network) but the restores always fail.
It seems like restores of files under 50 GB may complete but almost every CACHE.DAT file is larger than that.
The restores initiate with values ranging from 80 to 60 MB and after about 30 minutes of these transmissions, then they drop to about 20 MB to finally fail the restore.
Do you have any suggestions to try? Is it worth looking into setting up a Netbackup project to control shared memory through the Resource Controls Facility.
Please let me know if you require more information,
Thanks
Antonio
09-21-2009 03:43 AM
09-21-2009 06:13 AM
09-22-2009 12:24 PM
09-22-2009 02:44 PM
09-25-2009 10:10 AM
09-25-2009 10:33 AM
09-25-2009 10:44 AM
09-28-2009 02:26 PM
Hi sdw303, thanks for your suggestions,
Would you say that if the problem was a noisy SFP GBIC we still would complete restores to the same server via CX300 and CX380 FIBRE LUNS? The switch we use is an EMC DS 200B, it seems to me that when we use CX300 SATA the speed in which it transmits data between the server and the SAN is too high and somehow it reaches a point of saturation and the socket connection freezes?
I tried configuring the same SATA LUNS from the CX300 managed by VXVM by creating UFS file systems on them and the restores actually completed in a much longer time frame, the same happened with the backups. The throughput was on average about 20 MB.
I then removed the drives from the control of VXVM and created ZFS file systems and both restores and backup complete. The times of restore/backup were pretty decent for both. The throughput for the restores fluctuated constanlty between 80 to 10.
As I mentioned before when it comes to using CX300 and CX380 FIBRE all restores complete as expected, the restore issues only occur with CX300 SATA LUNS.
If VXVM/UFS file systems restores are completing and VXVM/VXFS are not what could be tuned in the VXFS file system to make this work?
Also, why all of the other servers running solaris 9 restore just fine unsing the CX300 SATA LUNS under VXVM/VXFS?
I really do not want to switch to ZFS because we manage everything through VXVM.
By the way I am compliant with the Emulex-sun driver patch you suggested...
Patch: 139608-05 Obsoletes: 120222-31
Antonio