Restore speed is very slow!

Joe_Despres · ‎09-25-2009

I was getting the following error on a restore on my Solaris San Media server:

[error 5] socket read failed: errno = 62 - Timer expired

I ran a verify on the image I want to restore from and it came up clean..

Backup took approx 9 hrs... attempted restore was working on 30 hrs!

I fixed this issue by adding the following on the San Media servers bp.conf

CLIENT_READ_TIMEOUT = 3600

The restore would run approx 20 -> 30 mins then wait 20 mins...

The restore eventually completed..... But took a very long time!!!

Any suggestions on a possible fix?

Master: v240 Solaris 9 NBU 6.5.3
San Media: M3000 Solaris 10 NBU 6.5.3 LTO3

Thanks....

Joe Despres

sdo · ‎09-26-2009

Joe,

Firstly, performance questions are notoriously difficult to nail down...

1) Was the SAN media server restoring to itself or to a client?
(i.e. Tape - SAN - HBA - Server - local disk (or SAN disk))
(...or Tape - SAN - HBA - Server - NIC - LAN - NIC - Client - Local disk)

2) How many files were being restored?
3) How much data (in KB) was being restored?
4) Were you restoring from multiple images? (i.e. either across a selection of date ranges, or a selection of different mount-points and/or different folders)?
5) How long did the original backup take?
6) The backups that you are restoring from, are they multi-plexed?
7) If they are multi-plexed, how many images were potentially multi-plexed together?

There will be more questions depending upon how you answer the above. It will likely take several rounds of Q/A to get to the bottom of this.

Dave.

Joe_Despres · ‎09-26-2009

1) Was the SAN media server restoring to itself or to a client?

(i.e. Tape - SAN - HBA - Server - local disk (or SAN disk))

(...or Tape - SAN - HBA - Server - NIC - LAN - NIC - Client - Local disk)

Tape --> HBA --> Server

LTO3 --> HBA --> M3000

2) How many files were being restored?

160851

3) How much data (in KB) was being restored?

4758838030

4) Were you restoring from multiple images? (i.e. either across a selection of date ranges, or a selection of different mount-points and/or different folders)?

Single image.....

5) How long did the original backup take?

9hrs and 35 mins

6) The backups that you are restoring from, are they multi-plexed?

No

7) If they are multi-plexed, how many images were potentially multi-plexed together?

There will be more questions depending upon how you answer the above. It will likely take several rounds of Q/A to get to the bottom of this.

sdo · ‎09-27-2009

1) Can you tell us whether these files exist on the server, and if so, what values they contain?
/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_RESTORE
/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS

2) Do you have another server to compare to? (i.e. is there another SAN media server of the same spec that does perform a restore of this size, 4.7 TB, well?)

3) Was the server doing anything else during the restore? (i.e. doing more backups)

4) SAN speed? 1/2/4/8/10 Gb?

5) You were restoring to local disk. Can you describe the disk layout and the interface that they live on? SAS or FC-AL, if FC-AL single or double loop? If FC-AL, loop speed 2 or 4 Gb/s? 7200 rpm or 15 krpm spindles? Spindle size? RAID set? Which form? Stripe width?

6) Fabric? Any SAN switches in between the LTO3 and your HBA? Or a direct connection?

7) If on a fabric, are there any other targets if different zones on the the same HBA?

Rough calcs on backup:
Backup
4,758,838,030 KB
4,647,303 MB
4,538 GB
4.4 TB

09:35
34500 secs
135 MB/s

Rough calcs on restore:
Restore
4,758,838,030 KB
4,647,303 MB
4,538 GB
4.4 TB

30:30
109800 secs
42 MB/s

135 MB/s for teh backup, but only 42 MB/s for the restore to the same location....

It was exactly the same location, correct?

Dion · ‎09-27-2009

Another thing to check is whether you have multiplexing configured for your backups. This improves your backup speeds in certain environments but it not bode well when it comes to restores. With multiplexing, you can easily expect the restore to take 3x as long as the backup as you will be forcing your tape to shoe-shine between each of the segmented backups on the tape media.

Joe_Despres · ‎09-30-2009

1) Can you tell us whether these files exist on the server, and if so, what values they contain?

/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS 64

/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_RESTORE 64

/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS 262144

2) Do you have another server to compare to? (i.e. is there another SAN media server of the same spec that does perform a restore of this size, 4.7 TB, well?)

Nope

3) Was the server doing anything else during the restore? (i.e. doing more backups)

I did take errors when a incremental backup started....

a:: status 50 (client process aborted)

b:: media manager terminated by parent process

4) SAN speed? 1/2/4/8/10 Gb?

4 gig

5) You were restoring to local disk. Can you describe the disk layout and the interface that they live on? SAS or FC-AL, if FC-AL single or double loop? If FC-AL, loop speed 2 or 4 Gb/s? 7200 rpm or 15 krpm spindles? Spindle size? RAID set? Which form? Stripe width?

Not sure about this..... Looken into this...

6) Fabric? Any SAN switches in between the LTO3 and your HBA? Or a direct connection?

Direct Connection

7) If on a fabric, are there any other targets if different zones on the the same HBA?

Joe_Despres · ‎10-13-2009

Found a possible reason......

http://seer.entsupport.symantec.com/docs/280168.htm

Quick I/O license was purchased.... but never installed....

Thanks....

Joe Despres

David_McMullin · ‎10-13-2009

For Solaris 10, there is a system parameter that can slow restores - TCP_FUSION

From this link:

http://seer.entsupport.symantec.com/docs/306694.htm

a. When there are no active backup or restore operations, run the following command:
# echo 'do_tcp_fusion/W 0' | mdb -kw

b. The NetBackup processes will also need to be restarted:
# cd /usr/openv/netbackup/bin/goodies
# ./netbackup stop
# ./netbackup start

2. Via the /etc/system file.
This option has less potential for disrupting the system, but does require a system reboot.

Add following line in the /etc/system file.
set ip:do_tcp_fusion = 0
Once the /etc/system file is updated, it will be necessary to restart the system before the workaround will take effect.

We saw restore speed go from 5MB to 60MB after making this change...

VOX

Restore speed is very slow!