Intermittent status 24 and 41 when backing up mult...

mtimbol · ‎08-15-2014

One environment we manage is a Netbackup 7.5.0.6. Master server in Server 2008 R2 Ent 64bit.

We are getting intermittent status 24 or 41 backing up windows servers with several terrabytes of data. The servers have multiple drives. The policy allows multi streams. Backup selection is "All local Drives". The policy will create one stream per drive. However, the backup for the big drives would run for for hours and even days before it fails with status 24 or 41.

If we try to rerun, it will eventually fail again.

As a workaround, we broke up the policies into several. We created separate policies for the drives with the most data. For one server with over 7 TB of data, we created 5 policies just to backup the contents of R:\global.

On the first polciy, we created a Backup Selection like this:

R:\global\0*\
R:\global\1*\
.
.
.
R:\global\a*\
R:\global\b*\

On the second

R:\global\c*\
... and so forth.

This creates one stream for each R:\global subfolder. The jobs are smaller and we are getting better success.
But some of the drawbacks are.. if someone creates folder starting with a character I did not anticipate, it will not be backed up.
If lets say, there is no folder starting with a letter which is in the selection list, the stream for that job will fail with status 71.

CPU and Network utilization on both the clients and server appear to be ok when I checked. Each time we ask the Network guys to check, they are unable to find anything wrong.

Has anyone encountered similar issues and how you fixed it?

RonCaplinger · ‎08-15-2014

Looks like you are on the right path by dividing up the drive by directory. And I'm assuming you are running the full backup in those policies on separate days...?

Some suggestions on the status 24's & 41's:

If you have IPV6 enabled on the media server and clients' try disabling it.
If you have TCP offloading/TCP Chimney enabled on the NIC settings, disable it on both the media server and client.
If these things don't work, be aware the the problem may lie in your networking equipment or configurations, as a NetBackup issue is not the source for these errors, but rather the symptoms of the networking issue.

If you still want to keep the current multi-policy/multi-backup selection setup in place, you can use a combination of backup selection/exclude/include (add another policy for ALL_LOCAL_DRIVES, and in the client properties you can exclude each of the backup selections specified in the other policies as exclusions for this policy).

Alternatives:

Flash Backup - assuming you use tape and can achieve an acceptable throughput from the client to keep a tape drive streaming, this will back up the entire drive in blocks instead of file-by-file, but still give you the ability to recovery individual files
NBU Accelerator/NBU client side Dedupe - reduces the amount of data to go through to back up the drive in both full and incremental schedules, could eliminate the long waits for some of the data to be scanned/sent, if that is a cause of the problem.
If the data on this drive lives on a SAN, Replication Director could manage SAN-based snapshots.

INT_RND · ‎08-15-2014

I would like to point out that you don't need multiple policies. I would suggest using one policy for this client.

Under backup selection use the NEW_STREAM directive like this:

R:\global\a*\

NEW_STREAM

R:\global\b*\

More information found here:

http://www.symantec.com/business/support/index?page=content&id=HOWTO33690

RiaanBadenhorst · ‎08-15-2014

Have you check the disk performance? You're kicking off multiple jobs now that are using the same drive.

Other options to explore are Accelerator (requires data optimization license) and FlashBackup for Windows.

Michael_G_Ander · ‎08-16-2014

On clients with big drives, increasing the CLIENT_READ_TIMEOUT from the default 300 seconds also can migate some of these errors.

Another thing worth investigating is the use of journal, so the backup do not have to scan the whole disk.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

mtimbol · ‎08-16-2014

Thank you for your comments. I really learn a lot from this forum.

We've tried using NEW_STREAM before. Our backup selection looked something like this.

R:\folder1

NEW_STREAM

R:\folder2

NEW_STREAM

R:\global\

But the job that backs up R:\global eventually fails.

Using a wildcard triggers the creation of multiple streams even without putting in a "NEW_STREAM" directive.

We've tried changing the client read and browse timeouts. Right now it is 9600 for both. I haven't tried the other suggestion yet but will look into it.

I failed to mention that this server and others we are having similar problems with are on HyperV. Another team managers the HyperV environment which makes it more difficult since we don't really know enough about this Microsoft virtualization technology yet.

RiaanBadenhorst · ‎08-16-2014

Do these servers use pass through disk (for R:\Global)? If not, why not just back them up using the HyperV agent?

Gergely_B_ · ‎08-27-2014

do you use Cisco UCS servers?

we've seen tons of errors (same situation) and tuning UCS settings helped. since we have no errors.

(Network guys couldn't see errors, it happened on UCS level)

mtimbol · ‎09-01-2014

The Netbackup servers are HP blade servers. The clients are on HyperV. We have HyperV 2008 and 2012. We had some problems running HyperV image backups on HyperV 2008 once before on some other servers. Not sure if these are already on HyperV 2012 but there is hesitation from the hyperv administrators to use hyperV backups which is why we back them up like physical machines.

Gergely_B_ · ‎10-30-2014

these are the changes that we made on UCS side:

New Ethernet Adapter Policy will be created with below parameters which will be assigned to above Netbackup Master/Media

1. Increase “Receive Queues” from existing 1 to 4

2. Increase “Ring Size” from existing 512 to 4096

3. Change “Receive Side Scaling (RSS)” from disable to enabled

VOX

Intermittent status 24 and 41 when backing up multiple terrabytes of data