NDMP Backup Failures with Netbackup 5220

BJ0101 · ‎07-23-2012

Hi All,

I have recently set up two remote netbackup 5220's at offshore sites 1 and 2. They are in the same NBU domain as our onshore master server but geographically separate.

In total we have 4 offshore IBM N-series nas heads, 2 at site 1 (nas1 and nas2) and 2 at site 2 (nas3 and nas4).

The NAS heads are snapmirrored copies of each other, such that nas3 is a copy of nas1, and nas 4 is a copy of nas2.

Only the CIFS shares of each NAS are backed up to a PureDisk dedupe disk pool on the closest 5220 via NDMP over a separate, non-routeable private backup network. All the backups are working swimmingly, apart from the nas 4 backup which runs incrementals just fine, but full backups fail with a

Error bpbrm (pid=27683) db_FLISTsend failed: network write failed (44)

The reason the same data is backed up at both sites is so that we have an offsite copy of all data both on the NAS's and on the Appliances.

I am at a loss to understand why nas3 backs up just fine and nas4 fails. WE are seeing good speeds of around 60MB/s until the failure occurs which is normally towards the end of the backup.

The backup itself occurs over two deicated NIC ports on the appliance which are bonded with 802.3ad LACP.

Please help!

Thanks.

Ben

chashock · ‎07-23-2012

I'm assuming you've done the whole "compare everything that works to what doesn't down to the smallest detail" troubleshooting steps. :)

Does it matter when you run the full backup? How long does the incremental take versus the full?

I've seen something like this happen before when near the end of a backup job the NAS (in my case it was EMC NAS) was suddenly getting hammered with batch process that wrote and read a ton of data from the same filesystem. We knew nothing about the batch process until the combined I/O of both actions basically brought the system to its knees.

BJ0101 · ‎07-23-2012

Yes I've done lots of comparing of log files between a working backup and a non-working one. I've also done tcp traces which didn't turn up anything useful.

one of my suppositions was that the snap mirror was running at the same time as the backup and causing the backup to time out due to a lack of resource on the NAS, but it runs every hour, so if this was the case the backup would fail sometime during the firts hour when the snap runs, but it runs for much longer then this before failing.

Can anyone shed some light on how depupe would work in this scenario, I've heard the NAS rehydrates the data before getting backed up to the appliance, at which point it gets deduped again. Seems like a resource-intensive operation.

chashock · ‎07-25-2012

Dedupe on the file system can cause performance issues during a backup, I've seen that happen, but if you're running dedupe on NAS3 the same way as NAS4, I'm not sure that is the issue.

Do you have logging enabled for bpbrm, since it is the proc that is reporting the error? Can you share the contents of that log?

Douglas_A · ‎07-25-2012

Can you break down the FS backup to more granular levels.. See if its a single directory or an entire FS causing the issue?

I have seen this as stated above with Netapp Dedupe enabled on a volume which has many thousands or millions of files.

Remember the NDMP protocol has a limit of 71million files per backup before it kicks the bucket.

BJ0101 · ‎07-25-2012

Chashock, can you tell me how I enable logging for bpbrm? I take it this is done on the Appliance?

Douglas - I will certainly try to run some smaller backups and see if I am getting the same issue. The strange thing is that NAS4 is backing up (or should be backing up) exactly the same dataset as NAS2, which is working fine, but I'm willing to try anything at this stage.

Thanks guys for your suggestions.

chashock · ‎07-25-2012

bpbrm logging isn't enabled by default so you have to create the log directory. There may be a better way to do this, but this is how I do it. I generally don't recommend going into elevate mode, but for this it's the only way I know how to insure it works.

Log into the CLISH

Support - > Maintenance - > elevate

cd /usr/openv/netbackup/logs

there should be a script there called mkdirlog.org, execute it with the command 'sh mkdirlog.org' which will create several log directories.

You should probably increase the logging level for bpbrm at least to troubleshoot. You can do this on the host properties of the master server.

restart the NBU services

Then when you go to Support -> Logs -> Browse you should have a bpbrm directory in the NBU/netbackup directory with a log file in it you can use to start troubleshooting things. It might take more than this log to figure out precisely what's going on, though.

VOX

NDMP Backup Failures with Netbackup 5220