04-23-2020 12:52 AM
Hello dear veritas community vox, long time we were to read your topics, greatfull thanks !
We have some duplications failing with OST plugin error, break of the network exchange fail over a precise time (especialy recently discovered)
Duplication is through IP (WAN). Netbackup side we have :
<16>:bptm:ddp_filecopy_status() failed, start_offset, Err: 5009-filecopy operation failed (nfs: I/O error)
Datadomain side :
DDErrNo = 5009 (I/O error) DDErrNo = 5057 (File handle is stale)
All duplications are failing over :
time for duplication network side break (131 minutes) (2h11 30 seconds)
Vendor purpose to set up a time out value for NFS of 600000, plugin was updated on the media server which achieve the duplication through the WAN (which is one of 10 media servers for this master).
For future purpose we had seen also some codes different of the master code 191 :
84 : media write error
190 / 191 : bellow the children code and plugin error
Error bpduplicate backup id optimized duplication failed, client process aborted (50). Error bpduplicate Duplicate of backupid failed, client process aborted (50). Error bpduplicate Status = no images were successfully processed. 2060046: plugin error
Vendor suggest it could be a network break caused by IPS or security enhanced feature.
Best regards and thanks in advance for all your answers.
04-23-2020 01:02 AM
Those DD errors need to be resolved by EMC and your network team. NetBackup only tells the DD to replicate/duplicate and it waits for the DD to say when it's done. In your case its not getting done, its complaining about NFS I/O. Once that is resolved, you'll see no issue in NetBackup.
04-23-2020 02:32 AM - edited 04-24-2020 03:21 AM
Hello we found 131 minutes and 15 s in
net.ipv4.tcp_keepalive_time = 7200 net.ipv4.tcp_keepalive_intvl = 75 net.ipv4.tcp_keepalive_probes = 9
Found that recommended parameters seems to be 900 in Symantec NetBackup Backup Planning and
Performance Tuning Guide and applied master and media server
net.ipv4.tcp_keepalive_time = 900
Issue seems to be solved. We will work further with our network team to see the tcp_keepalive_time on our WAN link.
04-24-2020 12:27 AM - edited 04-24-2020 12:29 AM
the tcp_keepalive_time will only make a difference if there is a firewall between the two data domains.
tcp_keepalive ensure to send a idle frame to destination when the idle time has passed to prevent a fireall in closing idle connections.
You should ask the network admin what is configured in the firewall to find the right tcp_keepalive_time
04-24-2020 12:39 AM
Altering the tcp parameters on the NetBackup side will have little effect of the data transfer between the Data Domain devices. You are trying to solve the problem in the wrong spot.
As @RiaanBadenhorst has stated, this is really a problem for EMC to resolve. This may require altering tcp parameters on the DataDomain system along the lines of what @Nicolai has suggested.
Given the traffic is traversing a WAN link, there will almost certainly be some form of gateway routers and/or firewalls involved. You will need to discuss with your network/security admins on the configuration of these to determine if one of these devices is interrupting the session.
04-25-2020 06:05 PM
Thinking about this some more - the error you have shown as reported by the DD is around a stale file handle. If you look at this in a general sense stale NFS file handles have this cause: "A filehandle becomes stale whenever the file or directory referenced by the handle is removed by another host, while your client still holds an active reference to the object. A typical example occurs when the current directory of a process, running on your client, is removed on the server (either by a process running on the server or on another client)".
Has or is anything happening on either of the DD's involved to change the underlying storage/volume presented to the other side via NFS? Is there some kind of housekeeping task that runs at the same time as the backups that may change the file structure.
Something else to investigate anyway....
04-30-2020 05:03 AM - edited 04-30-2020 05:04 AM
I had this issue with my DD990 - but only VERY sporadically. It would happen only once every few months, and I could not cause it to happen. My duplications would run to the end, then fail. If that is what you are seeing, try this to verify:
Once the duplication fails with 191, cancel the SLP process after noting the image details.
Try manually duplicating, either from catalog or command line.
I found that even manual duplications would not work, somehow the DD as it deduped the image, or through normal cleaning/system maintenance made the image not stable.
I had to rerun the backup to get an image that I could duplicate.
This issue went away with my next upgrade/patch of my DD - you might try that if you are having this very sporadic issue.
Since I could not replicate the process, I was not able to get any joy from DD support.