Forum Discussion

asg2ki's avatar
asg2ki
Level 4
4 years ago

NetBackup SQL job failures - network connection timed out 41

Hi All,

This is not a query for help but a note on a solution to a rather pesky problem with NetBackup SQL job failures and status message "network connection timed out 41". While it took me some time to figure out the root cause behind these failures, in the end it turned out to be nothing related to the usual suspects such as DNS/Hosts resolution, network bottlenecks or similar issues. I've been checking literally all aspects of all involved components (NBU Master, Media, Client, SQL DB, networking, etc.) but unfortunately nothing useful appeared in any of the logs.

The most misleading part was when tracing error messages such as:

ERR - Error in VxBSACreateObject: 3.
CONTINUATION: - System detected error, operation aborted.
ERR - Error in GetCommand: 0x80770004.
DBMS MSG - ODBC return code <-1>, SQL State <37000>, SQL Message <3202><[Microsoft][SQL Server Native Client 11.0][SQL Server]Write on "VNBU0-9280-5168-1609373275" failed: 995(The I/O operation has been aborted because of either a thread exit or an application request.)>.
CONTINUATION: - An abort request is preventing anything except termination actions.

INFO Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the server.
INFO Error in VxBSACreateObject: 3.
INFO System detected error, operation aborted.
INFO Error in GetCommand: 0x80770004.
INFO An abort request is preventing anything except termination actions.
INFO ODBC return code <-1>, SQL State <37000>, SQL Message <3202><[Microsoft][SQL Server Native Client 11.0][SQL Server]Write on "VNBU0-8380-13048-1609410056" failed: 995(The I/O operation has been aborted because of either a thread exit or an application request.)>.

...and also SQL errors like:

2020-12-30 23:52:19.11 Backup BackupIoRequest::ReportIoError: write failure on backup device 'VNBU0-9808-5248-1609367774'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2020-12-30 23:52:19.11 Backup Error: 3041, Severity: 16, State: 1.
2020-12-30 23:52:19.11 Backup BACKUP failed to complete the command BACKUP DATABASE DB01. Check the backup application log for detailed messages.
2020-12-30 23:52:19.11 spid56 Error: 18210, Severity: 16, State: 1.
2020-12-30 23:52:19.11 spid56 BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'VNBU0-9808-5248-1609367774'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2020-12-31 00:00:19.10 spid31s This instance of SQL Server has been using a process ID of 8844 since 12/30/2020 10:23:10 PM (local) 12/30/2020 9:23:10 PM (UTC). This is an informational message only; no user action is required.
2020-12-31 00:10:03.71 Backup Error: 3041, Severity: 16, State: 1.

Unfortunately neither "dbclient" nor "bprd" provided any useful indication on what's going on, but at the same time taking SQL backups natively via SQL Management Studio worked just fine.

So, long story short... If you encounter such messages and you are using SQL instances in a VMware virtualized environment, make sure to check also against your "VSS appliaction quiescing" settings. In case you may have this set to "disabled", then this could be very much the source of the issue as it was in my case.

In normal circumstances quiescing shouldn't be disabled at all, however I've seen many discussions over the years where users are suggesting to do so in order to avoid a long existing problem with failures over generic VM backups about which vendors are fingerpointing to each other but no effective long-term solution has been made available yet.

The process of disabling VSS quiesced application based snapshots is outlined on the link below where in my case I had a legacy VM which was applied with the "tools.conf" option.

https://kb.vmware.com/s/article/2146204

After removing the "vss.disableAppQuiescing = true" line from the "tools.conf" file the backups started to work immediately.

I didn't test if disabling the disk UUID would end up with the same results but if I get a spare moment, then I may actually check it and report back here on the results. Perhaps this whole scenario should be tested by NetBackup engineers and in case they can confirm the behavior then a side note on the below link would be appropriate:

https://www.veritas.com/content/support/en_US/doc/44037985-142651971-0/v15097395-142651971

Cheers

No RepliesBe the first to reply