NetBackup - some jobs fail with Status 156 or Stat...

Colin_North · ‎10-13-2010

I am having a problem backing a couple of Windows servers, both of which generate either a Status 156 or Status 150 error. The backup job fails on a volume that contains between 400 and 600GB of data, and one of the servers has a folder that contains million of small files.

Below are some server details:

Master and Media Server = Windows 2003 Server Standard SP2 / NetBackup 6.5.6
Client 1 = Windows 2003 Server Standard SP2 / NetBackup 6.0 MP4
Client 2 = Windows 2003 Server Enterprise SP2 / NetBackup 6.5.4

On my Master Server under Host Properties > Master Servers > Client Attributes if I tick "Enable Windows Open File Backups for this client" and select "Use VSS" the backup of the affected volume will normally fail with a Status 156 error for both servers.

If I untick "Enable Windows Open File Backups for this client" the backup will run, but will eventually fail at the end of the backup window with a Status 150 error. Obviously I or nobody here has cancelled the backup.

The following is an extract from the "bpbkar" log of one of the clients:

<32> TransporterRemote::write[2](): FTL - SocketWriteException: send() call failed, could not write data to the socket, possible broken connection.
05:00:14.355: [4500.4244] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::OperationFailedException] was thrown. Details about the exception follow...:
Error code  = (-1008).
Src file    = (D:\654\src\cl\clientpc\util\tar_tfi.cpp).
Src Line    = (275).
Description = (%s getBuffer operation failed).
Operation type=().
)
05:00:14.355: [4500.4244] <16> NBUException::traceException(): (
An Exception of type [Symantec::NetBackup::Ncf::SocketWriteException] was thrown. Details about the exception follow...:
Error code  = (-1027).
Src file    = (TransporterRemote.cpp).
Src Line    = (310).
Description = (send() call failed, could not write data to the socket, possible broken connection).
Local IP=(). Remote IP=(). Remote Port No.=(0).
No. of bytes to write=(32768) while No. of bytes written=(0).
)
05:00:14.355: [4500.4244] <2> tar_base::V_vTarMsgW: FTL - socket write failed
05:00:14.355: [4500.4244] <16> dtcp_write: TCP - failure: send socket (1856) (TCP 10054: Connection reset by peer)
05:00:14.355: [4500.4244] <16> dtcp_write: TCP - failure: attempted to send 26 bytes
05:00:14.355: [4500.4244] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
05:00:14.355: [4500.4244] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
05:00:14.355: [4500.5292] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
05:00:14.355: [4500.4244] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
05:00:14.355: [4500.4244] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
05:00:14.355: [4500.4244] <16> dtcp_write: TCP - failure: send socket (1856) (TCP 10054: Connection reset by peer)
05:00:14.355: [4500.4244] <16> dtcp_write: TCP - failure: attempted to send 42 bytes
05:00:14.355: [4500.4244] <4> tar_backup::backup_done_state: INF - Not waiting for server status
05:00:14.355: [4500.4244] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
05:00:14.355: [4500.4244] <4> ov_log::OVLoop: INF - Cycling log file
05:00:14.355: [4500.4244] <4> ov_log::OVClose: INF - Closing log file: C:\Program Files\VERITAS\NetBackup\logs\BPBKAR\101210.LOG

I know that the 156 error relates to snapshot file creation and issues with that. But the 150 error doesn't really help me - there's next to no information on the net about this error. It's one of those NetBackup errors that means nothing.

Anyone had similar issues?

MOHAMED_PATEL · ‎10-13-2010

Snapshot errors - 156 are widely related to timeout issues - Hence increase timeouts to 3600 sec (1Hour) or higher -

The other issue is a network connection issue seen many times on a Windows server when connection is borken as in the log - TCP 10054: Connection reset by peer

Make sure that NIC and switch are set at Full duplex if possible.

These are some basic commands which will change the NIC settings -

These features are found to cause unexpected socket drops during network load for backups.

- Run this command to check if those features are enabled
netsh int tcp show global

- Disable the features by running these commands
netsh int tcp set global autotuning=disabled
netsh int tcp set global chimney=disabled

Run backup -

The actual technet refers:

http://www.symantec.com/docs/TECH60844

Colin_North · ‎10-14-2010

Hi Mohamed,

Thanks for your answer, I increased the Client Read Timeout values on the two affected clients to 3600 (1 hour). But last night the backup jobs failed again. I haven't changed any of the NIC settings on the client servers, because they are Production boxes and this would require a Change Request to be approved. But I do know that most, if not all of our servers are set to "Link Speed and Duplex = Auto Detect" and all of the "Offload" options will be enabled/on.

Just to clean things up I rebooted both of my Netbackup servers this morning, but I know that something network related is going on, so I don't expect the reboot to help, but a reboot is always good.

After looking through the performance tuning manual I discovered that high network / cpu utilisation on the master server can be fixed by changing the Master Server Firewall properties - see screenshot below. I have no experience in changing this, so I don't know what effect this change could have on all of my backups.

If I cannot fix this in the next day or so, I will have to contact Support.

MOHAMED_PATEL · ‎10-14-2010

Looking at the screenshot - remove the 'localhost' from the firewall configuration of NETBACKUP -

What was meant was the Firewall service for Windows -

Disable Windows Firewall from the Services configuration.

Increase the 'BPSTART_notify' timeout on the Master server - this should eleviate the 156 error code.

Also instead of DNS, update the hosts file of all the servers with the relevant hostname ans IP addresses.

...Good Luck

VOX

NetBackup - some jobs fail with Status 156 or Status 150