Solved: Most of the Windows 2008/2012 file servers backup ...

rVikas · ‎12-04-2015

Hi Team,

We have most of the clients (Windows 2008/2012 ) which acts as file server fails with code 13 / 24.

only specific large drives get affected of all the servers. every time we need to resume the job to take it till completion.

12/04/2015 00:11:00 - Info bpbrm (pid=27793) starting bpbkar on client
12/04/2015 00:11:00 - connected; connect time: 0:00:00
12/04/2015 00:11:40 - Info bpbkar (pid=7312) Backup started
12/04/2015 00:11:40 - Info bpbrm (pid=27793) bptm pid: 27850
12/04/2015 00:11:40 - Info bpbkar (pid=7312) change time comparison:<disabled>
12/04/2015 00:11:40 - Info bpbkar (pid=7312) archive bit processing:<enabled>
12/04/2015 00:11:41 - Info bptm (pid=27850) start
12/04/2015 00:11:41 - Info bptm (pid=27850) using 262144 data buffer size
12/04/2015 00:11:41 - Info bpbkar (pid=7312) not using change journal data for <M:\>: not enabled
12/04/2015 00:11:41 - Info bptm (pid=27850) using 30 data buffers
12/04/2015 00:11:43 - Info bptm (pid=27850) start backup
12/04/2015 00:11:44 - Info bptm (pid=27850) backup child process is pid 27886
12/04/2015 00:11:44 - begin writing
12/04/2015 03:40:20 - Critical bpbrm (pid=27793) from client xxxxxxx-bk: FTL - socket write failed
12/04/2015 03:41:14 - Error bptm (pid=27850) media manager terminated by parent process
12/04/2015 03:41:40 - Info bpbkar (pid=7312) done. status: 24: socket write failed
12/04/2015 03:41:40 - end writing; write time: 3:29:56
socket write failed (24)

-------------------------------

12/03/2015 16:28:11 - Warning bpbrm (pid=10947) from client xxxxxx: WRN - can't open file: M:\Merchandising\Toys & General Merch\Raj\Template\~$Zone Tempelate.xlsx (WIN32 32: The process cannot access the file because it is being used by another process. )
12/03/2015 16:33:26 - Error bpbrm (pid=10947) socket read failed: errno = 104 - Connection reset by peer
12/03/2015 16:33:26 - Error bptm (pid=11099) system call failed - Connection reset by peer (at child.c.1306)
12/03/2015 16:33:49 - Error bptm (pid=11099) unable to perform read from client socket, connection may have been broken
12/03/2015 16:33:49 - Error bptm (pid=11077) media manager terminated by parent process
12/03/2015 16:36:20 - Error bpbrm (pid=10947) could not send server status message
12/03/2015 16:36:50 - Info bpbkar (pid=12148) done. status: 13: file read failed
12/03/2015 16:36:50 - end writing; write time: 21:32:39
file read failed (13)

rVikas · ‎12-16-2015

one observation..

As i said most of the file system backup having huge no. of files fails with socket write failed (code - 24).

all those media servers hav bonding configured . eg. xxxxxx1-bond0, xxxxxx2-bond0, xxxxxx3-bond0, xxxxxx4-bond0.

below is bonding status of one of the media server :

[root@xxxxxx-2~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 33
Partner Key: 221
Partner Mac Address: 00:23:04:ee:c1:85

Slave Interface: eth4
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1e:27:57:1j:7g
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth5
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1e:27:57:1j:7g
Aggregator ID: 1
Slave queue ID: 0

=================================================================

i tried one file system backup with my one of the media server that was the only one in that storage unit: that media server don't have bonding configured. and link speed is 1000Mb/s.

To my surprise backup completed successfully though it took 46 Hrs to complete 1.3 TB of data with speed of 10 Mbps and without any retry.

======================================================================

So it seems , is bonding configuration is cause of intermittent (socket Write failed ) failure.

Have anyone come across such kind of issues and if yes what is the best bonding mode to be configured.

like Fault-Tolerance etc.,

View solution in original post

Marianne · ‎12-04-2015

What is NBU version and patch level on all involved components? Is checkpoints enabled in these policies? There was a bug in a specific NBU version where backups were hanging on a checkpoint. You will need these logs to troubleshoot : bpbrm and bptm on media server and bpbkar on the clients. Level 3 logging should be sufficient. Level 5 if you plan on logging a Support call.

Handy NetBackup Links

rVikas · ‎12-04-2015

Hi Marianne.

NBU Master , Media and client version is 7.6.0.1.

Master is running on Solrais 10

Media is running on RHEL Linux 6.1

Checkpoint is enabled on all policy and is 30 mins.

all policies are configured with multistreaming.

I will enable the logs.

cruisen · ‎12-05-2015

Hello rVikas,

This is strange of course, this remembers me about an issue that was similar, with read and write errors, that at the end where related to the fact that high catalog activity was there, the time the backup happen.

In my case it was a Oracle DB administrator, doeing crosschecks (with option of deletion) monday morning, and this make netbackup to start deletion manually, without taking into consideration the retention set on netbackup, of its images from the NB database.

This high activity during specific hours the lundi morning, maked the db of netbackup inaccesible and the backups failing. You should just check the peeks on the master server when the failing happens, of course you should always have a minimum of 3 gb of free diskspace for the netbackup repertory, for not having DB issues.

best regards,

Cruisen

rVikas · ‎12-05-2015

Hi Cruisen,

I am not sure if its becasue of DB issue.

I am facing issue only with Windows file servers that to large drives having size more than 1 TB.

All backup fails with same error 13/24 and message "Socket write failes"

Marianne · ‎12-06-2015

The bug that I was thinking about was for 7.5: Fixed in 7.5.0.5: (ET2952065) NetBackup 7.5.0.x backups with multiplexing, multistreaming and checkpoint restart enabled may hang when a large file is encountered. http://www.symantec.com/docs/TECH194918 But as you can see from the log snippets in the TN, we should be able to see if a similar issue is seen here. Hop Riaan will be along soon as I seem to remember that he experienced a similar issue not so long ago.

Handy NetBackup Links

cruisen · ‎12-06-2015

HI rVikas,

when an update takes very long of the netbackup database (1 TB) this can cause to overpass the client read time out setting on netbackup, because the modfication of the db takes so long so the connection to the client will be reset.

So its both, the database activity, and also the fact that big servers causes this.

The CLIENT_READ_TIMEOUT option specifies the number of seconds to use for the client-read timeout.

If the master server does not get a response from a client within the CLIENT_READ_TIMEOUT period, the backup or restore operation, for example, fails.

https://www.veritas.com/support/en_US/article.000043984

best regards,

Cruisen

rVikas · ‎12-09-2015

Hi Cruisen,

Attached is my master server Timeout properties.

Same i am using for clients timeout property as well.

rVikas · ‎12-09-2015

Hi Marianne,

We have Netbackup 7.6.0.1 installed on clients ,master and media

Still should i disable multistreaming and checkpoint restart in order to test the backup.

rVikas · ‎12-09-2015

Not sure what is causing the error. for every file server i am getting below error:

12/09/2015 04:40:34 - Error bpbrm (pid=24221) socket read failed: errno = 104 - Connection reset by peer
12/09/2015 04:40:34 - Error bptm (pid=24421) system call failed - Connection reset by peer (at child.c.1306)
12/09/2015 04:40:34 - Error bptm (pid=24421) unable to perform read from client socket, connection may have been broken
12/09/2015 04:40:34 - Error bptm (pid=24385) media manager terminated by parent process
12/09/2015 04:40:36 - Error bpbrm (pid=24221) could not send server status message
12/09/2015 04:40:40 - Info bpbkar (pid=2588) done. status: 13: file read failed
12/09/2015 04:40:40 - end writing; write time: 4:14:20

Symantec have asked o run SAS utility however my media server is linux 6.1 - 64 bit, i am not able to run this 32-bit utility from my media server.

Backup is not failing at particular time, some times it fails after writing 500 GB or sometime after 100GB

Marianne · ‎12-09-2015

Have you had a look at the logs yet that I suggested above?

Handy NetBackup Links

cruisen · ‎12-09-2015

How much free space do you have onb the netbackup repository? I asked you this before!

Best regards

Cruisen

rVikas · ‎12-14-2015

Hello Cruisen,

Below is file system status:

[root@xxxx tmp]# df -kh /usr/openv
Filesystem size used avail capacity Mounted on
dpool/openv 65G 40G 25G 63% /usr/openv

[root@xxxx tmp]# df -kh /openv - This path is holding the Netbackup databse.
Filesystem size used avail capacity Mounted on
dpool/db 280G 150G 130G 54% /openv

[root@xxxx tmp]# df -kh /usr/openv/NBUlogs
Filesystem size used avail capacity Mounted on
dpool/NBUlogs 200G 103G 97G 52% /usr/openv/NBUlogs

rVikas · ‎12-14-2015

Incrmental backup completes without any issue, only full one fails.

Am i missing any Netbackup configuration parameters for backing up large file system.

rVikas · ‎12-14-2015

This is i am seeing in bpbkar log:

19:38:43.546 [12996.10384] <2> tar_base::V_vTarMsgW: INF - Excluded: F:\CorpServ\Corporate Operations\FS_Retail_Learning\Resources\2014 Projects\PeopleSoft\Simulations\SMARTForms\FRench\PSoft_SMART_FR_FS Raw\PSoft_SMART_SIM_FR_FS_10_MngTermination.cp.bak
19:38:43.827 [12996.10384] <2> tar_base::V_vTarMsgW: INF - Excluded: F:\CorpServ\Corporate Operations\FS_Retail_Learning\Resources\2014 Projects\PeopleSoft\Simulations\SMARTForms\FRench\PSoft_SMART_FR_FS Raw\PSoft_SMART_SIM_FR_FS_11_MngAddressChange.cp.bak
19:38:44.295 [12996.10384] <2> tar_base::V_vTarMsgW: INF - Excluded: F:\CorpServ\Corporate Operations\FS_Retail_Learning\Resources\2014 Projects\PeopleSoft\Simulations\SMARTForms\FRench\PSoft_SMART_FR_FS Raw\PSoft_SMART_SIM_FR_FS_12_MngLOA.cp.bak
19:38:44.685 [12996.10384] <2> tar_base::V_vTarMsgW: INF - Excluded: F:\CorpServ\Corporate Operations\FS_Retail_Learning\Resources\2014 Projects\PeopleSoft\Simulations\SMARTForms\FRench\PSoft_SMART_FR_FS Raw\PSoft_SMART_SIM_FR_FS_13_MngLocTransfer.cp.bak
19:45:56.839 [12996.10384] <16> tar_tfi::processException:

An Exception of type [SocketWriteException] has occured at:

Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.55 $ , Function: TransporterRemote::write[2](), Line: 338

Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.91 $ , Function: Packer::getBuffer(), Line: 652

Module: tar_tfi::getBuffer, Function: D:\NB\NB_7.6.0.1\src\cl\clientpc\util\tar_tfi.cpp, Line: 311

Local Address: [::]:0

Remote Address: [::]:0

OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

)

Expected bytes: 1049600

19:45:56.870 [12996.10384] <2> tar_base::V_vTarMsgW: FTL - socket write failed
19:45:56.870 [12996.10384] <16> dtcp_write: TCP - failure: send socket (504) (TCP 10053: Software caused connection abort)
19:45:56.870 [12996.10384] <16> dtcp_write: TCP - failure: attempted to send 26 bytes
19:45:56.870 [12996.10384] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
19:45:56.870 [12996.10384] <4> tar_backup::backup_done_state: INF - number of file directives found: 1
19:45:56.886 [12996.12128] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
19:45:56.901 [12996.10384] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
19:45:56.933 [12996.10384] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
19:45:56.933 [12996.10384] <16> dtcp_write: TCP - failure: send socket (504) (TCP 10053: Software caused connection abort)
19:45:56.933 [12996.10384] <16> dtcp_write: TCP - failure: attempted to send 42 bytes
19:45:56.933 [12996.10384] <4> tar_backup::backup_done_state: INF - Not waiting for server status
19:45:58.149 [12996.10384] <4> ov_log::OVLoop: Timestamp
19:45:58.149 [12996.10384] <4> OVStopCmd: INF - EXIT - status = 0
19:45:58.149 [12996.10384] <2> tar_base::V_Close: closing...
19:45:58.149 [12996.10384] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
19:45:59.241 [12996.10384] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000101
19:45:59.241 [12996.10384] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): ubs specifics: 0x001d0000
19:46:00.942 [12996.10384] <16> dtcp_read: TCP - failure: recv socket (568) (TCP 10053: Software caused connection abort)
19:46:00.942 [12996.10384] <16> dtcp_read: TCP - failure: recv socket (504) (TCP 10053: Software caused connection abort)
19:46:00.942 [12996.10384] <4> OVShutdown: INF - Finished process
19:46:00.957 [12996.10384] <4> WinMain: INF - Exiting C:\program files\Veritas\NetBackup\bin\bpbkar32.exe

Marianne · ‎12-15-2015

Please post full set of logs as attachments (.txt preferably). Snippets without media server logs that covers the same period do not help.

Please confirm Timeouts on the Media Server -
Timeouts occur on the media server, not master or client.

Handy NetBackup Links

cruisen · ‎12-15-2015

Are you using the accelarator feature, and deduplication?

It sounds like the Full schedule needs a rescan

On the policy Schedules you will find the option called Accelerator forced rescan, that needs to be checked.

In other words can you tell us more about your architecture and the features you re using within netbackup?

best regards,

Cruisen

abhinavs · ‎12-15-2015

Hi rVikas,

Are those client server's VM or Physical servers?

If VM's try changing the host.

rVikas · ‎12-16-2015