Solved: Status 13 (File read failed) and 40 on HP-UX Clien...

Fr_d_ric_RIGAL · ‎01-19-2012

Hi all,

Initial configuration : Master / Media Server NBU 7.1 under Win 2k3

Client HP-UX 11.23 under version 7.1.

Storage Device : VTL Quantum DXI 7500

Backup OK (Full and Incr) with bpstart and bpend notify to close and restart Oracle Databases.

New configuration : Master VM NBU 7.1.0.3 under Win 2k8 64 - Media Server phys. NBU 7.1.0.3 under Win 2k3 (old Master)

Client HP-UX 11.23 under version 7.1.

Storage Device : VTL Quantum DXI 6700

Backup Failed with status code 13 (File read failed) after approximatively 150 Gb.

Here is NBU log extract :

bpbkar (client) :

13:04:34.931 [5111] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE

13:04:34.931 [5111] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE sig

nals

13:04:34.934 [5111] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: netwo

rk connection broken

13:04:34.934 [5111] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection br

oken

13:04:34.934 [5111] <2> bpbkar Exit: INF - Close of stdout complete

13:04:37.994 [5111] <16> bpbkar Exit: ERR - read server exit status = 13: file r

ead failed

bpbrm (media) :

13:04:35.354 [7876.6020] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2963 TO 172.30.64.115.1556 fd = 1480
13:04:35.385 [7876.6020] <2> db_end: Need to collect reply
13:04:35.385 [7876.6020] <16> bpbrm readline: socket read failed, Une connexion existante a dû être fermée par l'hôte distant. (10054)
13:04:35.385 [7876.6020] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 1
13:04:35.401 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:35.401 [7876.6020] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2964 TO 172.30.64.115.1556 fd = 1500
13:04:35.448 [7876.6020] <2> db_end: Need to collect reply
13:04:35.448 [7876.6020] <2> bpbrm kill_child_process_Ex: start
13:04:38.151 [7876.6020] <2> bpbrm wait_for_child: start
13:04:38.151 [7876.6020] <2> bpbrm wait_for_child: child exit_status = 82
13:04:38.151 [7876.6020] <2> inform_client_of_status: INF - Server status = 13
13:04:38.276 [7876.6020] <2> bpbrm Exit: attempting to send mail to root on neftys
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+0,1,2,0,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+veritas_pbx,1,0,2,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+0,1,2,0,1,0+172.16.16.112.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+vnetd,1,0,2,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+bpcd,1,0,2,1,0+neftys.txt
13:04:38.291 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:38.291 [7876.6020] <2> logconnections: BPCD CONNECT FROM 172.16.16.147.2969 TO 172.16.16.112.1556 fd = 1512
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+veritas_pbx,1,0,2,1,0+172.16.16.112.txt
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+vnetd,1,0,2,1,0+172.16.16.112.txt
13:04:38.307 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded

bptm (media) :

13:04:35.994 [6860.3264] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2966 TO 172.30.64.115.1556 fd = 832
13:04:36.041 [6860.3264] <2> db_end: Need to collect reply
13:04:36.104 [6860.3264] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 1
13:04:36.119 [6860.3264] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:36.119 [6860.3264] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2967 TO 172.30.64.115.1556 fd = 916
13:04:36.151 [6860.3264] <2> db_end: Need to collect reply
13:04:36.151 [6860.3264] <4> write_backup: successfully wrote backup id neftys_1326968740, copy 1, fragment 97, 1390552 Kbytes at 16015.018 Kbytes/sec
13:04:38.151 [6860.3264] <2> io_close: closing D:\Veritas\NetBackup\db\media\tpreq\drive_HP.ULTRIUM2-SCSI.010, from bptm.c.16367
13:04:38.151 [6860.3264] <2> drivename_write: Called with mode 1
13:04:38.151 [6860.3264] <2> drivename_unlock: unlocked
13:04:38.151 [6860.3264] <2> drivename_checklock: Called
13:04:38.151 [6860.3264] <2> drivename_lock: lock established
13:04:38.151 [6860.3264] <2> drivename_unlock: unlocked
13:04:38.151 [6860.3264] <2> drivename_close: Called for file HP.ULTRIUM2-SCSI.010
13:04:38.151 [6860.3264] <2> send_MDS_msg: MEDIA_DONE 0 -1326896691 0 VP0150 4006089 0 {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC}
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: msg (MEDIA_DONE 0 -1326896691 0 VP0150 4006089 0 {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC})
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: keyword MEDIA_DONE version 0 jobid -1326896691 copyNum 0 mediaId VP0150 mediaKey 4006089 unloadDelay 0 allocId {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC}
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: returns 0
13:04:38.151 [6860.3264] <2> JobInst::sendIrmMsg: returning
13:04:38.151 [6860.3264] <2> mm_terminate: EXITING with status 82
13:04:38.151 [6860.3264] <2> cleanup: Detached from BPBRM shared memory
13:04:52.339 [7012.4824] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn HP.ULTRIUM2-SCSI.010 -dp {3,0,0,9} -dk 2000812 -m VP0150 -mk 4006089 -mds 0 -alocid 739797 -jobid -1326896773 -jm
13:04:52.339 [7012.4824] <2> vnet_same_host: ../../libvlibs/vnet_addrinfo.c.2928: 0: name2 is empty: 0 0x00000000
13:04:52.339 [7012.4824] <4> bptm: emmserver_name = fr-vpt-nbu-01.fr.guerbet-group.com
13:04:52.339 [7012.4824] <4> bptm: emmserver_port = 1556

I've tried to upgrade client NBU version in 7.1.0.3 : idem

I've set CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT on Media Server to 7200 : idem.

I've set BACKUP_START_NOTIFY and END_NOTIFY TIMEOUT on Media Server to 3600 : idem.

Thank you for your help.

Best regards.

Mark_Solutions · ‎01-20-2012

Thanks - that is much better

You have already said that your client read timeouts have been set to good values but something happens here that is in the range of 5 to 6 minutes - which does indicate a timeout.

The backup runs fine until it starts hitting areas that cannot be backed up and it starts skipping files.

Just over 5 minutes into skipping files it crashed - as if the lack of data being written has cause the failure - which does make sense apart from your timeouts apparently being set high.

Set the timeouts on the client host properties timeouts as well - but it is normally good enough to have them set on the Media Server.

Just in case it is an O/S thing then please try the following:

On the Media Server set the following registry key:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

Create a new DWORD named TcpTimedWaitDelay with a Decimal Value of 30

Then from a command line (run as administrator) netsh int ipv4 set dynamicport tcp start=10000 num=50000

Reboot the Media Server for the registry entry to take effect

This will stop the ports being locked out in a Timed_Wait state which could cause an issue and also increase the number of available ports.

On the client please run the following and let me know the values returned

# cat /proc/sys/net/ipv4/tcp_keepalive_time

# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

# cat /proc/sys/net/ipv4/tcp_keepalive_probes

One other alternative is to take a look at the files (or folders) being skipped and exclude them for this client to prevent the delay in the first place

Hope this helps

View solution in original post

Fr_d_ric_RIGAL · ‎01-19-2012

Here is an extract of Activity Monitor :

9/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13187] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13190] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13219] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client1322] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13227] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13228] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13229] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13230] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13231] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13232] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13233] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13238] is a socket special file. Skipping
19/01/2012 16:39:33 - Error bpbrm(pid=6544) socket read failed, Une connexion existante a dû être fermée par l'hôte distant. (10054)
19/01/2012 16:39:35 - Error bpbrm(pid=6544) cannot send mail to netbackup@enterprise.com
19/01/2012 16:39:36 - end writing; write time: 00:20:38
file read failed(13)
19/01/2012 16:39:41 - Info bpbkar32(pid=18165) done. status: 13: file read failed

Mark_Solutions · ‎01-19-2012

Are they from different jobs or the same job? Big time difference!

Could do with seeing a whole job detail for start and end times - perhaps the whole of the activity monitor entry

Fr_d_ric_RIGAL · ‎01-19-2012

Sorry

The last log (Activity Monitor) does not match the previous logs. It's just an exemple of this issue.

Mark_Solutions · ‎01-19-2012

So could we see the full text please, especially important for 10054 errors (network error / timeout)

Thanks

Fr_d_ric_RIGAL · ‎01-20-2012

Hi Mark

here is an extract from bpbrm logs from this night :

00:31:44.602 [7780.8160] <2> verify_client: bpbrm.c.34953: db_getCLIENT failed: 227 227 0x000000e3

00:31:44.618 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.

00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1718: 0: vnet_pbxConnect() failed: 18 0x00000012

00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1719: 0: save_errno: 10054 0x00002746

00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1720: 0: use_vnetd: 0 0x00000000

00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1721: 0: cr->vcr_service: bpcd

00:31:44.618 [7780.8160] <2> vnet_async_connect: ../../libvlibs/vnet_connect.c.1353: 0: do_service failed: 18 0x00000012

00:31:44.634 [7780.8160] <2> vnet_vnetd_service_socket: ../../libvlibs/vnet_vnetd.c.1957: 0: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006

00:31:44.634 [7780.8160] <2> vnet_vnetd_service_socket: ../../libvlibs/vnet_vnetd.c.1971: 0: service: bpcd

00:31:44.634 [7780.8160] <2> logconnections: BPCD CONNECT FROM 172.16.16.147.3562 TO 172.16.16.137.13724 fd = 1584

00:31:44.665 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.

00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1718: 0: vnet_pbxConnect() failed: 18 0x00000012

00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1719: 0: save_errno: 10054 0x00002746

00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1720: 0: use_vnetd: 1 0x00000001

00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1721: 0: cr->vcr_service: vnetd

00:31:44.665 [7780.8160] <2> vnet_async_connect: ../../libvlibs/vnet_connect.c.1353: 0: do_service failed: 18 0x00000012

00:31:44.680 [7780.8160] <2> do_vnetd_service: ../../libvlibs/vnet_connect.c.1675: 0: connect: VNETD CONNECT FROM 172.16.16.147.3564 TO 172.16.16.137.13724 fd = 1572

00:31:44.680 [7780.8160] <2> vnet_vnetd_connect_forward_socket_begin: ../../libvlibs/vnet_vnetd.c.445: 0: VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a

00:31:44.680 [7780.8160] <2> vnet_vnetd_connect_forward_socket_begin: ../../libvlibs/vnet_vnetd.c.462: 0: ipc_string: 1734

00:31:47.134 [7780.8160] <2> bpbrm start_bpcd_stat: DATA_SOCK from bpcr = 1584

00:31:47.134 [7780.8160] <2> bpbrm start_bpcd_stat: NAME_SOCK from bpcr = 1572

00:31:47.134 [7780.8160] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 223

00:31:47.134 [7780.8160] <2> vnet_pbxConnect: pbxConnectEx Succeeded

00:31:47.134 [7780.8160] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.3569 TO 172.30.64.115.1556 fd = 1576

00:31:47.165 [7780.8160] <2> db_CLIENTsend: reset client protocol version from 0 to 7

00:31:47.181 [7780.8160] <2> db_getCLIENT: db_CLIENTreceive: no entity was found 227

00:31:47.196 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.

Mark_Solutions · ‎01-20-2012

So did this fail after 3 seconds?

We need to see it all, not an extract, otherwise we cannot see the full picture.

Paste the who job activity details into a text file and attach it - much easier on here if it is an attachment rather than all the text pasted into the field

Thanks

Fr_d_ric_RIGAL · ‎01-20-2012

Mark,

here is the file you ask, but this file doesn't match with bpbkar, bptm and bpbrm logs posted yesterday.

these logs have been deleted by customer.

Hope you will find something.

Thank you

Mark_Solutions · ‎01-20-2012

Thanks - that is much better

You have already said that your client read timeouts have been set to good values but something happens here that is in the range of 5 to 6 minutes - which does indicate a timeout.

The backup runs fine until it starts hitting areas that cannot be backed up and it starts skipping files.

Just over 5 minutes into skipping files it crashed - as if the lack of data being written has cause the failure - which does make sense apart from your timeouts apparently being set high.

Set the timeouts on the client host properties timeouts as well - but it is normally good enough to have them set on the Media Server.

Just in case it is an O/S thing then please try the following:

On the Media Server set the following registry key:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

Create a new DWORD named TcpTimedWaitDelay with a Decimal Value of 30

Then from a command line (run as administrator) netsh int ipv4 set dynamicport tcp start=10000 num=50000

Reboot the Media Server for the registry entry to take effect

This will stop the ports being locked out in a Timed_Wait state which could cause an issue and also increase the number of available ports.

On the client please run the following and let me know the values returned

# cat /proc/sys/net/ipv4/tcp_keepalive_time

# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

# cat /proc/sys/net/ipv4/tcp_keepalive_probes

One other alternative is to take a look at the files (or folders) being skipped and exclude them for this client to prevent the delay in the first place

Hope this helps

Fr_d_ric_RIGAL · ‎01-20-2012

Mark,

thank you very much for your help!

At this point, we only have performed a test by excluding /var/spool/sockets and the backup is OK.

We will wait the issue of the full backup of this weekend before modifiyng anything else.

I would let you informed next week.

Best regards

Fr_d_ric_RIGAL · ‎01-23-2012

All is working fine now by excluding /var/spool/sockets.

Thank you very much.

VOX

Status 13 (File read failed) and 40 on HP-UX Client