01-19-2012 07:44 AM
Hi all,
Initial configuration : Master / Media Server NBU 7.1 under Win 2k3
Client HP-UX 11.23 under version 7.1.
Storage Device : VTL Quantum DXI 7500
Backup OK (Full and Incr) with bpstart and bpend notify to close and restart Oracle Databases.
New configuration : Master VM NBU 7.1.0.3 under Win 2k8 64 - Media Server phys. NBU 7.1.0.3 under Win 2k3 (old Master)
Client HP-UX 11.23 under version 7.1.
Storage Device : VTL Quantum DXI 6700
Backup Failed with status code 13 (File read failed) after approximatively 150 Gb.
Here is NBU log extract :
bpbkar (client) :
13:04:34.931 [5111] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE
13:04:34.931 [5111] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE sig
nals
13:04:34.934 [5111] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: netwo
rk connection broken
13:04:34.934 [5111] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection br
oken
13:04:34.934 [5111] <2> bpbkar Exit: INF - Close of stdout complete
13:04:37.994 [5111] <16> bpbkar Exit: ERR - read server exit status = 13: file r
ead failed
bpbrm (media) :
13:04:35.354 [7876.6020] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2963 TO 172.30.64.115.1556 fd = 1480
13:04:35.385 [7876.6020] <2> db_end: Need to collect reply
13:04:35.385 [7876.6020] <16> bpbrm readline: socket read failed, Une connexion existante a dû être fermée par l'hôte distant. (10054)
13:04:35.385 [7876.6020] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 1
13:04:35.401 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:35.401 [7876.6020] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2964 TO 172.30.64.115.1556 fd = 1500
13:04:35.448 [7876.6020] <2> db_end: Need to collect reply
13:04:35.448 [7876.6020] <2> bpbrm kill_child_process_Ex: start
13:04:38.151 [7876.6020] <2> bpbrm wait_for_child: start
13:04:38.151 [7876.6020] <2> bpbrm wait_for_child: child exit_status = 82
13:04:38.151 [7876.6020] <2> inform_client_of_status: INF - Server status = 13
13:04:38.276 [7876.6020] <2> bpbrm Exit: attempting to send mail to root on neftys
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+0,1,2,0,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+veritas_pbx,1,0,2,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+0,1,2,0,1,0+172.16.16.112.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+vnetd,1,0,2,1,0+neftys.txt
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.276 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\160\f93ddb60+bpcd,1,0,2,1,0+neftys.txt
13:04:38.291 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:38.291 [7876.6020] <2> logconnections: BPCD CONNECT FROM 172.16.16.147.2969 TO 172.16.16.112.1556 fd = 1512
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+veritas_pbx,1,0,2,1,0+172.16.16.112.txt
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002
13:04:38.291 [7876.6020] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: D:\Veritas\NetBackup\var\host_cache\0c9\b992c6c9+vnetd,1,0,2,1,0+172.16.16.112.txt
13:04:38.307 [7876.6020] <2> vnet_pbxConnect: pbxConnectEx Succeeded
bptm (media) :
13:04:35.994 [6860.3264] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2966 TO 172.30.64.115.1556 fd = 832
13:04:36.041 [6860.3264] <2> db_end: Need to collect reply
13:04:36.104 [6860.3264] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 1
13:04:36.119 [6860.3264] <2> vnet_pbxConnect: pbxConnectEx Succeeded
13:04:36.119 [6860.3264] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.2967 TO 172.30.64.115.1556 fd = 916
13:04:36.151 [6860.3264] <2> db_end: Need to collect reply
13:04:36.151 [6860.3264] <4> write_backup: successfully wrote backup id neftys_1326968740, copy 1, fragment 97, 1390552 Kbytes at 16015.018 Kbytes/sec
13:04:38.151 [6860.3264] <2> io_close: closing D:\Veritas\NetBackup\db\media\tpreq\drive_HP.ULTRIUM2-SCSI.010, from bptm.c.16367
13:04:38.151 [6860.3264] <2> drivename_write: Called with mode 1
13:04:38.151 [6860.3264] <2> drivename_unlock: unlocked
13:04:38.151 [6860.3264] <2> drivename_checklock: Called
13:04:38.151 [6860.3264] <2> drivename_lock: lock established
13:04:38.151 [6860.3264] <2> drivename_unlock: unlocked
13:04:38.151 [6860.3264] <2> drivename_close: Called for file HP.ULTRIUM2-SCSI.010
13:04:38.151 [6860.3264] <2> send_MDS_msg: MEDIA_DONE 0 -1326896691 0 VP0150 4006089 0 {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC}
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: msg (MEDIA_DONE 0 -1326896691 0 VP0150 4006089 0 {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC})
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: keyword MEDIA_DONE version 0 jobid -1326896691 copyNum 0 mediaId VP0150 mediaKey 4006089 unloadDelay 0 allocId {8DB9E77A-378A-43FC-87BF-6FA5FF426CAC}
13:04:38.151 [6860.3264] <2> packageBptmResourceDoneMsg: returns 0
13:04:38.151 [6860.3264] <2> JobInst::sendIrmMsg: returning
13:04:38.151 [6860.3264] <2> mm_terminate: EXITING with status 82
13:04:38.151 [6860.3264] <2> cleanup: Detached from BPBRM shared memory
13:04:52.339 [7012.4824] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn HP.ULTRIUM2-SCSI.010 -dp {3,0,0,9} -dk 2000812 -m VP0150 -mk 4006089 -mds 0 -alocid 739797 -jobid -1326896773 -jm
13:04:52.339 [7012.4824] <2> vnet_same_host: ../../libvlibs/vnet_addrinfo.c.2928: 0: name2 is empty: 0 0x00000000
13:04:52.339 [7012.4824] <4> bptm: emmserver_name = fr-vpt-nbu-01.fr.guerbet-group.com
13:04:52.339 [7012.4824] <4> bptm: emmserver_port = 1556
I've tried to upgrade client NBU version in 7.1.0.3 : idem
I've set CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT on Media Server to 7200 : idem.
I've set BACKUP_START_NOTIFY and END_NOTIFY TIMEOUT on Media Server to 3600 : idem.
Thank you for your help.
Best regards.
Solved! Go to Solution.
01-20-2012 02:51 AM
Thanks - that is much better
You have already said that your client read timeouts have been set to good values but something happens here that is in the range of 5 to 6 minutes - which does indicate a timeout.
The backup runs fine until it starts hitting areas that cannot be backed up and it starts skipping files.
Just over 5 minutes into skipping files it crashed - as if the lack of data being written has cause the failure - which does make sense apart from your timeouts apparently being set high.
Set the timeouts on the client host properties timeouts as well - but it is normally good enough to have them set on the Media Server.
Just in case it is an O/S thing then please try the following:
On the Media Server set the following registry key:
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
Create a new DWORD named TcpTimedWaitDelay with a Decimal Value of 30
Then from a command line (run as administrator) netsh int ipv4 set dynamicport tcp start=10000 num=50000
Reboot the Media Server for the registry entry to take effect
This will stop the ports being locked out in a Timed_Wait state which could cause an issue and also increase the number of available ports.
On the client please run the following and let me know the values returned
# cat /proc/sys/net/ipv4/tcp_keepalive_time
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
One other alternative is to take a look at the files (or folders) being skipped and exclude them for this client to prevent the delay in the first place
Hope this helps
01-19-2012 07:53 AM
Here is an extract of Activity Monitor :
9/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13187] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13190] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13219] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client1322] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13227] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13228] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13229] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13230] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13231] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13232] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13233] is a socket special file. Skipping
19/01/2012 16:39:33 - Info bpbrm(pid=6544) from client neftys: TRV - [/var/spool/sockets/pwgr/client13238] is a socket special file. Skipping
19/01/2012 16:39:33 - Error bpbrm(pid=6544) socket read failed, Une connexion existante a dû être fermée par l'hôte distant. (10054)
19/01/2012 16:39:35 - Error bpbrm(pid=6544) cannot send mail to netbackup@enterprise.com
19/01/2012 16:39:36 - end writing; write time: 00:20:38
file read failed(13)
19/01/2012 16:39:41 - Info bpbkar32(pid=18165) done. status: 13: file read failed
01-19-2012 08:17 AM
Are they from different jobs or the same job? Big time difference!
Could do with seeing a whole job detail for start and end times - perhaps the whole of the activity monitor entry
01-19-2012 08:20 AM
Sorry
The last log (Activity Monitor) does not match the previous logs. It's just an exemple of this issue.
01-19-2012 08:43 AM
So could we see the full text please, especially important for 10054 errors (network error / timeout)
Thanks
01-20-2012 01:25 AM
Hi Mark
here is an extract from bpbrm logs from this night :
00:31:44.602 [7780.8160] <2> verify_client: bpbrm.c.34953: db_getCLIENT failed: 227 227 0x000000e3
00:31:44.618 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.
00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1718: 0: vnet_pbxConnect() failed: 18 0x00000012
00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1719: 0: save_errno: 10054 0x00002746
00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1720: 0: use_vnetd: 0 0x00000000
00:31:44.618 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1721: 0: cr->vcr_service: bpcd
00:31:44.618 [7780.8160] <2> vnet_async_connect: ../../libvlibs/vnet_connect.c.1353: 0: do_service failed: 18 0x00000012
00:31:44.634 [7780.8160] <2> vnet_vnetd_service_socket: ../../libvlibs/vnet_vnetd.c.1957: 0: VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
00:31:44.634 [7780.8160] <2> vnet_vnetd_service_socket: ../../libvlibs/vnet_vnetd.c.1971: 0: service: bpcd
00:31:44.634 [7780.8160] <2> logconnections: BPCD CONNECT FROM 172.16.16.147.3562 TO 172.16.16.137.13724 fd = 1584
00:31:44.665 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.
00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1718: 0: vnet_pbxConnect() failed: 18 0x00000012
00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1719: 0: save_errno: 10054 0x00002746
00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1720: 0: use_vnetd: 1 0x00000001
00:31:44.665 [7780.8160] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1721: 0: cr->vcr_service: vnetd
00:31:44.665 [7780.8160] <2> vnet_async_connect: ../../libvlibs/vnet_connect.c.1353: 0: do_service failed: 18 0x00000012
00:31:44.680 [7780.8160] <2> do_vnetd_service: ../../libvlibs/vnet_connect.c.1675: 0: connect: VNETD CONNECT FROM 172.16.16.147.3564 TO 172.16.16.137.13724 fd = 1572
00:31:44.680 [7780.8160] <2> vnet_vnetd_connect_forward_socket_begin: ../../libvlibs/vnet_vnetd.c.445: 0: VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
00:31:44.680 [7780.8160] <2> vnet_vnetd_connect_forward_socket_begin: ../../libvlibs/vnet_vnetd.c.462: 0: ipc_string: 1734
00:31:47.134 [7780.8160] <2> bpbrm start_bpcd_stat: DATA_SOCK from bpcr = 1584
00:31:47.134 [7780.8160] <2> bpbrm start_bpcd_stat: NAME_SOCK from bpcr = 1572
00:31:47.134 [7780.8160] <2> ConnectionCache::connectAndCache: Acquiring new connection for host fr-vpt-nbu-01.fr.guerbet-group.com, query type 223
00:31:47.134 [7780.8160] <2> vnet_pbxConnect: pbxConnectEx Succeeded
00:31:47.134 [7780.8160] <2> logconnections: BPDBM CONNECT FROM 172.16.16.147.3569 TO 172.30.64.115.1556 fd = 1576
00:31:47.165 [7780.8160] <2> db_CLIENTsend: reset client protocol version from 0 to 7
00:31:47.181 [7780.8160] <2> db_getCLIENT: db_CLIENTreceive: no entity was found 227
00:31:47.196 [7780.8160] <2> vnet_pbxConnect: ../../libvlibs/vnet_pbx.c.666: pbxSetAddrEx/pbxConnectEx return error 10054:Une connexion existante a dû être fermée par l'hôte distant.
01-20-2012 01:34 AM
So did this fail after 3 seconds?
We need to see it all, not an extract, otherwise we cannot see the full picture.
Paste the who job activity details into a text file and attach it - much easier on here if it is an attachment rather than all the text pasted into the field
Thanks
01-20-2012 02:16 AM
Mark,
here is the file you ask, but this file doesn't match with bpbkar, bptm and bpbrm logs posted yesterday.
these logs have been deleted by customer.
Hope you will find something.
Thank you
01-20-2012 02:51 AM
Thanks - that is much better
You have already said that your client read timeouts have been set to good values but something happens here that is in the range of 5 to 6 minutes - which does indicate a timeout.
The backup runs fine until it starts hitting areas that cannot be backed up and it starts skipping files.
Just over 5 minutes into skipping files it crashed - as if the lack of data being written has cause the failure - which does make sense apart from your timeouts apparently being set high.
Set the timeouts on the client host properties timeouts as well - but it is normally good enough to have them set on the Media Server.
Just in case it is an O/S thing then please try the following:
On the Media Server set the following registry key:
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
Create a new DWORD named TcpTimedWaitDelay with a Decimal Value of 30
Then from a command line (run as administrator) netsh int ipv4 set dynamicport tcp start=10000 num=50000
Reboot the Media Server for the registry entry to take effect
This will stop the ports being locked out in a Timed_Wait state which could cause an issue and also increase the number of available ports.
On the client please run the following and let me know the values returned
# cat /proc/sys/net/ipv4/tcp_keepalive_time
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
One other alternative is to take a look at the files (or folders) being skipped and exclude them for this client to prevent the delay in the first place
Hope this helps
01-20-2012 05:14 AM
Mark,
thank you very much for your help!
At this point, we only have performed a test by excluding /var/spool/sockets and the backup is OK.
We will wait the issue of the full backup of this weekend before modifiyng anything else.
I would let you informed next week.
Best regards
01-23-2012 12:57 AM
All is working fine now by excluding /var/spool/sockets.
Thank you very much.