03-14-2012 09:24 PM
Hi All,
All the servers in the DC are failing with EC 44:
bpbkar
20:16:34.323 [9858] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE
20:16:34.323 [9858] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals
20:16:34.323 [9858] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken
20:16:34.323 [9858] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection broken
20:16:34.359 [9858] <2> bpbkar Exit: INF - Close of stdout complete
20:16:34.359 [9858] <4> bpbkar Exit: INF - setenv FINISHED=0
bptm
20:16:29.192 [8665] <2> read_brm_msg: STOP BACKUP lcllxotd01_1331420558
20:16:29.193 [8665] <2> send_brm_msg: EXIT lcllxotd01_1331420558 150
20:16:29.193 [8665] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 17078 (tmmpx.c:3435)
20:16:29.193 [8665] <2> wait_for_sigcld: waiting for child to exit, timeout is 3000
20:16:29.193 [8665] <2> Media_siginfo_print: 10: delay 664 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 11: delay 594 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 12: delay 587 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 13: delay 6 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 14: delay 1 signo SIGCHLD:17 code 1 pid 17078
20:16:29.193 [8665] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=17078 (tmcommon.c:5639)
bpbkar
20:16:34.323 [9858] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE
20:16:34.323 [9858] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals
20:16:34.323 [9858] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken
20:16:34.323 [9858] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection broken
20:16:34.359 [9858] <2> bpbkar Exit: INF - Close of stdout complete
20:16:34.359 [9858] <4> bpbkar Exit: INF - setenv FINISHED=0
bptm
20:16:29.192 [8665] <2> read_brm_msg: STOP BACKUP lcllxotd01_1331420558
20:16:29.193 [8665] <2> send_brm_msg: EXIT lcllxotd01_1331420558 150
20:16:29.193 [8665] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 17078 (tmmpx.c:3435)
20:16:29.193 [8665] <2> wait_for_sigcld: waiting for child to exit, timeout is 3000
20:16:29.193 [8665] <2> Media_siginfo_print: 10: delay 664 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 11: delay 594 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 12: delay 587 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 13: delay 6 signo SIGUSR1:10 code 0 pid 8659
20:16:29.193 [8665] <2> Media_siginfo_print: 14: delay 1 signo SIGCHLD:17 code 1 pid 17078
20:16:29.193 [8665] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=17078 (tmcommon.c:5639)
strace
write(2, "INF - Estimate:-1 -1\n", 21INF - Estimate:-1 -1
) = 21
read(0,
"\n", 14) = 1
write(2, "ERR - CONTINUE BACKUP message no"..., 43ERR - CONTINUE BACKUP message not received
) = 43
chdir("/") = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, {0x8072d9c, [PIPE], SA_RESTORER|SA_RESTART, 0xb6957ee8}, 8) = 0
write(2, "INF - EXIT STATUS 66: client bac"..., 82INF - EXIT STATUS 66: client backup failed to receive the CONTINUE BACKUP message
) = 82
close(1) = 0
chdir("/sys7/netbackup/bin") = 0
access("/usr/openv/netbackup/bin/bpend_notify.unknown.unknown", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/openv/netbackup/bin/bpend_notify.unknown", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/openv/netbackup/bin/bpend_notify", F_OK) = -1 ENOENT (No such file or directory)
exit_group(66) = ?
Process 14974 detached
Unable to determine the casue of the failure yet.. Please help me resolve the issue..
client: linux 2.4 and windows 2003
media server linux 2.4
version 6.5.6
03-14-2012 10:08 PM
03-14-2012 10:16 PM
2.4 is the kernel version. What is the actual OS and version of the Linux server and clients?
What is in bpbrm log on media server for the same period?
03-14-2012 10:35 PM
sadly i bpbrm was not created...now i have created it...there was a network atcivity which led to this...but what changed during this period isnt sure...as i joined the company after this change happened..and no one knows what change was made to the network...i am still getting those details..
uname -a o/p
2.4.21-40.ELsmp #1 SMP Thu Feb 2 22:22:39 EST 2006 i686 i686 i386 GNU/Linux
media server shows:
Thu Mar 15 01:34:24 EDT 2012
client shows:
Thu Mar 15 01:34:52 EDT 2012
03-14-2012 11:07 PM
is it failing immediately or after some time ?
kindly post complete log from details status from activity monitor
03-14-2012 11:14 PM
Mar 12, 2012 1:29:37 AM - requesting resource lcllxotms1-hcart-robot-tld-26
Mar 12, 2012 1:29:37 AM - requesting resource nbprodcl1.NBU_CLIENT.MAXJOBS.lcllxotd03
Mar 12, 2012 1:29:37 AM - requesting resource nbprodcl1.NBU_POLICY.MAXJOBS.OTT_PR_AIX_FS_T2
Mar 12, 2012 1:29:39 AM - granted resource nbprodcl1.NBU_CLIENT.MAXJOBS.lcllxotd03
Mar 12, 2012 1:29:39 AM - granted resource nbprodcl1.NBU_POLICY.MAXJOBS.OTT_PR_AIX_FS_T2
Mar 12, 2012 1:29:39 AM - granted resource OT1033
Mar 12, 2012 1:29:39 AM - granted resource lcllxotms1_tld26_d1
Mar 12, 2012 1:29:39 AM - granted resource lcllxotms1-hcart-robot-tld-26
Mar 12, 2012 1:29:40 AM - estimated 8828853 kbytes needed
Mar 12, 2012 1:29:41 AM - started process bpbrm (pid=23394)
Mar 12, 2012 1:29:57 AM - connecting
Mar 12, 2012 1:29:58 AM - connected; connect time: 0:00:00
Mar 12, 2012 1:30:00 AM - mounting OT1033
Mar 12, 2012 1:31:24 AM - mounted OT1033; mount time: 0:01:24
Mar 12, 2012 1:31:24 AM - positioning OT1033 to file 25
Mar 12, 2012 1:32:06 AM - positioned OT1033; position time: 0:00:42
Mar 12, 2012 1:32:06 AM - begin writing
Mar 12, 2012 2:01:29 AM - Error bpbrm (pid=23407) db_FLISTsend failed: network write failed (44)
Mar 12, 2012 2:02:22 AM - end writing; write time: 0:30:16
network write failed (44)
03-14-2012 11:25 PM
Hi,
Does backup start and moves some data to tape/disk? Or does backup just start and then dies without moving any data?
I would start pinging and tracing from backup servers to clients and vice versa. If that works fine, then I would check out that connection to right ports are working:
bptestbpcd -client backup_client -debug
I think problem lies on network.
Regards
-Henrik
03-14-2012 11:36 PM
infact if we restart the backup it completes fine...yes its the network..but what is the issue?
03-14-2012 11:38 PM
03-14-2012 11:45 PM
master is on a different site and the media and the clients are on same site...yes increased the timeout to 7200...
03-15-2012 12:46 AM
The following points to some comms error between bpbrm on the media server and bpdbm on the master:
"Error bpbrm (pid=23407) db_FLISTsend failed: network write failed (44)"
With master and media server on different sites, the problem is probably with unreliable connection between the sites. ANY drop in connection between the sites will result in "broken connection". NetBackup is merely reporting the error. You need to speak to your network admins to monitor network connection between the sites and take steps to ensure reliable connection.
03-15-2012 01:13 AM
Adding onto Mariannes excellent and informatiove post ... I am pleased to announce ...
"Martin's top tip of the week ..."
"Just because NetBackup reports an error, DOES NOT mean that NetBackup caused the error'.
Regards,
Martin