bpbkar FATAL exit status = 40: network connection ... - VOX

Certified

Hi All,

All the servers in the DC are failing with EC 44:

bpbkar

20:16:34.323 [9858] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE

20:16:34.323 [9858] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals

20:16:34.323 [9858] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken

20:16:34.323 [9858] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection broken

20:16:34.359 [9858] <2> bpbkar Exit: INF - Close of stdout complete

20:16:34.359 [9858] <4> bpbkar Exit: INF - setenv FINISHED=0

bptm

20:16:29.192 [8665] <2> read_brm_msg: STOP BACKUP lcllxotd01_1331420558

20:16:29.193 [8665] <2> send_brm_msg: EXIT lcllxotd01_1331420558 150

20:16:29.193 [8665] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 17078 (tmmpx.c:3435)

20:16:29.193 [8665] <2> wait_for_sigcld: waiting for child to exit, timeout is 3000

20:16:29.193 [8665] <2> Media_siginfo_print: 10: delay 664 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 11: delay 594 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 12: delay 587 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 13: delay 6 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 14: delay 1 signo SIGCHLD:17 code 1 pid 17078

20:16:29.193 [8665] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=17078 (tmcommon.c:5639)

bpbkar

20:16:34.323 [9858] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE

20:16:34.323 [9858] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals

20:16:34.323 [9858] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken

20:16:34.323 [9858] <4> bpbkar Exit: INF - EXIT STATUS 40: network connection broken

20:16:34.359 [9858] <2> bpbkar Exit: INF - Close of stdout complete

20:16:34.359 [9858] <4> bpbkar Exit: INF - setenv FINISHED=0

bptm

20:16:29.192 [8665] <2> read_brm_msg: STOP BACKUP lcllxotd01_1331420558

20:16:29.193 [8665] <2> send_brm_msg: EXIT lcllxotd01_1331420558 150

20:16:29.193 [8665] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 17078 (tmmpx.c:3435)

20:16:29.193 [8665] <2> wait_for_sigcld: waiting for child to exit, timeout is 3000

20:16:29.193 [8665] <2> Media_siginfo_print: 10: delay 664 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 11: delay 594 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 12: delay 587 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 13: delay 6 signo SIGUSR1:10 code 0 pid 8659

20:16:29.193 [8665] <2> Media_siginfo_print: 14: delay 1 signo SIGCHLD:17 code 1 pid 17078

20:16:29.193 [8665] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=17078 (tmcommon.c:5639)

strace

write(2, "INF - Estimate:-1 -1\n", 21INF - Estimate:-1 -1
) = 21
read(0,
"\n", 14)                       = 1
write(2, "ERR - CONTINUE BACKUP message no"..., 43ERR - CONTINUE BACKUP message not received
) = 43
chdir("/")                              = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, {0x8072d9c, [PIPE], SA_RESTORER|SA_RESTART, 0xb6957ee8}, 8) = 0
write(2, "INF - EXIT STATUS 66: client bac"..., 82INF - EXIT STATUS 66: client backup failed to receive the CONTINUE BACKUP message
) = 82
close(1)                                = 0
chdir("/sys7/netbackup/bin")            = 0
access("/usr/openv/netbackup/bin/bpend_notify.unknown.unknown", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/openv/netbackup/bin/bpend_notify.unknown", F_OK) = -1 ENOENT (No such file or directory)
access("/usr/openv/netbackup/bin/bpend_notify", F_OK) = -1 ENOENT (No such file or directory)
exit_group(66)                          = ?
Process 14974 detached

Unable to determine the casue of the failure yet.. Please help me resolve the issue..

client: linux 2.4 and windows 2003

media server linux 2.4

version 6.5.6

11 REPLIES 11

Partner Accredited Certified

Is time synchronized between these hosts? If so, backup was aborted from server side, and we can no determine cause of abort. We need logs that leads "STOP BACKUP" message. By the way, is this new setup or existing environment. Did it occur just once? Is your network OK? I suspect backup sessions were involved in some network trouble. Failure on multiple hosts and sudden disconnect imply it to me.

Partner VIP Accredited Certified

2.4 is the kernel version. What is the actual OS and version of the Linux server and clients?

What is in bpbrm log on media server for the same period?

Handy NetBackup Links

Certified

sadly i bpbrm was not created...now i have created it...there was a network atcivity which led to this...but what changed during this period isnt sure...as i joined the company after this change happened..and no one knows what change was made to the network...i am still getting those details..

uname -a o/p

2.4.21-40.ELsmp #1 SMP Thu Feb 2 22:22:39 EST 2006 i686 i686 i386 GNU/Linux

media server shows:

Thu Mar 15 01:34:24 EDT 2012

client shows:

Thu Mar 15 01:34:52 EDT 2012

Accredited

is it failing immediately or after some time ?

kindly post complete log from details status from activity monitor

Certified

Mar 12, 2012 1:29:37 AM - requesting resource lcllxotms1-hcart-robot-tld-26
Mar 12, 2012 1:29:37 AM - requesting resource nbprodcl1.NBU_CLIENT.MAXJOBS.lcllxotd03
Mar 12, 2012 1:29:37 AM - requesting resource nbprodcl1.NBU_POLICY.MAXJOBS.OTT_PR_AIX_FS_T2
Mar 12, 2012 1:29:39 AM - granted resource nbprodcl1.NBU_CLIENT.MAXJOBS.lcllxotd03
Mar 12, 2012 1:29:39 AM - granted resource nbprodcl1.NBU_POLICY.MAXJOBS.OTT_PR_AIX_FS_T2
Mar 12, 2012 1:29:39 AM - granted resource OT1033
Mar 12, 2012 1:29:39 AM - granted resource lcllxotms1_tld26_d1
Mar 12, 2012 1:29:39 AM - granted resource lcllxotms1-hcart-robot-tld-26
Mar 12, 2012 1:29:40 AM - estimated 8828853 kbytes needed
Mar 12, 2012 1:29:41 AM - started process bpbrm (pid=23394)
Mar 12, 2012 1:29:57 AM - connecting
Mar 12, 2012 1:29:58 AM - connected; connect time: 0:00:00
Mar 12, 2012 1:30:00 AM - mounting OT1033
Mar 12, 2012 1:31:24 AM - mounted OT1033; mount time: 0:01:24
Mar 12, 2012 1:31:24 AM - positioning OT1033 to file 25
Mar 12, 2012 1:32:06 AM - positioned OT1033; position time: 0:00:42
Mar 12, 2012 1:32:06 AM - begin writing
Mar 12, 2012 2:01:29 AM - Error bpbrm (pid=23407) db_FLISTsend failed: network write failed (44)
Mar 12, 2012 2:02:22 AM - end writing; write time: 0:30:16
network write failed (44)

Partner Accredited

Hi,

Does backup start and moves some data to tape/disk? Or does backup just start and then dies without moving any data?

I would start pinging and tracing from backup servers to clients and vice versa. If that works fine, then I would check out that connection to right ports are working:

bptestbpcd -client backup_client -debug

I think problem lies on network.

Regards

-Henrik

Certified

infact if we restart the backup it completes fine...yes its the network..but what is the issue?

Partner Accredited Certified

It seems communication issue between master and media server. Where does you media server(lcllxotms1) reside? Same site with master server(nbprodcl1)? separate site connected via WAN? Have you tried to increase Client Read Timeout on both server?

Certified

master is on a different site and the media and the clients are on same site...yes increased the timeout to 7200...

Partner VIP Accredited Certified

The following points to some comms error between bpbrm on the media server and bpdbm on the master:

"Error bpbrm (pid=23407) db_FLISTsend failed: network write failed (44)"

With master and media server on different sites, the problem is probably with unreliable connection between the sites. ANY drop in connection between the sites will result in "broken connection". NetBackup is merely reporting the error. You need to speak to your network admins to monitor network connection between the sites and take steps to ensure reliable connection.

Handy NetBackup Links

Employee Accredited

Adding onto Mariannes excellent and informatiove post ... I am pleased to announce ...

"Martin's top tip of the week ..."

"Just because NetBackup reports an error, DOES NOT mean that NetBackup caused the error'.

Regards,

Martin

never-displayed

You must be signed in to add attachments

never-displayed