Solved: I am getting same error 2826 while restoring

RajkumarBaniwal · ‎04-05-2012

Hi

Reference: https://www-secure.symantec.com/connect/forums/mailbox-restore-fails-invalid-error-number2826

I am getting same error 2826 while restoring on Linux client .It is dispalying the error

4/5/2012 12:24:37 PM - Error bptm(pid=13460) cannot write data to socket, Broken pipe
4/5/2012 12:24:37 PM - Error bptm(pid=13460) The following files/folders were not restored:
4/5/2012 12:24:37 PM - Error bptm(pid=13460) UTF - /dcs/arch01/ARC/ERI/Mar_2012/MSEMR03201203130957551503.dat.gz
4/5/2012 12:24:38 PM - Info bpbrm(pid=13451) media manager for backup id gnnmed25_1333259894 exited with status 24: socket write failed
4/5/2012 12:24:38 PM - Info bpbrm(pid=13451) terminating bpbrm child 13463 jobid=2014589
4/5/2012 12:24:38 PM - restored image gnnmed25_1333259894 - (socket write failed(24)); restore time 01:18:28
4/5/2012 12:24:39 PM - end Restore; elapsed time: 01:19:12
invalid error number(2826).

please help on the same

Rajkumar Baniwal

backup support

***** Moved to new discussion by MvdB ******

Regards,
Rajkumar Baniwal

RajkumarBaniwal · ‎04-25-2012

It is resolved after changing the same client as SAN media server and restoring through the same media server

Regards,
Rajkumar Baniwal

View solution in original post

Marianne · ‎04-05-2012

You have probably noticed in the discussion where you originally posted that we have asked for additional info but never received anything.
The 'real' error in your instance seems to be status 24.
Please share all relevant info for your environment:

OS on Master server and media server
OS on Client (Which Linux 'flavour' and version)
NBU version (and patch level) on master, media server and client

Any firewall between master and media server or between media server and client?
We can see that the restore failed after 1 hour. We sometimes see that firewall timeouts will terminate connection. Was any data actually written to client?

Please share all text in Activity Monitor details as well as the info in 'View Status' tab of the restore GUI.

Also post following logs as attachments (if they do not exist, please create them and retry the restore):

On master: bprd (restart NBU after creating this log)
On media server: bpbrm and bptm
On client : bpcd and tar

Please rename logs to match process name and post them as attachments.

Handy NetBackup Links

RajkumarBaniwal · ‎04-06-2012

PFB

OS on Master server and media server           SUN Solaris OS 5.10
OS on Client (Which Linux 'flavour' and version)    Linux 2.6.18-164.el5
NBU version (and patch level) on master, media server and client   7.1.0.3

Any firewall between master and media server or between media server and client? No
Was any data actually written to client? No

Alttached all logs fo reference.

Thanks and regards

Rajkumar Baniwal

Regards,
Rajkumar Baniwal

mph999 · ‎04-06-2012

I'm a bit confused,

The restore starts at around 11:06 (from job details we can match the PID)

From media server bpbrm log

11:06:10.726 [13451] <2> logparams: -restore -mpx_restore -S gnnbkp01 -mt 2 -to 32767 -mud 180 -mediasvr gnnbkp01 -masterversion 710000

From client tar log the first line is 11.06

11:06:13 (2014589.001) INF - TAR STARTED 15158

But bpcd log from client, the first line is

17:58:07.449 [13799] <2> setup_debug_log: switched debug log file for bpcd

... unless I have make a mistake, the bpcd log is not covering the time of the restore.

Anyhow looking at what we have ...

bptm seems to start reading the tape, suspect it could be an mpx 'backup' as it reads for some time ...

11:08:12.148 [13456] <2> send_brm_msg: CURRENT POSITION UN0125 3

11:08:12.148 [13456] <2> process_brm_msg: no pending message from bpbrm

11:08:13.148 [13456] <2> read_brm_msg: CONTINUE RESTORE

11:08:13.148 [13456] <2> send_brm_msg: MEDIA READY

11:08:13.148 [13456] <2> mpx_read_data: begin reading data from media id UN0125, file num 3

11:08:13.148 [13456] <2> mpx_read_data: Total Kbytes transferred 0

11:08:13.153 [13456] <2> read_position: read position on drive index 4, is LBA 429134

11:28:13.097 [13456] <4> report_throughput: VBRT 1 13456 0 1 HP.ULTRIUM4-SCSI.002 UN0125 0 1 0 141796352 141796352 (mpxrestore.c.1715)

11:28:13.097 [13456] <2> mpx_read_data: Total Kbytes transferred 141796352

11:48:13.104 [13456] <4> report_throughput: VBRT 1 13456 0 1 HP.ULTRIUM4-SCSI.002 UN0125 0 1 0 136298496 136298496 (mpxrestore.c.1715)

11:48:13.104 [13456] <2> mpx_read_data: Total Kbytes transferred 278094848

12:08:13.130 [13456] <4> report_throughput: VBRT 1 13456 0 1 HP.ULTRIUM4-SCSI.002 UN0125 0 1 0 129108992 129108992 (mpxrestore.c.1715)

12:08:13.130 [13456] <2> mpx_read_data: Total Kbytes transferred 407203840

12:16:28.434 [13460] <2> filter_image: [13456] Min_records before frag switch is 90000, locate available = 0

12:24:37.873 [13460] <2> get_next_file: [13456] received string: (EXIT STATUS 0)

12:24:37.921 [13456] <2> Media_dispatch_signal: calling child_wait (mpxrestore.c:2881) delay 0 seconds

12:24:37.921 [13456] <2> child_wait: SIGCHLD: exit=24, signo=0 core=no, pid=13460

We see the last line is child_wait: SIGCHLD: exit=24

There is a child bptm process involved PID [13460], this is what returns the status 24.

11:06:13.219 [13460] <2> read_config_file: using 262144 value from /usr/openv/netbackup/NET_BUFFER_SZ

11:06:13.219 [13460] <2> io_set_sendbuf: setting send network buffer to 262144 bytes

11:06:13.220 [13460] <2> vnet_pbxConnect: pbxConnectEx Succeeded

11:06:13.227 [13460] <2> do_pbx_service: ../../libvlibs/vnet_connect.c.1776: 0: via PBX: VNETD CONNECT FROM 172.30.27.1.54741 TO 172.30.27.1.1556 fd = 8

11:06:13.227 [13460] <2> vnet_vnetd_daemon_socket: ../../libvlibs/vnet_vnetd.c.383: 0: VN_REQUEST_DAEMON_SOCKET: 2 0x00000002

11:06:13.227 [13460] <2> vnet_vnetd_daemon_socket: ../../libvlibs/vnet_vnetd.c.397: 0: ipc_string: /tmp/vnet-13376333604170772171000000188-kmaaiA

11:06:13.295 [13460] <2> connect_bprd_socket: [13456] data socket to bprd is connected port = 0, fd = 8, DATAPORT=0, duping to stdin

12:16:28.434 [13460] <2> filter_image: [13456] Min_records before frag switch is 90000, locate available = 0

12:24:37.344 [13460] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6669: 0: fopen() failed: 2 0x00000002

12:24:37.344 [13460] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.6670: 0: fopen() failed: /usr/openv/var/host_cache/1f7/260a25f7+bpjobd,1,4,2,2,0+gnnbkp01.txt

12:24:37.349 [13460] <2> vnet_pbxConnect: pbxConnectEx Succeeded

12:24:37.349 [13460] <2> job_connect: SO_KEEPALIVE set on socket 2 for client gnnbkp01

12:24:37.349 [13460] <2> logconnections: BPJOBD CONNECT FROM 172.30.27.1.61865 TO 172.30.27.1.1556 fd = 2

12:24:37.349 [13460] <2> job_authenticate_connection: ignoring VxSS authentication check for now...

12:24:37.353 [13460] <2> job_connect: Connected to the host gnnbkp01 contype 10 jobid <2014589> socket <2>

12:24:37.353 [13460] <2> job_connect: Connected on port 61865

12:24:37.421 [13460] <2> job_monitoring_exex: ACK disconnect

12:24:37.422 [13460] <2> job_disconnect: Disconnected

12:24:37.422 [13460] <16> write_to_out: cannot write data to socket, Broken pipe

Think we need the bpcd log ...

Omar_Villa · ‎04-07-2012

Have you check if the iptables (firewall) service is on? most of the socket errors I see in linux are because iptables are filtering big blocks by default, the rest of your applications can work perfectly but because netbackup needs big blocks is the one that most of the time fails.

try: service iptables status (if is on)

try: service iptables stop (run a backup)

Regards.

RajkumarBaniwal · ‎04-11-2012

Firewall is blocked. but we are getting same error

Regards,
Rajkumar Baniwal

Marianne · ‎04-12-2012

6 April:
Any firewall between master and media server or between media server and client? No
12 April:
Firewall is blocked.

Nothing is going to work with blocked firewall.

Handy NetBackup Links

mph999 · ‎04-12-2012

Gone through the logs again, doesn't really tell us anything to be honest, but since I spent time doing it, I figured I would post it up. Perhaps this will help show one way of going through logs, and a method of finding where the problem is. In this part, there are x4 PIDs inivolved across two logs (bpbrm and bptm) - as these two processes are on the same machine, it's easy to justy look at the time - which every processes shows the status 24 first, it the one having (but not necessarily causing) the issue.

There are two bptm and two bpbrm processes - parent/ child

bptm

Parent PID [13456]

Child PID [13460]

bpbrm

Parent PID [13451]

Child PID [13463]

Looking in the logs to see where the status 24 occurs first (is it bptm of bpbrm)

From bprd

12:24:38.144 [29918] <2> read_bpbrm_msg: (pid 13451) EXIT gnnmed25_1333259894 24

From bpbrm PID 13463 - think we can discount this - the 24 appears in bpbrm PID 13451

11:06:13.218 [13463] <2> bpbrm handle_restore: forking tar

11:06:13.218 [13463] <2> bpbrm handle_restore: restore command = /usr/openv/netbackup/bin/tar tar -x -v -Y -p -P -I 1333592417 -U 0 -E /usr/openv/netbackup/.rename.13463 -j -Q -J clnt_lc_messages=C -J clnt_lc_time=C -J clnt_lc_ctype=C -J clnt_lc_collate=C -J clnt_lc_numeric=C -J restoreid=2014589.001 -J job_total=1 -J client=gnnmed25 -J requesting_client=gnnbkp01 -J browse_client=gnnmed25 -J backup_time=1333259894 -L /tmp/logmedERI -f -

11:06:13.218 [13463] <2> bpbrm handle_restore: received bpcd success message

11:06:13.301 [13463] <2> bpbrm handle_restore: read tar start message from gnnmed25

11:08:13.196 [13463] <2> bpbrm mm_sig: received ready signal from media manager

12:24:37.986 [13463] <2> bpbrm mm_sig: received not ready signal from media manager

12:24:38.037 [13463] <2> bpbrm check_for_terminate: process killed by signal 1

12:24:38.049 [13463] <2> BpConfigHolder::~BpConfigHolder: +++ ENTERING +++ : obj = 1005bed78 (../BpConf.cpp:40)

12:24:38.049 [13463] <2> BpConfigHolder::free: +++ ENTERING +++ : obj = 1005bed78 (../BpConf.cpp:60)

12:24:38.049 [13463] <2> BpConfigHolder::free: --- EXITING --- : obj = 1005bed78 (../BpConf.cpp:60)

12:24:38.049 [13463] <2> BpConfigHolder::~BpConfigHolder: --- EXITING --- : obj = 1005bed78 (../BpConf.cpp:40)

12:24:38.051 [13463] <2> NBIORInterceptor::destroy: Destroying Interceptor: EMMlib_Orb_EMMlib_Orb(../NBIORInterceptor.cpp:60)

12:24:38.053 [13463] <2> InsecureEndpointSelectorFactory::~InsecureEndpointSelectorFactory: +++ ENTERING +++ : obj = 1005d7970 (../Endpoint_Selector.cpp:1864)

12:24:38.053 [13463] <2> InsecureEndpointSelectorFactory::~InsecureEndpointSelectorFactory: --- EXITING --- : obj = 1005d7970 (../Endpoint_Selector.cpp:1864)

12:24:38.053 [13463] <2> EndpointSelectorFactory::~EndpointSelectorFactory: +++ ENTERING +++ : obj = 1005d7870 (../Endpoint_Selector.cpp:1769)

12:24:38.053 [13463] <2> EndpointSelectorFactory::~EndpointSelectorFactory: --- EXITING --- : obj = 1005d7870 (../Endpoint_Selector.cpp:1769)

From bpbrm PID 13451

12:23:28.005 [13451] <2> bpbrm send_parent_msg: KEEP_ALIVE 15

12:23:28.055 [13451] <2> bpbrm read_parent_msg: read from parent ACK_KEEP_ALIVE

12:24:37.986 [13451] <2> bpbrm read_media_msg: read from media manager: MEDIA NOT READY

12:24:37.986 [13451] <2> bpbrm signal_bpbrm_child: sending SIGUSR1 to bpbrm child 13463

* 12:24:38.037 [13451] <2> bpbrm read_media_msg: read from media manager: EXIT gnnmed25_1333259894 24

12:24:38.037 [13451] <2> bpbrm process_media_msg: media manager for backup id gnnmed25_1333259894 exited with status 24: socket write failed

12:24:38.037 [13451] <2> bpbrm kill_bpbrm_child: terminating bpbrm child 13463 jobid=2014589

12:24:38.071 [13451] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm

12:24:38.071 [13451] <2> bpbrm brm_sigcld: bpbrm child 13463 exit_status = 150, signal_status = 0

12:24:38.071 [13451] <2> bpbrm brm_sigcld: child 13463 exited with status 150: termination requested by administrator

12:24:38.071 [13451] <2> bpbrm send_status_to_parent: EXIT gnnmed25_1333259894 24 sent to parent process.

12:24:38.171 [13451] <2> bpbrm read_parent_msg: read from parent TERMINATE

12:24:38.171 [13451] <2> bpbrm tell_mm: sending media manager msg: TERMINATE

From bptm PID 13460

* 12:24:37.422 [13460] <16> write_to_out: cannot write data to socket, Broken pipe

From bptm PID 13456

11:08:12.148 [13456] <4> mpx_issue_begin_msg: begin reading backup id gnnmed25_1333259894 (restore), copy 1, fragment 2, from media id UN0125 on drive HP.ULTRIUM4-SCSI.002 (index 4)