Solved: Re: client backup daily error (status code41)

Carawain · ‎02-17-2017

Hello,

I need help because one client got daily backup error (status 41).

Impacted policies are standard policy which backup oracle database folder and system policy with full schedule so when the amount of data to backup is high.

the bpbkar process hung when it arrive to "begin writing state".

sometimes there a trace of backup flow like 32 Kilobytes so less performance but the backup is so long and also canceled.

i tried to redirect the backup policies on several other media server available but the issue is the same so i know that the issue is on client side.

i tried many policy/client configuration found on the foum topic:

-increse client timeout

-activate the multiple datastream

-as the database policies have to be encrypted i tried to disable the encryption option and check the server processor load but nothing monopolize cpu ressource.

-i tried to disable client/side deduplication too but no more emprouvement.

so bpbkar stay indefinitely active on the server until a timeout occurs with status code 41.

Master and media server are in 7.6.1.2 version

client is an AIX 5 3 in 7.5.0.7 version

If anyone have an idea to reveal the issue?

Thanks.

Marianne · ‎02-21-2017

Snippets of logs do not help when full data transfer path needs to be followed.

If you cannot upload the full logs here (as .txt files), then best to log a Support call with Veritas where you can submit/upload complete log files.

Handy NetBackup Links

View solution in original post

Michael_G_Ander · ‎02-17-2017

First of, a file backup of a running Oracle database is not a good idea as the database will be inconsistent in a restore situation.

My suggestion is exclude the database files (usually called .dbf) and use either agent based or rman dump to backup the oracle database. Confer with your DBA about which solution is best.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Genericus · ‎02-17-2017

Mark that as a solution! Michael is absolutely correct. In fact, I would suggest that for simplicity, you seperate the database and log files into their own directories for simplicity in excluding them.

excluding /oracle/db and /oracle/logs will be easier than exluding *.dbf, since it will have to scan for them...

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Carawain · ‎02-20-2017

Hello

thank for all reply.

I'll check with dba administrator if we can improve this backup but we have a lot of database backup policies implemented as this one with status code 0.

I've found this related technote: https://www.veritas.com/support/en_US/article.000009698

I've launch the below custom bpbkar command to simulate the backup :

./bpbkar -dt 0 -r 888 -nocont /u07/flash_recovery_area/DBUVC170_ORAVAL15/

Here the result of bpcd and bpbkar output with verbosity = 5:

bpcd

18:00:48.932 [1290352] <2> setup_debug_log: switched debug log file for bpcd
18:00:48.939 [1290352] <2> bpcd main: VERBOSE = 5
18:00:48.939 [1290352] <2> logparams: /usr/openv/netbackup/bin/bpcd -terminate 
18:00:48.939 [823324] <16> daemon_select_and_accept: got terminate request: 0
18:00:48.939 [823324] <2> bpcd main: got terminate request
18:00:48.973 [1290352] <2> init_resilient_cache: [vnet_nbrntd.c:869] Initialize resilient cache. 0 0x0
18:00:48.981 [1290352] <4> bpcd main: bpcd terminated
18:00:49.021 [823324] <4> bpcd main: EXIT status = 0
18:00:59.512 [1290374] <2> setup_debug_log: switched debug log file for bpcd
18:00:59.512 [1290374] <2> bpcd main: VERBOSE = 5
18:00:59.512 [1290374] <2> logparams: /usr/openv/netbackup/bin/bpcd -standalone 
18:00:59.512 [1290374] <2> SetMaxDataLimit: maximum data size: current=9223372036854775807 max=9223372036854775807
18:00:59.522 [823326] <2> setup_debug_log: switched debug log file for bpcd
18:00:59.573 [823326] <2> vnet_pcache_init_table: [vnet_private.c:235] starting cache size 200 0xc8
18:00:59.582 [823326] <8> file_to_cache_item: [vnet_addrinfo.c:6577] fopen() failed ERRNO=2 FILE=/usr/openv/var/host_cache/1ff/ffffffff+bpcd,1,2,0,1,0+.txt
18:00:59.601 [823326] <2> retry_getaddrinfo: [vnet_addrinfo.c:908] using SVC=bpcd PORT=13782
18:00:59.601 [823326] <2> retry_getaddrinfo: [vnet_addrinfo.c:912] adjusted service name 13782
18:00:59.601 [823326] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1632] found via getaddrinfo NAME=NULL SVC=bpcd
18:00:59.604 [823326] <2> daemon_startup_listeners: 4 listening for legacy service bpcd
18:00:59.615 [823326] <2> daemon_startup_listeners: listening for vnetd service bpcd
18:00:59.822 [823326] <2> vnet_registerPBXServer: ../../libvlibs/vnet_pbx.c.132: pbxRegisterEx successful at XXX.X.112.141:1556/bpcd, returns with 0 alt_addrs
18:00:59.822 [823326] <2> daemon_resource_checker: last_brk = 0, new_brk = 110073440
18:00:59.822 [823326] <2> daemon_resource_checker: last max_fd = 0, new_fd = 10
18:00:59.823 [823326] <16> resync_host_cache: bpclntcmd fork pid = 1822740
18:00:59.842 [823326] <2> daemon_check_for_zombies: waited for 1 child processes including: 1822740

bpbkar

18:01:39.303 [1822748] <2> logparams: ./bpbkar -dt 0 -r 888 -nocont /u07/flash_recovery_area/DBUVC170_ORAVAL15/ 
18:01:39.303 [1822748] <2> bpbkar main: hole_granularity = 256.
...
18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - cwd = NULL
18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - path = /u07/flash_recovery_area/DBUVC170_ORAVAL15
18:01:39.352 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of /u07/flash_recovery_area is /u07/flash_recovery_area
18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - Resolved_path = /u07/flash_recovery_area/DBUVC170_ORAVAL15
18:01:39.353 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of /u07/flash_recovery_area/DBUVC170_ORAVAL15 is /u07/flash_recovery_area/DBUVC170_ORAVAL15
18:01:39.361 [1822748] <2> bpbkar SelectFile: INF - cwd = /u07/flash_recovery_area/DBUVC170_ORAVAL15
18:01:39.361 [1822748] <2> bpbkar SelectFile: INF - path = autobackup.tar
18:01:39.362 [1822748] <4> check_file_sparseness: Device changing from 0 to 5
...
19:25:32.086 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of / is /
19:25:32.086 [1822748] <4> bpbkar expand_wildcards: end backup for filelist /u07/flash_recovery_area/DBUVC170_ORAVAL15/
19:25:32.087 [1822748] <4> bpbkar main: INF - Client completed sending data for backup

19:25:32.087 [1822748] <2> bpbkar main: INF - Total Size:710314321 + 333GB
19:25:32.087 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info for prefix = hardlinks_ and older than 30 days
19:25:32.087 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info/root for prefix = hardlinks_ and older than 30 days
19:25:32.096 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops for prefix = jbp and older than 3 days
19:25:32.096 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops/nbjlogs for prefix = jbp and older than 3 days
19:25:32.096 [1822748] <4> bpbkar Exit: INF - bpbkar exit normal
19:25:32.097 [1822748] <4> bpbkar Exit: INF - EXIT STATUS 0: the requested operation was successfully completed

So when we simulate bpbkar on local client no issue the backup get successfully in 1,5 hours but when we launch the backup over the network something is stuck on bpbkar side.

I need to understand if there's network error or system/client issue.

Carawain · ‎02-20-2017

Here the bpcd over the network

logs with job over the network:20:08:00.687 [823326] <16> resync_host_cache: bpclntcmd fork pid = 1814602
20:08:00.706 [823326] <2> daemon_check_for_zombies: waited for 1 child processes including: 1814602
20:08:19.518 [823326] <2> vnet_pbxAcceptSocket: Accepted sock[10] from 44.54.100.10:60096
...
20:08:19.577 [1818652] <8> file_to_cache_item: [vnet_addrinfo.c:6577] fopen() failed ERRNO=2 
...
FILE=/usr/openv/var/host_cache/18d/73d56d8d+0,1,41,0,1,0+media001.txt
20:08:34.076 [1900704] <2> bpcd exit_bpcd: exit status 0  ----------->exiting

bpbkar logs with job over the network:

20:08:35.656 [1900706] <4> bpbkar main: real locales <C C C C C C>
20:08:35.656 [1900706] <4> bpbkar main: standardized locales - mnt_lc_messages <C> mnt_lc_ctype <C> mnt_lc_time <C> mnt_lc_collate <C> mnt_lc_numeric <C>
20:08:35.757 [1900706] <2> init_resilient_cache: [vnet_nbrntd.c:869] Initialize resilient cache. 0 0x0
20:08:35.757 [1900706] <2> vnet_check_resilient_socket: [vnet_nbrntd.c:808] the socket is  0 0x0
20:08:35.757 [1900706] <2> vnet_check_resilient_socket: [vnet_nbrntd.c:808] the socket is  1 0x1
20:08:35.757 [1900706] <8> vnet_get_peer_sock_names: [vnet_nbrntd.c:263] getsockname() failed 57 0x39
20:08:35.796 [1900706] <2> logparams: bpbkar -r 1209600 -ru root -dt 0 -to 0 -bpstart_time 0 -clnt oraval15 -class ux-v-db-dbuvg931-ab-std -sched full -st INCR -bpstart_to 300 -bpend_to 300 -read_to 1800 -ckpt_time 900 -blks_per_buffer 127 -stream_count 1 -stream_number 1 -jobgrpid 1822942 -use_otm -use_ofb -b oraval15_1487358495 -kl 28 -fso -F encipher 
...
20:08:36.046 [1900706] <2> bpbkar main: receive filelist:</u07/flash_recovery_area/DBUVG931_ORAVAL15> 
20:08:36.046 [1900706] <2> bpbkar add_to_filelist: starting sizeof(filelistrec) <128>
20:08:36.046 [1900706] <2> bpbkar main: receive filelist:<CONTINUE> 
20:08:36.046 [1900706] <4> bpbkar expand_wildcards: start to backup filelist /u07/flash_recovery_area/DBUVG931_ORAVAL15 ,nb_fscp_enabled is 0
20:08:36.046 [1900706] <4> bpbkar expand_wildcards: INF - Processing /u07/flash_recovery_area/DBUVG931_ORAVAL15
20:08:36.066 [1900706] <2> bpbkar cd: remap name would return=/u07/flash_recovery_area/DBUVG931_ORAVAL15
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd4 on /
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd2 on /usr
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd9var on /var
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd3 on /tmp
20:08:36.067 [1900706] <2> mount resolve_mount_path: INF - Actual mount point of /tmp is /tmp
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd1 on /local/home
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (procfs) /proc on /proc
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd10opt on /opt
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/localoptlv on /local/opt
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u01_lv on /u01
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u02_lv on /u02
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u03_lv on /u03
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u04_lv on /u04
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u07_lv on /u07
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/redo1_lv on /u05
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/redo2_lv on /u06
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing vmt_gfstype (18) as (NFS) /staging_area_DMUTRv3 on /tmp/staging_area_DMUTRv3
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (autofs) /etc/auto_home on /home
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/opt/ctmagent on /home/ctmagent
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/home/oracle on /home/oracle
20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/home/oradba on /home/oradba
20:08:36.067 [1900706] <2> bpbkar SelectFile: INF - cwd = NULL
20:08:36.067 [1900706] <2> bpbkar SelectFile: INF - path = /u07/flash_recovery_area/DBUVG931_ORAVAL15
...
20:08:36.116 [1900706] <2> bpbkar SelectFile: INF - cwd = /u07/flash_recovery_area/DBUVG931_ORAVAL15/autobackup/2016_07_26
20:08:36.116 [1900706] <2> bpbkar SelectFile: INF - path = o1_mf_s_918226028_csgql5mp_.bkp
22:21:11.062 [1900706] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE
22:21:11.062 [1900706] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals
22:21:11.062 [1900706] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken

Marianne · ‎02-20-2017

You need corresponding bpbrm and bptm logs (level 3) on the media server to see what happened w.r.t. file transfer to media server. (Please save/copy logs to .txt files and upload as attachments.)
Have you tried to look into KeepAlive settings on master, media and Oracle client?

What is the size of o1_mf_s_918226028_csgql5mp_.bkp?
What transfer rate do you see when you ftp the file to the media server?

If I may ask - why are you using normal file-level backup to backup Oracle Flash Recovery Area?
It seems that Oracle's recommendation is that rman should be used:

https://docs.oracle.com/cd/B19306_01/backup.102/b14192/bkscenar002.htm

Handy NetBackup Links

Carawain · ‎02-21-2017

Hello,

On bpbrm I have these logs:

20:09:38.650 [15348.14048] <2> bpbrm read_media_msg: read from media manager: BEGIN_BACKUP oraval15_1487358495 B02594
20:09:38.650 [15348.14048] <2> bpbrm send_parent_msg: BEGIN_BACKUP oraval15_1487358495 B02594
...
22:21:11.473 [15348.14048] <2> bpbrm read_media_msg: read from media manager: EXIT oraval15_1487358495 150
22:21:11.473 [15348.14048] <2> bpbrm process_media_msg: media manager for backup id oraval15_1487358495 exited with status 150: termination requested by administrator
22:21:11.473 [15348.14048] <2> bpbrm send_status_to_parent: EXIT oraval15_1487358495 41 sent to parent process for jobid = 1823243.

On bptm I have the below logs:

20:08:26.249 [14760.11128] <2> read_brm_msg: START BACKUP -b oraval15_1487358495 -bt 1487358495 -c oraval15 -cl ux-v-db-dbuvg931-ab-m01-std -st 1 -rl 1 -rp 1209600 -sl full -ct 0 -hostname oraval15 -rclnt oraval15 -rclnthostname oraval15 -bclnt oraval15 -bclnthostname oraval15 -F encipher -ru root -no_callback -connect_options 0x01010100 -jobid 1823243 -jobgrpid 1822942

20:08:26.390 [14760.11128] <4> report_client: VBRC 2 14760 1 oraval15_1487358495 0 ux-v-db-dbuvg931-ab-std 1 full 0 1 1

20:08:26.873 [14760.11128] <2> mpx_start_child: backup child for oraval15_1487358495 is pid 15148.1828, CINDEX = 0

20:08:26.873 [14760.11128] <2> send_brm_msg: NCB_DATASOCKET oraval15_1487358495 /tmp/vnet-14608487358506886816000000003-d2Ifad;ae3096c6a558f9431e76204946afbe1f;11;1800

20:08:26.920 [15148.1828] <2> bptm: INITIATING (VERBOSE = 0): -pid 15348 -den 20 -rt 1 -rn 2 -cj 4 -mpx 8 -reqid -1487143804 -jm -brm -p CRYPT -stunit STUmedia007_P_LTO4 -eari 0 -maxfrag 1048575 -masterversion 760000 -mediasvr media007 -bpbrm_shm_id Global\NetBackup_BPBRM_SHM_Path_1981033517_15348_14048 -blks_per_buffer 128 -b oraval15_1487358495 -cl ux-v-db-dbuvg931-ab-std -c oraval15 -hostname oraval15 -bclnt oraval15 -bclnthostname oraval15 -ru root -rclnt oraval15 -rclnthostname oraval15 -sl full -mmfill 14760 0 65536 12 10 1981033705 0 1487358495 1 ux-v-db-dbuvg931-ab-std oraval15 oraval15_1487358495 

20:09:38.853 [14760.11128] <4> write_backup: begin writing backup id oraval15_1487358495, copy 1, fragment 1, to media id B02594 on drive LTO4_2_2_1_2 (index 13)

20:10:28.664 [14760.11128] <2> io_write_back_header: drive index 13, oraval15_1487358495, file num = 1, mpx_headers = 1, copy 1

22:20:59.648 [14760.11128] <2> read_brm_msg: STOP BACKUP oraval15_1487358495

if i'm not wrong it appears there no more data transfering on media server just trace of:

file num = 1

and then the stop of backup.

The file named o1_mf_s_918226028_csgql5mp_.bkp has data size to 8 megabytes.

The scp transfert from client to media server is fast and complete.

We use normal file-level backup because if dba admin have not sufficient retention on airman backup available we can deliver them the db data files.

I will have a look on KeepAlive settings

Thanks to all

Marianne · ‎02-21-2017

Snippets of logs do not help when full data transfer path needs to be followed.

If you cannot upload the full logs here (as .txt files), then best to log a Support call with Veritas where you can submit/upload complete log files.

Handy NetBackup Links

Carawain · ‎02-21-2017

master server tcp_keep parameter in second

[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@ ~]#

Client tcp_keep parameter in seconds
{oraval15}/u07 --> #no -a | grep tcp_keep
              tcp_keepcnt = 8
             tcp_keepidle = 14400
             tcp_keepinit = 150
            tcp_keepintvl = 150

windows media server tcp_keep parameter in millisecondKeepAliveTime REG_DWORD 0x000493e0 (300000)

300000 ms = 5 minutes

Carawain · ‎02-21-2017

Hello Marianne,

I can't post complete logs for security reason.

I'll open a support case.

Thank you all for replies.

Marianne · ‎02-21-2017

So, KeepAliveTime on client is 5 minutes and 2 hours on master/media server?

Best if these settings are the same.

Is it not possible to edit the copied .txt files and change all hostnames and IPs to something generic? e.g.
replace master hostname with MASTER and master IP with 10.10.10.1
replace media server hostname (if not master) with MEDIA1 and IP with 10.10.10.2
replace client hostname with CLIENT1 and IP with 10.10.10.3

Other than those references, there should be nothing else that cannot be posted here.

Handy NetBackup Links