02-17-2017 05:36 AM
Hello,
I need help because one client got daily backup error (status 41).
Impacted policies are standard policy which backup oracle database folder and system policy with full schedule so when the amount of data to backup is high.
the bpbkar process hung when it arrive to "begin writing state".
sometimes there a trace of backup flow like 32 Kilobytes so less performance but the backup is so long and also canceled.
i tried to redirect the backup policies on several other media server available but the issue is the same so i know that the issue is on client side.
i tried many policy/client configuration found on the foum topic:
-increse client timeout
-activate the multiple datastream
-as the database policies have to be encrypted i tried to disable the encryption option and check the server processor load but nothing monopolize cpu ressource.
-i tried to disable client/side deduplication too but no more emprouvement.
so bpbkar stay indefinitely active on the server until a timeout occurs with status code 41.
Master and media server are in 7.6.1.2 version
client is an AIX 5 3 in 7.5.0.7 version
If anyone have an idea to reveal the issue?
Thanks.
Solved! Go to Solution.
02-21-2017 05:51 AM
Snippets of logs do not help when full data transfer path needs to be followed.
If you cannot upload the full logs here (as .txt files), then best to log a Support call with Veritas where you can submit/upload complete log files.
02-17-2017 05:49 AM
First of, a file backup of a running Oracle database is not a good idea as the database will be inconsistent in a restore situation.
My suggestion is exclude the database files (usually called .dbf) and use either agent based or rman dump to backup the oracle database. Confer with your DBA about which solution is best.
02-17-2017 08:25 AM
Mark that as a solution! Michael is absolutely correct. In fact, I would suggest that for simplicity, you seperate the database and log files into their own directories for simplicity in excluding them.
excluding /oracle/db and /oracle/logs will be easier than exluding *.dbf, since it will have to scan for them...
02-20-2017 03:34 AM - edited 02-20-2017 03:36 AM
Hello
thank for all reply.
I'll check with dba administrator if we can improve this backup but we have a lot of database backup policies implemented as this one with status code 0.
I've found this related technote: https://www.veritas.com/support/en_US/article.000009698
I've launch the below custom bpbkar command to simulate the backup :
./bpbkar -dt 0 -r 888 -nocont /u07/flash_recovery_area/DBUVC170_ORAVAL15/
Here the result of bpcd and bpbkar output with verbosity = 5:
bpcd
18:00:48.932 [1290352] <2> setup_debug_log: switched debug log file for bpcd 18:00:48.939 [1290352] <2> bpcd main: VERBOSE = 5 18:00:48.939 [1290352] <2> logparams: /usr/openv/netbackup/bin/bpcd -terminate 18:00:48.939 [823324] <16> daemon_select_and_accept: got terminate request: 0 18:00:48.939 [823324] <2> bpcd main: got terminate request 18:00:48.973 [1290352] <2> init_resilient_cache: [vnet_nbrntd.c:869] Initialize resilient cache. 0 0x0 18:00:48.981 [1290352] <4> bpcd main: bpcd terminated 18:00:49.021 [823324] <4> bpcd main: EXIT status = 0 18:00:59.512 [1290374] <2> setup_debug_log: switched debug log file for bpcd 18:00:59.512 [1290374] <2> bpcd main: VERBOSE = 5 18:00:59.512 [1290374] <2> logparams: /usr/openv/netbackup/bin/bpcd -standalone 18:00:59.512 [1290374] <2> SetMaxDataLimit: maximum data size: current=9223372036854775807 max=9223372036854775807 18:00:59.522 [823326] <2> setup_debug_log: switched debug log file for bpcd 18:00:59.573 [823326] <2> vnet_pcache_init_table: [vnet_private.c:235] starting cache size 200 0xc8 18:00:59.582 [823326] <8> file_to_cache_item: [vnet_addrinfo.c:6577] fopen() failed ERRNO=2 FILE=/usr/openv/var/host_cache/1ff/ffffffff+bpcd,1,2,0,1,0+.txt 18:00:59.601 [823326] <2> retry_getaddrinfo: [vnet_addrinfo.c:908] using SVC=bpcd PORT=13782 18:00:59.601 [823326] <2> retry_getaddrinfo: [vnet_addrinfo.c:912] adjusted service name 13782 18:00:59.601 [823326] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1632] found via getaddrinfo NAME=NULL SVC=bpcd 18:00:59.604 [823326] <2> daemon_startup_listeners: 4 listening for legacy service bpcd 18:00:59.615 [823326] <2> daemon_startup_listeners: listening for vnetd service bpcd 18:00:59.822 [823326] <2> vnet_registerPBXServer: ../../libvlibs/vnet_pbx.c.132: pbxRegisterEx successful at XXX.X.112.141:1556/bpcd, returns with 0 alt_addrs 18:00:59.822 [823326] <2> daemon_resource_checker: last_brk = 0, new_brk = 110073440 18:00:59.822 [823326] <2> daemon_resource_checker: last max_fd = 0, new_fd = 10 18:00:59.823 [823326] <16> resync_host_cache: bpclntcmd fork pid = 1822740 18:00:59.842 [823326] <2> daemon_check_for_zombies: waited for 1 child processes including: 1822740
bpbkar
18:01:39.303 [1822748] <2> logparams: ./bpbkar -dt 0 -r 888 -nocont /u07/flash_recovery_area/DBUVC170_ORAVAL15/ 18:01:39.303 [1822748] <2> bpbkar main: hole_granularity = 256. ... 18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - cwd = NULL 18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - path = /u07/flash_recovery_area/DBUVC170_ORAVAL15 18:01:39.352 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of /u07/flash_recovery_area is /u07/flash_recovery_area 18:01:39.352 [1822748] <2> bpbkar SelectFile: INF - Resolved_path = /u07/flash_recovery_area/DBUVC170_ORAVAL15 18:01:39.353 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of /u07/flash_recovery_area/DBUVC170_ORAVAL15 is /u07/flash_recovery_area/DBUVC170_ORAVAL15 18:01:39.361 [1822748] <2> bpbkar SelectFile: INF - cwd = /u07/flash_recovery_area/DBUVC170_ORAVAL15 18:01:39.361 [1822748] <2> bpbkar SelectFile: INF - path = autobackup.tar 18:01:39.362 [1822748] <4> check_file_sparseness: Device changing from 0 to 5 ... 19:25:32.086 [1822748] <2> bpbkar resolve_path: INF - Actual mount point of / is / 19:25:32.086 [1822748] <4> bpbkar expand_wildcards: end backup for filelist /u07/flash_recovery_area/DBUVC170_ORAVAL15/ 19:25:32.087 [1822748] <4> bpbkar main: INF - Client completed sending data for backup 19:25:32.087 [1822748] <2> bpbkar main: INF - Total Size:710314321 + 333GB 19:25:32.087 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info for prefix = hardlinks_ and older than 30 days 19:25:32.087 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/hardlink_info/root for prefix = hardlinks_ and older than 30 days 19:25:32.096 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops for prefix = jbp and older than 3 days 19:25:32.096 [1822748] <2> bpbkar delete_old_files_recur: INF - checking files in directory /usr/openv/netbackup/logs/user_ops/nbjlogs for prefix = jbp and older than 3 days 19:25:32.096 [1822748] <4> bpbkar Exit: INF - bpbkar exit normal 19:25:32.097 [1822748] <4> bpbkar Exit: INF - EXIT STATUS 0: the requested operation was successfully completed
So when we simulate bpbkar on local client no issue the backup get successfully in 1,5 hours but when we launch the backup over the network something is stuck on bpbkar side.
I need to understand if there's network error or system/client issue.
02-20-2017 03:51 AM
Here the bpcd over the network
logs with job over the network:20:08:00.687 [823326] <16> resync_host_cache: bpclntcmd fork pid = 1814602 20:08:00.706 [823326] <2> daemon_check_for_zombies: waited for 1 child processes including: 1814602 20:08:19.518 [823326] <2> vnet_pbxAcceptSocket: Accepted sock[10] from 44.54.100.10:60096 ... 20:08:19.577 [1818652] <8> file_to_cache_item: [vnet_addrinfo.c:6577] fopen() failed ERRNO=2 ... FILE=/usr/openv/var/host_cache/18d/73d56d8d+0,1,41,0,1,0+media001.txt 20:08:34.076 [1900704] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
bpbkar logs with job over the network:
20:08:35.656 [1900706] <4> bpbkar main: real locales <C C C C C C> 20:08:35.656 [1900706] <4> bpbkar main: standardized locales - mnt_lc_messages <C> mnt_lc_ctype <C> mnt_lc_time <C> mnt_lc_collate <C> mnt_lc_numeric <C> 20:08:35.757 [1900706] <2> init_resilient_cache: [vnet_nbrntd.c:869] Initialize resilient cache. 0 0x0 20:08:35.757 [1900706] <2> vnet_check_resilient_socket: [vnet_nbrntd.c:808] the socket is 0 0x0 20:08:35.757 [1900706] <2> vnet_check_resilient_socket: [vnet_nbrntd.c:808] the socket is 1 0x1 20:08:35.757 [1900706] <8> vnet_get_peer_sock_names: [vnet_nbrntd.c:263] getsockname() failed 57 0x39 20:08:35.796 [1900706] <2> logparams: bpbkar -r 1209600 -ru root -dt 0 -to 0 -bpstart_time 0 -clnt oraval15 -class ux-v-db-dbuvg931-ab-std -sched full -st INCR -bpstart_to 300 -bpend_to 300 -read_to 1800 -ckpt_time 900 -blks_per_buffer 127 -stream_count 1 -stream_number 1 -jobgrpid 1822942 -use_otm -use_ofb -b oraval15_1487358495 -kl 28 -fso -F encipher ... 20:08:36.046 [1900706] <2> bpbkar main: receive filelist:</u07/flash_recovery_area/DBUVG931_ORAVAL15> 20:08:36.046 [1900706] <2> bpbkar add_to_filelist: starting sizeof(filelistrec) <128> 20:08:36.046 [1900706] <2> bpbkar main: receive filelist:<CONTINUE> 20:08:36.046 [1900706] <4> bpbkar expand_wildcards: start to backup filelist /u07/flash_recovery_area/DBUVG931_ORAVAL15 ,nb_fscp_enabled is 0 20:08:36.046 [1900706] <4> bpbkar expand_wildcards: INF - Processing /u07/flash_recovery_area/DBUVG931_ORAVAL15 20:08:36.066 [1900706] <2> bpbkar cd: remap name would return=/u07/flash_recovery_area/DBUVG931_ORAVAL15 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd4 on / 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd2 on /usr 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd9var on /var 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd3 on /tmp 20:08:36.067 [1900706] <2> mount resolve_mount_path: INF - Actual mount point of /tmp is /tmp 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd1 on /local/home 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (procfs) /proc on /proc 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/hd10opt on /opt 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/localoptlv on /local/opt 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u01_lv on /u01 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u02_lv on /u02 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u03_lv on /u03 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u04_lv on /u04 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/u07_lv on /u07 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/redo1_lv on /u05 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /dev/redo2_lv on /u06 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing vmt_gfstype (18) as (NFS) /staging_area_DMUTRv3 on /tmp/staging_area_DMUTRv3 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (autofs) /etc/auto_home on /home 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/opt/ctmagent on /home/ctmagent 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/home/oracle on /home/oracle 20:08:36.067 [1900706] <2> mount build_mount_list: INF - Processing (local) /local/home/oradba on /home/oradba 20:08:36.067 [1900706] <2> bpbkar SelectFile: INF - cwd = NULL 20:08:36.067 [1900706] <2> bpbkar SelectFile: INF - path = /u07/flash_recovery_area/DBUVG931_ORAVAL15 ... 20:08:36.116 [1900706] <2> bpbkar SelectFile: INF - cwd = /u07/flash_recovery_area/DBUVG931_ORAVAL15/autobackup/2016_07_26 20:08:36.116 [1900706] <2> bpbkar SelectFile: INF - path = o1_mf_s_918226028_csgql5mp_.bkp 22:21:11.062 [1900706] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE 22:21:11.062 [1900706] <2> bpbkar sighandler: INF - ignoring additional SIGPIPE signals 22:21:11.062 [1900706] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 40: network connection broken
02-20-2017 04:14 AM
You need corresponding bpbrm and bptm logs (level 3) on the media server to see what happened w.r.t. file transfer to media server. (Please save/copy logs to .txt files and upload as attachments.)
Have you tried to look into KeepAlive settings on master, media and Oracle client?
What is the size of o1_mf_s_918226028_csgql5mp_.bkp?
What transfer rate do you see when you ftp the file to the media server?
If I may ask - why are you using normal file-level backup to backup Oracle Flash Recovery Area?
It seems that Oracle's recommendation is that rman should be used:
https://docs.oracle.com/cd/B19306_01/backup.102/b14192/bkscenar002.htm
02-21-2017 05:32 AM
Hello,
On bpbrm I have these logs:
20:09:38.650 [15348.14048] <2> bpbrm read_media_msg: read from media manager: BEGIN_BACKUP oraval15_1487358495 B02594 20:09:38.650 [15348.14048] <2> bpbrm send_parent_msg: BEGIN_BACKUP oraval15_1487358495 B02594 ... 22:21:11.473 [15348.14048] <2> bpbrm read_media_msg: read from media manager: EXIT oraval15_1487358495 150 22:21:11.473 [15348.14048] <2> bpbrm process_media_msg: media manager for backup id oraval15_1487358495 exited with status 150: termination requested by administrator 22:21:11.473 [15348.14048] <2> bpbrm send_status_to_parent: EXIT oraval15_1487358495 41 sent to parent process for jobid = 1823243.
On bptm I have the below logs:
20:08:26.249 [14760.11128] <2> read_brm_msg: START BACKUP -b oraval15_1487358495 -bt 1487358495 -c oraval15 -cl ux-v-db-dbuvg931-ab-m01-std -st 1 -rl 1 -rp 1209600 -sl full -ct 0 -hostname oraval15 -rclnt oraval15 -rclnthostname oraval15 -bclnt oraval15 -bclnthostname oraval15 -F encipher -ru root -no_callback -connect_options 0x01010100 -jobid 1823243 -jobgrpid 1822942 20:08:26.390 [14760.11128] <4> report_client: VBRC 2 14760 1 oraval15_1487358495 0 ux-v-db-dbuvg931-ab-std 1 full 0 1 1 20:08:26.873 [14760.11128] <2> mpx_start_child: backup child for oraval15_1487358495 is pid 15148.1828, CINDEX = 0 20:08:26.873 [14760.11128] <2> send_brm_msg: NCB_DATASOCKET oraval15_1487358495 /tmp/vnet-14608487358506886816000000003-d2Ifad;ae3096c6a558f9431e76204946afbe1f;11;1800 20:08:26.920 [15148.1828] <2> bptm: INITIATING (VERBOSE = 0): -pid 15348 -den 20 -rt 1 -rn 2 -cj 4 -mpx 8 -reqid -1487143804 -jm -brm -p CRYPT -stunit STUmedia007_P_LTO4 -eari 0 -maxfrag 1048575 -masterversion 760000 -mediasvr media007 -bpbrm_shm_id Global\NetBackup_BPBRM_SHM_Path_1981033517_15348_14048 -blks_per_buffer 128 -b oraval15_1487358495 -cl ux-v-db-dbuvg931-ab-std -c oraval15 -hostname oraval15 -bclnt oraval15 -bclnthostname oraval15 -ru root -rclnt oraval15 -rclnthostname oraval15 -sl full -mmfill 14760 0 65536 12 10 1981033705 0 1487358495 1 ux-v-db-dbuvg931-ab-std oraval15 oraval15_1487358495 20:09:38.853 [14760.11128] <4> write_backup: begin writing backup id oraval15_1487358495, copy 1, fragment 1, to media id B02594 on drive LTO4_2_2_1_2 (index 13) 20:10:28.664 [14760.11128] <2> io_write_back_header: drive index 13, oraval15_1487358495, file num = 1, mpx_headers = 1, copy 1 22:20:59.648 [14760.11128] <2> read_brm_msg: STOP BACKUP oraval15_1487358495
if i'm not wrong it appears there no more data transfering on media server just trace of:
file num = 1
and then the stop of backup.
The file named o1_mf_s_918226028_csgql5mp_.bkp has data size to 8 megabytes.
The scp transfert from client to media server is fast and complete.
We use normal file-level backup because if dba admin have not sufficient retention on airman backup available we can deliver them the db data files.
I will have a look on KeepAlive settings
Thanks to all
02-21-2017 05:51 AM
Snippets of logs do not help when full data transfer path needs to be followed.
If you cannot upload the full logs here (as .txt files), then best to log a Support call with Veritas where you can submit/upload complete log files.
02-21-2017 07:14 AM
master server tcp_keep parameter in second
[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@ ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@ ~]#
Client tcp_keep parameter in seconds
{oraval15}/u07 --> #no -a | grep tcp_keep
tcp_keepcnt = 8
tcp_keepidle = 14400
tcp_keepinit = 150
tcp_keepintvl = 150
windows media server tcp_keep parameter in millisecondKeepAliveTime REG_DWORD 0x000493e0 (300000)
300000 ms = 5 minutes
02-21-2017 07:25 AM - edited 02-21-2017 07:27 AM
02-21-2017 11:23 PM
So, KeepAliveTime on client is 5 minutes and 2 hours on master/media server?
Best if these settings are the same.
Is it not possible to edit the copied .txt files and change all hostnames and IPs to something generic? e.g.
replace master hostname with MASTER and master IP with 10.10.10.1
replace media server hostname (if not master) with MEDIA1 and IP with 10.10.10.2
replace client hostname with CLIENT1 and IP with 10.10.10.3
Other than those references, there should be nothing else that cannot be posted here.