01-18-2016 01:41 AM
We are facing the issue of high CPU usage (reaches 100%) on the net-backup version 7.5.0.7, and due to that major backups are failing.
First step: I have changed twice the /opt/SYMCOpsCenterServer/db/conf/server.conf from
-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 256M -ch 1024M -cl 256M -zl -os 1M -m
To:
-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 512M -ch 2048M -cl 256M -zl -os 1M -m
and:
-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gtc 4 -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 1G -ch 2048M -cl 1G -zl -os 1M -m
Restarted each time the OpsCenter server:
/opt/SYMCOpsCenterServer/bin/opsadmin.sh stop/start
But this didn't fix the issue.
Second step, I have made the change on the below file:
cat /usr/openv/var/global/server.conf
-n NB_masterserverBZ
-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -ti 0 -c 100M -ch 2048M -cl 100M -zl -os 1M -o /usr/openv/db//log/server.log -ud -m
Then to this:
-n NB_masterserverBZ
-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -gn 32 -ti 0 -c 100M -ch 3G -cl 100M -zl -os 1M -o /usr/openv/db//log/server.log
-ud -m
Still the issue is there.
Third step: Did the backup of NBDB and defrag:
/opt/SYMCOpsCenterServer/bin/dbbackup.sh /my_db_backup_dir
/opt/SYMCOpsCenterServer/bin/./dbdefrag.sh
Fourth step:
1- CREATED the file (missing) -- /usr/openv/var/global/emm.conf
Added contents--
NUM_DB_BROWSE_CONNECTIONS=20
NUM_DB_CONNECTIONS=21
NUM_ORB_THREADS=35
2- CREATED the file (missing)-- /usr/openv/var/global/nbrb.conf
Added contents--
SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1
# /usr/bin/ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
vmemory(kbytes) unlimited
ls -l /opt/SYMCOpsCenterServer/db/data/
-rw------- 1 root root 181116928 Jan 18 09:58 symcOpscache.db
-rw------- 1 root root 139264 Jan 13 10:53 symcopsscratchdb.db
-rw------- 1 root root 1089536 Jan 13 10:53 symcsearchdb.db
drwxr-xr-x 2 root root 512 Jan 13 10:19 temp
-rw------- 1 root root 1048797184 Jan 18 10:19 vxpmdb.db
-rw-r--r-- 1 root root 589824 Jan 18 10:24 vxpmdb.log
Find attached outputs from following command while the issue happens:
/bin/vmstat 30 1
/bin/prstat -a 5 2
/bin/iostat -xntcz
/bin/ps -e -o pcpu,pid,ppid,args | /bin/sort -rn | /bin/head -50
/opt/openv/netbackup/bin/bpps -x
At last, here are some informationtaken from the netbackup master server:
cat /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
128
cat /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK
128
ndd -get /dev/tcp tcp_max_buf
16777216
ndd -get /dev/tcp tcp_xmit_hiwat
2097152
# ndd -get /dev/tcp tcp_recv_hiwat
2097152
# ndd -get /dev/tcp tcp_wscale_always
1
Here below from bptm.log
04:00:03.399 [7398] <2> read_config_file: using 262144 value from /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_DISK
04:00:03.399 [7398] <2> io_init: using 262144 data buffer size
04:00:03.399 [7398] <2> set_job_details: Tfile (718113): LOG 1452826803 4 bptm 7398 using 262144 data buffer size
.........
04:00:03.399 [7398] <2> read_config_file: using 0 value from /usr/openv/netbackup/NET_BUFFER_SZ
04:00:03.399 [7398] <2> io_set_recvbuf: NOT doing setsockopt() to set network buffer size
04:00:03.399 [7398] <2> io_set_recvbuf: receive network buffer is 2102260 bytes
04:00:03.405 [7398] <2> read_config_file: using 128 value from /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
04:00:03.417 [7398] <2> read_config_file: using 128 value from /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK
04:00:03.417 [7398] <2> io_init: using 128 data buffers
04:00:03.417 [7398] <2> set_job_details: Tfile (718113): LOG 1452826803 4 bptm 7398 using 128 data buffers
..........
04:00:08.088 [7438] <2> bptm: INITIATING (VERBOSE = 5): -w -c masterserverBZ -dpath master_su -stunit dd7200_master_su -cl sdp1a -bt 1452826806 -b masterserverBZ_1452826806 -st 0 -cj 1 -reqid -1452
605272 -jm -brm -hostname masterserverBZ -ru root -rclnt masterserverBZ -rclnthostname masterserverBZ -rl 0 -rp 604800 -sl sdp1a -ct 0 -maxfrag 524288 -eari 0 -mediasvr masterserverBZ -no_callback
-connect_options 0x01010100 -jobid 718114 -jobgrpid 718114 -masterversion 750000 -bpbrm_shm_id 1174405198 -blks_per_buffer 512 -shm
04:00:08.089 [7438] <2> main: bptm.c.1601: maximum fragment size is 524288000 Kbytes
........
04:00:03.417 [7398] <2> io_init: child delay = 10, parent delay = 15 (milliseconds)
Thanks for your help.
Solved! Go to Solution.
01-18-2016 09:21 AM
Hi,
You didn't mention your server specs.
You're running OpsCenter on your Master server, you should migrate to a dedicated server if you're facing performance issues.
You're also running verbose logging, I don't figures on the impact on performance but one can assume having minimal logging will use less CPU than verbose logging, or atleast less memory.
01-18-2016 09:21 AM
Hi,
You didn't mention your server specs.
You're running OpsCenter on your Master server, you should migrate to a dedicated server if you're facing performance issues.
You're also running verbose logging, I don't figures on the impact on performance but one can assume having minimal logging will use less CPU than verbose logging, or atleast less memory.
01-18-2016 11:54 AM
Although this says you can:
http://eval.symantec.com/mktginfo/enterprise/white_papers/b-nbu_7_opscenter_analytics_WP.en-us.pdf
...I would take Riaan' advice, and the advice from here:
https://www.veritas.com/community/forums/running-opscenter-media-server
In practice I would never run OpsCenter Server on any NetBackup Server (master or media), even in a small virtual test environment.
01-19-2016 01:15 AM
OpsCenter is very CPU greedy.
However CPU may be a issue:
load averages: 4.94, 1.83, 1.05
load averages: 6.37, 2.19, 1.17
How may cores do you have ?
The number of cores has a incluence of the interpretation of load average.
01-19-2016 05:34 AM
@Nicolai,
I have four cores:
==================================== CPUs ====================================
CPU CPU Run L2$ CPU CPU
LSB Chip ID MHz MB Impl. Mask
--- ---- ---------------------------------------- ---- --- ----- ----
00 0 0, 1, 2, 3 2750 5.0 7 161
@Riaan,
I have increased the logging level just to troubleshoot the issue.
@sdo,
I will check the links.
01-20-2016 12:25 AM
load averages: 4.94, 1.83, 1.05
load averages: 6.37, 2.19, 1.17
With 4 cores a load average of 4 means all CPU are 100 utilized. This mean during the 1 minute sample that CPU's are overloaded respective 94% and 237%.
No problem during 5 and 15 minutes sample - please note its average, you could still have CPU spikes.
As a workaround, I would suspend OpsCenter data collection in the time frame where backup load is at it highest.
01-20-2016 01:01 AM
@Nicolai,
I have stopped the OpsCenter server with:
/opt/SYMCOpsCenterServer/bin/opsadmin.sh stop
But still, I got the CPU spikes.
01-20-2016 01:03 AM
CPU spikes can't be avoided, does it still cause issue ?
What are the error messages when Oracle backup fails ?
Even if a system is total hammered , it should not cause failures.
01-20-2016 06:01 AM
@Nicolai,
No more failing Oracle backups happened since the very beginning time when the first CPY spikes occured.
You are right, even if a system is totaly hammered , it should not cause failures.
What is the difference between "suspend OpsCenter data collection " and stop OpsCenter as I did?
BR
01-20-2016 06:44 AM
OpsCenter would be online but not collecting data from the master.
01-20-2016 06:44 AM
Suspend collection = the OpsCenter Server, app server, web server all stay up and accessible - so you can still query reports etc, but OpsCenter Server will cease polling any NetBackup Master servers.
OpsCenter down = no polling, no reports, no GUI, no web server.
01-20-2016 07:27 AM
@ Riaan , @sdo,
I meant by my last comment the following:
If I stopped the OpsCenter server and the issue is still there. So, no need to suspend the OpsCenter data collection now to have lower CPU load.
Hope I'm clear.
I found somewhere also that if the number of backup jobs running during the day is greater than 5000, then it would be recommended to use a separate server for OpsCenter.
How to find out the number of running jobs in one day?
01-20-2016 07:45 AM
Well - if opscenter is stopped and you still have the issue - it is not related to opscenter at all (and there is no reason for stopping opscenter).
Something else must be interfering with Oracle backups.
OpsCenter can tell you how may job to run or use this command line:
# bpimagelist -hoursago 24 -idonly | wc -l
The command will list all backup the last 24 hours from when you run the command.
Alternative:
bpimagelist -d 01/01/2016 -e 01/20/2016 -idonly | wc -l
Divide the value by 20 to get a average.