Solved: Re: CPU at 100% during oracle database backup

eomaber · ‎01-18-2016

We are facing the issue of high CPU usage (reaches 100%) on the net-backup version 7.5.0.7, and due to that major backups are failing.

First step: I have changed twice the /opt/SYMCOpsCenterServer/db/conf/server.conf from

-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 256M -ch 1024M -cl 256M -zl -os 1M -m

To:

-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 512M -ch 2048M -cl 256M -zl -os 1M -m

and:

-n OPSCENTER_masterserverBZ -x tcpip(LocalOnly=YES;BROADCASTLISTENER=0;DOBROADCAST=NO;ServerPort=13786;) -gtc 4 -gd DBA -gk DBA -gl DBA -gp 8192 -ti 0 -c 1G -ch 2048M -cl 1G -zl -os 1M -m

Restarted each time the OpsCenter server:
/opt/SYMCOpsCenterServer/bin/opsadmin.sh stop/start

But this didn't fix the issue.

Second step, I have made the change on the below file:
cat /usr/openv/var/global/server.conf
-n NB_masterserverBZ
-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -ti 0 -c 100M -ch 2048M -cl 100M -zl -os 1M -o /usr/openv/db//log/server.log -ud -m

Then to this:

-n NB_masterserverBZ
-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -gn 32 -ti 0 -c 100M -ch 3G -cl 100M -zl -os 1M -o /usr/openv/db//log/server.log
-ud -m

Still the issue is there.

Third step: Did the backup of NBDB and defrag:

/opt/SYMCOpsCenterServer/bin/dbbackup.sh /my_db_backup_dir

/opt/SYMCOpsCenterServer/bin/./dbdefrag.sh

Fourth step:

1- CREATED the file (missing) -- /usr/openv/var/global/emm.conf
Added contents--

NUM_DB_BROWSE_CONNECTIONS=20

NUM_DB_CONNECTIONS=21

NUM_ORB_THREADS=35

2- CREATED the file (missing)-- /usr/openv/var/global/nbrb.conf

Added contents--

SECONDS_FOR_EVAL_LOOP_RELEASE = 180

RESPECT_REQUEST_PRIORITY = 0

DO_INTERMITTENT_UNLOADS = 1

# /usr/bin/ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
nofiles(descriptors) 256
vmemory(kbytes)      unlimited

ls -l /opt/SYMCOpsCenterServer/db/data/

-rw-------   1 root     root     181116928 Jan 18 09:58 symcOpscache.db
-rw-------   1 root     root      139264 Jan 13 10:53 symcopsscratchdb.db
-rw-------   1 root     root     1089536 Jan 13 10:53 symcsearchdb.db
drwxr-xr-x   2 root     root         512 Jan 13 10:19 temp
-rw-------   1 root     root     1048797184 Jan 18 10:19 vxpmdb.db
-rw-r--r--   1 root     root      589824 Jan 18 10:24 vxpmdb.log

Find attached outputs from following command while the issue happens:

/bin/vmstat 30 1

/bin/prstat -a 5 2

/bin/iostat -xntcz

/bin/ps -e -o pcpu,pid,ppid,args | /bin/sort -rn | /bin/head -50

/opt/openv/netbackup/bin/bpps -x

At last, here are some informationtaken from the netbackup master server:

cat /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
128
cat /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK
128

ndd -get /dev/tcp tcp_max_buf
16777216
ndd -get /dev/tcp tcp_xmit_hiwat
2097152
# ndd -get /dev/tcp tcp_recv_hiwat
2097152
# ndd -get /dev/tcp tcp_wscale_always
1

Here below from bptm.log

04:00:03.399 [7398] <2> read_config_file: using 262144 value from /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_DISK
04:00:03.399 [7398] <2> io_init: using 262144 data buffer size
04:00:03.399 [7398] <2> set_job_details: Tfile (718113): LOG 1452826803 4 bptm 7398 using 262144 data buffer size

.........

04:00:03.399 [7398] <2> read_config_file: using 0 value from /usr/openv/netbackup/NET_BUFFER_SZ
04:00:03.399 [7398] <2> io_set_recvbuf: NOT doing setsockopt() to set network buffer size
04:00:03.399 [7398] <2> io_set_recvbuf: receive network buffer is 2102260 bytes
04:00:03.405 [7398] <2> read_config_file: using 128 value from /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS
04:00:03.417 [7398] <2> read_config_file: using 128 value from /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK
04:00:03.417 [7398] <2> io_init: using 128 data buffers
04:00:03.417 [7398] <2> set_job_details: Tfile (718113): LOG 1452826803 4 bptm 7398 using 128 data buffers

..........

04:00:08.088 [7438] <2> bptm: INITIATING (VERBOSE = 5): -w -c masterserverBZ -dpath master_su -stunit dd7200_master_su -cl sdp1a -bt 1452826806 -b masterserverBZ_1452826806 -st 0 -cj 1 -reqid -1452
605272 -jm -brm -hostname masterserverBZ -ru root -rclnt masterserverBZ -rclnthostname masterserverBZ -rl 0 -rp 604800 -sl sdp1a -ct 0 -maxfrag 524288 -eari 0 -mediasvr masterserverBZ -no_callback
-connect_options 0x01010100 -jobid 718114 -jobgrpid 718114 -masterversion 750000 -bpbrm_shm_id 1174405198 -blks_per_buffer 512 -shm
04:00:08.089 [7438] <2> main: bptm.c.1601: maximum fragment size is 524288000 Kbytes

........

04:00:03.417 [7398] <2> io_init: child delay = 10, parent delay = 15 (milliseconds)

Thanks for your help.

RiaanBadenhorst · ‎01-18-2016

Hi,

You didn't mention your server specs.

You're running OpsCenter on your Master server, you should migrate to a dedicated server if you're facing performance issues.

You're also running verbose logging, I don't figures on the impact on performance but one can assume having minimal logging will use less CPU than verbose logging, or atleast less memory.

View solution in original post

RiaanBadenhorst · ‎01-18-2016

Hi,

You didn't mention your server specs.

You're running OpsCenter on your Master server, you should migrate to a dedicated server if you're facing performance issues.

You're also running verbose logging, I don't figures on the impact on performance but one can assume having minimal logging will use less CPU than verbose logging, or atleast less memory.

sdo · ‎01-18-2016

Although this says you can:

http://eval.symantec.com/mktginfo/enterprise/white_papers/b-nbu_7_opscenter_analytics_WP.en-us.pdf

...I would take Riaan' advice, and the advice from here:

https://www.veritas.com/community/forums/running-opscenter-media-server

In practice I would never run OpsCenter Server on any NetBackup Server (master or media), even in a small virtual test environment.

Nicolai · ‎01-19-2016

OpsCenter is very CPU greedy.

The attachment show plenty of free memory and almost no "wait for I/O".

However CPU may be a issue:

load averages: 4.94, 1.83, 1.05
load averages: 6.37, 2.19, 1.17

How may cores do you have ?

The number of cores has a incluence of the interpretation of load average.

eomaber · ‎01-19-2016

@Nicolai,

I have four cores:

==================================== CPUs ====================================

      CPU                 CPU                         Run    L2$    CPU   CPU
LSB   Chip                 ID                         MHz     MB    Impl. Mask
---   ---- ---------------------------------------- ----   ---    ----- ----
00     0      0,   1,   2,   3                       2750   5.0        7 161

@Riaan,

I have increased the logging level just to troubleshoot the issue.

@sdo,

I will check the links.

Nicolai · ‎01-20-2016

load averages: 4.94, 1.83, 1.05
load averages: 6.37, 2.19, 1.17

With 4 cores a load average of 4 means all CPU are 100 utilized. This mean during the 1 minute sample that CPU's are overloaded respective 94% and 237%.

No problem during 5 and 15 minutes sample - please note its average, you could still have CPU spikes.

As a workaround, I would suspend OpsCenter data collection in the time frame where backup load is at it highest.

eomaber · ‎01-20-2016

@Nicolai,

I have stopped the OpsCenter server with:

/opt/SYMCOpsCenterServer/bin/opsadmin.sh stop

But still, I got the CPU spikes.

Nicolai · ‎01-20-2016

CPU spikes can't be avoided, does it still cause issue ?

What are the error messages when Oracle backup fails ?

Even if a system is total hammered , it should not cause failures.

eomaber · ‎01-20-2016

@Nicolai,

No more failing Oracle backups happened since the very beginning time when the first CPY spikes occured.

You are right, even if a system is totaly hammered , it should not cause failures.

What is the difference between "suspend OpsCenter data collection " and stop OpsCenter as I did?

BR

RiaanBadenhorst · ‎01-20-2016

OpsCenter would be online but not collecting data from the master.

sdo · ‎01-20-2016

Suspend collection = the OpsCenter Server, app server, web server all stay up and accessible - so you can still query reports etc, but OpsCenter Server will cease polling any NetBackup Master servers.

OpsCenter down = no polling, no reports, no GUI, no web server.

eomaber · ‎01-20-2016

@ Riaan , @sdo,

I meant by my last comment the following:

If I stopped the OpsCenter server and the issue is still there. So, no need to suspend the OpsCenter data collection now to have lower CPU load.

Hope I'm clear.

I found somewhere also that if the number of backup jobs running during the day is greater than 5000, then it would be recommended to use a separate server for OpsCenter.

How to find out the number of running jobs in one day?

Nicolai · ‎01-20-2016

Well - if opscenter is stopped and you still have the issue - it is not related to opscenter at all (and there is no reason for stopping opscenter).

Something else must be interfering with Oracle backups.

OpsCenter can tell you how may job to run or use this command line:

# bpimagelist -hoursago 24 -idonly | wc -l

The command will list all backup the last 24 hours from when you run the command.

Alternative:

bpimagelist -d 01/01/2016 -e 01/20/2016 -idonly | wc -l

Divide the value by 20 to get a average.

VOX

CPU at 100% during oracle database backup