Solved: Heavy load on EMM db, high CPU utilization

mrmadej · ‎02-26-2013

Hello,

I want ask for any suggestion regarding heavy load on EMM database after switching to second cluster site using VCS GCO and catalog replication by VVR.

The physical hostname is changed and virtual IP address for master server. Virtual hostname is the same. In bp.conf i have the parameter ANY_CLUSTER_INTERFACE = 1.

NBU version 7.0.1 on Linux RedHat 5.6

Yesterday I switched NBU to backup site and everything was looking good except high CPU utilization by NB_dbsrv process. The utilization is permanent on 20% CPU and in peaks reach 100%. I thought that this situation is for few minutes, but it is still the same.

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12426 root      18   0 6647m 463m 7344 S 20.5  1.4 304:27.37 NB_dbsrv
13148 root      16   0  572m 174m  27m S  5.9  0.5  81:56.29 nbemm

In nbemm log a have huge amount of entries like below:

0,51216,111,111,1894847,1361811878412,13148,1089464640,0:,65:Preallocated <5> elements in curViewSeq for <c-sapbwp_1361345386>,22:ImageObject::fetchView,1
0,51216,111,111,1894848,1361811878413,13148,1089464640,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1
1,51216,111,111,1894849,1361811878413,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
1,51216,111,111,1894850,1361811878421,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1062|u32:2020006|)
1,51216,111,111,1894851,1361811878422,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894852,1361811878428,13148,47815214316672,0:,12:retval - <0>,37:ImageCatalogImpl::getImageCopyDetails,1
1,51216,111,111,1894853,1361811878429,13148,1089464640,0:,0:,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894854,1361811878429,13148,1089464640,0:,35:<Need to send unique media sequnce>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1
0,51216,111,111,1894855,1361811878435,13148,1089464640,0:,12:retval - <0>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1
1,51216,111,111,1894856,1361811878517,13148,1106905408,0:,0:,36:ImageCatalogImpl::getImageWithCopies,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894857,1361811878524,13148,1106905408,0:,68:Preallocated <5> elements in curViewSeq for <ichmura-025_1361345428>,22:ImageObject::fetchView,1
0,51216,111,111,1894858,1361811878525,13148,1106905408,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1
1,51216,111,111,1894859,1361811878525,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv>

There is difficult to log in to java gui, but after few minutes gui being responding. But is impossible to open Device Monitor, Devices and other tabs where i can monitor devices. This is big inconvenience for administrators.

Most of policies can run and working well, also restores can be started. Unfortunately not all media servers are working as is expected because some of them is active only for disk, and should be active for disk and tape. When i try to activate the host via "vmoprcmd -activate_host -h <host>" i recieve "network protocol error (39)" message.

I can ask EMM db using nbemmcmd command (i.e. nbemmcmd -listhosts). I can also change some setting using this command. EMM is responding immediately.

It is possible that the database has to validate itself? NBU database has about 1.1TB. I am affraid to restart NBU, this is very critical system for others production databases.

Thanks for any suggestions.

Regards

Madej

Mark_Solutions · ‎02-28-2013

Just wondering if the netbackup logs folder are has not failed over with the rest of the cluster or has a write issue and NetBackup is desperately trying and failing to write its log files but cannot causing a blockage

Check everything out on the node you are on in case that node has a configuration file of similar which is pointing the logs to a non existent or write protected area

View solution in original post

Nicolai · ‎02-26-2013

Try running /usr/openv/db/bin/dbadm (default password nbusql) and select 2) Database Space and Memory Management and select 4) Adjust Memory Settings, 3) Large.

Re-start of Netbackup required (Yes I know - you need to schedule it then)

But you may also have connection orinted errors as well

mrmadej · ‎02-26-2013

Thanks for response.

I have cache size as follows:

   (Setting) (Initial) (Minimum) (Maximum)
                Current      500M      500M     6000M
                  Small       25M       25M      500M
                 Medium      200M      200M      750M
                  Large      500M      500M        1G

The cache size is not a problem. With the same setting NBU worked well on primary site. This is support recommendation from few months ago.

How i can diagnose what type of activity causing this strange behavior? Sometimes is impossible to run vxlogview command. It returns:

]~# /usr/openv/netbackup/bin/vxlogview -o 111 -t 00:10:00 > /tmp/111.txt 
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists

I am looking the confirmation that there is possibility to safety restart NBU, clear cache and restart PBX. I am afraid that after this operation the database inconsistency can occur.

Regards

Madej

mrmadej · ‎02-28-2013

The problem looks serious and even technical support has no idea what is going on.

NBU and pbx has been restarted, cache has been cleared and the symptoms still the same.

When there are no running jobs the CPU is utilized on 20-30%. But when jobs are running (~200 active and ~100 queued) CPU utilization is about 140% - 180% (4 CPU)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND       12426 root      18   0 6859m 1.1g 8940 S 170.5  3.4   2225:35 NB_dbsrv     13148 root      15   0  640m 186m  29m S 56.8  0.6 684:40.26 nbemm         13337 root      15   0  422m 161m  16m S  9.5  0.5  41:16.39 nbjm

And in addition i have found that logs are not being created. On primary site logs were growing very quickly.

Support will analize NBSU logs. I hope that this help.

Regards

Madej

revarooo · ‎02-28-2013

cd /usr/openv/db/data

ls -l (post the output)

Mark_Solutions · ‎02-28-2013

Just wondering if the netbackup logs folder are has not failed over with the rest of the cluster or has a write issue and NetBackup is desperately trying and failing to write its log files but cannot causing a blockage

Check everything out on the node you are on in case that node has a configuration file of similar which is pointing the logs to a non existent or write protected area

mph999 · ‎02-28-2013

Yikes nasty one ...

I would add USE_HAS=1 into /usr/openv/var/global/nbemm.conf

(create this if not alreay there).

1. Stop NBU
2. Start just the DB - /usr/openv/db/bin/nbdbms_start_server

3. nbdb_unload –rebuild

4. nbdb_admin –reorganize

5. nbdb_unload –rebuild

Think of this as a 'defrag' on the NBDB.

(Not sure how long this willl take, as an idea, I ran this a while back on a 15GB DB (the size of the sql db, not the image db) and it took about 2.5 hours, but this was on a powerful machine with lots of memory (it runs in memory so the more the better)).

Martin

e_m_p_har · ‎02-28-2013

Some quick questions come up. What is the size of NBDB? If you look at the contents of /usr/openv/db/data (or it's appropriate link) does it require the cache values set in server.conf (-c/-cl-ch)? Have you added -gn <value> (I normally recommend setting this to 30 for initial testing. Can you check how the systems semaphore values are set? The following process can be used to set this:

sysctl -a | grep kernel.sem

Examine current settings, and if they are under the recommended values, you can increase them. This process is described in (http://www.symantec.com/docs/TECH203066). The other nice thing about changing semaphore values is that it can be done without a restart.

If you see that EMM_DATA.db is excessively large in size (multiple GB or higher) then you will need to perform the rebuild/reorganize steps referenced by mph999. If you end up having to use rebuild/reorganize you may want to run it multiple times. When you run rebuild/reorg if it doesn't complete processing *.dbR files will be left over. If that happens you will want to run another rebuild/reorganize.

One last thing I would recommend looking at is pack.summary from both sites. In NetBackup 7.0.1 there were some issues regarding EMM growth and deadlocks that required EEB's to fix. The NetBackup Late Breaking News article may be of use identifying specific failure points.

mrmadej · ‎03-01-2013

 ll /opt/netbck/db/data                                                                
total 7265632
-rw------- 1 root root   26218496 May  9  2011 DARS_DATA.db
-r-------- 1 root root     135168 May  8  2012 DARS_DATA.dbR
-rw------- 1 root root   26218496 May  9  2011 DARS_INDEX.db
-r-------- 1 root root      36864 May  8  2012 DARS_INDEX.dbR
-rw------- 1 root root  314798080 Mar  1 11:18 DBM_DATA.db
-r-------- 1 root root  216436736 May  8  2012 DBM_DATA.dbR
-rw------- 1 root root   35803136 Mar  1 11:18 DBM_INDEX.db
-r-------- 1 root root   16977920 May  8  2012 DBM_INDEX.dbR
-rw------- 1 root root 4837781504 Mar  1 11:18 EMM_DATA.db
-r-------- 1 root root   97587200 May  8  2012 EMM_DATA.dbR
-rw------- 1 root root   26218496 Mar  1 11:18 EMM_INDEX.db
-r-------- 1 root root     724992 May  8  2012 EMM_INDEX.dbR
-rw------- 1 root root   36724736 Mar  1 11:16 NBDB.db
-r--r--r-- 1 root root   39100416 May  8  2012 NBDB.dbR
-rw------- 1 root root 1752170496 Mar  1 11:18 NBDB.log
-r-------- 1 root root     327680 May  8  2012 NBDB.logR
-rw------- 1 root root        460 Mar  1 00:01 vxdbms.conf

Linked to the shared volume.

Regards

Madej

mrmadej · ‎03-01-2013

I have found that the log in nbdb directory is old and not being created new.

Other logs are in place.

The volume for logs is not in cluster configration. This is standalone and is mounted administratively. The directory structure is correct created by mklogdir script.

Have to analize the nbsu logs to find some differences in configration between primary and secondary server.

Regards

Madej

mrmadej · ‎03-01-2013

Database is not very big (without images):

du -sk /opt/netbck/db/data/*                                             25608   /opt/netbck/db/data/DARS_DATA.db
136     /opt/netbck/db/data/DARS_DATA.dbR
25608   /opt/netbck/db/data/DARS_INDEX.db
40      /opt/netbck/db/data/DARS_INDEX.dbR
307432  /opt/netbck/db/data/DBM_DATA.db
211376  /opt/netbck/db/data/DBM_DATA.dbR
34968   /opt/netbck/db/data/DBM_INDEX.db
16584   /opt/netbck/db/data/DBM_INDEX.dbR
4724408 /opt/netbck/db/data/EMM_DATA.db
95304   /opt/netbck/db/data/EMM_DATA.dbR
25608   /opt/netbck/db/data/EMM_INDEX.db
712     /opt/netbck/db/data/EMM_INDEX.dbR
46912   /opt/netbck/db/data/NBDB.db
38184   /opt/netbck/db/data/NBDB.dbR
1728808 /opt/netbck/db/data/NBDB.log
320     /opt/netbck/db/data/NBDB.logR
8       /opt/netbck/db/data/vxdbms.conf

Semafores are set the same as on primary site:

sysctl -a | grep kernel.sem                                              kernel.sem = 250        32000   32      128

Pack summary looks as follows:

# DO NOT EDIT THIS FILE !
# * means installed patch was preceded by this patch.
# + means that the installed patch installed this patch as a dependency.
NB_CLT_7.0.1 installed. +NB_7.0.1 +NB_JAV_7.0.1
NB_7.0.1 installed. *NB_CLT_7.0.1
NB_JAV_7.0.1 installed. *NB_CLT_7.0.1
EEB_NetBackup_7.0.1_PET2140767_SET2140749_EEB2
EEB_NetBackup_7.0.1_PET2201350_SET2200966_EEB2

Is identical with primary site.

I have to shutdown NBU to rebuild/reorganize database. Need discuss with administrator.

Regards

Madej

sebaquadri · ‎08-15-2013

Hi all,

I'm experiencying the same issue as you... with NBU 7.1.0.4 on hpux master server and having a lot of jobs starting but going to queue state showing "Waiting in NetBackup scheduler work queue on server ..." I noted that NB_dbsrv is consuming 100% of CPU... I do not have a big EMM_DATA.db file as you can see below:

ebrbsnp05 >> find /nbdb_catalog -name EMM_DATA.db
/nbdb_catalog/data/EMM_DATA.db
ebrbsnp05 >> ls -l /nbdb_catalog/data/EMM_DATA.db
-rw------- 1 root sys 51531776 Aug 15 12:26 /nbdb_catalog/data/EMM_DATA.db
ebrbsnp05 >> du -k /nbdb_catalog/data/EMM_DATA.db
50324 /nbdb_catalog/data/EMM_DATA.db
ebrbsnp05 >>

But I'm having 100 active jobs and 140 queued... all of the queued jobs are "Waiting in NetBackup scheduler work queue on server <master server name>"

I have opened a case with Symantec but it looks like they have not a clear understanding about the issue.

Have you finally got a solution for this case? How does the nbdb rebuild/reorganize that mph999 suggested above work? Did this help? Have you tried this?

Let me mention that I tried to perform this rebuild/reorganize yesterday, but I got an error while trying to take a backup of the nbdb vefore to do the first rebuild (I tried to take a backup of the nbdb just to be safe)... see below what I got...

I'm receiving the following error Segmentation fault (core dumped) when trying to perform a backup of the NDBD... could you please give us a hand?

Check below please…

ebrbsnp05 >> /usr/openv/netbackup/bin/nbdbms_start_stop start

ebrbsnp05 >> ../bpps -x

NB Processes

------------

root 11363 1 0 11:22:32 ? 0:00 /usr/openv/db//bin/NB_dbsrv @/usr/openv/var/global/server.conf @/usr/openv/var/global/databases.conf -hn 7

MM Processes

------------

Shared Symantec Processes

-------------------------

root 11324 1 0 11:22:27 ? 0:00 /opt/VRTSpbx/bin/pbx_exchange

ebrbsnp05 >> /usr/openv/db/bin/nbdb_ping

Database [NBDB] is alive and well on server [NB_ebrvmsnp01].

ebrbsnp05 >> mkdir /nbdb_catalog/backup

ebrbsnp05 >> /usr/openv/db/bin/nbdb_backup -dbn NBDB -online /nbdb_catalog/backup/backup1

Segmentation fault (core dumped)

ebrbsnp05 >>

Symantec is looking into some logs now... could anybody please help? Thanks in advance...

Let me add that I have also rebooted the master server yesterday, but the issue is still there... the reboot did not help. :(

Seba.

sebaquadri · ‎08-15-2013

Hi my friend... just a little detail... if you see my email it is @hp.com so... I work on HP, my management will not like your suggestion LOL :)

Changing from HP to IBM is not an option for me... anyway thank you for your advise... I'll keep looking...

StefanosM · ‎08-15-2013

@ Madej.

We had similar problems and we found that it was bad communication between master and some media servers.
I suspect that you have "network" problems inside your DR master server.

What I can not understand is the lack of logs.

@ Seba

:) I can not see your email.

We are doing a lot of work with HP. And you can always change to Linux.

mph999 · ‎08-15-2013

Not sure this is going to help, but it is a recommended setting. http://www.symantec.com/business/support/index?page=content&id=TECH178102 M

Marianne · ‎08-15-2013

I have moved this new issue to a new discussion.

The original thread is almost 6 months old and the OP stopped responding.

Can I ask everyone who replied o this new post to retype replies in the new discussion?

https://www-secure.symantec.com/connect/forums/nbdbsrv-consuming-100-cpu

Handy NetBackup Links

sebaquadri · ‎08-16-2013

@mph999 I've already modifed some files, per symantec advise, and the issue has gone for almost 3 weeks... but last monday the issue was back and today I have 300 queued jobs and 150 active... I'm already using USE_HASH=1 as you mentioned...

The changes I did 3 weeks ago, when the issue was "fixed" (for 3 weeks) are these:

============begin===============

Install the following EEB and add USE_HASH=1 in the /usr/openv/var/global/emm.conf

=====================================

NetBackup_7.1.0.4 2762882

Problem Description

Due to the way Sybase processes the query involving backup-id field of EMM_ImageCopy and

EMM_Image tables, the query takes a long time to execute.

Installed Files

/usr/openv/netbackup/bin/nbemm

DOWNLOAD LINK:

ftp://iosupport:M3Q9r*SI0di7@ftp.entsupport.symantec.com/pub/support/outgoing/04757956/eebinstaller.2762882.1.hpia64

=================================================================================

(1) NBU Config Tuning

=================================================================================

1.A.)

Reduce Master / Media Server socket usage

Move NBU internal VNETD socket connections on master servers to server loopback interface instead of using VNETD daemon --Add the following line to /usr/openv/netbackup/bp.conf

CONNECT_OPTIONS = localhost 1 0 2

No restart needed

=================================================================================

1.B.)

Master Server

Add more connections to the EMM database if the environment has 10+ Media Servers and additional remote admin consoles / other increased backup activity.

STOP NBU on the Master

CREATE file--

UNIX: /usr/openv/var/global/emm.conf

Add contents--

NUM_DB_BROWSE_CONNECTIONS=20

NUM_DB_CONNECTIONS=21

NUM_ORB_THREADS=35

USE_HASH=1 (This line Not needed for 7.5, 7.1 needs EEB's)

REFERENCE: http://www.symantec.com/docs/TECH57277

----------------------------------------------------------------------------------------------

Add more memory and threads to EMM DB due to large environment (35 media servers).

To change , edit file:

Unix: /usr/openv/var/global/server.conf

Make a backup copy of the file before modifying

CURRENT file settings

--------------------------------

# cat -s /usr/openv/var/global/server.conf -n NB_ebrvmsnp01

-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -ti 0 -c 25M -ch 500M -cl 25M -zl -os 1M -o /usr/openv/db//log/server.log -ud

MAKE CHANGES

---------------------------------

Update the following parameters in the file

-ch 500M to new value -ch 3G

-gn 32 ( Add Missing parameter - Add '-gn 32' after '-gl DB ' Afor more DB threads

-m ( Add Missing parameter - Add ' -m ' after ' -ud ' )

-m Helps to trim the NBDB.log file, this will be the default in NBU 7.5

AFTER CHANGES

---------------------------------

# cat -s /usr/openv/var/global/server.conf -n NB_ebrvmsnp01

-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -gd DBA -gk DBA -gl DBA -gn 32 -ti 0 -c 25M -ch 3G -cl 25M -zl -os 1M -o /usr/openv/db//log/server.log -ud -m

Restart services on the master

REFERENCE:

http://www.symantec.com/docs/HOTO67149

=================================================================================

1.C.)

Verify minimum MASTER NBRB tuning file is in place

CREATE file if missing--

UNIX: /usr/openv/var/global/nbrb.conf

Windows: <install_path>\Veritas\NetBackup\var\global\nbrb.conf

Add contents--

SECONDS_FOR_EVAL_LOOP_RELEASE = 180

RESPECT_REQUEST_PRIORITY = 0

DO_INTERMITTENT_UNLOADS = 1

If file exists, do not adjust values to the ones noted above. As they may have been customized for your environment

REFERENCE: http://www.symantec.com/docs/TECH57942

=================================================================================

OS TUNING

=================================================================================

2)

OS TUNING FOR NETBACKUP RESOURCES - NBU 6.X / 7.X

-------------------------------------

2.A)

Default SYSTEM File Descriptors too low for NetBackup Master

This is a very critical setting for all masters on all OS's

# /usr/bin/ulimit -a

time(seconds) unlimited

file(blocks) unlimited

data(kbytes) 1048576

stack(kbytes) 8192

memory(kbytes) unlimited

coredump(blocks) 4194303

nofiles(descriptors) 2048 <<<<<<<<<<<<< Set to 8192 minimum

Reference

Minimum O/S ulimit settings on UNIX platforms

http://www.symantec.com/docs/TECH75332

Insufficient system file descriptors can cause the EMM_DATA.db file to grow very large.

http://www.symantec.com/docs/TECH168846

=======end====================

Those settings helped, as I said before... but after 3 weeks the issue is back... :(

My friends, I'll continue on the new post --> https://www-secure.symantec.com/connect/forums/nbdbsrv-consuming-100-cpu

Please continue on this new post.

Thanks in advance!

Seba.

Marianne · ‎08-16-2013

Please continue discussion in this new thread:

https://www-secure.symantec.com/connect/forums/nbdbsrv-consuming-100-cpu

Handy NetBackup Links

VOX

Heavy load on EMM db, high CPU utilization