Hello, I want ask for any suggestion regarding heavy load on EMM database after switching to second cluster site using VCS GCO and catalog replication by VVR. The physical hostname is changed and virtual IP address for master server. Virtual hostname is the same. In bp.conf i have the parameter ANY_CLUSTER_INTERFACE = 1. NBU version 7.0.1 on Linux RedHat 5.6 Yesterday I switched NBU to backup site and everything was looking good except high CPU utilization by NB_dbsrv process. The utilization is permanent on 20% CPU and in peaks reach 100%. I thought that this situation is for few minutes, but it is still the same. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12426 root 18 0 6647m 463m 7344 S 20.5 1.4 304:27.37 NB_dbsrv 13148 root 16 0 572m 174m 27m S 5.9 0.5 81:56.29 nbemm In nbemm log a have huge amount of entries like below: 0,51216,111,111,1894847,1361811878412,13148,1089464640,0:,65:Preallocated <5> elements in curViewSeq for <c-sapbwp_1361345386>,22:ImageObject::fetchView,1 0,51216,111,111,1894848,1361811878413,13148,1089464640,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1 1,51216,111,111,1894849,1361811878413,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |) 1,51216,111,111,1894850,1361811878421,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1062|u32:2020006|) 1,51216,111,111,1894851,1361811878422,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |) 0,51216,111,111,1894852,1361811878428,13148,47815214316672,0:,12:retval - <0>,37:ImageCatalogImpl::getImageCopyDetails,1 1,51216,111,111,1894853,1361811878429,13148,1089464640,0:,0:,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |) 0,51216,111,111,1894854,1361811878429,13148,1089464640,0:,35:<Need to send unique media sequnce>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1 0,51216,111,111,1894855,1361811878435,13148,1089464640,0:,12:retval - <0>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1 1,51216,111,111,1894856,1361811878517,13148,1106905408,0:,0:,36:ImageCatalogImpl::getImageWithCopies,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |) 0,51216,111,111,1894857,1361811878524,13148,1106905408,0:,68:Preallocated <5> elements in curViewSeq for <ichmura-025_1361345428>,22:ImageObject::fetchView,1 0,51216,111,111,1894858,1361811878525,13148,1106905408,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1 1,51216,111,111,1894859,1361811878525,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> There is difficult to log in to java gui, but after few minutes gui being responding. But is impossible to open Device Monitor, Devices and other tabs where i can monitor devices. This is big inconvenience for administrators. Most of policies can run and working well, also restores can be started. Unfortunately not all media servers are working as is expected because some of them is active only for disk, and should be active for disk and tape. When i try to activate the host via "vmoprcmd -activate_host -h <host>" i recieve "network protocol error (39)" message. I can ask EMM db using nbemmcmd command (i.e. nbemmcmd -listhosts). I can also change some setting using this command. EMM is responding immediately. It is possible that the database has to validate itself? NBU database has about 1.1TB. I am affraid to restart NBU, this is very critical system for others production databases. Thanks for any suggestions. Regards Madej

Just wondering if the netbackup logs folder are has not failed over with the rest of the cluster or has a write issue and NetBackup is desperately trying and failing to write its log files but cannot causing a blockage Check everything out on the node you are on in case that node has a configuration file of similar which is pointing the logs to a non existent or write protected area

Try running /usr/openv/db/bin/dbadm (default password nbusql) and select 2) Database Space and Memory Management and select 4) Adjust Memory Settings, 3) Large. Re-start of Netbackup required (Yes I know - you need to schedule it then) But you may also have connection orinted errors as well

cd /usr/openv/db/data ls -l (post the output)

Yikes nasty one ... I would add USE_HAS=1 into /usr/openv/var/global/nbemm.conf (create this if not alreay there). 1. Stop NBU 2. Start just the DB - /usr/openv/db/bin/nbdbms_start_server 3. nbdb_unload –rebuild 4. nbdb_admin –reorganize 5. nbdb_unload –rebuild Think of this as a 'defrag' on the NBDB. (Not sure how long this willl take, as an idea, I ran this a while back on a 15GB DB (the size of the sql db, not the image db) and it took about 2.5 hours, but this was on a powerful machine with lots of memory (it runs in memory so the more the better)). Martin

Forum Discussion

mrmadej

Level 4

12 years ago

Heavy load on EMM db, high CPU utilization

Hello,

I want ask for any suggestion regarding heavy load on EMM database after switching to second cluster site using VCS GCO and catalog replication by VVR.

The physical hostname is changed and virtual IP address for master server. Virtual hostname is the same. In bp.conf i have the parameter ANY_CLUSTER_INTERFACE = 1.

NBU version 7.0.1 on Linux RedHat 5.6

Yesterday I switched NBU to backup site and everything was looking good except high CPU utilization by NB_dbsrv process. The utilization is permanent on 20% CPU and in peaks reach 100%. I thought that this situation is for few minutes, but it is still the same.

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12426 root      18   0 6647m 463m 7344 S 20.5  1.4 304:27.37 NB_dbsrv
13148 root      16   0  572m 174m  27m S  5.9  0.5  81:56.29 nbemm

In nbemm log a have huge amount of entries like below:

0,51216,111,111,1894847,1361811878412,13148,1089464640,0:,65:Preallocated <5> elements in curViewSeq for <c-sapbwp_1361345386>,22:ImageObject::fetchView,1
0,51216,111,111,1894848,1361811878413,13148,1089464640,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1
1,51216,111,111,1894849,1361811878413,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
1,51216,111,111,1894850,1361811878421,13148,1106905408,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1062|u32:2020006|)
1,51216,111,111,1894851,1361811878422,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894852,1361811878428,13148,47815214316672,0:,12:retval - <0>,37:ImageCatalogImpl::getImageCopyDetails,1
1,51216,111,111,1894853,1361811878429,13148,1089464640,0:,0:,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894854,1361811878429,13148,1089464640,0:,35:<Need to send unique media sequnce>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1
0,51216,111,111,1894855,1361811878435,13148,1089464640,0:,12:retval - <0>,41:ImageCatalogImpl::getTotalSizeAndMediaIds,1
1,51216,111,111,1894856,1361811878517,13148,1106905408,0:,0:,36:ImageCatalogImpl::getImageWithCopies,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv> PID=<13415> |)
0,51216,111,111,1894857,1361811878524,13148,1106905408,0:,68:Preallocated <5> elements in curViewSeq for <ichmura-025_1361345428>,22:ImageObject::fetchView,1
0,51216,111,111,1894858,1361811878525,13148,1106905408,0:,12:retval - <0>,36:ImageCatalogImpl::getImageWithCopies,1
1,51216,111,111,1894859,1361811878525,13148,47815214316672,0:,0:,37:ImageCatalogImpl::getImageCopyDetails,1,(1061|A65:HOST=<tygrys.xxxx.xx> VER=<700000> APP=<nbstserv>

There is difficult to log in to java gui, but after few minutes gui being responding. But is impossible to open Device Monitor, Devices and other tabs where i can monitor devices. This is big inconvenience for administrators.

Most of policies can run and working well, also restores can be started. Unfortunately not all media servers are working as is expected because some of them is active only for disk, and should be active for disk and tape. When i try to activate the host via "vmoprcmd -activate_host -h <host>" i recieve "network protocol error (39)" message.

I can ask EMM db using nbemmcmd command (i.e. nbemmcmd -listhosts). I can also change some setting using this command. EMM is responding immediately.

It is possible that the database has to validate itself? NBU database has about 1.1TB. I am affraid to restart NBU, this is very critical system for others production databases.

Thanks for any suggestions.

Regards

Madej

Mark_Solutions
12 years ago
Just wondering if the netbackup logs folder are has not failed over with the rest of the cluster or has a write issue and NetBackup is desperately trying and failing to write its log files but cannot causing a blockage

Check everything out on the node you are on in case that node has a configuration file of similar which is pointing the logs to a non existent or write protected area

17 Replies

Replies have been turned off for this discussion

Nicolai
Moderator
12 years ago
Try running /usr/openv/db/bin/dbadm (default password nbusql) and select 2) Database Space and Memory Management and select 4) Adjust Memory Settings, 3) Large.

Re-start of Netbackup required (Yes I know - you need to schedule it then)

But you may also have connection orinted errors as well

mrmadej

Level 4

12 years ago

Thanks for response.

I have cache size as follows:

   (Setting) (Initial) (Minimum) (Maximum)
                Current      500M      500M     6000M
                  Small       25M       25M      500M
                 Medium      200M      200M      750M
                  Large      500M      500M        1G

The cache size is not a problem. With the same setting NBU worked well on primary site. This is support recommendation from few months ago.

How i can diagnose what type of activity causing this strange behavior? Sometimes is impossible to run vxlogview command. It returns:

]~# /usr/openv/netbackup/bin/vxlogview -o 111 -t 00:10:00 > /tmp/111.txt 
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists
/var/log/VRTSpbx/50936-103-8323328-130226-0000000201.log does not exists

I am looking the confirmation that there is possibility to safety restart NBU, clear cache and restart PBX. I am afraid that after this operation the database inconsistency can occur.

Regards

Madej

mrmadej
Level 4
12 years ago
The problem looks serious and even technical support has no idea what is going on.

NBU and pbx has been restarted, cache has been cleared and the symptoms still the same.

When there are no running jobs the CPU is utilized on 20-30%. But when jobs are running (~200 active and ~100 queued) CPU utilization is about 140% - 180% (4 CPU)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12426 root 18 0 6859m 1.1g 8940 S 170.5 3.4 2225:35 NB_dbsrv 13148 root 15 0 640m 186m 29m S 56.8 0.6 684:40.26 nbemm 13337 root 15 0 422m 161m 16m S 9.5 0.5 41:16.39 nbjm

And in addition i have found that logs are not being created. On primary site logs were growing very quickly.

Support will analize NBSU logs. I hope that this help.

Regards

Madej
revarooo
Level 6
12 years ago
cd /usr/openv/db/data

ls -l (post the output)
Mark_Solutions
Level 6
12 years ago
Just wondering if the netbackup logs folder are has not failed over with the rest of the cluster or has a write issue and NetBackup is desperately trying and failing to write its log files but cannot causing a blockage

Check everything out on the node you are on in case that node has a configuration file of similar which is pointing the logs to a non existent or write protected area
mph999
Level 6
12 years ago
Yikes nasty one ...

I would add USE_HAS=1 into /usr/openv/var/global/nbemm.conf

(create this if not alreay there).

1. Stop NBU
2. Start just the DB - /usr/openv/db/bin/nbdbms_start_server

3. nbdb_unload –rebuild

4. nbdb_admin –reorganize

5. nbdb_unload –rebuild

Think of this as a 'defrag' on the NBDB.

(Not sure how long this willl take, as an idea, I ran this a while back on a 15GB DB (the size of the sql db, not the image db) and it took about 2.5 hours, but this was on a powerful machine with lots of memory (it runs in memory so the more the better)).

Martin
e_m_p_har
12 years ago
Some quick questions come up. What is the size of NBDB? If you look at the contents of /usr/openv/db/data (or it's appropriate link) does it require the cache values set in server.conf (-c/-cl-ch)? Have you added -gn <value> (I normally recommend setting this to 30 for initial testing. Can you check how the systems semaphore values are set? The following process can be used to set this:

sysctl -a | grep kernel.sem

Examine current settings, and if they are under the recommended values, you can increase them. This process is described in (http://www.symantec.com/docs/TECH203066). The other nice thing about changing semaphore values is that it can be done without a restart.

If you see that EMM_DATA.db is excessively large in size (multiple GB or higher) then you will need to perform the rebuild/reorganize steps referenced by mph999. If you end up having to use rebuild/reorganize you may want to run it multiple times. When you run rebuild/reorg if it doesn't complete processing *.dbR files will be left over. If that happens you will want to run another rebuild/reorganize.

One last thing I would recommend looking at is pack.summary from both sites. In NetBackup 7.0.1 there were some issues regarding EMM growth and deadlocks that required EEB's to fix. The NetBackup Late Breaking News article may be of use identifying specific failure points.

mrmadej

Level 4

12 years ago

 ll /opt/netbck/db/data                                                                
total 7265632
-rw------- 1 root root   26218496 May  9  2011 DARS_DATA.db
-r-------- 1 root root     135168 May  8  2012 DARS_DATA.dbR
-rw------- 1 root root   26218496 May  9  2011 DARS_INDEX.db
-r-------- 1 root root      36864 May  8  2012 DARS_INDEX.dbR
-rw------- 1 root root  314798080 Mar  1 11:18 DBM_DATA.db
-r-------- 1 root root  216436736 May  8  2012 DBM_DATA.dbR
-rw------- 1 root root   35803136 Mar  1 11:18 DBM_INDEX.db
-r-------- 1 root root   16977920 May  8  2012 DBM_INDEX.dbR
-rw------- 1 root root 4837781504 Mar  1 11:18 EMM_DATA.db
-r-------- 1 root root   97587200 May  8  2012 EMM_DATA.dbR
-rw------- 1 root root   26218496 Mar  1 11:18 EMM_INDEX.db
-r-------- 1 root root     724992 May  8  2012 EMM_INDEX.dbR
-rw------- 1 root root   36724736 Mar  1 11:16 NBDB.db
-r--r--r-- 1 root root   39100416 May  8  2012 NBDB.dbR
-rw------- 1 root root 1752170496 Mar  1 11:18 NBDB.log
-r-------- 1 root root     327680 May  8  2012 NBDB.logR
-rw------- 1 root root        460 Mar  1 00:01 vxdbms.conf

Linked to the shared volume.

Regards

Madej

mrmadej
Level 4
12 years ago
I have found that the log in nbdb directory is old and not being created new.

Other logs are in place.

The volume for logs is not in cluster configration. This is standalone and is mounted administratively. The directory structure is correct created by mklogdir script.

Have to analize the nbsu logs to find some differences in configration between primary and secondary server.

Regards

Madej

mrmadej

Level 4

12 years ago

Database is not very big (without images):

du -sk /opt/netbck/db/data/*                                             25608   /opt/netbck/db/data/DARS_DATA.db
136     /opt/netbck/db/data/DARS_DATA.dbR
25608   /opt/netbck/db/data/DARS_INDEX.db
40      /opt/netbck/db/data/DARS_INDEX.dbR
307432  /opt/netbck/db/data/DBM_DATA.db
211376  /opt/netbck/db/data/DBM_DATA.dbR
34968   /opt/netbck/db/data/DBM_INDEX.db
16584   /opt/netbck/db/data/DBM_INDEX.dbR
4724408 /opt/netbck/db/data/EMM_DATA.db
95304   /opt/netbck/db/data/EMM_DATA.dbR
25608   /opt/netbck/db/data/EMM_INDEX.db
712     /opt/netbck/db/data/EMM_INDEX.dbR
46912   /opt/netbck/db/data/NBDB.db
38184   /opt/netbck/db/data/NBDB.dbR
1728808 /opt/netbck/db/data/NBDB.log
320     /opt/netbck/db/data/NBDB.logR
8       /opt/netbck/db/data/vxdbms.conf

Semafores are set the same as on primary site:

sysctl -a | grep kernel.sem                                              kernel.sem = 250        32000   32      128

Pack summary looks as follows:

# DO NOT EDIT THIS FILE !
# * means installed patch was preceded by this patch.
# + means that the installed patch installed this patch as a dependency.
NB_CLT_7.0.1 installed. +NB_7.0.1 +NB_JAV_7.0.1
NB_7.0.1 installed. *NB_CLT_7.0.1
NB_JAV_7.0.1 installed. *NB_CLT_7.0.1
EEB_NetBackup_7.0.1_PET2140767_SET2140749_EEB2
EEB_NetBackup_7.0.1_PET2201350_SET2200966_EEB2

Is identical with primary site.

I have to shutdown NBU to rebuild/reorganize database. Need discuss with administrator.

Regards

Madej

Forum Discussion

Heavy load on EMM db, high CPU utilization

17 Replies

Related Content

High CPU load on primary server - java process

Netbackup java console wont load

Re: "Driver has detected a controller error" - help and info pls

Heavy disk usage on Media server when idle

NBCC Utility

Recent Discussions

command: bperror

MS-SharePoint policy restore error (2804) .

How to restore a backup

How to configure RBAC

10 years old netbackup appliance database service down, ssl certification out date