OpsCenter 7.5.0.6 database maxes out

MilanPalian-A · ‎09-26-2013

The OpsCenter database (dbsrv11.exe) maxes out the CPU. I have allocated 6 cores just to the database, and they are running 100% continuously.

I have done all the obvious stuff: defrag disk, defrag database, increase cache (32GB), add processors. The only thing that quiets the database is disabling data collection for 4 of my 7 appliances, the other masters do not seem to affect performance as much.

I already opened a case, but support seems clueless about the database. Has anyone had this problem on 7.5.0.6?

Dave_High · ‎09-26-2013

HI MIlan,

How large is your database and how many JOBS per day are you pulling into OpsCenter? 7.5.0.6 had some DB and performance enhancements that even our very large customers are liking. I am worried that six cores simply isnt enough. Is this running on a VM? Not really something we recommend as it requires a lot more resources just to do a basic job. The fact that you can disable some of your appliances and performance improves tells me you may be overloading the system.

Thanks,

dave

MilanPalian-A · ‎09-27-2013

Hello Dave,

It is a standalone system, octacore, 48GB, dedicated to OpsCenter, no VM, with 30GB database and 25.000 jobs per day.

Thanks for your input, that is new info for me, as I was under the impression that we had a decently configured machine for OpsCenter purposes. We are going to scale up soon, so I am hoping this will help. However, what sort of system would we need if we added, say, a hundred appliances? It is interesting that the other NBU Masters do not seem to have such an impact, as the Appliances, but maybe it is due to the nature of the job mix that we have on the Appliances.

Milan

MilanPalian-A · ‎09-27-2013

Dave,

I should also add that it is not just overload from too many queries, the processors are running 100% continuously. It looks as you would expect from a long running query or a full scan of the database.

Regards,

Milan

Dave_High · ‎09-27-2013

Ya, that is a monster system for only 25k jobs per day. Keep the support case open and force it into backline. All our other customers who have upgraded have been very pleased with the performance. This is the first I have heard of issues.

Let me ping my technical guys on the Appliance question. I know if you have 100 Masters running 1000 jobs each that isnt a big deal (100,000 jobs total) so not sure how an Appliance Master would change anything. Would all the appliances be Masters in your scenario?

Thanks,

dave

MilanPalian-A · ‎09-27-2013

Hello Dave,

Yes, thanks for the advice. I read about such a problem in 7.5.0.5, we had an EEB to deal with it. However, that was folded into 7.5.0.6. I am also thinking that maybe all the upgrades we have done have left so garbage in the database that is causing this behaviour. It would be nice to do a clean install and try it out, but we need the history.

Wish you a nice weekend !

Milan

tom_sprouse · ‎09-27-2013

MilanPallan,

Courious as to the contents of your server.conf file for the cache values on the database in addition to the scl.conf file settings.

If you can post them great....

Also, check the purge-status.log and see how many jobs / media / slp / etc... are purging daily.

In addition, what are your current purge settings? (In WebUI, Settings / Configuration / Purge)

Cycle OpsCenter and review the bottom of the ./db/log/server.log - are there any performance warnings and what is the most recent fragment count on vxpmdb.db?

In addition, check Anti-Virus on the host and ensure; a/v is not scanning the database on read / write operations. In fact, I would exclude the entire database directory and sub-folders....

./installpath/OpsCenter/Server/db/data/

There have been known issues with different A/V applications causing this behavior.

If you have any questions, please let us know.

--Tom

MilanPalian-A · ‎10-08-2013

Hello Tom,

Thank you for your hints.

- I tried various cache sizes, exagerating in both directions, without much effect. It is currently 32GB, memory usage for the whole system usually stabilizes around 26GB.

- scl.conf sort of grew on us organically, as issues developed and were fixed by support. I cannot review it, as I was told that scl.conf parameters are a great mystery that should not be revealed to the uninitiated:

nbu.scl.collector.serviceSyncPeriodInSecs=120
nbu.scl.collector.imageSyncPeriodInSecs=600
nbu.scl.collector.jobSyncPeriodInSecs=180
nbu.scl.collector.serviceThresholdEventCount=20

nbu.scl.collector.imageThresholdEventCount=1000
nbu.scl.collector.nonDoneJobSyncPeriodInSecs=3600

nbu.scl.agent.imageAgentTimeout=360000000000

- Purge config is:

Backup Job, Alert, Audit Trail: 420 days

SLP images: 90 day

the rest: 31 days

Time is 8:00, it is enabled, expired image purge disabled.

- Data is purged, as can be seen in reporting, but it is interesting that we completely lack a "purge-status.log" file. It appeared once upon a time, but seems to have gone underground again. Initially, we only experienced performance problems during the hour or two that it took for purge to complete, but this is no longer so.

- Performance warnings arise, due to fragmentation and I defrag the database and the disk. This helps a little bit.

- Thanks for the A/V advice, I will try that.

Regards,

Milan

MilanPalian-A · ‎10-09-2013

Hello Tom,

The A/V was indeed scanning the database, but disabling it did not resolve the issue.

Milan

Michael_G_Ander · ‎10-10-2013

Hi Milan

Have been told that http://www.symantec.com/business/support/index?page=content&id=TECH190207 can help on 7.5.0.6 performance too

Maybe there is a statistics option for appliances too

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

MilanPalian-A · ‎10-11-2013

Hi Michael,

Yes, that is a good advice, I have already tried it previously, but was unable to understand it in sufficient detail. I believe I should go back to analyzing the database stats.

Best,

Milan

MilanPalian-A · ‎10-17-2013

Just for closure ...

We ended up upgrading to OpsCenter 7.6 and significantly beefing up the hardware. The problem went away and surprisingly, CPU usage is now lower. Instead of many maxed out cores, there are only two and the rest of the load is spread around more thinly. It seems it was a case of resource starvation, causing an even bigger resource grab that OpsCenter could not handle gracefully.

OpsCenter 7.6 also has a newer version of SQL Anywhere (v12) which is also more responsive. I would recommend upgrading. There is an upgrade hitch that is documented in Technote TECH211070 , read it before upgrading.

- Milan

D_Thomas · ‎01-24-2014

If you have this issue, just upgrade to 7.6

I had the same issue with 7.5.0.7 (after I had the problem with 7.5.0.6). I would just reboot the VM every other day to get around it, but the box would just consume any processors assigned to it for no reason that I could see.

I upgraded to 7.6.0.1 a few days ago, which is backwards compatible with the Netbackup servers running 7.5.0.6, and its been great ever since.

MilanPalian-A · ‎01-27-2014

@DThomas,

Yes, we are also running 7.6 and it is much better, due to the newer database software.

However, we still had instances of cores stuck at 100% until someone from engineering who was participating in the WebEx manually regenerated the statistics on one of the database tables.