09-26-2013 12:30 AM
The OpsCenter database (dbsrv11.exe) maxes out the CPU. I have allocated 6 cores just to the database, and they are running 100% continuously.
I have done all the obvious stuff: defrag disk, defrag database, increase cache (32GB), add processors. The only thing that quiets the database is disabling data collection for 4 of my 7 appliances, the other masters do not seem to affect performance as much.
I already opened a case, but support seems clueless about the database. Has anyone had this problem on 7.5.0.6?
09-26-2013 07:26 AM
HI MIlan,
How large is your database and how many JOBS per day are you pulling into OpsCenter? 7.5.0.6 had some DB and performance enhancements that even our very large customers are liking. I am worried that six cores simply isnt enough. Is this running on a VM? Not really something we recommend as it requires a lot more resources just to do a basic job. The fact that you can disable some of your appliances and performance improves tells me you may be overloading the system.
Thanks,
dave
09-27-2013 12:09 AM
Hello Dave,
It is a standalone system, octacore, 48GB, dedicated to OpsCenter, no VM, with 30GB database and 25.000 jobs per day.
Thanks for your input, that is new info for me, as I was under the impression that we had a decently configured machine for OpsCenter purposes. We are going to scale up soon, so I am hoping this will help. However, what sort of system would we need if we added, say, a hundred appliances? It is interesting that the other NBU Masters do not seem to have such an impact, as the Appliances, but maybe it is due to the nature of the job mix that we have on the Appliances.
Milan
09-27-2013 12:12 AM
Dave,
I should also add that it is not just overload from too many queries, the processors are running 100% continuously. It looks as you would expect from a long running query or a full scan of the database.
Regards,
Milan
09-27-2013 07:12 AM
Ya, that is a monster system for only 25k jobs per day. Keep the support case open and force it into backline. All our other customers who have upgraded have been very pleased with the performance. This is the first I have heard of issues.
Let me ping my technical guys on the Appliance question. I know if you have 100 Masters running 1000 jobs each that isnt a big deal (100,000 jobs total) so not sure how an Appliance Master would change anything. Would all the appliances be Masters in your scenario?
Thanks,
dave
09-27-2013 07:36 AM
Hello Dave,
Yes, thanks for the advice. I read about such a problem in 7.5.0.5, we had an EEB to deal with it. However, that was folded into 7.5.0.6. I am also thinking that maybe all the upgrades we have done have left so garbage in the database that is causing this behaviour. It would be nice to do a clean install and try it out, but we need the history.
Wish you a nice weekend !
Milan
09-27-2013 06:57 PM
MilanPallan,
Courious as to the contents of your server.conf file for the cache values on the database in addition to the scl.conf file settings.
If you can post them great....
Also, check the purge-status.log and see how many jobs / media / slp / etc... are purging daily.
In addition, what are your current purge settings? (In WebUI, Settings / Configuration / Purge)
Cycle OpsCenter and review the bottom of the ./db/log/server.log - are there any performance warnings and what is the most recent fragment count on vxpmdb.db?
In addition, check Anti-Virus on the host and ensure; a/v is not scanning the database on read / write operations. In fact, I would exclude the entire database directory and sub-folders....
./installpath/OpsCenter/Server/db/data/
There have been known issues with different A/V applications causing this behavior.
If you have any questions, please let us know.
--Tom
10-08-2013 06:43 AM
Hello Tom,
Thank you for your hints.
- I tried various cache sizes, exagerating in both directions, without much effect. It is currently 32GB, memory usage for the whole system usually stabilizes around 26GB.
- scl.conf sort of grew on us organically, as issues developed and were fixed by support. I cannot review it, as I was told that scl.conf parameters are a great mystery that should not be revealed to the uninitiated:
nbu.scl.collector.serviceSyncPeriodInSecs=120
nbu.scl.collector.imageSyncPeriodInSecs=600
nbu.scl.collector.jobSyncPeriodInSecs=180
nbu.scl.collector.serviceThresholdEventCount=20
nbu.scl.collector.imageThresholdEventCount=1000
nbu.scl.collector.nonDoneJobSyncPeriodInSecs=3600
nbu.scl.agent.imageAgentTimeout=360000000000
- Purge config is:
Backup Job, Alert, Audit Trail: 420 days
SLP images: 90 day
the rest: 31 days
Time is 8:00, it is enabled, expired image purge disabled.
- Data is purged, as can be seen in reporting, but it is interesting that we completely lack a "purge-status.log" file. It appeared once upon a time, but seems to have gone underground again. Initially, we only experienced performance problems during the hour or two that it took for purge to complete, but this is no longer so.
- Performance warnings arise, due to fragmentation and I defrag the database and the disk. This helps a little bit.
- Thanks for the A/V advice, I will try that.
Regards,
Milan
10-09-2013 01:36 AM
Hello Tom,
The A/V was indeed scanning the database, but disabling it did not resolve the issue.
Milan
10-10-2013 07:23 AM
Hi Milan
Have been told that http://www.symantec.com/business/support/index?page=content&id=TECH190207 can help on 7.5.0.6 performance too
Maybe there is a statistics option for appliances too
Regards
Michael
10-11-2013 12:31 AM
Hi Michael,
Yes, that is a good advice, I have already tried it previously, but was unable to understand it in sufficient detail. I believe I should go back to analyzing the database stats.
Best,
Milan
10-17-2013 12:11 AM
Just for closure ...
We ended up upgrading to OpsCenter 7.6 and significantly beefing up the hardware. The problem went away and surprisingly, CPU usage is now lower. Instead of many maxed out cores, there are only two and the rest of the load is spread around more thinly. It seems it was a case of resource starvation, causing an even bigger resource grab that OpsCenter could not handle gracefully.
OpsCenter 7.6 also has a newer version of SQL Anywhere (v12) which is also more responsive. I would recommend upgrading. There is an upgrade hitch that is documented in Technote TECH211070 , read it before upgrading.
- Milan
01-24-2014 07:09 AM
If you have this issue, just upgrade to 7.6
I had the same issue with 7.5.0.7 (after I had the problem with 7.5.0.6). I would just reboot the VM every other day to get around it, but the box would just consume any processors assigned to it for no reason that I could see.
I upgraded to 7.6.0.1 a few days ago, which is backwards compatible with the Netbackup servers running 7.5.0.6, and its been great ever since.
01-27-2014 12:50 AM
@DThomas,
Yes, we are also running 7.6 and it is much better, due to the newer database software.
However, we still had instances of cores stuck at 100% until someone from engineering who was participating in the WebEx manually regenerated the statistics on one of the database tables.