05-06-2014 03:52 AM
I have a 5220 ver 2.5.3 running Netbackup 7.5.0.6. I want to know of there is a way I can monitor my appliance to know if I am reaching limits with it in regards to amount of policys run at one time. I run into jobs running that fail and have been trying to run them at different times to find out if I have a bottleneck. But if there is a tool to monitor performance of my 5220 and tell me I'm running too hot, that would be great.
Pat
Solved! Go to Solution.
05-07-2014 05:26 AM
The queue is still really big .. i guess as you have fired another off it is running at the moment but do show what --processqueueinfo shows to make sure one is running and one is queued.
You need to just keep on top of this .. one one is running but one is not queued then use --processqueue to fire another off .. keep doing that until the queue comes down
Are you sure all of your disks are OK - based on the size it doesn't look like it is running very fast
Maybe it needs the contentrouter.cfg doing and a reboot to speed things up anyway?
05-06-2014 04:16 AM
I believe that there are tools / health check wizards built into the latest version (2.6.1 / 7.6.0.1) but not in your version.
One of the most telling things on appliances (apart from actually having failures) is how the process queue is coping with things - the bigger and further behind that gets the worse your performance will get
/usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
/usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
As with any NetBackup installation there is also tuning that can be done to optimise an appliances usage (data buffers / workerthreads etc)
05-06-2014 04:47 AM
Does this output look normal? Not sure what a normal Queue size might be?
thanks
Pat
crcontrol --processqueueinfo
Busy : yes
Pending: no
crcontrol --queueinfo
total queue size : 4850922227
creation date of oldest tlog : Mon May 5 12:19:51 2014
05-06-2014 05:46 AM
It is pretty big - and latest one is yesterday .. but in this global world i don't know where you are (which time zone and what time it is where you are)
Two things to help me ..
1. Give me a time and date check so that i can work out based on the time you post our time difference
2. Can you show me the output of the following:
/opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0
05-06-2014 07:44 AM
I am in EST timezone , I put the output of the command below.
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4064 mV
Current: 0 mA
Temperature: 32 C
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
Battery state:
GasGuageStatus:
Fully Discharged : No
Fully Charged : Yes
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Remaining Capacity Alarm: No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 99 %
Charger System State: 49168
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 102 %
Max Error: 2 %
05-06-2014 08:07 AM
OK - so when you ran the processqueue command it was about 7am?
And it says the queue processing was still running from midnight
That means it was still busy after 7 hours which is not great.
Your RAID Cache battery is fine so it is not that causing an issue and I assume you have no active alerts from the appliance (as a disk being down will also degrade performance)
Which means that it is just down to processing... perhaps you only have a short retention period for your images or just a lot of jobs running but it is worth running the queue processing manually a few times now and again to keep it running leanly.
Things get better at 2.6 / 7.6.0.1 but for now you may need to keep running it to get the size right down and then pop on every couple of days to run it manually too:
/usr/openv/pdde/pdcr/bin/crcontrol --processqueue
What does crcontrol --queueinfo look like at the moment?
Do you do accelerator backups? If so what is the WorkerThreads value in the /disk/etc/puredisk/contentrouter.cfg file? It should be 128 (default is 64)
Tell us about your system, retention periods used, how much free space your disk pool currently has etc. to give us a good view of your system
05-06-2014 09:50 AM
I do have jobs running and queued this morning.
What does running the crcontrol --processqueue command do? what do you mean "Running Leanly"?
I ran /usr/openv/pdde/pdcr/bin/crcontrol --processqueue and it came back with "OK"
wmcnbuel01:/home/maintenance # /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 7912802257
creation date of oldest tlog : Mon May 5 12:19:51 2014
I am running some (10) Accelerator backups
can you tell me what folder the /disk/etc/puredisk/contentrouter.cfg file is under?
I have 2 5520's , one at my datacenter , one at my DR site. They are doing AIR
I have 190 policies
490 schedules
307 clients (80% VMware.) (90% Windows)
10 Exchange Servers
50 SQL Server
05-07-2014 01:42 AM
Running the --processqueue command sets a queue process into operation.
By default this only runs at midnight and midday but it has a limit on how much it actually processes (these are all of the internal de-duplication processes and transactions which get queued and the process comitts them)
On a busy system it can get behind so the queue just gets bigger and bigger which slows the system down.
Keeping it trimmed down to a small size will help the system run well
When it said "OK" it has started one or added one to its process queue.
The more times it gets run the smaller the queue gets - usually you have one running and one queued and once both have finished check how how it is doing. The size is in bytes (or is it kb?) so it is nice to have a queue no more than 5 digits.
Yours looks to be struggling as when you first showed the command the queue was up to Mon May 5 12:19:51 2014 but 5 hours later it has not got any newer so was still processing.
When that run had finished it would do the new one due - or the one you fired off manually - it should have done the midday and midnight ones since then too - so lets see what your queue is this morning to see if it is catching up and if it has any waiting (so do --processqueueinfo and --queueinfo)
The /disk/etc/puredisk/contentrouter.cfg is exactly where it says it is ... /disk/etc/puredisk/contentrouter.cfg
If you are doing accelerator backups you really should use 128 for the WorkerThreads value.
It looks like you are doing a lot to one appliance - how is the disk space doing? Show us the output of:
/usr/openv/pdde/pdcr/bin/crcontrol --dsstat
and also tell us what the sizes show under the disk pool section of the admin console
Once a de-dupe pool gets above 80% full performance starts to degrade
05-07-2014 04:03 AM
Thank you for the resposnes, I've responded below.
Pat
Found the Worker Theads in contentrouter.cfg, it is set at 64
output of /usr/openv/pdde/pdcr/bin/crcontrol --dsstat
************ Data Store statistics ************
Data storage Raw Size Used Avail Use%
60.0T 57.5T 37.3T 20.2T 65%
Number of containers : 245760
Average container size : 185092929 bytes (176.52MB)
Space allocated for containers : 45488438260640 bytes (41.37TB)
Space used within containers : 44200272176182 bytes (40.20TB)
Space available within containers: 1288166084458 bytes (1.17TB)
Space needs compaction : 6296580493192 bytes (5.73TB)
Reserved space : 2749427482624 bytes (2.50TB)
Reserved space percentage : 4.2%
Records marked for compaction : 166007501
Active records : 620009877
Total records : 786017378
Disk Pool
Raw Size 57.5
Used 37.3
Available 20.2
05-07-2014 04:09 AM
OK thanks...
Used space looks good
A fair bit ready for compaction but it should do that automatically once a month - if you have short retention periods then you may need to do it yourself more regularly
You should increase the worker threads value - usually best to give it a reboot after making the change so choose a quiet period - your large queue means that it may take a while to reboot - What is the queue at the moment? (--processqueueinfo / --queueinfo)
05-07-2014 05:22 AM
My retention periods range from 1 month to 6 months for disk.
processqueue = ok
wmcnbuel01:/usr/openv/pdde/pdcr/bin # ./crcontrol --queueinfo
total queue size : 19302538730
creation date of oldest tlog : Wed May 7 00:20:10 2014
05-07-2014 05:26 AM
The queue is still really big .. i guess as you have fired another off it is running at the moment but do show what --processqueueinfo shows to make sure one is running and one is queued.
You need to just keep on top of this .. one one is running but one is not queued then use --processqueue to fire another off .. keep doing that until the queue comes down
Are you sure all of your disks are OK - based on the size it doesn't look like it is running very fast
Maybe it needs the contentrouter.cfg doing and a reboot to speed things up anyway?
05-07-2014 05:37 AM
Thanks, It has been my thought that I have gradually overloaded this box. I am in the process of setting up another to off load some of the jobs. I appreciate all the help
05-07-2014 05:41 AM
Not a problem at all - I would do the contentrouter change, reboot and then keep fireing off the processqueue until you see it come down to 4 or 5 digits .. then just keep an eye on it every few days
If you keep it trimmed down it will run faster - also double check its status in the web gui to ensure it is not reporting any issues - a disk down or power supply down can have an adverse affect