03-11-2013 07:37 PM
My SLP processing between two MSDP units has gone slow and I cannot find a cause.
It seemed to start the weekend before last; I noticed my priority SLP processing was still running on Sunday when normally it completes Saturday evening. Since then I've struggled to get SLP processing completed in a timely manner. Even the twice daily tlog processing seems slow.
Both servers look normal from a CPU, IO and memory perspective. There appears to be no hardware errors that I can see and network paths look good. I have checked (and ensured disabled) any form of anti-virus and disk indexing.
I have looked at memory usage; I see in both spoold logs "Current Memory usage exceeds the Info threshold value (current value: 73, allowed value: 70)." I also see on the target node entries in storaged.log "WARNING [0000000006385AF0]: -1: Can not preallocate 2372 cache nodes, will try to replace existing cache nodes."
Even "cancelling" an SLP duplication job takes time (read hours!).
Symantec are looking at this now but I thought I'd throw it out to the community.
Any assistance greatly appreciated.
Solved! Go to Solution.
03-26-2013 05:18 PM
I have upgraded the server memory from 24Gb to 32GB this morning and now my spoold process is using 2GB memory more than before and my IO profile whilst SLP processes are running is completely different; significantly lower read IO and disk queue on MSDP unit very low.
I would suggest that despite the Symantec recommendation of 1GB RAM per 1TB MSDP space the max cache size of 51% memory usage for content router is also a factor (this would equate to 13GB or thereabouts) and it would seem like even though I had spare memory spoold was constrained by this cache parameter and not using it and thus was resorting to disk IO to service requests.
I'll know more over the weekend where my slow SLP processing is more noticable.
03-11-2013 07:42 PM
Sorry should also have said.
Master Server is RHEL v6.3, NBU v7.1.0.4
MSDP Servers are Windows 2008 (R2 or SP1), NBU v7.1.0.4
03-11-2013 11:48 PM
You should check for Optimized duplication. If NetBackup is not using Optimized duplication then it will take more time to complete the duplication job.
03-11-2013 11:55 PM
My first thought but yes; all SLP jobs containing the magic wording:
03/12/2013 12:30:15 - Info Duplicate (pid=20027) Initiating optimized duplication from @aaaaa to @aaaab
Based on previous experience; network utilisation would be much hgher if if wasn't using optmised dedupe.
03-12-2013 04:09 AM
I can take a look at the logs .... if you have enbaled them.
Master server: admin, job deatils for a duplication job
Media server: bptm, bpdm
03-12-2013 08:14 PM
Have logs and would love to post them but ... there's too much site info in them for me to put on a public forum.
What I can post is some job info. As you can see between 09:10:12 and 13:14:17 it's doing whatever optimised dedupe does for just over 4 hours for 15GB of data. Admin logs shows lots of hearbeats received.
03/13/2013 09:09:55 - ended process 0 (pid=13893)
03/13/2013 09:09:56 - Info bpdm (pid=1512) started
03/13/2013 09:09:56 - started process bpdm (pid=1512)
03/13/2013 09:09:56 - Info bpdm (pid=1512) requesting nbjm for media
03/13/2013 09:10:11 - begin writing
03/13/2013 09:10:12 - end writing; write time: 0:00:01
03/13/2013 13:14:17 - Info bpdm (pid=1512) EXITING with status 0
03/13/2013 13:14:17 - Info targetmediasrv (pid=1512) StorageServer=PureDisk:srcmediasrv; Report=PDDO Stats for (srcmediasrv): scanned: 15832037 KB, stream rate: 1.34 MB/sec, CR sent: 2677565 KB, dedup: 83.1%
03/13/2013 13:14:17 - end Duplicate; elapsed time 4:04:23
My bpdm logs for pid 1512 shows:
09:10:23.951 [1512.772] <2> construct_sts_isid: master_server mastersrv, client CLIENT123, backup_time 1362914108, copy_number 2, stream_number 563, fragment_number 0, resume_number 1, spl_name NULL
13:14:13.666 [1512.772] <2> send_end_repl_marker: bp_sts_copy_extent returned 0:
03-12-2013 10:12 PM
For logs --- no problem.
You can perform a small test.
create a new backup policy and SLP --- and then notice the duplication speed.
Reason being: There is an performance issue with MSDP when mini catalog contains lot of fragment files for single policy.
Example -- from my machine: Shows "pwd" as client name and "ls -al" command shows all the policy name which send backup to MSDP for this client. If there are lot of fragment files under policy folder then it can degrade the performance
Same is true for a windows MSDP storage --- just replace "/" with "\" and change the "/storage" with your actual storage path -- example "e:\storage" rest of the path will remain same.
03-13-2013 10:27 PM
Sorry I'm a bit confused; what do you mean by fragment files in this context? Your example above doesn't seem to show many files at all.
My issue is around an unexplained drop in SLP performance although it doesn't seem to be related to all clients and dedupe processing. Even my tlog queue processing seems to be slow (i.e. not completing in a few hours).
I am narrowing in on maybe a lack of physical memory. Using W2008 resource monitor I see spoold process hard faulting on the source MSDP server although only at < 50/s. I see no hard faults on the destination.
Just wondering what others see?
03-19-2013 07:02 PM
After much investigation it would appear this was caused by my destination MSDP Windows pagefile being too small. It hasn't changed but was only 16GB whereas I had 24GB RAM and my spool process was using 13-14GB or memory. Looks like I had just drifted over some threshold which despite no hard faulting going on was resulting in slowness on the MSDP destination server.
I spotted it by seeing pagefile IO during NetBackup spoold startup.
I had plenty of space so was able to simply increase the pagefile space and so far, problem solved.
03-24-2013 04:58 PM
and unfortunately that wasn't the problem
My SLP processing still slow and looks to be MSDP write performance related.
03-25-2013 06:38 AM
If you are suffering with performance and are limited on memory I would make sure that the system is nice and neat as it will improve performance
First take a look at you queue processing to see how far behind it is:
crcontrol --queueinfo
If the number looks very large (in the millions) or the last processed date is a long time agi you will be getting impacted - then see if it is running (crcontrol --processqueueinfo)
If it is not running then fire it off (crcontrol --processqueue)
Do this two or three times - keep doing it until the queue comes right down
If you still have issues you may want to compact things (crcontrol -v -m +1,+2)
Finally for now, check that the system has rebasing turned on - this is effectively de-dupe defrag to make things run quicker (crcollect --rebasestate).
If it is off then turn it on (crcollect --rebaseon) and run queue processing another couple of times - then wait for a day for it to run the rebasing
Hope this helps
03-25-2013 07:42 PM
Have checked all the above; queue processing is happening albeit about 24 hours behind. Rebasing is on. Looking at the sched_GarbageCollection.log file seems to show a weekly start of the compact.
Anothing piece to the puzzle is the time it takes to even cancel an active duplication; right click and cancel from the java gui and it can take 10s of minutes to hours before the job aborts.
I have located some extra memory which I will install in the morning and see if that improves things.
03-26-2013 05:18 PM
I have upgraded the server memory from 24Gb to 32GB this morning and now my spoold process is using 2GB memory more than before and my IO profile whilst SLP processes are running is completely different; significantly lower read IO and disk queue on MSDP unit very low.
I would suggest that despite the Symantec recommendation of 1GB RAM per 1TB MSDP space the max cache size of 51% memory usage for content router is also a factor (this would equate to 13GB or thereabouts) and it would seem like even though I had spare memory spoold was constrained by this cache parameter and not using it and thus was resorting to disk IO to service requests.
I'll know more over the weekend where my slow SLP processing is more noticable.
03-27-2013 02:37 AM
03-27-2013 02:43 PM