cancel
Showing results for 
Search instead for 
Did you mean: 

SLP processing gone super slow

thesanman
Level 6

My SLP processing between two MSDP units has gone slow and I cannot find a cause.

It seemed to start the weekend before last; I noticed my priority SLP processing was still running on Sunday when normally it completes Saturday evening.  Since then I've struggled to get SLP processing completed in a timely manner.  Even the twice daily tlog processing seems slow.

Both servers look normal from a CPU, IO and memory perspective.  There appears to be no hardware errors that I can see and network paths look good.  I have checked (and ensured disabled) any form of anti-virus and disk indexing.

I have looked at memory usage; I see in both spoold logs "Current Memory usage exceeds the Info threshold value (current value: 73, allowed value: 70)."  I also see on the target node entries in storaged.log "WARNING [0000000006385AF0]: -1: Can not preallocate 2372 cache nodes, will try to replace existing cache nodes."

Even "cancelling" an SLP duplication job takes time (read hours!).

Symantec are looking at this now but I thought I'd throw it out to the community.

Any assistance greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

thesanman
Level 6

I have upgraded the server memory from 24Gb to 32GB this morning and now my spoold process is using 2GB memory more than before and my IO profile whilst SLP processes are running is completely different; significantly lower read IO and disk queue on MSDP unit very low.

I would suggest that despite the Symantec recommendation of 1GB RAM per 1TB MSDP space the max cache size of 51% memory usage for content router is also a factor (this would equate to 13GB or thereabouts) and it would seem like even though I had spare memory spoold was constrained by this cache parameter and not using it and thus was resorting to disk IO to service requests.

I'll know more over the weekend where my slow SLP processing is more noticable.

View solution in original post

14 REPLIES 14

thesanman
Level 6

Sorry should also have said.

Master Server is RHEL v6.3, NBU v7.1.0.4

MSDP Servers are Windows 2008 (R2 or SP1), NBU v7.1.0.4

Anshu_Pathak
Level 5

You should check for Optimized duplication. If NetBackup is not using Optimized duplication then it will take more time to complete the duplication job.

thesanman
Level 6

My first thought but yes; all SLP jobs containing the magic wording:

03/12/2013 12:30:15 - Info Duplicate (pid=20027) Initiating optimized duplication from @aaaaa to @aaaab

Based on previous experience; network utilisation would be much hgher if if wasn't using optmised dedupe.

Anshu_Pathak
Level 5

I can take a look at the logs .... if you have enbaled them. 

Master server: admin, job deatils for a duplication job

Media server: bptm, bpdm

thesanman
Level 6

Have logs and would love to post them but ... there's too much site info in them for me to put on a public forum.

What I can post is some job info.  As you can see between 09:10:12 and 13:14:17 it's doing whatever optimised dedupe does for just over 4 hours for 15GB of data.  Admin logs shows lots of hearbeats received.

03/13/2013 09:09:55 - ended process 0 (pid=13893)
03/13/2013 09:09:56 - Info bpdm (pid=1512) started
03/13/2013 09:09:56 - started process bpdm (pid=1512)
03/13/2013 09:09:56 - Info bpdm (pid=1512) requesting nbjm for media
03/13/2013 09:10:11 - begin writing
03/13/2013 09:10:12 - end writing; write time: 0:00:01
03/13/2013 13:14:17 - Info bpdm (pid=1512) EXITING with status 0
03/13/2013 13:14:17 - Info targetmediasrv (pid=1512) StorageServer=PureDisk:srcmediasrv; Report=PDDO Stats for (srcmediasrv): scanned: 15832037 KB, stream rate: 1.34 MB/sec, CR sent: 2677565 KB, dedup: 83.1%
03/13/2013 13:14:17 - end Duplicate; elapsed time 4:04:23

My bpdm logs for pid 1512 shows:

09:10:23.951 [1512.772] <2> construct_sts_isid: master_server mastersrv, client CLIENT123, backup_time 1362914108, copy_number 2, stream_number 563, fragment_number 0, resume_number 1, spl_name NULL
13:14:13.666 [1512.772] <2> send_end_repl_marker: bp_sts_copy_extent returned 0:

Anshu_Pathak
Level 5

For logs --- no problem.

You can perform a small test.

create a new backup policy and SLP --- and then notice the duplication speed. 

Reason being: There is an performance issue with MSDP when mini catalog contains lot of fragment files for single policy.

Example -- from my machine: Shows "pwd" as client name and "ls -al" command shows all the policy name which send backup to MSDP for this client. If there are lot of fragment files under policy folder then it can degrade the performance

Same is true for a windows MSDP storage --- just replace "/" with "\" and change the "/storage" with your actual storage path -- example "e:\storage" rest of the path will remain same.

 

 

Client name:
[root@rhel6 rhel6]# pwd
/storage/databases/catalog/2/rhel6
 
Policy name:
[root@rhel6 rhel6]# ls -al
total 24
drwx------. 5 root root 4096 Mar 12 16:57 .
drwx------. 3 root root 4096 Mar  8 01:56 ..
drwx------. 2 root root 4096 Mar 12 17:09 Accelerator
drwx------. 2 root root 4096 Mar 11 18:57 catalog
drwx------. 2 root root 4096 Mar  8 13:48 slptest
 
Fragment Information:
[root@rhel6 rhel6]# cd Accelerator/
[root@rhel6 Accelerator]# ls -al
total 84
drwx------. 2 root root 4096 Mar 12 17:09 .
drwx------. 5 root root 4096 Mar 12 16:57 ..
-rw-r-----. 1 root root   76 Mar 12 16:57 __dirpo__
-rw-r-----. 1 root root  169 Mar 12 16:57 rhel6_1363087675_C1_F1.fmk
-rw-r-----. 1 root root  172 Mar 12 16:57 rhel6_1363087675_C1_F1.hdr
-rw-r-----. 1 root root  174 Mar 12 16:57 rhel6_1363087675_C1_F1.img
-rw-r-----. 1 root root  170 Mar 12 16:57 rhel6_1363087675_C1_F1.info
-rw-r-----. 1 root root  171 Mar 12 16:57 rhel6_1363087675_C1_F1.map
-rw-r-----. 1 root root  170 Mar 12 16:57 rhel6_1363087675_C1_HDR.img
-rw-r-----. 1 root root  170 Mar 12 16:57 rhel6_1363087675_C1_HDR.info
-rw-r-----. 1 root root  172 Mar 12 16:58 rhel6_1363087675_C1_TIR.img
-rw-r-----. 1 root root  171 Mar 12 16:58 rhel6_1363087675_C1_TIR.info
-rw-r-----. 1 root root  160 Mar 12 17:09 rhel6_1363088360_C1_F1.fmk
-rw-r-----. 1 root root  163 Mar 12 17:09 rhel6_1363088360_C1_F1.hdr
-rw-r-----. 1 root root  165 Mar 12 17:09 rhel6_1363088360_C1_F1.img
-rw-r-----. 1 root root  161 Mar 12 17:09 rhel6_1363088360_C1_F1.info
-rw-r-----. 1 root root  162 Mar 12 17:09 rhel6_1363088360_C1_F1.map
-rw-r-----. 1 root root  161 Mar 12 17:09 rhel6_1363088360_C1_HDR.img
-rw-r-----. 1 root root  161 Mar 12 17:09 rhel6_1363088360_C1_HDR.info
-rw-r-----. 1 root root  163 Mar 12 17:09 rhel6_1363088360_C1_TIR.img
-rw-r-----. 1 root root  162 Mar 12 17:09 rhel6_1363088360_C1_TIR.info
 
 
 

thesanman
Level 6

Sorry I'm a bit confused; what do you mean by fragment files in this context?  Your example above doesn't seem to show many files at all.

My issue is around an unexplained drop in SLP performance although it doesn't seem to be related to all clients and dedupe processing.  Even my tlog queue processing seems to be slow (i.e. not completing in a few hours).

I am narrowing in on maybe a lack of physical memory.  Using W2008 resource monitor I see spoold process hard faulting on the source MSDP server although only at < 50/s.  I see no hard faults on the destination.

Just wondering what others see?

thesanman
Level 6

After much investigation it would appear this was caused by my destination MSDP Windows pagefile being too small.  It hasn't changed but was only 16GB whereas I had 24GB RAM and my spool process was using 13-14GB or memory.  Looks like I had just drifted over some threshold which despite no hard faulting going on was resulting in slowness on the MSDP destination server.

I spotted it by seeing pagefile IO during NetBackup spoold startup. 

I had plenty of space so was able to simply increase the pagefile space and so far, problem solved.

 

thesanman
Level 6

and unfortunately that wasn't the problem  sad

My SLP processing still slow and looks to be MSDP write performance related.

Mark_Solutions
Level 6
Partner Accredited Certified

If you are suffering with performance and are limited on memory I would make sure that the system is nice and neat as it will improve performance

First take a look at you queue processing to see how far behind it is:
crcontrol --queueinfo

If the number looks very large (in the millions) or the last processed date is a long time agi you will be getting impacted - then see if it is running (crcontrol --processqueueinfo)
If it is not running then fire it off (crcontrol --processqueue)
Do this two or three times - keep doing it until the queue comes right down

If you still have issues you may want to compact things (crcontrol -v -m +1,+2)
Finally for now, check that the system has rebasing turned on - this is effectively de-dupe defrag to make things run quicker (crcollect --rebasestate).
If it is off then turn it on (crcollect --rebaseon) and run queue processing another couple of times - then wait for a day for it to run the rebasing

Hope this helps

thesanman
Level 6

Have checked all the above; queue processing is happening albeit about 24 hours behind.  Rebasing is on.  Looking at the sched_GarbageCollection.log file seems to show a weekly start of the compact.

Anothing piece to the puzzle is the time it takes to even cancel an active duplication; right click and cancel from the java gui and it can take 10s of minutes to hours before the job aborts.

I have located some extra memory which I will install in the morning and see if that improves things.

thesanman
Level 6

I have upgraded the server memory from 24Gb to 32GB this morning and now my spoold process is using 2GB memory more than before and my IO profile whilst SLP processes are running is completely different; significantly lower read IO and disk queue on MSDP unit very low.

I would suggest that despite the Symantec recommendation of 1GB RAM per 1TB MSDP space the max cache size of 51% memory usage for content router is also a factor (this would equate to 13GB or thereabouts) and it would seem like even though I had spare memory spoold was constrained by this cache parameter and not using it and thus was resorting to disk IO to service requests.

I'll know more over the weekend where my slow SLP processing is more noticable.

Mark_Solutions
Level 6
Partner Accredited Certified
I agree - I have never reccomended a customer follows those figures - having seen that the 40TB NetBackup Appliances come with 92GB RAM it speaks for itself - although they do expand to 64TB, but even that is 1.5GB per TB and is still 92GB after all! Glad it is looking better

thesanman
Level 6
Just wish it would show up in a log file somewhere; something indicating WARNING insufficient cache memory available! Symantec support had really only had once good session at investigating but lack of physical RAM never came into the conversation.