cancel
Showing results for 
Search instead for 
Did you mean: 

Is your CR Queue processing taking days? 300mil+ tlogs? Rehydration slow? Read Me.

bartman10
Level 4

Hope you don't backup up a lot of VMware machines with 7.0.. if you do get ready for 200-300 million t-logs that need processed daily and CR Queue processing that will take 3 days for one run to finish.

I don't think I have that large of a VM environment (3-4TB) and I really don't see how Symantec missed this issue for so long. I've spent months on the phone with tech support trying to figure out what is wrong with my PD environment. Finally I was put in touch with Sr. developers of PD and they knew the problem right off the bat.

Flash backups can cause the default 128kb dedupe chunk size to drop to 4kb. VMware API backups use Flash Backups... So my VM backups where causing 32x more T-log transactions than they should.

There is an EEB for 7.1.0.2 and a fix in 7.1.0.3 but it sounds like it's not fully fixed until 7.5.

They are also finally working to fix the rehydration speed issues! I only get about 7-25mb/s on 4 streams rehydrating to an Advanced Disk pool that then copies to tape. It takes for EVER to get my weekly/monthly full images moved to tape for longer retention. I'm actually loosing ground as I'm ingesting more data than can be duplicated out. I have over 800 jobs queued just for SLP duplications trying to go to tape!

Again, I've worked with support on this for months.. my PD servers are monsters and NO bottlenecks can be found. CPU use is almost non existent, IOSTAT shows not much disk load, 17% FREE ram... and yet slow rehydration.

After working with the developers they also knew the issues right away.

1. setting compression=1 in pd.conf on 7.0 media servers uses older compression that is limited to about 25mb/s per stream ingestion. It is also affecting rehydration. 7.1 uses LZO and is hundreds of mb/s faster.

2. co-locating data in the PureDisk storage chunks. 7.1.0.3 will try to co-locate data so rehydration will not need as much random IO. They are saying 2-6x performance increase... I only hope this is enough...

Sounds like more is slatted for PD 6.6.3 but no release date is set for that yet...

12 REPLIES 12

Lee_C
Level 5

Hi Bartman10,

Thanks for providing feedback that you received from support.  We too have poor rehydration performance and have also had challenges getting past the lower levels of support, so gave up and reverted our main backups to tape. 

Can you tell us how you got hold of  7.1.0.2, and which EEB for 7.1.0.2 did support provide?

Any feedback on the improvements would be appreciated.

 

Thanks.

thesanman
Level 6

I have given up doing my "Full" backups to my Dedupe pool in my v7.1.0.1 environment even with latest EEB; I write straight to tape.  The dedupe is just too slow.

Lee_C
Level 5

I can see 7.1.0.2 was released yesterday and I know 7.5 will be a future release, so I'm curious about the benefits regarding 7.1.0.3 mentioned in the original post in the interim.

bartman10
Level 4

I had to wait for 7.1.0.2 like everyone else... now that it's out I can't install it yet because we are in a hard freeze for an Epic upgrade. Epic is the huge software thing that basically runs the entire hospital.. but I don't know much more than that about it and don't want to! :) All I need to know is how much SAN space you need and how fast it needs to be.

 

I don't know if 7.1.0.2 will need EEB's to do all this new magic. But you may want to ask support about some hidden features I've been told about. I don't have all the facts yet but it sounds like there are some hidden performance tuning settings and some kind of ability to re-structure previously backed up image fragments so they will benefit from the rehydration speed increases.

I'll have more info after I can get this thing installed and work with the devs to tweak it. I'll keep you posted.

 

Last night I had to move all backups to my off site PureDisk pool because my primary site's pool has finally ground to a total halt! It's over 7 days behind in CR Queue processing, it's totally full because all the expired junk is not being physically removed and now can't even process incoming data from backups.... FAN-TASTIC!

It's stuff like this that makes me want to yell at the company talking heads that posts all the fanboy stuff about how great this stuff is in these forums and blogs... hello! I'm here on the street using this stuff and it does not work as designed. Great.. you just yesterday released an update that is supposed to fix this... what about the past 8 months I've been suffering and working with low level support while they tell me there is nothing wrong and can't figure out why it's running like crap!

And after they started hawking Symantec boxes the conversations would start to turn to "well we now have this great hardware appliance and your hardware must be built wrong". BzzzzT, Wrong, My PureDisk enviroment was built and vetted BY Symantec during a PS engagement before your boxes had hit the street. And guess what guys... your boxes have the exact same issue. I only wish I would have let them install a test appliance like they had offered.. I would have loved to hear what they had to say after I ran my backups to it and the transaction logs started taking days to process... We now know why.. but this is only recent news and many low level support people still don't know about it!

I've sent emails to all the low level PureDisk support people I've worked with to let them know about the Flash Backup t-log issue so maybe others won't have to go through this... as it seems Symantec is keeping this info a secret from support...

 

Sorry if I'm a little blunt and rough... but this info needs to get out to users who are suffering with these issues. It's also hurting Symantec as we are quite close to ripping out the PureDisk enviroment because it simply does not work as designed. Sure, there are some great things about it, it duplicates the images to the offsite PD pool crazy fast, it stores a ton of stuff in a small ammount of physical disk.. but it does not work if I can't get the info OUT of it, and it can't run it's internal housekeeping processes so it fills up and blows up..

bartman10
Level 4

I'd like to hear what you guys are running and issues you see.

 

My onsite and offsite hardware is the same, only the offsite is 2 servers attached to 64TB.

 

Onsite:

Servers- 3x DL385 G7- 64GB, 2 Opteron CPUs (16 cores total), fiber attached, 4x 1GB nic LACP team.(moving to 10g)

Storage- Nexsan SATABeast with E60X expansion, 2 controllers, 4x4gb SAN ports Active/Active- 2TB drives, 3 RAID6 groups with 19 spindles, presented to servers as 16 2TB luns per server. ( 16th LUN is not 2TB, it's 1020mb) =34TB raw to each server, Storage Foundation used to Concat and format 16 LUNS into 31TB usable /Storage mount.

SPA: /Storage/databases mounted on HP EVA RAID1 500GB LUN, 160x15k spindles in EVA disk group. 2nd HBA in SPA to connect to EVA...

All servers HBA use both ports 4gb, 2 SAN paths (sanA-sanB)

 

If you thinking of building a PureDisk deployment the best practice guide is WAY off on the size you need for the /Storage/database mount... In my config my database is currently only 125GB used! Symantec says 10% of total storage and it totally off. WAY to much. Also, PureDisk is really bad at trying to use multable threads.. most of the cores in my 16 core boxes have 0% work to do at all times. 6 cores would be enough... really... even with 6 I would have 2-3 cores with nothing to do... no reason to spend $ for more cores... just get as high MHZ as you can get because they can't thread well to take advantage of modern processers... Hopefully they work to fix this as it's quite clear where the CPU makers are headed on this frount.. more cores... but at the same time... I would need 2 CPU's to enable me to have enough RAM slots unless I wanted to pay crazy $$ for 8 or 16GB sticks of ram... so it may be cheeper to get 2 CPU's and populate more ram slots with 4GB sticks...

At least this is my EXP with the system thus far...

Lee_C
Level 5

Hi Bartman10,

I read your post, and it's deja-vu.  We have two servers for Puredisk with 20TB DAS storage.  We purchased new servers 12 months ago each with two hex core intel i7 and 24GB of RAM.  Backup speeds are good, and restoration of small amounts of data is OK, but restoring anything else is extremely poor.

Our hardware doesnt break into a sweat!  When there is an issue, with Puredisk you sort of hope it's with your kit, since you can do something about it - you hold no hope with the software.

When we do run into issue, we always wonder, is this affecting us only? We'll thats the impression you get when you speak to the lower levels of support it's always the same log gathering, and when the next agent takes over, it's can you give us the (same) logs again.  Surely Symantec should be able to keep these agents aware of the big issues!

You've been luck to speak to developers.  There are clearly some good engineers at Symantec, but getting access to them is tough.  That may be an oversight since we don't have BCS which is touted as having access to more experienced engineers.

dedupe
Level 3

Hi guys,

You all use Puredisk, I have a NB5200 appliance. Problems with the appliances are exactly same as the puredisk ones. We are monitoring the appliance's hardware resource utilisation, CPU usage average 8%, 32GB in use all the time. Appliance's disk speed is 220MB/s with dd test. So the problem is with the code not with the hardware. My case has been escalated to the top level thanks to our account manager and received the final say about the fix\EEB\release. Fix was implemented in Puredisk first, then Netbackup media server then on to appliances. It takes longer to release the code to the appliances. So the fix will be available to us at the end of October the earliest! but it is already released for you. Apart from the timing issue, I wonder how the speed will be improved with the new code. Also do not forget that this is affecting the restore speed not just tape out. Any feedback is appreciated.

Burak

 

 

bartman10
Level 4

I have upgraded to 7.1.0.2 and installed the rehydration EEB on the PureDisk servers. It's too early to tell what's fixed/improved. I did notice clients with fast disk and Ethernet show 60mb/s+ where they used to be 30's.

It sounds like there are some other things that need to be done to take advantage of the rehydration EEB... something called "re-staging". You have to mark specific clients in the pd.conf file, they will take up new space on the disk pool that will be optimized then rehydration of said clients will be faster... I'm still not sure what this means to the old data that is still referenced by other clients... and after talking to a developer about how this process works I get the feeling it will never correctly stage all the data in the disk pool for faster rehydration and most of it seems quite manual and convoluted.

bartman10
Level 4

FYI, they have the rehydration EEB for the applicances also.

grlazy
Level 3
I am having the same issue at work. We finally made it , to upgrade our two servers to a 2x10core cpu , 196GB ram each systems.We added a quite fast storage and we thought it would be good at rehydration speeds. Well the truth is at the middle. After upgrading at 7.1.0.3 we thought we may get the rehydration speeds we needed.It seems that we can't get acceptable speeds at SLP's when we try to rehydrate to tape. So i made some tests and i was able to get 50-70MB/sec at a fiber 3584 L3 drive.It is not setting the fiber cable at fire but it will do. What i did was to make duplicates from the Msdp storage to tape either from the catalog menu or via the bpduplicate command.

f25
Level 4

Hi,

IMHO the appliance is exactly the same stuff as standard PureDisk. Just "packaged" otherwise.

The 32GB RAM usage is somewhat normal. I think that your monitoring tool is insufficient/incorrect. Indeed, when you usee # free -m you see that there are only few MB free but then you see that all the rest is used fo caching.

In our company we use MS SCOM for monitoring MEM/CPU/HDD utilisation and the numbers are quite ok and seem to reflect the real utilisation.

I thought that the appliance disk speed would be faster. 220 MB/s is just above the recomendation. This has the most impact on backup/restore over the LAN. We are use to have it around 500 MB/s, reaching 700 MB/s (depending on the RAID/disk/volume groups configuration) on direct attached storage.

Michał

f25
Level 4

Hi,

I do not understand. Rehydration bottleneck is on PureDisk Storage Pool side. More than 32 GB is not going to be used by PureDisk server, nor more than 8 threads (4 cores).

The speed-up would be achievied by decrease in disk latency (HDD more like RAM) and overal disk R/W speed. Rehydration is like putting a GB/TB jigsaw from 128KB (or smaller) puzzles.

This may have an advantage. Theoretically, parallel "restoring" of two VMs should not be much slower than "restoring" one VM. Anybody experimented?

Michał