Solved: Backups/Restores work, Duplications fail

dukbtr · ‎04-19-2013

Master and Media Server:

NB 7.5.0.4

RHEL 6.3

Data Domain DD880 (VTL)

STK L700e w/ IBM LTO3 drives

When we do a backup or restore to VTL or physical tape, both procedures work fine. When we try to do a duplication, the dupe hangs or may take 3-4 days to dupe what was a 30 minute backup. This occurs with dupes from VTL->VTL, VTL->Physical Tape, Physical ->Physical. Also doesn't matter whether we use Vault, SLP, or dupe from the catalog image manually, the same thing happens.

Any Ideas? We have been working with a Symantec Tech for several months on this, thought I would finally put the question out here.

Thanks in advance for any replies

dukbtr · ‎05-17-2013

We have solved the issue. It was one of the ports on the Fiber Patch panel above the server. It was only 1 of the 2 in the pair, the fiber that is receiving. This is why all backups (writes) worked perfectly, and that we only saw issues during duplications. But it was only on 1 port which affected port 4a on the DD880. That is why the problem was also random, or so it seemed. A duplication or restore would run fine as long is it was using a drive on port 4b on the VTL. If it switched tapes and used a 4a drive, then it would fail.

I narrowed it down to 4a and we switched the port on the patch panel, everything is running perfect.

Jeez, this was a rough one!!!

Thanks for the help

View solution in original post

dukbtr · ‎04-24-2013

hmmm, I guess no body has run into this before.

Marianne · ‎04-24-2013

Nope...

I have worked with many customers over the years using different make/model VTLs on different OS's.

Never seen duplication issues.

I guess Symantec Support engineers have been requesting loads of level 5 logs?

Handy NetBackup Links

dukbtr · ‎04-24-2013

Yes, a ton of high level logs have been sent

mph999 · ‎04-24-2013

Firstly, thank you for the clear explanation of the issue. I wish EVERYBODY (yes guys, that's a big hint) would provide sufficient details in their opening post so we at least have an idea of whats going on.

I think the best thing I can do, is ask you for the case number. I will see what I can then do to get the case escalated to BL. If it is already with BL then it must be a difficult one.

Few questions :

1. When did the problem star, any known changes at this time

2. Has it ever worked

3. Where abouts in the process does it hang

4. Is the system busy when the duplication(s) are running

5. Is it duplicatng over the lan, or is it handled by just one media server

Martin

dukbtr · ‎04-25-2013

I sent an email with the case number.

1. When did the problem start, any known changes at this time

The problem started about a week after we added the 2 drives from the STK L700e library. The physical drives are used for offsite duplication purposes only. Over this past weekend though, I ran a few more tests like multiple copy and the issue showed up there also. The backup that normally takes about an hour to run was at almost 9 hours doing nothing. Once I killed it and changed it back to only going to the VTL, it was fine.

2. Has it ever worked

Yes, it is random. we might have 2-3 duplications run perfect, then the next one or more stalls. It is hit or miss when we see it, though I would say that it is more miss lately.

3. Where abouts in the process does it hang

Varies, It may stall on the mount or the tape gets mounted, does nothing then writes like 600mb every hour or so.

4. Is the system busy when the duplication(s) are running

No, NetBackup is the only thing on the Media Server and no other jobs are running

5. Is it duplicatng over the lan, or is it handled by just one media server

One media server, it is connected to both the DD880 VTL and the STK L700. 2 four port fiber cards. 2 fiber channels to the VTL and 2 to the Physical Tape drives, all connected into a switch for the TAN.

mph999 · ‎04-25-2013

Got it thx, will take a look.

martin

dukbtr · ‎05-02-2013

So I have been looking at compatibity matrixes and found this:

This is what we are currently using as the changer:

Robot driver = D_DOMAINRESTORER_L180 5110, EMULATING ULTRIUM_3 TAPES

According to the NetBackup HW compatiblity matrix for a DD880 we can use:

D_DOMAINRESTORER_L180,
IBM^^^^^03584L32,
ADIC^^^^Scalar^i2000

According to the EMC DD880 compatibilty matrix for dd_firmware 5.1 and RHEL 6 it only lists the following under changers:
IBM System Storage TS3500

On EMC's compatiblity matrix for D_DOMAINRESTORER_L180, only Windows servers are listed.

Could this be contibruting to the issues we are seeing, thoughts?

Mark_Solutions · ‎05-02-2013

Note sure on the L180 question .. but does this affect all backups?

I am wondering about tow things that can have a real affect on duplication, especially when de-dupe is involved

1. Do any of these use GRT - if so it processes all of the GRT information by mounting the image before it actually starts running the duplication so this could be a part of it

2. (any this really comes into play to 1. above if the answer is yes) What fragement size do you use on your disk storage unit for the DD? This can have a big affect on duplication performance when it starts to read a large image file which gets re-hydrated via de-dupe. I try and use a 5000MB fragment size when writing to de-dupe and the same size for all types of disk when using GRT in any backups

Hope this helps - may be worth a try anyway to see if a smaller fragment size sorts things out for you?

dukbtr · ‎05-02-2013

No we don't use GRT. Yes this effects all backups. We even see the issue when doing multiple copies for a backup.

Mark_Solutions · ‎05-03-2013

OK - if possible try some backups using a smaller fragment size for the DD storage unit (though you didn't actually say what it was set to) and see if that improves things

dukbtr · ‎05-03-2013

I will try that to see if it makes any difference. We are currently using the default 1TB fragment size.

dukbtr · ‎05-03-2013

So I have figured out a common link between the backup images/tape numbers that get hung during a duplication. All the tapes have a backup that have "fragment=2" if I look at Images on tape. The tapes/images that duplicate fine all have "fragment=1".

So now, any ideas on what this means or what to look for next? Stumped as far as that goes.

I don't think I want to decrease fragment size, wouldn't decreasing size just create more fragments for the backups, compounding the problem even further?

mph999 · ‎05-03-2013

Nice one, so to confirm, if the tape has one froagment in the image it works, if it has 2 fragements in a given image it fails.

So, if you decrease the frag size, you will get more images, it is worth trying because.

1, If it is a number of fragments issue, you will compound the problem, correct

2. If it is a backup image size issue, you won't

It is worth, if not vital to find out which of these two is 'true'.

Unfortunately, with some issues play abot and trying things out can be the most powerful troubleshooting technique there is. The logs don't always give answers, and sometimes even when they do they issue can be very timeconsuming and difficult to find, and testing can help redulce this.

martin

watsons · ‎05-03-2013

Just to add a few more points..

A DebugLevel=6 unified log (vxul) and verbose5 bptm of those duplication jobs should provide more insights as to where the bottleneck or slowness occurs. Whether it's the mount/unmount of the tape resources, the allocation/unallocation of resources (nbrb process), or the duplication write simply take longer - did it compete against backup job for resources, given a lower job priority?

The configuration of the duplication needs to be checked first, along with how it interacts with the backup.

dukbtr · ‎05-07-2013

Tried varying fragment sizes, all resulted in same issue.

Today though, we had lost all the VTL drives for a few miutes, not sure why and sys admins or San admin can't tell us either. Here is waht the /var/log/messages had in it:

May 7 09:30:23 rhesutil02 tldcd[26537]: TLD(0) mode_sense ioctl() failed: Success

May 7 09:30:23 rhesutil02 tldd[5544]: TLD(0) going to DOWN state, status: Unable to sense robotic device

May 7 09:30:24 rhesutil02 ltid[5326]: Request for media ID LT3291 is being rejected because mount requests are disabled (reason = robotic daemon going to DOWN state)

May 7 09:32:29 rhesutil02 tldd[5544]: TLD(0) going to UP state

Thinkng it may be some sort of communications issue between media server and VTL?

dukbtr · ‎05-17-2013

We have solved the issue. It was one of the ports on the Fiber Patch panel above the server. It was only 1 of the 2 in the pair, the fiber that is receiving. This is why all backups (writes) worked perfectly, and that we only saw issues during duplications. But it was only on 1 port which affected port 4a on the DD880. That is why the problem was also random, or so it seemed. A duplication or restore would run fine as long is it was using a drive on port 4b on the VTL. If it switched tapes and used a 4a drive, then it would fail.

I narrowed it down to 4a and we switched the port on the patch panel, everything is running perfect.

Jeez, this was a rough one!!!

Thanks for the help

Marianne · ‎05-17-2013

Thanks for sharing!

Handy NetBackup Links

VOX

Backups/Restores work, Duplications fail