Solved: Backing up Exchange 2010 DAG - Multiplexing Issue

CraigL · ‎05-04-2012

Hello everyone -

Having an issue with a brand new Exchange 2010 policy that I'm hoping someone out there can shed some light on. Here is the situation:

The policy runs just fine backing up the entire DAG with multiplexing set at 1, but loads tapes into both drives on the master/media server and writes to both. This is kind of silly, since all of the data will easily fit into one LTO3 tape. So, we changed the multiplexing value to 2 in hopes to get everything onto one tape, and many of the child jobs fail with error 25s.

With this being said, we do multiplexing on many other policies without issue, so I'm inclined to think this isn't a network/connectivity issue as what the description for error 25 is. Well, we know it's not connectivity since the policy works perfectly with multiplexing set to 1.

Vitals on our setup and the policy:

We're running Netbackup v7.1.01 on a common master/media server with Windows 2003 Standard R2 w/ SP2 as the OS.

The policy type is (obviously) MS-Exchange-Server, the Exchange attribute is set to 'Passive copy and if not available the active copy', and the single backup selection is 'Microsoft Exchange Database Availability Groups:\'. 'Perform snapshot backups' is also checked. Under the schedule options, 'Media Multiplexing' is set to 1. Type of backup is Full.

The DAG has two servers (mailbox role only) in it -- both Windows 2008 R2 Enterprise w/ SP1. Exchange is version 2010 SP2 w/ RU2.

I did open a ticket with Symantec on this issue and unfortunately, was given little assistence. The person we worked with just said it must be a network problem, but also had to look at the documentation to see what multiplexing even WAS. Sigh.

This seems like it should be a lay-up to do. Does anyone have any ideas or suggestions as to what may be going on here?

Thanks a ton in advance.

-Craig

CraigL · ‎06-06-2012

Everyone -

I'm happy to share that this issue has been resolved. The fix, ultimately, was simple:

Port 1556 needed to be opened on the Windows firewall on both of our DAG nodes. I didn't see in the NBU Exchange Admin guide, but maybe I missed it. :)

A huge thanks goes out to Steve @ Symantec for working through this issue with me!

Hopefully this helps someone else out there!

-Craig

View solution in original post

J_H_Is_gone · ‎05-04-2012

you want to go to only one tape.

you can have many many storage units set up - and the total number of drives in all of your storage units can be more than the number of tape drives you have.

Just create a storage unit with 1 tape drive and set the exchange policy to use it.

--------------------------

as to the other issue - you said you have 2 exchange servers - are there passive copies on each one?

what I have observed

parent job dag - goes and sees what it is suppose to back up

child job for serverA - sees that there are mailstores on serverA it needs to backup so it makes a parent job just for serverA

grandchild job - for mailstore A

grandchild job - for mailstore B

child job for serverB = sees that there are mailstores on serverB it needs to backup so it makes a parent job just for serverB

grandchild job - for mailstore C

grandchild job - for mailstore D

so this could be part of what is causing 2 tapes or contention when multistreaming.

try putting each mailstore in the backup selection

it could also be if you try to run two many streams at once the exchange server just cannot handle it.

has anybody monitored the exchange servers when backups are running to see if you are killing it.

are both exchange servers listed in someother policy? Meaning if you go to the client list do you see both exchange physical server names listed?

and do you have both servers in the master server properties under distributed application restore mapping?

CraigL · ‎05-07-2012

Hi there -

First of all, thank you for taking the time to reply.

Creating the new storage unit with only one drive is an interesting idea. I will give that a shot and report back.

Yes, there are passive copies of on each of the DAGs. The machines are, more or less, doing absolutely nothing when the backups run. They aren't even in production yet and thus only have a very small handful of mailboxes at this time.

Performance utilization is very, very low during the backups. We know they aren't being crushed.

Per the documentation, we have only the DAG name set up as the Client. The person at Symantec we worked with said this was correct, but as I stated above, he also didn't know what multiplexing was before he looked it up.

The cient list ONLY shows the DAG name.

Negative on both servers being listed under the 'Distributed Application Restore Mapping' option. I will have to look into this one.

-Craig

J_H_Is_gone · ‎05-07-2012

you need to have that info in the DARM - I am not to sure how will it works without it.

second - I found when it goes to enumerate that dag to get check the mailstores - it checks both the passive and active copies to see what the state is.

I had issues because the physical server name was not in my client list - so it was like the master did not know it was allowed to talk to them.

so as a test create a policy that only backups the C drive and the system state of the physical exchange servers. This will get the server names into the client list. But I don't see how this would effect multiplexing.

CraigL · ‎05-07-2012

I took your suggestion and made a Storage Unit with only one drive. While it only did use one drive, it still picked up two different tapes from the scratch pool.

The backup IS finishing succesfully which is obviously a good thing, but the fact that it's using 2 tapes to store ~5GB of data is frustrating.

Every other policy we have writes to one tape, and when it fills up, it grabs a new one. It's just this one that seems to have a difficult time with it. Ugh.

-Craig

J_H_Is_gone · ‎05-07-2012

did it take two tapes from the very start?

in the gui run the images on tape report for each tape

do they have different images on the tape or is it the same backup on both tapes.

I am wondering if by chance you have Multiply copies checked on the policy?

can you please post the output of the job details.

CraigL · ‎05-09-2012

OK.. I think we may be dealing with multiple issues here. I spent some time today testing out lots of scanarios and have found the following:

Multiple copies is not set for the policy. That is confirmed :)

If I run the job with no multiplexing at all, it puts all the data from one server onto one tape, and data from the other server on another. I have a feeling this is by design.

If I move all active database copies to one server, everything backs up fine to a single tape without issue.

If I move any database over to the other DAG member and enable multiplexing, the second child job always fails with the error 25. The first one finishes without issue.

Backing up individual mail databases on each machine on their own works fine, so I know there are no connectivity issues.

Here's the detail status from the job that bombs on 25.

5/9/2012 3:46:51 PM - Info nbjm(pid=6380) starting backup job (jobid=21333) for client server-01, policy Test_ExchangeDAG, schedule TestFull
5/9/2012 3:46:51 PM - estimated 345676 Kbytes needed
5/9/2012 3:46:51 PM - Info nbjm(pid=6380) started backup job for client server-01, policy Test_ExchangeDAG, schedule TestFull on storage unit Local_Tape_Drives
5/9/2012 3:48:53 PM - mounting CEN508
5/9/2012 3:49:39 PM - Info bpbrm(pid=6792) telling media manager to start backup on client
5/9/2012 3:49:39 PM - Info bptm(pid=6724) using 131072 data buffer size
5/9/2012 3:49:39 PM - Info bptm(pid=6724) using 32 data buffers
5/9/2012 3:49:39 PM - mounted; mount time: 00:00:46
5/9/2012 3:49:39 PM - positioning CEN508 to file 10
5/9/2012 3:51:45 PM - positioned CEN508; position time: 00:02:06
5/9/2012 3:51:46 PM - Info bpbrm(pid=6792) child done, status 0
5/9/2012 3:52:07 PM - Info bpbrm(pid=4788) sending bpsched msg: CONNECTING TO CLIENT FOR DAG-01_1336592811
5/9/2012 3:52:07 PM - connecting
5/9/2012 3:52:49 PM - Error bpbrm(pid=4788) cannot create data socket, The operation completed successfully. (0)
5/9/2012 3:52:49 PM - Info bpbrm(pid=6792) child done, status 25
5/9/2012 3:52:49 PM - Info bpbrm(pid=6792) sending media manager msg: STOP BACKUP DAG-01_1336592811
5/9/2012 3:52:55 PM - Info bpbrm(pid=6792) media manager for backup id DAG-01_1336592811 exited with status 150: termination requested by administrator
5/9/2012 3:52:55 PM - end writing
cannot connect on socket(25)
5/9/2012 3:53:18 PM - Info bpbrm(pid=6792) sending media manager msg: TERMINATE

Yes, that's me cancelling it after I see it hit error 25 :)

The selections for this job look like this:

NEW_STREAM

DB1 (active on node 1)

NEW_STREAM

DB2 (active on node 2)

I also came across a scanario where I can actually crash the Netbackup Job Manager service with a certain combo of streams and mail databases. If I can continue to reproduce this at will, I'll open another ticket w/ Symantec.

What do ya think?

-Craig

rawilde · ‎05-09-2012

Multiplexing with a DAG backup certainly should work. I would try the support case route again.

Also, that Job Manager crash is a known issue and should be addrssed in 7.5.0.3

J_H_Is_gone · ‎05-09-2012

by chance are your exchange servers also media servers?

Look that the jobs that each ran and look for who the media server was that did the backup.

CraigL · ‎05-10-2012

Yeah, I'm going to try and open another case and hope to get someone better.

Thanks for the tip on the Job Manager crash!

-Craig

CraigL · ‎05-10-2012

Negative, we simply have one master/media server, and it absolutely does not run Exchange :)

srk · ‎05-11-2012

Hi Craig.

Can you post the job details from a successful job? The reason I ask is that with mounting and positioning, we’ve got almost 3 minutes worth of tape expense above. That is:

5/9/2012 3:49:39 PM - mounted; mount time: 00:00:46
5/9/2012 3:49:39 PM - positioning CEN508 to file 10
5/9/2012 3:51:45 PM - positioned CEN508; position time: 00:02:06

It could be that when you don't use MPX, a different tape is selected and it only has to position to tape mark 2 or 3 (instead of 10)

There's not a lot go on here, but there's an outside chance that we're exceeding a client read timeout (default is 5 minutes.) That's kind of a stretch, but the info is limited.

I would create a new volume pool, move a single tape (that has no current images) to that pool and then configure the policy to use just that one pool. That would remove the 2 minute positioning issue from the equation.

CraigL · ‎05-14-2012

Hi -

Do you want a log from one of the succesful jobs from when the job is run with multiplexing enabled or disabled?

There appear to be a decent number of Exchange-related fixes between where we are at now (7.1) and 7.1.0.4, so we're going to upgrade to that later today and see if the issues disappear. Will certainly try your suggestion and report back, though. Thank you for the suggestion.

-Craig

jim_dalton · ‎05-14-2012

The good news Craig is it will work: I've have the same setup give or take and MPX works fine.

No need to faff with streams, thats undesireable even if it does work.

All we have to do is figure why yours doesnt.

You need to give us a full policy listing of your policy and a breakdown of the storage unit in the policy as well.

The stunit needs mpx enabling too as well as the backup policy, and the mpx limit. This is the case ?

Thnks,Jim

CraigL · ‎05-14-2012

As a follow-up.. upgrading to 7.1.0.4 (both master/media server + clients) did NOT solve the issue.

Bummer :)

-Craig

CraigL · ‎05-14-2012

Hi Jim -

That's good. What version of NBU are you running?

Here are the details of my test policy:

Policy Type - MS-Exchange-Server

Policy Storage - Local_Tape_Drives (will detail this below)

Job Priority - 0

Media Owner - Any

Perform Snapshot Backups is checked. Under options, nohing is set.

Disable client-side deduplication is checked

Database backup source is Passive copy and if not available active

Nothing is set under Preferred server list

Under Schdule:

Type of backup is full

Media multiplexing is set at 4

Multiple copies, and all overerride stuff is NOT checked

Under clients:

Just our DAG name

Under Backup Selections:

NEW_STREAM

DAG -> Mail DB1 (live on node1)

NEW_STREAM

DAG -> Mail DB2 (live on node2)

Storage Unit Info:

Name: Local_Tape_Drives

Maximum current write drives - 2

Reduce fragment size - not checked/default

Enable Multiplexing is checked

Maximum streams per drive - 4

The issue is this.. if all mail database are live on ONE node, everything works fine. If they're not, thats when we see the error 25s. Multiplexing works fine in other policies, even using older clients!

-Craig

CraigL · ‎05-14-2012

Tried you suggestion regarding making a volume pool with just one tape (no images) and configured the policy to use it. No dice -- same outcome. The backup from one node worked fine, the other did not. Here is the detail output from the two:

WORKING:

5/14/2012 1:33:32 PM - Info nbjm(pid=4192) starting backup job (jobid=21426) for client Server-02, policy Test_ExchangeDAG, schedule TestFull
5/14/2012 1:33:32 PM - estimated 0 Kbytes needed
5/14/2012 1:33:32 PM - Info nbjm(pid=4192) started backup job for client Server-02, policy Test_ExchangeDAG, schedule TestFull on storage unit Local_Tape_Drives
5/14/2012 1:33:33 PM - started process bpbrm (8872)
5/14/2012 1:33:57 PM - Info bpbrm(pid=8872) starting bptm
5/14/2012 1:33:57 PM - Info bpbrm(pid=8872) Started media manager using bpcd successfully
5/14/2012 1:34:20 PM - Info bpbrm(pid=8872) Server-02 is the host to backup data from
5/14/2012 1:34:20 PM - Info bpbrm(pid=8872) telling media manager to start backup on client
5/14/2012 1:34:20 PM - Info bptm(pid=6828) using 131072 data buffer size
5/14/2012 1:34:20 PM - Info bptm(pid=6828) using 32 data buffers
5/14/2012 1:34:31 PM - Info bptm(pid=6828) start backup
5/14/2012 1:34:31 PM - Info bptm(pid=6828) Waiting for mount of media id CEN508 (copy 1) on server mds-netbackup01.
5/14/2012 1:35:15 PM - Info bptm(pid=6828) media id CEN508 mounted on drive index 1, drivepath {4,0,4,0}, drivename tape_drive2hcart3, copy 1
5/14/2012 1:35:39 PM - mounting CEN508
5/14/2012 1:35:39 PM - mounted; mount time: 00:00:00
5/14/2012 1:35:39 PM - positioning CEN508 to file 1
5/14/2012 1:36:02 PM - Info bpbrm(pid=9416) sending bpsched msg: CONNECTING TO CLIENT FOR DAG-01_1337016812
5/14/2012 1:36:02 PM - connecting
5/14/2012 1:36:24 PM - Info bptm(pid=1824) setting receive network buffer to 525312 bytes
5/14/2012 1:36:24 PM - Info bpbrm(pid=9416) start bpbkar32 on client
5/14/2012 1:36:24 PM - connected; connect time: 00:00:22
5/14/2012 1:36:24 PM - positioned CEN508; position time: 00:00:45
5/14/2012 1:36:24 PM - begin writing
5/14/2012 1:36:45 PM - Info bpbkar32(pid=7356) Backup started
5/14/2012 1:36:45 PM - Info bpbrm(pid=9416) Sending the file list to the client
5/14/2012 1:37:07 PM - Info bpbrm(pid=9416) DB_BACKUP_STATUS is 0
5/14/2012 1:37:19 PM - Info bpbrm(pid=8872) media manager for backup id DAG-01_1337016812 exited with status 0: the requested operation was successfully completed
5/14/2012 1:39:25 PM - Info bpbrm(pid=8872) child done, status 0
5/14/2012 1:39:25 PM - end writing; write time: 00:03:01
the requested operation was successfully completed(0)

NOT WORKING:

5/14/2012 1:36:33 PM - Info nbjm(pid=4192) starting backup job (jobid=21427) for client Server-01, policy Test_ExchangeDAG, schedule TestFull
5/14/2012 1:36:33 PM - estimated 0 Kbytes needed
5/14/2012 1:36:33 PM - Info nbjm(pid=4192) started backup job for client Server-01, policy Test_ExchangeDAG, schedule TestFull on storage unit Local_Tape_Drives
5/14/2012 1:36:55 PM - Info bpbrm(pid=8872) Server-01 is the host to backup data from
5/14/2012 1:37:19 PM - Info bpbrm(pid=8872) telling media manager to start backup on client
5/14/2012 1:37:20 PM - Info bptm(pid=6828) using 131072 data buffer size
5/14/2012 1:37:20 PM - Info bptm(pid=6828) using 32 data buffers
5/14/2012 1:37:30 PM - Info bptm(pid=6828) start backup
5/14/2012 1:39:47 PM - Info bpbrm(pid=7812) sending bpsched msg: CONNECTING TO CLIENT FOR DAG-01_1337016993
5/14/2012 1:39:47 PM - connecting
5/14/2012 1:40:29 PM - Error bpbrm(pid=7812) cannot create data socket, The operation completed successfully. (0)
5/14/2012 1:40:30 PM - Info bpbrm(pid=8872) child done, status 25
5/14/2012 1:40:30 PM - Info bpbrm(pid=8872) sending message to media manager: STOP BACKUP DAG-01_1337016993
5/14/2012 1:40:35 PM - Info bpbrm(pid=8872) media manager for backup id DAG-01_1337016993 exited with status 150: termination requested by administrator
5/14/2012 1:40:35 PM - end writing
cannot connect on socket(25)

Is this what you were asking for?

srk · ‎05-14-2012

Hi Craig.

Yes, that's what I was looking for. Do you have either of the following set?

'Limit jobs per policy' (configured in the policy)

'Maximum jobs per client' (configured in the Host Properties of the master server)

-Steve

CraigL · ‎05-14-2012

Hi Steve -

Limit jobs per policy is NOT configured in the policy.

Maximum jobs per client is set at 4.

-Craig

srk · ‎05-14-2012

Okay.

One thing I think may be worth a shot would be to increase the Client Read Timeout. The default is 5 minutes, but I'm not sure exactly when the clock starts ticking on that setting (especially with this whole DAG/passive/active part of the equation.)

The rationale is this:

Let's say we use 1:33:32 as the "start time" for Client Read:

5/14/2012 1:33:32 PM - Info nbjm(pid=4192) starting backup job (jobid=21426) for client Server-02, policy Test_ExchangeDAG, schedule TestFull

The first job connects to the client (after tape mount, position, etc.) here, at 1:36:02

5/14/2012 1:36:02 PM - Info bpbrm(pid=9416) sending bpsched msg: CONNECTING TO CLIENT FOR DAG-01_1337016812
5/14/2012 1:36:02 PM - connecting

The second job connects to the client here, at 1:39:47

5/14/2012 1:39:47 PM - Info bpbrm(pid=7812) sending bpsched msg: CONNECTING TO CLIENT FOR DAG-01_1337016993
5/14/2012 1:39:47 PM - connecting

So, theoretically, if Client Read started 'ticking' at 1:33:32, the 5 minute default timeout has expired by the time we see the 2nd 'connecting' at 1:39:47. That's where it bombs.

Try bumping up the Client Read Timeout for both nodes (configured in the Host Properties -> Timeouts section in the client properties) to maybe 10 or 15 minutes and see if that has any impact.

I think it's worth a shot based on above.

-Steve

VOX

Backing up Exchange 2010 DAG - Multiplexing Issue