cancel
Showing results for 
Search instead for 
Did you mean: 

Exchange 2013 Backup fails code error 24

Hamza_H
Moderator
Moderator
   VIP   

Hi Everyone,

We have a backup issue for a long time on exchange server 2013.

The configuration is like this, 

24 Databases to backup with one cluster (yes I know it's not recommended)

The problem is, the backup of some DBs don't work and it's not the same database, every time we have a new failing backup.

after the application logs on the cluster, I've found the Error id 401 and 403 on the same time of the error 24 appears on the backup log, and yes we know it's a Microsoft issue as this TN explains :

 https://www.veritas.com/support/en_US/article.TECH197008

 

however, what we can't understand is, why the backup of the weekend (full_backup) works just fine, no error 401 and 403 on the cluster application logs and no backup error,  but during the week (also full backups) there are issues.

Is this related to the Microsoft KB? because we have the KB3000850 installed, do we need to install the KB3000853?

Thanks in advance.

 

HHA.

 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Your DAG is a virtual cluster name. NetBackup resolves it to a mailbox server. I think from the bpbrm log that the snapshot was taken on a host named MAILBOX02.

I also see that the first "from client" message after bpbrm passed the backup directives was the error. It occurred almost exactly 30 minutes after the start of the job, and 29 minutes after bpbrm started bptm.

?19:58:40.800 [180589.180589] <2> bpbrm mm_sig: received ready signal from media manager
>20:27:12.543 [180589.180589] <32> non_mpx_backup_archive_verify_import: from client DAG: FTL - socket write failed

The only two sockets bpbkar32 writes to are to bpbrm (we see that works), and bptm (the media manager).

If you have a verbose bpbkar log on MAILBOX02, I think you will find that it failed while trying to write the .edb file to bptm. That suggests that the problem is not with reading the snapshot but rather with writing to the media manager. This takes it out of the scope of my expertise.

View solution in original post

I updated the TechNote at the top of this thread and got it re-published. The link has changed. Please use https://www.veritas.com/support/en_US/article.100027680.html.

View solution in original post

11 REPLIES 11

Lowell_Palecek
Level 6
Employee

The consistency check is optional if you are backing up a DAG. You can turn it off in your policy, or you can specify that the backup continue if the consistency check fails.

Having said that, here's a guess from other customer experience.

Have you configured enough space for your VSS snapshots? If not, you will get exactly this result. NetBackup will make a snapshot and start running a consistency check while backing up the data files. If the snapshot runs out of space, the OS deletes it out from under bpbkar32. The first thing you see in the logs is that the consistency check fails. If you configure the policy to keep going, the next thing you will see is that the backup of the data file (usually the .edb file) backs up too few bytes. (It think it fails because it knows how much data to expect, but I'm not sure.)

Hamza_H
Moderator
Moderator
   VIP   

Hello @Lowell_Palecek,

 

Thank you for your reply and your tip,

unless mistaken, to turn off the option " Perform consistency check before backup with VSS" is under the client's props > exchange and not policy.. ( snapshot attached).

I have already configured the space for VSS to illimited but it didn't resolve the problem.

I've just turned off the option, I will wait for today's execution to see if it's resolved...

btw, we don't have a policy configured (ms-windows) to back up the data (.edb), so there is no harm in turning off the option of consistency, in this case, right?

 

Thanks a lot.

 

Hamza.

 

You are correct. The consistency check options (yes, no, maybe) are in the client host properties.

MS-Windows policies exclude Exchange data files in order to prevent backing up multiple times, so no, your MS-Windows policy is not backing up your .edb files.

The consistency check is a Microsoft requirement in non-DAG environments for them to support VSS backup and restore. They waived the requirement for DAG backups.

With the option that failing the consistency check fails the backup, is the backup job (bpbkar32) retried from the same snapshot job at least once? If the OS drops the snapshot, the second failure will differ from the first. The first bpbkar and Event log will report that the consistency check fails part way through. The retry will report that the check fails to read the header information.

With the other two options, check to make sure you are getting the whole .edb file. If not, I think you get status 1, but I'm not sure. Check the bpbkar log to make sure.

You can run out of VSS space even if you tell vssadmin it's unlimited, if you provision it on a volume that's too small. If snapshot space is your problem you may see the results vary depending on how busy the Exchange system is, and how long it takes from the snapshot creation to the end of the backup. VSS system provider snapshots are copy-on-write.

The TechNote you found, https://www.veritas.com/support/en_US/article.TECH197008, does not say it's a Microsoft bug. In fact, it describes the first bpbkar failure before retry when the snapshot runs out of space. I'm disappointed that the TechNote doen't say more. It's dated late 2013, which may be before I figured out the cause. I will see about getting it updated.

I found the customer cases that I had. The first case was around the time of the TechNote. The TechNote probably came out of that case. We didn't figure out the cause because the customer refused to provide us any backup logs. They had a DAG environment configured to either not do the consistency check, or to continue if it failed. Most of their backups were good, but their backups taken during especially busy times were all bad. They should have been getting status 1 on their backups, but they wouldn't tell us. They had OpsCenter configured to treat status 1 as success.

The second case came just a couple of months later. They had Exchange 2007, so a failed consistency check failed the backup. In this case I saw that they had only a few hundred megabytes provided for VSS snapshots of a large database. The backup failed during a very busy time.

In the retry, there is no snapshot for bpbkar to back up. This had been a status 1 situation unless the consistency check was done and failure was considered fatal. After the second case we changed NetBackup (7.6.0.2) to treat a missing data file on the snapshot as failure.

Hamza_H
Moderator
Moderator
   VIP   

Hello Again @Lowell_Palecek,

Thank you for your researches and especially your explanation and the time you took to answer.

The execution of yesterday didn't get me (us) good news, the status code 24 is always here, but the event ID 401 & 403 are no longer on application logs, so I reactivated the option again.

To answer your question: Is the backup job (bpbkar32) re-tried from the same snapshot job at least once? 

I can assure you that for some jobs, there are always attempt 2, (even when we have all jobs ok during the week [which is really rare] there is always the attempt N2 for some databases, except the weekend, when all jobs are successful from the first attempt, and that's why i'm really confused),

=>So yes, bpbkar32 always tries to get the snapshot in the 2n attempt, but it's the same message, not different from the first one. The application logs have no errors regarding these messages on NetBackup.

The snapshot job is always successful, and as you mentioned, the fact that you already had this problem with that customer "Most of their backups were good, but their backups taken during especially busy times were all bad" so, I think we have the same problem here, because as I said before, backups are always fine during the weekend, not a single error message or attempt 2, but during the week, there are always 2 or 3 jobs of 27 jobs causing the problem (error 24):

4 déc. 2018 20:57:11 - Critical bpbrm (pid=21799) from client dag-002: FTL - socket write failed
4 déc. 2018 20:57:13 - Error bptm (pid=22019) media manager terminated by parent process

 

Any other explanation?

 

Thanks in Advance,

 

HamzaH.

So far, we've had only descriptions of the problem, not data. It could be that my guess of the cause is correct - or not. There are other reasons for things to go better during quieter times.

1. Regarding the snapshot conjecture, I have assumed you are using the "system" or "auto" VSS provider. Is that correct? How many disk volumes are involved in the databases being backed up? You say they are configured for unlimited VSS snapshot size. The configuration also specifies the volume where the snapshot copy-on-write data is kept. Do the target volumes have sufficient space? Does the result change if you target the snapshots to a big empty disk? How much time passes from when the snapshot is taken until the backup job fails?

2. You have now given two lines from the job details. Please provide about the last 10 lines before the error. Find the line in the bpbrm log that matches the error in the job details. Look for the context for the error in that bpbrm pid. What was going on before the error? Are there any other signs of trouble?

3. Whenever bpbrm reports a problem "from client" find the error in the client log. I'm assuming that the client process is bpbkar32. Make sure that's true. As in the case of the bpbrm log, look for the context for the error in that pid. What was the process doing? Is there an error before the socket failure? I'm surprised that the error is a socket write failure. That does not fit blaming the snapshot. Are there any large time gaps in the log for the pid that fails?

Can you open a case with Support? A support engineer can work with you more efficiently to troubleshoot your environment. I had one conjecture - that the OS is dropping your snapshots due to insufficient space. Support can help you better than I can with other issues.

Hamza_H
Moderator
Moderator
   VIP   

Thanks again @Lowell_Palecek,

I will open a case with support to find the real reason why we are having these problems.

Just one quick question, I'm not an exchange expert, so yesterday, I configured the VSS's space to unlimited on both the DAG and the mailbox server. so can you please tell me this; should I configure that only on the DAG or only on the mailbox server or both?

Because the dag has 20Gb of 136Gb of free space while the mbxserver has only 9Gb of 136Gb.

You will find attached the logs bpbrm for the pid involved.

Thanks.

Your DAG is a virtual cluster name. NetBackup resolves it to a mailbox server. I think from the bpbrm log that the snapshot was taken on a host named MAILBOX02.

I also see that the first "from client" message after bpbrm passed the backup directives was the error. It occurred almost exactly 30 minutes after the start of the job, and 29 minutes after bpbrm started bptm.

?19:58:40.800 [180589.180589] <2> bpbrm mm_sig: received ready signal from media manager
>20:27:12.543 [180589.180589] <32> non_mpx_backup_archive_verify_import: from client DAG: FTL - socket write failed

The only two sockets bpbkar32 writes to are to bpbrm (we see that works), and bptm (the media manager).

If you have a verbose bpbkar log on MAILBOX02, I think you will find that it failed while trying to write the .edb file to bptm. That suggests that the problem is not with reading the snapshot but rather with writing to the media manager. This takes it out of the scope of my expertise.

Hamza_H
Moderator
Moderator
   VIP   

Thanks again @Lowell_Palecek for your help.

I am trying to contact the support to investigate more about this issue. 

At the moment, I 've marked your last reply as a solution, because it's the one that is leading us to a "true" reason, but as soon as I get the answer/solution from the technical support, I will update the post to let you know.

HamzaH.

 

I updated the TechNote at the top of this thread and got it re-published. The link has changed. Please use https://www.veritas.com/support/en_US/article.100027680.html.