Solved: We upgraded to 7.5.0.5 and

Juasiepo · ‎03-04-2013

Hi,

Last year we upgraded to 7.5.0.4 + deduplication, since then the backup system is not completely stable.

Randomly our windows backups started to fail with error statuses:

Media write error (84)
File write failed (14)
File read failed (13)

In the beginning only a few of them failed and as a workaround I created for them a special policy without the accelerator and they started to finish successfully again.

BUT this Friday all full windows backups that never failed before failed at once. Again as a workaround I re-launched them without the accelerator and finished ok.

This Friday all backups failed with error statuses 13 & 14. Inside the Detailed Status tab another status error can be seen:
"
03/03/2013 11:27:15 - Info bpfis(pid=0) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations

“

And the the full log says something related to the snapshot not being deleted:

"
03/03/2013 11:24:06 - Info bpbkar(pid=4844) NOT using change journal data for <D:\>: unable to validate change journal usage <reason=snapshot needed to flush open files to databases>
03/03/2013 11:25:47 - Error bptm(pid=7258) image copy failed: error 2060018: file not found
03/03/2013 11:25:47 - Info bpbkar(pid=4844) accelerator sent 270046720 bytes out of 5562443264 bytes to server, optimization 95.1%
03/03/2013 11:25:47 - Info bpbkar(pid=4844) bpbkar waited 30 times for empty buffer, delayed 5773 times.
03/03/2013 11:26:50 - Info ochlp-bak002(pid=7258) StorageServer=PureDisk:ochlp-bak002; Report=PDDO Stats for (ochlp-bak002): scanned: 5360324 KB, CR sent: 11367 KB, CR sent over FC: 0 KB, dedup: 99.8%, cache hits: 0 (0.0%)
03/03/2013 11:26:53 - Error bptm(pid=7258) media manager terminated by parent process
03/03/2013 11:26:53 - Error bpbrm(pid=5326) could not send server status message
03/03/2013 11:26:58 - Info bpbkar(pid=4844) done. status: 14: file write failed
03/03/2013 11:26:58 - end writing; write time: 00:06:49
03/03/2013 11:27:03 - Info bpbrm(pid=9186) Starting delete snapshot processing
03/03/2013 11:27:03 - Info bpfis(pid=0) Snapshot will not be deleted
file write failed(14)
03/03/2013 11:27:15 - Info bpfis(pid=280) Backup started
03/03/2013 11:27:15 - Critical bpbrm(pid=9186) from client svpdom11: cannot open C:\Program Files\Veritas\NetBackup\online_util\fi_cntl\bpfis.fim.svpdom11_1362305533.1.0
03/03/2013 11:27:15 - Info bpfis(pid=280) done. status: 1542
03/03/2013 11:27:15 - end operation
03/03/2013 11:27:15 - Info bpfis(pid=0) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations
"

I am sure that the problem has something to do with the accelerator as backup jobs only fail when the accelerator is enabled. No matter if is the client or the server which deduplcates the data, change journal enabled, etc.

My Solaris or Linux clients work perfectly without problems.

My environment:

1 Master 7.5.0.4 Linux
1 Media 7.5.0.4 Linux
Clients Windows/Solaris/linux 7.5.0.4

Thank for your time

Best Regards

Juasiepo · ‎04-15-2013

We upgraded to 7.5.0.5 and the problem hasn't apperared anymore. So I think it is not necessary to wait till 7.6 release.

View solution in original post

Fabrice_P_ · ‎03-04-2013

You may have old vss snapshots on your Windows machine. Can you please try to list if there is existing snapshots using either "diskshadow / list all shadows" on Windows 2008 or the "vssadmin" tool for Windows 2003.

cds171612 · ‎03-04-2013

Finally..somone else has noticed this! For me the workaround is to delete the Track file that Accelerator creates and that usually solves the issue.

However..as 90% of my clients are remote backups...deleting the track log basically means I am reseeding the client and if its a 2TB dataset I might have to let it run a week or two which is not good.

One thing I have noticed is that when I get that error I am able to run a incremental just fine..but of course I will need to run a full at some point..so that just buys me time.

Client Dedupe Enabled

Accelerator Enabled

Happens to most clients at some point, some worse than others..on any giving weekend out of 150 + clients I will have 2-3 that this happens to.

Have not tried 7.5.0.5 yet...but I am on 7.5.0.4.

cds171612 · ‎03-04-2013

OH also here is an example of a log file from a job with the same issue.

3/2/2013 11:34:45 AM - Info bpbkar32(pid=10932) change journal NOT enabled for <F:\>
3/2/2013 3:08:59 PM - Error bpbrm(pid=2220) db_FLISTsend failed: file read failed (13)
3/2/2013 3:14:15 PM - Info bpbkar32(pid=0) accelerator sent 124464640 bytes out of 656514811392 bytes to server, optimization 100.0%
3/2/2013 3:14:15 PM - Info bpbkar32(pid=0) bpbkar waited 117 times for empty buffer, delayed 93351 times.
3/2/2013 3:14:15 PM - Critical bpbrm(pid=2220) unexpected termination of client xxx-fs1.global.com
3/2/2013 3:14:20 PM - Info bpbkar32(pid=0) done. status: 13: file read faile

Juasiepo · ‎03-05-2013

Hi,

You were right, there were many snapshots in many server, some of them very old. here you are the extract of my test server:

DISKSHADOW> list shadows all

Querying all shadow copies on the computer ...

* Shadow copy ID = {18474126-9473-4333-bac7-d941f05f12d0}

- Shadow copy set: {2af66dec-3c4e-4f5d-8d3b-dba1ddc75a77}

- Original count of shadow copies = 1

- Original volume name: \\?\Volume{11647d85-7a3f-11e1-97ff-806e6

f6e6963}\ [C:\]

- Creation time: 12/05/2012 20:13:45

- Shadow copy device name: \\?\GLOBALROOT\Device\HarddiskVolumeS

hadowCopy1

- Originating machine: FQDN.servername

- Service machine: FQDN.servername

- Not exposed

- Provider ID: {b5946137-7b9f-4925-af80-51abd60b20d5}

- Attributes: No_Auto_Release Persistent Differential

Number of shadow copies listed: 1

DISKSHADOW> delete shadows all

Deleting shadow copy {18474126-9473-4333-bac7-d941f05f12d0} on volume \\?\Volume

{11647d85-7a3f-11e1-97ff-806e6f6e6963}\ from provider {b5946137-7b9f-4925-af80-5

1abd60b20d5} [Attributes: 0x00020009]...

Number of shadow copies deleted: 1

After deleting the shadow copy I have launched the backup again:

1º time without accelerator it failed with the following error:

05/03/2013 09:36:19 - Error bptm(pid=32565) image copy failed: error 2060018: file not found
05/03/2013 09:36:19 - Error bpbrm(pid=32296) db_FLISTsend failed: file read failed (13)

2º time without accelerator it finished ok
3º time with accelerator it failed wit the following error:

05/03/2013 12:01:32 - Critical bpbrm(pid=3159) from client FQDN.servername: cannot open C:\Program Files\VERITAS\NetBackup\online_util\fi_cntl\bpfis.fim.oasv222_1362480955.1.0
05/03/2013 12:01:32 - Info bpfis(pid=22752) done. status: 1542
05/03/2013 12:01:32 - end operation
05/03/2013 12:01:32 - Info bpfis(pid=0) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations

Then I have deleted the track log (as commented by cds171612) and launched a full backups with accelerator. This one has has finished ok (creating again the track log) and a second attempt has finished ok using the brand new accelerator's track log.

So It seems the workaround works BUT the cost is to high as it is necessary to re-create the tracklog again.

Any othet idea?

Thank you for your time.

Best regards

Mark_Solutions · ‎03-05-2013

3/2/2013 3:14:15 PM - Critical bpbrm(pid=2220) unexpected termination of client xxx-fs1.global.com
3/2/2013 3:14:20 PM - Info bpbkar32(pid=0) done. status: 13: file read faile

That looks like exactly 5 minutes to me ...

Do you still have the default client read and client connect timeouts set on your Media Servers?

It may just need increasing so that it does not shut the connections before the accellerator has done it work and prepared its file list for the client?

Hope this helps

cds171612 · ‎03-05-2013

Thanks for the comment, I checked and they are at 15 mins so thats not it. Also that job had been running since at least 11:34am(first line of my log) and the 5 mins you are talking about is when it first spit out an error.

This can happen 6 hours in on alarge client to 30 mins in on a remote client.. seems random.

However as Juasiepo mentioned although my workaround seems to resolve the issue for awhile..the cost is very high for remote backups and nothing stpos it from happeing again.

Welcome any more feedback or comments...this must be happening to others that use Accelerator?

Juasiepo · ‎03-06-2013

Just to isolate the problem. Do you have similar problem in other clients OS?

I have a mixed environment and this problem only happens in my windows clients. In Solaris or Linux all is working OK

Marianne · ‎03-06-2013

Do you perhaps have mixed case client names in policies for these clients? Or maybe had in the past?

We have seen cases where NBU got extremely 'confused' where Client names exist with lower case in one policy and upper case in another.

Handy NetBackup Links

Juasiepo · ‎03-06-2013

I don't think it is realated to client names as if I only disable the accelerator, backup runs ok

No matters if server is physical or virtual. Also the majority of backup were running with accelerator till last Friday.

Thank you for your time

Juasiepo · ‎03-06-2013

There is another post with similar problem.

https://www-secure.symantec.com/connect/forums/status-84-media-write-error-when-changing-policy-acce...

The solution proposed was:

"I copied the policy to a new policy and ran the job. It is creating a full backup image, but should complete successfully."

Moving The failing clients to a different policy is similar to deleting the track directory. Because with new policy a new track directory is generate as track directory path is policy name dependant.

But the question is... Is this a final solution or just a temporary workaround as problem can appear again after sometime.

Best regards

cds171612 · ‎03-06-2013

I only have windows servers, but what I can say is I have a large enviroment going over 2 countries with 2 completely seperate masters and I get the exact same issues on both.

If I am not mistaken, this might be related to the bug I reported in 7.5.0.1 that resulted in the Track file growing till it used up all disk space! They fixed that in 7.5.0.4, however perhaps part of the issue still remains..in that it no longer takes down your server but it still doesn't let you complete a full.

It doesn't happen to every server, and some more than others. In some cases just the 1 time.

I have tried changing every setting I can think of, but it still seems to exist in each scenario. It might be related to checkpoints..but with remote backups that needs to be enabled. That said will run a full accel with them disabled and then report back if by some miracle that works.

cds171612 · ‎03-06-2013

Nope same issue when checkpoints are disabled. So only way now to resolve is delete track log.

I did have a ticket open with symantec at one point..but it got sidetracked with some other issues and now my critical support has expired..might try and submit it again.

I_like_3_14 · ‎04-15-2013

I'm on 7.5.0.4, currently, and recently had this problem pop up on one of my Windows backup policies using accellerator. I just tested turning off accellerator and re-running the job, and it completed successfully.

The odd part is that all the incremental backups worked fine. It was only the FULL jobs that bombed.

Does anyone know if this will be fixed in the upcoming 7.6 release?

Juasiepo · ‎04-15-2013

We upgraded to 7.5.0.5 and the problem hasn't apperared anymore. So I think it is not necessary to wait till 7.6 release.

VOX

Backups failing with error 13, 14, 84, 1542 on windows clients with the accelerator enabled