cancel
Showing results for 
Search instead for 
Did you mean: 

Fileserver backup with too many 'Unknown Error'

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

My turn to ask for advice ....

We have a new NetBackup installation with a large fileserver to be backed up.

Because of status 1 exit codes, Accelerator and Change Journal is disabled, resulting in poor backup performance.

We see way too many files being skipped because of  (WIN32 32: Unknown error) .

Trying to add these files to Exclude List is nearly impossible - there are new ones every day.

I looked for previous posts about this error - there are URLs that no longer exist:
e.g this solution from @RiaanBadenhorst : https://vox.veritas.com/t5/NetBackup/Job-Status-1-on-many-polisys/td-p/719014

Any idea on how to troubleshoot / solve this issue?

17 REPLIES 17

EthanH
Level 4

It feels so odd to provide tips to the Oracle of NetBackup...

Are you using DFSR? 

I changed the link in Riann's post to veritas.com/support/en_US/article.HOWTO65638.html and it redirected to the article on configuring DFSR backups. 

Any errors in Event Viewer? Maybe a Windows admin has snapshots on the server that are causing NBU issues?

Michael_G_Ander
Level 6
Certified

Could it be something other than netbackup scaning the file server ?  

Have seen these Unkown Error and they always are a pain.

In my experience it is good to start with talking with antivirus person(s), scan on access or qaurantine of the files can gives this Unknow Error. 

It can also be something like users leaving their documents open, so that they are locked by another process.

If you havn't already and is allowed excluding ~* might help, as it removes all the temporary files office products creates.

If the file server contains profiles things like Temporary Internet files or other caches depending used products is worth to exclude if allowed.

Check there is not some kind of dumps running, even if they most often gives 156 

Make sure that the shadow area on the file server are big enough for snapshot of all the open files, unfortunately the VSS system has become less informative over the years. 

A change of the backup window can some times help

Also think there is some settings to how Netbackup handles open files like how many times it tries to snapshot a open file and how long it waits for the file(s) to be snapshotted.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

jnardello
Moderator
Moderator
   VIP    Certified
Just to check, I'm assuming you've already gotten several policies set up for the fileserver to back up the OS drives/registry & actual fileserver drives separately, right ? And that the regular OS policies are running fine with accelerator, it's just the giant fileserver drive(s) having problems completing ? I would probably recommend sticking with the goal you want to hit (accelerator and change journal) and trying to work towards that, as otherwise you're going to be skipping in-use files outright. Status 1s are usually survivable - not preferred, but given the choice between a successful 3 TB fileserver backup with 5 skipped word docs & a completely failed backup, I'll take the successful one. Once you've got the goal set you can start with the window tuning, VSS tuning (which I assume is needed for the large drive snapshot), policy tuning to break down the giant problem directories into more manageable chunks, and schedule tuning (i.e. grandfather/father/son, etc. ). Initial setups of fileserver policies are painful and it can take more than a few runs before you find a workable solution (gotta love those overnight backups). Best of luck !

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Thanks for all the replies Gents!

@EthanH 
This is not DFSR, so, the DFSR HOWTO article won't help.
I have asked the customer to check for Event Viewer errors.
Will feedback on this.

@Michael_G_Ander 
Thanks! We have sent the AV team all of the Veritas recommendations. Will ask the customer to check all your recommendations.

@jnardello 
There are 3 large drives - 37TB, 10TB, 19TB.
Multiple policies, broken down into multiple streams and staggered to run on different days.
Our biggest problem with status 1 is that following backups will not apply Accelerator and Change Journal.
Backups then run for days instead of a few hours.
The 'Unknown Error' is bugging me - we can deal with 'File in use' or 'Permission Denied' or any other error that makes sense and can be addressed. I don't know where to start or how to eliminate this. Different files every day.
We will look into VSS tuning, thanks.

I wonder if anyone is backing up fileservers without VSS snapshots (disabling it in Client Attributes)?

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Just a short update:

The Veritas Support Engineer shared this TN:
https://www.veritas.com/content/support/en_US/article.100032194

So, this and many other wildcard items have been added to Exclude List.
Backups now complete successfully, but backups are still very slow... (The biggest complaint is that Data Protector backups were MUCH faster.)
I see in bpbkar log that a LOT of time is spend on evaluating Exlude List (catch-22 ? )

I have also spent a bit of time to look for 'VSS tuning' info.
In a 10-year-old forum post, the user claimed that backups are considerably faster for non-VSS backups.

Our customer confirmed this morning that they were doing non-VSS backups with Data Protector and asked if we could disable snapshot backups in NBU.

My hesitation with disabling WOFB in Client Attributes is that we could possibly see more skipped files than before, again resulting in status 1 with Change Journal and Accellerator not being used.

Curious to know what other backup admins are doing on their physical, non-DFSR fileservers w.r.t VSS tuning?
Does it make a major difference when a separate drive letter is assigned for snapshots?
(I have realised that the default is to use the same drive with limited space.)

Thanks again for everyone's assistance.

I'm interested in the exclude list time. Is it the chief reason your backups are slow? Please post a short snippet here to get me oriented in the code. There are ways to build lists that don't scale well. I'd like to make sure that's not a factor here.

I wonder whether you can use vssadmin to test your ideas for speeding up the snapshot.

It seems what you want is a configuration option to not disable Accelerator on status 1, or at least not on busy files. You could propose that to Veritas program management. It feels a bit scary.

quebek
Moderator
Moderator
   VIP    Certified

Hello

Are you running FULL all the time or there are some CINC or DINC configured too?

What is the setting for this client for Archive Bit/Incrementals? based on archive bit or timestamp? did you check this out? is Use Change Journal selected? 

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Lowell_Palecek wrote:

I'm interested in the exclude list time. Is it the chief reason your backups are slow? Please post a short snippet here to get me oriented in the code. There are ways to build lists that don't scale well. I'd like to make sure that's not a factor here.


Hi @Lowell_Palecek 
Thanks for taking interest.
I am confident that the way that bpbkar is handling the Exclude List is contributing to the lengthy backup time
The Exclude list has for example *.jpg, *.bmp, *. log  - the result in bpbkar (level 0) log is literally thousands of lines (807615 lines) with Excluded: entries over the 4 days that the backup ran.
I have extracted all entries with the same PID into a single TextPad file - the size is 153MB
(All bpbkar logs uploaded to Case 201007-000445)

22:07:58.106 [4456.5372] <2> tar_base::V_vTarMsgW: INF - Excluded: Path\FileA.JPG
....
....
All the way up to midnight:
23:59:59.879 [4456.5372] <2> tar_base::V_vTarMsgW: INF - Excluded: Path\fileBB.bmp
Carrying on into the next morning:
00:00:49.098 [4456.5372] <2> tar_base::V_vTarMsgW: INF - Excluded: Path\Log\xxxxx.log

Like I said - 807615 lines containing 'Excluded' over 4 days across multiple bpbkar files.

An hour before the backup completed on day 4, we see this entry:

05:42:22.781 [4456.5372] <16> file_access::V_OpenForRead: ERR - CreateFile() failed: \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy21\User\ABCD\! File.scr (WIN32 5: Access is denied. )
05:42:22.797 [4456.5372] <2> tar_base::V_vTarMsgW: WRN - can't open file: Path\User\ABCD\! File.scr (WIN32 5: Unknown error)

Backup stats on day 4:

06:50:32.343 [4456.5372] <2> tar_base::backup_finish: TAR - backup: 5537496 files
06:50:32.343 [4456.5372] <2> tar_base::backup_finish: TAR - backup: file data: 892189095 bytes 8265 gigabytes
06:50:32.343 [4456.5372] <2> tar_base::backup_finish: TAR - backup: image data: 387741696 bytes 8288 gigabytes
06:50:32.343 [4456.5372] <2> tar_base::backup_finish: TAR - backup: elapsed time: 291038 secs 30578687 bps

 


@Lowell_Palecek wrote:

I wonder whether you can use vssadmin to test your ideas for speeding up the snapshot.



I have asked the customer if a separate drive/volume could be provisioned where snapshots can be directed to.


@Lowell_Palecek wrote:

It seems what you want is a configuration option to not disable Accelerator on status 1, or at least not on busy files. You could propose that to Veritas program management. It feels a bit scary.


Yes! Request for enhancement to be logged with PM via local SE.
NOT for the feint-hearted! Smiley Very Happy

 

The Windows API CreateFile is badly named. It just means open the file, with options for whether to create it if it doesn't exist. I'll bet what happened is that the copy-on-write snapshot ran out of space to save updated blocks. When that happens, the OS deletes the snapshot without telling any process such as bpbkar32 that's using it.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@quebek wrote:

Hello

Are you running FULL all the time or there are some CINC or DINC configured too?

What is the setting for this client for Archive Bit/Incrementals? based on archive bit or timestamp? did you check this out? is Use Change Journal selected? 

 


Hi @quebek 

Backups are configured for Full followed by DINC.
Incrementals are configured for timestamp and Change Journal.

The problem is with Status 1, then Accelerator and Change Journal are not considered when subsequent backups run.

Info bpbkar32 (pid=4456) not using change journal data for <PATH>: unable to validate change journal usage <reason=previous backup wasn't a successful backup>

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Lowell_Palecek wrote:

...... I'll bet what happened is that the copy-on-write snapshot ran out of space to save updated blocks. When that happens, the OS deletes the snapshot without telling any process such as bpbkar32 that's using it.


I wish there was a way to find evidence...

There may be something in the Event log when the OS removes a snapshot for lack of space. Otherwise the only evidence is that suddenly bpbkar32 can't access it.

I looked into exclude list processing. I don't see a major thing that I could improve. I found two minor things...

1. Check "Use case sensitive exclude list" option set in client host properties. Otherwise NetBackup converts every file to upper case before looking for it in the exclude list. (You'll have to be careful with your list then to get the case right, or make 2 entries for some.)

2. Whoever is handling your case could make an EEB that didn't log all 800,000 exclusions. It could log the first 100 and then every 1000th one. This probably isn't worth it. 800,000 log lines don't take hours to write.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Lowell_Palecek 

Thank you.

Our case engineer is on training for 2 days.
Will hopefully be able to speak to him tomorrow.

quebek
Moderator
Moderator
   VIP    Certified

Hello Marianne

I did ask this just to double check - sometimes I do find my self to forget about basic things... so double checking won't harm.

About journal - how it was created on this box? by the wintel admin or by NBU itself? 

maybe there are also too many changes and it is being overwritten - even thought the log indicates reason=previous backup wasn't a successful backup

Can you verify outcome from fsutil usn queryjournal drive_letter:

for each drive letter...

Also I am not sure what this means in output from above

Write range tracking: Disabled

if it was enabled - what can make use of it? Do you have any idea??

quebek
Moderator
Moderator
   VIP    Certified

One more idea...

Did you implement this https://www.veritas.com/support/en_US/article.100004864

so to exclude NBU processes for antivirus SW

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@quebek 

Thanks for your assistance.
To be 100% honest, I never asked about Change Journal existence.
I somehow assumed that it is there by default (..blush..) and simply went with NBU-side configuration as per Admin Guide I.
(In my defence - I am more of a Unix admin.)
I will ask the customer to run the fsutil command against all drive letters to check and confirm.

The AntiVirus exclusions were done right in the beginning, thanks.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Just an update about this issue:

The Veritas engineer told us the following:

If Netbackup is changed to use the Change Journal – and it is not enable on the volume – then Windows enables it.

Note: when "Use Change Journal‟ is enabled from NetBackup for a Windows client it is enabled for all volumes on that client. If the NTFS Change Journal is already enabled for a NTFS volume, NetBackup will not modify that instance of the NTFS Change Journal. If the NTFS Change Journal is not enabled for a NTFS volume, NetBackup will create an instance of the NTFS Change Journal for that volume. At the same time, NetBackup will start monitoring those NTFS Change Journal instances.

If "Use Change Journal‟ is disabled from NetBackup, the NTFS Change Journal instances are not disabled on the client as they may be used externally of NetBackup. However, NetBackup will stop monitoring the NTFS Change Journal instances.

So, after battling for weeks to get constructive feedback from the customer, he emailed this morning:

"The backups seem to be running a lot better now

You can proceed to close this case "

So, if you ask me what exactly made the difference?
I don't know. Probably Change Journal and updated Exclude List.
Just happy that our customer is 'happy' Smiley Very Happy