cancel
Showing results for 
Search instead for 
Did you mean: 

DFSR backup job that fail with status code 1 show NBU error in BAR

Sergio_Millan
Level 4

hi,

I'm running a few DFSR share backups using the backup selection = Shadow Copy Components:\User Data\Distributed File System Replication\DfsrReplicatedFolders\HomeShare\Home2 , for example. These are always to Disk target (open storage) and using Accelerator.

backups of DFSR share will usually run clean and finish with status 0 but sometimes completes with a status 1. I can ususally rerun the backup the next morning and it's happy.

The peculiar thing is if it's a status 1 and the logs don't show actual files skipped, BAR will then show an error if I try browsing the image for recovery:

"Unable to obtain list of files using the specified search criteria"

The "skipped" files seem to be the The shadow copy components themselves....resulting in a non-recoverable image. the logs of this scenario are as follows:

11/30/2016 5:00:00 PM - Info nbjm(pid=4396) starting backup job (jobid=2147876) for client cadc3dfsp004, policy DFSR_HomeShare_Home2, schedule Daily_DiffInc_Disk
11/30/2016 5:00:00 PM - Info nbjm(pid=4396) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2147876, request id:{E5CDA389-208A-444D-820C-23AB30B45037})
11/30/2016 5:00:00 PM - requesting resource CACALDD002_PrimarySTU
11/30/2016 5:00:00 PM - requesting resource bkpnbmaster1.nal.local.NBU_CLIENT.MAXJOBS.cadc3dfsp004
11/30/2016 5:00:00 PM - requesting resource bkpnbmaster1.nal.local.NBU_POLICY.MAXJOBS.DFSR_HomeShare_Home2
11/30/2016 5:00:00 PM - granted resource bkpnbmaster1.nal.local.NBU_CLIENT.MAXJOBS.cadc3dfsp004
11/30/2016 5:00:00 PM - granted resource bkpnbmaster1.nal.local.NBU_POLICY.MAXJOBS.DFSR_HomeShare_Home2
11/30/2016 5:00:00 PM - granted resource MediaID=@aaaaX;DiskVolume=CACALDD002_STU;DiskPool=CACALDD002_Production;Path=CACALDD002_STU;StorageServer=CACALDD002.nal.local;MediaServer=cacalnbu005.nal.local
11/30/2016 5:00:00 PM - granted resource CACALDD002_PrimarySTU
11/30/2016 5:00:02 PM - estimated 79986824 Kbytes needed
11/30/2016 5:00:02 PM - Info nbjm(pid=4396) started backup (backupid=cadc3dfsp004_1480550402) job for client cadc3dfsp004, policy DFSR_HomeShare_Home2, schedule Daily_DiffInc_Disk on storage unit CACALDD002_PrimarySTU
11/30/2016 5:00:04 PM - started process bpbrm (1768)
11/30/2016 5:00:10 PM - Info bpbrm(pid=1768) cadc3dfsp004 is the host to backup data from
11/30/2016 5:00:10 PM - Info bpbrm(pid=1768) reading file list for client
11/30/2016 5:00:11 PM - Info bpbrm(pid=1768) accelerator enabled
11/30/2016 5:00:17 PM - connecting
11/30/2016 5:00:20 PM - Info bpbrm(pid=1768) starting bpbkar32 on client
11/30/2016 5:00:20 PM - connected; connect time: 0:00:03
11/30/2016 5:00:22 PM - Info bpbkar32(pid=3000) Backup started
11/30/2016 5:00:22 PM - Info bpbkar32(pid=3000) change time comparison:<enabled>
11/30/2016 5:00:22 PM - Info bpbkar32(pid=3000) accelerator enabled backup, archive bit processing:<disabled>
11/30/2016 5:00:22 PM - Info bptm(pid=2280) start
11/30/2016 5:00:22 PM - Info bptm(pid=2280) using 262144 data buffer size
11/30/2016 5:00:22 PM - Info bptm(pid=2280) setting receive network buffer to 1049600 bytes
11/30/2016 5:00:22 PM - Info bptm(pid=2280) using 30 data buffers
11/30/2016 5:00:25 PM - Info bptm(pid=2280) start backup
11/30/2016 5:00:25 PM - Info bptm(pid=2280) backup child process is pid 6572.2260
11/30/2016 5:00:25 PM - Info bptm(pid=6572) start
11/30/2016 5:00:25 PM - begin writing
11/30/2016 5:00:27 PM - Info bpbkar32(pid=3000) not using change journal data for <Shadow Copy Components:\User Data\Distributed File System Replication\DfsrReplicatedFolders\HomeShare\Home2>: not supported for non-local volumes / file systems
11/30/2016 5:41:51 PM - Warning bpbrm(pid=1768) from client cadc3dfsp004: WRN - can't open object: Shadow Copy Components: (BEDS 0xE000FECB: A failure occurred accessing the backup component document.)
11/30/2016 5:41:51 PM - Warning bpbrm(pid=1768) from client cadc3dfsp004: WRN - can't open object: Shadow Copy Components:\User Data\Distributed File System Replication\DfsrReplicatedFolders\HomeShare (BEDS 0xE000FEDD: A failure occurred accessing the object list.)
11/30/2016 5:47:33 PM - Info bptm(pid=2280) waited for full buffer 1226 times, delayed 178286 times
11/30/2016 5:47:33 PM - Info bpbkar32(pid=3000) accelerator sent 1808810496 bytes out of 8746298368 bytes to server, optimization 79.3%
11/30/2016 5:47:34 PM - Info bptm(pid=2280) EXITING with status 0 <----------
11/30/2016 5:47:35 PM - Info bpbrm(pid=1768) validating image for client cadc3dfsp004
11/30/2016 5:47:37 PM - Info bpbkar32(pid=3000) done. status: 1: the requested operation was partially successful
11/30/2016 5:47:37 PM - end writing; write time: 0:47:12
the requested operation was partially successful(1)

The job was successfully completed, but some files may have been
busy or inaccessible. See the problems report or the client's logs for more details.
12/1/2016 8:23:46 AM - job 2147876 was restarted as job 2148876

 

any clue to why this might be happening and why it isn't a full "backup bad, not usable" rather than a status 1 ?

thank you ,

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
I have learned to never ignore a status 1. Under certain circumstances, it actually means a failure. Like in this case. Nothing was backed up.

Be sure you have bpbkar log on the client to investigate.

View solution in original post

10 REPLIES 10

Lowell_Palecek
Level 6
Employee

If the job gets far enough to call or bpbkar, then you can get status 1 with no actual data files, so long as something got sent to the backup storage.

Bpbkar backs up one file at a time. It doesn't see the big picture. If it doesn't back up a particular file due to simply not finding it or not being able to access it, then it sets status 1, not all the files were backed up. Systemic errors, such as communication failures or data corruption do produce hard errors.

You should always check the reason for status 1 results. If you don't find enough in the job details to understand what was missed, then you should check the bpbkar log.  "See the problems report or the client's logs for more details."

 

hi,

thank you for the reply and suggestions. I'm a bit perplexed with the fact that it doesn't happen all the time...and i can try the job shortly after the status 1 and it's happy. Whether it shows Status 1 or 0, the KB and # files do increment. It's just BAR that can't or won't read the result. This is different from other scenarios where i do get a list of the locked/skipped files by path, resulting in a status1. Then BAR will read and show image content.

can i ask you then how you would interpret the following problem log details:

under Description

from client xyz: WRN-can't open object:Shadow copy components: (BEDS 0xe000fecb: A Failure occurred accessing the backup component document)

from client xyz:WRN-can't open object: Shadow copy components:\User Data\Distributed File System Replication\DfsrREplicatedFolders\HomeShare (BESD 0xe000fedd: a failure occurred accessing the object list.)

backup of client xyz exited with status 1

**********

thank you very much for your time and feedback,

 

 

"From client XYZ" messages are posted by bpbrm from various client processes such as bpbkar32.exe. For more information about the failure, check your bpbkar log.

The BAR interface makes an internal BPLIST call to get files to show you.  You can try two things to follow why it doesn't get anything. You can try your own bplist from the command line. Second, find the BPLIST request in the bprd log and match it with Q_IMAGE_LIST_FILES queries in the bpdbm log.  In the bprd log, the work focuses around "listfiles" entries. In the bpdbm log, it focuses around ImageListFiles::executeQuery entries.

I suggest starting with the bpbkar log before trying to decipher the bprd and bpdbm logs.

I don't know enough details of DFSR processing to speculate on the error. What version of NetBackup and Windows OS do you have?

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
I have learned to never ignore a status 1. Under certain circumstances, it actually means a failure. Like in this case. Nothing was backed up.

Be sure you have bpbkar log on the client to investigate.

sdo
Moderator
Moderator
Partner    VIP    Certified

I spoke with Veritas SUpport about this "BEDS" error which results in a partial status 1.  They said it was by design.  I tried very hard to convince them that it was an error - but they wouldn't budge.

My advice, like Marianne's, always check your status '1' jobs - and especially any status '1' on DFSR backups.  If you see and 'status 1' and a 'BEDS' error, then my suspiscion is that the data will be unrecoverable.

IMO this is a semi-silent backup data loss - but like I said Veritas Support disagreed with me on that point.

manatee
Level 6

backing up dfsr partitions is tricky and sometimes errors are misleading. in my case, i get a lot of errors from shadow copy component but when don't have any vss writers error. so to finally find out if i do have a backup or nor, i did a random restore. it worked fine.

"BEDS" stands for Backup Exec Data Store. That's because NetBackup Windows clients share code with Backup Exec for some applications - Exchange, SharePoint, Windows file systems, VSS snapshots other than hardware providers, BMR, VMware and HyperV processing.

Whether a BEDS error in the log is a real error or a design quirk depends on the situation. We would have to see the relevant logs to evaluate it. A blanket statement that BEDS error logs are by design is not true, but some errors are just responses to NetBackup queries.  For example, during BEDS initialization, it tries to create a list of objects ("DLEs") for every possible "file system" type. On an Exchange server you'll see "errors" about not finding Oracle.

Very often, the error cited in a support request isn't the real problem. Either the root error shows earlier in the log, or the "error" cited is just an internal report such as I have described above and a failure occurs later. I would have to see the full log to offer specific advice. In this particular case, I may or may not be able to help, depending on the error.  I know the code around the edges of DFSR processing, but not much about DFSR details.

sorry for the delayed reply....this is on NBU 7.6.0.2 on Windows Server 2008 R2

thank you ALL  for the feed back...i would agree that the only real option is test the image via a restore and verify.

cheers

sdo
Moderator
Moderator
Partner    VIP    Certified

Apologies, I didn't mean to imply that I was told that "BEDS" errors are by design... what I meant was that... I was told that the backup jobs' error handling, by design, can consider a BEDS error as being equivalent to a partial status 1 for the backup job.

I cannot see how a backup job which completely gives up at some point through the backup can consider itself a partial status 1.  The code cannot know whether the error occured at the end, or the middle or the very start of the backup.  These BEDS errors mean that the backup job did not finish walking the file system or data set.  IMO this is an error and not a status 1.  Instead what we have are genuine errors masquerading as partials.  Yes I know a partial can mean 1 to all files missed, and so there is argument that they are potentially the same thing - but they're not... one type finishes walking a file system, whereas the BEDS error partials do not finish walking the file system, and so, IMO, these should be considered as failure errors.