cancel
Showing results for 
Search instead for 
Did you mean: 

Bexec 15 - Win2k8 R2 - System files backup problem

Metatron5196
Level 3

Good day,

 

We are experiencing a strange problem with only ONE Windows 2008 R2 Domain controller.

 

Current setup:

Backing up 14 physical servers and 12 VM's (Windows 2012 Hyper-v cluster with 2 x physical nodes)

Primary backup to de-dup. 1 x weekly full on fridays and weekday differentials every day. As a secondary backup the de-dup backup set is duplicated to LTO 5 as part of the same job as soon as it has completed.

Backup server is a standalone Windows 2012 std 64 running Bexec 15 and dedicated 14.4TB de-dup drive on a HP 1040 SAN (Dual FC 8 controllers). Server has 96GB RAM, 2 x quad core Xeon 2.4Ghx Cpu's and more than enough local OS and pagefile drive space.

All backups run perfectly and we seem to have good performance on the de-dup drive. We run 6 x concurrent processes on de-dup and all other settings are default. Total backup Data size is approx 4.5TB for all backup sets.

All jobs set for client side de-dup and Automatic VSS selection.

As part of the above, we backup 2 x Domain controllers. Physical servers both identical, one running primary roles, both GC servers.

 

Problem is only on the DC running primary roles and is a new problem. We recently applied latest FP via liveupdate and updated all agents. All servers are on latest MS patches via in house WSUS servers.

 

Problem description:

Did not appear directly after FP update. Worked for some time. Friday 12 Feb Full backup failed (even though preceding differentials ran through fine). Error in Job log states error no: E000848F. Description is that backup drive ran out of disc space. (unlikely as the de-dup drive has only 5.35TB of 14.4TB in use)

If Job is run it seems to get stuck at 14.3 GB into the backup while busy with System Files. Job reports: Source: System?State. Current File System Files. Runs as "Active-Loading Media" and runs for approx 24 h before finally failing with error. Interestingly Differential backups ran through fine.

Monitoring beremote process on the DC during Job run, it seems the server agent is busy with a file in \winsxs folder. File is called win32k.sys. It reads the file for a bit, transfers nothing for a bit, then reads the file again. Beremote and agent showing 0% cpu use during the process. So although Agent and Server both reports Job is running, it seems to be stuck in a loop.

Resolution steps tried:

1) restart of both the DC and the BEXEC server.

2) Uninstalled agent completely from DC and reinstalled and rebooted.

3) Expired all backup sets related ONLY to the affected server on dedup drive, and manually ran --processqueue etc. to remove from storage as I suspected a corrupt backup set (mostly due to the job error description) and/or  corruption in PostGreSQL database, which the above dedup cleanup process should have sorted if any was found.

4) deleted the job and recreated from scratch.

5) Removed the server from Bexec. Uninstalled agent again from DC and added server back into Bexec after reboot allowing Bexec to initiate push install.

6) Suspected VSS, so manually set AOFO in job to use MS vss

7) Still suspected VSS, so reset and re-registered VSS writers on DC and rebooted.

8) removed everything related to th affected DC from BEXEC. Then ran a DB repair on the BEXEC database, rebuilt indices and verified DB consistency. Then rebooted and re-added server and re-installed agent again via push install.

After all of the above was done yesterday, I started a full backup Job. It has been running for 14 hours and is stuck "Loading Media" at 14.3GB same as above. It has not yet generated an error.

Behaviour stays the same no matter what we try. Interesting that the server is performing optimally in its DC, GC, DNS and DHCP roles. No disk errors or system errors in event log. DC's are replicating OK and all services working 100%.

 

The other DC, is backing up just fine. Job reports sucess and backs up 33.9 GB at approx 1GB per min. Job settings for both servers are identical.

All other VM and Physical backups are fine. Tested Restores etc and agents all performing as expected. Backup speed is good and no errors on de-dup system.

 

Any help would be highly appreciated.

 

 

7 REPLIES 7

VJware
Level 6
Employee Accredited Certified

Does a backup using the native Windows Backup utility back up this server successfully ?

Also, are the "Shadow Copies" for the C:\ volume set to No Limit ?

Metatron5196
Level 3

Hi VJware,

 

Thanks for the reply.

 

Yes the Shadow copies are set to no limit. ( on all servers being backed up)

I cancelled the current BEXEC Job and all VSS writers on the affected server immediately returned to 'stable'

Also, only MS VSS is listed under providers. (ver 1.0.0.7)

I am now running a native windows full backup and will advise the result.

 

EDIT: To pre-empt the question, we also use SEP AV. Latest version (12.1.6 MP3). I made and exclusion on the affected server for beremote,  pdvfs applications. Also excepted '\windows\winsxs' folder from both Sonar and security risk scans. This made no difference. Then uninstalled AV client, rebooted and repeated the test and BEXEC job behaviour still exactly the same. AV is now re-installed with the above exclusions in place.

 

Metatron5196
Level 3

VJware,

 

Native backup completes successfully.

 

Also, both DC's had Windows updates on the 12th when the problem started (see attached image) Both DC's are identical in both hardware and OS and the WSUS update history is the same.

 

One works, one doesn't, so it does not seem likely an MS update caused the problem.

pkh
Moderator
Moderator
   VIP    Certified

For any OS above Server 2003, you should not re-register the VSS DLL's.  In the worse case, it might cause your server to be unbootable.

Metatron5196
Level 3

pkh,

 

We are aware of this fact now but thanks for the info. We realised this after the re-registration was done, and the decision to re-register VSS was based on a Symantec tech article that did not specify OS level  - Investigating the problem further we discovered that the specfic TechDoc was recalled some time ago. . .

 

However VSS writers are working post re-registration, native backup runs flawlessly and server is 100% operational, so thankfully the re-registration did no harm.

 

In the meantime I am scheduling a Native backup of the server to a shared folder on the BEXEC server's local drive and this folder is backed up along with the local BEXEC backup of the server. So we have some sort of fallback in case of DR.

Not a long term solution though, and we need the BEXEC full backup to run without errors.

PS: We are planning installation of new DC servers, pending approval. These new servers would be installed and rolled out as additional DC's, primary roles will be moved to the new servers, and old servers will be DcPromo'd from the Domain. Problem solved. But that is a few months away...

 

Metatron5196
Level 3

Hi VJware,

 

Seeing as native backup is working I have to assume the issue is related to Bexec / client agent.

 

Any further advice / troubleshooting steps you can reccommend?

 

Thx.

 

VJware
Level 6
Employee Accredited Certified

I would recommend enabling SGmon debug on the BE media server & the remote server & then running per resource backup.

For egs, first run a backup of only System State. If this succeeds, then run a backup of C:\ (excluding the winsxs dir). If they fail, pm me the job log and the debug logs. Thanks.