cancel
Showing results for 
Search instead for 
Did you mean: 

Backup Exec Job Engine high CPU usage

ABS_IT
Level 3

Our environment

  • Windows Server 2012 fully patched
  • Backup Exec 15 FP4

I just noticed that the Backup Exec Job Engine (bengine.exe) periodically runs at 100% CPU. Here is a summery of my research so far:

  • High CPU episodes happen exactly every 5 minutes
  • Each episode lasts approximately 1 minute
  • During this minute there is only disk read activity but no disk write activity
  • Procmon (from Sysinternals) shows that the disk read activity consists entirely of reading Backup Exec catalog files
  • Most catalog files are read multiple times, some even 100 or more of times
  • The phenomenon seems to have started after I installed FP4 (not 100% certain here)
  • Backups run successfully every day
  • Backup Exec is not logging any errors or warnings that indicate any type of problem neither during backups nor while idle

I have a number of questions:

  • Is anybody else experiencing the same phenomenon?
  • If so, did it start with FP4?
  • What is the Job Engine doing every 5 minutes? (If it is determining what backup sets have expired, the interval seems a bit agressive [mildly speaking]).
  • Reading a catalog file more than once does not seem right. Am I correct here?
  • Does it make sense to rebuild the catalogs from the backup media?
  • If so, what is the correct procedure to do so for Backup Exec 15?

Thanks in advance,
-Roger

1 ACCEPTED SOLUTION

Accepted Solutions

The case has been resolved after slightly over 2 months of working with Veritas Support. This involved many e-mails, half a dozen telephone calls (I explicitly preferred e-mail) as well as making several debug traces and process dumps.

The case started to make really meaningful progress when it was forwarded to Backline Support (after 4 weeks). Several weeks of in depth investigation and tight cooperation later we managed to crack the case: the culprit was a bug in an SQL stored procedure. And once Engineering produced a bug fix, my Job Engine stopped experiencing high CPU episodes. Trumpets !

A few closing remarks:

  • This bug applies to all BEX installations that use B2D storage. It will only cause noticeable performance issues though if you have many media files due to choosing a small maximum media file size (in my case I had 2000+ 1 GB media files in my B2D storage; I had good reasons to choose that 1 GB file size, but it did backfire).
  • I'd expect that Veritas includes the fix in a future release (it is not nearly serious enough for a hotfix).
  • Working with Backline support was rewarding. I was fortunate to work with a very competent engineer (Thank You, M. !) who in turn worked closely with Engineering.
  • I wish I could say the same about Frontline support. The primary KPI there seems to be 'mean time to case closure'. This resulted in several quite unhelpful suggestions. Things improved after I managed to get them out of emergency mode.
  • Veritas Support seems to have improved quite a bit since the unmerge from Symantec (past experience was all but positive: all cases closed without resolution). I think this is good news for customers.

Regards,
-Roger

 

View solution in original post

8 REPLIES 8

Gurvinder
Moderator
Moderator
Employee Accredited Certified

Stop all BE services and run catrebuildindex.exe -R from BE Install Path (command prompt). When it completes, start all BE services. See if this Issue repeats. Also let me know if you renamed the BE server recently by any chance post which you have started seeing this ? In the catalog folder do you see any other files apart from the xml and fh files. I have not yet noticed a case in tech support post FP4 for above Issue.

CraigV
Moderator
Moderator
Partner    VIP    Accredited

Exclude the Catalogs folder from any AV scan and see if this sorts out the issue.

Thanks!

ABS_IT
Level 3

First of all thank you for the responses!

> Stop all BE services and run catrebuildindex.exe -R from BE Install Path (command prompt). When it completes, start all BE services. See if this Issue repeats.

Done: the issue persists. AFAIK catrebuildindex -R rebuilds the catalog in the database from the catalog files. If the issue is caused by something in the catalog files thenselves, I'd expect the issue to persist.

> Also let me know if you renamed the BE server recently by any chance post which you have started seeing this ?

The server has never been renamed.

> In the catalog folder do you see any other files apart from the xml and fh files.

The 'Catalogs\<ServerName>' folder contains
- 169 .fh files
- 169 .xml files
- 18 .sdr files
- 2 folders ('CatalogProcessTemporaryFolder' and 'catstore', both empty)

IIRC the Procmon trace showed that only the .xml files were being read during a high CPU episode.

> I have not yet noticed a case in tech support post FP4 for above Issue.

OK.

> Exclude the Catalogs folder from any AV scan and see if this sorts out the issue.

It was already excluded.

Regards,
-Roger
 

ABS_IT
Level 3

Since the catrebuild index did not resolve the issue, I dediced to recatalog. This failed:

  • Completed status: Failed
  • Final error: 0xe0000900 - The requested media is not listed in the media index and could not be mounted. To add the media's catalog information to the disk-based catalogs, run an inventory operation on the media and resubmit the Catalog operation.
  • Final error category: Backup Media Errors
  • For additional information regarding this error refer to link V-79-57344-2304

I then decided to rebuild the catalogs from scratch, as outlined in https://www.veritas.com/support/en_US/article.TECH45473

  • Hold backup jobs
  • Stop BE services
  • Rename Catalogs folder
  • Start BE services
  • Inventory backup media
  • Catalog backup media

BE quickly created the initial catalog and then started to read from media. After 16 hours this failed:

  • Completed status: Failed
  • Final error: 0xe00084ca - The data being read from the media is inconsistent.
  • Final error category: Backup Media Errors
  • For additional information regarding this error refer to link V-79-57344-33994

Interestingly the media set that failed was from one of the last backup (written by BE 15 FP4)

The Windows Application Event Log has a log of these entries

  • Error Backup Exec 57620: A format inconsistency was encountered while attempting to access or process tape based catalogs on device "Backup Disk 1", each follow by
  • Error Backup Exec 57608: A format inconsistency was encountered while positioning to begin a tape read operation on device "Backup Disk 1", finally failing with
  • Error Backup Exec 34113: The job failed with the following error: The data being read from the media is inconsistent.

The final observation is that both (failed) catalog jobs report a whopping 269 TB byte count. The total size on the backup media is just under 2 TB...

And, by the way, the Backup Exec Job Engine still jumps to 100% CPU every 5 minutes. I am inclined to think that FP4 is seriously broken.

Regards,
-Roger

 

Gurvinder
Moderator
Moderator
Employee Accredited Certified

open a case with tech support to get this looked.

ABS_IT
Level 3

I just did :) The case number is 21402123.

Over the weekend I manually verified all backup sets because I did not like the "format inconsistency" error. I found one faulty backup set. After I expired the faulty backup set, all backup sets would verify OK. Interestingly the faulty backup set was not the one where I initially observed the "format inconsistency" error. Apparently the place where the error appeared in the job log was misleading.

Finally I inventoried and cataloged my backup sets again and now this went smoothly. Unfortunately the original issue persisted.

Regards,
-Roger

ABS_IT
Level 3

I just found out something really interesting:

  • During a high CPU episode bengine.exe is reading catalog files (see the opening post)
  • Almost every catalog file is read multiple times
  • The number of times each .XML catalog file is read is exactly equal to the number of B2D .BKF media files in the backup

I suspect that BE is doing some kind of maintenance every 5 minutes. Reading each catalog once makes sense, reading it multiple times does not. I suspect a bug here. Also doing maintenance every 5 minutes seems to be a bit over the top.

Regards,
-Roger

 

The case has been resolved after slightly over 2 months of working with Veritas Support. This involved many e-mails, half a dozen telephone calls (I explicitly preferred e-mail) as well as making several debug traces and process dumps.

The case started to make really meaningful progress when it was forwarded to Backline Support (after 4 weeks). Several weeks of in depth investigation and tight cooperation later we managed to crack the case: the culprit was a bug in an SQL stored procedure. And once Engineering produced a bug fix, my Job Engine stopped experiencing high CPU episodes. Trumpets !

A few closing remarks:

  • This bug applies to all BEX installations that use B2D storage. It will only cause noticeable performance issues though if you have many media files due to choosing a small maximum media file size (in my case I had 2000+ 1 GB media files in my B2D storage; I had good reasons to choose that 1 GB file size, but it did backfire).
  • I'd expect that Veritas includes the fix in a future release (it is not nearly serious enough for a hotfix).
  • Working with Backline support was rewarding. I was fortunate to work with a very competent engineer (Thank You, M. !) who in turn worked closely with Engineering.
  • I wish I could say the same about Frontline support. The primary KPI there seems to be 'mean time to case closure'. This resulted in several quite unhelpful suggestions. Things improved after I managed to get them out of emergency mode.
  • Veritas Support seems to have improved quite a bit since the unmerge from Symantec (past experience was all but positive: all cases closed without resolution). I think this is good news for customers.

Regards,
-Roger