NBU 6.5.4 Enterprise - Image Cleanup jobs sometime...

Niall_Porter · ‎08-18-2010

Hi all,

I have a NBU master/media server in our Norway office which sometimes (about once or twice each week) fails Image Cleanup jobs with status/error code 23: "socket read failed". Below is a log from one such job:

08/17/2010 19:07:21 - Info bpdbm (pid=12851) image catalog cleanup
08/17/2010 19:07:21 - Info bpdbm (pid=12851) deleting images which expire before Tue Aug 17 20:07:21 2010 (1282068441)
08/17/2010 19:07:21 - Info bpdbm (pid=12851) processing client STADC1
08/17/2010 19:07:21 - Info bpdbm (pid=12851) processing client STAEXCH1
08/17/2010 19:07:22 - Info bpdbm (pid=12851) processing client STANA1
08/17/2010 19:07:23 - Info bpdbm (pid=12851) processing client STANTS1
08/17/2010 19:07:23 - Info bpdbm (pid=12851) processing client STANTS2
08/17/2010 19:07:23 - Info bpdbm (pid=12851) processing client STANTS3
08/17/2010 19:07:24 - Info bpdbm (pid=12851) processing client STASSV01
08/17/2010 19:07:24 - Info bpdbm (pid=12851) processing client STASSV02
08/17/2010 19:07:25 - Info bpdbm (pid=12851) processing client stassv01
08/17/2010 19:07:25 - Info bpdbm (pid=12851) deleted 44 expired records, compressed 0, tir removed 0, deleted 0 expired copies
socket read failed (23)

Can anyone suggest where to start with this? Plenty online about error code 23 but not relating to Image Cleanup jobs, mostly just for backups...

Thanks in advance,
Niall

Marianne · ‎08-18-2010

Ensure you have admin and bpbdm log directories on the master (bpdbm needs restart of NBU to activiate log).

Have a look at these logs for clues.

Handy NetBackup Links

Andy_Welburn · ‎08-18-2010

GENERAL ERROR: intermittent status 47 and 23 image cleanup jobs
http://seer.entsupport.symantec.com/docs/315686.htm

Niall_Porter · ‎08-18-2010

Thanks for that. Already have admin log directory, added bpdbm but it seems to be logging in there now without a restart...

Anyway, admin log had something that looked like it might be relevant:

20:07:26.138 [12861] <2> EndpointSelector::select_endpoint: performing call with the only endpt available!(Endpoint_Selector.cpp:431)
20:07:26.160 [12861] <4> nbdelete: main: Connected to EMM server
20:07:26.165 [12861] <16> emmlib_GetDiskMediaWithImagesInState: (0) queryImagesWithDistinctVolumeInfo failed, emmError = 4005006, nbError = 0
20:07:26.165 [12861] <16> nbdelete: GetMediaList: Error executing emmlib_GetDiskMediaWithImagesInState
20:07:26.165 [12861] <16> nbdelete: (-) Translating EMM_ERROR_DBServerDown(4005006) to 23 in the Media context
20:07:26.165 [12861] <2> nbdelete: EXIT status = 23
20:07:26.172 [12861] <4> nbdelete: unlockOperation: Released lock on: /usr/openv/var/nbdelete/all.lock
20:07:26.172 [12861] <4> nbdelete: myexit: Released lock on: /usr/openv/var/nbdelete/process1.lock

To me that suggests it can't communicate with the EMM server? No reason the EMM server should have been down at that time, where would I check that out?

Marianne · ‎08-18-2010

Seems connection to EMM was successful, but a particular Disk Image is unavailable:

20:07:26.160 [12861] <4> nbdelete: main: Connected to EMM server
20:07:26.165 [12861] <16> emmlib_GetDiskMediaWithImagesInState: (0)

I was hoping that admin would actually list the name of the media server that the disk image belongs to.
Hopefully you'll see more in bpdbm...

Andy's TechNote about decommissioned media server(s) seems very relevant.

Handy NetBackup Links

Niall_Porter · ‎08-18-2010

That server (stassv01) is a combined master/media server, it's the only one that's ever been in that environment. There's never been any disk-based media used for backups there either, it's only ever had an ADIC autoloader with LTO3 drives.

Regarding Andy's post (thanks Andy), I can't see how the points in the SOLUTION/WORKAROUND section can be relevant - the second one mentions "improperly decomissioned media server", as above we've never had any other media server out there to decommission, and regarding the first point, if the connection to the media server was successful then not sure how that works either?

Guess I'll need to wait until this happens again to see if anything more specific is logged in the bpdbm logs?

Marianne · ‎08-18-2010

Thanks for sharing that info with us. Seems we're going to need bpdbm log.

Which O/S?
NBU version?

Handy NetBackup Links

Andy_Welburn · ‎08-18-2010

Do you still have offline cold catalog backups configured?

In a lot of instances image cleanup and cold catalogs would be scheduled after a session of scheduled backups. If cold catalogs are still configured they will take the d/b down & prevent the image cleanup from working correctly.

As I say, just a thought as you mentioned the possibility of the EMM being down - but you probably kicked the cold in to touch when you upgraded to 6.5?

Niall_Porter · ‎08-18-2010

Sorry, should have thought about the obvious stuff :)

Solaris 10, 6/06 on a Sun Fire V445, 2x UltraSPARC-IIIi @ 1.6GHz
NetBackup 6.5.4, think it's Enterprise version.

Niall_Porter · ‎08-18-2010

No cold catalog backups any more, no. However you might be onto something here. According to the Activity Monitor, the Image Cleanup job which failed (can only see one due to short history) started and finished while there was still a Catalog backup running. Looking at an older Image Cleanup job, it started 1 second after the preceeding Catalog Backup job.

Could this be because the Image Cleanup tried to run while the Catalog Backup was still happening? Would this be a problem since we're not using cold/offline catalog backups? If so, how can I stop this from happening?

Andy_Welburn · ‎08-18-2010

I would've thought that this would only be a problem with cold as opposed to hot catalog backups.

Maybe worth monitoring to see if all your failures coincide with a catalog backup. If they do, the easiest thing to change would be the scheduling of the catalog as the image cleanup will either run after each session of backups or after a preset period (e.g. 12 hours) if there is no break in backup sessions:

The Image cleanup property specifies the maximum interval that can elapse before an image cleanup is run. Image cleanup is run after every successful backup session (that is, a session in which at least one backup runs successfully). If a backup session exceeds this maximum interval, an image cleanup is initiated.

(plus you can limit it such that there is a minimum period between image cleanups:

The Catalog cleanup wait time property specifies the minimum interval that can elapse before an image cleanup is run. Image cleanup is not run after a successful backup session until this minimum interval has elapsed since the previous image cleanup.

Niall_Porter · ‎08-18-2010

Just checked, in the Master Server properties under "Clean-up" the Image Cleanup interval is set to "after 12 hours" and the Catalog cleanup wait time is set to "After 60 minutes". So doesn't look as though the two should coincide, or even (as is normally the case) run one immediately after the other.

I've never seen an image cleanup happen while there's still a catalog backup happening. We've done a fair bit of work on our other NBU environments (I manage three separate ones at the moment, looking to consolidate into one soon) and after running a couple of test jobs I usually notice a Catalog backup starts, then once that's finished an Image Cleanup job, but never seen the two running at the same time before.

Will see what happens tomorrow, thanks for the help so far. Of course, Murphy's law would say that now I'm watching this it'll never happen again :)

Andy_Welburn · ‎08-18-2010

will happen after each session or after 12 hours have elapsed & will not run within 60 minutes of the last one running.

"Murphy's law would say that now I'm watching this it'll never happen again" - you just know it don't you!

Anything in the bpdbm logs that Marianne mentioned?

Niall_Porter · ‎08-18-2010

Nothing yet but then I only enabled that this morning, will see what it shows after tonight's backups. If it looks fine tomorrow I'll try running a few test jobs so as to hopefully trigger a catalog backup and image cleanup as well.

VOX

NBU 6.5.4 Enterprise - Image Cleanup jobs sometimes fail, error code 23