Media Servers going offline

After upgrading all my media servers from 6.0MP4 to 7.1.0.2 I'm seeing some of my remote media servers going offline.  When I check the services are running, and doing a right click activate them brings them right back up again.  Let me fill in some details.

The master and (8) media servers are all Linux, both RHEL 4 and 5.  Yes I know they are old.  There are plans to replace them, but things around here move slower than syrup uphill in the Artic winter.  Of the 8, 3 have small tape libraries and these seem to stay online.  Five of them only have basic disk storage units to NFS mounted DataDomain storage.  These are the remote media servers and have the most trouble.  Not always the same one.  When I log into the media server, it looks OK.  The services are up and running.

I know the network infrastructure for these are not the best.  From what I've been able to find out, we are only using 100 Mbps links from the master to these remote servers.  No backups are coming across these, just the meta data to the catalog, and the master/media handshaking.

These are only being used with basic disk groups, so I don't have ltid running that would do heartbeat.  I have not seen anything that talks about a media server with just vmd running as the MM process/  I checked the emm logs for missing heartbeat messages, and did not see any.  I did see a bunch of "Heartbeat received from host abc123" messages.  So I checked each media server, and I found I'm not getting heartbeats from all of them.  I found three of the media servers are not providing any heartbeats, and two of them are.  This is really strange.

The backups are all working every night.  So they must be in some sort of communication with the master.  I'm going to dig some more and see what I can find out.  Will post any further info here as I discover it.  Anyone have any clues or suggestions, please let me know.

1 Solution

Accepted Solutions
Highlighted
Accepted Solution!

On the Master and Media

On the Master and Media Server servers it is useful to add the following touch file:

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

with a value of 800 in it

Try this and see if it helps

If this does not resolve it you can also add:

/usr/openv/netbackup/db/config/DPS_PROXYNOEXPIRE

but I would just try the first one before doing this

View solution in original post

5 Replies
Highlighted

More Info

I found this for one of the media servers that was not reporting a heartbeat message in the EMM log.

02/16/2012 07:05:24.510 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:05:54.550 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:05:54.982 [RemTimer::doDPSUpdate] DPS host timed out: wmtocpvms01
02/16/2012 07:05:54.996 [Info] V-219-2 Server wmtocpvms01's disk active state set to DOWN
02/16/2012 07:06:04.669 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:04.689 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:04.709 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:04.729 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:19.800 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:19.820 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:19.840 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:19.860 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:06:24.591 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:06:24.606 [Info] V-219-2 Server wmtocpvms01's disk active state set to UP
02/16/2012 07:06:24.619 [ResourceEventMgr_i::needConfig ] Media server wmtocpvms01 processed ADM license check
02/16/2012 07:06:54.657 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:07:24.698 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:07:54.738 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:08:24.777 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:08:39.061 [RemTimer::doDPSUpdate] DPS host timed out: wmtocpvms01
02/16/2012 07:08:39.074 [Info] V-219-2 Server wmtocpvms01's disk active state set to DOWN
02/16/2012 07:08:44.577 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:08:44.597 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:08:44.617 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:08:44.637 [ResourceEventMgr_i::resolveServer] Host wmtocpvms01is not in a cluser but is not up; state: 8
02/16/2012 07:08:54.816 [ResourceEventMgr_i::needConfig ] needConfig request: wmtocpvms01.em.entergy.com
02/16/2012 07:08:54.829 [Info] V-219-2 Server wmtocpvms01's disk active state set to UP

It looks like it sets the status up and down over and over again.  I could not find anything with a google search of the ResourceEventMgr message or the DPS host timed out.  As I suspect, it looks like I need to adjust some tuning to make these more resilant to network slowness.

Highlighted
Accepted Solution!

On the Master and Media

On the Master and Media Server servers it is useful to add the following touch file:

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

with a value of 800 in it

Try this and see if it helps

If this does not resolve it you can also add:

/usr/openv/netbackup/db/config/DPS_PROXYNOEXPIRE

but I would just try the first one before doing this

View solution in original post

Highlighted

Thanks Mark.

I had never heard of those two.  I looked them up and understand what's going on.  I had one go offline this morning.  I'm going to do that one for now and see how it goes.  It looks very promising.

Highlighted

Thanks again Mark.  It's been

Thanks again Mark.  It's been almost a week and the servers are staying online.

Highlighted

Fantastic! - Glad to have

Fantastic! - Glad to have helped