cancel
Showing results for 
Search instead for 
Did you mean: 

Drives AVR mode: Manually UPping a robot

Supra_James
Level 3

Hi All

I have two media servers on the same LAN segment and they each have connectivity to 2 VTL tape libraries. Each media server is the robotic control host of one of the VTLs.

I quite often get a situation where half drives of one media server go AVR where for some reason the daemon on the other side cannot be contacted. This is strange in itself as there should be no reason for this to happen - but it's really not the point of my question.

My problem is that to resolve these AVR drives the only thing that can be done is to stop all backups on the media server and restart the daemons (ltid processes). This causes a massive issue because it can impact backups that are running perfectly well on the other robot.

Has anyone found a way to manually kick netbackup into action, and "UP" the robot that it thinks has gone down, without restarting the daemons? This would be a much better solution that having to cancel backups and restart the daemons.

Media servers are running 7.5.0.6 on Solaris against a master server of the same version on Linux.

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

The process is fairly simple. I can also explain why robtest still works.

On a media server that has the robot and drives 'local' two processes run (for tld type robot).

tldd and tldcd

tldcd talks directly to the robot

tldd talks to tldcd to get things done, eg, mount / unmount.

If the media server cannot communicate with its local robot, it's drives will goto AVR.  Also, if tldd cannot talk to tldcd, drives will go AVR - not impossible but less likely as a casue as the processes run on the same box,.

 

For a media server that has a remote robot, but local drives only tldd will run.  This will talk with tldcd on the remote robot control host to get tapes mounted/ unmounted etc ...  If tldd cannot talk with tldcd due to say a network issue, then the drives will go AVR on the non- robot control host media server, but everything will be fine on the robot control host - hence why robtest will still work.

In your case, both media servers have robots, so both will have tldcd running, but ths process is only talking to the local robot.  tldd will talk with tldcd on the local and remote server as explained above.

So in this case, the most likely cause, based on what I know, is that there is some intermittant network issue where the two servers cannot talk with one another.  It could be a one-way issue, which would explain why the drives only go into AVR mode on one of the servers.

The most common reasons for drives in AVR mode :

RCH has lost connectivity to the robot

tldd cannot talk to tldcd

Config is wrong

If the confiog was wrong the drives would stay in AVR, so we can pretty much rule this out.

View solution in original post

8 REPLIES 8

mph999
Level 6
Employee Accredited

The only way I know, is to restart the ltid process unfortunately.

I presume, the drives that go AVR are all the drives in the library controlled by the other media server - if the other media server is working fine at this point, my first guess would be some network connectivity between the two - though as you say, on the same network this does sound odd, but certainly not impossible.

Could be worth a peep in the robots log.

On the media server that has the issue, is tldd process still running, and, can you run robtest successfully to its own robot.

Martin

Supra_James
Level 3
Thanks Martin Yes the drives that go AVR are the drives without local robotic control. The other drives are fine including all drives on the other media server. robtest still works fine and tldd is running. I don't know how the mechanism works that would make the drives get back into TLD control but it is plainly not working for us. When we have the issue, /var/adm/messages says that the robot is going down - unable to sense robotic device. We have a project ongoing to replace these media servers with Linux boxes so fingers crossed the issue disappears :)

RadovanTuran
Level 4

Strange.

If robtest is able to connect then AVR mode is confusing.

Anyway, you can restart ltid without affecting currently running backups.
Just don't use GUI for restarting "Media management daemon".
From Unix command line run "vmps", remember the full path of ltid daemon and kill all processes
mentioned by "vmps". Don't worry. Backups will continue only tape management request will fail.
As all these daemons are killed already (check by "vmps") run ltid (with full path) from command line.

It will put itself in daemon mode (your term will no tbe occupied by process running on frontend) and will start all necessary other daemons (vnetd, tldd, tldcd,...).

If there will be just fw seconds between killing and starting nothing happen. I'm using it years already.

mph999
Level 6
Employee Accredited

The process is fairly simple. I can also explain why robtest still works.

On a media server that has the robot and drives 'local' two processes run (for tld type robot).

tldd and tldcd

tldcd talks directly to the robot

tldd talks to tldcd to get things done, eg, mount / unmount.

If the media server cannot communicate with its local robot, it's drives will goto AVR.  Also, if tldd cannot talk to tldcd, drives will go AVR - not impossible but less likely as a casue as the processes run on the same box,.

 

For a media server that has a remote robot, but local drives only tldd will run.  This will talk with tldcd on the remote robot control host to get tapes mounted/ unmounted etc ...  If tldd cannot talk with tldcd due to say a network issue, then the drives will go AVR on the non- robot control host media server, but everything will be fine on the robot control host - hence why robtest will still work.

In your case, both media servers have robots, so both will have tldcd running, but ths process is only talking to the local robot.  tldd will talk with tldcd on the local and remote server as explained above.

So in this case, the most likely cause, based on what I know, is that there is some intermittant network issue where the two servers cannot talk with one another.  It could be a one-way issue, which would explain why the drives only go into AVR mode on one of the servers.

The most common reasons for drives in AVR mode :

RCH has lost connectivity to the robot

tldd cannot talk to tldcd

Config is wrong

If the confiog was wrong the drives would stay in AVR, so we can pretty much rule this out.

Marianne
Level 6
Partner    VIP    Accredited Certified

Excellent explanation, Martin!

In short - it looks like network comms issue between media server and robot control host.

If there is network comms issue, I cannot see how new Linux media servers will solve the issue....

Add VERBOSE entry to vm.conf on all media servers and restart ltid to increase Media Manager logging to /var/adm/messages.

Supra_James
Level 3

Thank you for this!

I will put some more logging in place to try and capture the issue next time it happens. The reason I am looking forward to the Linux boxes going in is that this will of course replace the hardware, NICs, cables and ports on the switches so everything will be new and hopefully no more network issue.

From your list of causes it would seem the most likely one for us is the connection between tldd and tldcd failing for some reason.

Cheers

James

Supra_James
Level 3

I will try this next time - but I do suspect that will cause problems, because I have seen it since NBU 6.0 where ltid will not start due to resources being already assigned on the host, specifically where the media server processes have not been stopped cleanly.

When that happens you need to run nbrbutil -resetMediaServer to clear all allocations from that media server, which does impact running backups. They all fail with a resource allocaion message.

Will give it a go though, can't hurt if I have to restart anyway :)

wan2live
Level 3
Certified

Are you using the latest 

1) Firmware on your HBA's

2) Drivers for your Libraries

3) Drivers for your NIC's

I offten had the same challenges and it usually was a driver config and an update/upgrade corrected the issue.

G/L