11-04-2013 01:37 PM
Hi All
I have two media servers on the same LAN segment and they each have connectivity to 2 VTL tape libraries. Each media server is the robotic control host of one of the VTLs.
I quite often get a situation where half drives of one media server go AVR where for some reason the daemon on the other side cannot be contacted. This is strange in itself as there should be no reason for this to happen - but it's really not the point of my question.
My problem is that to resolve these AVR drives the only thing that can be done is to stop all backups on the media server and restart the daemons (ltid processes). This causes a massive issue because it can impact backups that are running perfectly well on the other robot.
Has anyone found a way to manually kick netbackup into action, and "UP" the robot that it thinks has gone down, without restarting the daemons? This would be a much better solution that having to cancel backups and restart the daemons.
Media servers are running 7.5.0.6 on Solaris against a master server of the same version on Linux.
Solved! Go to Solution.
11-05-2013 02:54 PM
The process is fairly simple. I can also explain why robtest still works.
On a media server that has the robot and drives 'local' two processes run (for tld type robot).
tldd and tldcd
tldcd talks directly to the robot
tldd talks to tldcd to get things done, eg, mount / unmount.
If the media server cannot communicate with its local robot, it's drives will goto AVR. Also, if tldd cannot talk to tldcd, drives will go AVR - not impossible but less likely as a casue as the processes run on the same box,.
For a media server that has a remote robot, but local drives only tldd will run. This will talk with tldcd on the remote robot control host to get tapes mounted/ unmounted etc ... If tldd cannot talk with tldcd due to say a network issue, then the drives will go AVR on the non- robot control host media server, but everything will be fine on the robot control host - hence why robtest will still work.
In your case, both media servers have robots, so both will have tldcd running, but ths process is only talking to the local robot. tldd will talk with tldcd on the local and remote server as explained above.
So in this case, the most likely cause, based on what I know, is that there is some intermittant network issue where the two servers cannot talk with one another. It could be a one-way issue, which would explain why the drives only go into AVR mode on one of the servers.
The most common reasons for drives in AVR mode :
RCH has lost connectivity to the robot
tldd cannot talk to tldcd
Config is wrong
If the confiog was wrong the drives would stay in AVR, so we can pretty much rule this out.
11-04-2013 01:59 PM
The only way I know, is to restart the ltid process unfortunately.
I presume, the drives that go AVR are all the drives in the library controlled by the other media server - if the other media server is working fine at this point, my first guess would be some network connectivity between the two - though as you say, on the same network this does sound odd, but certainly not impossible.
Could be worth a peep in the robots log.
On the media server that has the issue, is tldd process still running, and, can you run robtest successfully to its own robot.
Martin
11-04-2013 02:12 PM
11-05-2013 02:23 PM
Strange.
If robtest is able to connect then AVR mode is confusing.
Anyway, you can restart ltid without affecting currently running backups.
Just don't use GUI for restarting "Media management daemon".
From Unix command line run "vmps", remember the full path of ltid daemon and kill all processes
mentioned by "vmps". Don't worry. Backups will continue only tape management request will fail.
As all these daemons are killed already (check by "vmps") run ltid (with full path) from command line.
It will put itself in daemon mode (your term will no tbe occupied by process running on frontend) and will start all necessary other daemons (vnetd, tldd, tldcd,...).
If there will be just fw seconds between killing and starting nothing happen. I'm using it years already.
11-05-2013 02:54 PM
The process is fairly simple. I can also explain why robtest still works.
On a media server that has the robot and drives 'local' two processes run (for tld type robot).
tldd and tldcd
tldcd talks directly to the robot
tldd talks to tldcd to get things done, eg, mount / unmount.
If the media server cannot communicate with its local robot, it's drives will goto AVR. Also, if tldd cannot talk to tldcd, drives will go AVR - not impossible but less likely as a casue as the processes run on the same box,.
For a media server that has a remote robot, but local drives only tldd will run. This will talk with tldcd on the remote robot control host to get tapes mounted/ unmounted etc ... If tldd cannot talk with tldcd due to say a network issue, then the drives will go AVR on the non- robot control host media server, but everything will be fine on the robot control host - hence why robtest will still work.
In your case, both media servers have robots, so both will have tldcd running, but ths process is only talking to the local robot. tldd will talk with tldcd on the local and remote server as explained above.
So in this case, the most likely cause, based on what I know, is that there is some intermittant network issue where the two servers cannot talk with one another. It could be a one-way issue, which would explain why the drives only go into AVR mode on one of the servers.
The most common reasons for drives in AVR mode :
RCH has lost connectivity to the robot
tldd cannot talk to tldcd
Config is wrong
If the confiog was wrong the drives would stay in AVR, so we can pretty much rule this out.
11-05-2013 09:06 PM
Excellent explanation, Martin!
In short - it looks like network comms issue between media server and robot control host.
If there is network comms issue, I cannot see how new Linux media servers will solve the issue....
Add VERBOSE entry to vm.conf on all media servers and restart ltid to increase Media Manager logging to /var/adm/messages.
11-06-2013 01:00 AM
Thank you for this!
I will put some more logging in place to try and capture the issue next time it happens. The reason I am looking forward to the Linux boxes going in is that this will of course replace the hardware, NICs, cables and ports on the switches so everything will be new and hopefully no more network issue.
From your list of causes it would seem the most likely one for us is the connection between tldd and tldcd failing for some reason.
Cheers
James
11-06-2013 01:05 AM
I will try this next time - but I do suspect that will cause problems, because I have seen it since NBU 6.0 where ltid will not start due to resources being already assigned on the host, specifically where the media server processes have not been stopped cleanly.
When that happens you need to run nbrbutil -resetMediaServer to clear all allocations from that media server, which does impact running backups. They all fail with a resource allocaion message.
Will give it a go though, can't hurt if I have to restart anyway :)
01-31-2014 02:16 PM
Are you using the latest
1) Firmware on your HBA's
2) Drivers for your Libraries
3) Drivers for your NIC's
I offten had the same challenges and it usually was a driver config and an update/upgrade corrected the issue.
G/L