Cluster Failover
Hello Experts,
We were experiencing issue in one of our master server i.e Win2008. So we moved the netbackup services to Node1. backup ran fine for 2-3 hours and later they were giving the error code 98 on the Node1. So finally we moved the netbackup services to node2 that was actual and backups were fine.
(1) There were the below event logs on node1
TLD(1) [7808] unable to read exit status from tldcd: Error number: (10060), stat = -3
TLD(1) [12200] Drive 7 (device 5) has not become ready. Last status: The requested resource is in use.
TLD(1) [13832] Drive 2 (device 2) has not become ready. Last status: The requested resource is in use.
Request for media ID L00087 is being rejected because mount requests are disabled (reason = robotic daemon going to DOWN state)
Pls help.
(2) Whenever we move netbackup services from Node1 to Node2 or vice versa we are losing the activty job logs. I mean if 10 jobs are running on node 1 and we have to move netbackup services to Node2. On node 2 we could see only the new jobs not the 10 jobs that were running on node1 same is happening on other server as well.
Pls help
As per your previous thread about db/media/errors - you had reservation issues - something else was using the drive.
nbrbutil -dump will confirm if NetBackup has a reservation on the drive or tapes already.
If not, power cycle the drives and also check to ensure no other servers not using NetBackup have the drives zoned.
As for you jobs not passing across the nodes, what version of NBU is this?
Is netbackup\db\jobs on the shared disk that floats between the nodes.
The entire netbackup\db should be on the shared disk!!!!
- You need to troubleshoot device errors while master is on node 2. Sounds to me like lack of SCSI3 reservation which is needed in a cluster environment. Probably node 1 still holding reservations if backups were active during failover. Read up about SCSI-3 reservation in Admin Guide 1. You can also check vmoprcmd command on how to release reservation from cmd. Another possibility is incorrect device mappings due to lack of persistent binding. I agree with revaroo. Jobs database should be on shared storage that will failover along with all other db's. Sounds like you need a health check on your cluster install.