NBU and DNS - a true story

quebek · ‎07-31-2009

Recently I had an issue with one of mine NBU master servers running on AIX system. I was being notified – by backup_exit_notify script - that all jobs started to fail on this server with exit code 219 (the required storage unit is unavailable). Looks like an easy issue to solve. Probably there is stacked tape in the drive or source slot of tape located currently in drive was occupied by tape manager (it happens from time to time in that site).
So I started investigation. I logged on to that affected server via SSH, the Java GUI interface was not working – launched from my notebook. It was loading but the only window I got, was the one as we can see on client – ordinary BAR application run from client. No policies, no storage units and other stuff related to NBU servers. I have even exported the DISPLAY variable with my notebook IP:0.0.
NBU_SRV:/ >export DISPLAY=MYNOTEBOOIP:0.0
On my notebook I started the xserver application (cygwin command - startx), allowed any system to be able to open any window on my notebook by executing this command in cygwin xterm window:
$ xhost +
access control disabled, clients can connect from any host
$
On server side I issued such command: /usr/openv/netbackup/bin/jnbSA &
I got the same window as I was having when launching java GUI directly from my PC. This window was similar to the one when running /usr/openv/netbackup/bin/jbpSA &
Strange, right?
From command line interface on the server I started to troubleshoot.
First for me was to run vmoprcmd command. Its output says that tape drive is up – no TLD-DOWN state under Control.
NBU_SRV:/ > vmoprcmd

                                PENDING REQUESTS

                                     <NONE>

                                  DRIVE STATUS

Drv Type   Control User      Label RecMID ExtMID Ready   Wr.Enbl. ReqId
0 hcart3   TLD                -                     No       -        -

                             ADDITIONAL DRIVE STATUS

Drv DriveName            Shared    Assigned        Comment
0 Drive0                No       -

Strange, since should be in down state, having in mind the exit code. From the other hand I was waiting a long time to get this command completed. The output from tpconfing –d also confirms that drive is in UP state – also time to complete this command wasn’t normal.
NBU_SRV:/root> tpconfig -d
Index DriveName              DrivePath                Type    Shared   Status
***** *********              **********               ****    ******   ******
0   Drive0                 /dev/rmt0.1              hcart3   No       UP
        TLD(0) Definition       DRIVE=1

Currently defined robotics are:
TLD(0)     robotic path = /dev/ovpass0,
             volume database host = NBU_SRV

I have run bpdbjobs –summary command to see how many jobs are in queue. Yet another strange thing – there was no output from this command, after few seconds I got just command prompt – nothing!! I even did not got the header for this command!!
NBU_SRV:/root> bpdbjobs -summary
NBU_SRV:/root>
Hmm are bpjobd and bpdbm daemons running?
I have checked with bpps –a, all needed daemons are up and running.
So I ran again bpdbjobs – this time without any switches. Result of course the same. No output I just got the returned command prompt. To eliminate that someone messed up my server I ran last|more command to see who and when was logged on. Answer: no one except me accessed this server for last couple of days. OK let’s troubleshoot further.
I have seen issues that bpjobd was hung and then command bpdbjobs wasn’t giving any output – it just hangs too. Similar issue to the one I am facing. My idea was to restart all NBU services. Bp.kill_all command stopped all NBU daemons, bpps –a confirmed this.
NBU_SRV:/> bpps -a
NB Processes
------------

MM Processes
------------
NBU_SRV:/>
I started NBU again with command rc.veritas_aix start. It took again longer time that it used to. Issue still persists – bpdbjobs gave no output, jobs are still failing. So I changed the current working directory to /usr/openv/netbackup/logs and ran this command:
NBU_SRV:/usr/openv/netbackup/logs> grep -R "<16>" *
For bpcd daemon I saw such entries:
<16> bpcd main: Couldn't connect to address 10.xx.xx.xxx port 898: Connection refused
Also strange since bpps –a command output was saying that all needed daemons are up and running.
Bprd daemon showed such entries:
<16> accept_new_connection: Failed to get init frame type
Wow, this error I’ve never seen before. I started to think about opening ticket to Symantec’s support. But I thought – let’s wait, maybe I will solve that issue by my self. It just passed about not more than 10 minutes from the initial ticket about failing jobs. Since these issues are related to network, the very first thought I had, was to check the name resolution. So I ran bpclntcmd back and forth with all switches (-ip –sv, -self, -hn etc) – all replays were just fine. Hmm, this is really strange. The only thing was, that I have to wait for the replays longer that I used to. Of course all new jobs are still failing with the same error code - 219. Let’s try nslookup command against my backup server name. After a ‘while’ I got proper response. But why it took so long and how long it lasts. I reran the nslookup command but with timex as precede command.
NBU_SRV:/ > timex nslookup NBU_SRV
*** Can't find server name for address 170.64.xxx.xxx:No response from server
Server: dns.mydomain.com
Address: 170.64.xxx.yyy

Name:    NBU_SRV.mydomain.com
Address: 10.xx.xx.xx

real 75.05
user 0.00
sys 0.00
Wow it took 75 seconds to get the answer from DNS servers. Next, what do I have in /etc/netsvc.conf file: outcome is hosts=local,bind4. So AIX server is trying to reach /etc/hosts files first and later on the DNS servers. Next step what is the contents of /etc/resolv.conf file. Well there are three entries for three different DNS servers. After pinging all of them it seemed that the very first server was inaccessible, but why it took AIX so long to switch to the other DNS server – 75 seconds imho is way too long - while running nslookup command. I started to think if there is any entry in /etc/hosts file regarding my NBU master server, probably there is no entry for this master server (a few months ago we started to trust DNS systems). This was confirmed with command grep –i NBU_srv_name /etc/hosts. I just added the proper entry to /etc/hosts file and guess what happened?? That’s right, everything started to run smoothly. Command bpdbjobs is returning output, jobs are no longer failing!! Issue solved – it took no more than 20 minutes to sort that out.
Conclusion is that the DNS – or to better say – in overall name resolution services has to be working properly in conjunction with NBU. So please keep in mind to have these services always available, with short response times to avoid issues I faced! After solving this, I informed my DNS admins to take care on their server.

VOX

NBU and DNS - a true story