Netbackup 6.5.1 restart deaomons - all running job...

frena · ‎06-05-2009

Hello guys,

Last few weeks we noticed really strange issue.
Jobs were failing with status 150 from time to time, first were thinking that somebody stopped it manually, but then we noticed that there is a kill for all running jobs in time..
After invastigantion I found aonly this massages in /var/adm/messages

35332 Jun 2 18:43:15 xxx last message repeated 1 time
35333 Jun 2 18:46:45 xxx tldcd[9247]: [ID 682742 daemon.notice] Processing UNMOUNT, TLD(412) drive 13, slot 102, barcode M01612L1 , vsn M01612
35334 Jun 2 18:46:45 xxx tldcd[8842]: [ID 932272 daemon.notice] TLD(412) opening robotic path /dev/sg/c0t4l1
35335 Jun 2 18:46:45 xxx tldcd[8842]: [ID 325794 daemon.notice] inquiry() function processing library IBM 03584L32 7050:
35336 Jun 2 18:46:45 xxx tldcd[8842]: [ID 195861 daemon.notice] TLD(412) initiating MOVE_MEDIUM from addr 269 to addr 1126
35337 Jun 2 18:46:52 xxx tldcd[8842]: [ID 860284 daemon.notice] TLD(412) closing/unlocking robotic path
35338 Jun 2 18:46:53 xxx tldcd[9247]: [ID 405217 daemon.notice] inquiry() function processing library IBM 03584L32 5360:
35339 Jun 2 18:46:53 xxx tldcd[9247]: [ID 325794 daemon.notice] inquiry() function processing library IBM 03584L32 7050:
35340 Jun 2 18:46:53 xxx last message repeated 1 time
35341 Jun 2 18:48:53 xxx ltid[7776]: [ID 265732 daemon.warning] Sending shutdown to tldcd daemon...
35342 Jun 2 18:48:53 xxx tldcd[9247]: [ID 891058 daemon.notice] TERMINATION requested from cp8055
35343 Jun 2 18:48:55 xxx vmd[8134]: [ID 631293 daemon.notice] terminating - successful (0)
35344 Jun 2 18:48:55 xxx vmd[8134]: [ID 715111 daemon.error] volume daemon terminating because it received a signal (15)
35345 Jun 2 18:48:55 xxxx vmd[8134]: [ID 164182 daemon.error] terminating - daemon terminated (7)
35346 Jun 2 18:49:19 xxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the cu
stom stop command; trying SIGKILL now.
35347 Jun 2 18:49:24 xxx login: [ID 446407 auth.crit] REPEATED LOGIN FAILURES ON /dev/console
35348 Jun 2 18:49:28 xxx vmd[6567]: [ID 617826 daemon.notice] ready for connections
35349 Jun 2 18:49:40 xxx tldd[7843]: [ID 918256 daemon.notice] Device=2, TLD=100, DRIVE=1

May 27 03:26:56 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 27 14:17:36 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 29 02:21:14 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 29 08:15:11 xxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 30 01:12:02 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
Jun 2 05:14:07 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
Jun 2 18:49:19 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.

There is nothing spacial before or after..
Also jobs which are running in thos time are killed 3-4 minutes before this error occured in the /var/adm/nmessages..
So 150's errors happaning 18:46 and error occured 18:49 forexample..

To me its seems like cluster issue becasue it wasn't happaning when we run on sigle node..after we enable failover and cluster errors started..
In netbackup logs I didnt find nothing special...
Any idea?

FrSchind · ‎06-05-2009

Does it stop, when you disable the failover? Then you should investigate further into your cluster configuration.

And of course: open a call with symantec for this.

Deepak_W · ‎06-05-2009

SIGKILL means something like "force kill the process regardless of the reason" so the program that you are trying to execute is being terminated. Perhaps the reason may be found in error log have you followed right procedure to put NBU in cluster. check below link for NBU HA documentation. http://seer.entsupport.symantec.com/docs/284550.htm

frena · ‎06-05-2009

We are on solaris cluster...
Our solaris team will take a look..
An in which log do you mean I should some error messages?

Deepak_W · ‎06-05-2009

check solaris error logs...

log a case wil symantec for the same...

also as Freak said, check disabling cluster, it will funnel the issue.

VOX

Netbackup 6.5.1 restart deaomons - all running jobs failed with status 150