Hello guys,
Last few weeks we noticed really strange issue.
Jobs were failing with status 150 from time to time, first were thinking that somebody stopped it manually, but then we noticed that there is a kill for all running jobs in time..
After invastigantion I found aonly this massages in /var/adm/messages
35332 Jun 2 18:43:15 xxx last message repeated 1 time
35333 Jun 2 18:46:45 xxx tldcd[9247]: [ID 682742 daemon.notice] Processing UNMOUNT, TLD(412) drive 13, slot 102, barcode M01612L1 , vsn M01612
35334 Jun 2 18:46:45 xxx tldcd[8842]: [ID 932272 daemon.notice] TLD(412) opening robotic path /dev/sg/c0t4l1
35335 Jun 2 18:46:45 xxx tldcd[8842]: [ID 325794 daemon.notice] inquiry() function processing library IBM 03584L32 7050:
35336 Jun 2 18:46:45 xxx tldcd[8842]: [ID 195861 daemon.notice] TLD(412) initiating MOVE_MEDIUM from addr 269 to addr 1126
35337 Jun 2 18:46:52 xxx tldcd[8842]: [ID 860284 daemon.notice] TLD(412) closing/unlocking robotic path
35338 Jun 2 18:46:53 xxx tldcd[9247]: [ID 405217 daemon.notice] inquiry() function processing library IBM 03584L32 5360:
35339 Jun 2 18:46:53 xxx tldcd[9247]: [ID 325794 daemon.notice] inquiry() function processing library IBM 03584L32 7050:
35340 Jun 2 18:46:53 xxx last message repeated 1 time
35341 Jun 2 18:48:53 xxx ltid[7776]: [ID 265732 daemon.warning] Sending shutdown to tldcd daemon...
35342 Jun 2 18:48:53 xxx tldcd[9247]: [ID 891058 daemon.notice] TERMINATION requested from cp8055
35343 Jun 2 18:48:55 xxx vmd[8134]: [ID 631293 daemon.notice] terminating - successful (0)
35344 Jun 2 18:48:55 xxx vmd[8134]: [ID 715111 daemon.error] volume daemon terminating because it received a signal (15)
35345 Jun 2 18:48:55 xxxx vmd[8134]: [ID 164182 daemon.error] terminating - daemon terminated (7)
35346 Jun 2 18:49:19 xxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the cu
stom stop command; trying SIGKILL now.
35347 Jun 2 18:49:24 xxx login: [ID 446407 auth.crit] REPEATED LOGIN FAILURES ON /dev/console
35348 Jun 2 18:49:28 xxx vmd[6567]: [ID 617826 daemon.notice] ready for connections
35349 Jun 2 18:49:40 xxx tldd[7843]: [ID 918256 daemon.notice] Device=2, TLD=100, DRIVE=1
May 27 03:26:56 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 27 14:17:36 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 29 02:21:14 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 29 08:15:11 xxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
May 30 01:12:02 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
Jun 2 05:14:07 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
Jun 2 18:49:19 xxxx [VRTS.scnb,scnb-harg,scnb-hars]: [ID 702911 daemon.error] Failed to stop scnb using the custom s
p command; trying SIGKILL now.
There is nothing spacial before or after..
Also jobs which are running in thos time are killed 3-4 minutes before this error occured in the /var/adm/nmessages..
So 150's errors happaning 18:46 and error occured 18:49 forexample..
To me its seems like cluster issue becasue it wasn't happaning when we run on sigle node..after we enable failover and cluster errors started..
In netbackup logs I didnt find nothing special...
Any idea?