cancel
Showing results for 
Search instead for 
Did you mean: 

The error code for the problematic tldd/tldcd processes!

liuyl
Level 6

What mean about the following “Interrupted system call (4)”、“stat = -2” and “error 24” in the debug/robot logs?

20:07:59.674 [17491] <3> send_command: TLD(1) [17491] unable to read ack from tldcd, Interrupted system call (4), stat = -2
20:07:59.674 [17491] <2> init_resilient_cache: [vnet_nbrntd.c:880] Initialize resilient cache. 0 0x0
20:07:59.674 [17491] <16> TldChildExit: Child process terminated abnormally with error 24
20:07:59.681 [10713] <5> GetResponseStatus: DecodeQuery() Actual status: Control daemon connect or protocol error
20:07:59.681 [10713] <3> DecodeQuery: TLD(1) unavailable: initialization failed: Control daemon connect or protocol error

Notes: if anyone need more information for the above problematic processes logs, please put forward come!

27 REPLIES 27

mph999
Level 6
Employee Accredited

System call are interactions between a program and the kernel, eg open/ close/ read / write + many many others.

tldcd is the process that actually talks to the robot, so if you want a tape loading for example, tldcd is the process that gets that done.

So, it would appear you have a library issue ... that could inclusde a comms issue between server / library.

unable to read ack from tldcd - some command was sent to the library, we didn't get a response back

I would start by running robtest from the robot control host ...

Once it starts, it will run the scsi 'mode' command, this querie the library to find out it's status/ number of drives/ slots etc ...

If this completes, it should leave you at the robtest prompt.

Try 

s s  (show slots)

s p  (show ports (this is what we call the map)

s d (show drives)

m s1 d1  (move tape from slot 1 to drive 1)

m d1 s1  (move tape back again)

For the last two commands (move) pick a drive that is empty and a slot that has a tape (from s s and s d outputs if these work).

If that lot work, we'll have to think again ...

I presume something in NBU is failing when you get this error ?  Unless it is intermittant and retries successfully.

Yes,  they are all right with the robtest! But almost all my backup jobs hung up or exited in failure!
So I need to know the definations or explanations of the digital part within the Interrupted System Call、stat and error!

mph999
Level 6
Employee Accredited

We don't define the vaules, it's coming from the operating system ...

All I can tell, at the moment, is that we don't get a responce back, or at least within a specified time.

When did this issue start ...

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You cannot look at these log snippets in isolation.

A full understanding of the backup infrastructure is needed, along with all relevant Media Manager logs on the robot control host and media server(s). 

Without having any insight into your environment, my guess is that there are intermittents network issues between the robot control host and this media server.

Now I have uploaded my NBSU log attachment!

And the tldd/tldcd fault should started from the following job log record:
08/08/2018 17:00:04 jcbak jdweb error requesting media, TpErrno = No robot daemon or robotics are unavailable

So what mph999 said, the tldcd became no respond, is right.

mph999
Level 6
Employee Accredited

It depends :

If the media server running the job is NOT the robot control host, then it could well be that the was /is some network issue between the media server and robot control host.

At high level to get tapes loaded/ unloaded etc ...

tldd talks to tldcd which talks directly to the robot

tldd Runs on every media server that has tape drives in the robot

tldcd Runs ONLY on the Robot Control Host

So, for tldd on a media server to talk to tldcd on the robot control host, iyt has to communicate over the network.

If the job that fails (eg backup job running on the robot control host itself) , tldd still has to talk to tldcd ..

I suspect here, as Marianne highlighted, that the media servers where you see errors, and seperate from the robot control host.  

I think that the problem should focus on the RCH itself, neither the remote media servers nor the network communication!


For some other special reasons, the tldds on media servers could not get responds from the tldcds on RCH!

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Interrupted system call (4), stat = -2  is a what it says - system call (OS) error. Not NBU.
Mabe you could search OS forums for explanation....

error 24  is a network error. Connection was established initially, but something happened in the network stack that dropped communication. 

Now the most strange doubt is the error code of "stat=-2" !
No any clue about it can be found, so perhaps I have to ask this on some OS forum!

Hello,


Have you tried to reboot the tape library and media server?

It would be interesting to know what happens on library end once reach the request from Media server to library (I mean library system log).

It seems to be something related to the driver or firmware.

Regards,

No,  we just reboot the RCH,  then all right!

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Rebooting RCH would refresh/reload OS-level resources - memory, cpu, FC. 
Please enable verbose volmgr/debug logs for future troubleshooting. (Info in NBU logging reference guide.)
If everything in this environment is as old and out of support as your NBU software, then the problem is going to return sooner or later....

mph999
Level 6
Employee Accredited

Yes, I couldn't find that easily - it'll be out there somewhere I suspect ...

 

mph999
Level 6
Employee Accredited

A perfect oppotunity to repost this ...

I describe the 23/24/25 status codes as follows:

RC=23: Server A sent a IP packet to valid server B, and is waiting for a response packet. It fails to get the response packet within the TIMEOUT window and raises the rc=23.

RC=25: Server A tried to sent IP packet to invalid server B. No connection made so Server A sets rc=25.

RC=24: Server A sends packet to server B and get a response within the TIMEOUT window. But something happens that drops connection between them.

I make an analogy of this communication environment using phone calls:

Person on Phone A calls to phone number B, which connects and they leave a voice mail to call them back. They wait for a call back that does not come and after a specified time, they quit. RC=23.

Person A calls phone number for what he thinks is a valid Phone B. The call does not go through and they hear the message "The number you have dialed is not a working number". RC=25.

Person A calls Person B, they call is picked up but the line connection somehow gets dropped unexpectedly.while communications is in progress. RC=24.

All of these are communication errors of some kind.

For RC=25, the source server may have the wrong target server name in its environment or an invalid/wrong IP address for the target server.

For RC=23, A can talk to B but B cannot talk to A. Could be a source server it does not recognize or it is using the wrong IP address to respond to. Possible bad host name to IP resolution.

RC 24: The toughest of the bunch. A and B know each other correctly. They just can't keep the call going.

I would give credit to my collegue for the excellent analogy, if only I could remember who it was ...

That is a wonderful analogy explanation!

But I think the error code 24 in my this issue is not equal to the most common error status 24, such as network related problem!
"Child process terminated abnormally with error 24"
It seems that this error 24 should be also the similar property of OS system call,such as “Interrupted system call (4)、stat = -2”

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@mph999 wrote:

 

I would give credit to my collegue for the excellent analogy, if only I could remember who it was ...


I bookmarked it: 

Jaime Vazques : https://vox.veritas.com/t5/NetBackup/Error-23/m-p/738836#M201891

Sadly one of the casualties during the Symantec/Veritas split ....

Yes.. :(

For so long time,  I almost forget to post the good answer that I eventually found from the OS layer!

1)  /usr/include/asm-generic/errno-base.h:

#define EINTR            4 /* Interrupted system call */

#define EMFILE        24 /* Too many open files */

 

2)  /usr/include/scsi/sg.h:

/* Use negative values to flag difference from original sg_header structure. */
#define SG_DXFER_NONE               -1 /* e.g. a SCSI Test Unit Ready command */
#define SG_DXFER_TO_DEV           -2 /* e.g. a SCSI WRITE command */
#define SG_DXFER_FROM_DEV     -3 /* e.g. a SCSI READ command */

 

mph999
Level 6
Employee Accredited

Well done finding that information.

20:07:59.674 [17491] <3> send_command: TLD(1) [17491] unable to read ack from tldcd, Interrupted system call (4), stat = -2

From the error, and your findings it suggets that the interuppted system call hapens when we try and send somrthing to the robot, as opposed to reading some response ...  eg. scsi mode sense, if scsi move_medium

So, it is interesting, but not groundbreaking in terms of what is wrong.

Is the issue intermittant, or happeneing 100% of the time.