Forum Discussion

liuyl's avatar
liuyl
Level 6
6 years ago

The error code for the problematic tldd/tldcd processes!

What mean about the following “Interrupted system call (4)”、“stat = -2” and “error 24” in the debug/robot logs?

20:07:59.674 [17491] <3> send_command: TLD(1) [17491] unable to read ack from tldcd, Interrupted system call (4), stat = -2
20:07:59.674 [17491] <2> init_resilient_cache: [vnet_nbrntd.c:880] Initialize resilient cache. 0 0x0
20:07:59.674 [17491] <16> TldChildExit: Child process terminated abnormally with error 24
20:07:59.681 [10713] <5> GetResponseStatus: DecodeQuery() Actual status: Control daemon connect or protocol error
20:07:59.681 [10713] <3> DecodeQuery: TLD(1) unavailable: initialization failed: Control daemon connect or protocol error

Notes: if anyone need more information for the above problematic processes logs, please put forward come!

27 Replies

  • System call are interactions between a program and the kernel, eg open/ close/ read / write + many many others.

    tldcd is the process that actually talks to the robot, so if you want a tape loading for example, tldcd is the process that gets that done.

    So, it would appear you have a library issue ... that could inclusde a comms issue between server / library.

    unable to read ack from tldcd - some command was sent to the library, we didn't get a response back

    I would start by running robtest from the robot control host ...

    Once it starts, it will run the scsi 'mode' command, this querie the library to find out it's status/ number of drives/ slots etc ...

    If this completes, it should leave you at the robtest prompt.

    Try 

    s s  (show slots)

    s p  (show ports (this is what we call the map)

    s d (show drives)

    m s1 d1  (move tape from slot 1 to drive 1)

    m d1 s1  (move tape back again)

    For the last two commands (move) pick a drive that is empty and a slot that has a tape (from s s and s d outputs if these work).

    If that lot work, we'll have to think again ...

    I presume something in NBU is failing when you get this error ?  Unless it is intermittant and retries successfully.

    • liuyl's avatar
      liuyl
      Level 6

      Yes,  they are all right with the robtest! But almost all my backup jobs hung up or exited in failure!
      So I need to know the definations or explanations of the digital part within the Interrupted System Call、stat and error!

      • mph999's avatar
        mph999
        Level 6

        We don't define the vaules, it's coming from the operating system ...

        All I can tell, at the moment, is that we don't get a responce back, or at least within a specified time.

        When did this issue start ...

  • You cannot look at these log snippets in isolation.

    A full understanding of the backup infrastructure is needed, along with all relevant Media Manager logs on the robot control host and media server(s). 

    Without having any insight into your environment, my guess is that there are intermittents network issues between the robot control host and this media server.

    • liuyl's avatar
      liuyl
      Level 6

      Now I have uploaded my NBSU log attachment!

      And the tldd/tldcd fault should started from the following job log record:
      08/08/2018 17:00:04 jcbak jdweb error requesting media, TpErrno = No robot daemon or robotics are unavailable

      So what mph999 said, the tldcd became no respond, is right.

      • mph999's avatar
        mph999
        Level 6

        It depends :

        If the media server running the job is NOT the robot control host, then it could well be that the was /is some network issue between the media server and robot control host.

        At high level to get tapes loaded/ unloaded etc ...

        tldd talks to tldcd which talks directly to the robot

        tldd Runs on every media server that has tape drives in the robot

        tldcd Runs ONLY on the Robot Control Host

        So, for tldd on a media server to talk to tldcd on the robot control host, iyt has to communicate over the network.

        If the job that fails (eg backup job running on the robot control host itself) , tldd still has to talk to tldcd ..

        I suspect here, as Marianne highlighted, that the media servers where you see errors, and seperate from the robot control host.  

  • Interrupted system call (4), stat = -2  is a what it says - system call (OS) error. Not NBU.
    Mabe you could search OS forums for explanation....

    error 24  is a network error. Connection was established initially, but something happened in the network stack that dropped communication. 

    • mph999's avatar
      mph999
      Level 6

      A perfect oppotunity to repost this ...

      I describe the 23/24/25 status codes as follows:

      RC=23: Server A sent a IP packet to valid server B, and is waiting for a response packet. It fails to get the response packet within the TIMEOUT window and raises the rc=23.

      RC=25: Server A tried to sent IP packet to invalid server B. No connection made so Server A sets rc=25.

      RC=24: Server A sends packet to server B and get a response within the TIMEOUT window. But something happens that drops connection between them.

      I make an analogy of this communication environment using phone calls:

      Person on Phone A calls to phone number B, which connects and they leave a voice mail to call them back. They wait for a call back that does not come and after a specified time, they quit. RC=23.

      Person A calls phone number for what he thinks is a valid Phone B. The call does not go through and they hear the message "The number you have dialed is not a working number". RC=25.

      Person A calls Person B, they call is picked up but the line connection somehow gets dropped unexpectedly.while communications is in progress. RC=24.

      All of these are communication errors of some kind.

      For RC=25, the source server may have the wrong target server name in its environment or an invalid/wrong IP address for the target server.

      For RC=23, A can talk to B but B cannot talk to A. Could be a source server it does not recognize or it is using the wrong IP address to respond to. Possible bad host name to IP resolution.

      RC 24: The toughest of the bunch. A and B know each other correctly. They just can't keep the call going.

      I would give credit to my collegue for the excellent analogy, if only I could remember who it was ...

    • liuyl's avatar
      liuyl
      Level 6

      Now the most strange doubt is the error code of "stat=-2" !
      No any clue about it can be found, so perhaps I have to ask this on some OS forum!

      • Tousif's avatar
        Tousif
        Level 6

        Hello,


        Have you tried to reboot the tape library and media server?

        It would be interesting to know what happens on library end once reach the request from Media server to library (I mean library system log).

        It seems to be something related to the driver or firmware.

        Regards,