cancel
Showing results for 
Search instead for 
Did you mean: 

The error code for the problematic tldd/tldcd processes!

liuyl
Level 6

What mean about the following “Interrupted system call (4)”、“stat = -2” and “error 24” in the debug/robot logs?

20:07:59.674 [17491] <3> send_command: TLD(1) [17491] unable to read ack from tldcd, Interrupted system call (4), stat = -2
20:07:59.674 [17491] <2> init_resilient_cache: [vnet_nbrntd.c:880] Initialize resilient cache. 0 0x0
20:07:59.674 [17491] <16> TldChildExit: Child process terminated abnormally with error 24
20:07:59.681 [10713] <5> GetResponseStatus: DecodeQuery() Actual status: Control daemon connect or protocol error
20:07:59.681 [10713] <3> DecodeQuery: TLD(1) unavailable: initialization failed: Control daemon connect or protocol error

Notes: if anyone need more information for the above problematic processes logs, please put forward come!

27 REPLIES 27

I think that there should indeed be many tldcd child processes for the same robot opened by the corresponding media servers'  requests during the robot inaction period,  which also meanwhile result in the sending command failures of those tldcd child processes !

 

mph999
Level 6
Employee Accredited

No, there should only be tldcd process for one media server, that is the robot control host.  Other media servers should not be zoned/ have visability of the robot.

Any media server that has drives in a robot, even if controlled by a different machine will have tldd process

The robot control host only, should have tldcd process.  If the RCH also has drives, it will also have a tldd process.

My meaning might be misunderstood!

Surely one tldcd process for one media server,  but when many bptm requests come from remote media servers to the same robot during a shorter time interval,  that would result in many corresponding tldcd child processes on the RCH!

Notes: refer to my another post  "The working mechanism about tldd/tldcd". 

And If this robot was also at inaction status for some reason in the meantime,  thus the above problem with EINTR/EMFI LE/SG_DXFER_TO_DEV would take place!

 

mph999
Level 6
Employee Accredited

OK, so yes - if you have many requests, there are just handled one at a time - should not cause the issue you see.

I am pretty certain the issue is outside NBU.

As you said this issue is outside of NBU,  could I think that it should be due to the stuck robot!?

mph999
Level 6
Employee Accredited

I don't know if it's the library, or the OS / Server causing the problem.

Do you have multiple media servers, is it possible to move the RCH to a different server.

mph999
Level 6
Employee Accredited

Could you check the nofiles value (ulimit -a) - we saw it complaining about too many open files, so worth checking this.  It should be at least 8192.

By chance, I am working this week with a colleague who is a bit of an expert in pretty much everything ....  We just had a chat and definately seems that the error happens when we try and send a command, at which point the sg driver throws a fit.

Could you confirm if this happens 100% of the time, or is intermittant.  I think intermittant as robtest worked, but it could be related to a load on the machine perhaps ...

 

It should be not too often,  and also mostly when the TL's hardware parts were in some troubles, such as MCP/IOB or TD!
The nofiles parameter is setup with 8000,  so I doubt whether the opened files might exceed this value when many NBU jobs hung up due to the stuck TL!?