10-20-2018 03:36 AM
What mean about the following “Interrupted system call (4)”、“stat = -2” and “error 24” in the debug/robot logs?
20:07:59.674 [17491] <3> send_command: TLD(1) [17491] unable to read ack from tldcd, Interrupted system call (4), stat = -2
20:07:59.674 [17491] <2> init_resilient_cache: [vnet_nbrntd.c:880] Initialize resilient cache. 0 0x0
20:07:59.674 [17491] <16> TldChildExit: Child process terminated abnormally with error 24
20:07:59.681 [10713] <5> GetResponseStatus: DecodeQuery() Actual status: Control daemon connect or protocol error
20:07:59.681 [10713] <3> DecodeQuery: TLD(1) unavailable: initialization failed: Control daemon connect or protocol error
Notes: if anyone need more information for the above problematic processes logs, please put forward come!
02-26-2019 07:52 PM
I think that there should indeed be many tldcd child processes for the same robot opened by the corresponding media servers' requests during the robot inaction period, which also meanwhile result in the sending command failures of those tldcd child processes !
02-26-2019 11:13 PM
No, there should only be tldcd process for one media server, that is the robot control host. Other media servers should not be zoned/ have visability of the robot.
Any media server that has drives in a robot, even if controlled by a different machine will have tldd process
The robot control host only, should have tldcd process. If the RCH also has drives, it will also have a tldd process.
02-27-2019 12:27 AM - edited 02-27-2019 12:36 AM
My meaning might be misunderstood!
Surely one tldcd process for one media server, but when many bptm requests come from remote media servers to the same robot during a shorter time interval, that would result in many corresponding tldcd child processes on the RCH!
Notes: refer to my another post "The working mechanism about tldd/tldcd".
And If this robot was also at inaction status for some reason in the meantime, thus the above problem with EINTR/EMFI LE/SG_DXFER_TO_DEV would take place!
02-27-2019 04:35 AM
OK, so yes - if you have many requests, there are just handled one at a time - should not cause the issue you see.
I am pretty certain the issue is outside NBU.
02-27-2019 06:08 AM - edited 02-27-2019 06:12 AM
As you said this issue is outside of NBU, could I think that it should be due to the stuck robot!?
02-28-2019 10:53 PM
I don't know if it's the library, or the OS / Server causing the problem.
Do you have multiple media servers, is it possible to move the RCH to a different server.
02-28-2019 11:12 PM
Could you check the nofiles value (ulimit -a) - we saw it complaining about too many open files, so worth checking this. It should be at least 8192.
By chance, I am working this week with a colleague who is a bit of an expert in pretty much everything .... We just had a chat and definately seems that the error happens when we try and send a command, at which point the sg driver throws a fit.
Could you confirm if this happens 100% of the time, or is intermittant. I think intermittant as robtest worked, but it could be related to a load on the machine perhaps ...
03-01-2019 07:16 PM - edited 03-06-2019 06:06 AM
It should be not too often, and also mostly when the TL's hardware parts were in some troubles, such as MCP/IOB or TD!
The nofiles parameter is setup with 8000, so I doubt whether the opened files might exceed this value when many NBU jobs hung up due to the stuck TL!?