I think that there should indeed be many tldcd child processes for the same robot opened by the corresponding media servers' requests during the robot inaction period, which also meanwhile result in the sending command failures of those tldcd child processes ！
No, there should only be tldcd process for one media server, that is the robot control host. Other media servers should not be zoned/ have visability of the robot.
Any media server that has drives in a robot, even if controlled by a different machine will have tldd process
The robot control host only, should have tldcd process. If the RCH also has drives, it will also have a tldd process.
My meaning might be misunderstood！
Surely one tldcd process for one media server, but when many bptm requests come from remote media servers to the same robot during a shorter time interval, that would result in many corresponding tldcd child processes on the RCH！
Notes: refer to my another post "The working mechanism about tldd/tldcd".
And If this robot was also at inaction status for some reason in the meantime, thus the above problem with EINTR/EMFI LE/SG_DXFER_TO_DEV would take place！
OK, so yes - if you have many requests, there are just handled one at a time - should not cause the issue you see.
I am pretty certain the issue is outside NBU.
I don't know if it's the library, or the OS / Server causing the problem.
Do you have multiple media servers, is it possible to move the RCH to a different server.
Could you check the nofiles value (ulimit -a) - we saw it complaining about too many open files, so worth checking this. It should be at least 8192.
By chance, I am working this week with a colleague who is a bit of an expert in pretty much everything .... We just had a chat and definately seems that the error happens when we try and send a command, at which point the sg driver throws a fit.
Could you confirm if this happens 100% of the time, or is intermittant. I think intermittant as robtest worked, but it could be related to a load on the machine perhaps ...
It should be not too often, and also mostly when the TL's hardware parts were in some troubles, such as MCP/IOB or TD！
The nofiles parameter is setup with 8000, so I doubt whether the opened files might exceed this value when many NBU jobs hung up due to the stuck TL！？