We have 28 media servers with 45 drives total, all run through SSO.
All media servers are seeing a loss of robotic communication, at seemingly random times.
The most recent set of changes introduced, in an attempt to address this problem are as follows,
1. One large Tape SAN (single zone with 28 initiators and 45 targets), changed to single intitator, target zones.
2. project.max-file-descriptor=(priv,8000,deny) Added to the project file for Netbackup
The media servers are a mix of HP-UX and RHEL.
Network interface monitoring indicates no packet loss or collisions.
Sample messages from syslog:
Nov 4 16:17:26 $MASTER_SERVER acsd: [ID 157546 daemon.error] ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 16:19:34 $MASTER_SERVER acsd: [ID 964518 daemon.notice] ACS(0) going to UP state
Nov 4 15:26:33 $MEDIA_SERVER1 acsd: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 15:28:41 $MEDIA_SERVER1 acsd: ACS(0) going to UP state
Nov 3 19:21:31 $MEDIA_SERVER2 acsd: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 3 19:24:01 $MEDIA_SERVER2 acsd: ACS(0) going to UP state
Those messages represent the last three ACSD down/up sequences on three different media servers.
Any pointers would be much appreciated. The biggest recommendation, from most involved to date is, "My God you have 1224 LUN paths in an SSO environment."
Experienced engineers at this location, believe that this arrangement of resources is acceptable and has worked in the past.
acsss_stats is null and the event log has never show a loss of connectivity.
We are seeing many SCSI reservation conflicts from avrd, which we think is the root of the problem.
Running multiple OS types and versions, which complicates some of the configuration.
I still feel ACS up/down is a network comms issue...
Important to have ALL logs (OS, NBU and ACSLS) available when issue is seen.
Do you have VERBOSE entry in vm.conf on all media servers with volmgr/debug logs enabled?
All of these (as per Martin's article: https://www-secure.symantec.com/connect/articles/quick-guide-setting-logs-netbackup :(
We are sharing 52 tape drives to 28 non-dedicated media servers (VM Hosts, Oracle RAC, Oracle). For a grand total of 1456 resources. We do use STUG's across the board, but this doesn't seem to help much.
Here is an example of a wrapper script
Stop media manager processes (stopltid)
mkdir /usr/openv/volmgr/debug (if not already there ...)