11-04-2013 02:18 PM
We have 28 media servers with 45 drives total, all run through SSO.
All media servers are seeing a loss of robotic communication, at seemingly random times.
The most recent set of changes introduced, in an attempt to address this problem are as follows,
1. One large Tape SAN (single zone with 28 initiators and 45 targets), changed to single intitator, target zones.
2. project.max-file-descriptor=(priv,8000,deny) Added to the project file for Netbackup
The media servers are a mix of HP-UX and RHEL.
Network interface monitoring indicates no packet loss or collisions.
Sample messages from syslog:
Nov 4 16:17:26 $MASTER_SERVER acsd[16629]: [ID 157546 daemon.error] ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 16:19:34 $MASTER_SERVER acsd[16629]: [ID 964518 daemon.notice] ACS(0) going to UP state
Nov 4 15:26:33 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 15:28:41 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to UP state
Nov 3 19:21:31 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 3 19:24:01 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to UP state
Those messages represent the last three ACSD down/up sequences on three different media servers.
Any pointers would be much appreciated. The biggest recommendation, from most involved to date is, "My God you have 1224 LUN paths in an SSO environment."
Experienced engineers at this location, believe that this arrangement of resources is acceptable and has worked in the past.
Cheers,
Chris Oliver
11-07-2013 11:27 PM
ACSD going up and down is all about network connectivity to the ACSLS server - nothing to do with SAN/SSO config.
You will need to check logs on ACSLS server to see if all is well over there.
I would start with acsss_stats.log and acsss_event.log.
11-21-2013 08:46 AM
Marianne,
acsss_stats is null and the event log has never show a loss of connectivity.
We are seeing many SCSI reservation conflicts from avrd, which we think is the root of the problem.
Running multiple OS types and versions, which complicates some of the configuration.
Thanks,
Chris
11-21-2013 12:22 PM
I still feel ACS up/down is a network comms issue...
Important to have ALL logs (OS, NBU and ACSLS) available when issue is seen.
Do you have VERBOSE entry in vm.conf on all media servers with volmgr/debug logs enabled?
All of these (as per Martin's article: https://www-secure.symantec.com/connect/articles/quick-guide-setting-logs-netbackup :(
11-21-2013 12:50 PM
Marianne,
Not according to Symantec. They do not see this as a networking problem. We've sent all the pertinent logs from both NBU and ACSLS.
Thanks,
Chris
P.S. Do you know anyone hiring?
11-21-2013 03:56 PM
11-26-2013 10:40 AM
We are sharing 52 tape drives to 28 non-dedicated media servers (VM Hosts, Oracle RAC, Oracle). For a grand total of 1456 resources. We do use STUG's across the board, but this doesn't seem to help much.
11-26-2013 01:20 PM
11-27-2013 12:29 PM
Here is an example of a wrapper script
Stop media manager processes (stopltid)
mkdir /usr/openv/volmgr/debug (if not already there ...)
11-29-2013 06:43 AM
Would be interesting to see the truss output, but just for the record - any specific reason for the max file descriptor set to 8000?