Marianne, acsss_stats is

cgoliver · ‎11-04-2013

We have 28 media servers with 45 drives total, all run through SSO.

All media servers are seeing a loss of robotic communication, at seemingly random times.

The most recent set of changes introduced, in an attempt to address this problem are as follows,

1. One large Tape SAN (single zone with 28 initiators and 45 targets), changed to single intitator, target zones.

2. project.max-file-descriptor=(priv,8000,deny) Added to the project file for Netbackup

The media servers are a mix of HP-UX and RHEL.

Network interface monitoring indicates no packet loss or collisions.

Sample messages from syslog:

Nov 4 16:17:26 $MASTER_SERVER acsd[16629]: [ID 157546 daemon.error] ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 16:19:34 $MASTER_SERVER acsd[16629]: [ID 964518 daemon.notice] ACS(0) going to UP state

Nov 4 15:26:33 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 4 15:28:41 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to UP state

Nov 3 19:21:31 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov 3 19:24:01 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to UP state

Those messages represent the last three ACSD down/up sequences on three different media servers.

Any pointers would be much appreciated. The biggest recommendation, from most involved to date is, "My God you have 1224 LUN paths in an SSO environment."

Experienced engineers at this location, believe that this arrangement of resources is acceptable and has worked in the past.

Cheers,

Chris Oliver

Marianne · ‎11-07-2013

ACSD going up and down is all about network connectivity to the ACSLS server - nothing to do with SAN/SSO config.

You will need to check logs on ACSLS server to see if all is well over there.

I would start with acsss_stats.log and acsss_event.log.

Handy NetBackup Links

cgoliver · ‎11-21-2013

Marianne,

acsss_stats is null and the event log has never show a loss of connectivity.

We are seeing many SCSI reservation conflicts from avrd, which we think is the root of the problem.

Running multiple OS types and versions, which complicates some of the configuration.

Thanks,

Chris

Marianne · ‎11-21-2013

I still feel ACS up/down is a network comms issue...

Important to have ALL logs (OS, NBU and ACSLS) available when issue is seen.

Do you have VERBOSE entry in vm.conf on all media servers with volmgr/debug logs enabled?
All of these (as per Martin's article: https://www-secure.symantec.com/connect/articles/quick-guide-setting-logs-netbackup :(

mkdir /usr/openv/volmgr/debug/acssi

mkdir /usr/openv/volmgr/debug/acsd

mkdir /usr/openv/volmgr/debug/robots

mkdir /usr/openv/volmgr/debug/daemon

mkdir /usr/openv/volmgr/debug/ltid

mkdir /usr/openv/volmgr/debug/oprd

mkdir /usr/openv/volmgr/debug/reqlib

mkdir /usr/openv/volmgr/debug/tpcommand

I remember reqlib being helpful when I had to troubleshoot ACSLS issues in the past...

Handy NetBackup Links

cgoliver · ‎11-21-2013

Marianne,

Not according to Symantec. They do not see this as a networking problem. We've sent all the pertinent logs from both NBU and ACSLS.

Thanks,

Chris

P.S. Do you know anyone hiring?

mph999 · ‎11-21-2013

I'd consider a truss on ascd, probaby via a wrapper script if it goes down real quick after being started. It might give some clues. I'd need to check, but I thought acsssi communicated with the asc server and ascd was like the equiv. of tldd. So eg. tldd > tldcd > library acsd > acsssi > acs server ... which would mean acsd is one step away from the network. Am I right in understading you have the same 28 drive share across all 45 media servers - guess what I am thinking ... I can't remember the numbers off hand, but my colleague will remind me of the number of dirves (x) that should be shared between (y) media servers eg max value of (drives shared x media servers shared across) .

cgoliver · ‎11-26-2013

We are sharing 52 tape drives to 28 non-dedicated media servers (VM Hosts, Oracle RAC, Oracle). For a grand total of 1456 resources. We do use STUG's across the board, but this doesn't seem to help much.

mph999 · ‎11-26-2013

The max number of resources for a 'set' of drives is suggested as 150 - th at was the value I couldn't remember before. I don;t know of anywhere this value is documented, but I heard it from my colleague, who was told by Engineering. So for example, 10 drives shared across 15 media servers. If this would cause acsd to go down ??? - it's not really connected, well, not directly. For STUGs, there should be no more than 5 STUs in each. This is documented, I think in one of the best practice guides. You haven't mentioned how the call is progressing, but I would suggest a wrapper script that takes the place of ascd, and sticks a truss in front of it - I'll dig out an example later. Martin

mph999 · ‎11-27-2013

Here is an example of a wrapper script

Stop media manager processes (stopltid)

mkdir /usr/openv/volmgr/debug (if not already there ...)

mv /usr/openv/volmgr/bin/acsd /usr/openv/volmgr/bin/acsd.bin

vi acsd (new file, this should contain....)

#!/usr/bin/sh

/usr/bin/truss -l -f -a -d -o /usr/openv/volmgr/debug/truss.`date +%m%d%y:%H:%M:%S` /usr/openv/volmgr/bin/acsd.bin $*

Save the file

The set the permissions.

chmod 555 bpbackupdb

Start ltid (ltid -v)

bp_kill_all · ‎11-29-2013

Would be interesting to see the truss output, but just for the record - any specific reason for the max file descriptor set to 8000?

VOX

ACSD process going down/up