cancel
Showing results for 
Search instead for 
Did you mean: 

ACSD process going down/up

cgoliver
Level 5

We have 28 media servers with 45 drives total, all run through SSO.

All media servers are seeing a loss of robotic communication, at seemingly random times.

The most recent set of changes introduced, in an attempt to address this problem are as follows,

 

1. One large Tape SAN (single zone with 28 initiators and 45 targets), changed to single intitator, target zones.

2. project.max-file-descriptor=(priv,8000,deny) Added to the project file for Netbackup

 

The media servers are a mix of HP-UX and RHEL.

Network interface monitoring indicates no packet loss or collisions.

 

Sample messages from syslog:

 

Nov  4 16:17:26 $MASTER_SERVER acsd[16629]: [ID 157546 daemon.error] ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov  4 16:19:34 $MASTER_SERVER acsd[16629]: [ID 964518 daemon.notice] ACS(0) going to UP state

 

Nov  4 15:26:33 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov  4 15:28:41 $MEDIA_SERVER1 acsd[21215]: ACS(0) going to UP state
 

Nov  3 19:21:31 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to DOWN state, status: Timeout waiting for robotic command
Nov  3 19:24:01 $MEDIA_SERVER2 acsd[25538]: ACS(0) going to UP state

 

Those messages represent the last three ACSD down/up sequences on three different media servers.

 

Any pointers would be much appreciated. The biggest recommendation, from most involved to date is, "My God you have 1224 LUN paths in an SSO environment."

Experienced engineers at this location, believe that this arrangement of resources is acceptable and has worked in the past.


Cheers,

Chris Oliver

 

 

 

 

9 REPLIES 9

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

ACSD going up and down is all about network connectivity to the ACSLS server - nothing to do with SAN/SSO config.

You will need to check logs on ACSLS server to see if all is well over there.

I would start with acsss_stats.log and acsss_event.log.

 

cgoliver
Level 5

Marianne,

 

acsss_stats is null and the event log has never show a loss of connectivity.

We are seeing many SCSI reservation conflicts from avrd, which we think is the root of the problem.

 

Running multiple OS types and versions, which complicates some of the configuration.

 

Thanks,

Chris

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I still feel ACS up/down is a network comms issue...

Important to have ALL logs (OS, NBU and ACSLS) available when issue is seen.

Do you have VERBOSE entry in vm.conf on all media servers with volmgr/debug logs enabled?
All of these (as per Martin's article: https://www-secure.symantec.com/connect/articles/quick-guide-setting-logs-netbackup :(

mkdir /usr/openv/volmgr/debug/acssi
mkdir /usr/openv/volmgr/debug/acsd
mkdir /usr/openv/volmgr/debug/robots
mkdir /usr/openv/volmgr/debug/daemon
mkdir /usr/openv/volmgr/debug/ltid
mkdir /usr/openv/volmgr/debug/oprd
mkdir /usr/openv/volmgr/debug/reqlib
mkdir /usr/openv/volmgr/debug/tpcommand
 
I remember reqlib being helpful when I had to troubleshoot ACSLS issues in the past...

cgoliver
Level 5

Marianne,

Not according to Symantec. They do not see this as a networking problem. We've sent all the pertinent logs from both NBU and ACSLS.

 

Thanks,

Chris

P.S. Do you know anyone hiring?

mph999
Level 6
Employee Accredited
I'd consider a truss on ascd, probaby via a wrapper script if it goes down real quick after being started. It might give some clues. I'd need to check, but I thought acsssi communicated with the asc server and ascd was like the equiv. of tldd. So eg. tldd > tldcd > library acsd > acsssi > acs server ... which would mean acsd is one step away from the network. Am I right in understading you have the same 28 drive share across all 45 media servers - guess what I am thinking ... I can't remember the numbers off hand, but my colleague will remind me of the number of dirves (x) that should be shared between (y) media servers eg max value of (drives shared x media servers shared across) .

cgoliver
Level 5

We are sharing 52 tape drives to 28 non-dedicated media servers (VM Hosts, Oracle RAC, Oracle). For a grand total of 1456 resources. We do use STUG's across the board, but this doesn't seem to help much.

mph999
Level 6
Employee Accredited
The max number of resources for a 'set' of drives is suggested as 150 - th at was the value I couldn't remember before. I don;t know of anywhere this value is documented, but I heard it from my colleague, who was told by Engineering. So for example, 10 drives shared across 15 media servers. If this would cause acsd to go down ??? - it's not really connected, well, not directly. For STUGs, there should be no more than 5 STUs in each. This is documented, I think in one of the best practice guides. You haven't mentioned how the call is progressing, but I would suggest a wrapper script that takes the place of ascd, and sticks a truss in front of it - I'll dig out an example later. Martin

mph999
Level 6
Employee Accredited

Here is an example of a wrapper script

Stop media manager processes (stopltid)

mkdir /usr/openv/volmgr/debug (if not already there ...)

mv /usr/openv/volmgr/bin/acsd /usr/openv/volmgr/bin/acsd.bin
vi acsd (new file, this should contain....)
 
#!/usr/bin/sh
/usr/bin/truss -l -f -a -d -o /usr/openv/volmgr/debug/truss.`date +%m%d%y:%H:%M:%S` /usr/openv/volmgr/bin/acsd.bin $*
 
Save the file
The set the permissions.
chmod 555 bpbackupdb
Start ltid (ltid -v)
 
 

bp_kill_all
Level 3
Partner Accredited

Would be interesting to see the truss output, but just for the record - any specific reason for the max file descriptor set to 8000?