cancel
Showing results for 
Search instead for 
Did you mean: 

SAPNW04 behaivior in failover an Enqueue Replication Server

Roger_Zimmerman
Level 4

We actually have a problem (or even an unclear situation) with the configuration of an enqueue replication server and an enqueue server in a SAP R/3 installation.

 

Situation:

- installed is a 3 node VCS

- defined is an enqueue instance named ASCS

- defined is an enqueue replication instance named ERS

- ASCS and ERS can switch to each node

- no "polling the HA software" following the SAP installation guide is configured but the VCS agent

- default configuration is running ASCS on node 2 and ERS on node 3

- default failover scenarios like ASCS is following the ERS works fine

 

The unclear situation now is:

- the preonline script from the SAPNW04 Agent (from 4Q08 agent pack) does not suppress the start of the ERS instance on ASCS node

- failover of ERS and switchover of ERS are possible to the same node as the ASCS is running on

 

Our understanding of enqueue replication server is that it must not run on the same node as the enqueue server. We saw, that if both instances are running on the same node and we switch away the ASCS instance the lock table after the switch is empty. We saw also, that in the same situation after a switchover of the enqueue replication server the lock table is not empty.

 

We need an information for the following points:

- is it allowd to run enqueue and enqueue replication server on the same node, using the replication process? (even if it makes no really sense, but does it destroying data)

- is the "loss of entries in the lock table" during a switch of an enqueue from a node where both are running to another node normal? or is it really a loss?

- if it is not allowed, can we configure the enqueue replication instance ressource not to come up on an enqueue node without changing the preonline script?

 

From our point of view it is not possible to make the enqueue replication instance ressource depending from the enqueue instance ressource offline local because of the "hand over the lock table" process, so both service groups must live in parallel for a short time on the same node.

 

A case with the SAP is open, but the answers from this guys are not very helpful in this time, they only point to their own documents, where these questions are not answered.

 

Again we are in bad need of information.

 

Best regards and many thanks in advance

Roger

Message Edited by Roger Zimmermann on 01-16-2009 06:41 PM
7 REPLIES 7

M__Braun
Level 5
Employee

Hi Roger,

 

I guess your main.cf for both SAP Enqueue instances looks similiar to the following example (PreOnline enabled for both groups):

 

 

group SAP70-ASCS (
    SystemList = { blade1000 = 0, blade1000-2 = 1, blade1000-3 = 2,
         blade1000-4 = 3 }
   
    AutoStartList = { blade1000 }
    PreOnline = 1
    )

    DiskGroup SAP70-ASCS_dg (
        DiskGroup = sap70scs
        )

    IP SAP70-ASCS_ip (
        Device = qfe0
        Address = "1.1.1.151"
        NetMask = "255.255.255.0"
        )

    Mount SAP70-ASCS_ASCS00_mnt (
        MountPoint = "/usr/sap/W70/ASCS00"
        BlockDevice = "/dev/vx/dsk/sap70scs/ASCS00"
        FSType = vxfs
        FsckOpt = "-y"
        )

    Proxy SAP70-ASCS_NIC_proxy (
        TargetResName = NIC_LAN_nic
        )

    SAPNW04 SAP70-ASCS_ASCS00_sap (
        EnvFile = "/home/w70adm/env.vcs"
        InstName = ASCS00
        InstType = AENQUEUE
        ProcMon = "en ms"
        SAPAdmin = w70adm
        SAPMonHome = "/usr/sap/W70/SYS/exe/run"
        SAPSID = W70
        StartProfile = "/usr/sap/W70/SYS/profile/START_ASCS00_sap70scs"
        )

    requires group SAP70-DB online global soft
    SAP70-ASCS_ip requires SAP70-ASCS_NIC_proxy
    SAP70-ASCS_ASCS00_mnt requires SAP70-ASCS_dg
    SAP70-ASCS_ASCS00_sap requires SAP70-ASCS_ASCS00_mnt
    SAP70-ASCS_ASCS00_sap requires SAP70-ASCS_ip


group SAP70-REP (
    SystemList = { blade1000 = 0, blade1000-2 = 1, blade1000-3 = 2,
         blade1000-4 = 3 }
    AutoStartList = { blade1000-2 }
    PreOnline = 1
    )

    DiskGroup SAP70-REP_dg (
        DiskGroup = sap70rep
        )

    IP SAP70-REP_ip (
        Device = qfe0
        Address = "1.1.1.152"
        NetMask = "255.255.255.0"
        )

    Mount SAP70-REP_REP02_mnt (
        MountPoint = "/usr/sap/W70/REP02"
        BlockDevice = "/dev/vx/dsk/sap70rep/REP02"
        FSType = vxfs
        FsckOpt = "-y"
        )

    Proxy SAP70-REP_NIC_proxy (
        TargetResName = NIC_LAN_nic
        )

    SAPNW04 SAP70-REP_REP02_sap (
        EnqSrvResName = SAP70-ASCS_ASCS00_sap
        EnvFile = "/home/w70adm/env.vcs"
        InstName = REP02
        InstType = AENQREP
        ProcMon = enr
        SAPAdmin = w70adm
        SAPMonHome = "/usr/sap/W70/SYS/exe/run"
        SAPSID = W70
        StartProfile = "/usr/sap/W70/SYS/profile/START_REP02_sap70rep"
        )

    requires group SAP70-DB online global soft
    SAP70-REP_ip requires SAP70-REP_NIC_proxy
    SAP70-REP_REP02_mnt requires SAP70-REP_dg
    SAP70-REP_REP02_sap requires SAP70-REP_REP02_mnt
    SAP70-REP_REP02_sap requires SAP70-REP_ip

> The unclear situation now is:

> - the preonline script from the SAPNW04 Agent (from 4Q08 agent pack) does not suppress the start of the ERS instance on ASCS node

> - failover of ERS and switchover of ERS are possible to the same node as the ASCS is running on

 

What do you mean by start and failover? The preonline logic only applies to "real" failover scenarios recognized by VCS. Manual VCS operations like "Switch to" or online/offfline are still allowed as the assumption is that administrators should know what they do.

 

Regards

 

Manuel

 

 

 

Roger_Zimmerman
Level 4

Hi, Manuel,

 

Thanks for the answer.

 

Yes, my main.cf looks very similar like the mentioned one. And all the stuff is running well as long we have enouch valid nodes for the service groups SAP70-ASCS and SAP70-REP. But when all nodes are failed and only one node is left then both instances may run on the same node.

 

In other words, when SAP70-ASCS is failing over to the last remaining node, then SAP70-ERS will fault on this node after SAP70-ASCS is started and has taken over the lock table (as it is designed). But when SAP70-ASCS is running on the last remaining node and SAP70-ERS is failing over to this node, then SAP70-ERS is starting on this node in parallel of SAP70-ASCS. And then both Instances run on the same node. And this is not suppressed by the standard preonline script delivered with the SAPNW04 agent.

 

I cannot get any informations about this situation, if it is allowed by the SAP instances itself or if it can lead to the loss of the lock table and so on.

 

The preonline script is running in ervery situation, also if there is a manual switch over, preonlining is not restricted to "real failover" scenarios. And also in this manual switchover case the preonline script should suppress the start of the ERS instance on the ASCS node, if it is not allowed by the SAP. (But, again, I get no answer from SAP peopleSmiley Sad )

 

I know this is a very rare situation, that the cluster is melting down to one node but nevertheless a possilbe one. And we have to handle also this one in the very right way. But therefore I need to know, if it is allowed to have such a "problematic" situation Smiley Happy .

 

Best regards

Roger

M__Braun
Level 5
Employee

Hi Roger,

 

> In other words, when SAP70-ASCS is failing over to the last remaining node, then SAP70-ERS will fault on this node after SAP70-ASCS is started and has taken over the lock table (as it is designed).

 

So far the behavior is correct. If this node is the last remaining node the SAP70-ERS should only fail, but stay offline as the preonline should recognize the interdependence via the EnqSrvResName info.

 

> But when SAP70-ASCS is running on the last remaining node and SAP70-ERS is failing over to this node, then SAP70-ERS is starting on this node in parallel of SAP70-ASCS.

 

This should not happen. Do you checked that both Service Groups have PreOnline activated? Is the EnqSrvResNameattribute of the Enqueue Replication resource contain the VCS resource name of the Enqueue Instance (e.g. SAP70-ASCS_ASCS00_sap)?

 

> And then both Instances run on the same node. And this is not suppressed by the standard preonline script delivered with the SAPNW04 agent.

 

It should suppress the described scenario.

 

> The preonline script is running in ervery situation, also if there is a manual switch over, preonlining is not restricted to "real failover" scenarios.

 

That's true. But the code does the following:

 

$sSys                System where the group is asked to be made online
 $sGrp                Group for which preonline has been invoked
 $sReason             Contains the string MANUAL or FAULT
 $sFaultedNode        Cluster node where the fault occurred due to FAULT

 Returns : boolean value in the set of {0,1} where

           0 false
           1 true
 -------------------------------------------------------------------------

<...>

  #---------------------
  # Validate arguments..
  #---------------------
  if ( $sReason eq 'MANUAL' ) {
    &logInfo ( 20240,
    "<%s> Invoked during a normal switch-over or online. Exiting", $sMyName );
    $iReturnCode = 0; # Do nothing..
  }

 

Could you please provide the main.cf configuration of your Enqueue instances?

 

Regards

 

Manuel

Roger_Zimmerman
Level 4

Hi, Manuel,

 

sorry for the delay, we were busy with going productive the last days.

 

Yes, preonlinig is allowed for both instances and the EnqSrvResName attribute is set as well. But nevertheless the failover to the last remaining node takes place. The automatic one, after a failure on the ERS node. And we both think, that should be suppressed.

 

O.k., here is a fragment of the main.cf with only the two instances, all other is a little bit long...


include "types.cf"include "Db2udbTypes.cf"include "SAPNW04Types.cf"cluster CLUSTER (UserNames = { admin = "" }ClusterAddress = "11.11.11.11"Administrators = { admin }UseFence = SCSI3)system node01 ()system node02 ()system node03 ()....group db2 (SystemList = { node01 = 0, node03 = 1 }AutoStartList = { node01, node03 })Db2udb db2_db2_database (DB2InstOwner = db2p11DB2InstHome = "/db2/db2p11"DatabaseName = P11StartUpOpt = ACTIVATEDBEncoding = ISO8859-1)....group sapascs (SystemList = { node02 = 0, node03 = 1, node01 = 2 }AutoStartList = { node02, node03, node01 }PreOnline = 1)....SAPNW04 sapascs_sap_ascs (EnqSrvResName = sapascs_sap_ascsEnvFile = "/home/p11adm/envfile"InstName = ASCS11InstType = AENQUEUEProcMon = "en ms"SAPAdmin = p11admSAPMonHome = "/usr/sap/P11/SYS/exe/run"SAPSID = P11StartProfile = "/usr/sap/P11/SYS/profile/START_ASCS11_node-ascs")....requires group db2 online global soft....group sapers (SystemList = { node03 = 0, node02 = 2, node01 = 1 }AutoStartList = { node03, node01, node02 }PreOnline = 1)....SAPNW04 sapers_sap_ers (EnqSrvResName = sapascs_sap_ascsEnvFile = "/home/p11adm/envfile"InstName = ERS22InstType = AENQREPProcMon = erSAPAdmin = p11admSAPMonHome = "/usr/sap/P11/SYS/exe/run"SAPSID = P11StartProfile = "/usr/sap/P11/SYS/profile/START_ERS22_node-ers")....requires group sapascs online global soft

 


From my point of view this looks correct. BTW, the Agents are from 4Q08 HA Agent disc. Quite new.

 

Ahm, I also found this break for manual actions, but for automatic actions the suppression should work...

 

What do you think? Maybe there is a situation, where it is possible that the instance is going online accidently? (My perl knowledge is not THAT good... and the modules are quite nested a lot. I lost trace somewhere...)

 

We now made a surrounding preonline script to get the cluster productive, but this is not standard and not very comfortable for the next consultant who is going to make maintenance or reconfigurations. And I do not feel very good with this solution, although it works.

 

But, the most important question (if it is allowed by SAP, even it makes no sense and the agent preonline script should avoid it) is not clear. SAP up to now does not make a clear statement (alwas: it should not... etc. etc.) if it is possible from the SAP processings. Strange... And maybe I worry for nothing...

 

Best regards

a worrying Roger

M__Braun
Level 5
Employee

Hi Roger,

 

SAPNW04 sapascs_sap_ascs (
EnqSrvResName = sapascs_sap_ascs

 

It seems you used the EnqSrvResName attribute also for the ASCS instance. This error screws up the preonline node search algorithm. Here is the EnqSrvResName attribute description from the Agent ICG:

 

EnqSrvResName

The name of the standalone ENQUEUE server cluster resource. This attribute is used only by an Enqueue Replication Server. Using this attribute the replication server queries the ENQUEUE server resource state while searching for a fail over target and vice a versa.

 

> But, the most important question (if it is allowed by SAP, even it makes no sense and the agent preonline script should avoid it) is not clear.

 

Sorry, I can not speak for SAP. But I never experienced an error if the EnqRep instance started on the Enqueue node as the replica Enqueue table is in a separate shared memory segment.

 

Regards

 

Manuel

Roger_Zimmerman
Level 4

Hi, Manuel,

 

sorry for the delay, was busy with some other installation.

 

Two things: First it is possible that the attribute EnqSrvResName in the ASCS ressource should not be set, I will correct this. But on the other hand I was not able to find the code fragment what is going wild from this (would be very nice if you can get me a hint where I have to look). Anyway, as soon as possible I remove this entry from the customers configuration (when I get on the machines the next time, up to this the workaround have to do the job). I think we have to test this then again, so we have to wait for a planned downtime.

 

Second, I got message from SAP finally:

 

"the situation you described does not bare any problem at all. The case the replication server is running on the same host as the enqueue server is uncritical as well as the situation that there is no replication server running at all. 

Both situation off course only offer no high availability, but both do not bear the risk of lossing the enqueue table or any other problem."

 

Sounds good. So we really worried for nothing.

 

Best regards

Roger

M__Braun
Level 5
Employee

Hi Roger,

 

> First it is possible that the attribute EnqSrvResName in the ASCS ressource should not be set, I will correct this. But on the other hand I was not able to find the code fragment what is going wild from this

 

The described Agent requirement for the EnqSrvResName attribute is not negotiable. ;)

 

In summary the Agent is relying on a strict pair relationship of ENQ and ENQREP. Using the attribute in the wrong way will screw up the node detemination algorithm which is used for ENQ failover (locating corresponding ENQREP) and ENQREP failover (locating only nodes which are not running the matching ENQ).

 

> Both situation off course only offer no high availability, but both do not bear the risk of lossing the enqueue table or any other problem."

 

Fine.

 

Regards

 

Manuel