Thanks for the reply ... Let

cnewtonne · ‎05-07-2012

I running into a situation where a 3-node VCS cluster ...
- contains 2 oracle databases. Node 1 runs a db with sid of 'love' which I cloned onto node 3 using also a sid of 'love'.
- Even though clone db on node 3 is NOT under VCS control, the instance keeps terminating. We found out from the agen logs that VCS is terminating it.

....
Agent is calling clean for resource(ORA_****
....

This link http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vcs_oracle_agent.pdf talks about "Best practices for multiple Oracle instance
configurations in a VCS environment', but it does NOT mention of if duplicate SID are allowed or not. It is either too obvious to mention or it can be done.

My questions are ...

- Will a VCS cluster allow 2 databases with indentical sid's on seperate nodes? I do not know if such
- Why is VCS terminating the cloned instance even though this clone db is NOT under its control?

Any insights will be much appreciated?

Thank you.

Gaurav_S · ‎05-07-2012

Hello,

to answer

- Will a VCS cluster allow 2 databases with indentical sid's on seperate nodes? I do not know if such

---- VCS will not allow, in parallel clustering, you can run same instance with different instance IDs for e.g like in Oracle RAC

- Why is VCS terminating the cloned instance even though this clone db is NOT under its control?

--- can you share the config file & logs at the time when errors is received ..

also what is OS version & VCS version you are using ?

Gaurav

cnewtonne · ‎05-07-2012

Thanks for the reply ...

Let me be more exact in stating my question. Can you have 2 oracle databases within a VCS cluster where both use an IDENTICAL value for $ORACLE_SID.

This is part of the agent log showing termination of instance.

2012/05/05 22:55:12 VCS ERROR V-16-2-13064 (######) Agent is calling clean for resource(VCS_RG) because the resource is up even after offline completed.
2012/05/05 22:55:12 VCS ERROR V-16-20002-1 (######) Oracle:VCS_RG:clean:Oracle home directory /ora_home/.../11.2.0 does not exist
2012/05/05 22:55:12 VCS ERROR V-16-20002-2 (######) Oracle:VCS_RG:clean:sqlplus/svrmgrl not found in /ora_home/.../11.2.0/bin
2012/05/05 22:55:12 VCS WARNING V-16-20002-8 (######) Oracle:VCS_RG:clean: File /ora_home/.../my_db.env is not a valid text file
2012/05/05 22:55:12 VCS NOTICE V-16-20002-28 (######) Oracle:VCS_RG:clean:shutdown abort left processes (26269, 26271, 26275, 26277, 26279, 26281, 262833
, 26285, 26287, 26289, 26291, 26293, 26295, 26297, 26299, 26301, 26303, 26305, 26307, 26309, 26311, 26313, 26315, 26317, 26355, 26357, 26359, 26361)
2012/05/05 22:55:12 VCS NOTICE V-16-20002-26 (######) Oracle:VCS_RG:clean:Oracle(my_db) kill TERM 26269, 26271, 26275, 26277, 26279, 26281, 26283, 22
6285, 26287, 26289, 26291, 26293, 26295, 26297, 26299, 26301, 26303, 26305, 26307, 26309, 26311, 26313, 26315, 26317, 26355, 26357, 26359, 26361
2012/05/05 22:55:22 VCS NOTICE V-16-20002-27 (######) Oracle:VCS_RG:clean:Oracle(my_db) kill KILL 26269, 26271, 26275, 26277, 26279, 26281, 26283, 22
6285, 26287, 26289, 26291, 26293, 26295, 26297, 26299, 26301, 26303, 26305, 26307, 26309, 26311, 26313, 26315, 26317, 26355, 26357, 26359, 26361
2012/05/05 22:55:43 VCS INFO V-16-2-13001 (######) Resource(VCS_RG): Output of the completed operation (clean)
Use of uninitialized value $DBAMGR in string ne at /opt/VRTSagents/ha/bin/Oracle/clean line 167.
2012/05/05 22:55:43 VCS INFO V-16-2-13068 (######) Resource(VCS_RG) - clean completed successfully.
2012/05/05 22:55:44 VCS INFO V-16-1-10305 Resource VCS_RG (Owner: unknown, Group: SG_my_db) is offline on ###### (VCS initiated)
2012/05/05 22:55:44 VCS NOTICE V-16-1-10446 Group SG_my_db is offline on system ######

mikebounds · ‎05-07-2012

This should work if you use FireDrill attribute, but I think you will have to put the clone database under VCS control.

The FIredrill attribute is used for the sort of thing you are doing - let me explain futher how this attribute is normally used:

If you have failover Oracle service group which runs on node 1 and storage is replicated to node 3, then if oracle is online on node 1, then VCS still checks (by default every 5 minutes, determined by OfflineMonitorInterval), if Oracle is online on node 3 and if it is online, then VCS offlines Oracle as this is a Concurrency violation as oracle_sg is a FAILOVER service group so is not allowed to be online on both nodes. I think this is what you are seeing, so you should see "Concurrency violation" error message in you logs. When replicating oracle, then some users want to run a Firedrill, which means taking a clone of replicated data on node3 and bringing up Oracle using this clone to check Oracle works on "DR" node. This is where Firedrill attribute is set to 1 on the Oracle type (hatype -modify Oracle Firedrill 1) and a second Oracle "clone" service group is created to online the clone Oracle database. If you online clone Oracle using the "clone" service group then the concurrency violation does not occur, but I think if you online clone Oracle database outside of VCS control, then the concurrency violation will still occur.

So if you create a separate clone service group with SystemList of node3 and set Oracle Firedrill attribute, then I think this should work.

You may also want to look at Veritas Storage checkpoints as this is an automated process where Veritas software clones database for you and changes SID and starts it on the same node - see "Using Database Storage Checkpoints and Storage Rollback" section in https://sort.symantec.com/public/documents/sfha/5.1sp1/solaris/productguides/html/sf_adv_ora/

Mike

Satish_K__Pagar · ‎05-07-2012

The resource will be OFFLINE'd in VCS and cleaned on the 3rd VCS Node only in case of concurrency violation. Please check your logs to see if there was a concurrency violation. If you don't want to put the cloned DB on 3rd node under VCS control, you must specify SystemList for the Oracle service group limiting to the 1st and 2nd VCS node only.

eg. # hagrp -modify SG_my_db SystemList node1 0 node2 1

So in short you may run two different Oracle resources with identical SIDs in VCS, provided you appropriately modify the SystemList for each of those groups.

arangari · ‎05-07-2012

'FireDrill' concept is used only for 'Disaster Recovery' solution. Although it is suggested here, i feel it is overkill. Also one shouldn't be using this mechanism as this solution may not work in future if there would be changes in the product related to the FireDrill.

mikebounds · ‎05-08-2012

Amit, can you please elaborate on your response as this is very worrying. In a typical firedill you:

Replicate data to a node in a cluster (RDS) or across clusters (GCO)
You take a snapshot (clone) of the data using VCS snapshot application
You bring up identical database in VCS and use Firedrill to stop concurrency violations

What is trying to be achieved by this discussion is near identical to this except, that data is not been replicated, but the agents I have seen to take snapshots do no reply on this, so for exampe the RVGSnapshot uses vxsnap, as oppose to using "vxrvg snapshot", but in any case the firedrill concept would still work without VCS taking the clone as it is the Firedrill attribute that stops the concurrency violations.

My suggested solution only requires the "FireDrill" attribute as clones can be taken manually without having to use a VCS agent, and this attribute was brought out 8 years ago in VCS 4.0 and has not changed since and although I have not used 6.0, from the VCS 6.0 Admin guide, it still looks the same. I have used the FireDrill attribute for many customers so if Symantec and planning to remove this attribute or change how it works, then please let me know as I am still using this in new clusters I am implementing.

Mike

arangari · ‎05-08-2012

Mike,

Although the use of FireDrill attribute will acheive what the user is planning to do here, the solution given by Satish P should also be sufficient.

The 'FireDrill' attribute is always thought in relation with DR (RDC/GCO) and the connotation is very strong there. The attribute is not changed since 8 years and may not change in future - however that does not mean I should use it for my normal operations. Here the user has not given the clear picture of his usecase yet.

Please read my comments only in context of question raised by user and not as our product roadmap.

Regards,

Amit Rangari

mikebounds · ‎05-08-2012

Just done a short test and you do NOT need to put clone database under VCS control to use FireDrill attribute. All I did in test was to create a resource and then online outside of VCS control and then probe resource and as expected VCS detects resource online. I then repeated setting FireDrill attribute to 1 for the resource type and VCS did not detect resource was online.

Therefore to achieve what you are doing you simply need to run:

hatype -modify Oracle Firedrill 1

You could set this attribute on and off everytime you start and stop clone database, but you could just leave it set to 1 all the time if you never want VCS to shutdown a Database when it is started outside of VCS control when an identical database is already online on another node.

I would not recommend removing node3 from the SystemList as this reduced the High Availablity of your database, as if node2 is not available then your database will stay done if you loose node1.

However you need to make sure that real Oracle database does not try to run on the same node as clone database as this will likely cause problems, so you want node3 to be the last node chosen when failing over and if node3 is the only available node then if clone DB is running then it needs to be shutdown first, or the real database is prevented from going online.

You use preonline script to shutdown clone database or put clone DB under VCS control and use "offline local" service group dependency.

Mike

cnewtonne · ‎05-09-2012

This is an excellent exchange of expertise and I thank all who contributed.

My usage case basically involved cloning a database that is running on node 1 to node 2 (all in same VCS cluster). The cloned db will run using an identical $ORACLE_SID. Even though node 2 was NOT under VCS control, VCS kept terminating the clone database. When we changed the $ORACLE_SID for the cloned db, the issue was resolved and the db was NO longer being terminated.

We wanted to keep same sid but had to do this since we thought that within a VCS cluster, you can not run 2 databases using same $ORACLE_SID even on different nodes.

Now we know that it can be done by modifying the systemlist or the firedrill attribute.

Once last thought is that I do not understand Amit's argument. If the feature is officially introduced and supported by the vendor and if it used wihtin the scope of its purpose, why not? Who says that using it as such violates the features scope and purpose because it is not meant to be used for 'my normal operations'.

Thank you all, again.

mikebounds · ‎05-09-2012

I don't understand Amits argument too. I was a consultant at Symantec for 10 years and I frequently used featutres of VCS in a different way to they were intended. Relevent to this post, I have set-up quite a few firedrill service groups (a service group that starts the service on a snapshot/clone of the data and ake use of setting FireDril attribute) and customers did not just use the firedrill service group for DR testing, they would use to test, patches, upgrades and run reports, so to suggest the service group should not be used in this way does not make much sense to me.

What you do need to be aware of is that if the real service group needs to failover to the node running the clone, then you will have a longer outage as you have to shutdown clone before bringing up real database, but having a 2nd failover target that takes longer is better than having no 2nd failover target at all which is what Satish and Amit are suggesting by removing this node from the SystemList. Also, with 3 nodes, if you lost node 1 or 2, then if clone wasn't online, you could suspend starting it until node 1 or 2 was fixed, or if clone was started you could stop it so 3rd node was available without delay, if a another node went down.

Mike

arangari · ‎05-09-2012

Consider a main.cf as below. This is derived from whatever limited information we have about customer env from above comments.

group G1 (

SystemList = {S1, S2, S3}

)

FileOnOff R1 (

PathName = "/tmp/R1"

)

group G1_C ( // clone SG

SystemList = {S3}

)

FileOnOff R1_C ( //clone of R1

PathName = "/tmp/R1"

)

group G2 (

SystemList = {S1 , S2, S3}

)

FileOnOff R2 (

PathName = "/tmp/R2"

)

For user, the G1 and G1_C are independent groups with respect to operations.

Situation 1:

1. G1 is online on S1.

2. To bring G1_C (the clone group online on S3), user fires 'hatype -modify FileOnOff FireDrill 1'.

3. G1_C is brought online on S3.

4. G1 faults on S1 - plans to failover to S3 as S2 is say in FAULTED state.

5. G1 goes online on S3. - VCS would not bring G1_C offline as it is not a 'FireDrill' service group of G1 defined through 'offline local' dependency. Thus now both G1 and G1_C are online on S3.

6. G1_C is brought offline on S3 - G1 will be seen as 'FAULTED'.

Situation 2:

1. G2 is online on S3

2. R2 suddently goes online on S1 - this will not be detected as 'Concurrency Violation'.

About using the attribute in its uninteded way : the changes in its intention and use of them will not gaurantee the unintended ways will be supported in future. So this is certainly a cautious decision to take.

Open to discuss further...

mikebounds · ‎05-10-2012

Regards situation 1:

Point 5: I already said in an earlier post "You use preonline script to shutdown clone database or put clone DB under VCS control and use "offline local" service group dependency.", so I don't understand this point.

Point 6: Service groups won't be online if you implement preonline or offline local dependency, but even if you didn't, I think your scenario is very unliley to occur, because I think it is likely the online or monitor would fail so G1 would probably never successfully online.

Regards situation 2, this is no different from a "real" firedrill - whilst Firedrill attribute is set, you have no protection against concurrency violaltions for the resource types that the Firedrill is set to 1. But in reality, this is not a problem for the Oracle resource type, unless you were using CFS as you wouldn't be able to start the real database as you wouldn't be able to mount the filesystems concurently. So the most likely cause of a concurrency violaltion for G2 would be someone bringing up a clone and in this instance you would want the clone to stay up. Note also that the Symantec supplied script "fdsetup" sets the FireDrill attribute to 1, but never unsets it, so many customer have FireDrill set to 1 permenantly as fdsetup does not inform you to unset it, in fact it doesn't even inform you that it changes the FireDrill attribute to 1

In essence, any argument you make against setting FireDrill for a non-firedrill use, can be used against setting Firedril for actual firedrill use as from a VCS point of view it makes no difference if you are using your clone/snapshot to test DR or some other purpose like reporting or whatever cnewtonne wants to use the clone for.

And regards removing S3 from SystemList, then in situation 1, you have an outage for G1 as there is no failover target left, where as if you use FireDrill attribute instead, then you just have a delay whilst clone shutsdown and if you take my advice in last post of shutting clone down once S2 is failed, then you wouldn't even have a delay (and since S1 and S2 and independent servers, it is unlikely S1 would fail at about the same time as S2).

Mike

VOX

Can a VCS cluster run 2 databases with identical ORACLE_SID?