Forum Discussion

Zahid_Haseeb's avatar
Zahid_Haseeb
Moderator
12 years ago

link between primary site and dr site disconnected

SFHA/DR = 5.1

rhel = uname -a
Linux xxxxxxxxxxxxxxxxx 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
 

Hello all

We hope yourall be fine. I want to investigate what happen at 12:30 am on 30th September 2012. One of my client manually switched over service group from primary site to DR site. As the switching over is being executed the hagui session got disconnected(which he was taken from primary site) and not able to see the status what happened with DR site. When he felt he is not able to see any thing at DR site he did online the service group at primary site again so that his critical application can UP, and the service group again got UP successfully at primary site. When the service group got UP again successfully he said to me that he is again able to take the hagui session of DR site via public IP but the replication was stopped between primary and DR site. (Please Note: before switch over from primary to DR site the repstatus was connected and up-to-date and not behind)

the above is the case one....

case no two:

Today in evening between 5:00 pm to 6:00 pm on same day when I reached at the primary site of my client, I saw that the DR site replication service group was offline which made me thought that this is problem which is why the replication was stopped, as I UP the service group of DR site the red exclamation mark appear at primary node. I checked the status of replication via repstatus command which said me that the "primary - primary " configuration , so I ran the fbsync command from DR site command prompt which also failed with error. I again stopped the replication service group and reboot the primary site node1 and the exclamation mark disappear. I just UP the service group on node2 at primary site.


0.) Kindly shared your expert opinion what actually happened also see the below question too please
1.) Why all session got disconnected when my client switched over to DR site (case one)
2.) Why the primary - primary situation occurred (case two)

I have up loaded the engine logs of dr site node for review

  • To delete replication objects using low level commands  "stop, disassociate, remove" is an alternative to delsec and delpri as delsec and delrpi will only work for clean states.

    If replication is disconnected then this means there is a problem between replication IPs (or a problem on the node like replication service group is offline or problem with VVR daemons).  The cluster gets its information using the cluster IPs which should be different IPs to replication, so replication being disconnected should not effect communication between clusters.

    I suspect you have incorrectly configured the network somewhere or your script to remove IPs on Application IP is breaking things, for example it may be messing up the routing table so that communication between replication IPs stops working.

    Mike

29 Replies