11-22-2012 04:39 AM
Environment
OS = rhel 5
SGHA/DR version = 5.0 MP3 RP3
Primary Site = Two Nodes
DR Site = Single Node
Disk Group = one (with four SAN Disks)
Work Performed
My Primary Site's two SAN Disk from four DIsks which was shared between both Nodes failed. Now my Application was DOWN at Primary Site So I UP my Application from DR Site successfully. Now at Primary Site I replaced both bad Disks with the two new DIsks via using option 4 and 5 of vxdiskadm command successfully. Now from DR Site I ran the fbsync command which started successfully.
My question :
I ran the fbsync command .. This I did right or wrong ? (as this is not a partial sync and has to sync complete as my Primary Site have fresh Disks. I think that fbsync do a incremental Sync) OR fbsync can also SYNC complete data from DR Site to Primary in my case ?
Comments Required please
11-22-2012 05:45 AM
I am surprised the fbsync command worked, but I would think there is a good chance your primary data is corrupt - I can see 2 possibilties:
I would run a space optimised snapshot on the primary site and mount the data to check, which probably won't mount if the are corrupt. You could also try to mount the volumes read-only ("-o ro" I think) on the Primary - this is not supported as if you try to read the files, they could be changing with VVR and these changes won't be in the primary nodes cache, but they should mount and if they don't mount, then your data is almost certainly corrupt.
If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.
Mike
11-22-2012 06:12 AM
First thanks for your words Mike.
One question raising in my mind. As I said that I UP my Application at DR Site(did not switched over) and while UP my Application at DR Site I feel that my Client restarted the Primary Site Machine as well, means that might be a Take Over, Because before execute the fbsync command I checked the repstatus and saw the Primary - Primary configuration.
So Let me think on your words in which you said that :
If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.
is not the above command failed as the configuration was Primary - Primary configuration.
11-22-2012 11:58 AM
Supposed Environment:
SFHA/DR setup
Primary Site = Two Cluster Node's
DR Site = One Cluster Node
What could be the step(s) if suppose Primary Site both Server's shutdown abnormally due to power outage and suppose we start the Application at DR Site (via DR Site Cluster Java Console.)
Now when the Power restored and Primary Site Servers are UP, it found that the Primary Site central storage/SAN Lun's which shared by both nodes dead/corrupted/faulty. So we replaced the SAN Luns via the help of option 4 and 5 of vxdiskadm. Now our next activity will be that we have to synchronized our Primary Site.
Now when we see the status of Replication via repstatus command we can see a Primary - Primary configuration. In this situation what will be the roadmap ?
Everyone's comment will be appriciated please.
11-23-2012 04:16 AM
Zahid,
Suggested reading:
Veritas Volume Replicator 5.0MP3 Linux Administrator's guide -> Transferring the Primary
https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08.htm
or https://d1mj3xqaoh14j0.cloudfront.net/public/documents/sf/5.0MP3/linux/pdf/vvr_admin.pdf (p239)
Particularly the sections:
Taking over from an original Primary (p247)
https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08s03.htm
--------------------
The takeover procedure involves transferring the Primary role from an original Primary to a Secondary. When the original Primary fails or is destroyed because of a disaster, the takeover procedure enables you to convert a consistent Secondary to a Primary. The takeover of a Primary role by a Secondary is useful when the Primary experiences unscheduled downtimes or is destroyed because of a disaster.
--------------------
Failing back to the original Primary (p255)
https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08s04.htm
--------------------
After an unexpected failure, a failed Primary host might start up to find that one of its Secondaries has been promoted to a Primary by a takeover. This happens when a Secondary of this Primary has taken over the Primary role because of the unexpected outage on this Primary. The process of transferring the role of the Primary back to this original Primary is called failback.
--------------------
Also: Veritas Cluster Server Agents for Veritas Volume Replicator Configuration Guide (5.0MP3 Linux)
https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_agents_config/index.html
https://d1mj3xqaoh14j0.cloudfront.net/public/documents/sf/5.0MP3/linux/pdf/vvr_agents_config.pdf
(probably start at Overview of how to configure VVR in a VCS environment, and work from there)
For other versions/platforms, as always, look for the relevant documents on http://sort.symantec.com/documents - the concepts / overall procedures are basically the same though.
11-23-2012 02:16 PM
It is ok to run fbsync, if vradmin lets you run fbysnc after you have replaced some LUNs, because as you say you need to sort out the Primary-Primary config, but as I said earlier, I think there is a good chance that fbsync won't sync all the data on the replaced LUNs.
Before you run fbsync, I believe if you run vxrlink status at the old primary (where you replaced the LUNs), it should tell you how many bytes are outstanding on the DCM, so if fbsync is going to work, the DCM would have to contain at least the size of the replaced LUNs and if it doesn't the vradmin fbsync is not that intelligent as in point 2 of my first post. I guess vradmin MIGHT mark all data dirty on DCM for replaced LUNs, as part of running fbsync so you could also run vxrlink status on DR before running fbsync and then after fbsync is run, these 2 DCMs are merged, so you can check if the dirty data from added LUNs is added, then, but I very much doubt it. If you verify fbsync has not taken replaced volumes in to consideration, then there is no point letting fbsync finish, just run "vradmin -f stoprep" and vradmin -a startrep" to sync from scratch.
Mike
11-26-2012 01:40 AM
@ Mike
Do you see any complication if the Mount resource is already Online at DR Site and we execute the mount command from Primary Site with read only option ?
From Primary and Secondary Site the Mount Resource is the Parent of RVG-PRI resource, So in any case will the Mount Resource automatically Probe ? and if yes, is'nt the Child Resource (which is RVG-PRI in our case) will be Online automatically as the Parent(MOUNT Resource) got Online ?
Means I just want to make sure that in any case the RVG-PRI will be Offline at Primary Site as its already online at DR Site.
11-26-2012 06:28 AM
You could mount using a temporary mount point so VCS shouldn't recognise the resource as online. In theory the worst that coulld happen is that you get a currency violation and VCS will umount the read-only mount (it won't online dependent resources), but I would freeze application service groups on both sides for a "belts and braces" approach.
Mike
11-26-2012 09:44 AM
Hmmm let me share with my OS team(as my Client cant afford extra down time in terms of Cuncurrency Voilation as it has around 100,000 of users) and do it this way and will share the result.
- Freeze Application Service Group on both Sites (Primary and DR).
- Try to mount the Volume to a temporary place like /mnt. for example
#mount -t vxfs -o ro /dev/vx/dsk/DG/Volume /mnt
11-28-2012 04:04 AM
You can't run fsck on a readonly filesystem. It could be you just need to specify filesystem type, or it maybe that the inode table was been updated on the primary while you were trying to mount on the secondary (this is why this method is not officially supported). If you have an Enterprise license and a little free space in the diskgroup, then the better method is to run a space optimised snapshot and mount this.
Mike
11-28-2012 04:07 AM
I just did stoprep and then startrep and when sync completed, now trying to run the mount command with readonly but feel that some issue. See the below for reference:
[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
#tail -f /var/logs/messages
Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533
Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed
Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.
I think that we need to run fsck but in a situation where I have ext3 filesystem. Any suggesstion what would be the syntax to run the fsck in my situation
like
fsck -o full -y /dev/vx/dsk/DG/volume
OR
like fsck -f -y /dev/vx/dsk/DG/volume
11-28-2012 04:23 AM
So you mean this may not be the issue of filesystem thats why not mounting(would be other reasons). thanks
Would you kindly share the command for space optimized snapshot and how much space should i consider free.
I tried to pauserep(here the inode definetely be not updating at secondary/RealPrimary ) and then did mount command by faced same messages
Thanks
11-28-2012 05:53 AM
The amount of space you need free is:
About the same size of inode table for pointers to real or COW (copy on write) data so about 2%
+
Amount of changes that occur on the primary while snapshot is mounted
+
Any changes you MAY make to filesystem when it is mounted as a space optimized snapshot.
If this is just a test to check the filesystem is ok, then 5% should be plenty
The commands to do snapshot are under section "Space-optimized instant snapshots" in Vxvm admin guide - examples from guide shown below:
Make cache object - this is where space above is stored so make this 5% size of filesystem:
vxassist -g mydg make cachevol 1g vxmake -g mydg cache cobjmydg cachevolname=cachevol vxcache -g mydg start cobjmydg
vxsnap -g mydg prepare myvol
vxsnap -g mydg make source=myvol/newvol=snap3myvol/cache=cobjmydg
mount /dev/vx/dsk/mydg/snap3myvol /mnt
11-28-2012 06:22 AM
Bundle of Thanks first for your long drafted reply Mike.
I have two Disks/SAN LUN's 100 GB each
I have two volumes on these two Disks/LUN's. One is 150GB which is data volume and being replicated and the 50GB aprox is the Srl VOLUME.
( I dont have free space available in the DiskGroup So I think I need to add a disk with 10 GB aprox space for making the volume, name cachevol as per your suggesstion above ) CORRECT ?
======
vxsnap -g mydg make source=myvol/newvol=snap3myvol/cache=cobjmydg
Is this only one thing ""source=myvol/newvol=snap3myvol/cache=cobmydb""? I am not able to understand this please.
11-28-2012 06:46 AM
Yes you need to add disk to diskgroup as you need free space to be in diskgroup.
"source=myvol/newvol=snap3myvol/cache=cobmydb"
This is what Symantec called a tuple - it is specifying 3 things which you seperate with "/" as shown with no spaces so you have:
source=myvol - The volume you are taking a snapshot of
newvol=snap3myvol - Name of new volume that is the snapshot of your volume
cache=cobmydb - The name of your cache object
Note you can use the same cache object for many volumes, so if you had more than one volume, you don't need a separate cache object and so you would just repeat the "vxsnap prepare" and "vxsnap make" for a second volume if you had one.
Mike
11-29-2012 01:17 AM
As per the above mentioned messages and see below also as a reference, What could be the factors involve which is why i am getting these messages(as I am useto mounting the filesystem with readonly at DR Site and I never face this problem)
[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
#tail -f /var/logs/messages
Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533
Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed
Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.
11-29-2012 02:09 AM
Is this an ext3 filesystem? If it is not ext3, then you need to specify the mount type when you mount it.
Mike
11-29-2012 04:52 AM
yes this is ext3 filesystem. But facing same error which I am facing
One more thing I have browse which is what.
At Primary Site (New Secondary) we stopped Replication and then Stopped VCS at Primary Site (New Secondary) then mount the Volume again I was not able to mount it and giving me same error while mount. I created a new filesystem again on this Volume(which was not able to mount) but still I am facing the same problem after creating a new filesystem. I dissociated the Volume from the RVG and then created a filesystem and then I mount it. It got successfully mount but as I associated the Volume back to RVG I face same error while mounting the Volume.
11-29-2012 07:20 AM
The original point of doing a space-optimised snapshot or mounting read-only was to see if fbsync synced volumes on the replaced disks, but if you have stopped and started replication, then the fact that you replaced disks is irrelevant as you have synced everything from scratch. I have mounted VVR volume on secondary in Solaris vxfs fine, as Solaris just marks a mounted flag, but I have never tried on Linux ext3, so Linux MAY do something different, like it might try to write to inode table as error says " lost page WRITE" and this won't work as the whole volume is readonly. But if you stop replication so that secondary volumes are writable, then this should work and if this doesn't then you probably did something wrong earlier.
If you try all this again, I would use space-optimised snapshot and it doesn't work, then send all the commands you are running (in order you run them) including associating volumes, stopping and starting replication, snapshot commands and mount of snapshot.
Mike
11-29-2012 09:34 PM
Thanks for all Participants specially Mike for kind words on this Post
All volumes are started which are associated with the Volumes in vxprint command.
an example to start disabled volume
(vxvol -g DG start VolumeName OR vxvol -g DG startall )
Start and Stop Replication Example
(vradmin -g DG -a startrep RVGname and vradmin -g DG -f stoprep RVGname)
===================
Even At Primary Site where Application was not LIVE, I created new volumes/FileSystem which are able to mount if they are diassociated from RVG but when I Associated the Volumes with RVG I am not able to Mount even the Replication is Stopped. Now at this point if the fresh/formated with new filesystem is not able to mount under RVG how can I aspect that this volume snapshot can also be mount even it mount suppose what is my benefit as the Real Volume is not able to mount, in this situation what happen while switchover/failover means how can the volume be mounted. I am so depressed with Support at this time.
Now my Final PLAN
- Finally I broken the GCO which made both clusters(Primary Site and DR Site) isolated, and remove the LINKS between ServiceGroups(Application and Replication ServiceGroups). Application was still running on DR Site
- Then created a New ServiceGroup, DiskGroup,New Volumes with new name only for Application and just ran the createpri command on the Primary Site.
After doing all this I copied my Aplication from DR Site to Primary Site today morning and UP the Application from Primary Site.
Plan for today
Phase-I
Now at DR Site I will create New ServiceGroup, DiskGroup and New volumes with the same name and size(as on Primary Site we newly created).
Phase-II
Will execute the addsec command from Primary Site which will start the Replication between Primary and DR Site.
Phase-III
We need to create the RVG Resource at Replication ServiceGroup.
1.) At PrimarySite we Remove the DiskGroup Resource from Application ServiceGroup and Create the DIskGroup Resource under Replication ServiceGroup (Can we do that without any DownTime as when we remove the Diskgroup Resource from Application ServiceGroup, this may DEPORT the DiskGroup at PrimarySite ?, may cause Application Down)...How can we move the Diskgroup from Application ServiceGroup to Replication ServiceGroup seamlessly without any impact on LIVE Application
2.) Create RVG Resource at Replication ServiceGroup(at PrimarySite)
3.) Create RVG-Primary Resource at Application ServiceGroup(at PrimarySite).
4.) Create an Online-LocalHard Link between Application and Replication ServiceGroup(Will Select the ApplicationServiceGroup and then Select the Replication ServiceGroup). (at PrimarySite)
The above four Activities we also need to perform at DR Site which I dont think is really tension but the point which I am concern about is Point # 1 (Actually these four steps put the Replication under Veritas Cluster Control)
5.) Add the Remote Cluster
6.) Create the Application ServiceGroup a Global ServiceGroup.
If any concern on above Activity under Plan for Today please share the comments ? For this I need urgent and quick response as I have lost all Symantec creditability at my Client Site as the Severity-1 Case took a week for even not been able to complete. I would really appreciate if I can get proper resolution for this please.