cancel
Showing results for 
Search instead for 
Did you mean: 

Do we need fbsync can do a complete sync if Primary Site Disk fails

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Environment

OS = rhel 5

SGHA/DR version = 5.0 MP3 RP3

Primary Site = Two Nodes

DR Site = Single Node

Disk Group = one (with four SAN Disks)

Work Performed

My Primary Site's two SAN Disk from four DIsks which was shared between both Nodes failed. Now my Application was DOWN at Primary Site So I UP my Application from DR Site successfully. Now at Primary Site I replaced both bad Disks with the two new DIsks via using option 4 and 5 of vxdiskadm command successfully. Now from DR Site I ran the fbsync command which started successfully.

My question :

I ran the fbsync command .. This I did right or wrong ? (as this is not a partial sync and has to sync complete as my Primary Site have fresh Disks. I think that fbsync do a incremental Sync)  OR fbsync can also SYNC complete data from DR Site to Primary in my case ?

Comments Required please

31 REPLIES 31

mikebounds
Level 6
Partner Accredited

I am surprised the fbsync command worked, but I would think there is a good chance your primary data is corrupt - I can see 2 possibilties:

  1. VVR is intelligent to know that all data on the replaced disks (assuming it is not mirrored) will need to be synced so it marks the DCM map completely dirty for these subdisks so that all this data is synced
  2. VVR is not this intelligent and the data on your new disks only contains the data changes that were written on the DR site and so is corrupt

I would run a space optimised snapshot on the primary site and mount the data to check, which probably  won't mount if the are corrupt.  You could also try to mount the volumes read-only ("-o ro" I think) on the Primary - this is not supported as if you try to read the files, they could be changing with VVR and these changes won't be in the primary nodes cache, but they should mount and if they don't mount, then your data is almost certainly corrupt.

If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.

Mike

 

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

First thanks for your words Mike.

One question raising in my mind. As I said that I UP my Application at DR Site(did not switched over) and while UP my Application at DR Site I feel that my Client restarted the Primary Site Machine as well, means that might be a Take Over, Because before execute the fbsync command I checked the repstatus and saw the Primary - Primary configuration.

So Let me think on your words in which you said that :

If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.

 

is not the above command failed as the configuration was Primary - Primary configuration.

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Supposed Environment:

SFHA/DR setup

Primary Site = Two Cluster Node's

DR Site = One Cluster Node

What could be the step(s) if suppose Primary Site both Server's shutdown abnormally due to power outage and suppose we start the Application at DR Site (via DR Site Cluster Java Console.)

Now when the Power restored and Primary Site Servers are UP, it found that the Primary Site central storage/SAN Lun's which shared by both nodes dead/corrupted/faulty. So we replaced the SAN Luns via the help of option 4 and 5 of vxdiskadm. Now our next activity will be that we have to synchronized our Primary Site.

Now when we see the status of Replication via repstatus command we can see a Primary - Primary configuration. In this situation what will be the roadmap ?

Everyone's comment will be appriciated please.

g_lee
Level 6

Zahid,

Suggested reading:

Veritas Volume Replicator 5.0MP3 Linux Administrator's guide -> Transferring the Primary

https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08.htm

or https://d1mj3xqaoh14j0.cloudfront.net/public/documents/sf/5.0MP3/linux/pdf/vvr_admin.pdf (p239)

Particularly the sections:

Taking over from an original Primary (p247)

https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08s03.htm
--------------------
The takeover procedure involves transferring the Primary role from an original Primary to a Secondary. When the original Primary fails or is destroyed because of a disaster, the takeover procedure enables you to convert a consistent Secondary to a Primary. The takeover of a Primary role by a Secondary is useful when the Primary experiences unscheduled downtimes or is destroyed because of a disaster.
--------------------

Failing back to the original Primary (p255)

https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_admin/ch08s04.htm
--------------------
After an unexpected failure, a failed Primary host might start up to find that one of its Secondaries has been promoted to a Primary by a takeover. This happens when a Secondary of this Primary has taken over the Primary role because of the unexpected outage on this Primary. The process of transferring the role of the Primary back to this original Primary is called failback.
--------------------

Also: Veritas Cluster Server Agents for Veritas Volume Replicator Configuration Guide (5.0MP3 Linux)

https://sort.symantec.com/public/documents/sf/5.0MP3/linux/html/vvr_agents_config/index.html

https://d1mj3xqaoh14j0.cloudfront.net/public/documents/sf/5.0MP3/linux/pdf/vvr_agents_config.pdf

(probably start at Overview of how to configure VVR in a VCS environment, and work from there)

For other versions/platforms, as always, look for the relevant documents on http://sort.symantec.com/documents - the concepts / overall procedures are basically the same though.

mikebounds
Level 6
Partner Accredited

It is ok to run fbsync, if vradmin lets you run fbysnc after you have replaced some LUNs, because as you say you need to sort out the Primary-Primary config, but as I said earlier, I think there is a good chance that fbsync won't sync all the data on the replaced LUNs.

Before you run fbsync, I believe if you run vxrlink status at the old primary (where you replaced the LUNs), it should tell you how many bytes are outstanding on the DCM, so if fbsync is going to work, the DCM would have to contain at least the size of the replaced LUNs and if it doesn't the vradmin fbsync is not that intelligent as in point 2 of my first post.  I guess vradmin MIGHT mark all data dirty on DCM for replaced LUNs, as part of running fbsync so you could also run vxrlink status on DR before running fbsync and then after fbsync is run, these 2 DCMs are merged, so you can check if the dirty data from added LUNs is added, then, but I very much doubt it.  If you verify fbsync has not taken replaced volumes in to consideration, then there is no point letting fbsync finish, just run "vradmin -f stoprep" and vradmin -a startrep" to sync from scratch.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

@ Mike

Do you see any complication if the Mount resource is already Online at DR Site and we execute the mount command from Primary Site with read only option ?

From Primary and Secondary Site the Mount Resource is the Parent of RVG-PRI resource, So in any case will the Mount Resource automatically Probe  ? and if yes, is'nt the Child Resource (which is RVG-PRI in  our case) will be Online automatically as the Parent(MOUNT Resource) got Online ?

Means I just want to make sure that in any case the RVG-PRI will be Offline at Primary Site as its already online at DR Site.

mikebounds
Level 6
Partner Accredited

You could mount using a temporary mount point so VCS shouldn't recognise the resource as online.   In theory the worst that coulld happen is that you get a currency violation and VCS will umount the read-only mount (it won't online dependent resources), but I would freeze application service groups on both sides for a "belts and braces" approach.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Hmmm let me share with my OS team(as my Client cant afford extra down time in terms of Cuncurrency Voilation as it has around 100,000 of users) and do it this way and will share the result.

- Freeze Application Service Group on both Sites (Primary and DR).

- Try to mount the Volume to a temporary place like /mnt. for example

  #mount -t vxfs -o ro /dev/vx/dsk/DG/Volume /mnt

mikebounds
Level 6
Partner Accredited

You can't run fsck on a readonly filesystem.  It could be you just need to specify filesystem type, or it maybe that the inode table was been updated on the primary while you were trying to mount on the secondary (this is why this method is not officially supported).  If you have an Enterprise license and a little free space in the diskgroup, then the better method is to run a space optimised snapshot and mount this.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

I just did stoprep and then startrep and when sync completed, now trying to run the mount command with readonly but feel that some issue. See the below for reference:

[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/

mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,

       missing codepage or other error

       In some cases useful info is found in syslog - try

       dmesg | tail  or so

 

#tail -f /var/logs/messages

Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533

Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed

Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.

 

I think that we need to run fsck but in a situation where I have ext3 filesystem. Any suggesstion what would be the syntax to run the fsck in my situation

like

fsck -o full -y /dev/vx/dsk/DG/volume

OR

 

like fsck -f -y /dev/vx/dsk/DG/volume

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

So you mean this may not be the issue of filesystem thats why not mounting(would be other reasons). thanks

Would you kindly share the command for space optimized snapshot and how much space should i consider free.

I tried to pauserep(here the inode definetely be not updating at secondary/RealPrimary ) and then did mount command by faced same messages

Thanks

mikebounds
Level 6
Partner Accredited

The amount of space you need free is:

About the same size of inode table for pointers to real or COW (copy on write) data so about 2%

+

Amount of changes that occur on the primary while snapshot is mounted

Any changes you MAY make to filesystem when it is mounted as a space optimized snapshot.

 

If this is just a test to check the filesystem is ok, then 5% should be plenty

The commands to do snapshot are under section "Space-optimized instant snapshots" in Vxvm admin guide - examples from guide shown below:

Make cache object - this is where space above is stored so make this 5% size of filesystem:

 

vxassist -g mydg make cachevol 1g
vxmake -g mydg cache cobjmydg cachevolname=cachevol
vxcache -g mydg start cobjmydg
 
Prepare volume for snapshot (create DCO and turn on Fastresync)
vxsnap -g mydg prepare myvol
Then take snapshot
vxsnap -g mydg make source=myvol/newvol=snap3myvol/cache=cobjmydg
 
Then you can mount (and you can fsck this first if you want):
mount /dev/vx/dsk/mydg/snap3myvol /mnt
 
Note you may HAVE to fsck first, as in essence the snapshot contains a filesystem that has not been umounted as the snapshot is taken when the filesystem is mounted at the VVR primary, so some O/S's just complain, but mount anyway, but I THINK some require fsck to clean "mount" flag.
 
Mike

 

 

 

 

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Bundle of Thanks first for your long drafted reply Mike.

I have two Disks/SAN LUN's 100 GB each

I have two volumes on these two Disks/LUN's. One is 150GB which is data volume and being replicated and the 50GB aprox is the Srl VOLUME.

( I dont have free space available in the DiskGroup So I think I need to add a disk with 10 GB aprox space for making the volume, name cachevol as per your suggesstion above ) CORRECT ?

 

======

One question here :
 
Then take snapshot
vxsnap -g mydg make source=myvol/newvol=snap3myvol/cache=cobjmydg

Is this only one thing ""source=myvol/newvol=snap3myvol/cache=cobmydb""?   I am not able to understand this please.

 

mikebounds
Level 6
Partner Accredited

Yes you need to add disk to diskgroup as you need free space to be in diskgroup.

"source=myvol/newvol=snap3myvol/cache=cobmydb"

This is what Symantec called a tuple - it is specifying 3 things which you seperate with "/" as shown with no spaces so you have:

source=myvol    - The volume you are taking a snapshot of

newvol=snap3myvol   - Name of new volume that is the snapshot of your volume

cache=cobmydb    - The name of your cache object

Note you can use the same cache object for many volumes, so if you had more than one volume, you don't need a separate cache object and so you would just repeat the "vxsnap prepare" and "vxsnap make" for a second volume if you had one.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

As per the above mentioned messages and see below also as a reference, What could be the factors involve which is why i am getting these messages(as I am useto mounting the filesystem with readonly at DR Site and I never face this problem)

 

[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/

mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,

       missing codepage or other error

       In some cases useful info is found in syslog - try

       dmesg | tail  or so

 

#tail -f /var/logs/messages

Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533

Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed

Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.

mikebounds
Level 6
Partner Accredited

Is this an ext3 filesystem?  If it is not ext3, then you need to specify the mount type when you mount it.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

yes this is ext3 filesystem. But facing same error which I am facing

 

One more thing I have browse which is what.

At Primary Site (New Secondary) we stopped Replication and then Stopped VCS at Primary Site (New Secondary) then mount the Volume again I was not able to mount it and giving me same error while mount. I created a new filesystem again on this Volume(which was not able to mount) but still I am facing the same problem after creating a new filesystem. I dissociated the Volume from the RVG and then created a filesystem and then I mount it. It got successfully mount but as I associated the Volume back to RVG I face same error while mounting the Volume.

mikebounds
Level 6
Partner Accredited

The original point of doing a space-optimised snapshot or mounting read-only was to see if fbsync synced volumes on the replaced disks, but if you have stopped and started replication, then the fact that you replaced disks is irrelevant as you have synced everything from scratch.  I have mounted VVR volume on secondary in Solaris vxfs fine, as Solaris just marks a mounted flag, but I have never tried on Linux ext3, so Linux MAY do something different, like it might try to write to inode table as error says " lost page WRITE" and this won't work as the whole volume is readonly.  But if you stop replication so that secondary volumes are writable, then this should work and if this doesn't then you probably did something wrong earlier.

If you try all this again, I would use space-optimised snapshot and it doesn't work, then send all the commands you are running (in order you run them)  including associating volumes, stopping and starting replication, snapshot commands and mount of snapshot.

Mike

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Thanks for all Participants specially Mike for kind words on this Post

All volumes are started which are associated with the Volumes in vxprint command.

an example to start disabled volume

(vxvol -g DG start VolumeName OR vxvol -g DG startall )

Start and Stop Replication Example

(vradmin -g DG -a startrep RVGname and vradmin -g DG -f stoprep RVGname)

===================

Even At Primary Site where Application was not LIVE, I created new volumes/FileSystem which are able to mount if they are diassociated from RVG but when I Associated the Volumes with RVG I am not able to Mount even the Replication is Stopped. Now at this point if the fresh/formated with new filesystem is not able to mount under RVG how can I aspect that this volume snapshot can also be mount even it mount suppose what is my benefit as the Real Volume is not able to mount, in this situation what happen while switchover/failover means how can the volume be mounted.  I am so depressed with Support at this time.

Now my Final PLAN

- Finally I broken the GCO which made both clusters(Primary Site and DR Site) isolated, and remove the LINKS between ServiceGroups(Application and Replication ServiceGroups). Application was still running on DR Site

- Then created a New ServiceGroup, DiskGroup,New Volumes with new name only for Application and just ran the createpri command on the Primary Site.
After doing all this I copied my Aplication from DR Site to Primary Site today morning and UP the Application from Primary Site.

Plan for today
Phase-I      
Now at DR Site I will create New ServiceGroup, DiskGroup and New volumes with the same name and size(as on Primary Site we newly created).

Phase-II    
Will execute the addsec command from Primary Site which will start the Replication between Primary and DR Site.

Phase-III
We need to create the RVG Resource at Replication ServiceGroup.

1.) At PrimarySite we Remove the DiskGroup Resource from Application ServiceGroup and Create the DIskGroup Resource under Replication ServiceGroup (Can we do that without any DownTime as when we remove the Diskgroup Resource from Application ServiceGroup, this may DEPORT the DiskGroup at PrimarySite ?, may cause Application Down)...How can we move the Diskgroup from Application ServiceGroup to Replication ServiceGroup seamlessly without any impact on LIVE Application

2.) Create RVG Resource at Replication ServiceGroup(at PrimarySite)

3.) Create RVG-Primary Resource at Application ServiceGroup(at PrimarySite).

4.) Create an Online-LocalHard Link between Application and Replication ServiceGroup(Will Select the ApplicationServiceGroup and then Select the Replication ServiceGroup). (at PrimarySite)

The above four Activities we also need to perform at DR Site which I dont think is really tension but the point which I am concern about is Point # 1 (Actually these four steps put the Replication under Veritas Cluster Control)

5.) Add the Remote Cluster

6.) Create the Application ServiceGroup a Global ServiceGroup.


If any concern on above Activity under Plan for Today please share the comments ? For this I need urgent and quick response as I have lost all Symantec creditability at my Client Site as the Severity-1 Case took a week for even not been able to complete. I would really appreciate if I can get proper resolution for this please.