Solved: The other thing to remember

gopi_enovate · ‎04-10-2015

Hi Everyone,
I have been reading a lot of forum discussions and documents to understand the behaviour of Netbackup for shadow copy components.

We are using shadow copy components for our DFS and file servers. Backups runs for hours to get completed.

As per my understanding, shadow copy components backs up system files, registry and all other system related data and puts in system state in BAR.
All user data, ADAM and other relative data is structured within user data.

There are different types of creating shadow copy using VSS.

1. VSS full clone
2. VSS copy on write (Differential copy)
3. Redirect on write

Does Netbackup use VSS full clone for shadow copy component?

A regular shadow copy from windows is completed within few minutes.
However, same data backup takes hours to get completed using NB. File backup gets completed soon.

Kindly help.

sdo · ‎04-10-2015

I believe that NetBackup 'bpbkar32 -> bpfis -> VSS' uses 'copy on write (differential)' VSS methods - i.e. maintains a discardable delta change log.

AFAIK NetBackup does not use VSS full clone, nor 'redirect on write' (but I myself have not looked in to what this 'redirect on write' actually is - so anything that you can briefly share with me would be appreciated).

When you say "A regular shadow copy from windows is completed within few minutes." I have to ask what type of VSS shadow did you use, i.e. what type of VSS shadow does 'regular' mean in this case? I think your use of the term 'completed' is a bit mis-leading, and what you probably mean is that "a regular shadow copy from within Windows is 'established' within a few minutes". i.e. established does not necessarily mean completed. A shadow is only ever 'completed' either when it is broken off (possibly after being fully copied for separately (re-)mounting), or when it is discarded/deleted. And I very much doubt whether a 'full clone' of even a typical 16GB bare bones fresh Windows OS install on a C: drive can be completed within a few minutes - so if a 'full clone' appears to complete with in a few minutes - I think MS is using advanced techniques of a combination of a 'starting a copy' plus establishing a (hidden from us) copy-on-write - so that a full copy appears to be ready for use before the copy has actually fully completed.

.

Anyway, when we think about performance of VSS shadows with NetBackup, let's first remind ourselves what a 'copy on write / differential' shadow actually is. It is a delta change log, i.e. a list of all the copies of the contents of all of the original blocks that have since been overwritten during the lifetime of the copy-on-write differential shadow. i.e. when a copy-on-write differential shadow is established, then, when a write (one I/O) occurs on the 'protected' (aka shadowed) volume, then VSS/Windows will intercpt the write and before it will allow the file system driver to actually complete the write, VSS will read (one I/O) the original block contents that is just about to be overwritten, and then write (one I/O) this copy to the shadow storage - before finally allowing the original wite to the protected volume to complete. And there is a fourth IO - when backup IO is intercepted by VSS and redirected to the copy of the original block in the VSS shadow space. So, whilst a 'copy-on-write differential' VSS shadow exists, all write IOs actually incur a minimum of three I/Os - but there is always the potential for the fourth IO (read) for the backup.

If we extrapolate this to what occurs when a backup is running, then VSS also has to intercept all shadow reads (being performed by the backup application) and check to see whether the block being attempted to be read has since been overwritten, and if it has since been overwritten by an application (e.g database) or the O/S then VSS will have go and fetch the original contents from the VSS shadow copy and pretend to the backup application that it is has just read the source (protected/shadowed) volume. So, it is quite obvious that backups of shadowed volumes incur more CPU per IO, and that this backup IO becomes more random over a period of time, but also increasingly more random as more writes occur during the lifetime of the shadow copy. And increased randomness of small IOs are typically more latent (i.e. slower over time) simply because the backup becomes less and less sequential.

One of the great benefits of VSS 'copy-on-write differential' shadows, is that they can simply be discarded / deleted when the backup completes - because - the shadow storage only contains a copy of the old contents of blocks that have since been overwritten and thus are no longer relevant as soon as the backup completes.

.

Now, we also need to consider exactly where VSS decides the 'shadow copy' is placed. And why a 'regular' Windows VSS shadow appears to 'complete' within a few minutes. Remember, this regular Windows VSS shadow does not have to contend with being intensively read by a backup application. So, I think, you are comparing apples with oranges, and not quite being fair. Anyway, this is tricky point... coming up next...

VSS is a effectively a nested structure of '1:n' relationships, i.e. one-to-many relationships... top to bottom...

- one 'writer' can make use of one or more 'providers'

- one 'provider' can make use of one or more 'volumes'

- one 'volume' can contain one or more 'shadowstorage' spaces

- one 'shadowstorage' space can contain one or more 'shadows.

...

Hence, when we think we have VSS issues, an admin should *always* run all five VSS list commands, to check for errors, and to check that all five layers/levels/components of VSS are working correctly and responding and do not 'hang'...

...so a good 'admin' will always issue these five commands, even if for no other reason that be sure that the VSS command line interface does not hang during it's systems calls to the VSS kernel API routines...

vssadmin list writers

vssadmin list providers

vssadmin list volumes

vssadmin list shadowstorage

vssadmin list shadows

(...and I dare say that one can learn a lot from experimenting with the list/show type commands in teh more advanced 'diskshadow' utility...)

...anyway, what I wanted to get to was... 'shadows' have to live somewhere, and if a Windows admininistrator has not pre-defined which volume is to be used as shadowstorage space for any given volume, then Windows will do it's best to pick somewhere - and here, it gets fun, Microsoft have constantly and repeatedly changed the default behaviour of VSS across different versions of Windows, and different releases and different service packs and different Windows updates - that it is not easily possible to predict which volume Windows / VSS will select as shadowstorage space.

When a VSS copy-on-write differential delta change log based VSS shadow copy is created for us by Windows when NetBackup bpfis calls VSS - then we have to ask ourselves... where is this delta-change-log created? (and remember this delta change-log is essentially nothing more than a file (albeit hidden) on disk) Windows will sometimes place this delta change log on the very volume being protected/shadowed - and so backup IO thoughput becomes componded by VSS's activities for the very same backup IO throughput !

Whereas I suspect that Windows is smart enough to place the shadow for a 'full clone' on a separate volume. So, once again we're comparing apples with oranges... i.e. the two different VSS 'methods' will (I suspect, but unconfirmed) select different volumes - i.e. the VSS full clone will effectively chose an alternate volume (and thus probably be more performant), whereas a copy-on-write shadow will probably select the same volume as the source data (and thus be less peformant), and this is probaby simply because it is expected to be discarded very soon (i.e. as soon as the backup completes).

Good Windows and backup admins will usually have permanent seperate volumes on large file servers - simply for use as 'shadow storage' space during backups - and never (ever!) use this 'apparently' spare capacity for anything else. And manually configure VSS on each volume to use these specific 'other' volumes as shadow space - to ease and distribute IO loads (remember one write IO actually incurs three IOs, with the ever present possible demand for a later read for a backup).

HTH.

View solution in original post

sdo · ‎04-10-2015

The other thing to remember is that any slowness of a VSS backup really probably has nothing to do with NetBackup per se... because NetBackup simply leverages Micosoft's own APIs which 'backup' the Shadow Copy Components (SCC:) and/or System State (SS:). i.e. every backup software vendor will leverage the same MS API soutines, even MS' own backup products.

To be fair, VSS backups of SCC: and SS: should be fairly fast, so if you find that yours are not very fast... then I have to ask, have you tried using MS' own backup tools to take a backup of the SCC: and/or SS: to see how fast it is outside of NetBackup? Also, a typical cause of slow backups is not having disabled AV for scan on backup or scan on read by NetBackup processes. Remember, AV targets IO being peformed by 'executables' (i.e. .exe programs) being run by processes (and services). So, it is best practice to disable A/V scanning of IO being performed by typical NetBackup Client related executables such as bpbkar32, bpfis, bpbrm, bpcd, bpinetd, vnetd, mtstrmd, nbostpxy, tar32 etc.

sdo · ‎04-10-2015

I believe that NetBackup 'bpbkar32 -> bpfis -> VSS' uses 'copy on write (differential)' VSS methods - i.e. maintains a discardable delta change log.

AFAIK NetBackup does not use VSS full clone, nor 'redirect on write' (but I myself have not looked in to what this 'redirect on write' actually is - so anything that you can briefly share with me would be appreciated).

When you say "A regular shadow copy from windows is completed within few minutes." I have to ask what type of VSS shadow did you use, i.e. what type of VSS shadow does 'regular' mean in this case? I think your use of the term 'completed' is a bit mis-leading, and what you probably mean is that "a regular shadow copy from within Windows is 'established' within a few minutes". i.e. established does not necessarily mean completed. A shadow is only ever 'completed' either when it is broken off (possibly after being fully copied for separately (re-)mounting), or when it is discarded/deleted. And I very much doubt whether a 'full clone' of even a typical 16GB bare bones fresh Windows OS install on a C: drive can be completed within a few minutes - so if a 'full clone' appears to complete with in a few minutes - I think MS is using advanced techniques of a combination of a 'starting a copy' plus establishing a (hidden from us) copy-on-write - so that a full copy appears to be ready for use before the copy has actually fully completed.

.

Anyway, when we think about performance of VSS shadows with NetBackup, let's first remind ourselves what a 'copy on write / differential' shadow actually is. It is a delta change log, i.e. a list of all the copies of the contents of all of the original blocks that have since been overwritten during the lifetime of the copy-on-write differential shadow. i.e. when a copy-on-write differential shadow is established, then, when a write (one I/O) occurs on the 'protected' (aka shadowed) volume, then VSS/Windows will intercpt the write and before it will allow the file system driver to actually complete the write, VSS will read (one I/O) the original block contents that is just about to be overwritten, and then write (one I/O) this copy to the shadow storage - before finally allowing the original wite to the protected volume to complete. And there is a fourth IO - when backup IO is intercepted by VSS and redirected to the copy of the original block in the VSS shadow space. So, whilst a 'copy-on-write differential' VSS shadow exists, all write IOs actually incur a minimum of three I/Os - but there is always the potential for the fourth IO (read) for the backup.

If we extrapolate this to what occurs when a backup is running, then VSS also has to intercept all shadow reads (being performed by the backup application) and check to see whether the block being attempted to be read has since been overwritten, and if it has since been overwritten by an application (e.g database) or the O/S then VSS will have go and fetch the original contents from the VSS shadow copy and pretend to the backup application that it is has just read the source (protected/shadowed) volume. So, it is quite obvious that backups of shadowed volumes incur more CPU per IO, and that this backup IO becomes more random over a period of time, but also increasingly more random as more writes occur during the lifetime of the shadow copy. And increased randomness of small IOs are typically more latent (i.e. slower over time) simply because the backup becomes less and less sequential.

One of the great benefits of VSS 'copy-on-write differential' shadows, is that they can simply be discarded / deleted when the backup completes - because - the shadow storage only contains a copy of the old contents of blocks that have since been overwritten and thus are no longer relevant as soon as the backup completes.

.

Now, we also need to consider exactly where VSS decides the 'shadow copy' is placed. And why a 'regular' Windows VSS shadow appears to 'complete' within a few minutes. Remember, this regular Windows VSS shadow does not have to contend with being intensively read by a backup application. So, I think, you are comparing apples with oranges, and not quite being fair. Anyway, this is tricky point... coming up next...

VSS is a effectively a nested structure of '1:n' relationships, i.e. one-to-many relationships... top to bottom...

- one 'writer' can make use of one or more 'providers'

- one 'provider' can make use of one or more 'volumes'

- one 'volume' can contain one or more 'shadowstorage' spaces

- one 'shadowstorage' space can contain one or more 'shadows.

...

Hence, when we think we have VSS issues, an admin should *always* run all five VSS list commands, to check for errors, and to check that all five layers/levels/components of VSS are working correctly and responding and do not 'hang'...

...so a good 'admin' will always issue these five commands, even if for no other reason that be sure that the VSS command line interface does not hang during it's systems calls to the VSS kernel API routines...

vssadmin list writers

vssadmin list providers

vssadmin list volumes

vssadmin list shadowstorage

vssadmin list shadows

(...and I dare say that one can learn a lot from experimenting with the list/show type commands in teh more advanced 'diskshadow' utility...)

...anyway, what I wanted to get to was... 'shadows' have to live somewhere, and if a Windows admininistrator has not pre-defined which volume is to be used as shadowstorage space for any given volume, then Windows will do it's best to pick somewhere - and here, it gets fun, Microsoft have constantly and repeatedly changed the default behaviour of VSS across different versions of Windows, and different releases and different service packs and different Windows updates - that it is not easily possible to predict which volume Windows / VSS will select as shadowstorage space.

When a VSS copy-on-write differential delta change log based VSS shadow copy is created for us by Windows when NetBackup bpfis calls VSS - then we have to ask ourselves... where is this delta-change-log created? (and remember this delta change-log is essentially nothing more than a file (albeit hidden) on disk) Windows will sometimes place this delta change log on the very volume being protected/shadowed - and so backup IO thoughput becomes componded by VSS's activities for the very same backup IO throughput !

Whereas I suspect that Windows is smart enough to place the shadow for a 'full clone' on a separate volume. So, once again we're comparing apples with oranges... i.e. the two different VSS 'methods' will (I suspect, but unconfirmed) select different volumes - i.e. the VSS full clone will effectively chose an alternate volume (and thus probably be more performant), whereas a copy-on-write shadow will probably select the same volume as the source data (and thus be less peformant), and this is probaby simply because it is expected to be discarded very soon (i.e. as soon as the backup completes).

Good Windows and backup admins will usually have permanent seperate volumes on large file servers - simply for use as 'shadow storage' space during backups - and never (ever!) use this 'apparently' spare capacity for anything else. And manually configure VSS on each volume to use these specific 'other' volumes as shadow space - to ease and distribute IO loads (remember one write IO actually incurs three IOs, with the ever present possible demand for a later read for a backup).

HTH.

VOX

Netbackup Shadow Copy component backup is too slow compared to VSS