Vault Duplication

rsm_gbg · ‎11-05-2013

Hi,

I started using Vault Duplication to Offsite tapes. (all LTO4 tapes)
It seems it either don't compress the data onto the duplication tape or it for some other reason don't fill it and just have to continue to another Offsite tape.

I run a normal backup on to local tape during the night, it's about 1TB

Vault do a duplication of todays data.
But it start duplication onto one tape but instead of filling it it starts another duplication onto a new tape.
It usually fills about 500-600 GB and then starts a new dupl. to a new tape.

They all have the same retention on the daily tape.

Why isn't it filling up the tape?

I went from inline copy to using Vault duplication and with inline copy it wasn't any problem.

- Roland

Marianne · ‎11-05-2013

What is the status of media with only 500-600 GB written to it?

Check with 'bpmedialist -m <media-id>'

To force NBU to fill media before selecting new tape, change Max Partially Full setting on Offsite Pool to 1.

Handy NetBackup Links

rsm_gbg · ‎11-06-2013

Hi,

Seems like it is not compressing at all!

DLT017 is the Daily Local tape and Duplicated by Vault to DRT120 and DRT122
Starting with DRT020.

Next day backups continued on the DLT017 tape so dates are a bit misleading.

root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DRT120
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DRT120 10     90   11/04/2013 15:01 11/04/2013 23:52   hcart   791137984     0
               90   12/02/2013 09:57        N/A         FULL SUSPENDED
root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DRT122
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DRT122 10     23   11/04/2013 23:52 11/05/2013 00:47   hcart    81293600     0
               23   12/02/2013 08:12        N/A         SUSPENDED
root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DLT017
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DLT017 10    407   10/31/2013 08:59 11/05/2013 01:10   hcart 1239776352     0
              407   12/03/2013 01:10 11/05/2013 16:34 FULL
root@ipndms:/nbu/bin#

I have now set the Max Partially Full setting to 1.

- Roland

mph999 · ‎11-06-2013

The drives should be using hardware compression - this is controlled by the drive and the type of the os path to the drive. For example, if using solaris the drive path might be /dev/rmt/0cbn - the 'c' bit 'tells' the drive to compress. Providing NBU is configured with the correct drive path (and it should be by default) it should compress. NBU actually has no control over hardware compression, if it's not working then it's either a driver, firmware of drive fault. NBU also has no control over when a tape is marked as full. As data is written to the tape, the drive detects the 'end' of the tape approaching, and the drive firmware sets a 'tape full' flag in the driver (therefore OS level). When NBU sends the next block of data to the OS (which actually writes it to tape, not NBU) the tape driver informs NBU to mark the tape as full, and load a new one. So, setting max partially full won't make any difference to the amount of data on the tape, when the tape is marked full. The other thing to consider is the type of data - some data cannot be compressed well (if at all) - an easy eample is .jpg (as .jpg is a compressed format) where as .txt file would compress well. So if you filled one tape with .jpg and one with .txt, when both tapes become full, the .txt tape will show as holding more data.

rsm_gbg · ‎11-06-2013

I understand that but that doesn't explain why DLTxxx compresses data to 1.2TB and DRTxxx cant?

It is the same drives same robot same driver same os

It is duplicating DLTxxx to DRTxxx tapes using the same everything.
Shouldn't it manage to cram 1.2TB onto DRTxxx tapes?

Using inline copy did this without any problems, duplicating is another method though...

mph999 · ‎11-06-2013

You make a good point ... ITC makes two exact copies (provideding you start with empty tapes - the exact same data is sent from the buffer to the two tape drives. When you make a duplication its a 'back-to-back' restore then backup, bptm reads the data off media a and then backs it up to media B - so pretty much from the media manager level, it's the same. From memory, I can't remember if vault has a software compreesion option - if it has, and it is enabled then funny things can happen if you try and compress data twice (it generlly gets bigger) - that's the only thing I can think of at the moment. Providing this isn't the case, then I hae no answer at the moment. As mentioned, the data is read into the buffer, then sent to the OS to be written to tape - we don't add an extra ... I'll have a look tomorrow at past cases / etracks see if this has been reported previously - the thing that is confusing me at the moment is that NBU can't mark a tape as full on its own, it has to be told to do so by the tape driver. In fact, I'll stick my neck out a bit, and say it's impossible for NBU to mark a tape as full by itself. I'll report back ...

Marianne · ‎11-06-2013

Have a look at this similar issue:

https://www-secure.symantec.com/connect/forums/vault-tapes-capacity

Handy NetBackup Links

mph999 · ‎11-07-2013

Good find Marianne - another possible reason ... I checked a load of etracks at lunch - nothing even close, that I could find. I will look again using different search terms, in case I missed something. I also looked through previous cases. Many where Vault eject 'not full' tapes, but nothing that I could find where only vault was filling the tapes with the less than expected amount of data. Quite a few cases where the tapes were full with low amounts of data, but none specific to vault that I could see. I didn't look at all of them just a few at random, and in all cases the cause was drive firmware or faulty drives. As a matter of interest, what is the OS ? I'll have a think as to how to investigate - I think we would need a very planned test, getting a new vault profile to dupplicate a couple of full tapes, onto to empty tapes, so that they should end up as a mirror copy, and hope they don't (ie the fault happens) - the cataog .f file would probably then be required for the images on the tapes and maybe read the tapes back with tar (hopefully unix server) - would need to make sure the tapes are not multiplexed. Would also need to use the same controlled data to fill tapes using ITC. I'll have a think about the details.

watsons · ‎11-07-2013

There is a way to find out if it's really specific to vault duplication.

Try the same data using SLP duplication, my guess is it would have the same behavior - I don't really think a vault feature carries a different kind of duplicate mechanism, or at least, I had not seen a technote about that. If there is indeed some difference, most of us would very much like to know from Symantec.

I am more inclined to see this as something on the tape drive firmware - are we talking about the same tape drive for all the tests carried out?

mph999 · ‎11-08-2013

SLP and Vault both use bpduplicate - so there is no difference.

rsm_gbg · ‎11-08-2013

Sorry been busy with other stuff.

I will do some tests next week, I get more data over the weekend as we do a full backup of a RMAN directory.

Interesting is that we got a new drive with a lower fw than the other which is latest.

Solaris 10 and LTO4's

I'll let you know

rsm_gbg · ‎11-10-2013

Over the weekend we got about 1TB/day of backup.
Now it compressed a bit, filled first tape with 935GB and then anouther 87GB on the next tape.
The original DLT009 got 1.2TB

I have changed this parameter as suggested by Marianne.
change Max Partially Full setting on Offsite Pool to 1

Seems the DLT009 was written on drive 001 with the newer firmware, normal one tape backup.
Duplication to DRT125 was written on the same drive and DLT009 was in the drive with the older fw.

root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DRT125
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DRT125 10     17   11/09/2013 15:00 11/10/2013 00:00   hcart   935192576     0
               17   12/07/2013 05:04        N/A         FULL SUSPENDED
root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DRT128
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DRT128 10      2   11/10/2013 00:00 11/10/2013 00:16   hcart    87324064     0
                2   12/07/2013 05:45        N/A         SUSPENDED
root@ipndms:/nbu/bin# ./admincmd/bpmedialist -m DLT009
Server Host = pnms01
id     rl images   allocated        last updated      density kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
DLT009 10    209   11/09/2013 00:05 11/11/2013 04:01   hcart 1206396032     0
              209   12/09/2013 04:01 11/10/2013 16:21
root@ipndms:/nbu/bin#

mph999 · ‎11-12-2013

Do you have lines like these in the bptm log, just interested, no worries if not

00:11:54.634 [31429] <2> write_backup: block position check: actual 22922, expected 22922

00:11:54.634 [31429] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0

Can you run a bpmedialist -mcontents -m <media id> on each of the three tapes - grab the output and attach up here, just to prove that what is on the duplicate tapes is exactly what is on the one original tape.

Jim-90 · ‎11-15-2013

I wonder if "preserve muliplexing" in vault would have any influence?

I'm carefull about changing "partially filled" tapes. I've seen backup queued because of this.

rsm_gbg · ‎11-27-2013

Hi Guys and Girls,

I've been busy with other stuff and havn't been able to pay the backups enough attention.
I cleared up the other issues and can now work more closely on this.

My main issue with the backups is that they are too slow so I run out of time in my 24 hour day.
It could be caused by several issues and I have ventilated a few here on the forum and got really good help.
My backup strategy is as follows.
I got 2 tape labels.
DLTxxx Which is local tapes sitting in the Robot at all times.
DRTxxx Which are Remote tapes going offsite every day.
Daily backups start at midnight and when they are quick ends about 6am.
At 3pm I start my Vault session which duplicates the Daily tapes onto remote tapes that gets Vaulted.
Sometimes the Daily backups is just so slow it doesn't finish until after 3pm and then my backup window is reached which results in failed backups (196) client backup was not attempted because backup window closed )
Sometimes duplication is so slow that it runs past midnight and it starts interfering with the normal Daily backups.

I got an Sun SL48 robot with 2xLTO4 SCSI drives.
Both drives has the latest firmware and so does the robot.
One of the drives /dev/rmt/1 has been replaced, I got a case with the other drive to be replaced but need to get some evidence that it needs to ve replaced.
I have replaced all tapes with new tapes so the oldest is about 3 months.
I got a master server running in a Solaris LDom and a separeate Mediaserver running Solaris 10 x86
Both are well equiped in terms of cpu and memory

I managed to get the backups go to both drives without each client multiplexing so one client goes to one drive and another to the other at the same time.
I can't see any real difference in speed going to one drive or the other.

I've tested the network and it shouldn't be a problem.

Outstanding issues with my backups.

1. It seems duplication from Local tapes to Remote tapes doesn't compress as good as on the daily.
When the Daily is marked full they have about 1.3TB
Remote tapes only gets about 8-900GB before marked as full.

2. Duplication is some days very slow, my backup data is about 800GB, 1.2TB on Saturdays.
On a good day the Duplication is about 3 hours and on a bad day 10 hours
The worst I've seen is a 16 hours duplication.

3. From another thread.
https://www-secure.symantec.com/connect/forums/how-calculate-sizedatabuffers
I still have big waiting and delayed times
11/28/2013 00:42:22 - Info bptm (pid=28893) waited for full buffer 84863 times, delayed 88347 times
But it varies and some days is worse than others and clients could be quick one day and slow the other.

I think it bolts down to the unreplaced drive, because everything is so random it's the only explanation.
But I need some proof that it needs to be replaced. (errors in a log)
I'm going to do some manual tar tests to the drive to see if I can get it to complain.

- Roland

rsm_gbg · ‎11-27-2013

Yes I have!

1562 lines with "block position check"
log.112813:04:35:51.762 [5595] <2> write_data: block position check: actual 3471627, expected 3471627
log.112813:04:37:07.163 [5595] <2> io_terminate_tape: block position check: actual 3476424, expected 3476424
log.112813:04:39:16.271 [5711] <2> write_data: block position check: actual 3448452, expected 3448452
log.112813:04:44:54.055 [5711] <2> write_data: block position check: actual 3467490, expected 3467490
log.112813:04:52:14.965 [5711] <2> write_data: block position check: actual 3495477, expected 3495477
log.112813:04:57:30.724 [5711] <2> write_data: block position check: actual 3514438, expected 3514438
log.112813:05:02:33.793 [5711] <2> write_data: block position check: actual 3526295, expected 3526295
log.112813:05:07:28.153 [5711] <2> write_data: block position check: actual 3537221, expected 3537221
log.112813:05:12:35.046 [5711] <2> write_data: block position check: actual 3543504, expected 3543504
log.112813:05:17:30.839 [5711] <2> io_terminate_tape: block position check: actual 3555096, expected 3555096

root@pnms01:/usr/openv/netbackup/logs/bptm# grep 'EOM encountered' *
log.111913:04:29:42.061 [9351] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0
log.112113:03:01:04.783 [29439] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0
log.112313:00:20:47.103 [20016] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0
log.112313:20:52:34.948 [21085] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0
log.112513:05:12:27.015 [22901] <2> write_backup: EOM encountered --- Fragmenting, TWIN_INDEX 0
root@pnms01:/usr/openv/netbackup/logs/bptm#

mph999 · ‎11-27-2013

Hi Roland, Thx for the update - I've been meaning to chase up this thread .... You're not going to get errors in a log for the drive, without specialist software which your not ikely to have as it's very expensive, and if you had, you would have saaid so. The reason, is that there are no errors, or not at least that what we have (os + netbackup) is going to detect them. As I mentioned previously, it is the tape firmware that flags the tape as full - nothing else. This should be enough to get the drive swapped - I just can't think of any other possibility. What I will do is run this past one of really reallly 'all things tape' guys and see if they can think of anything else. NBU doesn't have the ability to mark a tape full itself - or put another way, if it doesn't recieve the 'tape full' flag from the tape driver (which gets it from the firmware) then, believe it or not, it would write to the same tape forever. I'll have a think about the comment Jim made about mpx - I don;t see this making a big difference (if any) - the backup headers are in a different 'place' but they are still 1k each (far as I know). M

rsm_gbg · ‎11-27-2013

We have occationally OS errors from the tape as well as NBU log errors.

The last OS error last week pointed to a specific tape so I was told to exchange that tape first before they consider replacing tapes, I'm just waiting to get a new error.
From /var/adm/messages NBU errors loged by the OS.
Nov 25 01:42:32 pnms01 bptm[19833]: [ID 895065 daemon.warning] TapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive HP.ULTRIUM4-SCSI.001 (index 1), Media Id DLT003
Nov 25 01:42:32 pnms01 bptm[19833]: [ID 356200 daemon.crit] TapeAlert Code: 0x04, Type: Critical, Flag: MEDIA, from drive HP.ULTRIUM4-SCSI.001 (index 1), Media Id DLT003
Nov 25 01:42:32 pnms01 bptm[19833]: [ID 436935 daemon.crit] TapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive HP.ULTRIUM4-SCSI.001 (index 1), Media Id DLT003
Nov 25 01:42:32 pnms01 bptm[19833]: [ID 185535 daemon.crit] TapeAlert Code: 0x14, Type: Critical, Flag: CLEAN NOW, from drive HP.ULTRIUM4-SCSI.001 (index 1), Media Id DLT003

NBU errors are steadily incresing as seen with you tperr.sh scripts

HP.ULTRIUM4-SCSI.000 is /dev/rmt/1 which is replaced
HP.ULTRIUM4-SCSI.001 is /dev/rmt/0 which is the unreplaced drive.

Errors File exists ....
DLT010 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT011 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT002 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT012 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT003 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT014 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
DLT005 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
DRT010 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT016 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 5)
SCR004 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT110 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT008 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 18)
DRT021 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT018 has had errors in 2 different drives   (Total occurrences (errors) of this volume is 4)
DRT031 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 5)
DRT113 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
DRT014 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
NLT002 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT115 has had errors in 2 different drives   (Total occurrences (errors) of this volume is 3)
DRT106 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT016 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
NRT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
NRT005 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
MRT006 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT006 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT008 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT009 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)

HP.ULTRIUM4-SCSI.000 has had errors with 13 different tapes   (Total occurrences (errors) for this drive is 33)
HP.ULTRIUM4-SCSI.001 has had errors with 18 different tapes   (Total occurrences (errors) for this drive is 54)

The drive with incresing errors is not suprisingly HP.ULTRIUM4-SCSI.001

- Roland

mph999 · ‎11-28-2013

Ahh ok, we have errors that is good .... I hope you found tperr.sh useful, I'd forgotten about that - not used it for a while. Some of the errors may be 'cleaning alerts' - I edit those out as I don;t class them as real errors ... that may bring the numbers down a little. As a general rule, and as tperr is based simply on stats . I usually start with 2 errors per tape per drive is average ... it depends on exactly what the errors are but you have to start somewhere. I then compare the number of errors per drive, against the other drives, and the same for tapes, to see I anything stands out eg. DLT008 doesn't seem to happy. In this case, as you have already mentioned, I'd look for increases in the errors, so monitor over a period of time. I spoke to one of my colleagues today as promised. First response was 'the drive determines the end-of-media' as I expected it would be. We still cannot come up with a reason there is a difference between the dup type (inline copy and vault) - very odd, but still the over-riding factor is that NBU cannot set Full media itself.

rsm_gbg · ‎11-28-2013

I'm trying to find a way of getting some stats.

To me the only thing we could see on a regular basis is that Dup is slow or backups are slow.
We also see that Dup is not compressing good enough.

This theory is true IF we see the slow dup or slow backup when we use the nonreplaced drive.
All this could be derived using logs and bpdbjobs but it is very cumbersome finding a way of seeing.what I want.

Do you have an easy way of getting these columns

Policy Type Schedule Copy DSTMEDIA tapedrive
ie
"IPND_papp02" "Backup or Duplication" "Daily" "1" "DRT101" "/dev/rmt/0cbn"

Then I could maybe see some data pointing to /dev/rmt/0 as beeing slow reading or writing or something.
But as I see it it's not easy getting this output. especially the tape-drive

Also what should I do to reset the stats that tperr show?
Starting from 0 errors.

VOX

Vault Duplication