I have a query around Vault, I'm sure this has been asked several times already, but no harm in asking again...
We have configured our backups of various SQL Server databases to be written to a disk storage unit. We then use Vault to duplicate off these backups to tape. Many of the backups, particularly of transaction logs, can be very small in size, typically 1-2Kb and on a typicaly day, there can be around 2000 of these small backups.
When the Vault job runs, I've noticed in the Activity Monitor that it does a lot of processing before reading each backup and copying it to tape, this processing taking several seconds to complete each time. This gives us 2 problems:
Is there any way to speed this up? Can the processing steps the Vault script is running be removed or simplified?
More info please... The reason for my questions is that there may be other, more efficient ways to configure the duplication and ensure better throughput.
Vault creates batches based on the retention levels used in the images to be duplicated.
There may be other ways to create these batches based on whether too many or too little images are grouped into a batch and amount of tape drives available.
What type of Disk STU?
Basic? Advanced? MSDP? OST?
How many tape drives in the tape (Media Manager) STU?
How many tape drives in use during Vault duplication?
Is Disk STU and tape STU attached to same or different media servers?
If same media server, is disk and tape attached to same or different hba's?
What are the buffer settings on the media server(s)?
Do you have bptm log on the media server(s) to see the result of buffer config?
Can you share a recent vault session log (upload as .txt file) to see how batches are created?
One thing that almost always worth looking at is the file system handling all these small image files and of course the underlying hardware.
Things like fragmentation and file system/read cache/prefetch is usually some good places to start.
Pretty sure, there is some technotes about improving the performance of disk storage units in general and some about handling many small images. Thinking maybe something like what was called MAX_ADD_FILES could help if the problem is updating the netbackup catalog.
This behavior that you mentioned:
"it does a lot of processing before reading each backup and copying it to tape"
Does it show in the job details like this?
09/05/2016 10:36:27 - Info bpbkar (pid=23589) 15000 entries sent to bpdbm
09/05/2016 10:37:10 - Info bpbkar (pid=23589) 20000 entries sent to bpdbm
If so, try using this touch file to improve it:
Thank you all for your replies. Here is the info requested:
1. The original backup images are being written to a MSDP storage pool on a NetBackup 5220 Appliance. The appliance uses the 2 disk shelves that are shipped with the appliance so are directly connected to the appliance using SAS
2. The tape Storage Unit is configured to use 1 concurrent write drive. The other storage units are all configured to use 1 fewer tape drive than the number in the robot. This always ensures that there is 1 tape drive available at all times for Vault.
3. The Duplication settings in the Vault profile are set to use 1 write drive. I've tried changing this to 2 write drives and to also change the Storage Unit to use 2 concurrent write drives, however, vault only uses 1 drive. I guess this is because the original images are on disk and the disk is considered to be 1 read drive.
4. The tape robot and tape drives are directly plugged into the NetBackup 5220 appliance using Fibre Channel (no SAN switch). The tape drives are all LTO 7, but using LTO 6 tape cartridges
I've copied here a typical trace seen in the Activity Monitor:
09/09/2016 20:00:57 - Info bpdm (pid=9707) completed reading backup image
09/09/2016 20:01:03 - Info bpdm (pid=9707) using 64 data buffers
09/09/2016 20:01:07 - begin reading
09/09/2016 20:01:08 - Info bptm (pid=9690) waited for full buffer 0 times, delayed 0 times
09/09/2016 20:01:10 - end reading; read time: 0:00:03
09/09/2016 20:01:11 - Info bpdm (pid=9707) completed reading backup image
So we can see:
When you have 4000 images to duplicate, a wait time of up to 10 seconds before duplicating the image is substantial = 4000 x 10 = 40,000 seconds = 667 minutes = 11 hours
Is there anything that can be done to speed this up? Would using SLP be better?
SLP may be better, there are various setting that can be changed in order to change the way images get batched together - this can for example prevent many small duplication jobs running, instead running one job (containing many images). It's not always the most simple thing to tune, can take a bit of trial and error .... I don't believe this is possible with vault.
However, SLP will only work going forwards, you cannot transfer written images to SLP, only new ones that are written to the SLP.
You could use a combination, SLP to make the duplications, and then Vault to eject the tapes.
Well.. I would question one thing:
Where did the 4000 images come from? Are they backlogs (due to slow duplication) or do you really have 4000 images to be vaulted every week or so?
If they're backlogs, the duplication had been running too slowly so you need to improve on the I/O of to the tape storage, rehydration from MSDP could be one key factor - that might need open a support case to pinpoint where you can improve.
If it's indeed a weekly routine, see if you can break it down into 2 vault names (with different profiles) - 2 vault can run in parallel (but not profile in the same vault) provided you have sufficient tape drives. Each vault chooses different images based on the "select backup" criteria - and if performance is an issue, consider running them in different window.
SLP is better in term of handling the batches and more automated, it just does not "mark the tape offsite" for you, but this is trivial - you can simply get SLP to duplicate, and then a vault profile just to do "eject" to make it offsite.
Watsons makes an excellent point (as usual) - if the actual duplication performance is slow, even using SLP is not going to improve thing
Duplicating from msdp isn't always the fastest way of doing things, simply because msdp has to rehydrate the data, and there is a limit of how fast this is going to happen.
Perhpas you could give some idea of how fast a duplication is happening in MB/s
One more question - what is the software version on the Appliance?
The 5220 hardware is quite old. Have NBU/Appliance patches and version upgrades been kept up to date?
There have been vast MSDP improvements over the years with each new NBU/Appliance version.
Thank you for your replies and help. Here's the information you asked for:
I've been monitoring it and noticed that poor performance from the LUN containing the catalog was not helping matters so I moved the catalog LUN to an all-flash SSD storage device. This has sped Vault up a bit, but not sufficiently. We're still seeing it take around 15 seconds end to end to duplicate a 1 or 2kb SQL transaction log to tape. There are no other jobs running on the NetBackup Appliance at the time Vault is run.
I know the 5220's are now quite old, however, as there is nothing else running I don't think it is out of power.
I'm a bit struggling now to see how else this could be sped up, or is this a more fundamental problem with the Vault code how it's been written?