05-01-2017 10:49 PM
NBU 7.7.3 / SL500 tape library
how does one isolate where the slowness is when it comes to backing up to tape?
05-01-2017 11:32 PM
Probably neither of the two.
NetBackup sends data to the tape drive as fast as bptm receives data from the client (backup) or bptm read process (duplication).
Tape drive writes as fast as it receives data. Modern tape drives can write in excess of 100MB/sec.
To troubleshoot performance, you firstly need to test the speed at which data can be read from disk (backup or duplication).
Then test the transport from disk to media server - network in case of backup, SAN in case of local duplication on media server.
The Performance Tuning Guide explains methods to find the bottleneck.
Please bear in mind that duplication from dedupe disk to tape is a whole different ball game with data that needs to be rehidrated. There are various Veritas TNs, articles, etc about this topic as well as requirements for Media Server dedupe disk read/write speed in MSDP Admin Guide.
05-02-2017 12:09 AM
Hello,
If we discuss about the performance issue in NetBackup. As per my experince 90% issue endup with resources problem (NIC, Switch, DISK, Tape library etc ). NetBackup do not have much control on performance.
About your query, From NetBackup point of view.
In job detail or bptm log show the buffer output if there is any waiting/delay.
Example:
Waiting for empty buffer = It mean the client is sending data with expected speed but storage attached to media server not able to write/dump the data expected speed and flush the buffer. (Tape library issue)
Waiting for full buffer = It mean media server able to write data with expected speed but client is sending data in low throughput. (Client Server/ Network issue)
Reference : Article URL:http://www.veritas.com/docs/000010401
Regards,
05-02-2017 12:25 AM
hi,
the link says error 404.
anyway, on that waiting for full buffer, i have such entry but the image is already on the master/media server waiting to be duplicated to tape. my entries are many would that indicate that the tape drive is waiting on the master/media server all the time specially i see waited for some 57,000 times?
05/02/2017 07:44:43 - requesting resource LCM_master-hcart2-robot-tld-0 05/02/2017 07:44:43 - Info nbrb (pid=3437) Limit has been reached for the logical resource LCM_master-hcart2-robot-tld-0 05/02/2017 09:33:12 - begin Duplicate 05/02/2017 09:33:12 - granted resource LCM_master-hcart2-robot-tld-0 05/02/2017 09:33:12 - started process RUNCMD (pid=13496) 05/02/2017 09:33:13 - requesting resource master-hcart2-robot-tld-0 05/02/2017 09:33:13 - requesting resource @aaaan 05/02/2017 09:33:13 - reserving resource @aaaan 05/02/2017 09:33:13 - awaiting resource master-hcart2-robot-tld-0. No drives are available. 05/02/2017 09:33:13 - ended process 0 (pid=13496) 05/02/2017 09:43:11 - Info bpduplicate (pid=13496) window close behavior: Suspend 05/02/2017 09:43:11 - resource @aaaan reserved 05/02/2017 09:43:11 - granted resource DR0617 05/02/2017 09:43:11 - granted resource HP.ULTRIUM5-SCSI.000 05/02/2017 09:43:11 - granted resource master-hcart2-robot-tld-0 05/02/2017 09:43:11 - granted resource MediaID=@aaaan;DiskVolume=PureDiskVolume;DiskPool=NBU-PureDisk-1;Path=PureDiskVolume;StorageServer=master;MediaServer=master 05/02/2017 09:43:11 - Info bptm (pid=14668) start 05/02/2017 09:43:11 - started process bptm (pid=14668) 05/02/2017 09:43:11 - Info bptm (pid=14668) start backup 05/02/2017 09:43:12 - Info bpdm (pid=14682) started 05/02/2017 09:43:12 - started process bpdm (pid=14682) 05/02/2017 09:43:12 - Info bpdm (pid=14682) reading backup image 05/02/2017 09:43:12 - Info bpdm (pid=14682) using 32 data buffers 05/02/2017 09:43:12 - Info bpdm (pid=14682) requesting nbjm for media 05/02/2017 09:43:12 - Info bptm (pid=14668) media id DR0617 mounted on drive index 0, drivepath /dev/nst1, drivename HP.ULTRIUM5-SCSI.000, copy 2 05/02/2017 09:43:12 - Info bptm (pid=14668) INF - Waiting for positioning of media id DR0617 on server master for writing. 05/02/2017 09:43:12 - begin reading 05/02/2017 10:13:55 - Info bptm (pid=14668) waited for full buffer 57787 times, delayed 115509 times 05/02/2017 10:13:59 - end reading; read time: 0:30:47 05/02/2017 10:13:59 - Info bpdm (pid=14682) completed reading backup image 05/02/2017 10:14:00 - Info bpdm (pid=14682) using 32 data buffers 05/02/2017 10:14:00 - begin reading
05-02-2017 12:36 AM
bptm (pid=14668) waited for full buffer 57787 times, delayed 115509 times
bptm waiting for full, means that for the backup, the client wasn't sending the data quickly enough. This could be for a varity or reasons, disk read speed, network etc ...
The actual number we're interesed in is the number of delays, 115509 - with each delay being (I think 10ms from memory) we have a total delay of 1155090ms = 1155 seconds = 19 mins.
If the backup is 'small' and should have completed in say an hour (with no delays) then 19 mins is significant. If the backup was massive and would take a week to complete, then 19 minss less significant.
05-02-2017 12:45 AM
that's what confusing me. the backup per se takes around 7 hours to complete. overnight.
this post comes from the duplication activity details, not from a backup. so the master/media server is not sending fast enough to the tape drives?
05-02-2017 12:46 AM
Hello,
Thanks for the update.
I am able to open that article. I just pasted the article content over here for your reference.
Please refer the mph999 suggestion, it also refer the same as mentioned in article.
So in your case its client site performance issue. To isolate the issue peform following test:
bpbkar null test = disk IO test
SAS test = For network peformance test
NIC = Check the NIC property on client.
If it impacting all clients connected to that media server then concentrate on media server Network.
Article: 000010401
Problem
Understanding waits and delays for full and empty buffers message produced in NetBackup logs, and also presented in the Activity Monitor for NetBackup 7.1 and later.
Error Message
From bptm log:
08:18:26.087 [5312.6256] <2> write_data: waited for full buffer 24552 times, delayed 1221273 times
From bpbkar log:
08:18:26.072 AM: [5416.6628] <4> BufferManagerLegacySharedMemory::~BufferManagerLegacySharedMemory(): INF - bpbkar waited 64737 times for empty buffer, delayed 66780 times.
Solution
To calculate how much delay was experienced by bptm, we will need the following two lines from bptm log:
08:18:26.087 [5312.6256] <2> write_data: waited for full buffer 24552 times, delayed 1221273 times
00:49:29.035 [5312.6256] <2> io_init: child delay = 10, parent delay = 15 (milliseconds)
Delay and Wait counters will help to determine which process has to wait more often: Data Producer or Data Consumer side of the read/write operation.
During a backup, if the client and the media server are the same system, the bpbkar process on the client that reads the data from the storage (disk) and places it into shared memory segments is the Data Producer. The single bptm process that reads this data from the shared memory segments and writes it to the storage (tape or disk), is the Data Consumer.
If the client and media server are different systems, then the Data Producer and Data Consumer are different. The bpbkar process on the client reads the data from the disk and sends it to the network. The bptm Child process on the Media Server then reads that data coming from the client and places the data into shared memory segments on the Media Server, is the Data Producer. Whilst the bptm Parent process that reads the data from the shared memory segments on the media server and writes the data to the destination storage (tape or disk) is the Data Consumer.
The Data Producer needs an empty buffer, while the Data Consumer needs a full buffer. So, the log events above are from the Data Consumer bptm side.
If the Data Consumer has to wait for a full buffer, it increments the wait and delay counters to indicate that it had to wait for a full buffer. Then the Data Consumer waits the defined number of milliseconds ( which for the Data Consumer is the value from the bptm log output => parent delay = 15 (milliseconds)) before it checks again for a full buffer. If a full buffer is still not available, the Data Consumer increments the delay counter to indicate that it had to wait (delay) again for a full buffer, and waits the prescribed number of milliseconds ( parent delay = 15 (milliseconds)) before rechecking the buffer. The Data Consumer repeats the delay and full buffer check steps until a full buffer is available.
Analysis of the wait and delay counter values that are recorded in the bptm log and also seen in the activity monitor events for the job (in NetBackup 7.1 only and later versions), indicates which process, Data Producer or Data Consumer, has had to wait most often and for how long.
Below is a calculation of how much delay is experienced by the bptm process shown in the bptm output example above:
Total delay = how many times bptm was delayed x parent delay (milliseconds) = 1221273 x 0.015 = 18319 secs = 305 mins = over 5 hours
To find out how to solve this problem using the following configuration files, and how these processes (Consumer and Producer) work for backups and restores, please read NetBackup Planning and Performance Tuning Guide for additional information :
SIZE_DATA_BUFFERS
NUMBER_DATA_BUFFERS
PARENT_DELAY
CHILD_DELAY
05-02-2017 12:57 AM
VOX added a space to the end of the URL.
Try this: http://www.veritas.com/docs/000010401
Or else Google (or use Handy NBU links in my signature) to locate the Performance Tuning Guide.
As I have said previously - duplication from MSDP is a whole different ball game.
You firstly need to check that the disk used for MSDP conforms to I/O requirements (as per MSDP guide), ensure that tape and disk are zoned/attached to different HBAs, preferably on different PCI cards, and that you read every available article about MSDP rehydrate and duplication and SLP tuning.
05-02-2017 12:57 AM
Hello,
What is size of data in backup selection?
How many small files and sub folders in backup selection?
In Job detail, how much average bandwidth are you getting?
Please sent the print screen of job detail from NBU console.
Regards,
05-02-2017 01:09 AM
hi,
during backup, the backup jobsize is around 2.5TB
i wouldn't know the number of files or folders since it's an RMAN backup. since we got Veeam, NBU is now backing up Oracle databases only.
the duplication job doesn't show any transfer rate.
the job detail i have already posted earlier. that's from the duplication job.
05-02-2017 01:18 AM
Over here you have seen how the disk storage can affect read or write speed:
https://vox.veritas.com/t5/NetBackup/restore-speed-question/m-p/829771#M225831
05-02-2017 01:21 AM
Hello,
Thanks for sharing this information.
As mentioned by Marianne. When it come to MSDP duplication to tape it become different story.
The MSDP rehydrate the data and then send it to target media server.
Following article may help you to understand and resolve the problem.
https://www.veritas.com/support/en_US/article.000016160
https://www.veritas.com/support/en_US/article.000043063
https://www.veritas.com/support/en_US/article.000082458
Regards,
05-02-2017 01:21 AM
Hi,
the 2.5 TB backup was written to MSDP? How long does it take ?
Please tell us:
- OS of master/media/storageserver
- Filesystem of MSDP pool and the settings. How many percent of FS are used?
- RAID configuration (if exist)
ciao
tunix2k
05-02-2017 01:32 AM
that's during backup. when reading from the zfs storage.
my problem now is the duplication where the disk image resides in the master/media server. not in the zfs storage.
05-02-2017 01:36 AM
that 2.5TB took 7 hours to backup.
master/media is RHEL 6.3
the MSDP is using xfs and 41% is used (total is 9TB). this partition is on a SAN storage. raid 5
05-02-2017 01:37 AM
Info bpdm (pid=14682) using 32 data buffers
Increase this value to 256, done in [INSTALL_PATH]/netbackup/db/config
create file NUMBER_DATA_BUFFERS (no file extension) and write 256 in the text file. This will not solve the issue but provider a overall imporvement in tape operation.
05-02-2017 01:51 AM
I did not imply that the issue is with zfs specifically.
Please read again what I have said:
" .... you have seen how the disk storage can affect read or write speed ..."
I have also referred to MSDP I/O requirements in another comment.
Have you checked and verified?
In older discussions I have pointed you to Performance Tuning of Oracle Backups with MSDP.
Have you investigated and implemented that?
There are also lots of Oracle documentation (and forum posts) w.r.t RMAN performance tuning.
You have now received a LOT of good pointers.
I suggest that you take note of all suggestions, take your time to read up, investigate, and then implement performance tuning advice - one change at a time. Each time documenting the change and results.
05-02-2017 02:22 AM
Hi,
MSDP storage looks fine. As @Marianne said: You have now received a LOT of good pointers.
Additional links
https://www.veritas.com/support/en_US/article.000004178 Perf-tuning-guid NBU 7.5/7.6
https://www.veritas.com/support/en_US/article.TECH167095 Kernel settings Linux Master
https://www.veritas.com/support/en_US/article.TECH1724 Buffer settings
https://www.veritas.com/support/en_US/article.000076306 shared memory recommendations
https://www.veritas.com/support/en_US/article.TECH203066.html example shared memory settings
good luck
tunix2k
05-02-2017 04:09 AM
This will almost certainly be re-hydration from the MSDP pool?
Duplicate the same image from a basic disk and see what the timings are?
05-02-2017 05:44 AM
One thing Marianne did NOT mention for speed - is logging. NetBackup can generate a ton of logs on both media and master server. I got my catalog and logs on flash disk and saw an increase in through put of about 10%. So my 18 hour backup completed 1.8 hours earlier. Youir mileage may vary.
I often have multiple jobs running at the same time, and I write to a data domain and duplicate to tape at the same time ( this two way data stream is NOT recommended, by the way ) - and so my backups times can vary due to other loads. But over time, I have seen a definite benefit from faster disks for logging.
BUT - I would agree with Marianne - isolate the steps the data flows, and check each one. Find the link that is overloaded or limiting you, and resolve it.