Solved: Thanks Mark for the

errrlog · ‎10-11-2011

Noticed performance drop when single bptm process spawn multiple duplication jobs.

Problem Description:

ServerA is backed up using "_Test_XYZ1Day" SLP. As per the SLP config, the backup goes to disk staging pool (Advance disk pool XYZStagingpool1) and then duplicated (3copies) to tape. Throughput of the duplication job to tape drive drops directly in proportion to the number of duplication per SLP. (NOTE: each duplication job uses a seperate Tape drive)

Test Results

SLP _Test_XYZ1Day: Set with 1x Duplication job. Job uses 1 Tape drive and runs at throughput of 120 – 140MB/s. 120MB/sec sustained throughput most of the time.
SLP _Test_XYZ1Day: Set with 2x Duplication job. Jobs will use 2 Tape drives and each of the jobs run at throughput of 80 – 85MB/s. 83MB/sec sustained throughput per drive most of the time.
SLP _Test_XYZ1Day: Set with 3x Duplication job. Jobs will use 3 Tape drives and each of the jobs run at throughput of 53 – 58MB/s. 54MB/sec sustained throughput per drive most of the time.

NOTE: Combined throughtput of all the duplication jobs seems limited to 160 - 170MB/sec.

Test Conditions: All the above tests are stats are under below conditions

Dedicated Staging Array used as a Advance Disk Pool.( 270+ MB/sec Sustained read throughtput )
No other writes/read on the Array.
To the same Tape library. (Stk SL500 with 5 LTO4 Drives)
Single port dual HBA's connected to two different fabrics at 2gbit speeds. (3 tape drives on 1 fabric and 2 and controller on other)

Storage lifecycle policy details:

#nbstl _Test_XYZ1Day -L
                                Name: _Test_XYZ1Day
                 Data Classification: (none specified)
            Duplication job priority: 0
                               State: active
                             Version: 5
Destination 1              Use for: backup
                        Storage Unit: XYZStagingPool1
                         Volume Pool: (none specified)
                        Server Group: (none specified)
                      Retention Type: Capacity Managed
                     Retention Level: 10 (1 day)
               Alternate Read Server: (none specified)
               Preserve Multiplexing: false
                               State: active
                              Source: (client)
                      Destination ID: (none specified)
Destination 2              Use for: duplication
                        Storage Unit: media3-hcart-tld-2
                         Volume Pool: temp_cp_test1
                        Server Group: Any
                      Retention Type: Fixed
                     Retention Level: 10 (1 day)
               Alternate Read Server: (none specified)
               Preserve Multiplexing: false
                               State: active
                              Source: (primary)
                      Destination ID: (none specified)
Destination 3              Use for: duplication
                        Storage Unit: media3-hcart-tld-2
                         Volume Pool: temp_cp_test1
                        Server Group: Any
                      Retention Type: Fixed
                     Retention Level: 10 (1 day)
               Alternate Read Server: (none specified)
               Preserve Multiplexing: false
                               State: active
                              Source: (primary)
                      Destination ID: (none specified)
Destination 4              Use for: duplication
                        Storage Unit: media3-hcart-tld-2
                         Volume Pool: temp_cp_test2
                        Server Group: Any
                      Retention Type: Fixed
                     Retention Level: 10 (1 day)
               Alternate Read Server: (none specified)
               Preserve Multiplexing: false
                               State: active
                              Source: (primary)
                      Destination ID: (none specified)

Any ideas would be appreciated.

Thanks.

Mark_Solutions · ‎10-14-2011

Fully with you and your setup and tuning tests now.

The issue here is that you use inline copy and its handling of the bptm process which is a know problem.

This tech note explains it better http://www.symantec.com/docs/HOWTO56160

It maybe that you can think about this to see if there is a bottleneck anywhere that could help imporve things but the issue stems from using inline copy

Hope this helps

View solution in original post

AAlmroth · ‎10-11-2011

Well, the figures you mention seems to correlate well with 2Gbit FC, approx 170-180MB/s.

I think your problem may be that you use all three drives on one HBA port. A simple test here would be to "down" 1-2 drives on the first port, and see if the same behaviour occur when the jobs are forced to run to diffferent ports. If it is a HBA port bandwidth problem, then you should in this case have two duplication jobs, but each running to drives on separate ports, and you should get the full speed. If not, then perhaps the I/O bus on the media server is the limiting factor.

Also, you don't mention how the disk system is connected? Is it connected using the same HBA ports, or dedicated? Even if the I/O direction is different, that is, read from disk, write to tape, there will be overhead handling all I/O on the HBA, the device driver, and OS kernel. In this case it may help to use a higher I/O block size. Exact configuration varies depending on OS.

/A

Mark_Solutions · ‎10-12-2011

Hi

If this is not all dropped on a single HBA (and hopefully it isn't) then when you do more than one duplication at a time (inline copies) then you can find that the bptm process handles the data through its buffers in such a way that this has been seen to occurr.

Tuning your data buffers can help - just try different numbers.

As you are using LTO4 drives the size for best performance in SIZE_DATA_BUFFERS is 262144

If you are not already using this value you will need to bplabel any previously (BUT EXPIRED!) tapes, otherwise NBU will jsut read the header and change the buffer size to match the old setting.

So then try 16, 32 and 64 in the NUMBER_DATA_BUFFERS file to see what gives you the best performance.

All buffer files in \netbackup\db\config\

32 is usually good but this will vary when using inline copy

Of course you can also tune your DISK buffers which may improve things further

Hope this helps

Omar_Villa · ‎10-12-2011

Just seeing the behavior u every drive under the same hba, confirm the zoning or binding of your drives and u will know where u are

errrlog · ‎10-12-2011

Thanks guys for the comments.

I want to clarify few more things:

-2Gbit port: I am able to get 193MB/sec sustained throughput on a 2 gbit port.

- Most of the tests are done with 2xduplication jobs. I started testing 3xduplication to correlate the 160MB/sec - 170MB/sec bptm limitation.

- During the 2xduplication job tests, I made sure that I use the tape drives on a different HBA, which are connected to differnt fabric switchs.

ALL TESTED SCENARIOS:

1x duplication job goint to 1xTP Drive on Hba1 on Hba2 - Throughput 120MB/sec - 140MB/sec
2x duplication jobs going to 2xTP drives on different hba port - Throughput 80 - 85MB/sec on each drive
2x duplication jobs going to 2xTP drives on the same hba port - Throughput 80 - 85MB/sec on each drive
3x duplication jobs going to 2xTp drives on hba1 and 1xTP drive on hba2 - Throughput 53 - 58MB/sec on each drive
3x duplication jobs going to 3xTp drives on hba1 - Throughput 53 - 58MB/sec on each drive

- Disk is a Storage LUN from Storagetek SAN array dedicated for backup Staging. The array is connected using another set of two single port hba's. (bpbkar to null test provided 270MB/sec sustained throughput)

- All HBA's are PCI-E (64 bit) and so no way close to saturating bus limitation.

- Disk Buffers: - Tested with 128k, 256k, 512k, 1024k. 1024k seems to be slightly better but 512K seems to balance the delayed waits.

- Number data buffers: - 256. (the server is loaded with 32GB RAM) and no more than 200 concurrent jobs running at the same time.

- Tuning: The point to be note is we are getting 120MB/sec sustained to a LTO4 drive. Which is a optimal value for LTO4 and anything over 120MB/sec is just due to compression.

SCENARIO (1): - 1x duplication job goint to 1xTP Drive on Hba1 on Hba2 - (processess involved bptm controlling 1xtape drive , bpdm reading from disk)

I get Throughput 120MB/sec - 140MB/sec

SCENARIO (2): considering (1) when 2xduplication jobs are forked from single bptm writing to -> Different Tape Drives -> on Different Hba ports -> connected to different fabric switches. (processess involved bptm controlling 2xtape drive , bpdm reading from disk)

I get Throughput 80MB/sec - 85MB/sec. But, Ideally I should get the same throughput as in Senario (1). i,e 120MB - 140MB.

The only change between the SCENARIO (1) and SCENARIO (2) is NUMBER OF DUPLICATION JOBS.

This clearly seems like some kind of bptm process contention and not resource contention

Mark_Solutions · ‎10-13-2011

Hi

Your disk buffers look quite small and your number of buffers quite high

An example of what i have found works really well in multiple drive LTO4 or 5 environments is as follows:

SIZE_DATA_BUFFERS 2162144

NUMBER_DATA_BUFFERS 32

SIZE_DATA_BUFFERS_DISK 1048576

NUMBER_DATA_BUFFERS_DISK 32

Do bear in mind that you need to test this using new tapes or any available expired tapes will need relabelling to overwrite the header with the new block size

Hope this helps

Omar_Villa · ‎10-13-2011

What is the OS of your media server?

errrlog · ‎10-13-2011

Hi Mark,

I guess you have mis typed the SIZE_DATA_BUFFERS 262144

I have mentioned the figures in Kilobytes (you might have missed the tailing 'k'). The number you mentioned exists in my tested block size i.e (262144? = 256k). So the our SIZE_DATA_BUFFERS is 512k = 512x1024 = 524288)

Regards to NUMBER_DATA_BUFFERS, yes it is high because, we have sufficient RAM in relation to the number of concurrent process.

Thanks for the suggestions though, but I have done extensive test before I begin to use those numbers.

THE POINTS TO BE FOCUSED ON ARE:

I DO NOT have any PERFORMANCE issues doing BACKUPS, RESTORES, 1xDUPLICATION. Infact we are getting optimal figures.
THE ISSUE only happens when we start using 2 or more Duplication jobs.

OS Version: Solaris 10

Server Hardware: SPARC-Enterprise-T5220.

Netbackup Version: 6.5.5

sdo · ‎10-13-2011

This looks like an interesting problem. Can you clarify (somehow in words) your HBA and SAN and storage topology and zoning?

Omar_Villa · ‎10-13-2011

Share ther OS you have and we can help you to get the proper zoning and binding of the box just to confirm how are your drives setup, with this we can confirm drives are split across the ports, guessing or thinking they might be split is not enough.

Regards.

errrlog · ‎10-13-2011

FABRIC:

Two fabric consisting of two switches (SwitchA and SwitchB),
Storagetek 6160 Storage Array (Dedicated for staging) with 2x controllers with 4xports on each. Each controller runs two connections to each switch.
Media server has 4 x single port Qlogic 2462 HBA's, 2 connected to SwitchA and 2 connected to SwitchB. One of the connections to each switch is zoned to tape drives, and the other is zoned to storage.
Tape Library 5xLTO 4 drives. 3 drives connected to SwitchA and 2xdrives and the robotic controller is connected to SwitchB. (Library/Drives are not shared. We do not use SSO)

ZONING: we use soft zoning (wwn based)

OS Version: Solaris 10

Server Hardware: SPARC-Enterprise-T5220.

Netbackup Version: 6.5.5

THE POINTS TO BE FOCUSED ON ARE:

IF THERE IS A PROBLEM WITH THE CONFIG I SHOULD SEE THE ISSUE DURING BACKUP, RESTORE, 1XDUPLICATION.

BUT

1.I DO NOT have any PERFORMANCE issues doing BACKUPS, RESTORES, 1xDUPLICATION. Infact we are getting optimal figures.

2.THE ISSUE only happens when we start using 2 or more Duplication jobs.

For Example:

MediaServer03 –>hba01 ->switchA->(Tape3, Tape4, Tape5)

MediaServer03 –>hba03 ->switchB->(Tape1, Tape2, Robot Contol)

Scenario 1: (2xduplication jobs running at the same time from 2xSLP’s)

SLP1 configured with 1x Duplication job -> writing to Tape1: I get 120+MB/sec

Processess involved: (1x bpdm job @120MB/sec disk read, 1xbptm writing at 120MB/sec to tape)

SLP2 configured with 1xDuplication job-> writing to Tape3: I get 120+MB/sec

Processess involved: (1x bpdm job @120MB/sec disk read, 1xbptm writing at 120MB/sec to tape)

Scenario 2: (2xduplication jobs running at the same time from 1xSLP)

SLP1 configured with 2x Duplication job

Duplication job 1-> writing to Tape1: I get 80 - 85MB/sec

Duplication job 2-> writing to Tape3: I get 80 - 85MB/sec

Processess involved: (1x bpdm job @80MB/sec disk read, 1xbptm writing to 2 x tape drives at 80MB/sec to tape)

NOTE: Bpdm slows down because, bptm is unable to go beyond 80-85 MB/sec and hence it waits for the empty buffers.

Mark_Solutions · ‎10-14-2011

Fully with you and your setup and tuning tests now.

The issue here is that you use inline copy and its handling of the bptm process which is a know problem.

This tech note explains it better http://www.symantec.com/docs/HOWTO56160

It maybe that you can think about this to see if there is a bottleneck anywhere that could help imporve things but the issue stems from using inline copy

Hope this helps

errrlog · ‎10-14-2011

Thanks Mark for the pointer.

"Inline Copy (multiple copies) takes one stream of data that the bptm buffers receive and writes the data to two or more destinations sequentially"

Well that explains the issue.

VOX

bptm process - causing performance bottleneck