Can you bypass Netbackup and

drhartman · ‎11-02-2012

I have just installed Netbackup 7.5 on a Solaris 10 server using NDMP to communicate with a NetApp (running Ontap 8.1) which is connected to an Overland Neo 2000 series tape library via Sanbox 1400 (I'm waiting for a replacement tape drive so I only have one tape drive in the library right now). My basic problem is that the backups take an extreme amount of time to complete. The backups (from the Netbackup Server's perspective as seen from the activity monitor from the Admin Console) seem to start OK but the throughput is only operating at about 19-20MB/s. Most backups generally fail because the windows close before they can be addressed. When I log into the NetApp and observe the tape writes (using the command "sysstat -x 1") I noticed that the output was showing values of anywhere from around 35MB/s to 165MB/s but, after several seconds of writes it would stop for 14 - 20 seconds before beginning to write to tape again. The following is an example: (I apologize for the alignment...I hope it is readable enough to see what I'm seeing)

NetApp> sysstat -x 1
CPU    NFS   CIFS   HTTP   Total     Net    kB/s    Disk   kB/s    Tape   kB/s Cache Cache    CP    CP Disk   OTHER    FCP iSCSI     FCP   kB/s   iSCSI   kB/s
                                                                in     out     read   write    read   write    age           hit     time   ty util                                                in       out        in       out
3%   1002      0         0         1002   2966   3469  2648     0        0       0        23s         86%      0%   -    32%       0            0         0           0        0         0        0
3%   1106      0          0         1112   3141   2580   2492    0        0       0      23s         88%      0%    -    31%       6           0         0            0        0          0         0
2%     803      0         0           803   3937   1132    752      0         0       0      23s          93%      0%    -     5%        0            0         0            0        0        0         0
16%    895      0         0          895   4769   2705 66792     0      0 50397     39s         85%       0%    -    58%       0         0          0            0        0          0         0
19%    976      0          0         976   2808   1951 62504     0        0 80806    39s       89%      0%    -    60%       0           0          0           0        0          0       0
22%  1066     0        0        1066   3079  3116 103624    0        0 98763    25s       92%       0%    -   65%       0           0          0            0        0          0         0
20%    976      0          0           983   4003   2760  78588     0        0 68944    24s       85%      0%    -   66%        7          0          0          0       0          0         0
14%   1037     0         0         1038   2941   1825  33412     0        0 33554    24s        82%       0%    -   56%       1          0           0            0        0          0         0
3%   1003      0         0        1003   3897   1796   1100     0        0         0       24s         86%        0% -    20%       0         0         0            0        0        0       0
9%    956       0          0           956    3295 1503  16568 5648     0        0       24s        97%      46%  Tf  46%       0           0           0            0       0          0         0
6%   1057      0         0         1057   3027   2202   6724 1886      0         0       24s         89%    100% :f   37%      0           0           0           0        0          0         0
4%    875      0          0          887    2887 2936   2608 14052     0         0        24s         84%       62% :    29%     12        0          0             0        0          0        0
3%   1174      0        0        1177   3373  1843   1660       0        0         0      24s       89%         0% -   18%       3         0           0            0      0          0        0
3%    967       0         0         967    3833   3634   2640      0        0       0        24s         80%         0%   -   31%      0         0           0             0         0         0        0
3%    969      0         0         969    3581   1743   1756      0       0         0       24s       90%         0%   -   27%      0          0           0             0         0          0        0
3%   1101      0         0       1101    4565   2676    2424     0        0       0        24s         87%        0%   -   36%      0        0          0              0         0          0      0
2%    785       0        0          793     2628   1364    1388      0       0       0        24s       90%         0%   - 18%      8          0           0             0          0          0       0
2%    931       0         0          931   2438   1817    1836      0       0        0       24s         90%         0%     -   23%     0         0         0             0          0         0       0
3%    984       0         0         984    3009    2223    1560      0       0         0      24s       85%         0%     -   29%     0         0           0            0         0          0     0

3%    908      0          0         908    2748   2247    2488      0       0          0       24s        89%         0%   -    30%     0         0           0            0          0          0       0
4%  1114      0         0        1114    5121   3814    3712       0       0        0      24s         93%         0%   -    26%     0         0          0           0          0          0      0
9%    781      0         0          790   2216   2308 18016   8462    0          0        39s        96%        38% Tf  49%      9         0         0             0          0          0       0
5%    896      0         0          896   2315   1418    4248 18888   0          0      39s         91%      100% :f   39%      0       0       0           0          0          0      0
5%    897      0        0        899    4179    3082    4120 11368    0   1114        39s        79%      51%   :    22%      2        0       0             0        0          0       0
23%    810      0        0         816    4347   1195 107412      0       0 103416    41s       89%       0%    -    75%      6        0         0           0        0         0       0
16%    925      0       0         1002    2840   2742   53688     0       0 54002       41s       85%        0%    -    63%      77     0        0             0          0          0       0
18%    999      0        0        1006    3969   2644   69408      0       0   60817       42s      87%       0%   -    75%      7      0          0            0          0         0      0    30%    841      0      0         843     2619   2409 168996      0       0 165806      41s       96%          0%     -    48%      2       0         0            0         0         0      0
24%   1233     0      0        1233    2909   1819 121868      0       0 118424       41s       91%          0%     -   55%      0       0        0             0           0         0         0
16%    973      0      0          973    2706   2364   54840      0       0 49873      25s      83%        0%    -    67%       0      0         0          0           0         0        0
6%   1036      0      0         1036    2768   1811    9832       0       0   7209        25s        84%        0%   -    22%       0      0       0            0        0          0        0
4%    886      0       0           893    2863   2827    2584        0         0      0              25s       85%        0%     -    27%       7      0        0          0          0       0       0

Here it will continue with this pattern of stop writing to tape for 16 - 20 seconds then continue with respectable write throughput for a single drive. Can someone please help...this is a real head banger...Can someone please give me some advise or direction to take at this point?

Thanks,

DRH

Nicolai · ‎11-02-2012

This is not a Netbackup problem. It's either a bad tape drive or a I/O issues at the Netapp filer. Bad tape drive usually perform at lot of rewrites (correct write errors) and that reduces writing speed at lot. If that not the issue you need to take a look at the filer.

Hint: add the sysstat output as a attachment instead

GlenG · ‎11-02-2012

Does vmstat show anything interesting (ie. non zero value for):

r        the number of kernel threads in run queue
b       the number of blocked kernel threads that are waiting for resources I/O, paging, and so forth
w       the number of swapped out lightweight processes (LWPs) that are waiting for processing resources to finish.
sr       pages scanned by clock algorithm

How about "prstat -mL" USR, SYS and LAT columns.

If you have the DTraceToolkit installed pfilestat <pid> can show why a process is waiting/delayed.

have a good weekend,

GlenG

drhartman · ‎11-07-2012

Thank you Nicolai and GlenG for your comments, tips and help. I do not consider myself an expert on the Netapp in any way (in fact I feel I have enough knowledge to be dangerous) so please be gentle...Is there a way to determine on the Netapp the number of retries or rewrites it is experiencing in regards to the tape drives? This just may be a suspect tape drive. It has been actively used now for 3 - 4 years and the second tape drive went south a just a few weeks ago. I now need to justify (through statistics if possible) the purchase of yet another tape drive. I have executed the vmstat command on the Netbackup Master Server while backups are active and found that r, b and w are basically 0 with r hitting a 1 or 2 occasionally. sr is pegged at 0. I ran the vmstat command at 1 second intervals and observed for several minutes. GlenG, I do not have the DTraceToolkit. Can you explain what I might look for in the output of the prstat command you so graciously provided.

Again, thank you for your help!

DRH

Modena · ‎11-07-2012

I am seeing a similar thing - I have a HP MSL4048 library with NDMP coming from a netapp 3240. Two drives are quick (110MB/sec) and two are slow at half that rate.

Looking at sysstat output on the netapp for the fast drives it is constantly writing, on the slow drives it writes for a few seconds at good speed, then goes to 0 for a few seconds....

Its not clear to me yet if this is a tape drive, cable, library, netapp port or FC switch issue.....still investigating....

Modena · ‎11-09-2012

Just to close this out from my end, I had a faulty FC SFP on the Netapp connection, replaced it and the 0's for seconds at a time have gone away, all my drives are pumping at full speed now.

drhartman · ‎11-13-2012

Thank you, Modena. I replaced the FC SFP on the NetApp but this had no change in the disposition of the problem. The backups are still toggling between 40MB/s - 140MB/s and then "0" for about 10 seconds. It continues on and repeats...It almost appears as though there is some kind of buffering taking place. From the Netbackup server, a SunFire V210 running Solaris 10, I trussed the ndmpagent process while running "sysstat -x 1" on the NetApp. The truss output would stop (and sleep) when the tape writes went to "0" and started again when writes picked up. I also trussed the bpbrm process and noticed that when the tape writes go to "0", there is a message that gets written to stdout, "db_end : Need to collect reply" then there is an open("/usr/openv/netbackup/bin/DBMto", O_RDONLY) Err#2 ENOENT. Again, this behavior is the same as with ndmpagent where the process "goes to sleep" at the same time the tape writes go to "0". Is there anything else I can do to track down the problem? So far I've swapped fibre cables and replaced the FC SFP on the NetApp.

Any help would be greatly appreciated.

DRH

Nicolai · ‎11-13-2012

Can you bypass Netbackup and write directly to tape via NetApp command ?. This will help you isolate the issue.

watsons · ‎11-14-2012

Yea.. Nicolai's suggestion to use netapp "dump" command to backup directly to tape would be a good way to isolate the issue. Details as in URL below:

http://www.wafl.co.uk/dump/

If you like to look at whether there is any I/O progress on Netbackup, enable ndmpagent (unified log OID=134) & ndmp (unified log OID=151), bpbrm & bptm logs on media server.

On the netapp end, you can look at ndmpd log.

It may not tell you why it's slow, but you may be able to tell how slow it is, or whether there is any connection timeout or buffer overflow (which can cause slow I/O).

drhartman · ‎11-14-2012

watsons, Nicolai,

Thank you for your suggestions... To the first order I ran a remote dump from the Master Server without any options and observed little improvement. I was getting around 20MB/s - 30MB/s with delays in writes of about 10 seconds (after several seconds of writes). I then ran a remote dump with a blocking size of 128. Now, after the first couple minutes the throughput is fairly steady at between 45MB/s - 65MB/s. Now very few breaks in the writes. I believe this throughput is enough for our backups to complete within the 24 hour period. Now, how do I change the blocking factor in Netbackup? Also, on a side note, does someone know how to increase the number of logs in the Activity Monitor? Currently, only 3 days of logs are visible but I would like to see a couple of weeks. Is there a down side to this number of days?

Thank you all,

DRH

Will_Restore · ‎11-14-2012

Regarding the latter question (logs in the Activity Monitor?)

http://www.symantec.com/business/support/index?pag...

Problem

How to manually adjust the number of days a backup job's history remains available in the job activi...

Solution

To adjust this, use the bpdbjobs command (found in the /usr/openv/netbackup/bin/admincmd/ directory)...

bpdbjobs -clean -keep_days <number_of_days>

For instance, to set it for seven days, use the following command:

# /usr/openv/netbackup/bin/admincmd/bpdbjobs -clean -keep_days 7

thanks to Andy for the tip here:

https://www-secure.symantec.com/connect/forums/activity-monitor-five-days-ago#comment-7295151

Yasuhisa_Ishika · ‎11-14-2012

I'm not sure SIZE_DATA_BUFFERS_NDMP on NetBackup is equivalent with block factor, but it is worth to try. Have a look at "Backup Planning and Performance Tuning Guide" for more detail.

http://www.symantec.com/docs/DOC4483

drhartman · ‎11-15-2012

Thanks all,

OK, I have been experimenting with the SIZE_DATA_BUFFERS_NDMP file and although I can see the changes to the "data buffer size" in the Activity Monitor Detailed Status window for that jobid, it doesn't seem to make a difference in the preformance of the tape writes. I am getting very acceptable tape writes of between 45MB/s - 65MB/s+ when I issue the following command from the Master Server, "ssh -l root NetApp dump 0fb nrst0a 256 /vol/new_vol/.snapshot/newvolsnap". I have reviewed the documentation (as you suggested) but I can't tell if there is a direct correlation between SIZE_DATA_BUFFERS_NDMP and the tape blocking factor. Is it possible that data compression is being used (which may take more time to compress and write)? How would I check that from the Netbackup's point of view?

Thank you again,

DRH

Yasuhisa_Ishika · ‎11-15-2012

One major difference left between dump and NDMP is path-based file history. It is possible that this performance degration is came from collecting, transfering and storing history into catalog.

Can you try to add "HIST = n" on top of file selections?
But, here is an major defect. This prevent cataloging individual files and folders, so with this, you can not restore individual items under specified paths.

drhartman · ‎11-16-2012

Some more information...I decided to truss the some of the Netbackup processes and found a couple things...While trussing bpbrm I noticed a pattern between the "0" tape writes on the NetApp and acceptable tape writes. Everytime the NetApp would show "0" in the tape writes column the bpbrm process would also stop sending output to screen with the message: "db_end: Need to collect reply" issue an open to the file, /usr/openv/netbackup/bin/DBMto and fail with "Err#2 ENOENT" then go to a sleeping state. The sysstat process would continue to display zeros under the tape writes column for several seconds (sometimes for up to a minute or more) then the truss would wake up and take off again when the tape writes began to show acceptable throughput. This pattern would continue to repeat...tape writes going to "0"...db_end: Need to collect reply...sleeping...continue with the writes to tape. So it appears as though the tape writes stall while some other process is issuing a reply...

Another observation and it may not mean anything but: I peered into another system here that was setup very similar to this environment (Master Server is a SunFire V240 running Solaris 9 and using Netbackup 6.5 with a NetApp attached to a tape library via FC SFP) and I noticed that the /usr/openv/netbackup/bp.conf files are very different. Perhaps someone can tell me if there is anything in here that may be causing these "hangs" in the tape writes. (I know that the Netbackup versions are very different but I was just hoping there was something that stands out...I did not setup the problematic NetApp or Netbackup 7.5 so I'm really grasping here...)

MasterServer1 (The SunFire V210 with the tape write issues running Netbackup 7.5)

SERVER = MasterServer1
SERVER = MasterServer1
CLIENT_NAME = MasterServer1
CONNECT_OPTIONS = localhost 1 0 2
USE_VXSS = PROHIBITED
VXSS_SERVICE_TYPE = INTEGRITYANDCONFIDENTIALITY
EMMSERVER = MasterServer1
HOST_CACHE_TTL = 3600
VXDBMS_NB_DATA = /usr/openv/db/data
LIST_FS_IMAGE_HEADERS = NO
VERBOSE = 4
TELEMETRY_UPLOAD = NO

MasterServer2 (The SunFire V240 running Netbackup 6.5)

SERVER = MasterServer2.dev.com
CLIENT_NAME = MasterServer2
ALLOW_NDMP
ALLOW_MEDIA_OVERWRITE = DBR
ALLOW_MEDIA_OVERWRITE = TAR
ALLOW_MEDIA_OVERWRITE = CPIO
ALLOW_MEDIA_OVERWRITE = ANSI
ALLOW_MEDIA_OVERWRITE = AOS/VS
ALLOW_MEDIA_OVERWRITE = MTF1
ALLOW_MEDIA_OVERWRITE = RS-MTF1
ALLOW_MEDIA_OVERWRITE = BE-MTF1
MEDIA_SERVER = MasterServer2.dev.com
KEEP_JOBS_HOURS = 480
KEEP_JOBS_SUCCESSFUL_HOURS = 480
EMMSERVER = MasterServer2.dev.com
VXDBMS_NB_DATA = /usr/openv/db/data
DISALLOW_CLIENT_LIST_RESTORE = YES
DISALLOW_CLIENT_RESTORE = YES
VERBOSE = 5

I would be happy to provide the pertinent information from the truss output if that would help.

Thanks,

DRH

VOX

NDMP backups extremely slow

Problem

Solution