cancel
Showing results for 
Search instead for 
Did you mean: 

NDMP backups extremely slow

drhartman
Level 3

I have just installed Netbackup 7.5 on a Solaris 10 server using NDMP to communicate with a NetApp (running Ontap 8.1) which is connected to an Overland Neo 2000 series tape library via Sanbox 1400 (I'm waiting for a replacement tape drive so I only have one tape drive in the library right now). My basic problem is that the backups take an extreme amount of time to complete. The backups (from the Netbackup Server's perspective as seen from the activity monitor from the Admin Console) seem to start OK but the throughput is only operating at about 19-20MB/s. Most backups generally fail because the windows close before they can be addressed. When I log into the NetApp and observe the tape writes (using the command "sysstat -x 1") I noticed that the output was showing values of anywhere from around 35MB/s to 165MB/s but, after several seconds of writes it would stop for 14 - 20 seconds before beginning to write to tape again. The following is an example: (I apologize for the alignment...I hope it is readable enough to see what I'm seeing) 

NetApp> sysstat -x 1
 CPU    NFS   CIFS   HTTP   Total     Net     kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP    CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                                                 in     out     read   write    read    write    age           hit      time   ty    util                                                 in       out        in       out
  3%   1002      0          0         1002   2966   3469  2648     0         0        0        23s         86%      0%    -    32%       0            0         0            0        0          0         0
  3%   1106      0          0         1112   3141   2580   2492    0         0        0        23s         88%      0%    -    31%       6            0         0            0        0          0         0
  2%     803      0          0           803   3937   1132    752      0         0        0       23s          93%      0%    -     5%        0            0         0            0        0          0         0
 16%    895      0          0           895   4769   2705 66792     0        0  50397     39s         85%       0%    -    58%       0           0          0            0        0          0         0
 19%    976      0          0           976   2808   1951 62504     0        0  80806     39s         89%       0%    -    60%       0           0          0            0        0          0         0
 22%  1066      0          0         1066   3079  3116 103624    0        0  98763     25s         92%       0%    -   65%        0           0          0            0        0          0         0
 20%    976      0          0           983   4003   2760  78588     0        0  68944     24s         85%       0%    -   66%        7           0          0            0        0          0         0
 14%   1037     0          0         1038   2941   1825  33412     0        0  33554    24s          82%       0%    -   56%       1           0           0            0        0          0         0
  3%   1003      0          0         1003   3897   1796    1100     0        0         0       24s         86%        0%   -    20%       0           0           0            0        0          0         0
  9%    956       0          0           956    3295  1503  16568  5648     0         0       24s         97%      46%  Tf  46%       0           0           0            0        0          0         0
  6%   1057      0          0         1057   3027   2202   6724  1886      0         0       24s         89%     100%  :f   37%      0           0           0            0        0          0         0
  4%    875       0          0           887    2887  2936   2608 14052     0         0        24s         84%       62%  :    29%     12         0           0             0        0          0        0
  3%   1174      0          0         1177   3373  1843   1660       0        0         0        24s         89%         0%   -   18%       3          0           0             0        0          0        0
  3%    967       0          0          967    3833   3634   2640      0        0         0        24s         80%         0%    -   31%      0          0           0             0         0          0        0
  3%    969       0         0           969    3581   1743   1756      0        0         0        24s         90%         0%    -   27%      0          0           0             0         0          0        0
  3%   1101      0         0         1101    4565   2676    2424     0        0         0        24s         87%         0%    -   36%      0          0          0              0         0          0        0
  2%    785       0         0          793     2628   1364    1388      0       0         0        24s         90%         0%     -  18%      8          0           0             0          0          0       0
  2%    931       0         0          931     2438   1817    1836      0       0         0        24s         90%         0%     -   23%     0          0           0             0          0          0       0
  3%    984       0         0          984    3009    2223    1560      0       0         0        24s         85%         0%     -   29%     0          0           0             0          0          0       0

  3%    908      0          0          908    2748    2247    2488      0       0          0        24s         89%         0%    -    30%     0          0           0            0          0          0       0
  4%  1114      0          0        1114    5121   3814    3712       0       0          0        24s         93%         0%    -    26%     0          0          0             0          0          0       0
  9%    781      0          0          790     2216   2308  18016   8462    0          0        39s         96%        38%  Tf  49%      9         0          0             0          0          0       0
  5%    896      0          0          896     2315   1418    4248  18888   0          0        39s         91%      100%   :f   39%      0         0         0             0          0          0        0
  5%    897      0          0          899    4179    3082    4120  11368    0   1114        39s        79%        51%    :    22%      2         0         0             0          0          0       0
 23%    810      0        0           816    4347   1195  107412      0       0  103416     41s        89%         0%     -    75%      6        0         0             0          0          0       0
 16%    925      0        0         1002    2840   2742   53688       0       0   54002       41s       85%         0%     -    63%      77       0         0             0          0          0       0
 18%    999      0        0         1006    3969   2644   69408      0        0   60817       42s        87%         0%     -    75%      7       0          0            0          0         0      0     30%    841      0       0           843     2619   2409  168996      0       0 165806       41s       96%          0%     -    48%      2       0         0            0           0         0        0
 24%   1233     0      0         1233    2909   1819  121868       0        0 118424       41s       91%          0%     -    55%      0       0        0             0           0         0         0
 16%    973      0       0          973    2706   2364    54840       0        0  49873        25s        83%          0%     -    67%       0      0         0            0           0         0        0
  6%   1036      0      0         1036    2768   1811    9832         0        0   7209          25s        84%          0%     -    22%       0      0        0             0          0          0        0
  4%    886      0       0           893    2863   2827     2584        0         0      0              25s       85%         0%      -    27%       7      0         0            0           0         0        0
 

Here it will continue with this pattern of stop writing to tape for 16 - 20 seconds then continue with respectable write throughput for a single drive. Can someone please help...this is a real head banger...Can someone please give me some advise or direction to take at this point?

Thanks,

DRH

14 REPLIES 14

Nicolai
Moderator
Moderator
Partner    VIP   

This is not a Netbackup problem. It's either a bad tape drive or a I/O issues at the Netapp filer. Bad tape drive usually perform at lot of rewrites (correct write errors) and that reduces writing speed at lot. If that not the issue you need to take a look at the filer.

Hint: add the sysstat output as a attachment instead

GlenG
Level 4

Does vmstat show anything interesting (ie. non zero value for):

r        the number of kernel threads in run queue
b       the  number  of blocked kernel threads that are waiting  for resources I/O, paging, and so forth
w       the number of  swapped  out  lightweight  processes  (LWPs)  that are waiting for processing resources to finish.
sr       pages scanned by clock algorithm

 

How about "prstat -mL" USR, SYS and LAT columns.

If you have the DTraceToolkit installed pfilestat <pid> can show why a process is waiting/delayed.

 

have a good weekend,

GlenG

 

 

drhartman
Level 3

Thank you Nicolai and GlenG for your comments, tips and help. I do not consider myself an expert on the Netapp in any way (in fact I feel I have enough knowledge to be dangerous) so please be gentle...Is there a way to determine on the Netapp the number of retries or rewrites it is experiencing in regards to the tape drives? This just may be a suspect tape drive. It has been actively used now for 3 - 4 years and the second tape drive went south a just a few weeks ago. I now need to justify (through statistics if possible) the purchase of yet another tape drive. I have executed the vmstat command on the Netbackup Master Server while backups are active and found that r, b and w are basically 0 with r hitting a 1 or 2 occasionally. sr is pegged at 0. I ran the vmstat command at 1 second intervals and observed for several minutes. GlenG, I do not have the DTraceToolkit. Can you explain what I might look for in the output of the prstat command you so graciously provided.

Again, thank you for your help!

DRH

Modena
Level 2
Partner

I am seeing a similar thing - I have a HP MSL4048 library with NDMP coming from a netapp 3240. Two drives are quick (110MB/sec) and two are slow at half that rate.

 

Looking at sysstat output on the netapp for the fast drives it is constantly writing, on the slow drives it writes for a few seconds at good speed, then goes to 0 for a few seconds....

 

Its not clear to me yet if this is a tape drive, cable, library, netapp port or FC switch issue.....still investigating....

Modena
Level 2
Partner

Just to close this out from my end, I had a faulty FC SFP on the Netapp connection, replaced it and the 0's for seconds at a time have gone away, all my drives are pumping at full speed now.

drhartman
Level 3

Thank you, Modena. I replaced the FC SFP on the NetApp but this had no change in the disposition of the problem. The backups are still toggling between 40MB/s - 140MB/s and then "0" for about 10 seconds. It continues on and repeats...It almost appears as though there is some kind of buffering taking place. From the Netbackup server, a SunFire V210 running Solaris 10, I trussed the ndmpagent process while running "sysstat -x 1" on the NetApp. The truss output would stop (and sleep) when the tape writes went to "0" and started again when writes picked up. I also trussed the bpbrm process and noticed that when the tape writes go to "0", there is a message that gets written to stdout, "db_end : Need to collect reply" then there is an open("/usr/openv/netbackup/bin/DBMto", O_RDONLY) Err#2 ENOENT. Again, this behavior is the same as with ndmpagent where the process "goes to sleep" at the same time the tape writes go to "0". Is there anything else I can do to track down the problem? So far I've swapped fibre cables and replaced the FC SFP on the NetApp.

Any help would be greatly appreciated.

DRH

Nicolai
Moderator
Moderator
Partner    VIP   

Can you bypass Netbackup and write directly to tape via NetApp command ?. This will help you isolate the issue.

watsons
Level 6

Yea.. Nicolai's suggestion to use netapp "dump" command to backup directly to tape would be a good way to isolate the issue. Details as in URL below:

http://www.wafl.co.uk/dump/

If you like to look at whether there is any I/O progress on Netbackup,  enable ndmpagent (unified log OID=134) & ndmp (unified log OID=151), bpbrm & bptm logs on media server.

On the netapp end, you can look at ndmpd log.

It may not tell you why it's slow, but you may be able to tell how slow it is, or whether there is any connection timeout or buffer overflow (which can cause slow I/O).

drhartman
Level 3

watsons, Nicolai,

Thank you for your suggestions... To the first order I ran a remote dump from the Master Server without any options and observed little improvement. I was getting around 20MB/s - 30MB/s with delays in writes of about 10 seconds (after several seconds of writes). I then ran a remote dump with a blocking size of 128. Now, after the first couple minutes the throughput is fairly steady at between 45MB/s - 65MB/s. Now very few breaks in the writes. I believe this throughput is enough for our backups to complete within the 24 hour period. Now, how do I change the blocking factor in Netbackup? Also, on a side note, does someone know how to increase the number of logs in the Activity Monitor? Currently, only 3 days of logs are visible but I would like to see a couple of weeks. Is there a down side to this number of days? 

Thank you all,

DRH 

Yasuhisa_Ishika
Level 6
Partner Accredited Certified

I'm not sure SIZE_DATA_BUFFERS_NDMP on NetBackup is equivalent with block factor, but it is worth to try. Have a look at "Backup Planning and Performance Tuning Guide" for more detail.

http://www.symantec.com/docs/DOC4483

drhartman
Level 3

Thanks all,

OK, I have been experimenting with the SIZE_DATA_BUFFERS_NDMP file and although I can see the changes to the "data buffer size" in the Activity Monitor Detailed Status window for that jobid, it doesn't seem to make a difference in the preformance of the tape writes. I am getting very acceptable tape writes of between 45MB/s - 65MB/s+ when I issue the following command from the Master Server, "ssh -l root NetApp dump 0fb nrst0a 256  /vol/new_vol/.snapshot/newvolsnap". I have reviewed the documentation (as you suggested) but I can't tell if there is a direct correlation between  SIZE_DATA_BUFFERS_NDMP and the tape blocking factor. Is it possible that data compression is being used (which may take more time to compress and write)? How would I check that from the Netbackup's point of view?

 

Thank you again,

DRH

Yasuhisa_Ishika
Level 6
Partner Accredited Certified

One major difference left between dump and NDMP is path-based file history. It is possible that this performance degration is came from collecting, transfering and storing history into catalog.

Can you try to add "HIST = n" on top of file selections?
But, here is an major defect. This prevent cataloging individual files and folders, so with this, you can not restore individual items under specified paths. 

drhartman
Level 3

Some more information...I decided to truss the some of the Netbackup processes and found a couple things...While trussing bpbrm I noticed a pattern between the "0" tape writes on the NetApp and acceptable tape writes. Everytime the NetApp would show "0" in the tape writes column the bpbrm process would also stop sending output to screen with the message: "db_end: Need to collect reply" issue an open to the file, /usr/openv/netbackup/bin/DBMto and fail with "Err#2 ENOENT" then go to a sleeping state. The sysstat process would continue to display zeros under the tape writes column for several seconds (sometimes for up to a minute or more) then the truss would wake up and take off again when the tape writes began to show acceptable throughput. This pattern would continue to repeat...tape writes going to "0"...db_end: Need to collect reply...sleeping...continue with the writes to tape. So it appears as though the tape writes stall while some other process is issuing a reply...

Another observation and it may not mean anything but: I peered into another system here that was setup very similar to this environment (Master Server is a SunFire V240 running Solaris 9 and using Netbackup 6.5 with a NetApp attached to a tape library via FC SFP) and I noticed that the /usr/openv/netbackup/bp.conf files are very different. Perhaps someone can tell me if there is anything in here that may be causing these "hangs" in the tape writes. (I know that the Netbackup versions are very different but I was just hoping there was something that stands out...I did not setup the problematic NetApp or Netbackup 7.5 so I'm really grasping here...)

MasterServer1 (The SunFire V210 with the tape write issues running Netbackup 7.5)

SERVER = MasterServer1
SERVER = MasterServer1
CLIENT_NAME = MasterServer1
CONNECT_OPTIONS = localhost 1 0 2
USE_VXSS = PROHIBITED
VXSS_SERVICE_TYPE = INTEGRITYANDCONFIDENTIALITY
EMMSERVER = MasterServer1
HOST_CACHE_TTL = 3600
VXDBMS_NB_DATA = /usr/openv/db/data
LIST_FS_IMAGE_HEADERS = NO
VERBOSE = 4
TELEMETRY_UPLOAD = NO
 

MasterServer2 (The SunFire V240 running Netbackup 6.5)

SERVER = MasterServer2.dev.com
CLIENT_NAME = MasterServer2
ALLOW_NDMP
ALLOW_MEDIA_OVERWRITE = DBR
ALLOW_MEDIA_OVERWRITE = TAR
ALLOW_MEDIA_OVERWRITE = CPIO
ALLOW_MEDIA_OVERWRITE = ANSI
ALLOW_MEDIA_OVERWRITE = AOS/VS
ALLOW_MEDIA_OVERWRITE = MTF1
ALLOW_MEDIA_OVERWRITE = RS-MTF1
ALLOW_MEDIA_OVERWRITE = BE-MTF1
MEDIA_SERVER = MasterServer2.dev.com
KEEP_JOBS_HOURS = 480
KEEP_JOBS_SUCCESSFUL_HOURS = 480
EMMSERVER = MasterServer2.dev.com
VXDBMS_NB_DATA = /usr/openv/db/data
DISALLOW_CLIENT_LIST_RESTORE = YES
DISALLOW_CLIENT_RESTORE = YES
VERBOSE = 5

I would be happy to provide the pertinent information from the truss output if that would help.

Thanks,

DRH