11-02-2012 09:46 AM
I have just installed Netbackup 7.5 on a Solaris 10 server using NDMP to communicate with a NetApp (running Ontap 8.1) which is connected to an Overland Neo 2000 series tape library via Sanbox 1400 (I'm waiting for a replacement tape drive so I only have one tape drive in the library right now). My basic problem is that the backups take an extreme amount of time to complete. The backups (from the Netbackup Server's perspective as seen from the activity monitor from the Admin Console) seem to start OK but the throughput is only operating at about 19-20MB/s. Most backups generally fail because the windows close before they can be addressed. When I log into the NetApp and observe the tape writes (using the command "sysstat -x 1") I noticed that the output was showing values of anywhere from around 35MB/s to 165MB/s but, after several seconds of writes it would stop for 14 - 20 seconds before beginning to write to tape again. The following is an example: (I apologize for the alignment...I hope it is readable enough to see what I'm seeing)
NetApp> sysstat -x 1
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
3% 1002 0 0 1002 2966 3469 2648 0 0 0 23s 86% 0% - 32% 0 0 0 0 0 0 0
3% 1106 0 0 1112 3141 2580 2492 0 0 0 23s 88% 0% - 31% 6 0 0 0 0 0 0
2% 803 0 0 803 3937 1132 752 0 0 0 23s 93% 0% - 5% 0 0 0 0 0 0 0
16% 895 0 0 895 4769 2705 66792 0 0 50397 39s 85% 0% - 58% 0 0 0 0 0 0 0
19% 976 0 0 976 2808 1951 62504 0 0 80806 39s 89% 0% - 60% 0 0 0 0 0 0 0
22% 1066 0 0 1066 3079 3116 103624 0 0 98763 25s 92% 0% - 65% 0 0 0 0 0 0 0
20% 976 0 0 983 4003 2760 78588 0 0 68944 24s 85% 0% - 66% 7 0 0 0 0 0 0
14% 1037 0 0 1038 2941 1825 33412 0 0 33554 24s 82% 0% - 56% 1 0 0 0 0 0 0
3% 1003 0 0 1003 3897 1796 1100 0 0 0 24s 86% 0% - 20% 0 0 0 0 0 0 0
9% 956 0 0 956 3295 1503 16568 5648 0 0 24s 97% 46% Tf 46% 0 0 0 0 0 0 0
6% 1057 0 0 1057 3027 2202 6724 1886 0 0 24s 89% 100% :f 37% 0 0 0 0 0 0 0
4% 875 0 0 887 2887 2936 2608 14052 0 0 24s 84% 62% : 29% 12 0 0 0 0 0 0
3% 1174 0 0 1177 3373 1843 1660 0 0 0 24s 89% 0% - 18% 3 0 0 0 0 0 0
3% 967 0 0 967 3833 3634 2640 0 0 0 24s 80% 0% - 31% 0 0 0 0 0 0 0
3% 969 0 0 969 3581 1743 1756 0 0 0 24s 90% 0% - 27% 0 0 0 0 0 0 0
3% 1101 0 0 1101 4565 2676 2424 0 0 0 24s 87% 0% - 36% 0 0 0 0 0 0 0
2% 785 0 0 793 2628 1364 1388 0 0 0 24s 90% 0% - 18% 8 0 0 0 0 0 0
2% 931 0 0 931 2438 1817 1836 0 0 0 24s 90% 0% - 23% 0 0 0 0 0 0 0
3% 984 0 0 984 3009 2223 1560 0 0 0 24s 85% 0% - 29% 0 0 0 0 0 0 0
3% 908 0 0 908 2748 2247 2488 0 0 0 24s 89% 0% - 30% 0 0 0 0 0 0 0
4% 1114 0 0 1114 5121 3814 3712 0 0 0 24s 93% 0% - 26% 0 0 0 0 0 0 0
9% 781 0 0 790 2216 2308 18016 8462 0 0 39s 96% 38% Tf 49% 9 0 0 0 0 0 0
5% 896 0 0 896 2315 1418 4248 18888 0 0 39s 91% 100% :f 39% 0 0 0 0 0 0 0
5% 897 0 0 899 4179 3082 4120 11368 0 1114 39s 79% 51% : 22% 2 0 0 0 0 0 0
23% 810 0 0 816 4347 1195 107412 0 0 103416 41s 89% 0% - 75% 6 0 0 0 0 0 0
16% 925 0 0 1002 2840 2742 53688 0 0 54002 41s 85% 0% - 63% 77 0 0 0 0 0 0
18% 999 0 0 1006 3969 2644 69408 0 0 60817 42s 87% 0% - 75% 7 0 0 0 0 0 0 30% 841 0 0 843 2619 2409 168996 0 0 165806 41s 96% 0% - 48% 2 0 0 0 0 0 0
24% 1233 0 0 1233 2909 1819 121868 0 0 118424 41s 91% 0% - 55% 0 0 0 0 0 0 0
16% 973 0 0 973 2706 2364 54840 0 0 49873 25s 83% 0% - 67% 0 0 0 0 0 0 0
6% 1036 0 0 1036 2768 1811 9832 0 0 7209 25s 84% 0% - 22% 0 0 0 0 0 0 0
4% 886 0 0 893 2863 2827 2584 0 0 0 25s 85% 0% - 27% 7 0 0 0 0 0 0
Here it will continue with this pattern of stop writing to tape for 16 - 20 seconds then continue with respectable write throughput for a single drive. Can someone please help...this is a real head banger...Can someone please give me some advise or direction to take at this point?
Thanks,
DRH
11-02-2012 12:17 PM
This is not a Netbackup problem. It's either a bad tape drive or a I/O issues at the Netapp filer. Bad tape drive usually perform at lot of rewrites (correct write errors) and that reduces writing speed at lot. If that not the issue you need to take a look at the filer.
Hint: add the sysstat output as a attachment instead
11-02-2012 01:57 PM
Does vmstat show anything interesting (ie. non zero value for):
r the number of kernel threads in run queue
b the number of blocked kernel threads that are waiting for resources I/O, paging, and so forth
w the number of swapped out lightweight processes (LWPs) that are waiting for processing resources to finish.
sr pages scanned by clock algorithm
How about "prstat -mL" USR, SYS and LAT columns.
If you have the DTraceToolkit installed pfilestat <pid> can show why a process is waiting/delayed.
have a good weekend,
GlenG
11-07-2012 06:08 AM
Thank you Nicolai and GlenG for your comments, tips and help. I do not consider myself an expert on the Netapp in any way (in fact I feel I have enough knowledge to be dangerous) so please be gentle...Is there a way to determine on the Netapp the number of retries or rewrites it is experiencing in regards to the tape drives? This just may be a suspect tape drive. It has been actively used now for 3 - 4 years and the second tape drive went south a just a few weeks ago. I now need to justify (through statistics if possible) the purchase of yet another tape drive. I have executed the vmstat command on the Netbackup Master Server while backups are active and found that r, b and w are basically 0 with r hitting a 1 or 2 occasionally. sr is pegged at 0. I ran the vmstat command at 1 second intervals and observed for several minutes. GlenG, I do not have the DTraceToolkit. Can you explain what I might look for in the output of the prstat command you so graciously provided.
Again, thank you for your help!
DRH
11-07-2012 09:25 PM
I am seeing a similar thing - I have a HP MSL4048 library with NDMP coming from a netapp 3240. Two drives are quick (110MB/sec) and two are slow at half that rate.
Looking at sysstat output on the netapp for the fast drives it is constantly writing, on the slow drives it writes for a few seconds at good speed, then goes to 0 for a few seconds....
Its not clear to me yet if this is a tape drive, cable, library, netapp port or FC switch issue.....still investigating....
11-09-2012 02:10 AM
Just to close this out from my end, I had a faulty FC SFP on the Netapp connection, replaced it and the 0's for seconds at a time have gone away, all my drives are pumping at full speed now.
11-13-2012 05:58 AM
Thank you, Modena. I replaced the FC SFP on the NetApp but this had no change in the disposition of the problem. The backups are still toggling between 40MB/s - 140MB/s and then "0" for about 10 seconds. It continues on and repeats...It almost appears as though there is some kind of buffering taking place. From the Netbackup server, a SunFire V210 running Solaris 10, I trussed the ndmpagent process while running "sysstat -x 1" on the NetApp. The truss output would stop (and sleep) when the tape writes went to "0" and started again when writes picked up. I also trussed the bpbrm process and noticed that when the tape writes go to "0", there is a message that gets written to stdout, "db_end : Need to collect reply" then there is an open("/usr/openv/netbackup/bin/DBMto", O_RDONLY) Err#2 ENOENT. Again, this behavior is the same as with ndmpagent where the process "goes to sleep" at the same time the tape writes go to "0". Is there anything else I can do to track down the problem? So far I've swapped fibre cables and replaced the FC SFP on the NetApp.
Any help would be greatly appreciated.
DRH
11-13-2012 08:24 AM
Can you bypass Netbackup and write directly to tape via NetApp command ?. This will help you isolate the issue.
11-14-2012 02:19 AM
Yea.. Nicolai's suggestion to use netapp "dump" command to backup directly to tape would be a good way to isolate the issue. Details as in URL below:
http://www.wafl.co.uk/dump/
If you like to look at whether there is any I/O progress on Netbackup, enable ndmpagent (unified log OID=134) & ndmp (unified log OID=151), bpbrm & bptm logs on media server.
On the netapp end, you can look at ndmpd log.
It may not tell you why it's slow, but you may be able to tell how slow it is, or whether there is any connection timeout or buffer overflow (which can cause slow I/O).
11-14-2012 08:55 AM
watsons, Nicolai,
Thank you for your suggestions... To the first order I ran a remote dump from the Master Server without any options and observed little improvement. I was getting around 20MB/s - 30MB/s with delays in writes of about 10 seconds (after several seconds of writes). I then ran a remote dump with a blocking size of 128. Now, after the first couple minutes the throughput is fairly steady at between 45MB/s - 65MB/s. Now very few breaks in the writes. I believe this throughput is enough for our backups to complete within the 24 hour period. Now, how do I change the blocking factor in Netbackup? Also, on a side note, does someone know how to increase the number of logs in the Activity Monitor? Currently, only 3 days of logs are visible but I would like to see a couple of weeks. Is there a down side to this number of days?
Thank you all,
DRH
11-14-2012 09:07 AM
Regarding the latter question (logs in the Activity Monitor?)
http://www.symantec.com/business/support/index?pag...
Problem
thanks to Andy for the tip here:
https://www-secure.symantec.com/connect/forums/activity-monitor-five-days-ago#comment-7295151
11-14-2012 07:12 PM
I'm not sure SIZE_DATA_BUFFERS_NDMP on NetBackup is equivalent with block factor, but it is worth to try. Have a look at "Backup Planning and Performance Tuning Guide" for more detail.
11-15-2012 08:53 AM
Thanks all,
OK, I have been experimenting with the SIZE_DATA_BUFFERS_NDMP file and although I can see the changes to the "data buffer size" in the Activity Monitor Detailed Status window for that jobid, it doesn't seem to make a difference in the preformance of the tape writes. I am getting very acceptable tape writes of between 45MB/s - 65MB/s+ when I issue the following command from the Master Server, "ssh -l root NetApp dump 0fb nrst0a 256 /vol/new_vol/.snapshot/newvolsnap". I have reviewed the documentation (as you suggested) but I can't tell if there is a direct correlation between SIZE_DATA_BUFFERS_NDMP and the tape blocking factor. Is it possible that data compression is being used (which may take more time to compress and write)? How would I check that from the Netbackup's point of view?
Thank you again,
DRH
11-15-2012 06:58 PM
One major difference left between dump and NDMP is path-based file history. It is possible that this performance degration is came from collecting, transfering and storing history into catalog.
Can you try to add "HIST = n" on top of file selections?
But, here is an major defect. This prevent cataloging individual files and folders, so with this, you can not restore individual items under specified paths.
11-16-2012 05:22 AM
Some more information...I decided to truss the some of the Netbackup processes and found a couple things...While trussing bpbrm I noticed a pattern between the "0" tape writes on the NetApp and acceptable tape writes. Everytime the NetApp would show "0" in the tape writes column the bpbrm process would also stop sending output to screen with the message: "db_end: Need to collect reply" issue an open to the file, /usr/openv/netbackup/bin/DBMto and fail with "Err#2 ENOENT" then go to a sleeping state. The sysstat process would continue to display zeros under the tape writes column for several seconds (sometimes for up to a minute or more) then the truss would wake up and take off again when the tape writes began to show acceptable throughput. This pattern would continue to repeat...tape writes going to "0"...db_end: Need to collect reply...sleeping...continue with the writes to tape. So it appears as though the tape writes stall while some other process is issuing a reply...
Another observation and it may not mean anything but: I peered into another system here that was setup very similar to this environment (Master Server is a SunFire V240 running Solaris 9 and using Netbackup 6.5 with a NetApp attached to a tape library via FC SFP) and I noticed that the /usr/openv/netbackup/bp.conf files are very different. Perhaps someone can tell me if there is anything in here that may be causing these "hangs" in the tape writes. (I know that the Netbackup versions are very different but I was just hoping there was something that stands out...I did not setup the problematic NetApp or Netbackup 7.5 so I'm really grasping here...)
MasterServer1 (The SunFire V210 with the tape write issues running Netbackup 7.5)
SERVER = MasterServer1
SERVER = MasterServer1
CLIENT_NAME = MasterServer1
CONNECT_OPTIONS = localhost 1 0 2
USE_VXSS = PROHIBITED
VXSS_SERVICE_TYPE = INTEGRITYANDCONFIDENTIALITY
EMMSERVER = MasterServer1
HOST_CACHE_TTL = 3600
VXDBMS_NB_DATA = /usr/openv/db/data
LIST_FS_IMAGE_HEADERS = NO
VERBOSE = 4
TELEMETRY_UPLOAD = NO
MasterServer2 (The SunFire V240 running Netbackup 6.5)
SERVER = MasterServer2.dev.com
CLIENT_NAME = MasterServer2
ALLOW_NDMP
ALLOW_MEDIA_OVERWRITE = DBR
ALLOW_MEDIA_OVERWRITE = TAR
ALLOW_MEDIA_OVERWRITE = CPIO
ALLOW_MEDIA_OVERWRITE = ANSI
ALLOW_MEDIA_OVERWRITE = AOS/VS
ALLOW_MEDIA_OVERWRITE = MTF1
ALLOW_MEDIA_OVERWRITE = RS-MTF1
ALLOW_MEDIA_OVERWRITE = BE-MTF1
MEDIA_SERVER = MasterServer2.dev.com
KEEP_JOBS_HOURS = 480
KEEP_JOBS_SUCCESSFUL_HOURS = 480
EMMSERVER = MasterServer2.dev.com
VXDBMS_NB_DATA = /usr/openv/db/data
DISALLOW_CLIENT_LIST_RESTORE = YES
DISALLOW_CLIENT_RESTORE = YES
VERBOSE = 5
I would be happy to provide the pertinent information from the truss output if that would help.
Thanks,
DRH