cancel
Showing results for 
Search instead for 
Did you mean: 

NBU 5220 Performance

smakovits
Level 6

We recently deployed a 5220 appliance into our environment as it was to be the savior in our battle against a backup window we were no longer able to meet.  When we finally got it online and into our NBU environment the initial performance was great.  The area we were to most benefit was with VMware backups.  The data stores mounted directly to the appliance would allow direct access to the snapshots for a fast and efficient backup.  Before this, we were performing client side backups so the impact on the hosts every night was significant as we tried backing up 600+ vms.  The plan was to be able to move all Dev and test off-host backups to the middle of the day as the performance impact was minimal and the end result was an increased window to complete things.  As we started with noon-time backups deduplication rates were high and so were the speeds.

 

However, this performance gain was short lived, as we began increasing the load we suddenly saw performance drop to a point of concern.  Backups were no longer speedy, 3500KB per second to 10,000KB.  There are some that might pop to 24,000KB, but in a sample size of 15 as I write this, only one is showing 24,000.

Now, I do have a few ideas. 

1. We have a 72TB appliance, therefore, 2 disk trays, during the backups, the one disk tray is going crazy, all the lights are flashing and you can really see that it is working.  However, the second tray is doing nothing.  While you might see a blink here or there, it is almost nothing compared to the other disk tray.  Is this to be expected?  When we looked at the disk configuration, it shows concat, is this normal?

2. Too much data at once and we are simply burying the appliance.  In reality, what sort of performance should I be able to expect from the appliance?

3. Relating to number 2, since we only have the one appliance right now while we wait to get the remote appliance in place, we are duping off to tape.  This is running at the same time as a backup, so this means at the same time the appliance is writing a lot of data, it is also reading it back to tape.

4. We are overloading the data store so that read speed is bad from source to destination.  We have fewer hosts,therefore, if we limit the jobs per host we limit the number of machines backing up at once (obviously).  This means that backups take way long, so we removed the limit per host and just set a limit per data store.  As we are new to it all, I am not sure what impact where, but again, I am trying to list any and all ideas from the start.

5. The appliance does not support multi-pathing, therefore, we only have a single path to the disk.

 

Beyond that I am not sure, but this is something that doesnt help with the showcasing of the appliances to management at the moment.  However, given the initial performance I am confident we can get back there.

1 ACCEPTED SOLUTION

Accepted Solutions

Mark_Solutions
Level 6
Partner Accredited Certified

Lots of tuning possible but every environment is different and what helps for one site may hinder another!

The ch value is the maximum cache value - everything explained here : http://www.symantec.com/docs/HOWTO34665

Running backups and duplications at the same time is not ever going to be great - which i assume is why in 7.6 SLPs get a schedule window and others use a scheduler to suspend and enable them so that they don't run while backups are running.

Queue processing, garbage collection and rebasing can all also have an affect and this is another juggling act - again people are adding the extra queue processing runs to the crontab to keep it running lean

The readme for 2.5.3 is interesting ...

Better battery backup monitoring, catalog busy error fixes, well .. when you read all of the content through it does make you feel like you have to have it!

I guess now that 2.5.3 is here that Support / Symantec will say try that first?

View solution in original post

50 REPLIES 50

Mark_Solutions
Level 6
Partner Accredited Certified

Lots of things to cover there but here are a few ideas which may help.

Firstly, in relation to one shelf getting used and the other not I am unsure of why this would be the case unless you added the second shelf at a later date and most data was already on the first shelf (being de-dupe it may not be adding to the capacity of the appliance and just carry on using the first shelf) The one thought is that the first shelf wil also most likely have the de-dupe database on it which is doing most of the work as if you are getting good de-dupe there shouldn't be much data written but the database should be getting a hammering!

Second - do check that the appliance itself is OK. Two things to check here, 1. The disks, a lost disk will substantially slow the thing down. 2. The RAID battery - massive slow down if this starts to go flat. Use:

 /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0

to check its state.

If you are running a lot of jobs with a lot of data then it may be that queue processing, rebasing and garbage collection is starting to slow things down.

Queue processing happens all the time (twice a day) but it never seems to clear enough down so it is well worth running this more often manually to keep it lean (/usr/openv/pdde/pdcr/bin/crcontrol --processqueue) and it may help speed it back up. Worth checking regularly how big your queue is - i feel that it should run far more often than every 12 hours and that actually midnight is not usually a great time to run as backups are usually running then - it would be best to run it 4 or 5 times a day during the day time, or at least when backups are not running.

rebasing and garbage collection do not run so often (monthly) so when these kick in they can also slow it down for a day or two so you may want to run these manually more regularly too.

Running duplications at the same time as backups will have an adverse affect too - when 7.6 comes out we will be able to schedule the SLPs but for now you may want t just restrict the I/O on the disk pool to prevent too many operations running on the pool at the same time.

Reducing the fragment size of the de-dupe disk storage unit also seems to help - i try and keep them to aroung 5000MB for best performance (mainly helps with duplications / replications)

The current thinking is an optimum limit of one VM per data store at a time for best performance when backing up - so anything more than that will have an impact on performance

Again, when 7.6 is here you will get the option of using Accelerator for VMWare backups which will change everything!

Hope some of this helps

smakovits
Level 6

Mark,

Funny you should mention the RAID battery as I got this email today:

                             Adapter Information                             |
|+---------------------------------------------------------------------------+|
||  |                    |Adapter| BBU  |BBU Learn | BBU  |      |           ||
||ID|   Adapter model    |Status |Status|  Cycle   |charge|State |Acknowledge||
||  |                    |       |      |  active  |      |      |           ||
||--+--------------------+-------+------+----------+------+------+-----------||
||  |Integrated Intel(R) |       |      |          |      |      |           ||
||1 |RAID Controller     |OK     |Not OK|-         |-     |Failed|No         ||
||  |SROMBSASMP2         |       |      |          |      |      |     

 

 

However, I run your command and things look to be OK.

/opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0

BBU status for Adapter: 0

BatteryType: iBBU
Voltage: 4064 mV
Current: 0 mA
Temperature: 38 C

BBU Firmware Status:

  Charging Status              : None
  Voltage                      : OK
  Temperature                  : OK
  Learn Cycle Requested        : No
  Learn Cycle Active           : No
  Learn Cycle Status           : OK
  Learn Cycle Timeout          : No
  I2c Errors Detected          : No
  Battery Pack Missing         : No
  Battery Replacement required : No
  Remaining Capacity Low       : No
  Periodic Learn Required      : No
  Transparent Learn            : No

Battery state:

GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : Yes
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : No
  Remaining Capacity Alarm: No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No

Relative State of Charge: 96 %
Charger System State: 49168
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 96 %
Max Error: 2 %

Exit Code: 0x00

I have a case open with support and I send then the logs to dispatch a part if needed.

Another are I would like to maybe discuss is the queue processing, rebaisng and garbage collection.

Any chance you can elaborate on each just briefly for what each does and the manual commands for each?

I did run /usr/openv/pdde/pdcr/bin/crcontrol --processqueue and it returned OK, not sure if that means it is done or what.

Lastly, something I discovered while getting the logs for support is that I have a gagillion temp vmdk files in the tmp folder.  A quick search revealed that this might be an issue.  https://www-secure.symantec.com/connect/forums/5220-appliance-file-system-errors-reboot-check-forced

Which you spoke on, so I am curious if you know anything more specific on the EEB for vmware.  My current appliance version in 2.5.2.

At some point it might be worth investigating the fragment size you mention as well.

On the final point of one VM per data store, I will try this setting now, just to have it in place.  I had it at 4...but in the long run I think the battery issue needs hashed first and then the queues, fragment size and orphaned vmdk's in the tmp folder.

 

 

 

Mark_Solutions
Level 6
Partner Accredited Certified

Thanks for getting back to me ... if the battery is playing up it will cause you real issues ... and it will need the new one to be fitted, allowed to charge and then a reboot to make the RAID cache fully operational - this will greatly improve speed.

The VMWare EEB is ET2982308, worth just asking support is they have a 2.5.2 version for the appliance while you have a case open with them (unless the files are there from when the system was on 2.5.1 - are they new or old?)

So on to queue processing ... the queue size is shown in kb and the size of the queue can be seen by running:

/usr/openv/pdde/pdcr/bin/crcontrol --queueinfo

This shows the size of the queue, the larger it is the less efficient the system can be as it means it has a lot of transactions to work through.

To see if it is currently processing the queue run the following (which shows if it is running and if one is queued):


/usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo

If the queue is large (GB's) and it is not being processed then run it twice (the command i gave yesterday) - this will fire one off and queue another. It can take 5 or 6 runs to bring the queue right down.

To see how you de-dupe data sizes are you can run either of the following - the second giving a little more detail:
/usr/openv/pdde/pdcr/bin/crcontrol --dsstat
/usr/openv/pdde/pdcr/bin/crcontrol --dsstat  1

If you want to manually run garbage collection (this is like an image cleanup job for de-dupe but actually runs only occasionally [monthly] you can use the following - but do this via an IPMI session as you have to leave the command until it completes so it stops you doing anything else
/usr/openv/pdde/pdcr/bin/crcollect -v -m +1,+2

Rebasing is like a defrag for de-dupe - it tidies things up and, like a defrag, makes things run quicker

To check state run /usr/openv/pdde/pdcr/bin/crcollect –rebasestate

A useful command is to do the following after you have kicked off queue processing:

tail -f /disk/log/spoold/storage.log

You can then watch the log as it processes each set of transactions - this should process each batch on a few seconds (maybe up to 20 seconds) - and more than this means the system is slow.

Hope all this helps

 

smakovits
Level 6

Mark,

Awesome information!  Just a few things as I work through this all.

# /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 14634764929
 

# /usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
Busy   : no
Pending: no

# /usr/openv/pdde/pdcr/bin/crcontrol --processqueue (twice)

# tail -f /disk/log/spoold/storage.log
tail: cannot open `/disk/log/spoold/storage.log' for reading: No such file or directory
tail: no files remaining
 

# /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 14641550937
 

So, essentially, I am not certain as to what this means exactly.  Is it working or not or did I do something wrong?

As for the VMDK's in the tmp folder, some are older, end of April, while others are as recent as the end of May.  Therefore, I will contact support about the EEB.

While on the topic of the TMP, do those files need to be gracefully cleared out or is there a way to just delete it all?  Or is the data relevant?

File names are:

errfile_xxxxxx

infile_xxxxxx

outfile_xxxxxx

and then some xxx_filelist files

and 897 VMDK files

In total the tmp folder has 22,022 files.

And very lastly for now,

/usr/openv/pdde/pdcr/bin/crcontrol --dsstat  1

just hangs and I get nothing back, even waiting several minutes.  I thought it was the extra space you had, so I removed it and still, the results are the same.

 

 

 

 

 

smakovits
Level 6

Oh thing I did manage to find is the log file.  While in the system I thought to go look for it and found we were missing a "d"

# tail -f /disk/log/spoold/storage.log

# tail -f /disk/log/spoold/storaged.log

June 04 12:54:04 INFO [1082194240]: Transaction log 2086858-2094268: 53500000 of 245142860 entries processed.
June 04 12:54:08 INFO [1082194240]: Transaction log 2086858-2094268: 53600000 of 245142860 entries processed.
June 04 12:54:13 INFO [1082194240]: Transaction log 2086858-2094268: 53700000 of 245142860 entries processed.
June 04 12:54:17 INFO [1082194240]: Transaction log 2086858-2094268: 53800000 of 245142860 entries processed.
June 04 12:54:22 INFO [1082194240]: Transaction log 2086858-2094268: 53900000 of 245142860 entries processed.
June 04 12:54:27 INFO [1082194240]: Transaction log 2086858-2094268: 54000000 of 245142860 entries processed.
June 04 12:54:34 INFO [1082194240]: Transaction log 2086858-2094268: 54100000 of 245142860 entries processed.
 

 

So not sure if this means it is working, but that is the output that is scrolling by.

 

 

smakovits
Level 6

Also, found what what happening.

/usr/openv/pdde/pdcr/bin/crcollect –rebasestate  --- (failed)

# /usr/openv/pdde/pdcr/bin/crcontrol --rebasestate
Image rebasing: ON
Rebasing busy: Yes
 

Mark_Solutions
Level 6
Partner Accredited Certified

OK - glad you spotted by deliberate mistake with the missing d in storaged.log!!

The tail -f shows that it is processing fairly quickly which is good.

The queue size is huge!! and is not showing as busy or pending so really need kicking off manually regularly to try and get it trimmed down - i see you kicked it off manually twice which is good so presumably it will be a bit smaller by now - keep on top of it to see how small you can get it - if there is not at least one pending then fire it off again manually.

The rebase state is interesting as it shows it is ON and processing and yet has failed - not sure about that one, it may be worth turning it off and back on again (never hear that IT expression before!!) - just use -crcollect -rebaseoff then leave it a few minutes and use crcollect -rebaseon.

Then check what the -rebasestate says

While asking support about the VMWare EEB ask then about the telemetry EEB too - i thought it was included in 2.5.2 but worth checking if you have a lot of stuff in /tmp although NetBackup itself does use the directory for its general cache location while doing normal processing.

As for deleting the stuff i usually cd into /tmp and then just do a rm *.vmdk or similar, just best if no jobs are running at the time as the files may actually be in use.

Finally, i think, is the crcontrol --dsstat 1 .... that is the right command but the issue may be the huge transaction queue you have - it could take a LONG time to return a result! Maybe wait until you have trimmed it down again and see how it goes then.

Worth mentioning the rebase failed state to support anyway

Hope all this helps

Andrew_Madsen
Level 6
Partner

Minor correction Mark. That should be crcontrol (/usr/openv/pdde/pdcr/bin/crcontrol) not crcollect for the rebase switch.

Mark_Solutions
Level 6
Partner Accredited Certified

Thanks Andrew - don't think my fingers are working right just lately - LOL!

smakovits
Level 6

OK, for starters, I finally got the battery replaced yesterday and am seeing some odd behavior, however, it could potentially be normal.  Should the RAID battery charge and de-charge during use/backups?  If yes, then my concern is gone, but if no, then I guess it is back to support.  I did mention this to them too so they might respond as I am writing this, but as far as this thread was concerned, I thought it was worth the mention.  Here are 3 snippets from the emails sent from the 5220 after the battery was replaced.  As you can see, it starts to charge up, but then later in the day loses the charge while the backups are running.

+--------------------------------------------------------------------------------------------+
|                                      RAID Infomation                                       |
|+------------------------------------------------------------------------------------------+|
||ID|Name|Status |Capacity| Type |Disks|Write Policy|Enclosure|HotSpare | State |Acknowledge||
||  |    |       |        |      |     |            |   ID    |Available|       |           ||
||--+----+-------+--------+------+-----+------------+---------+---------+-------+-----------||
||  |    |       |        |      |0 1 2|            |         |         |       |           ||
||3 |VD-0|Optimal|4.541TB |RAID-6|3 4 5|WriteThrough|0        |yes      |Warning|No         ||
||  |    |       |        |      |6    |            |         |         |       |           ||
|+------------------------------------------------------------------------------------------+|
|                              Adapter Information                                           |
|+-----------------------------------------------------------------------------+             |
||  |                     |Adapter| BBU  |BBU Learn | BBU  |       |           |             |
||ID|    Adapter model    |Status |Status|  Cycle   |charge| State |Acknowledge|             |
||  |                     |       |      |  active  |      |       |           |             |
||--+---------------------+-------+------+----------+------+-------+-----------|             |
||  |Integrated Intel(R)  |       |      |          |      |       |           |             |
||1 |RAID Controller      |OK     |OK    |No        |41 %  |Warning|No         |             |
||  |SROMBSASMP2          |       |      |          |      |       |           |             |
|+-----------------------------------------------------------------------------+             |
+--------------------------------------------------------------------------------------------+

 


Time Monitoring Ran: Wed Jun 5 19:21:05 2013 EDT

+--------------------------------------------------------------------------------------------+
|                                      RAID Infomation                                       |
|+------------------------------------------------------------------------------------------+|
||ID|Name|Status |Capacity| Type |Disks|Write Policy|Enclosure|HotSpare | State |Acknowledge||
||  |    |       |        |      |     |            |   ID    |Available|       |           ||
||--+----+-------+--------+------+-----+------------+---------+---------+-------+-----------||
||  |    |       |        |      |0 1 2|            |         |         |       |           ||
||3 |VD-0|Optimal|4.541TB |RAID-6|3 4 5|WriteThrough|0        |yes      |Warning|No         ||
||  |    |       |        |      |6    |            |         |         |       |           ||
|+------------------------------------------------------------------------------------------+|
|                              Adapter Information                                           |
|+-----------------------------------------------------------------------------+             |
||  |                     |Adapter| BBU  |BBU Learn | BBU  |       |           |             |
||ID|    Adapter model    |Status |Status|  Cycle   |charge| State |Acknowledge|             |
||  |                     |       |      |  active  |      |       |           |             |
||--+---------------------+-------+------+----------+------+-------+-----------|             |
||  |Integrated Intel(R)  |       |      |          |      |       |           |             |
||1 |RAID Controller      |OK     |OK    |No        |2 %   |Warning|No         |             |
||  |SROMBSASMP2          |       |      |          |      |       |           |             |
|+-----------------------------------------------------------------------------+             |
+--------------------------------------------------------------------------------------------+

 


Time Monitoring Ran: Thu Jun 6 02:21:04 2013 EDT

+--------------------------------------------------------------------------------------------+
|                                      RAID Infomation                                       |
|+------------------------------------------------------------------------------------------+|
||ID|Name|Status |Capacity| Type |Disks|Write Policy|Enclosure|HotSpare | State |Acknowledge||
||  |    |       |        |      |     |            |   ID    |Available|       |           ||
||--+----+-------+--------+------+-----+------------+---------+---------+-------+-----------||
||  |    |       |        |      |0 1 2|            |         |         |       |           ||
||3 |VD-0|Optimal|4.541TB |RAID-6|3 4 5|WriteThrough|0        |yes      |Warning|No         ||
||  |    |       |        |      |6    |            |         |         |       |           ||
|+------------------------------------------------------------------------------------------+|
|                              Adapter Information                                           |
|+-----------------------------------------------------------------------------+             |
||  |                     |Adapter| BBU  |BBU Learn | BBU  |       |           |             |
||ID|    Adapter model    |Status |Status|  Cycle   |charge| State |Acknowledge|             |
||  |                     |       |      |  active  |      |       |           |             |
||--+---------------------+-------+------+----------+------+-------+-----------|             |
||  |Integrated Intel(R)  |       |      |          |      |       |           |             |
||1 |RAID Controller      |OK     |OK    |No        |8 %   |Warning|No         |             |
||  |SROMBSASMP2          |       |      |          |      |       |           |             |
|+-----------------------------------------------------------------------------+             |
+--------------------------------------------------------------------------------------------+

Ed_Carter
Level 4
Certified

Having experienced 4 failed batteries, yes the decharge is normal and is all part of the learn cycle. What Symantec never seem to warn about is the time it takes for this process to complete (while its running, for obvious reasons it keeps disabled the writeback policy and runs like crap). I've had it take 15 hours to complete before and its really impacted our backup window! Basically as soon as the policy shows as Writeback decent performance should soon return.

We too experienced poor performance when we first got our 5220's (although we only have a single disk tray so cannot comment on your issue there..). Our issues were caused by a number of factors. See my previous post which has come great suggestions from the guys on this forum.                     

SLP tuning - If you are running AIR or tape duplications this helps. I suspended all SLP processing for the 1st period of my window to get a bulk of the backups through and then start via the task scheduler. I also use the Lifecycle_parameters files in db\config to batch things up. This really helped and keeps the noise in the activity monitor to a minimum..

Limit I/O on disk pool. This was the biggest help for performance and I have tuned it to 35 streams.

Use query based client selection and use Vmware resource limits in the master properties. We limit to 2 concurrent backups per datastore. The query selection balances load accross the virtual environment and despite annoying features like being unable to rerun failed jobs from the monitor this has been worth doing in our environment.

External factors. We had a i/o issue on the LUN that had out VC database on it. When the backups were hitting the virtual enviroment all the VC logging etc was creating a lot of noise on the DB which created performance issues all round..

Of course you may just have issues with your appliance(s) but hope this helps. I went through a whole heap of pain but happy to say all is finally running well. Definitely a good idea to do some manual queue processing as mentioned above..

Just waiting for a battery to fail at 6pm on a Friday night now..smiley

Cheers

Ed

smakovits
Level 6

Now, in order to avoid clutter with the above logs, I wanted to split out to a new response for the rest of the stuff I have.

VMDK EEB, got it from support, waiting to apply it when no jobs are running.

Telemetry EEB, "From all notes I have seen the issue with telemetry data was supposed to be addressed in 2.5.2. There is no telemetry EEB for 2.5.2 at this time."

My queue is still insane and not going anywhere:

# /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 6050796993
creation date of oldest tlog : Thu Jun  6 00:54:08 2013
 

It is not even working and I kicked it off twice again as suggested.

/usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
Busy   : no
Pending: no
 

I took the suggestion of turning and on the rebase process.  I actually left it off for several minutes,

# /usr/openv/pdde/pdcr/bin/crcontrol --rebasestate
Image rebasing: ON
Rebasing busy: Yes
 

On the battery front, this is the latest:

# /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -getbbustatus -a0

BBU status for Adapter: 0

BatteryType: iBBU
Voltage: 4069 mV
Current: 0 mA
Temperature: 39 C

BBU Firmware Status:

  Charging Status              : None
  Voltage                      : OK
  Temperature                  : OK
  Learn Cycle Requested        : No
  Learn Cycle Active           : No
  Learn Cycle Status           : OK
  Learn Cycle Timeout          : No
  I2c Errors Detected          : No
  Battery Pack Missing         : No
  Battery Replacement required : No
  Remaining Capacity Low       : No
  Periodic Learn Required      : No
  Transparent Learn            : No

Battery state:

GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : Yes
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : No
  Remaining Capacity Alarm: No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No

Relative State of Charge: 99 %
Charger System State: 49168
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 98 %
Max Error: 2 %

Exit Code: 0x00
 

 

I think that is it for now.  Support did respond and requested another DataCollect.

 

 

smakovits
Level 6

Thanks for the Reply Ed.  Certainly good to know the batteries are ultra reliable...

 

Currently, in order to track performance issues, I tunes VM jobs to 1 per datastore, but for obvious reasons, of 5 jobs I am looking at now, I see 8, 9, 10, 12, 11 MB/sec...sweet, I know!

Can I ask what your performance is.  I know I wont be the same, but one thing nobody can tell me is what I can expect, which annoys me. 

Of interest would be the DB issue you had too, so your just saying the LUN the SQL db was on, or is it something else?  Just curious as it would be something to note in the back of my head once I am past the battery issues and queue processing issue.

Ed_Carter
Level 4
Certified

Yeah thats not great. Generally if I get 20,000 or above I see that as acceptable for our window. I'd say on average we get about 40,000 kb/ps over SAN transport but do get a lot of variance. The range is from 20,000 all the way up to 130,000 kb/ps. If the I/O streams at the max say 20k - 50k. If less going on we hit some of the higher figures.

We finally got rid of tape after moving our 2nd 5220 to our second data centre and use bi directional AIR to effecively offsite our backups. I've found tape duplication certianly has a negative impact due to it have to rehydrate the data and pull the various blocks from all over the disk (this is where the rebase comes to play to minimise this as best it can). As AIR is only the unique blocks the read hit is much less.

In terms of the VC database, we found it was sharing a LUN with a couple of other SQL servers and just wasn't getting the IOPs that it needed. We are running on an EMC CX4-480 which doesn't have the best reporting for IOPs so had to dig around to pull the stats (soon going to VNX which should be much better). When the backups were in flight we found a marked decrease in Vcentre performance which had a knock on effect for backup performance as well as reliability. We've actually given the server a dedicated LUN now and it much improved our situation.

Best bet is to check out the data path all the way from source to target. Get your SAN stats (not just from vmware but from the actual array) to see if its touching the sides then work back to your SAN swiches (port errors, faulty fibre etc).. I'm of course assuming you're using SAN and not nbd. If your using nbd I think it'll be using the ESX host management network for the backup so maybe check that out too.

My problem was that we are a windows house and going to the 5220 on Linux was a bit of a step into the unknown and I felt pretty helpless when trying to troubleshoot potential non netbackup issues on the applliance (for example disk utilisation, memory etc etc). My advise is to do all you can to rule out external factors in your environment. Symantec support can be extremely hard work but speak to your account manager. They are really kean to prove how great their appliances are and I'd be very surprised if they wouldn't send an appliance engineer to your site for the day to assist you if you express your disatifsfaction (they did with us and were prepared to subsequently send a "technical expert" guy until we got things running well).

Hope this helps.

 

Ed

smakovits
Level 6

just found this little guy...

 

my webui was failing to display data

 

http://www.symantec.com/business/support/index?page=content&id=TECH204164

Brook_Humphrey
Level 4
Employee Accredited Certified

First off it's good to keep up with the late breaking news page:

http://www.symantec.com/docs/TECH145136

There are a couple publicly released EEB's in there that would be good to install. 

The one you mention and also the one for disk i/o alerts.

 

Now to discuss more your issue.

 

1. Rebasing will help to speed up the duplicaiton to tape once it completes. However if you are duping off to tape right after the backup it is likely that rebasing is not keeping up.

2. I did not see above how much space you are using on your dedupe pool so is it less than 32TB or are you using up all 72TB already. This will definitely effect the speed of your system also.

3. do you have the dedupe pool using the 4TB of internal disk on the appliance also or is it only using the external storage for the dedupe pool? The internal disks are running at 3GB/s were as the external are running at 6GB/s. If you use the internal for your dedupe pool then it will downgrade the performance of the whole dedupe pool to 3GB/s for the disk subsystem.

A good option and the recommended path to use if you are duping off to tape is to use the internal disks for an advanced disk and then use the external storage shelves for dedupe only. Set up your SLP to do the backup to the advanced disk and then do your dupe to tape from there and also your dupe to the dedupe pool from there. Have it expire the image after duplicaiton.

This will give good backup and duplicaiton to tape performance as well as keeping the performance up for the dedupe pool itself. 

4. I don't see any examples above of what the rest of your infrastucture is like but the things pointed out in point 3 above should cover quite a bit. It should remove most of your current bottlenecks and help your systems to perform faster. Above this it is likely you would have to start looking at other aspects of your backups to find out were issues may lie. 

a. what kind of data are you bcking up?

b. what kind of dedupe rates are your getting?

c. what kind of performance are you actually getting out of your network? Have you done any testing on this to see?

Anyway if you have any further questions please let me know. 

 

Thanks 

 

smakovits
Level 6

Brook,  some awesome data there.  To be 100% honest, I did not do any of the configuration as it was done by the UNIX admin that is the backup admin.  I administer the Windows side of things so I dont mess with a lot of the tapes, devices, etc.  However, I do have access to do it as needed, as long as I know what we are doing, its impact, etc.  So, now that that is cleared up, let me see if I cannot respond to your post.

Well, before I do, I should report that something is definitely screwed up somewhere because I just checked my queue info and get the following.

# /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 15433220155

OK, now on to your thoughts.

1. This is most likely the case, the dupe jobs as running forever and all the time.  Untill we get the second appliance in place and replicating, there is an obvious need to dupe to tape...
 

2. The /usr/openv/pdde/pdcr/bin/crcontrol --dsstat  1 will not return a value, and I assume this is where this information is, so I guess I will ask if there is any other way to get the information?  Regardless I do not believe we are full, closer to 32 that 64 I believe.  (42TB used, just looked)

3. I dont believe the 4TB of internal disk is in the dedupe pool. 

 

Partition

Total

Used

Available

%Used

 

Partition

Total

Used

Available

%Used

AdvancedDisk 0 GB 0 GB 0 GB 0
Deduplication 64 TB 42.300 TB 21.699 TB 67
Unallocated 11.482 TB - - -

Is it difficult to setup the SLP you describe because I would be all for it if you feel that it would help with the speed of backups.  Now, is this mentioned solution only for while duping to tape?  If so, that is fine, just more for informational purposes.  But I would certainly still be for exploring this and getting it put in place or at least testing it.

 

4. a. One server that is slow all the time is a file server with thousands of small files.  Now, I know this impacts client side backups in a big way, does it do the same for vmware backups as well?

b. Is there an easy way to get you the overll number?  I know looking in the admin console I see 30% - 100%, lots of 70s and 80s and 90s.

c. what sort of testing do you mean?  Source destination from the appliance to disk sort of stuff?  During backups I can see rates of 6000KB to 100,000KB for vmware backups. 

c2.  For client side backups I have a machine I am testing with that is currently able to do 90+MB/sec backup to take as a physical server.  If I install SEP, this decrades to 60-70MB, so I have a case open with SEP support on that.  Does SEP and the firewall have any impact on a vmware backup?  I assume no, but cant be certain.

 

So, I think for starters, if we could execute 3, that might be a good test.

Brook_Humphrey
Level 4
Employee Accredited Certified

2. Just use crcontrol --dstat instead as it will return what you need

3. In the web admin gui were you got that info from(or did you get it from clish) anyway you can resize and move partitions. In the web gui on the left hand side it will also show which storage is being used and how much. On the left you have the options to move and resize. In this case you should be able to simply move the volume to the external storage. 

Yes this way of setting up SLP's is specifily if you are duping to tape and will not be needed when you get replications going.

4. a. Actually the applainces seem to handle quite a large ammount of vm backups quite well. I have them in use at multiple sites and they are able to keep up with the load quite well. We also make use of autodiscover quite a bit.

b. Thats a little diffucult. If you have opscenter analytics then I have some custome reports that will get this data for you but it's enough to say that the lower the dedupe rate the more it will impact your performance.

c. Symantec support has a tool called appcritical that is a network testing and performance battery test. It returns more than just performance numbers and gives a good overall idea of how your network is working. 

c2. So things like SEP will greatly impact the performance of the client while it is doing fingerprinting. To work around this exclude the netbackup client install folder so that it is not constantly scanning the clients working files and processes.

For vmware backups. If it does impact these I have not noticed and my vm backups seem to run snappy enough that I have never thought to check on this.

smakovits
Level 6

OK, I will look into 3 on Monday with the other admin and start there.

 

While poking aroung the webUI I came across this, do it have any impact on the stuff we are doing?