11-09-2012 03:46 AM
Hi all, first time poster. looking forward to the experience.
Ok, I work with Netbackup 7.0.1, with a Clustered Unix Master server, 40 Media servers.
Backup capacity about 500TB/week.
Our tape drive environment is 2 SL3000's, one on the production environment(32 drives, 2 robotic arms) and another at the DR environment(14)
All drives are SSO configuration, all media servers can access all drives.
Our challenge is the setup is unstable, regularly we have issues where all configured drives on a media server go to AVR mode, and multiple instance of tldcd processes spawned on the Master Server, often leading to backups failing with error code 98.
My question is, Is there a more efficient way to manage/configure this environment?
Would love to hear your ideas...
Cheers
Solved! Go to Solution.
11-14-2012 03:30 AM
You have got too many entities in a single SSO pool. You have 40 media and 32 drives altogether. It all gives 40*32 = 1280 entities for NBRB to deal with.I real pain in a neck! I bet you suffer from resource allocation issues. I suppose in your environment there must be distinct pause between backup job start (green running man) and the moment it actually starts to write data on a tape. The most important architectural change you should implement is to split your huge SSO pool into a number of smaller ones.Say you make 4 pools with 10 media servers that share 8 drives. The number of entities i.e. media servers X drives must not exceed 255. Also you should implement NBRB tuning to boost resource allocation in your environment. http://www.symantec.com/business/support/index?page=content&id=TECH57942
There is a detailed MS Word document on NBRB tuning in varios NetBackup versions. Sorry I can't point you to the direct link, as I can't find it at the moment. Please do a serach on Symantec site.
11-09-2012 08:54 AM
A lot depends on the O/S of your media servers
It should all be fine by rights but a few things to keep it tidy always help
1. make sure you always have cleaning tapes available as drive errors can cause issues
2. Make sure all cabling and switch setups are good
3. make sure switches are set not to do a reset upon error - otherwise they will detect a write error and reset the bus dropping the drive connection
4, use persistent binding for all drives on all servers and keep then identical across all servers, with the robotics being 0,0 where possible
5. On Windows ensure that the removabel storage service is stopped and disabled
6. On windows make sure that the AutoRun key for the tape drives is set and has a value of 0 (http://support.microsoft.com/kb/842411)
7. keep tape drivers up to date
8. reboot all master / media servers regularly to clear down orphaned processes (once a week if possible)
9. If you have any library / drive issues and have to re-start it always make sure you also re-start the media servers (but only once the library is back up and showing as "ready")
Hope this helps
11-10-2012 01:00 AM
I tend to only share drive between to media servers. We call this configuration a SSO pair.
If to many media servers share drives a single drive fault will multiple status 98 in horrible numbers until a NBU admin take the drive down.
From the story you are telling, it sound like you have the same experience as me.
11-11-2012 01:54 AM
Thanks Guys,
Mark, we already implement a lot of the points you mentioned, although cleaning tapes have proven a small challenge. Rebooting the servers however is not regular enough as we can hardly afford the downtime. However, its something I will have to look into.
Nic, yeah I think so too, plus we down and replace drives with recurrent errors, but we dont have enough drives to implement SSO pairs. There is a plan to increase the number of drives, but I'm wondering if the global SSO configuration is the issue, or is there something we're missing that would stabilize the setup??
11-11-2012 04:53 AM
Also, you mention the drives on one media server goto AVR mode.
If the library is ACS controlled, it is fairly certain the media server is losing connection to the library - therefore network issue
If the library is not ACS, and is controlled by a different media server than the one that shows drives in AVR, then is probably losing connection to this other media server (the robot control host)
If the media server itself is the robot control host, then there is an issue happening between the server and the library, it's probably losing connection over than SAN.
Martin
11-12-2012 11:22 AM
A status 98 stom may cause Netbackup putting the robot in AVR status temporary - even with a perfect connection to ACSLS.
11-13-2012 01:59 AM
Im fairly sure theres a working limit to the number of shared connections and you are exceeding it.
For the record I've got 6 media servers, SSO, (5 Solaris, 1 NDMP) never reboot to fix Netbackup issues, theres no need to, the software is stable.
Here we are:
https://www-secure.symantec.com/connect/forums/sso-limitations-and-usage
250, unofficial! Youve got 32x40 at least, so I'd be tempted to get the official line from Symantec as to how many drives/servers can share.
Superficially at least somewhere between my config and yours things start going wrong.It would be useful to know Symantecs response.
Jim
11-13-2012 08:21 AM
You almost got more media servers than drives. Is there a history behind this design ?
I personally think 5 drives shared between 2 media servers is a good value and controllable.
11-14-2012 03:30 AM
You have got too many entities in a single SSO pool. You have 40 media and 32 drives altogether. It all gives 40*32 = 1280 entities for NBRB to deal with.I real pain in a neck! I bet you suffer from resource allocation issues. I suppose in your environment there must be distinct pause between backup job start (green running man) and the moment it actually starts to write data on a tape. The most important architectural change you should implement is to split your huge SSO pool into a number of smaller ones.Say you make 4 pools with 10 media servers that share 8 drives. The number of entities i.e. media servers X drives must not exceed 255. Also you should implement NBRB tuning to boost resource allocation in your environment. http://www.symantec.com/business/support/index?page=content&id=TECH57942
There is a detailed MS Word document on NBRB tuning in varios NetBackup versions. Sorry I can't point you to the direct link, as I can't find it at the moment. Please do a serach on Symantec site.
11-14-2012 05:41 AM
Thanks Guys,
@Anton, yes there's normally a pronounced delay between when a resource request is made and when it actually begins to write, on bad days as much as 40mins. The challenge here would be that we have some big DBs that are configured to run on 8/10/12 channels, but I guess in the interest of stability we might have to reconsider that.
Right, from the feedback here, I'd guess this could be the issue with my configuration. Now for the daunting task of convincing my management to overhaul our configuration.
Will post any further challenges as I go about this.
Thanks again guys, a nice way to welcome the new guy(thumbs up).
01-03-2013 07:42 AM
01-09-2013 02:44 AM
Thanks, this information sheds some light on challenges I'm also experiencing in my environment along the same lines.