Solved: Drives being marked down and up - how can I find t...

Storage_yoda · ‎10-20-2009

Hi there

I am currently modeling a proposed backup environment for a new project, all is greenfield so there is nothing working to base the solution on. The test environment is also very new and untested so the problems could be with the underlying environment.

Basics

NBU 6.5.4
Platform Windows 2003 R2 SP2 x64
Master server is clustered using VCS

site 1 local 2 node cluster
site 2 remote single node cluster
site 3 remote single node cluster

Each site also has a single media server - each media server has some SAN based disk presented which is configured into Advanced Disk - Disk Pools/Storage Units.

The sites are connected over an emulated WAN with latencies etc which are likely to represent the production environment (MPLS cloud).

Description

The initial work I was doing was to test SLP's with the plan that backups will take place on one of the 3 sites and data will be replicated to the other 2 sites using SLP duplication.

Basic testing was done using non slp based polices proving that the backup jobs would run to each of the media servers from each of the other servers (so all the basic configs to allow visibility and use of the disks and media servers is OK

The test SLP's initially seem to work and then become very random in their completion. On review significant numbers of of error 800's are seen.
either "media server missing" or "disk volume is down". The errors seem intermittent and a job may be rerun and work when it previously failed.

On looking in the disk log report significant numbers of

Volume xxxxxx H:\ marked down
Volume xxxxxx H:\ marked up

Are seen, sometimes a few seconds down, sometimes a few minutes.

I dont seem to have issues with normal disk target backups only when using SLP's, but that could just be luck!!

Question

How can I track down what the problem is?
I am thinking it could be timeouts across the WAN but how can I prove this? and if it is what can I do to prevent it?

I am limited in what logs etc can be posted but if you need any other info let me know and see what I can do.

Thanks in advance

Alex

Storage_yoda · ‎10-27-2009

As ever with these things the solution is the simplest, and again as ever - is network related.

The problem was that the nodes running the clustered master server on site 1 had 2 NIC's enabled one on the production network, which is routed and connects between sites, the other what will become in future a backup network, currently unrouted and unable to connect between sites. both IP addresses had been assigned to the same name in DNS , so when resolved, the name would return either address. Obviously data sent to the address with no connectivity effectively get lost!

Solution tidy up DNS (AGAIN!!) and ensure that the name resolution is correct. Once this was done, all seems to run OK!

View solution in original post

Storage_yoda · ‎10-21-2009

Having done some more work it appears that the disk drives that are showing as down all are on sites remote from the current location of the Master server. So if the Master server is currently active on site1, then I am getting errors on drives on site2 and site3. The "Drive Down" events are seeminly random across the 2 remote sites. Ususlly failing jobs will complete after a number of re-tries when the drive becomes "available" again.

This does look like a timeout for the EMM monitoring the Media servers and Drives. Is there a way to increase this? Am I barking up the wrong tree.

Anyone.....???

Thanks

Alex

Storage_yoda · ‎10-21-2009

Having done some more work it appears that the disk drives that are showing as down all are on sites remote from the current location of the Master server. So if the Master server is currently active on site1, then I am getting errors on drives on site2 and site3. The "Drive Down" events are seeminly random across the 2 remote sites. Ususlly failing jobs will complete after a number of re-tries when the drive becomes "available" again.

This does look like a timeout for the EMM monitoring the Media servers and Drives. Is there a way to increase this? Am I barking up the wrong tree.

Anyone.....???

Thanks

Alex

Storage_yoda · ‎10-27-2009

As ever with these things the solution is the simplest, and again as ever - is network related.

The problem was that the nodes running the clustered master server on site 1 had 2 NIC's enabled one on the production network, which is routed and connects between sites, the other what will become in future a backup network, currently unrouted and unable to connect between sites. both IP addresses had been assigned to the same name in DNS , so when resolved, the name would return either address. Obviously data sent to the address with no connectivity effectively get lost!

Solution tidy up DNS (AGAIN!!) and ensure that the name resolution is correct. Once this was done, all seems to run OK!

VOX

Drives being marked down and up - how can I find the cause?