Forum Discussion

Level 6

4 years ago

Unexpected Netbackup behaviour

Hi Guys, Today I have to tell guys something is happening with my environment. Since last week I´ve noticed backup jobs are failing without any clear reason but we haven't done any changes. For exa...

windows server 2012

Level 6

SAN disappearing ? Really? What is being done about that?

Sometimes no changes also means no maintenance.

I have mentioned a number of things that you can check.

You really need to have a good look at the environment.
Bear in mind that NetBackup is reporting the problems. Hardly ever causing it.

Level 6

4 years ago

From what you've said, I'd be looking at your fibre channel and IP networks first. It sounds like you're having a communication problem. If SAN is disappearing, that could be an HBA, cable, SFP, etc., that is failing. Gear never honors our change freezes, and while you may not have changed something, it could have changed itself :-). But if you're using an appliance as the disk pool, a SAN issue isn't going to be the source of a disk pool down, that's more likely an appliance problem or a network problem causing communication to fail.

You've mentioned a number of different issues. Can you take a step back and give us a little more info? You've mentioned clustering, SAN, network, appliances, and SQL as potential problem sources at least. It looks like you're using 7.7 and a Windows Master -- is that correct?

If I'm troubleshooting this, given you're getting 2106 errors, I'd start there. Drive that to resolution and then if you're still getting 13s and 6s move to that, but as Marianne says, the environment has to be healthy for NBU to work properly, and it is one of the best tools at finding illness in an environment I've ever seen.

AndresV
Level 6
4 years ago
I really appreciate your time,
The main problem is sometimes I see errors about disk down with some jobs but some jobs pointing to the same appliance, same disk pool, same storage unit.
I really can't understand this, however, I understand maybe some in the environment can be affecting NetBackup what is there a way to understand the behaviour about the disk pool error?
I run ./nbdb_admin NBDB -validate -full, no errors.
- Marianne
  Level 6
  4 years ago
  For issues with a specific appliance, you need to check logs on the appliance as well as comms between the master and the appliance. Get network team to monitor comms during backup window.
  Logs to check: System logs (messages), bptm, bpdm.
  What is the size of the dedupe pool? How much memory on this appliance? Appliance model and software version? (Memory exhaustion on appliance can cause various issues.)
  
  This article is old, but may help to identify issues on an appliance:
  https://www.veritas.com/support/en_US/article.100031616
  - vtas_chas
    Level 6
    4 years ago
    The key to this is method of troubleshooting. There is no button that you can push that says 'tell me the exact problem I'm having in the environment right now', so you have to work your way through the issue.
    Given you've said that the appliance does function properly most of the time, I'm not sure I'd start there. While it's possible that the appliance has a problem, it isn't a universal problem, so just trying to comb through appliance logs will likely be unfruitful - you want to look through them with a finer sift than that. I do like the angle Marianne suggests with 100031616, as there is some chance you're exhausting a resource and recovering, but that's a 50/50 to me.
    You indicate there isn't a pattern. I have that feeling a lot when I'm dealing with a complex problem and I'm almost always proven wrong. There's a pattern, it's just not yet visible. Start there. If you've got failures going back in time, go back as far as you can and start isolating the clients that are having the issue. Look for a pattern of client, physical network, OS, time of backup, policy type, etc. If you're seeing a resource exhaustion, you may see a time-based pattern, i.e., when you have a lot of database or VM backups all starting at one time you start to see these failures.
    I tend to write these things down or build a spreadsheet so I can more easily sort the data. Start trying to answer questions like
    How many failures have I had across the search period?
    Which clients?
    Do these clients always fail or do they get successful backups periodically?
    When failures occur, is it one client or multiple clients in the same period of time?
    What is different about the successful vs unsuccessful backups (time window, SLP structure, other jobs running, etc.)?
    Is replication running during successful windows?
    Is replication running during unsuccessful windows?
    Are the failures always on the same day?
    Do the failures occur during other datacenter activities?
    If there's no discernible pattern in that data, I'd then look at which client(s) have experienced this the most in the last 30 days. For starters, look at the job info in the bptm and bpdm logs for all of the jobs that have failed, and correlate the time of the failures to system logs on the appliance, looking for something to show up. If you find a given failure, do the same thing for another client at another time to see if the same failure or type of failure occurs on the system.
    Sorry if this seems rudimentary, but like I said there's no button to push or switch to throw that's necessarily going to tell you exactly what is wrong. This is just how I'd start to do the footwork of figuring out what it is that's going on because it isn't immediately obvious.

Related Content

unexpected reboot
10 years ago
kwakou
NetBackup for MySQL
2 months ago
huangjinxian
Event 41352 Unexpected COM exception caught - EV 14.1
4 years ago
Marcde
Re: What retention category is applied to subfolders under retention folders
14 years ago
StephenKing
Oracle to Netbackup Copilot
3 months ago
bakra