Today I have to tell guys something is happening with my environment. Since last week I´ve noticed backup jobs are failing without any clear reason but we haven't done any changes.
For example, I see 2106 errors (disk down) with some jobs but other jobs pointing to the same STU (same disk pool) don't fail.
Now I see errors 13 and 6 during SQL backup jobs when a week ago all jobs finished successfully.
I can't find any pattern with this behavior, does anyone have an idea.
I know we have to upgrade to receive support but for now, we have to deal with this version.
NetBackup relies on the environment to be healthy.
You need to carefully look at each failure and examine possible reasons.
You will need to dig into logs (check Status Code Ref Guide for relevant logs), look at OS logs, even check free space on master server, diskpool, etc.
Be sure that NBU catalog backups run at regular intervals and that they complete successfully.
I agree with @pats_729 - it could be network issues causing these errors. Get your network team involved to monitor switch logs and network comms during peak backup time.
Are you using OST?
I've found in my environment that sometimes when the OST on a media server goes haywire and can't communicate with the storage device, it reports the storage device as down.
Its a painful process to troubleshoot if you have many media servers.
Thank you for your response, we use appliances as media server.
I can't explain why some jobs fail with "disk down" errors and other jobs pointing to the same appliance, same disk pool, same storage unit don't.
This is a bank and now we are on "freeze" so no changes have been done in the environment.
I'm really lost here because a couple of weeks ago we have another weird problem, jobs didn't start automatically so we have to failover the cluster twice in order to get the jobs starting automatically.
Now all these problems suddenly come up, I could understand a couple of errors with some servers but so many errors with so many servers. full and differential SQL backup errors, SAN disappearing, disk down for some jobs... Errors like 2106, 5403,87,2075,6,13,156,230,83,25,42,54,2074. :(
SAN disappearing ? Really? What is being done about that?
Sometimes no changes also means no maintenance.
I have mentioned a number of things that you can check.
You really need to have a good look at the environment.
Bear in mind that NetBackup is reporting the problems. Hardly ever causing it.
From what you've said, I'd be looking at your fibre channel and IP networks first. It sounds like you're having a communication problem. If SAN is disappearing, that could be an HBA, cable, SFP, etc., that is failing. Gear never honors our change freezes, and while you may not have changed something, it could have changed itself :-). But if you're using an appliance as the disk pool, a SAN issue isn't going to be the source of a disk pool down, that's more likely an appliance problem or a network problem causing communication to fail.
You've mentioned a number of different issues. Can you take a step back and give us a little more info? You've mentioned clustering, SAN, network, appliances, and SQL as potential problem sources at least. It looks like you're using 7.7 and a Windows Master -- is that correct?
If I'm troubleshooting this, given you're getting 2106 errors, I'd start there. Drive that to resolution and then if you're still getting 13s and 6s move to that, but as @Marianne says, the environment has to be healthy for NBU to work properly, and it is one of the best tools at finding illness in an environment I've ever seen.
I really appreciate your time,
The main problem is sometimes I see errors about disk down with some jobs but some jobs pointing to the same appliance, same disk pool, same storage unit.
I really can't understand this, however, I understand maybe some in the environment can be affecting NetBackup what is there a way to understand the behaviour about the disk pool error?
I run ./nbdb_admin NBDB -validate -full, no errors.
For issues with a specific appliance, you need to check logs on the appliance as well as comms between the master and the appliance. Get network team to monitor comms during backup window.
Logs to check: System logs (messages), bptm, bpdm.
What is the size of the dedupe pool? How much memory on this appliance? Appliance model and software version? (Memory exhaustion on appliance can cause various issues.)
This article is old, but may help to identify issues on an appliance:
The key to this is method of troubleshooting. There is no button that you can push that says 'tell me the exact problem I'm having in the environment right now', so you have to work your way through the issue.
Given you've said that the appliance does function properly most of the time, I'm not sure I'd start there. While it's possible that the appliance has a problem, it isn't a universal problem, so just trying to comb through appliance logs will likely be unfruitful - you want to look through them with a finer sift than that. I do like the angle @Marianne suggests with 100031616, as there is some chance you're exhausting a resource and recovering, but that's a 50/50 to me.
You indicate there isn't a pattern. I have that feeling a lot when I'm dealing with a complex problem and I'm almost always proven wrong. There's a pattern, it's just not yet visible. Start there. If you've got failures going back in time, go back as far as you can and start isolating the clients that are having the issue. Look for a pattern of client, physical network, OS, time of backup, policy type, etc. If you're seeing a resource exhaustion, you may see a time-based pattern, i.e., when you have a lot of database or VM backups all starting at one time you start to see these failures.
I tend to write these things down or build a spreadsheet so I can more easily sort the data. Start trying to answer questions like
If there's no discernible pattern in that data, I'd then look at which client(s) have experienced this the most in the last 30 days. For starters, look at the job info in the bptm and bpdm logs for all of the jobs that have failed, and correlate the time of the failures to system logs on the appliance, looking for something to show up. If you find a given failure, do the same thing for another client at another time to see if the same failure or type of failure occurs on the system.
Sorry if this seems rudimentary, but like I said there's no button to push or switch to throw that's necessarily going to tell you exactly what is wrong. This is just how I'd start to do the footwork of figuring out what it is that's going on because it isn't immediately obvious.