Good morning all. I'm new to NetBackup and of course as it's tossed to me I'm tasked with figuring out a very pressing issue. Our current full backups are failing almost every weekend now because they are not finishing within their scheduled timeframe. I've called Support but because we're on version 18.104.22.168 which is EOL they will not support me. I'm attempting to troubleshoot it myself for now. My first question may be an obvious one to all of you but I want to make sure that what I'm seeing is normal.
Let me begin by saying that we have NetBackup installed on a Win 2008 R2 server and we're backing up our Oracle environment which is running on Solaris 10. This has been functioning perfectly for many years now..(before me). Just recently our full backups have began to fail because the backup does not complete within the normal timeframe and the machine reboots at the end of that timeframe due to our configurations. My question is this. When I open the Activity Monitory and view the Job that failed and then open the Detailed Status tab I'm seeing long periods of nothing going on. Is this normal? In the image below you can see that from 12:10am - 3:57am there was no activity shown. That's nearly 4 hours of nothing being logged. Is that what it means or is this just a time where it's writing files? Is there anyway to get a more detailed log?
I have never heard of a NetBackup server getting rebooted at the end of a backup window.
Especially while backups are running.
What you see is perfectly normal.
The activity monitor does not display each filename that is being backed up.
You only see the warnings of files being skipped.
Your backup config does not seem to be optimally configured...
If you look at the bptm log on the media server, you'll probably see activity during that 4 hours, while the client process was sending backup data. You may also be able to figure out how much data bptm received for that backup job during that time.
On your screenshot, I see 44000 KB/sec. I don't know whether that's real or an artifact of the failure. If it's real, that's slow. Since your backups used to succeed, did something change in your environment to degrade your network performance?
I also noticed that our Client Settings are configured to "Wait" on a locked file, or at least that's what I'm seeing. Is there a log that I can view that will tell me if it was waiting on a file?
I browsed to what I believe is this bptm log that you're talking about (C:\Program Files\Veritas\NetBackup\logs\bptm) I then opened up the file named after that day and I can't even find that servers name in the log file at all. I see other server names but not leopard.
EDIT: I just noticed that we have multiple Media Servers. The Win 2008 R2 server is the Master server and Leopard (the full backup job that is failing) is a Media server itself, but it's Solaris 10 which unfortunately I have no idea how to view logs on.
Use this technote to set up enhanced bpbkar logging. When enabled the bptm debug logs will show all processed files, meaning you can easily see if backup is waiting for nothing or processing files.
Apparently someone already enabled bpbkar logs. I see them being created for months now. Everything to me looks normal for the first few hours then at 22:55 it does another "Device changing from ##### to ####" (I see about 10 of those so far) and then there's nothing in the logs for 4 hours. What was it doing for 4 hours?
I have verified with our Oracle guys and these .dbf database's are shutdown prior to us attempting the backup. I've also upload two files. The first is the file straight from the Solaris server, and 2nd is the same file that I created by copying out the text.
It's also probably worth mentioning that we used to back everything up on both tape and disk. However we've since removed all tape backups and we ONLY backup on disk.
I really appreciate the help. Thank you.
There are quite a number of references in bpbkar to 'sparse' files. One example:
02:07:01.155  <4> check_file_sparseness: File </zoneds/owl/root/data/oracle/dwhse01/d05/oracle/dwhse01data/PROD_IAS_TEMP01.dbf> is probably sparse. Size = 104873984 st_blocks = 2144 block_size = 512, fstype = 13
See these TN's regarding sparse files:
One more thing - if you are backing up local zones from the global zones, please be sure that you read through the best practice docs: