07-30-2015 12:31 PM
Solved! Go to Solution.
08-10-2015 11:02 AM
Did you see Mariannes post ...
"
To just say that the backups were hanging is very vague.
If you look at the NBU process flow diagram, you will understand that is involved from NBU point of view, network, memory, Cpu, devices, etc.
Without knowing WHAT was hanging and without OS and NBU logs, you will never know.
Simply blame the OS patch install as this was the only change."
NBU hung is absolutle useless in terms of troubleshooting ....
What log, it depends 'how' it hung.
These two are master server logs, but communicate with the media servers, so could be related
nbjm
bpdbm
Media logs could be
bptm
bpbrm
bpbkar
If it hung with some tape related issue
ltid
tldd
tldcd
robots
If it goes to disk we could add another 3 or 4
So in a word no, I don't know which log as there has been no proper explanation of the issue.
The chances of working out why a hang has happened from logs alone is not always possible, you sometimes need a trace on the process involved (eg truss/ strace etc ) and sometimes a debug binary from Engineering with incresed logging.
Forget this one, you not going to find a cause. You need to trouble shoot these issues as they are happening, not afterwards.
07-30-2015 01:42 PM
In what way have the backups gone done, does NBU start, do backups try and run.
What exactly was done, what patches were applied, I'm not familiar with the term ghost/ freak patching.
Can you undo what was done, as it is almost certain one of the steps causeed the issue - patches were added them any one of these could have caused the issue, remove them, check it then works, then reapply the changes one at a time until you find the one causing the issue.
Then when the 'bad patch' is known, contact the vendor to find out exactly, what that patch does.
It might be the case that something needs to be changed in the NBU code to work around the issue, or it could be that what the patch has changed is so 'unreasonable' that the vendor may have to change it, or it could be a combination of both, unknown at the moment.
Atthe end of the day, my view, and not necessarily that of my colleagues - tackle the problem from the cause side, it's a bit unreasonable to expect Symantec to come up with a fix seeing we didn't break it. I'm not saying we won't, we may have an obligation to depending on what has happened, but some give and take needs to happen to help the investigating, which in this case means finding which change broke it, that will lead to a faster resolution, whatever that might be.
I can tell you that changes won't just be made to the NBU code until it is known exactly what has caused the issue, otherwise this could break things for other people - hence needing to know exactly, with 100% accuracy what was done.
07-30-2015 03:27 PM
After Unix team applied linux patches, it ran well until last Thursday night then backup was hang then unix team did 3 hard reboots (one in Thursday night and 2 in Friday morning) , backup was hang again and we did cleaned reboot and scan all drives. Server is working well until now. I need to find the root cause because Unix team said they cannot find any error.
07-30-2015 11:20 PM
You write it ran well until Thursday night, when was the patching done ?
Think you best bet is to get the Unix Team to uninstall/rollback the patches and then apply them one at the time on 1 media server to see which one breaks NetBackup.
To me it sounds like one these patches is causing memory leak or something like that in parts of the OS Netbackup uses.
07-30-2015 11:49 PM
So after the reboots, thu/ fri, has it worked ok since then ?
If the issue is not happening now, then it's going to be pretty difficult to find it.
Michael could be onto something, if there are no symptoms at the moemnt, keep an eye on the box and start investigation if the issue reoccurs.
The backup hung, did any processes crash (look for core files), did anything else not work, were commands slow, was there a error status code ...
It's impossible to work this with only the details, the backup hung.
Sometimes with issues were backups hang, apart from logs (bptm, bpbrm would be a starting point) I'd create a cron job to monitor memory usage:
Create directory /usr/openv/netbackup/memsym on server showing issue.
Copy memsym.sh to this directory
Change permissions, chmod 755 /usr/openv/netbackup/memsym/memsym.sh
Add one of the following lines to the system cron (cron -e)
To run every 30 minutes:
0,30 * * * * /usr/openv/netbackup/memsym/memsym.sh
To run every 20 minutes:
0,20,40 * * * * /usr/openv/netbackup/memsym/memsym.sh
To run every 5 minutes:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/openv/netbackup/memsym/memsym.sh
To run every minute
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58 * * * * /usr/openv/netbackup/memsym/memsym.sh
I am not sure how often it will need to be run, it depends on how quickly the memory usage builds up - adjust the cron enytry to run at an interval you desire, some examples above, but you could set every 3 hours, once a day etc ...
You will need to monitor the system so that you know what time the problem happens, then look in the output dir and select the files from aroud this time - check the files to see memory useage.
Here is the script - it's a quickly put together script, so please excuse the fact it doesn't check to see if the output directory is created, it just creates it, even if it is there and 'hides' the error, this will do no harm,
#!/bin/sh
#Define variables
OUTDIR=/usr/openv/netbackup/memsym/output
OUTFILE=mem_$(date +%F_%R_%S)
#Define file desctiptor
exec 3>${OUTDIR}/${OUTFILE}
#Create $OUTDIR
mkdir ${OUTDIR} >/dev/null 2>&1
echo "**********************************************************" >&3
echo "ps faxeo pid,pmem,rss,vsz,sz,size,args / Memory useage per Process" >&3
echo "**********************************************************" >&3
echo "vsz/VSZ: total virtual process size (kb)" >&3
echo "rss/RSS: resident set size (kb, may be inaccurate(!), see man)" >&3
echo "osz/SZ: total size in memory (pages)" >&3
echo "**********************************************************" >&3
ps faxeo pid,pmem,rss,vsz,sz,size,args --width 120 2>&1 >&3
printf "\n\n\n" >&3
echo "**********************************************************" >&3
echo "vmstat output" >&3
echo "**********************************************************" >&3
vmstat 2>&1 >&3
printf "\n\n\n" >&3
echo "**********************************************************" >&3
echo "free output" >&3
echo "**********************************************************" >&3
free 2>&1 >&3
printf "\n\n\n" >&3
echo "**********************************************************" >&3
echo "/proc/meminfo output" >&3
echo "**********************************************************" >&3
cat /proc/meminfo 2>&1 >&3
07-31-2015 12:20 AM
07-31-2015 08:46 AM
You write it ran well until Thursday night, when was the patching done ? they applied the patches on Thursday morning, backup ran well until Thursday night then it was hang
Think you best bet is to get the Unix Team to uninstall/rollback the patches and then apply them one at the time on 1 media server to see which one breaks NetBackup. No it's not good idea. servers are running well now. I need to investigate why backup processes were hang.
That's very strange, Symantec support cannot find any error in /var/log/messages. Unix team said it's not patch issue. I never see that issue before. Anyone know what log should I look ? Thank you.
08-02-2015 03:57 PM
Info you provided about the OS itself is also vague..
Linux OS = redhat, Suse or what? And version?
Compare output of "uname -a" from these 2 media servers to others which work well, what are the diffs?
Application patching can sometimes mingle with some library files that Netbackup uses, although this is somewhat rare. Check the patching logs to find out more, as well as the Netbackup side of logs - to check where it hangs exactly (be it bpbrm, bpdm or bpdm)
08-02-2015 11:55 PM
That's very strange, Symantec support cannot find any error in /var/log/messages. Unix team said it's not patch issue. I never see that issue before. Anyone know what log should I look ?
The reason you can't see an issue in messages, is that the fault prbably is in some other log.
If it's running correctly now, you will have to reproduce the fault to get some logs, unless you happen to have the correct logs from when the problem happened - and no, we have no idea which logs they will be, as there has been no indication of where the jobs hangs.
08-05-2015 05:56 PM
08-10-2015 11:02 AM
Did you see Mariannes post ...
"
To just say that the backups were hanging is very vague.
If you look at the NBU process flow diagram, you will understand that is involved from NBU point of view, network, memory, Cpu, devices, etc.
Without knowing WHAT was hanging and without OS and NBU logs, you will never know.
Simply blame the OS patch install as this was the only change."
NBU hung is absolutle useless in terms of troubleshooting ....
What log, it depends 'how' it hung.
These two are master server logs, but communicate with the media servers, so could be related
nbjm
bpdbm
Media logs could be
bptm
bpbrm
bpbkar
If it hung with some tape related issue
ltid
tldd
tldcd
robots
If it goes to disk we could add another 3 or 4
So in a word no, I don't know which log as there has been no proper explanation of the issue.
The chances of working out why a hang has happened from logs alone is not always possible, you sometimes need a trace on the process involved (eg truss/ strace etc ) and sometimes a debug binary from Engineering with incresed logging.
Forget this one, you not going to find a cause. You need to trouble shoot these issues as they are happening, not afterwards.
08-27-2015 11:19 AM