cancel
Showing results for 
Search instead for 
Did you mean: 

After applied Unix patch on 2 media servers (ghdnbk08lps & 09), backup went down has been created

T_N
Level 6
After Unix team applied the patches on linux media servers. Backup went down even we did hard reboots and soft reboots couple time. Here what Unix team did: No kernel patches or kernel/boot changes … just ghost/freak patching and spacewalk registration … we have done this on hundreds of other boxes. You have root on this server, so you can dig through the logs. Let me know if you need a hand. [root@ghdnbk08lps ~]# rpm -qa --last openssl-1.0.1e-30.el6_6.11.x86_64 Mon 06 Jul 2015 10:30:55 AM EDT glibc-utils-2.12-1.149.el6_6.9.x86_64 Mon 06 Jul 2015 10:30:50 AM EDT glibc-common-2.12-1.149.el6_6.9.x86_64 Mon 06 Jul 2015 10:30:47 AM EDT glibc-2.12-1.149.el6_6.9.x86_64 Mon 06 Jul 2015 10:30:37 AM EDT yum-plugin-security-1.1.30-30.0.1.el6.noarch Mon 06 Jul 2015 10:30:28 AM EDT libselinux-utils-2.0.94-5.8.el6.x86_64 Mon 06 Jul 2015 10:30:22 AM EDT libselinux-ruby-2.0.94-5.8.el6.x86_64 Mon 06 Jul 2015 10:30:22 AM EDT spacewalk-backend-libs-2.0.3-1.0.1.el6.noarch Mon 06 Jul 2015 10:30:21 AM EDT rhncfg-client-5.10.55-1.el6.noarch Mon 06 Jul 2015 10:30:21 AM EDT rhncfg-actions-5.10.55-1.el6.noarch Mon 06 Jul 2015 10:30:21 AM EDT rhncfg-5.10.55-1.el6.noarch Mon 06 Jul 2015 10:30:21 AM EDT osad-5.11.27-1.el6.noarch Mon 06 Jull 2015 10:30:21 AM EDT libselinux-python-2.0.94-5.8.el6.x86_64 Mon 06 Jul 2015 10:30:21 AM EDT jabberpy-0.5-0.21.el6.noarch Mon 06 Jul 2015 10:30:21 AM EDT libselinux-2.0.94-5.8.el6.x86_64 Mon 06 Jul 2015 10:30:20 AM EDT rhn-setup-2.2.7-1.0.2.el6.noarch Mon 06 Jul 2015 10:29:19 AM EDT rhnsd-5.0.14-1.el6.x86_64 Mon 06 Jul 2015 10:29:19 AM EDT yum-rhn-plugin-2.2.7-1.el6.noarch Mon 06 Jul 2015 10:29:18 AM EDT rhn-client-tools-2.2.7-1.0.2.el6.noarch Mon 06 Jul 2015 10:29:18 AM EDT rhn-check-2.2.7-1.0.2.el6.noarch Mon 06 Jul 2015 10:29:18 AM EDT rhnlib-2.5.72-1.el6.noarch Mon 06 Jul 2015 10:29:17 AM EDT rhn-org-trusted-ssl-cert-1.0-1.noarch Mon 06 Jul 2015 10:27:58 AM EDT bash-4.1.2-15.el6_5.2.x86_64 Mon 29 Sep 2014 01:54:42 PM EDT I openned case to Symantec support and sent /var/lo/messeges to them but they said cannot find any thing. Can someone have any idea how can I find the root cause. Thank you.
1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

Did you see Mariannes post ...

"

To just say that the backups were hanging is very vague.
If you look at the NBU process flow diagram, you will understand that is involved from NBU point of view, network, memory, Cpu, devices, etc.

Without knowing WHAT was hanging and without OS and NBU logs, you will never know.
Simply blame the OS patch install as this was the only change."

 

NBU hung is absolutle useless in terms of troubleshooting ....

What log, it depends 'how' it hung.

These two are master server logs, but communicate with the media servers, so could be related

nbjm

bpdbm

Media logs could be

bptm

bpbrm

bpbkar

 

If it hung with some tape related issue

ltid

tldd

tldcd

robots

If it goes to disk we could add another 3 or 4 

So in a word no, I don't know which log as there has been no proper explanation of the issue.

The chances of working out why a hang has happened from logs alone is not always possible, you sometimes need a trace on the process involved (eg truss/ strace etc ) and sometimes a debug binary from Engineering with incresed logging.

Forget this one, you not going to find a cause.  You need to trouble shoot these issues as they are happening, not afterwards.

View solution in original post

11 REPLIES 11

mph999
Level 6
Employee Accredited

In what way have the backups gone done, does NBU start, do backups try and run.

What exactly was done, what patches were applied, I'm not familiar with the term ghost/ freak patching. 

Can you undo what was done, as it is almost certain one of the steps causeed the issue - patches were added them any one of these could have caused the issue, remove them, check it then works, then reapply the changes one at a time until you find the one causing the issue. 

Then when the 'bad patch' is known, contact the vendor to find out exactly, what that patch does.

It might be the case that something needs to be changed in the NBU code to work around the issue, or it could be that what the patch has changed is so 'unreasonable' that the vendor may have to change it, or it could be a combination of both, unknown at the moment.

​Atthe end of the day, my view, and not necessarily that of my colleagues - tackle the problem from the cause side, it's a bit unreasonable to expect Symantec to come up with a fix seeing we didn't break it.  I'm not saying we won't, we may have an obligation to depending on what has happened, but some give and take needs to happen to help the investigating, which in this  case means finding which change broke it, that will lead to a faster resolution, whatever that might  be. 

I can tell you that changes won't just be made to the NBU code until it is known exactly what has caused the issue, otherwise this could break things for other people - hence needing to know exactly, with 100% accuracy what was done. 

 

T_N
Level 6

After Unix team applied linux patches, it ran well until last Thursday night then backup was hang then unix team did 3 hard reboots (one in Thursday night and 2 in Friday morning) , backup was hang again and we did cleaned reboot and scan all drives. Server is working well until now. I need to find the root cause because Unix team said they cannot find any error.

Michael_G_Ander
Level 6
Certified

You write it ran well until Thursday night, when was the patching done ?

Think you best bet is to get the Unix Team to uninstall/rollback the patches and then apply them one at the time on 1 media server to see which one breaks NetBackup.

To me it sounds like one these patches is causing memory leak or something like that in parts of the OS Netbackup uses.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

mph999
Level 6
Employee Accredited

So after the reboots, thu/ fri, has it worked ok since then ?

If the issue is not happening now, then it's going to be pretty difficult to find it.

Michael could be onto something, if there are no symptoms at the moemnt, keep an eye on the box and start investigation if the issue reoccurs.

The backup hung, did any processes crash (look for core files), did anything else not work, were commands slow, was there a error status code ...

It's impossible to work this with only the details, the backup hung.

Sometimes with issues were backups hang, apart from logs (bptm, bpbrm would be a starting point) I'd create a cron job to monitor memory usage:

Create directory /usr/openv/netbackup/memsym on server showing issue.
Copy memsym.sh to this directory
Change permissions, chmod 755 /usr/openv/netbackup/memsym/memsym.sh

Add one of the following lines to the system cron (cron -e)

 

To run every 30 minutes:
0,30 * * * * /usr/openv/netbackup/memsym/memsym.sh

To run every 20 minutes:
0,20,40 * * * * /usr/openv/netbackup/memsym/memsym.sh

To run every 5 minutes:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/openv/netbackup/memsym/memsym.sh

To run every minute
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58 * * * * /usr/openv/netbackup/memsym/memsym.sh

I am not sure how often it will need to be run, it depends on how quickly the memory usage builds up - adjust the cron enytry to run at an interval you desire, some examples above, but you could set every 3 hours, once a day etc ...

You will need to monitor the system so that you know what time the problem happens, then look in the output dir and select the files from aroud this time - check the files to see memory useage.

Here is the script - it's a quickly put together script, so please excuse the fact it doesn't check to see if the output directory is created, it just creates it, even if it is there and 'hides' the error, this will do no harm,

#!/bin/sh

#Define variables
OUTDIR=/usr/openv/netbackup/memsym/output
OUTFILE=mem_$(date +%F_%R_%S)

#Define file desctiptor
exec 3>${OUTDIR}/${OUTFILE}

#Create $OUTDIR
mkdir ${OUTDIR} >/dev/null 2>&1

echo "**********************************************************" >&3
echo "ps faxeo pid,pmem,rss,vsz,sz,size,args / Memory useage per Process" >&3
echo "**********************************************************" >&3
echo "vsz/VSZ: total virtual process size (kb)" >&3
echo "rss/RSS: resident set size (kb, may be inaccurate(!), see man)" >&3
echo "osz/SZ: total size in memory (pages)" >&3
echo "**********************************************************" >&3
ps faxeo pid,pmem,rss,vsz,sz,size,args --width 120 2>&1 >&3
printf "\n\n\n" >&3

echo "**********************************************************" >&3
echo "vmstat output" >&3
echo "**********************************************************" >&3
vmstat 2>&1 >&3
printf "\n\n\n" >&3


echo "**********************************************************" >&3
echo "free output" >&3
echo "**********************************************************" >&3
free 2>&1 >&3
printf "\n\n\n" >&3

echo "**********************************************************" >&3
echo "/proc/meminfo output" >&3
echo "**********************************************************" >&3
cat /proc/meminfo 2>&1 >&3

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
If you had all sorts of logs in place at the time, you will need to dig into all of them to try and figure out what happened. You first of all need to know WHAT whas hanging. Network related processes? NBU processes? Device writes? To just say that the backups were hanging is very vague. If you look at the NBU process flow diagram, you will understand that is involved from NBU point of view, network, memory, Cpu, devices, etc. Without knowing WHAT was hanging and without OS and NBU logs, you will never know. Simply blame the OS patch install as this was the only change.

T_N
Level 6

You write it ran well until Thursday night, when was the patching done ? they applied the patches on Thursday morning, backup ran well until Thursday night then it was hang

Think you best bet is to get the Unix Team to uninstall/rollback the patches and then apply them one at the time on 1 media server to see which one breaks NetBackup. No it's not good idea. servers are running well now. I need to investigate why backup processes were hang.

 

That's very strange, Symantec support cannot find any error in /var/log/messages. Unix team said it's not patch issue. I never see that issue before. Anyone know what log should I look ? Thank you.

watsons
Level 6

Info you provided about the OS itself is also vague.. 

Linux OS = redhat, Suse or what? And version?

Compare output of "uname -a" from these 2 media servers to others which work well, what are the diffs?

Application patching can sometimes mingle with some library files that Netbackup uses, although this is somewhat rare. Check the patching logs to find out more, as well as the Netbackup side of logs - to check where it hangs exactly (be it bpbrm, bpdm or bpdm)

mph999
Level 6
Employee Accredited

That's very strange, Symantec support cannot find any error in /var/log/messages. Unix team said it's not patch issue. I never see that issue before. Anyone know what log should I look ?

 

The reason you can't see an issue in messages, is that the fault prbably is in some other log.

If it's running correctly now, you will have to reproduce the fault to get some logs, unless you happen to have the correct logs from when the problem happened - and no, we have no idea which logs they will be, as there has been no indication of where the jobs hangs.

 

T_N
Level 6
Do you have idea what log should I look ?

mph999
Level 6
Employee Accredited

Did you see Mariannes post ...

"

To just say that the backups were hanging is very vague.
If you look at the NBU process flow diagram, you will understand that is involved from NBU point of view, network, memory, Cpu, devices, etc.

Without knowing WHAT was hanging and without OS and NBU logs, you will never know.
Simply blame the OS patch install as this was the only change."

 

NBU hung is absolutle useless in terms of troubleshooting ....

What log, it depends 'how' it hung.

These two are master server logs, but communicate with the media servers, so could be related

nbjm

bpdbm

Media logs could be

bptm

bpbrm

bpbkar

 

If it hung with some tape related issue

ltid

tldd

tldcd

robots

If it goes to disk we could add another 3 or 4 

So in a word no, I don't know which log as there has been no proper explanation of the issue.

The chances of working out why a hang has happened from logs alone is not always possible, you sometimes need a trace on the process involved (eg truss/ strace etc ) and sometimes a debug binary from Engineering with incresed logging.

Forget this one, you not going to find a cause.  You need to trouble shoot these issues as they are happening, not afterwards.

T_N
Level 6
Yes I closed that Symantec case, thank you everyone.