Solved: Re: Does NBEMM service was not responding cause is...

DoubleP · ‎07-30-2018

We recently had an issue with one of our Master Consoles (veritas 8.0/LINUX). The veritas tech who ended up rebooting the master server, updated his ticket with:

"

"-- Console was stuck at connecting to nbsl

– Checked basic communication of master, all working fine

– NBEMM service was not responding

– After checking server.log found that shared memory was disconnected causing SQLANY to not respond to requests.
I. 07/26 16:30:11. Disconnected TCPIP client's AppInfo: HOST=bakup001.paychex.com;PID=29162;EXE=com.mchange.v2.async.ThreadPo
I. 07/26 16:30:38. Connection terminated abnormally
I. 07/26 16:30:38. Disconnected SharedMemory client's AppInfo: IP=10.80.4.120;HOST=bakup001.paychex.com;OSUSER=root;OS='Linux 2.6.32-696.20.1.el6.x86_64 #1 SMP Fri Jan 12 15:07:59 EST 2018 x';EXE=/usr/openv/netbackup/bin/nbstserv;PID=0x707a;THREAD=0x7f68bf230720;VERSION=16.0.0.2322;API=ODBC;TIMEZONEADJUSTMENT=-240
I. 07/26 16:30:38. Connection terminated abnormally
I. 07/26 16:30:38. Disconnected SharedMemory client's

– We rebooted the node of cluster

– After making cluster resources UP, services started fine

– We are able to login to console and backups are working fine.

Ideally you should not face this issue again. Issue occured because OS deallocated shared memory of NBDB"

Now this is a two part question:

1) Did NBEMM Service not responding cause the shared memory issue or vice versa?

2) Even though everything looked fine after the bounce, we discovered the next day that somehow when the server was restarted none of our Oracle backups (archives and Onlines) really completed. They would receive a socket error when trying to connect to the Netback master's tan.: (<16> connect_to_service: connect failed STATUS (18) CONNECT_FAILED status: FAILED, (10) SOCKET_FAILED; system: (113) No route to host; FROM client IP TO master server tan IP bprd VIA pbx status: FAILED, (10) SOCKET_FAILED; system: ) We had to work with Veritas support, our internal networking and Linux SA groups to resolve this. Has anyone else experience something similar, and what can be done so it doesn't reoccur?

DoubleP · ‎08-03-2018

It was found that the OS deallocated the shared memory that NetBackup was attempting to use, leading to the failure of nbemm and nbdb. After a failure of that level, the only way to bring everything back up was a restart of the node. This was done. (According to Veritas support:” The RCA of the deallocation of memory will fall onto the OS administrators/vendors – we only felt the effect of the deallocation, we aren’t able to cause it.”)

View solution in original post

Anshu_Pathak · ‎07-30-2018

1) Did NBEMM Service not responding cause the shared memory issue or vice versa?

It's strange, in the solution there is no nbemm log snippet so actually you never got any RCA. The messages that you see in server.log those are information (I) message and not Error (E). These messages are normal when a conenction is closed. If you notice, first connection gets terminated and after that shared memory is getting disconnected, so issue is not with shared memory.

Ideally you/TSE should have checked nbemm logs to figure out if it was stuck/hung or in loop. If process is hung, it will not update log file, if it is in loop it will update log files.

Most of the time such issues are related to system resources (RAM, CPU, IPC channels) and third software hence server reboot or NetBackup service restart can serve as solution.

2) For question# 2

NetBackup is just an application which heavely relies on OS, Networking and other software. Any issue in these layers may not be visible/cause failure at OS level but may cause failure in NetBackup. There are so many variables, which makes it very difficult to pin point the issue.

What you are facing here can be seen in multiple environments. Sometimes you have to change settings in NetBackup however most of the time you need to patch, reconfigure or change settings for OS, SAN & Network.

Marianne · ‎07-31-2018

In addition to excellent reply by @Anshu_Pathak, herewith my 2c on #2:

Which Cluster technology are you using?
Was NBU started via the cluster and care was taken that all resources are online?
I am particularly thinking of the virtual IP address and virtual hostname.
(In VCS config, NBU will only be started if Virt IP resource is online.)

Can you ping the virtual hostname from the clients?
Can you check that the 1st SERVER entry on all clients are pointing to the master's virtual hostname and not physical nodename?
If NB_ORA_SERV variable is used in rman scripts, verify that it points to the virtual hostname.

Run 'bpclntcmd -pn' on database clients and carefully check the output to ensure that master's virtual hostname is listed:
expecting response from server <master_virtual_name >
....
This command will initiate a connection to bprd on the master - similar to rman backup request.
Check bprd log on the master to see if connection attempt was received from client.

Handy NetBackup Links

DoubleP · ‎07-31-2018

@Anshu_Pathak and @Marianne,

Thanks for your responses. I'm checking with the TSE that my coworker worked with on the initial issue.

Unfortunately, when the master server came back up, and backups appeared to be running, we didn't drill down in VCS to see if tan ip was showing Online afterwards. Though my coworker is pretty sure that nbu_group was showing online on the main segment of the tree.

DoubleP · ‎08-03-2018

It was found that the OS deallocated the shared memory that NetBackup was attempting to use, leading to the failure of nbemm and nbdb. After a failure of that level, the only way to bring everything back up was a restart of the node. This was done. (According to Veritas support:” The RCA of the deallocation of memory will fall onto the OS administrators/vendors – we only felt the effect of the deallocation, we aren’t able to cause it.”)

VOX

Does NBEMM service was not responding cause issues with OS deallocated shared memory of NBDB?