cancel
Showing results for 
Search instead for 
Did you mean: 

nbevtmgr not running

Sadfad
Level 3

Hi,

we shortly migratet our Master Server from Solaris 10 to Solaris 11 and to new Hardware.
After migrating we have several issues. For Example Enable/Disable of a policy takes up to 3 Minutes.

We also found out that nbevtmgr is not running. The service starts and stops in under 1 second.
In the Java GUI under Reports/Problems we can see many errors:
The Error Description is: connection to nbevtmgr:VRTS_NBU_PEM_Channel on host *master* failed after XXXX attempts.
We are getting this error message every minute.

I hope you can help me.

1 ACCEPTED SOLUTION

Accepted Solutions

We found the error.

It was a missing priviledge for 'proc_procntl'
After we are adding this priviledge the nbevtmgr startet and everything else worked fine.
Anyway thank you for your time and answers.

View solution in original post

21 REPLIES 21

Will_Restore
Level 6

We checked this technote before and thats sadly not the same fault we have.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Can you please share the process that you have followed to migrate to new OS and tell us which NBU version(s)? 

I have seen this some time ago when there was a corruption in some binaries that were copied manually from one server to the next. When downloaded software was ftp'ed and extracted on new server and re-installed, all was fine. (File sizes and checksum must be verified as well.)

mph999
Level 6
Employee Accredited

Is the process stopping or crashing ?

if it is stopping, then the nbevtmgr log may be useful.  This is unified log number 231

(vxlogview -p 51216 -o 231 -d all -t 00:10:00 )

Might be worth increaseing the log level first, recreating issue, and then looking at log.

vxlogcfg -a -p 51216 -o 231 -s DebugLevel=6 -s DiagnosticLevel=6

To decrease log, just run the same command but use DebugLevel=1

If the process is crashing, the log is still needed, but usually doesn't give a clear answer (as the process crashed, it couldn't log, so you only see what happened just before it crased, which may not be enough).  In this case you need to collect the crash dump, and if Unix/ Linux, process through the OS debugger and then log a call with Veritas.  You would also need to gather the system messages log, the process log (as mentioned) a copy of the core file (or crash dump if windows), a copy of the binary itself and the nbsu -c -t output.

Hi mph999,

the process is stopping not crashing. Attached you will find the outputt of vxlogview.

Our NBU version is 7.7.3 on all master/media/clients.

We didn't copie any binaries manually. 
1. Stooping and shutting down the old Master server (Solaris 10)2. Starting the new Master server (same name and IP as the old one / Solaris 11)
3. Installing NetBackup Server on the new Master Server (The Installation was incomplete due to a false IP config in /etc/hosts and we could start the server processes. After the installation was complete, we installed the server software again over the old installation. After that it was were able to start the Master server and all processes started.)
4. Mounting the Disks with catalog and so on from the old master to the new master and setting the links to the catalog.
5. Starting the Master Server

This is a short version. We also adjusted e.g. bp.conf and vm.conf.
Do you need more information about the migration process?

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I am curious about these steps:

Mounting the Disks with catalog and so on from the old master to the new master and setting the links to the catalog.

We also adjusted e.g. bp.conf and vm.conf.

Can you confirm that the catalog disk from old master have all of netbackup/db as well as relational db's with all softlinks correctly resolving to the following paths?
/usr/openv/netbackup/db
 /usr/openv/db/data  (or bp.conf VXDBMS_NB_DATA  pointing to correct location on catalog disk)

What did you 'adjust' in bp.conf and vm.conf?

Is nbevtmgr the only process/daemon stopping or other others as well? 

 

 

I can confirm, that all softlinks are correct.
We just added the new media servers in bp.conf and a new tape library in vm.conf.

Yes it is the only process/daemon which stopps.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Probably best to log a Support call. 

They will understand what the following in the log means: 
(Seems to be the place where things go wrong and the service stopping...) 

02/23/17 14:34:08.840 V-231-144 [AsyncEventMgr::createChannel()] Created the BestEffort event channel with channel ID=1.
02/23/17 14:34:08.842 V-137-10 [OrbService::initInMain] CORBA exception: system exception, ID 'IDL:omg.org/CORBA/BAD_PARAM:1.0'
TAO exception, minor code = 0 (unknown location; unspecified errno), completed = NO
 during OrbService::doInit

 

mph999
Level 6
Employee Accredited

 

You have a comms issue …  I will hazard a guess it’s something to do with endpoints.

 

The Corba error marianne points out is an excellent start, but by no means the complete story.  Maybe others better than I know different, but that error alone is pretty much useless, apart from telling us you have a comms issue.

The problem happens at some point before that.  These issues can be very hard to find.

 

From your log

 

02/23/17 14:34:08.822 [Orb::getConfig] found service entry: name = veritas_pbx port = 1556 proto = tcp(Orb.cpp:542)

02/23/17 14:34:08.822 [Orb::getConfig] required_interface: xxx defined and assigned(Orb.cpp:582)

 

<snip>

 

02/23/17 14:34:08.835 [TAO] Using 5 threads for all Consumers.

02/23/17 14:34:08.835 [TAO] Using 5 threads for all Suppliers.

02/23/17 14:34:08.836 [ServiceContainer::doInit] successfully processed directive: dynamic CosNotify_Service Service_Object * "/usr/openv/lib/libvxTAO_CosNotification.so.6":_make_TAO_NS_Notify_Service() "-AllowReconnect -DispatchingThreads 5 -SourceThreads 5 -NoUpdates"(ServiceContainer.cpp:192)

02/23/17 14:34:08.836 [ServiceContainer::doInit] processing directive: dynamic AsyncEventMgr Service_Object * "/usr/openv/lib/libAsyncEventService.so":_make_AsyncEventMgr()(ServiceContainer.cpp:166)

02/23/17 14:34:08.836 [NBLogFactory::locateLogger(fid, oid)] Found a logger (#2). FID - 231, OID - 231(NBLogFactory.cpp:199)

02/23/17 14:34:08.836 [Orb::setOrbTimeoutPolicy] setting ORB request timeout policy: tv = 300(Orb.cpp:1388)

02/23/17 14:34:08.836 [Orb::setOrbRequestTimeout] timeout seconds: 300(Orb.cpp:1374)

02/23/17 14:34:08.836 [Info] V-231-134 The TAO Notification Service (CosNotify_Service) has been started.

02/23/17 14:34:08.837 [vnet_sortaddrs] [vnet_addrinfo.c:3992] sorted addrs: 1 0x1

02/23/17 14:34:08.837 [NBIORInterceptor::establish_components] Encoding IP Address [xxx] in IOR(NBIORInterceptor.cpp:120)

02/23/17 14:34:08.837 [vnet_sortaddrs] [vnet_addrinfo.c:3992] sorted addrs: 1 0x1

02/23/17 14:34:08.837 [NBIORInterceptor::establish_components] Encoding IP Address [xxx] in IOR(NBIORInterceptor.cpp:120)

02/23/17 14:34:08.839 [Info] V-231-135 TAO Notification Service (CosNotify_Service) Initialization complete.

02/23/17 14:34:08.839 [Info] V-231-140 Creating default Event Manager objects.

 

 

 

Now lets look at the log from my server

 

02/24/17 20:23:21.032 [Orb::getConfig] found service entry: name = veritas_pbx port = 1556 proto = tcp(Orb.cpp:542)

02/24/17 20:23:21.032 [Orb::getConfig] cluster_name and required_interface not defined, using ANY(Orb.cpp:594)

 

<snip>

 

02/24/17 20:23:21.314 [TAO] Using 5 threads for all Consumers.

02/24/17 20:23:21.314 [TAO] Using 5 threads for all Suppliers.

02/24/17 20:23:21.315 [ServiceContainer::doInit] successfully processed directive: dynamic CosNotify_Service Service_Object * "/usr/openv/lib/libvxTAO_CosNotification.so.6":_make_TAO_NS_Notif

y_Service() "-AllowReconnect -DispatchingThreads 5 -SourceThreads 5 -NoUpdates"(ServiceContainer.cpp:192)

02/24/17 20:23:21.315 [ServiceContainer::doInit] processing directive: dynamic AsyncEventMgr Service_Object * "/usr/openv/lib/libAsyncEventService.so":_make_AsyncEventMgr()(ServiceContainer.c

pp:166)

02/24/17 20:23:21.316 [NBLogFactory::locateLogger(fid, oid)] Found a logger (#2). FID - 231, OID - 231(NBLogFactory.cpp:199)

02/24/17 20:23:21.317 [AsyncEventMgr::init()] +++ ENTERING +++ : obj = 1002ba900 (../AsyncEventMgr.cpp:99)

02/24/17 20:23:21.317 [LifecycleTools::startCosNotifSvc()] +++ ENTERING +++ : obj = ffffffff7fffce10 (../LifecycleTools.cpp:108)

02/24/17 20:23:21.317 [Orb::setOrbTimeoutPolicy] setting ORB request timeout policy: tv = 300(Orb.cpp:1388)

02/24/17 20:23:21.317 [Orb::setOrbRequestTimeout] timeout seconds: 300(Orb.cpp:1374)

02/24/17 20:23:21.318 [Info] V-231-134 The TAO Notification Service (CosNotify_Service) has been started.

packet_write_wait: Connection to 10.12.235.41: Broken pipents] Encoding IP Address [10.12.235.41] in IOR(NBIORInterceptor.cpp:120)

M071744HTG3QD:~ martin.holt$ nterceptor::establish_components] Encoding IP Address [10.12.235.41] in IOR(NBIORInterceptor.cpp:120)

02/24/17 20:23:21.321 [NBIORInterceptor::establish_components] Encoding IP Address [10.12.235.41] in IOR(NBIORInterceptor.cpp:120)

02/24/17 20:23:21.321 [NBIORInterceptor::establish_components] Encoding IP Address [10.12.235.41] in IOR(NBIORInterceptor.cpp:120)

02/24/17 20:23:21.347 [Info] V-231-135 TAO Notification Service (CosNotify_Service) Initialization complete.

02/24/17 20:23:21.347 [Info] V-231-140 Creating default Event Manager objects.

 

 

 

You can see the difference, my log shows my ip address, your log shows xxx

 

As a matter of interest, in pbx log at the time of startup, do you see lines like this:

 

02/24/17 20:23:21.251 [Info] V-103-10 Adding server: nbevtmgr

02/24/17 20:23:21.399 [Info] PBX_Client_Proxy::parse_line, line = extension=nbevtmgr  From 127.0.0.1

 

To get the pbx log  (PBX log levels work a bit differently, it should be at max debug level of 10 by default)

 

vxlogview -p 50936 -o 103 -t 00:40:00  (last 40 mins of log relative to when you run the command, adjust time as required, of just restart and get new logs …)

 

 

I will hazard a guess, that you won’t see the lines in PBX log, as nbevtmgr can’t connect to it.

 

I’d suggest checking name resolution, if you have multiple interfaces on master, even though you may only use one address, ALL address must be fully resolvable (both forwards and backwards) as the way NBU works,  it advertises all interfaces on a machine, so even if not used, every interface must be resolvable.

 

If you use hosts file, make sure the localhost entry is correct.  If on Unix/ Linux, check cat -v /etc/hosts for bad characters, I had a corrupted host file the other week that caused this sort of issue.  Might be worth re-creating the hosts file to be safe (obviously, don’t copy it …).

 

As a matter of interest, is this a cluster ???

 

I’ve also just had a look for previous cases with that same Corba error.  Various solutions, some were fixed by an EEB (not the version you have so N/A), quite a few had name resolution as the cause - so again, further evidence to check this.  I will apologies, if I had a £1 for every time I am told ‘there are no lookup issues’ only to find several hours later -  oh look ‘name resolution issue’ …

We might need to increase the vx log 156 and 137, these are loads of ‘network’ stuff into the nbevtmgr log, you need to recollect ONLY nbevtmgr log but run vxlogview -p 51216 -i 231, not -o 231.  To be honest, if it’s not a easily found lookup issue then you’ll probably need a support call, and in that case send raw unprocessed logs.

mph999
Level 6
Employee Accredited

I've just had a horrible thought ...

Did you edit the log and remove the ip addresses ? and put in xxx

02/23/17 14:34:08.822 [Orb::getConfig] required_interface: xxx defined and assigned(Orb.cpp:582)

If this is done, you need to substitute IP addreses not remove them, and be consistent

Eg.  10.20.30.40 become 1.1.1.2

20.30.40.50 becomes 1.1.1.3

... especially when you have some sort of communications problem.

Hi,

I did remove the ip adresses, because I have to. The ip adress which i removed was the IP of the Master Server.
It was always the same ip.
I'm sorry for that, I should have mentioned it in my reply with the log.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Sadfad -
You have not replied to any of Martin's requests in his lengthy reply posted on Friday?

I will check all of this. As soon as I have the information I will let you know.

mph999
Level 6
Employee Accredited

No problem ... I thought it looked a bit odd at the time, I should have asked on Friday really.

Some of what I said may be a bit irrelevant now, given that the IP address was correct.

However, I think the questions are still valid, multip IPs etc / name resolution can cause really odd stuff to happen - with most of NBU working fine, and only the odd thing not working because ultimately, there is still some communications issue.

I suspect ultimately this might need to be a support call, with full 156 and 137 logs enabled.  However, if OP is happy we can can check a few things first.

 

mph999
Level 6
Employee Accredited

OK ...

Can't reallly add much more at the moment, suggest checking in PBX log as before for lines mentioned before - as well as other steps.  

 

If no go, then need to see if the full logs help I think,  no need to stop anything as nbevtmgr isn't running, I'd clear the logs though

cd /usr/openv/logs/nbevtmgr and delete the log files

Then

vxlogcfg -a -p 51216 -o 231 -s DebugLevel=6 -s DiagnosticLevel=6

vxlogcfg -a -p 51216 -o 256 -s DebugLevel=6 -s DiagnosticLevel=6

vxlogcfg -a -p 51216 -o 137 -s DebugLevel=6 -s DiagnosticLevel=6

As soon as it fails, cp the 51216-231... file to nbevtmgr.raw

Run

vxlogview -p 51216 -i 231 >231.txx

vxlogview -p 51216 -i 231 -d all >231_all.txt  (this gives all the PIDs and TIDs)

Could also grab pbx log

vxlogview -p 50936 -i 103 -d all -t 00:15:00 >pbx.txt  (adjust time as necessary)

Turn logs down (same commands as before but use DebugLevel=1

We found the error.

It was a missing priviledge for 'proc_procntl'
After we are adding this priviledge the nbevtmgr startet and everything else worked fine.
Anyway thank you for your time and answers.

mph999
Level 6
Employee Accredited

Excellent that you found that, may I ask what led you to the answer ?

We did a truss on the nbevtmgr process. In this there was a short line with the error. Just very lucky that we found that.