Netbackup backups failing with error 82 for a medi...

ksurya1487 · ‎06-03-2013

Hi All..

We have 1 master sever and 4 media server, backups are running fine for 3 media servers but for one media server all backups are failing with Error 82

Active monitor logs

May 31, 2013 12:01:15 PM - requesting resource cvgsolbkpp001-tld-2-cbtsedl2b May 31, 2013 12:01:15 PM - requesting resource nasolbkp200.NBU_CLIENT.MAXJOBS.cvgrhesosp004-bka

May 31, 2013 12:01:15 PM - requesting resource nasolbkp200.NBU_POLICY.MAXJOBS.fs.all.prd.200

May 31, 2013 12:01:52 PM - granted resource nasolbkp200.NBU_CLIENT.MAXJOBS.cvgrhesosp004-bka

May 31, 2013 12:01:52 PM - granted resource nasolbkp200.NBU_POLICY.MAXJOBS.fs.all.prd.200

May 31, 2013 12:01:52 PM - granted resource V25082 May 31, 2013 12:01:52 PM - granted resource CBTSEDL2B_DR010 May 31, 2013 12:01:52 PM - granted resource cvgsolbkpp001-tld-2-cbtsedl2b May 31, 2013 12:01:52 PM - estimated 72281241 kbytes needed May 31, 2013 12:01:55 PM - started process bpbrm (pid=20363) May 31, 2013 12:01:56 PM - connecting May 31, 2013 12:01:57 PM - connected; connect time: 0:00:00 May 31, 2013 12:01:59 PM - mounting V25082 May 31, 2013 12:02:09 PM - mounted V25082; mount time: 0:00:10 May 31, 2013 12:02:09 PM - positioning V25082 to file 1 May 31, 2013 12:02:09 PM - positioned V25082; position time: 0:00:00 May 31, 2013 12:02:09 PM - begin writing May 31, 2013 12:02:39 PM - Info bpbrm (pid=20363) from client cvgrhesosp004-bka: TRV - /opt/app is in a different file system from /opt. Skipping.

May 31, 2013 12:10:44 PM - Error bptm (pid=20379) media manager terminated by parent process May 31, 2013 12:10:46 PM - end writing; write time: 0:08:37 media manager killed by signal (82)

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After deep analysis of BPTM,BPBRM,BPCD,BPBKAR,NBSU logs from symantech i got an update to upgrade kernel in the server..we have same kernel in all our servers , y do we need to upgrade this alone

Below is the advise from symantech do we really need to upgrade kernal for this issue

I was able to download the logs (bptm and bpbrm ) and we can clearly see Socket errors in the bptm log

Bptm shows the socket errors below

*********************************************

main: Setting mud from bp.conf

11:30:07.817 [16777] <2> nbjm_media_request: Passing job control to NBJM, type WRITE/9

11:30:07.817 [16777] <2> nbjm_media_request: old_media_id = , media_id = NULL

11:30:07.818 [16777] <2> RequestInitialResources: starting

11:30:07.818 [16777] <2> RequestInitialResources: started

11:30:07.821 [16777] <2> Orb::init: Created anon service name: NB_16777_-216338603(Orb.cpp:630)

11:30:07.821 [16777] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_16777_-216338603(Orb.cpp:648)

11:30:07.821 [16777] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop://1556:NB_16777_-216338603 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:759)

11:30:07.846 [16777] <32> Orb::activate: Failed to initialize ORB: check to see if PBX is running or if service has permissions to connect to PBX. Check PBX logs for details

11:30:07.859 [16777] <8> Orb::init: CORBA exception: system exception, ID 'IDL:omg.org/CORBA/BAD_PARAM:1.0'

TAO exception, minor code = 5 (endpoint initialization failure in Acceptor Registry; ECONNREFUSED), completed = NO during orb activation

11:30:07.860 [16777] <16> initializeJmComm: RequestInitialResources : failed to initialize ORB: [BAD_PARAM]. Verify PBX is running and caller has permissions to connect to PBX. See PBX logs for details

11:30:07.860 [16777] <2> RequestInitialResources: retVal = 25 emmStatus = 3000000

11:30:07.860 [16777] <2> RequestInitialResources: returning

11:30:07.860 [16777] <4> nbjm_media_request: Error from RequestMultipleResources, Master nasolbkp200, error 25, resourceAllocated 0

11:30:07.861 [16777] <2> set_job_details: Tfile (1468172): LOG 1370014207 16 bptm 16777 nbjm_media_request() failed: 25, cannot continue with copy 1

The media server has Solaris 10 but very old OS patch

5.10 Generic_142900-03 sun4v sparc SUNW,SPARC-Enterprise-T5220

You will need to update the OS patch due to socket errors showing up in bptm

Due to the socket related error noted in BPTM for 2 different master daemon connection failures

Do the following—

1)

Solaris servers needs SUN KERNEL update June 2011 or later Note the following kernel version on the problem media server / client

# uname -a

142900-03 = Release Date: Dec/09/2009

Oracle / Sun and SYMANTEC identified minimum Solaris Kernel patch to avoid system Socket Management issues and NetBackup daemon issues using the server sockets.

Symantec recommends that you download the patch set dated June 2011 (or newer) from the Oracle Support website.

https://support.oracle.com

The patch set contains the following minimum recommended patches:

¦ 118777-17 (SunOS 5.10: Sun GigaSwift Ethernet 1.0 driver patch) ¦ 139555-08 (Kernel patch with C++ library updates).

¦ 142394-01 (Internet Control Message Protocol (ICMP) patch) ¦ 143513-02 (Data Link Admincommandfor Solaris (DLADM) patch) ¦ 141562-02 (Address Resolution Protocol (ARP) patch)

The following patches are recommended for Solaris 10 SPARC with NXGE cards:

¦ 142909-17 (SunOS 5.10: nxge patch)

¦ 143897-03 (Distributed Link Software patch) ¦ 143135-03 (Aggregation patch) ¦ 119963-21 (Change Request ID - 6815915) ¦ 139555-08 (Change Request ID - 6723423)

Reference

Solaris nxge driver and NetBackup communication errors

http://www.symantec.com/business/support/index?page=content&id=TECH128953

Reference

SUN BUG 119963-21 - SunOS 5.10: Shared library patch for C++ This bug describes a lock contention in dtrace area when frequently forking or exiting short-living processes that uses C++ runtime library libCrun.so. Symptoms are a high load in kernel and a delay in fork() and exit() system call.

NBU support has now started to identify this problem for NBU in latest release info— Reference NetBackup 7.5 Release Notes Page 65 Solaris Patches

http://www.symantec.com/docs/DOC5041

2)

Reduce Media Server socket usage

Move NBU internal VNETD socket connections on servers to server loopback interface instead of using VNETD daemon --Add the following line to /usr/openv/netbackup/bp.conf CONNECT_OPTIONS = localhost 1 0 2

No restart needed

RamNagalla · ‎06-03-2013

hi,

what is your netbackup versions , I assume your Netbackup version is 7.5 based on the mail from Symantec

have you done any recent upgrades?

have you started this issue recently , does it ever work fine?

the inputs from Symatec is pretty clear..

I would suggest you go with the Kernal upgrade... and see if that helps..

Rohit_Phiske · ‎06-03-2013

Hi Surya,

Since you have 4 media servers connected to 1 master server, have you tried the option of recycle of the environment.

If this issue has come up recently and the backups on this media server was running before I would recommend you try with netbackup environment freshen up, if the issue was due to connectivity with the master server

Check if the below steps can be tried :

1. Stop active jobs and suspend the scheduler for initiating new jobs.

1. Shut down media servers services and PBX.

2. Shutdown master server services and PBX.

3. Clear the stuck media's in the tape drives if any.

3. Start PBX and master server services.

4. Start PBX and media server services ( This would ensure all connections and allocations are inline for all media servers.)

Let us know if it works.

ksurya1487 · ‎06-04-2013

Hi all

thanks for your reply its Netbackup 7.0.1, i will tell you the changes done by us after which we are facing the issue..!!

the media server having issue with Error 82 - X

Master server - Y (Solaris)

Master Server Y went down due to hardware issue and it was taking time to replace hardware component, By the time we built a new Master server with same name Y (With Redhat ) and added the media server X to the master server Y for Catalog restore as Master server Y is not connected to tape library

Later the old master server Y (Solaris) which was down due to hardware issue was fixed, so we brought down the new server and configured Media server X back to old Master server

We just ran device configuration wizard for the X media server, restarted services/Server once we configured it back to old master server, Nothing worked

Symantech also says

"Due to the socket related error noted in BPTM for 2 different master daemon connection failures"

i told all these to symantech, Do we still need kernel upgrade or something can be fixed here..!!

Some configuration missmatch with nedia server, do we need to do nbdecom of media server and install newely??

Marianne · ‎06-04-2013

No need to decommission.

Simply delete devices, 'offline' the media server, then remove NBU software.

Reinstall NBU, patch.
Activate media server and test comms.
Please double-check /etc/hosts for duplicate entries and/or left-over entries for temporary Linux server.
Run device config wizard and test backups.

*** EDIT ***

Can you confirm that the Linux server is completely down/uninstalled? Just in case there is still some polling/comms from this server with the media server trying to connect back to this server instead of the Solaris master.

Handy NetBackup Links

ksurya1487 · ‎06-05-2013

The Linux server has been isolated completly and its down,

Am worried about the media assigned to that media server , it has around 3000+ medias assigned, that will be assigned back again if we do so..!

Marianne · ‎06-06-2013

Try to reinstall as per my previous post.

Remember to double-check /etc/hosts as well.
It seems as if media server is trying to connect back to wrong master server.

What does 'nbemmcmd -getemmserver' show?

Handy NetBackup Links

VOX

Netbackup backups failing with error 82 for a media server

Active monitor logs