cancel
Showing results for 
Search instead for 
Did you mean: 

Client process aborted was reported only with the parent job

liuyl
Level 6

Enviroment: NBU 8.1.1 RHEL 6.5
Recently we had a long time standard full backup job running continuously with no any problem for hundred hours.
But after 196 Hrs the parent job done with status 50,  versus its child job(19018182) was still active running normally.

So:
1) How to understand such problem only failed with the parent job ?
2) What/which ever the client process aborted would mean about ?


1908182 Active DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 195.404 Hrs
1908181 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 175.502 Hrs
1908180 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 80.4825 Hrs
1908179 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 25.8503 Hrs
1908178 Active DCTM-ecmdata dctm1 13 Feb 2020 07:30:53 PM 195.405 Hrs

 

1908182 Active DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 213.676 Hrs
1908181 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 175.502 Hrs
1908180 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 80.4825 Hrs
1908179 Done 0 DCTM-ecmdata dctm1 13 Feb 2020 07:30:55 PM 25.8503 Hrs
1908178 Done 50 DCTM-ecmdata dctm1 13 Feb 2020 07:30:53 PM 196.771 Hrs

 

NEW_STREAM
/ecmdata/ecmdocdata
NEW_STREAM
/ecmdata/ecmdocdata1
/ecmdata/ecmdocdata2
NEW_STREAM
/ecmdata/ecmdocdata3
/ecmdata/ecmdocdata4
NEW_STREAM
/ecmdata/ecmdocdata5
/ecmdata/ecmdocdata6
/ecmdata/ecmdocdata7


50
client process aborted
34
02/13/2020 19:30:53 - Info nbjm(pid=7959) starting backup job (jobid=1908178) for client dctm1\
policy DCTM-ecmdata\
schedule dctm-ecmfile-full
02/13/2020 19:30:53 - Info nbjm(pid=7959) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1908178\
request id:{4B04A00E-4E54-11EA-A5A2-C960B335B408})
02/13/2020 19:30:53 - requesting resource dctm1-hcart-robot-tld-4
02/13/2020 19:30:53 - requesting resource jcbak.NBU_CLIENT.MAXJOBS.dctm1
02/13/2020 19:30:53 - requesting resource jcbak.NBU_POLICY.MAXJOBS.DCTM-ecmdata
02/13/2020 19:30:53 RB - waiting for Stop Drive Scan
02/13/2020 19:30:55 - granted resource jcbak.NBU_CLIENT.MAXJOBS.dctm1
02/13/2020 19:30:55 - granted resource jcbak.NBU_POLICY.MAXJOBS.DCTM-ecmdata
02/13/2020 19:30:55 - granted resource A226L4
02/13/2020 19:30:55 - granted resource IBM.ULT3580-TD4.004
02/13/2020 19:30:55 - granted resource dctm1-hcart-robot-tld-4
02/13/2020 19:30:55 - estimated 0 kbytes needed
02/13/2020 19:30:55 - begin operation
02/13/2020 19:30:55 - begin operation
02/13/2020 19:30:55 - Info RUNCMD(pid=23284) started
02/13/2020 19:30:55 - Info RUNCMD(pid=23284) exiting with status: 0
Operation status: the requested operation was successfully completed (0)
02/13/2020 19:30:55 - end operation; operation time: 000:00:00
02/13/2020 19:30:55 - begin operation
Operation status: the requested operation was successfully completed (0)
02/13/2020 19:30:55 - end operation; operation time: 000:00:00
02/13/2020 19:30:55 - begin operation
Operation status: client process aborted (50)
02/22/2020 00:17:06 - end operation; operation time: 196:46:11
02/22/2020 00:17:06 - begin operation
Operation status: the requested operation was successfully completed (0)
02/22/2020 00:17:06 - end operation; operation time: 000:00:00
02/22/2020 00:17:06 - begin operation
02/22/2020 00:17:06 - Info RUNCMD(pid=22637) started
02/22/2020 00:17:06 - Info RUNCMD(pid=22637) exiting with status: 0
Operation status: the requested operation was successfully completed (0)
02/22/2020 00:17:06 - end operation; operation time: 000:00:00
Operation status: client process aborted (50)
02/22/2020 00:17:06 - end operation

 

 

02/22/2020 00:17:06.732 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759456532224 File ID:117 [No context] 1 [DefaultQueue::handle_input] sending JL updateJobStatus : job
id=-1908178, status=1(CallbackQueue.cpp:1510)
02/22/2020 00:17:06.732 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759456532224 File ID:117 [No context] 1 [DefaultQueue::handle_input] (0x1901af0) CallbackQueue now em
pty(CallbackQueue.cpp:1594)
02/22/2020 00:17:06.733 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 [JobInstance::state] (0x7f1c301668e0) sta
te=5(JobInstance.cpp:355)
02/22/2020 00:17:06.733 [Application] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-94 FSM JobTable[STREAM_D
ISCOVERY:PEMPreprocessed] m_JobStep=4, Jobid=1908178, returns=50
02/22/2020 00:17:06.733 [Application] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-88 FSM JobTable[STREAM_D
ISCOVERY:StopOnError] m_JobStep=-7, Jobid=1908178
02/22/2020 00:17:06.733 [Application] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-94 FSM JobTable[STREAM_D
ISCOVERY:StopOnError] m_JobStep=5, Jobid=1908178, returns=0
02/22/2020 00:17:06.733 [Application] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-88 FSM JobTable[STREAM_D
ISCOVERY:EndNotifyScript] m_JobStep=6, Jobid=1908178
02/22/2020 00:17:06.733 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [No context] 1 [BPCRConnFactory::requestBPCRConnection] inserting connection
request (back) BPCRConnectionRequest(pool=2, connectionName=jcbak::RUNCMD, requester=0x7f1c383b0580)(BPCRConnFactory.cpp:1032)
02/22/2020 00:17:06.734 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [No context] 1 [BPCRConnection::linkToRequester] (0x7f1c10266180) link conne
ction to requester(BPCRConnFactory.cpp:163)
02/22/2020 00:17:06.734 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [No context] 1 [BPCRConnection::linkToRequester] Connection BPCRConnection(s
tate=1, hadError=0, activeFDcnt=3, connectName=jcbak::RUNCMD, hostName=jcbak, userName=) requester=0x7f1c383b0580(BPCRConnFactory.cpp:209)
02/22/2020 00:17:06.764 [Diagnostic] NB 51216 nbjm 117 PID:7969 TID:139759456532224 File ID:117 [No context] 1 V-117-219 [RunCmd::terminate] termination in progress
02/22/2020 00:17:06.766 [Diagnostic] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 V-117-238 [BPCRConnection::handleError] had an error, status=40
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnFactory::returnConnection] connection returned was f
ound on the active list BPCRConnectionRequest(pool=2, connectionName=jcbak::RUNCMD, requester=0x7f1c383b0580, connection= BPCRConnection(state=3, hadError=1, activeFD
cnt=2, connectName=jcbak::RUNCMD, hostName=jcbak, userName=))(BPCRConnFactory.cpp:1106)
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnection::unlinkFromRequester] marking connection not
reusable because parser is not reusable BPCRConnection(state=3, hadError=1, activeFDcnt=2, connectName=jcbak::RUNCMD, hostName=jcbak, userName=)(BPCRConnFactory.cpp:312)
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnectionRequest::unlinkFromRequester()] connection not
reusable, will close BPCRConnection(state=4, hadError=1, activeFDcnt=2, connectName=jcbak::RUNCMD, hostName=jcbak, userName=)(BPCRConnFactory.cpp:1700)
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnFactory::returnConnection] (0x1941020) connection co
uld not be reused(BPCRConnFactory.cpp:1138)
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-94 FSM JobTable[STREAM_D
ISCOVERY:EndNotifyScript] m_JobStep=6, Jobid=1908178, returns=0
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-107 *********** Jobid 19
08178 status summary for STREAM_DISCOVERY: *********
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 1 StartNotifySc
ript status 0 jobstatus -
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 2 StreamDiscove
ry status 0 jobstatus 0
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 3 BMRSave statu
s - jobstatus 0
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 4 PEMPreprocess
ed status 50 jobstatus 50
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 5 ValidateImage
status - jobstatus 50
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step 6 EndNotifyScri
pt status 0 jobstatus 50
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-114 Step -7 StopOnError
status 0 jobstatus 50
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-111 ********** job statu
s is 50 **********
02/22/2020 00:17:06.766 [Application] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [jobid=1908178 parentid=1908178] [Info] V-117-52 GenBackupJob Terminat
ing status=50
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 [JobInstance::state] (0x7f1c301668e0) sta
te=5(JobInstance.cpp:355)
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 [JobInstance::state] (0x7f1c301668e0) sta
te=5(JobInstance.cpp:355)
02/22/2020 00:17:06.766 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 [BackupJob::sendNotifications] (0x7f1c301
668e0) disable resume=false, brm started=false(BackupJob.cpp:2498)
02/22/2020 00:17:06.767 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [No context] 1 [BPCRConnFactory::requestBPCRConnection] inserting connection
request (back) BPCRConnectionRequest(pool=6, connectionName=dctm1::BPCD::root, requester=0x7f1c3837e480)(BPCRConnFactory.cpp:1032)
02/22/2020 00:17:06.767 [Diagnostic] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 V-117-239 [BackupJob::terminateThisJ
ob] terminated job, jobid=1908178, status=50 {2}
02/22/2020 00:17:06.767 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnection::linkToRequester] (0x7f1c2c4fe0e0) link conne
ction to requester(BPCRConnFactory.cpp:163)
02/22/2020 00:17:06.767 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 [BPCRConnection::linkToRequester] Connection BPCRConnection(s
tate=1, hadError=0, activeFDcnt=3, connectName=dctm1::BPCD::root, hostName=dctm1, userName=root) requester=0x7f1c3837e480(BPCRConnFactory.cpp:209)
02/22/2020 00:17:06.767 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 parentid=1908178] 1 [BackupJob::~BackupJob] (dump Backup Job)
:
**************************
*BackupJob *
(jobid=1908178 jobtype=0)
**************************
isStdBackup = false
backupType = 0
subtype = 505
mpxid = {4B12FC12-4E54-11EA-A7BD-D1139209671B}
maxmpx = 1
Birthtime = 1581593455
backupid = dctm1_1581593455
Active pid = 0
Job Count = 1
IPaddr =
currentThrottle = -1
estimatedSize = 0
estimatedFileCount= 0
suspendable = 0
progressLogTimerID= -1
localBackup = 0
policy = DCTM-ecmdata
Num copies = 1
Include_file# = 12
resourceCount = 1
Stu = dctm1-hcart-robot-tld-4
Pool = ECM
Retention = 2
UseLocal = false
(BackupJob.cpp:390)
02/22/2020 00:17:06.767 [Debug] NB 51216 nbjm 117 PID:7969 TID:139759647893248 File ID:117 [jobid=1908178 reqid=0] 1 [ExportedResourceMgr::~ExportedResourceMgr]
(0x7f1c3037ac90) nothing found to release or release on destruction is false(ExportedResourceMgr.cpp:162)

 

5 REPLIES 5

sdo
Moderator
Moderator
Partner    VIP    Certified

In most cases where NetBackup Client and NetBackup Server have been running a backup job quite well and quite steadily for many hours then it is usually quite likely for them to continue to do so until a sucessful end of the backup job is reached... unless... external factors cause a state change or interuption to an element of the communications path.  And, I'm sorry to say, that it will be nigh on impossible to determine which external factor(s) caused your backup job interuption unless you have a repeatable failure pattern.

1)  But why the child job could be still running active until its success eventually  ?
2)  It was strange that all the child jobs have their own schedule & PID versus the parent job not!
     So where the client process of the parent job came from?


1908182 Backup Done 0 DCTM-ecmdata dctm-ecmfile-full dctm1 24233 No
1908181 Backup Done 0 DCTM-ecmdata dctm-ecmfile-full dctm1 23668 No
1908180 Backup Done 0 DCTM-ecmdata dctm-ecmfile-full dctm1 23153 No
1908179 Backup Done 0 DCTM-ecmdata dctm-ecmfile-full dctm1 22681 No
1908178 Backup Done 50 DCTM-ecmdata - dctm1           No

sdo
Moderator
Moderator
Partner    VIP    Certified

When multi-streaming is not enabled then there is generally no concept of a parent and child job, i.e. there is a just a job.  When multi-streaming is enabled, then backup jobs are a set of a parent job (which has a "-" for schedule name) plus child jobs which do reflect the schedule name - and this is justa feature, this is just the way it is.

As for why child jobs would suceed, and yet the parent job fail with status 50, it is not possible to tell without detailed logs.

sdo
Moderator
Moderator
Partner    VIP    Certified

I did see this in your log snippet:

02/22/2020 00:17:06.766 [Diagnostic] NB 51216 nbjm 117 PID:7969 TID:139759446025984 File ID:117 [No context] 1 V-117-238 [BPCRConnection::handleError] had an error, status=40

...and to me status 40 have always been really difficult to track down errors to do with LAN networking and LAN switching, and hardly ever due to NetBackup itself.

1) But most backup types do have their parent job, such as Catalog、DB App,  so it should have nothing to do with the multistream.

378489 Backup Active sap_redo_SCERP Default-Application-Backup SCERPDB2 SCERPDB2 2818478 No
378488 Backup Active sap_redo_SCERP Default-Application-Backup SCERPDB2 SCERPDB2 5833122 No
378487 Backup Active sap_redo_SCERP Default-Application-Backup SCERPDB2 SCERPDB2 8323684 No
378486 Backup Active sap_redo_SCERP Default-Application-Backup SCERPDB2 SCERPDB2 6554144 No
378485 Backup Active sap_redo_SCERP                    sap_archivelog SCERPDB2 SCERPDB2 6619540 No
378484 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 28289 No
378483 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 28289 No
378482 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 28289 No
378481 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 27683 No
378480 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 27683 No
378479 Catalog Backup Done 0 NBU-Catalog full jchxbak jchxbak 27683 No
378478 Catalog Backup Done 0 NBU-Catalog full jchxbak              27579
378477 Catalog Backup Done 0 NBU-Catalog -    jchxbak                          No


2)  I shall check whether the corresponding network situation was abnormal at that time ...,  but why the child job was still all right regardless of the same media/client ?

..