Intermittent Status 213 failures
I'm using a fairly straightforward enviornment as follows:
Master: RHEL 5 NetBackup 7.1.0.4
Media: RHEL 5 NetBackup 7.1.0.4
OST Plugin: 2.4.1.0-289644
Using DDBoost Storage Unit which has had some intermittent failures which I cannot seem to figure out why. I had one occurrence on 12/13 (with 3-4 failures) which after some quick log skimming I couldn't find an obvious answer for and was willing to maybe pass it off as a network hiccup. I had another occurrence last night which means I am now expecting additional occurrences. I have been digging into logs and posts and tech notes for most of the day and cannot seem to put my finger on why exactly. There were 4 jobs which failed last night with status 213, while other active jobs continued to run and subsequent jobs have run successfully. I can see that the allocation of storage unit resource failed, but am just missing the part as to why that was.
Here is the output from one of the jobs:
vxlogview -X "jobid=7708479"
12/18/2012 03:26:48.409 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:26:48.410 V-116-215 [BaseJob::run] jobid=7708479 submitted to nbjm for processing
12/18/2012 03:27:40.059 [JobInstance::state] (0x2aaab40d7870) state=0(../JobInstance.cpp:316)
12/18/2012 03:27:40.059 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=7708479)
12/18/2012 03:27:40.060 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:27:40.060 V-118-227 [ResBroker_i::requestResources] received resource request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, priority 0, secondary priority 24,631, description THE_BACKUP_JOB-7708479-{C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.060 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-7708479 MPXGroupRequest maxmpx=1
SubRequest=0 provider=DriveOperationProvider resourcename=DDB-SPPDUPFI071-ORACLE-0 userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=DDB-SPPDUPFI071-ORACLE-0 mediaPool=NetBackup retentionLevel=3 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=spgridfi020.usa-ed.net shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=spmastfi030.usa-ed.net.NBU_CLIENT.MAXJOBS.spgridfi020.usa-ed.net CountedResourceRequest { max=99 }
Request=2 provider=NamedResourceProvider resourcename=spmastfi030.usa-ed.net.NBU_POLICY.MAXJOBS.IN-FISHERS-PRD-ORACLE-1 CountedResourceRequest { max=2147483647 }
12/18/2012 03:27:40.097 [allocateTwin] INITIATING:
12/18/2012 03:27:40.097 [allocateTwin] masterServer = spmastfi030.usa-ed.net, client = spgridfi020.usa-ed.net, jobType = 1, capabilityFlags = 0, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0
12/18/2012 03:27:40.097 [allocateTwin] Twin_Record: STUIdentifier = DDB-SPPDUPFI071-ORACLE-0, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 3, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0
12/18/2012 03:27:40.097 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.097 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.106 [set_host_and_dg_info_in_stu_list] could not get any media servers for storage unit
12/18/2012 03:27:40.107 [allocateTwin] EXIT INFO:
12/18/2012 03:27:40.107 [allocateTwin] EXIT STATUS = 2002011 (EMM_ERROR_STUS_Not_Available, Storage units are not available)
12/18/2012 03:27:40.108 V-118-201 [EMMAccess::allocateTwin] request to allocate STU (Master Server spmastfi030.usa-ed.net, Client spgridfi020.usa-ed.net, Job Type 1, STU DDB-SPPDUPFI071-ORACLE-0, STU Type 1, Media Pool NetBackup, Retention Level 3, Capabilities 0, Fibre Transport Preference 0) returned 2002011
12/18/2012 03:27:40.108 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.108 [Error] V-118-171 request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815} failed with status 0 (Not Enough Valid Resources); releasing 2 allocated resources
12/18/2012 03:27:40.108 [ResBroker_i::evaluateOne] resource request fails, requestId={C9427F56-48EC-11E2-AE0B-FD5ED0696815}, userId=jobid=7708479, status=2002011, birthTime=1355819260, timeToFirstEval=0 sec, timeToSuccessOrFailure=0 sec, numberOfTimesEvaluated=1
12/18/2012 03:27:40.108 V-118-108 [ResBroker_i::failOne] failing resource request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, status 2002011
12/18/2012 03:27:40.109 V-118-137 [EMMProvider::getDriveForWaitingRequest] allocation information for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815} is not found
12/18/2012 03:27:40.109 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, error code 2,002,011, reason not enough valid resources
12/18/2012 03:27:40.113 [BackupJob::ERMEvent] (0x2aaab40d7870) initial resource request failed, copy#=-1, EMM status=Storage units are not available, NBU status=213(../BackupJob.cpp:345)
12/18/2012 03:27:50.392 [Error] V-117-131 NBU status: 213, EMM status: Storage units are not available
12/18/2012 03:27:50.392 [JobInstance::state] (0x2aaab40d7870) state=1(../JobInstance.cpp:316)
12/18/2012 03:27:50.392 [JobInstance::state] (0x2aaab40d7870) state=1(../JobInstance.cpp:316)
12/18/2012 03:27:50.392 [BackupJob::sendNotifications] (0x2aaab40d7870) disable resume=false, brm started=false(../BackupJob.cpp:2128)
12/18/2012 03:27:50.393 V-117-239 [BackupJob::terminateThisJob] terminated job, jobid=7708479, status=213
12/18/2012 03:27:50.393 [BackupJob::~BackupJob] (dump Backup Job):
**************************
*BackupJob *
(jobid=7708479 jobtype=0)
**************************
isStdBackup = true
backupType = 0
subtype = 0
mpxid = {00000000-0000-0000-0000-000000000000}
maxmpx = 1
Birthtime = 0
backupid =
Active pid = 0
Job Count = 1
IPaddr =
currentThrottle = -1
estimatedSize = 0
estimatedFileCount= 0
suspendable = 0
progressLogTimerID= -1
localBackup = 0
policy = IN-FISHERS-PRD-ORACLE-1
Num copies = 1
Include_file# = 1
resourceCount = 1
Stu = DDB-SPPDUPFI071-ORACLE-0
Pool = NetBackup
Retention = 3
UseLocal = false
(../BackupJob.cpp:301)
12/18/2012 03:27:50.396 V-116-216 [RecoverableJob::doJobCompletion] jobid=7708479 completed with status 213
12/18/2012 03:27:50.401 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:27:50.463 [PemTask::taskComplete] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) Active subtask count=0(../PemTask.cpp:529)
12/18/2012 03:27:50.471 [Info] V-116-14 CLIENT spgridfi020.usa-ed.net POLICY IN-FISHERS-PRD-ORACLE-1 SCHED Default-Application-Backup EXIT STATUS 213 (no storage units available for use)
12/18/2012 03:27:50.478 [Error] V-116-85 backup of client spgridfi020.usa-ed.net exited with status 213 (no storage units available for use)
Looking at the nbrmms info indeed:
12/18/2012 03:27:19.100 [Warning] V-219-12 Volume DDB-SPPDUPFI071-ORACLE:IN-FI-SPPDUPFI070-ORACLE marked down, Storage server 192.168.212.40 P{rtmo}
12/18/2012 03:27:27.411 [Warning] V-219-14 Storage server DataDomain:192.168.212.40 was marked down-server
12/18/2012 03:27:30.527 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host punifimed100.usa-ed.net
12/18/2012 03:27:30.527 [Info] V-219-3 Volume spmastfi030.usa-ed.net:DDB-SPPDUPFI071-ORACLE:IN-FI-SPPDUPFI070-ORACLE marked up
On the Data Domain side the ddfs.info log shows nothing out of the ordinary. I can provide more detail, of course, but the underlying infrastructure has (and does) work just fine and 'nothing has been changed' recently :)
2 Suggestions that may help:
Nicolai's advise to increase "Maximum I/O streams per volume": https://www-secure.symantec.com/connect/forums/netbackup-and-datadomain-800-errors#comment-7409161
Extract from NBU 7.1 Release Notes http://www.symantec.com/docs/DOC3643 :
Deduplication backup jobs may fail with status code 213 or status code 2074.
Deduplication backup jobs may fail with status code 213 or 2074 even though
a suitable storage unit exists. The deduplication servers usually must be under
heavy load for that to occur.
To work around this issue, add the following touch file to the deduplication
storage server and to any load balancing servers:
UNIX: /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO
Windows:
install_path\Veritas\NetBackup\db\config\DPS_PROXYDEFAULTRECVTMO
The file content must be the integer 800. No other file content is required.