Forum Discussion

Mark_Garringer's avatar
13 years ago

Intermittent Status 213 failures

I'm using a fairly straightforward enviornment as follows:

Master: RHEL 5 NetBackup 7.1.0.4

Media: RHEL 5 NetBackup 7.1.0.4

OST Plugin: 2.4.1.0-289644

Using DDBoost Storage Unit which has had some intermittent failures which I cannot seem to figure out why. I had one occurrence on 12/13 (with 3-4 failures) which after some quick log skimming I couldn't find an obvious answer for and was willing to maybe pass it off as a network hiccup. I had another occurrence last night which means I am now expecting additional occurrences. I have been digging into logs and posts and tech notes for most of the day and cannot seem to put my finger on why exactly. There were 4 jobs which failed last night with status 213, while other active jobs continued to run and subsequent jobs have run successfully. I can see that the allocation of storage unit resource failed, but am just missing the part as to why that was.

Here is the output from one of the jobs:

vxlogview -X "jobid=7708479"
12/18/2012 03:26:48.409 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:26:48.410 V-116-215 [BaseJob::run] jobid=7708479 submitted to nbjm for processing
12/18/2012 03:27:40.059 [JobInstance::state] (0x2aaab40d7870) state=0(../JobInstance.cpp:316)
12/18/2012 03:27:40.059 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=7708479)
12/18/2012 03:27:40.060 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:27:40.060 V-118-227 [ResBroker_i::requestResources] received resource request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, priority 0, secondary priority 24,631, description THE_BACKUP_JOB-7708479-{C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.060 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-7708479 MPXGroupRequest maxmpx=1
    SubRequest=0 provider=DriveOperationProvider resourcename=DDB-SPPDUPFI071-ORACLE-0 userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=DDB-SPPDUPFI071-ORACLE-0 mediaPool=NetBackup retentionLevel=3 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=spgridfi020.usa-ed.net shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=spmastfi030.usa-ed.net.NBU_CLIENT.MAXJOBS.spgridfi020.usa-ed.net CountedResourceRequest { max=99 }
Request=2 provider=NamedResourceProvider resourcename=spmastfi030.usa-ed.net.NBU_POLICY.MAXJOBS.IN-FISHERS-PRD-ORACLE-1 CountedResourceRequest { max=2147483647 }
12/18/2012 03:27:40.097 [allocateTwin] INITIATING:
12/18/2012 03:27:40.097 [allocateTwin] masterServer = spmastfi030.usa-ed.net, client = spgridfi020.usa-ed.net, jobType = 1, capabilityFlags = 0, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0
12/18/2012 03:27:40.097 [allocateTwin] Twin_Record: STUIdentifier = DDB-SPPDUPFI071-ORACLE-0, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 3, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0
12/18/2012 03:27:40.097 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.097 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.106 [set_host_and_dg_info_in_stu_list] could not get any media servers for storage unit
12/18/2012 03:27:40.107 [allocateTwin] EXIT INFO:
12/18/2012 03:27:40.107 [allocateTwin] EXIT STATUS = 2002011 (EMM_ERROR_STUS_Not_Available, Storage units are not available)

12/18/2012 03:27:40.108 V-118-201 [EMMAccess::allocateTwin] request to allocate STU (Master Server spmastfi030.usa-ed.net, Client spgridfi020.usa-ed.net, Job Type 1, STU DDB-SPPDUPFI071-ORACLE-0, STU Type 1, Media Pool NetBackup, Retention Level 3, Capabilities 0, Fibre Transport Preference 0) returned 2002011
12/18/2012 03:27:40.108 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}
12/18/2012 03:27:40.108 [Error] V-118-171 request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815} failed with status 0 (Not Enough Valid Resources); releasing 2 allocated resources
12/18/2012 03:27:40.108 [ResBroker_i::evaluateOne] resource request fails, requestId={C9427F56-48EC-11E2-AE0B-FD5ED0696815}, userId=jobid=7708479, status=2002011, birthTime=1355819260, timeToFirstEval=0 sec, timeToSuccessOrFailure=0 sec, numberOfTimesEvaluated=1
12/18/2012 03:27:40.108 V-118-108 [ResBroker_i::failOne] failing resource request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, status 2002011
12/18/2012 03:27:40.109 V-118-137 [EMMProvider::getDriveForWaitingRequest] allocation information for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815} is not found
12/18/2012 03:27:40.109 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {C9427F56-48EC-11E2-AE0B-FD5ED0696815}, error code 2,002,011, reason not enough valid resources
12/18/2012 03:27:40.113 [BackupJob::ERMEvent] (0x2aaab40d7870) initial resource request failed, copy#=-1, EMM status=Storage units are not available, NBU status=213(../BackupJob.cpp:345)
12/18/2012 03:27:50.392 [Error] V-117-131  NBU status: 213, EMM status: Storage units are not available

12/18/2012 03:27:50.392 [JobInstance::state] (0x2aaab40d7870) state=1(../JobInstance.cpp:316)
12/18/2012 03:27:50.392 [JobInstance::state] (0x2aaab40d7870) state=1(../JobInstance.cpp:316)
12/18/2012 03:27:50.392 [BackupJob::sendNotifications] (0x2aaab40d7870) disable resume=false, brm started=false(../BackupJob.cpp:2128)
12/18/2012 03:27:50.393 V-117-239 [BackupJob::terminateThisJob] terminated job, jobid=7708479, status=213
12/18/2012 03:27:50.393 [BackupJob::~BackupJob] (dump Backup Job):
**************************
*BackupJob               *
(jobid=7708479 jobtype=0)
**************************
isStdBackup       = true
backupType        = 0
subtype           = 0
mpxid             = {00000000-0000-0000-0000-000000000000}
maxmpx            = 1
Birthtime         = 0
backupid          =
Active pid        = 0
Job Count         = 1
IPaddr            =
currentThrottle   = -1
estimatedSize     = 0
estimatedFileCount= 0
suspendable       = 0
progressLogTimerID= -1
localBackup       = 0
policy            = IN-FISHERS-PRD-ORACLE-1
Num copies        = 1
Include_file#     = 1
resourceCount     = 1
Stu               = DDB-SPPDUPFI071-ORACLE-0
Pool              = NetBackup
Retention         = 3
UseLocal          = false
(../BackupJob.cpp:301)
12/18/2012 03:27:50.396 V-116-216 [RecoverableJob::doJobCompletion] jobid=7708479 completed with status 213
12/18/2012 03:27:50.401 [RecoverableJob::writeBooleans] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) flags=773(../RecoverableJob.cpp:103)
12/18/2012 03:27:50.463 [PemTask::taskComplete] (ID:0x2aaabc0206a0 CTX:0x2aaabc0206a0) Active subtask count=0(../PemTask.cpp:529)
12/18/2012 03:27:50.471 [Info] V-116-14 CLIENT spgridfi020.usa-ed.net  POLICY IN-FISHERS-PRD-ORACLE-1  SCHED Default-Application-Backup  EXIT STATUS 213 (no storage units available for use)
12/18/2012 03:27:50.478 [Error] V-116-85 backup of client spgridfi020.usa-ed.net exited with status 213 (no storage units available for use)
 

Looking at the nbrmms info indeed:

12/18/2012 03:27:19.100 [Warning] V-219-12 Volume DDB-SPPDUPFI071-ORACLE:IN-FI-SPPDUPFI070-ORACLE marked down, Storage server 192.168.212.40 P{rtmo}
12/18/2012 03:27:27.411 [Warning] V-219-14 Storage server DataDomain:192.168.212.40 was marked down-server
12/18/2012 03:27:30.527 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host punifimed100.usa-ed.net
12/18/2012 03:27:30.527 [Info] V-219-3 Volume spmastfi030.usa-ed.net:DDB-SPPDUPFI071-ORACLE:IN-FI-SPPDUPFI070-ORACLE marked up
 

On the Data Domain side the ddfs.info log shows nothing out of the ordinary. I can provide more detail, of course, but the underlying infrastructure has (and does) work just fine and 'nothing has been changed' recently :)

  • 2 Suggestions that may help:

    Nicolai's advise to increase "Maximum I/O streams per volume": https://www-secure.symantec.com/connect/forums/netbackup-and-datadomain-800-errors#comment-7409161 

    Extract from NBU 7.1 Release Notes  http://www.symantec.com/docs/DOC3643 :

    Deduplication backup jobs may fail with status code 213 or status code 2074.
    Deduplication backup jobs may fail with status code 213 or 2074 even though
    a suitable storage unit exists. The deduplication servers usually must be under
    heavy load for that to occur.
    To work around this issue, add the following touch file to the deduplication
    storage server and to any load balancing servers:
    UNIX: /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO
    Windows:
    install_path\Veritas\NetBackup\db\config\DPS_PROXYDEFAULTRECVTMO
    The file content must be the integer 800. No other file content is required.

  • Any errors in the Data Domain's autosupport logs?  My guess would be a failing or overheating network card in the DD, or a bad cable.  For a quick no- or low-cost fix, try just replace the network cables from the DD to the media server. 

    Is the DD communicating to the media server on the same network path as all other network traffic, or is it on a separate network?

    Intermittent errors like this, where one side of the path reports a communication issue but the other doesn't, and the problems seem to increase over time, would make me think of overheating or other hardware failing.

  • 2 Suggestions that may help:

    Nicolai's advise to increase "Maximum I/O streams per volume": https://www-secure.symantec.com/connect/forums/netbackup-and-datadomain-800-errors#comment-7409161 

    Extract from NBU 7.1 Release Notes  http://www.symantec.com/docs/DOC3643 :

    Deduplication backup jobs may fail with status code 213 or status code 2074.
    Deduplication backup jobs may fail with status code 213 or 2074 even though
    a suitable storage unit exists. The deduplication servers usually must be under
    heavy load for that to occur.
    To work around this issue, add the following touch file to the deduplication
    storage server and to any load balancing servers:
    UNIX: /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO
    Windows:
    install_path\Veritas\NetBackup\db\config\DPS_PROXYDEFAULTRECVTMO
    The file content must be the integer 800. No other file content is required.

  • Thanks Marianne, I looked at both of these before posting and since I can not actually find additional supporting (log) information for either of them I am reluctant to change things 'just because.'

    The Max I/O streams per volume is set to 90 on that disk pool, and according to the system performace data it was only running in the mid teens for stream count during that time.

    I also felt like the DPS_PROXYDEFAULTRECVTMO was the best lead, but since I don't seem to be logging bpstsinfo timeouts either...

    How can I better tell why nbrmms did this? Just raise the debug level on that log and wait?

    12/18/2012 03:27:27.411 [Warning] V-219-14 Storage server DataDomain:192.168.212.40 was marked down-server

     

     

  • RonCaplinger - There were other sessions running to the same resources at the same time which did not fail, and the data domain logs are clean. This traffic does go over a private backup network from media server to data domain, which in theory should be a less conjested/dedidcated resource for it.

    wr - I also reviewed this prior to posting and feel it is not the case as I'm dealing with an AdvancedDisk. Following the spirit of the technote though:

    bpstulist -L

    Label:                DDB-SPPDUPFI071-ORACLE-0
    Storage Unit Type:    Disk
    Media Subtype:        DiskPool (6)
    Host Connection:      punifimed100.usa-ed.net
    Concurrent Jobs:      90
    On Demand Only:       yes
    Robot Type:           (not robotic)
    Max Fragment Size:    524288
    Max MPX:              1
    Block Sharing:        no
    File System Export:   no
    Ok On Root:           no
    Disk Pool:            DDB-SPPDUPFI071-ORACLE

    nbdevquery -listdp -stype DataDomain -U
    Disk Pool Name   : DDB-SPPDUPFI071-ORACLE
    Disk Pool Id     : DDB-SPPDUPFI071-ORACLE
    Disk Type        : DataDomain
    Status           : UP
    Flag             : Patchwork
    Flag             : Visible
    Flag             : OpenStorage
    Flag             : SingleStorageServer
    Flag             : CopyExtents
    Flag             : AdminUp
    Flag             : InternalUp
    Flag             : LifeCycle
    Flag             : CapacityMgmt
    Flag             : FragmentImages
    Flag             : Cpr
    Flag             : FT-Transfer
    Raw Size (GB)    : 32160.40
    Usable Size (GB) : 32160.40
    Num Volumes      : 1
    High Watermark   : 98
    Low Watermark    : 80
    Max IO Streams   : 90
    Comment          : DD Boost Oracle Data
    Storage Server   : 192.168.212.40 (UP)

    I believe the 'Host Connection' and the 'Storage Server' are different for Advanced Disk. The Host Connection is my media server and the Storage Server is the private (backup) network address of the Data Domain. Also I have no additional alias or DNS clutter as suggested by the tech note you referenced. Thanks though.