"Status 99" on NDMP jobs instead of queueing

elanmbx · ‎06-13-2013

NetBackup v. 7.5.0.5

I have a number of NDMP policies configured to backup NAS filesystems on an EMC Celerra. The policies are configured with jobs/policy set to 4. And 4 jobs *will* start when the policy is kicked off... but the other backup selections fail with a "Status 99"

The EMC jobs show the following:

2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) < Backup type: TAR >
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: TYPE Value: TAR
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: FILESYSTEM Value: /swint01p
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: PREFIX Value: /swint01p
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: LEVEL Value: 9
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: HIST Value: Y
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: UPDATE Value: y
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: OPTION Value: NT
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: SNAPSURE Value: Y
2013-06-13 20:21:08: NDMP: 3: Thread ndmp431 < Max allowed concurrent backup checkpoint is exceeded (count=5), try later >
2013-06-13 20:21:08: NDMP: 3: < LOG type: 2, msg_id: 0, entry: SnapSure file system creation fails, hasAssociatedMsg: 0, associatedMsgSeq: 0 >
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) NdmpdData::startBackup, SnapSure creation for backup failed

Now my impression is that the policy should only start 4 jobs and the rest should queue up. Instead of queueing they are all failing with "Status 99"...

Policy information:

------------------------------------------------------------

Policy Name:       NDMP-denpclrdm2-PRD01

  Policy Type:         NDMP
  Active:              yes
  Effective date:      02/03/1991 12:25:54
  Mult. Data Streams:  yes
  Client Encrypt:      no
  Checkpoint:          no
  Policy Priority:     1
  Max Jobs/Policy:     4
  Disaster Recovery:   0
  Collect BMR info:    no
  Residence:           DEN-DENSYM_ADV_PRD01-stu_disk_densyma01p-1y
  Volume Pool:         NetBackup
  Server Group:        *ANY*
  Keyword:             Production
  Data Classification:       -
  Residence is Storage Lifecycle Policy:    yes
  Application Discovery:      no
  Discovery Lifetime:      0 seconds
ASC Application and attributes: (none defined)

  Granular Restore Info:  no
  Ignore Client Direct:  no
Enable Metadata Indexing:  no
Index server name:  NULL
  Use Accelerator:  no
  HW/OS/Client:  NDMP          NDMP          denpclrdm2b

  Include:  SET DIRECT=Y
            SET HIST=Y
            SET OPTION=NT
            SET TYPE = TAR
            /ccstg01_ckpt0
            /ccstg02_ckpt0
            /ccstg03_ckpt0
            /ccstg05_ckpt0
            /svn02p_ckpt0
            SET SNAPSURE=Y
            /amm02p
            /aspen02p
            /docm02p
            /dtp02p
            /dtp03p
            /ecnt03p
            /fax01p
            /ora10p
            /scm01p
            /svn03p
            /swmil01p
            /swwork01p
            /swint01p

  Schedule:              full
    Type:                Full Backup
    Frequency:           every 25 days
    Maximum MPX:         1
    Synthetic:           0
    Checksum Change Detection: 0
    PFI Recovery:        0
    Retention Level:     0 (1 week)
    Number Copies:       1
    Fail on Error:       0
    Residence:           (specific storage unit not required)
    Volume Pool:         (same as policy volume pool)
    Server Group:        (same as specified for policy)
    Residence is Storage Lifecycle Policy:         0
    Schedule indexing:     0
    Daily Windows:
          Monday     19:00:00  -->  Tuesday    08:00:00

  Schedule:              incr
    Type:                Differential Incremental Backup
    Frequency:           every 1 day
    Maximum MPX:         1
    Synthetic:           0
    Checksum Change Detection: 0
    PFI Recovery:        0
    Retention Level:     0 (1 week)
    Number Copies:       1
    Fail on Error:       0
    Residence:           (specific storage unit not required)
    Volume Pool:         (same as policy volume pool)
    Server Group:        (same as specified for policy)
    Residence is Storage Lifecycle Policy:         0
    Schedule indexing:     0
    Daily Windows:
          Sunday     17:15:00  -->  Monday     08:00:00
          Monday     17:15:00  -->  Tuesday    08:00:00
          Tuesday    17:15:00  -->  Wednesday  08:00:00
          Wednesday  17:15:00  -->  Thursday   08:00:00
          Thursday   17:15:00  -->  Friday     08:00:00
          Friday     17:15:00  -->  Saturday   08:00:00
          Saturday   17:15:00  -->  Sunday     08:00:00

Anyone have any ideas?

elanmbx · ‎06-13-2013

Master is Solaris 10

Media servers are all Symantec 5220's running 2.5.2 (7.5.0.5)

Mark_Solutions · ‎06-14-2013

The policy looks OK to me to be honest - maybe it is the SET SNAPSURE=Y statement in the middle of it that causes the issue here.

Would it be possible to split this into 2 policies (and limit the jobs per client to 4) to see if that resolves it for you.

elanmbx · ‎06-14-2013

Technically I could split these up. Might be a good test. I have other NDMP policies which queue up jobs just fine... that's what is perplexing about these "Status 99" failures. Why do some NDMP jobs queue up while jobs from other policies immediately fail with a "Status 99"?

smurphy · ‎06-14-2013

The SET directives must come before the backup paths.
How did it work when you moved SET SNAPSURE=Y up?

elanmbx · ‎06-14-2013

The paths above the SNAPSURE directive are already snapshots... don't want the Celerra taking a snapshot of them. That's why the directive is where it is.

I have other NDMP policies with the SNAPSURE directive above all the paths and they have "Status 99" failures as well.

In addition - I have NDMP policies configured the same way as the ones that fail... that don't fail. It's strange.

smurphy · ‎06-14-2013

I have never seen it done that way and our manual says SET directives must be first in the list, followed by the file systems or volumes to be backed up.
No explanation for what is happening in the successful jobs but the above is what is supported.

Be that as it may, the "SnapSure creation for backup failed" is coming from the filer.
Set debug logging per this article:

http://www.symantec.com/business/support/index?page=content&id=TECH150646

Re-create the error, pull the server_log and let's see if more information is available.

Ankit_Maheshwar · ‎06-14-2013

There is below reason for error 99 in NDMP...

None of the paths in the Network Data Management Protocol (NDMP) policy file list were backed up successfully.
None of the backup paths exist on the NDMP host.
An Incremental backup is run, but no files have changed since the last backup.
The /etc/hosts file on a UNIX client contains an invalid localhost entry

So very first check failed backup is incrimental or full....

Did you try changing the NDMP buffer size below 2097152, you can try values 262144 (256kb) or 524288 (512 kb).

Create file SIZE_DATA_BUFFERS_NDMP if its not present and enter the values in it.

Path: /usr/openv/netbackup/db/config/

Jean-Pierre_Bai · ‎06-17-2013

Hello, In my experience code 99's are from bad file names, upper case versus lower case and so forth.

I would not play with buffer sizes as you may end up not being able to restore data you have already backed up via NDMP.

I would set up logs either on filer or legacy log ndmpagent. This will give you exact cause of error 99.

Splitting up policies with different directives would also appear to be a good idea. Keep it simple otherwise someday someone is going to mess up the directives and you won't know that you had set it up in a special way. Or else throughly test and document what you have done and make sure you test when upgrading filers.

quebek · ‎09-03-2013

Two things:

Please check this on your data mover - maybe increasing the number of concurrent data streams would helped:

[nasadmin@nas_host ~]$ server_param server_2 -facility NDMP -list
server_2 :
param_name                       facility default     current   configured
maxProtocolVersion                  NDMP          4          4
scsiReserve                         NDMP          0          0
CDBFsinfoBufSizeInKB                NDMP       1024       1024
bufsz                               NDMP        128        256        256
convDialect                         NDMP     8859-1     8859-1
concurrentDataStreams               NDMP          4          4
portRange                           NDMP 1024-65535 1024-65535
includeCkptFs                       NDMP          1          1
md5                                 NDMP          1          1
snapTimeout                         NDMP          5          5
dialect                             NDMP
forceRecursiveForNonDAR             NDMP          0          0
tapeSilveringStr                    NDMP         ts         ts
excludeSvtlFs                       NDMP          1          1
snapsure                            NDMP          0          0
v4OldTapeCompatible                 NDMP          1          1

To change this use the following command, ran from control station of your NAS device:

server_param server_2 -facility NDMP -modify concurrentDataStreams -value 8

Of course this is not telling why you are seeing 99 exit code - but I would use it as work around.

The other thing I think more important - in your policy you have two schedules:

full - ran every 25 days with retention period of 1 week (I think you are at risk)

"

 Schedule:              full
    Type:                Full Backup
    Frequency:           every 25 days
    Maximum MPX:         1
    Synthetic:           0
    Checksum Change Detection: 0
    PFI Recovery:        0
    Retention Level:     0 (1 week)

"

incr - ran evey day with rentetion period of 1 week

"

Schedule:              incr
    Type:                Differential Incremental Backup
    Frequency:           every 1 day
    Maximum MPX:         1
    Synthetic:           0
    Checksum Change Detection: 0
    PFI Recovery:        0
    Retention Level:     0 (1 week)

"

Please check out the above and amend - with such not frequent full backup I would set this schedule's retention to about 51 days- so you will have at least 2 full copies, but to be more careful I would set it to 100 days... or 3 months.

My take...

smurphy · ‎09-03-2013

I understood EMC Celerra had a hard limit of 4 NDMP threads. Have not heard that this had changed.

Mark_Solutions · ‎09-03-2013

I have re-read this all again and it looks like all of the jobs are kicking in instead of queueing the others once 4 have kicked in and the filer itself is causing the others to fail

A check though my docs and these do show that set snapshure=yes should be the first item in the selection list.

Any incorrect path name will also cause the jobs to fail with a 99 but your issue is exceeding the 4 streams

As there are no NEW_STREAM statements to split the jobs up it may be better to use more than one policy and use the maximum jobs per client to control how many can run at any one time or just try adding this setting anyway to see if that helps (Master Server Host Properties - Client Attributes section)

quebek · ‎09-03-2013

Hmm - well this is valid but only in specific cirumstances, let me quote EMC Article Number:000073016 Version:1

"Issue
Cannot increase concurrent NDMP backup streams from the default value of four streams
Trying to change parameter to support eight concurrent backup streams:

# server_param server_2 -facility NDMP -modify concurrentDataStreams -value 8
server_2 :
Error 4418: server_2 : 8 is not in range (1,4)
Environment    Product: Celerra
EMC SW: NAS Code 6.0.36.4 and later

Feature: NDMP Backup
Cause
Error 4418 message is reported when trying to change from four to eight concurrent NDMP backup streams.
The error is expected and is normal if the Data Mover has less than 8 GB of physical memory.
Change
Trying to change from four to eight streams
Resolution
The ability to raise the number of concurrent NDMP backup streams from four to eight was introduced with the 6.0 family release.
But, the Data Mover hardware must have 8 GB or more of memory to support more than four backup streams. There is no workaround."

VOX

"Status 99" on NDMP jobs instead of queueing