06-13-2013 09:15 PM
NetBackup v. 7.5.0.5
I have a number of NDMP policies configured to backup NAS filesystems on an EMC Celerra. The policies are configured with jobs/policy set to 4. And 4 jobs *will* start when the policy is kicked off... but the other backup selections fail with a "Status 99"
The EMC jobs show the following:
2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) < Backup type: TAR > 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: TYPE Value: TAR 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: FILESYSTEM Value: /swint01p 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: PREFIX Value: /swint01p 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: LEVEL Value: 9 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: HIST Value: Y 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: UPDATE Value: y 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: OPTION Value: NT 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) Name: SNAPSURE Value: Y 2013-06-13 20:21:08: NDMP: 3: Thread ndmp431 < Max allowed concurrent backup checkpoint is exceeded (count=5), try later > 2013-06-13 20:21:08: NDMP: 3: < LOG type: 2, msg_id: 0, entry: SnapSure file system creation fails, hasAssociatedMsg: 0, associatedMsgSeq: 0 > 2013-06-13 20:21:08: NDMP: 4: Session 431 (thread ndmp431) NdmpdData::startBackup, SnapSure creation for backup failed
Now my impression is that the policy should only start 4 jobs and the rest should queue up. Instead of queueing they are all failing with "Status 99"...
Policy information:
------------------------------------------------------------ Policy Name: NDMP-denpclrdm2-PRD01 Policy Type: NDMP Active: yes Effective date: 02/03/1991 12:25:54 Mult. Data Streams: yes Client Encrypt: no Checkpoint: no Policy Priority: 1 Max Jobs/Policy: 4 Disaster Recovery: 0 Collect BMR info: no Residence: DEN-DENSYM_ADV_PRD01-stu_disk_densyma01p-1y Volume Pool: NetBackup Server Group: *ANY* Keyword: Production Data Classification: - Residence is Storage Lifecycle Policy: yes Application Discovery: no Discovery Lifetime: 0 seconds ASC Application and attributes: (none defined) Granular Restore Info: no Ignore Client Direct: no Enable Metadata Indexing: no Index server name: NULL Use Accelerator: no HW/OS/Client: NDMP NDMP denpclrdm2b Include: SET DIRECT=Y SET HIST=Y SET OPTION=NT SET TYPE = TAR /ccstg01_ckpt0 /ccstg02_ckpt0 /ccstg03_ckpt0 /ccstg05_ckpt0 /svn02p_ckpt0 SET SNAPSURE=Y /amm02p /aspen02p /docm02p /dtp02p /dtp03p /ecnt03p /fax01p /ora10p /scm01p /svn03p /swmil01p /swwork01p /swint01p Schedule: full Type: Full Backup Frequency: every 25 days Maximum MPX: 1 Synthetic: 0 Checksum Change Detection: 0 PFI Recovery: 0 Retention Level: 0 (1 week) Number Copies: 1 Fail on Error: 0 Residence: (specific storage unit not required) Volume Pool: (same as policy volume pool) Server Group: (same as specified for policy) Residence is Storage Lifecycle Policy: 0 Schedule indexing: 0 Daily Windows: Monday 19:00:00 --> Tuesday 08:00:00 Schedule: incr Type: Differential Incremental Backup Frequency: every 1 day Maximum MPX: 1 Synthetic: 0 Checksum Change Detection: 0 PFI Recovery: 0 Retention Level: 0 (1 week) Number Copies: 1 Fail on Error: 0 Residence: (specific storage unit not required) Volume Pool: (same as policy volume pool) Server Group: (same as specified for policy) Residence is Storage Lifecycle Policy: 0 Schedule indexing: 0 Daily Windows: Sunday 17:15:00 --> Monday 08:00:00 Monday 17:15:00 --> Tuesday 08:00:00 Tuesday 17:15:00 --> Wednesday 08:00:00 Wednesday 17:15:00 --> Thursday 08:00:00 Thursday 17:15:00 --> Friday 08:00:00 Friday 17:15:00 --> Saturday 08:00:00 Saturday 17:15:00 --> Sunday 08:00:00
Anyone have any ideas?
06-13-2013 09:17 PM
Master is Solaris 10
Media servers are all Symantec 5220's running 2.5.2 (7.5.0.5)
06-14-2013 01:48 AM
The policy looks OK to me to be honest - maybe it is the SET SNAPSURE=Y statement in the middle of it that causes the issue here.
Would it be possible to split this into 2 policies (and limit the jobs per client to 4) to see if that resolves it for you.
06-14-2013 05:56 AM
Technically I could split these up. Might be a good test. I have other NDMP policies which queue up jobs just fine... that's what is perplexing about these "Status 99" failures. Why do some NDMP jobs queue up while jobs from other policies immediately fail with a "Status 99"?
06-14-2013 12:59 PM
The SET directives must come before the backup paths.
How did it work when you moved SET SNAPSURE=Y up?
06-14-2013 01:27 PM
The paths above the SNAPSURE directive are already snapshots... don't want the Celerra taking a snapshot of them. That's why the directive is where it is.
I have other NDMP policies with the SNAPSURE directive above all the paths and they have "Status 99" failures as well.
In addition - I have NDMP policies configured the same way as the ones that fail... that don't fail. It's strange.
06-14-2013 01:43 PM
I have never seen it done that way and our manual says SET directives must be first in the list, followed by the file systems or volumes to be backed up.
No explanation for what is happening in the successful jobs but the above is what is supported.
Be that as it may, the "SnapSure creation for backup failed" is coming from the filer.
Set debug logging per this article:
http://www.symantec.com/business/support/index?page=content&id=TECH150646
Re-create the error, pull the server_log and let's see if more information is available.
06-14-2013 10:50 PM
There is below reason for error 99 in NDMP...
So very first check failed backup is incrimental or full....
Did you try changing the NDMP buffer size below 2097152, you can try values 262144 (256kb) or 524288 (512 kb).
Create file SIZE_DATA_BUFFERS_NDMP if its not present and enter the values in it.
Path: /usr/openv/netbackup/db/config/
06-17-2013 12:56 AM
Hello, In my experience code 99's are from bad file names, upper case versus lower case and so forth.
I would not play with buffer sizes as you may end up not being able to restore data you have already backed up via NDMP.
I would set up logs either on filer or legacy log ndmpagent. This will give you exact cause of error 99.
Splitting up policies with different directives would also appear to be a good idea. Keep it simple otherwise someday someone is going to mess up the directives and you won't know that you had set it up in a special way. Or else throughly test and document what you have done and make sure you test when upgrading filers.
09-03-2013 05:22 AM
Two things:
Please check this on your data mover - maybe increasing the number of concurrent data streams would helped:
[nasadmin@nas_host ~]$ server_param server_2 -facility NDMP -list
server_2 :
param_name facility default current configured
maxProtocolVersion NDMP 4 4
scsiReserve NDMP 0 0
CDBFsinfoBufSizeInKB NDMP 1024 1024
bufsz NDMP 128 256 256
convDialect NDMP 8859-1 8859-1
concurrentDataStreams NDMP 4 4
portRange NDMP 1024-65535 1024-65535
includeCkptFs NDMP 1 1
md5 NDMP 1 1
snapTimeout NDMP 5 5
dialect NDMP
forceRecursiveForNonDAR NDMP 0 0
tapeSilveringStr NDMP ts ts
excludeSvtlFs NDMP 1 1
snapsure NDMP 0 0
v4OldTapeCompatible NDMP 1 1
To change this use the following command, ran from control station of your NAS device:
server_param server_2 -facility NDMP -modify concurrentDataStreams -value 8
Of course this is not telling why you are seeing 99 exit code - but I would use it as work around.
The other thing I think more important - in your policy you have two schedules:
"
Schedule: full Type: Full Backup Frequency: every 25 days Maximum MPX: 1 Synthetic: 0 Checksum Change Detection: 0 PFI Recovery: 0 Retention Level: 0 (1 week)
"
"
Schedule: incr Type: Differential Incremental Backup Frequency: every 1 day Maximum MPX: 1 Synthetic: 0 Checksum Change Detection: 0 PFI Recovery: 0 Retention Level: 0 (1 week)
"
Please check out the above and amend - with such not frequent full backup I would set this schedule's retention to about 51 days- so you will have at least 2 full copies, but to be more careful I would set it to 100 days... or 3 months.
My take...
09-03-2013 07:26 AM
I understood EMC Celerra had a hard limit of 4 NDMP threads. Have not heard that this had changed.
09-03-2013 08:23 AM
I have re-read this all again and it looks like all of the jobs are kicking in instead of queueing the others once 4 have kicked in and the filer itself is causing the others to fail
A check though my docs and these do show that set snapshure=yes should be the first item in the selection list.
Any incorrect path name will also cause the jobs to fail with a 99 but your issue is exceeding the 4 streams
As there are no NEW_STREAM statements to split the jobs up it may be better to use more than one policy and use the maximum jobs per client to control how many can run at any one time or just try adding this setting anyway to see if that helps (Master Server Host Properties - Client Attributes section)
09-03-2013 08:26 AM
Hmm - well this is valid but only in specific cirumstances, let me quote EMC Article Number:000073016 Version:1
"Issue
Cannot increase concurrent NDMP backup streams from the default value of four streams
Trying to change parameter to support eight concurrent backup streams:
# server_param server_2 -facility NDMP -modify concurrentDataStreams -value 8
server_2 :
Error 4418: server_2 : 8 is not in range (1,4)
Environment Product: Celerra
EMC SW: NAS Code 6.0.36.4 and later
Feature: NDMP Backup
Cause
Error 4418 message is reported when trying to change from four to eight concurrent NDMP backup streams.
The error is expected and is normal if the Data Mover has less than 8 GB of physical memory.
Change
Trying to change from four to eight streams
Resolution
The ability to raise the number of concurrent NDMP backup streams from four to eight was introduced with the 6.0 family release.
But, the Data Mover hardware must have 8 GB or more of memory to support more than four backup streams. There is no workaround."