cancel
Showing results for 
Search instead for 
Did you mean: 

Data domain- pools getting filled up. backups failing

Albatross_
Level 5

Hi experts,

Out setup has destination storage as Datadomain (running on OS: 5.5.3.1-509919 Model: DD890 ) 

we are facing a serious concern on the disk space utilization.

configured high watermark to 95% and low to 85%

Fullbackups are kicked on Friday evenings and some backups fail due to storage full issue.

currently the DD has reached to 96.4% of disk utilization, But I dont see any duplicated images get deleted. Its been almost 48hrs and we dont see any change in available space numbers.

 

enviroinment :

NB master server 7.7.1

STU - DD 890. ( primary stu / duplications to tape library )

I think once the high water mark is reached, the duplicated images on the disk should be deleted automatically. 

In our case I think this is not happening, and could be one of the reasons for failures.

Can some one help me out to solve this issue.

 

 

Cheers.

 

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You can make a list of the image-ids with timestamps older than 2 weeks - ( more or less1452816000 and below).

Your management will need to make decision here - if you cancel the SLPs, the normal expiration date will be applied and images on DD will expire without being duplicated to tape.

You can delay the older images by setting them to Inactive. This will give newer images the chance to be submitted.

Newer timestamps can be cancelled and manually duplicated using bpduplicate with -bidfile that contains the imagelist.

Again - care must be taken that images can be duplicated before 2-week expiration is reached.

We cannot make this decision for you, but as long as you have a backlog that is bigger than the amount of backups per day, image expiration and cleanup is not going to happen.

You can only manually duplicate images if they are no longer under SLP control.

The problem with a backlog is that it becomes more and more difficult to catch up.
Every time a duplication is attempted and does not complete successfully, the retry interval is pushed out further and further.
SLP will always run oldest outstanding jobs before newer ones.

The only way around this is to cancel older jobs or set them to inactive....

PLEASE PLEASE PLEASE read through the Best Practice Guide....

View solution in original post

16 REPLIES 16

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
How is DD and duplication configured in NBU? Basic disk with staging to tape? Or OST with backup and duplication controlled by SLP? If SLP, then expiration is 100 % according to retention set in SLP. Cleanup will take place according to DD maintenance cycle.

sdo
Moderator
Moderator
Partner    VIP    Certified

1) First prove that the DD is working operating normally/correctly.

2) Then prove that the DD is indeed expunging from disk the backups that NetBackup thinks that it has expired.

3) Check your retentions, and SLPs.

4) Check NetBackup logs to prove that the DD is being informed of 'expired' images which can DD can now expunge.

5) Check NetBackup logs to prove the image clean-up is working correctly.

.

Start with step 1 above.  There are posts in this forum on how to check that old DD data is expunging.  Search the forum for the answer, or wait for someone to tell you.  Sorry, but I don't know much about DD. 

sdo
Moderator
Moderator
Partner    VIP    Certified

Check this... and I hope you haven't got a NOexpire file too... surely not?

https://www.veritas.com/community/forums/images-are-not-deleting-disk-even-after-expiration

 

Albatross_
Level 5

HI Marrianne,

Yes with OST and duplication controlled by SLP

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
OST and SLP - then my previous post stands. HWM and LWM do not work with SLP and fixed retention. Check backup retention in SLPs. Verify that tape duplications are successful and that SLP backlog is within limits. Without successful duplication, the retention level for disk backups remain infinity. Then perform the checks as suggested by sdo - image cleanup jobs at regular intervals, bpdm logs on media servers that will show nbdelete. This is the command that sends the delete instruction to DD. Then over to DD to check maintenance/cleanup job. This normally runs once a week on a Tuesday morning. In very busy environments this is a problem. I have seen in a very busy environment where cleanup on the DD ran further and further behind, leaving large amounts of orphaned images on the DD. Needless to say, when it was time to buy more storage, they bought dedupe appliance from another vendor...

Nicolai
Moderator
Moderator
Partner    VIP   

I agree with above statement - further more. What is the average deduplication ratio on your Data Domain ?

If you backup compressed or encrypted data to a Data Domain the space availability will take a serious plunge.

Finding bad dedup data is a cat after the mouse chase. You never know when someone suddenly place large amount of e.g video files, compressed SQL dumps on a file system you protect.

Albatross_
Level 5

Hi sdo,

 

Yes I found there is a file named witn 'NOexpire-052713' under /usr/openv/netbackup/bin.

Please suggest me what to do, I am new to this backup env. I dont have any idea how and why they have configured NB env.

 

Albatross_
Level 5

Hi Marriane,

Check backup retention in SLPs.

The backup retention policy is 2 weeks, and dulication for 7Yrs to tape, Threre are two SLP with these retention periods.we have another couple of SLP with out duplication where the Backup retention is again two weeks.

Verify that tape duplications are successful and that SLP backlog is within limits

How do I check the duplications ? I have checked in activity monitor I see there are some duplication jobs running.

Then perform the checks as suggested by sdo :

Thre are image clean up jons running after every successful backups and I see few bunch of image cleanup jobs in the activty monitor.

I could not find any nbdelete in the bpdm logs all the media servers. 

Yes DD clean up job is running as scheduled, But again since its filling up we are cleaning it manually.

sdo
Moderator
Moderator
Partner    VIP    Certified

The presence of a file named:

/usr/openv/netbackup/bin/NOexpire-052713

...should not affect you, but maybe NetBackup does a partial match on file names.

Personally, I would remove that file.  It looks someone renamed the file several years ago.

.

The way to check whether NetBackup "image cleanup" is requesting expirations is to check activity monitor and look at an "image cleanup" job.  Do you see some counts of expired images?  If so, then NetBackup is expiring images.  So, move on to the next check in the list.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Check bpdm logs on media server after completion of Image Cleanup job.
Please share one such bpdm log for a media server who should have expired images (as per Details of Image Cleanup job).
Please copy log to bpdm.txt and upload as File Attachment.

About checking SLP backlog:

Download the SLP Best Practice doc from the Download link in this TN: http://www.veritas.com/docs/000018455 

Look for this section and read through the relevant pages: 

Avoid increasing backlog 
  Monitoring SLP progress and backlog growth

Use 'nbstlutil stlilist -image_incomplete -U' at least once a day to verify that NOT_STARTED images are not older that 24 hours and not increasing.

You may want to ask EMC to re-do a sizing exercize to verify that the amount and type of data being backed up can be fit onto the DD and kept for at least 2 weeks.

Also ask EMC for ways to identify data with poor dedupe rates as well as orphaned images.

Albatross_
Level 5

HI Marrianne,

i have executed the command nbstlutil stlilist -image_incomplete -U, there are around 6000 entries 

Copy to NYC_TLD1_LTO6 of type DUPLICATE is NOT_STARTED
Copy to NYC_TLD1_LTO6 of type DUPLICATE is NOT_STARTED

is that mean, there are around 6000 jobs to be duplicated,

Let me know how to tackle this

 

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Please go through the SLP Best Practice Guide.

How many backup jobs per day?

1st thing to check is that you have sufficient tape drives and media and that you have suffient bandwidth between DD and media servers (10Gb should be minimum requirement...)

You (and your management) will need to make a decision about the backlog  - old outstanding duplications will prevent newer backups to be duplicated.

Images not duplicated cannot be expired.

 

Albatross_
Level 5

Hi Marrianne,

Around 1500 + backup jobs are running per day.

we have configured SLP ( backups to DD pool full backups and duplication to tapes ).

we have around 13 Tape drives and more than enough tapes.

I believe we also have sufficient bandwidth between DD and media servers.

Is ther anyway to duplicate the older images ( using some kind of script ).

I am new to this project and there is no enough documentation and ironic part is the guy who worked and configured NB is out of the project 

I am really worried, sad

 

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You can make a list of the image-ids with timestamps older than 2 weeks - ( more or less1452816000 and below).

Your management will need to make decision here - if you cancel the SLPs, the normal expiration date will be applied and images on DD will expire without being duplicated to tape.

You can delay the older images by setting them to Inactive. This will give newer images the chance to be submitted.

Newer timestamps can be cancelled and manually duplicated using bpduplicate with -bidfile that contains the imagelist.

Again - care must be taken that images can be duplicated before 2-week expiration is reached.

We cannot make this decision for you, but as long as you have a backlog that is bigger than the amount of backups per day, image expiration and cleanup is not going to happen.

You can only manually duplicate images if they are no longer under SLP control.

The problem with a backlog is that it becomes more and more difficult to catch up.
Every time a duplication is attempted and does not complete successfully, the retry interval is pushed out further and further.
SLP will always run oldest outstanding jobs before newer ones.

The only way around this is to cancel older jobs or set them to inactive....

PLEASE PLEASE PLEASE read through the Best Practice Guide....

Albatross_
Level 5

Thanks a lot Marianne,

I will check with my Manager and get back at the earliest.

 

Thanks

Nicolai
Moderator
Moderator
Partner    VIP   

What is the duplication speed to tape  ?

Data domain has something called "Locality". Locality this is how data domain place data on its file systems, with good locality -  restore/duplication speed is good, with bad locality restore/duplication speed can drop down to 50MB/sec. bad locality is usually seen on data with high change rate.

Using a 50G fragment size on tape (take a look in the storage unit configuration):

6:30 per fragment = 136MB/sec

10:00 per fragment = 83MB/sec

20:00 per fragment = 40MB/sec

40:00 per fragment = 20MB/sec.

From the activity monitor job - type duplication. Speed = 136MB/sec becuase 50GB was duplicated in 6 minutes and 30 seconds.

02/02/2016 08:00:00 - begin reading
02/02/2016 08:01:45 - Info bptm (pid=27316) waited for full buffer 202 times, delayed 344 times
02/02/2016 08:06:30 - end reading; read time: 0:06:30

Also check on media server that SIZE_DATA_BUFFERS/NUMBER_DATA_BUFFERS is configured correctly. SIZE_DATA_BUFFETRS should 262144 and NUMBER_DATA_BUFFERS should be 256 or higher.

http://www.veritas.com/docs/000016306

http://www.veritas.com/docs/000004792