Forum Discussion

elanmbx's avatar
elanmbx
Level 6
10 years ago

Disk space (MSDP) discrepencies between master and media server appliance

So I have a discrepency on one of my media server appliances (5220) which is causing me some headaches, since it is reporting "disk full" during my backup window and failing some backups and duplications.  Strange thing is that the media server itself shows to only be 89% full (the Java console and OpsCenter show this pool is 96% full):

Screen3.jpg

densyma02p.Storage> Show Partition
- [Info] Performing sanity check on disks and partitions... (5 mins approx)
-----------------------------------------------------------------------
Partition     | Total      | Available  | Used       | %Used | Status
-----------------------------------------------------------------------
AdvancedDisk  |   11.21 TB |    4.93 TB |    6.28 TB |    57 | Optimal
Configuration |      25 GB |   18.31 GB |    6.69 GB |    27 | Optimal
MSDP          |      63 TB |    7.22 TB |   55.78 TB |    89 | Optimal
MSDP Catalog  |       1 TB |  1010.2 GB |   13.78 GB |     2 | Optimal
Unallocated   |  251.98 GB |       -    |       -    |    -  | -      

Screen1.jpg

Screen2.jpg

I'm half-inclined to restart the services on the master... but wanted to know if anyone has run into this issue?

  • Root-cause:  Bug in 7.6.1 install that caused a VxFS storage checkpoint to be left behind after the upgrade.  This checkpoint gradually grew until the MSDP filesystem was out of space.  Discovered using the command:

    fsckptadm list /msdp/data/dp1/pdvol

    Removed with:

    fsckptadm remove msdp_ckpt /msdp/data/dp1/pdvol

    Restarted services and my MSDP pool was back in business.

    This affected *2* of my 5220 appliances - both had to be repaired with the above procedure.

    Support Case #08311159 just in case anyone needs to reference this issue with Support.

9 Replies

  • A bit of environment info:

    Solaris master - 7.6.1

    5220 Media Servers - 2.6.1

    All systems recently upgraded to 7.6.1/2.6.1 recently (within the last week)

  • You have a couple of possibilities here. I have run into this in the past when you have a large database to back up and it prechecks on the storage available on the target disk. Remember the system does not know how much the data stream will deduplicate and if it needs 10 TB for a backup and you report only 7 being available you will fail.

    Now for the discrepancies run 

    /usr/openv/pdde/pdcr/bin/crcontrol --dsstat and see what the system then reports. I have my suspicions that the media server is reporting what is available to the CLISH and GUI and is reporting what is available and needs compression to the Master server. 

     

  • So here is the dsstat:

    densyma02p:/msdp/data/dp1/pdvol/history # /usr/openv/pdde/pdcr/bin/crcontrol --dsstat
    
    ************ Data Store statistics ************
    Data storage      Raw    Size   Used   Avail  Use%
                      63.0T  58.1T  55.9T   2.2T  97%
    
    Number of containers             : 350838
    Average container size           : 166807262 bytes (159.08MB)
    Space allocated for containers   : 58522326196922 bytes (53.23TB)
    Reserved space                   : 5458315213824 bytes (4.96TB)
    Reserved space percentage        : 7.9%

    This is obviously where it is getting its "shut it down!" information.  Seems like it is reserving an *awful* lot of space in the pool...

    For reference - I have two other identical 5220s and each of them only reserver 4%:

    densyma03p:/msdp/data/dp1/pdvol/log/convert # /usr/openv/pdde/pdcr/bin/crcontrol --dsstat
    
    ************ Data Store statistics ************
    Data storage      Raw    Size   Used   Avail  Use%
                      62.9T  60.4T  49.4T  10.9T  82%
    
    Number of containers             : 407267
    Average container size           : 131225549 bytes (125.15MB)
    Space allocated for containers   : 53443835834010 bytes (48.61TB)
    Reserved space                   : 2771037736960 bytes (2.52TB)
    Reserved space percentage        : 4.0%
  • This continues to be a significant issue.  It appears that the MSDP is not allowing space to be reclaimed.  I have moved policies off this appliance since the MSDP is currently 100% full.  I continue to expire images yet get NOTHING back.

    Support has been working on it for over a week now with no results.

    Will update when progress is made.

  • Elanmbx - Are you using that MSDP as a target for replication? I had a scenario where a backlog of SLP's continued to copy data to a target disk pool that was full and as we manually expired images (and ran the garbage collection - if you are not doing so...) the disk pool would fill right back up.

    If that is not the case please update us when you have a resoltion.

  • Interesting - we *did* have replications running to this MSDP from a remote site.  I turned those off a little while ago - but they would have been running to this appliance during this issue.  I wonder if there is a bunch of replicated data orphaned on the MSDP pool...

  • SLP's (if not manually cancelled) will run until they are completed. Have you seen any replication failures or a backlog of replication jobs in your activity monitor?

  • I manually cancelled all of the SLPs from my remote appliance that was replicating to this one.  At this point I have the vast majority of the images expired from this appliance and it will NOT release any space in the MSDP - it is still over 94% full as reported by NetBackup.

    I have a case and am working with support to get to the bottom of this...

  • Root-cause:  Bug in 7.6.1 install that caused a VxFS storage checkpoint to be left behind after the upgrade.  This checkpoint gradually grew until the MSDP filesystem was out of space.  Discovered using the command:

    fsckptadm list /msdp/data/dp1/pdvol

    Removed with:

    fsckptadm remove msdp_ckpt /msdp/data/dp1/pdvol

    Restarted services and my MSDP pool was back in business.

    This affected *2* of my 5220 appliances - both had to be repaired with the above procedure.

    Support Case #08311159 just in case anyone needs to reference this issue with Support.