Forum Discussion

Alexis_Jeldrez's avatar
7 years ago

Troubleshooting MSDP with Disk Pool regularly going down

Master is the Storage Server: NetBackup 8.0 with Windows 2008 R2 SP1

Disk used for MSDP disk pool: External storage connected to the Windows server through a dedicated HBA. Uses a SAN switch and Microsoft MPIO.

Our Puredisk Disk Pool is going down a couple times a month, during backup windows, after weeks of uptime and regular performance. This makes the NetBackup backup jobs directed to the associated MSDP storage unit fail with status 2074. The Disk Pool refuses to go up even restarting NB services but a Windows reboot brings everything back online. Problem started a year and a half ago, before the upgrade to NetBackup 8.0.

Since I haven't been able to catch the incident itself I tried replicating it (with NB services down) by disconnecting the external storage: I got logs in the FC switch and MPIO errors in Windows Event Viewer, which weren't produced before when NetBackup marked down the disk pool. The storage itself has zero errors and is unaware of any problem. Therefore, my theory is that the disk has been always online and something is happening in the software.

Until now the best NetBackup logs I have are from Disk Reports \ Disk Logs, where the following lines are produced right before all the backup jobs start failing with status 2074:

Volume <Disk Pool>:PureDiskVolume monitored by <Storage Server> is down
Volume <Storage Server>:<Disk Pool>:PureDiskVolume marked down

I tried looking in <MSDP-path>\log\spoold\spoold.log but coulnd't find the reason why the Disk Pool was down'ed in the first place. What logs should I be looking for and how should I configure their verbosity if required?

10 Replies

  • Hi,

    You can try configuring / modifying these touch files. The issue is not on the actual media server, or the storage. It is the master server that can't communicate or receive the status from the media server across the LAN. When it can't get an update if believes the MSDP is down and them marks it in the EMM.

     

    https://www.veritas.com/support/en_US/article.100007548

    • Nicolai's avatar
      Nicolai
      Moderator

      I am with Riaan - the two touch files are easy to set (and remove) and may actual fix the issue.

      You will get the disk down error when the Disk Polling Service fails to see the disk in the up state. The two touch files instruct the DSP service to be a bit more relaxed.

    • Alexis_Jeldrez's avatar
      Alexis_Jeldrez
      Level 6

      Sorry for not replying before, I couldn't check the platform until today.

      Thanks for the information, I haven't tried the CR_STATS_TIMER = 300 before in the pd.conf file. I'll check what happens now.

      I had already implemented the touch files and used the 1800 seconds configuration. The result was that the monitoring was delayed for 30 minutes before declaring the disk pool down (before it just waited for couple of minutes).

      • Alexis_Jeldrez's avatar
        Alexis_Jeldrez
        Level 6

        Thanks for replying.

        RiaanBadenhorst

        I tried everything in https://www.veritas.com/support/en_US/article.100007548 including CR_STATS_TIMER = 300 in the pd.conf file and rebooted the OS: problem persisted. I'm reverting the changes (deleting the three DPS_PROXY files). I checked the nbrmms and dps logs with vxviewlogs and spoold.log & spad.log but they just stop logging when the Disk Volume goes down.

        Marianne

        Most of the conversation in that thread is about a shared platform; I have physical access to the one I'm troubleshooting and I can confirm it's 100% physical and dedicated. RAM does't seem to be a problem, it never ever goes beyond 60% of use.

        As for the speed of the storage, I tried https://www.veritas.com/support/en_US/article.000095782 and got these results:

        • Write speed: 239.2 MB/sec
        • Read with a 64k buffer: 139.4 MB/sec
        • Read with a 1024k buffer: 249.2 MB/sec
        • Read with a 4096k buffer: 300.6 MB/sec

        The contentrouter.cfg reads the following, which seems okay:

         

        ; This parameter determines the data store read buffer size, in Bytes. The default value
        ; is 33554432 (32MB)
        ; @restart
        ; @validate [0-9]+
        ReadBufferSize=4194304

         

        I tried testing the reading speed with a buffer of 8MB but got an error, which is strange because I should have 15GB of RAM in standby.

         

        E:\test>nbperfchk -i E:\TEST\file.test -bs 8192k -o NUL
             800 MB @ 266.7 MB/sec,      792 MB @ 264.0 MB/sec
            1728 MB @ 288.0 MB/sec,      928 MB @ 309.3 MB/sec
        input: Invalid access to memory location.

         

        I'm still checking https://www.veritas.com/support/en_US/article.100037977 , I've already confirmed:

        • Windows 2008 R2 has Service Pack 1 installed
        • Block Size of the NTFS filesystem is 64K
        • Windows disk indexing disabled
        • MSDP path is excluded from antivirus
        • NetBackup policies don't use compression nor encryption

        I'll keep trying with the other stuff. On the meantime, I'll just have to reduce again the number of duplications from MSDP to tape in the weekends.

    • Alexis_Jeldrez's avatar
      Alexis_Jeldrez
      Level 6

      Marianne

      Thanks for your assistance. The NetBackup server (everything NB-related is centralized in one single physical server) has 32 GB of RAM, I understand that should be enough for deduplicating backups: the sum of all the full backups made during a weekend are about 15 TB.

      All backups are made to SLP which backups first on the MSDP disk pool and then duplicates everything to tape. I reduced the concurrent jobs the disk storage unit can handle to 30 and relocated the SLP duplication schedule so backups and duplications don't coincide. We didn't observe any positive influence in the problem.

      We monitored the performance and noticed that the RAM use stays at about 55% (18 GB used of the total 32 GB), CPU usage also only uses half. When the Disk Pool goes down we see the CPU usage jumps to only 30% of usage or lower, but I think that's a consequence of jobs not working. RAM usage stays the same. Attached: picture of the performance monitor, showing the CPU usage going down after the Disk Pool goes down.

      I'll have to check throughly the TN, I'm not sure of the impact of changing NTFS settings on existing data. I can confirm we already checked the antivirus was excluded from the MSDP drive and filesystem indexing was also disabled. MSDP is on the Master server, I know this isn't recommended but the performance is good.

      • Alexis_Jeldrez's avatar
        Alexis_Jeldrez
        Level 6

        We avoided problems for 4 months because we reduced the number of duplications (SLP) from MSDP to tape during the night. We had to revert that change recently and the problem returned: Status 2074.

        Following https://www.veritas.com/support/en_US/article.000008526 I realized that the disk pool is up but the NetBackup disk volume is down (pic related). I could try to turn it up but I really want to know why did it went down and what can I do to avoid this from happening again. It does look like something is at its limit and the problem that only occurred every three months now it's happening every weekend.