Solved: Troubleshooting MSDP with Disk Pool regularly goin...

Alexis_Jeldrez · ‎06-04-2018

Master is the Storage Server: NetBackup 8.0 with Windows 2008 R2 SP1

Disk used for MSDP disk pool: External storage connected to the Windows server through a dedicated HBA. Uses a SAN switch and Microsoft MPIO.

Our Puredisk Disk Pool is going down a couple times a month, during backup windows, after weeks of uptime and regular performance. This makes the NetBackup backup jobs directed to the associated MSDP storage unit fail with status 2074. The Disk Pool refuses to go up even restarting NB services but a Windows reboot brings everything back online. Problem started a year and a half ago, before the upgrade to NetBackup 8.0.

Since I haven't been able to catch the incident itself I tried replicating it (with NB services down) by disconnecting the external storage: I got logs in the FC switch and MPIO errors in Windows Event Viewer, which weren't produced before when NetBackup marked down the disk pool. The storage itself has zero errors and is unaware of any problem. Therefore, my theory is that the disk has been always online and something is happening in the software.

Until now the best NetBackup logs I have are from Disk Reports \ Disk Logs, where the following lines are produced right before all the backup jobs start failing with status 2074:

Volume <Disk Pool>:PureDiskVolume monitored by <Storage Server> is down
Volume <Storage Server>:<Disk Pool>:PureDiskVolume marked down

I tried looking in <MSDP-path>\log\spoold\spoold.log but coulnd't find the reason why the Disk Pool was down'ed in the first place. What logs should I be looking for and how should I configure their verbosity if required?

Marianne · ‎06-05-2018

@Alexis_Jeldrez

Can you confirm that this master/media server has sufficient memory for Master server load plus Media server processes plus Dedupe (1 to 1.5 MB memory per TB) as well as OS requirements?

Have you checked the TN with requirements for Windows MSDP?
https://www.veritas.com/support/en_US/article.100037977

Handy NetBackup Links

View solution in original post

RiaanBadenhorst · ‎06-05-2018

Hi,

You can try configuring / modifying these touch files. The issue is not on the actual media server, or the storage. It is the master server that can't communicate or receive the status from the media server across the LAN. When it can't get an update if believes the MSDP is down and them marks it in the EMM.

https://www.veritas.com/support/en_US/article.100007548

Marianne · ‎06-05-2018

@Alexis_Jeldrez

Can you confirm that this master/media server has sufficient memory for Master server load plus Media server processes plus Dedupe (1 to 1.5 MB memory per TB) as well as OS requirements?

Have you checked the TN with requirements for Windows MSDP?
https://www.veritas.com/support/en_US/article.100037977

Handy NetBackup Links

Nicolai · ‎06-06-2018

I am with Riaan - the two touch files are easy to set (and remove) and may actual fix the issue.

You will get the disk down error when the Disk Polling Service fails to see the disk in the up state. The two touch files instruct the DSP service to be a bit more relaxed.

Alexis_Jeldrez · ‎06-08-2018

Sorry for not replying before, I couldn't check the platform until today.

Thanks for the information, I haven't tried the CR_STATS_TIMER = 300 before in the pd.conf file. I'll check what happens now.

I had already implemented the touch files and used the 1800 seconds configuration. The result was that the monitoring was delayed for 30 minutes before declaring the disk pool down (before it just waited for couple of minutes).

Alexis_Jeldrez · ‎06-08-2018

@Marianne

Thanks for your assistance. The NetBackup server (everything NB-related is centralized in one single physical server) has 32 GB of RAM, I understand that should be enough for deduplicating backups: the sum of all the full backups made during a weekend are about 15 TB.

All backups are made to SLP which backups first on the MSDP disk pool and then duplicates everything to tape. I reduced the concurrent jobs the disk storage unit can handle to 30 and relocated the SLP duplication schedule so backups and duplications don't coincide. We didn't observe any positive influence in the problem.

We monitored the performance and noticed that the RAM use stays at about 55% (18 GB used of the total 32 GB), CPU usage also only uses half. When the Disk Pool goes down we see the CPU usage jumps to only 30% of usage or lower, but I think that's a consequence of jobs not working. RAM usage stays the same. Attached: picture of the performance monitor, showing the CPU usage going down after the Disk Pool goes down.

I'll have to check throughly the TN, I'm not sure of the impact of changing NTFS settings on existing data. I can confirm we already checked the antivirus was excluded from the MSDP drive and filesystem indexing was also disabled. MSDP is on the Master server, I know this isn't recommended but the performance is good.

Alexis_Jeldrez · ‎10-09-2018

We avoided problems for 4 months because we reduced the number of duplications (SLP) from MSDP to tape during the night. We had to revert that change recently and the problem returned: Status 2074.

Following https://www.veritas.com/support/en_US/article.000008526 I realized that the disk pool is up but the NetBackup disk volume is down (pic related). I could try to turn it up but I really want to know why did it went down and what can I do to avoid this from happening again. It does look like something is at its limit and the problem that only occurred every three months now it's happening every weekend.

Marianne · ‎10-09-2018

It sounds like this discussion: https://vox.veritas.com/t5/NetBackup/MSDP-status-2074-but-volume-is-not-down/td-p/637231

The Veritas TN: https://www.veritas.com/support/en_US/article.TECH156743

Handy NetBackup Links

Alexis_Jeldrez · ‎10-09-2018

Thanks for replying.

@RiaanBadenhorst

I tried everything in https://www.veritas.com/support/en_US/article.100007548 including CR_STATS_TIMER = 300 in the pd.conf file and rebooted the OS: problem persisted. I'm reverting the changes (deleting the three DPS_PROXY files). I checked the nbrmms and dps logs with vxviewlogs and spoold.log & spad.log but they just stop logging when the Disk Volume goes down.

@Marianne

Most of the conversation in that thread is about a shared platform; I have physical access to the one I'm troubleshooting and I can confirm it's 100% physical and dedicated. RAM does't seem to be a problem, it never ever goes beyond 60% of use.

As for the speed of the storage, I tried https://www.veritas.com/support/en_US/article.000095782 and got these results:

Write speed: 239.2 MB/sec
Read with a 64k buffer: 139.4 MB/sec
Read with a 1024k buffer: 249.2 MB/sec
Read with a 4096k buffer: 300.6 MB/sec

The contentrouter.cfg reads the following, which seems okay:

; This parameter determines the data store read buffer size, in Bytes. The default value
; is 33554432 (32MB)
; @restart
; @validate [0-9]+
ReadBufferSize=4194304

I tried testing the reading speed with a buffer of 8MB but got an error, which is strange because I should have 15GB of RAM in standby.

E:\test>nbperfchk -i E:\TEST\file.test -bs 8192k -o NUL
     800 MB @ 266.7 MB/sec,      792 MB @ 264.0 MB/sec
    1728 MB @ 288.0 MB/sec,      928 MB @ 309.3 MB/sec
input: Invalid access to memory location.

I'm still checking https://www.veritas.com/support/en_US/article.100037977 , I've already confirmed:

Windows 2008 R2 has Service Pack 1 installed
Block Size of the NTFS filesystem is 64K
Windows disk indexing disabled
MSDP path is excluded from antivirus
NetBackup policies don't use compression nor encryption

I'll keep trying with the other stuff. On the meantime, I'll just have to reduce again the number of duplications from MSDP to tape in the weekends.

sdo · ‎10-11-2018

Check your SAN switch ports for the ports that are zoned to provide connectivity between NetBackup servers and storage (any tape and disk).

i.e. there will be three sets of ports to zero the counters on and then check next time the problem happens, or even to check every day.

set 1 - the SAN switch ports facing the NetBackup servers

set 2 - the SAN switch ports facing the disk storage (for NetBackup)

set 3 - the SAN switch ports facing the tape storage (for NetBackup)

.

Check the light levels too of every SAN switch port too. If your signal levels are less than -12dB (less than minus twelve) then that is borderline and suspect.

Of course also look for errors counts on all ports.

And look for "LLI" (lost link indicator) counts which hints of reboots.

Alexis_Jeldrez · ‎12-03-2018

@Alexis_Jeldrez wrote:
@Marianne
MSDP is on the Master server, I know this isn't recommended but the performance is good.

We recently closed the support case; we were told again that MSDP on a Windows Master is not recommended. We will consider getting an Enterprise Server license for installing a second server which can use MSDP on the storage. In the meantime we will try client-side deduplication on our biggest NetBackup clients to see if that helps us with the disk volume uptime.

PD: I wonder if it would be appropiate to replace "not recommended" with "not supported" and to prevent a MSDP volume to be created on a Windows master in the first place.

VOX

Troubleshooting MSDP with Disk Pool regularly going down