Troubleshooting MSDP with Disk Pool regularly going down

Question

Master is the Storage Server: NetBackup 8.0 with Windows 2008 R2 SP1Disk used for MSDP disk pool: External storage connected to the Windows server through a dedicated HBA. Uses a SAN switch and Microsoft MPIO.Our Puredisk Disk Pool is going down a couple times a month, during backup windows, after weeks of uptime and regular performance. This makes the NetBackup backup jobs directed to the associated MSDP storage unit fail with status 2074. The Disk Pool refuses to go up even restarting NB services but a Windows reboot brings everything back online. Problem started a year and a half ago, before the upgrade to NetBackup 8.0.Since I haven't been able to catch the incident itself I tried replicating it (with NB services down) by disconnecting the external storage: I got logs in the FC switch and MPIO errors in Windows Event Viewer, which weren't produced before when NetBackup marked down the disk pool. The storage itself has zero errors and is unaware of any problem. Therefore, my theory is that the disk has been always online and something is happening in the software.Until now the best NetBackup logs I have are from Disk Reports \ Disk Logs, where the following lines are produced right before all the backup jobs start failing with status 2074:Volume &lt;Disk Pool&gt;:PureDiskVolume monitored by &lt;Storage Server&gt; is down
Volume &lt;Storage Server&gt;:&lt;Disk Pool&gt;:PureDiskVolume marked downI tried looking in &lt;MSDP-path&gt;\log\spoold\spoold.log but coulnd't find the reason why the Disk Pool was down'ed in the first place. What logs should I be looking for and how should I configure their verbosity if required?

marianne · Accepted Answer

Alexis_Jeldrez
Can you confirm that this master/media server has sufficient memory for Master server load plus Media server processes plus Dedupe (1 to 1.5 MB memory per TB) as well as OS requirements?
Have you checked the TN with requirements for Windows MSDP?&nbsp;https://www.veritas.com/support/en_US/article.100037977
&nbsp;

riaanbadenhorst · Answer

Hi,
You can try configuring / modifying these touch files. The issue is not on the actual media server, or the storage. It is the master server that can't communicate or receive the status from the media server across the LAN. When it can't get an update if believes the MSDP is down and them marks it in the EMM.
&nbsp;
https://www.veritas.com/support/en_US/article.100007548

sdo · Answer

Check your SAN switch ports for the ports that are zoned to provide connectivity between NetBackup servers and storage (any tape and disk).i.e. there will be three&nbsp;sets of ports to zero the counters on and then check next time the problem happens, or even to check every day.set 1 - the SAN switch ports facing the NetBackup serversset 2 - the SAN switch ports facing the disk storage (for NetBackup)set 3 - the SAN switch ports facing the tape storage (for NetBackup).Check the light levels too of every SAN switch port too. &nbsp;If your signal levels are less than -12dB (less than minus twelve) then that is borderline and suspect.Of course also look for errors counts on all ports.And look for "LLI" (lost link indicator) counts which hints of reboots.

nicolai · Answer

I am with Riaan - the two touch files are easy to set (and remove) and may actual fix the issue.
You will get the disk down error when the Disk Polling Service fails to see the disk in the up state. The two touch files instruct the DSP service to be a bit more relaxed.

alexis_jeldrez · Answer

Sorry for not replying before, I couldn't check the platform until today.

Thanks for the information, I haven't tried the CR_STATS_TIMER = 300 before in the pd.conf file. I'll check what happens now.

I had already implemented the touch files and used the 1800 seconds configuration. The result was that the monitoring was delayed for 30 minutes before declaring the disk pool down (before it just waited for couple of minutes).

alexis_jeldrez · Answer

MarianneThanks for your assistance. The NetBackup server (everything NB-related is centralized in one single physical server) has 32 GB of RAM, I understand that should be enough for deduplicating backups: the sum of all the full backups made during a weekend are about 15 TB.All backups are made to SLP which backups first on the MSDP disk pool and then duplicates everything to tape. I reduced the concurrent jobs the disk storage unit can handle to 30 and relocated the SLP duplication schedule so backups and duplications don't coincide. We didn't observe any positive influence in the problem.We monitored the performance and noticed that the RAM use stays at about 55% (18 GB used of the total 32 GB), CPU usage also only uses half. When the Disk Pool goes down we see the CPU usage jumps to only 30% of usage or lower, but I think that's a consequence of jobs not working. RAM usage stays the same. Attached: picture of the performance monitor, showing the CPU usage going down after the Disk Pool goes down.I'll have to check throughly the TN, I'm not sure of the impact of changing NTFS settings on existing data. I can confirm we already checked the antivirus was excluded from the MSDP drive and filesystem indexing was also disabled. MSDP is on the Master server, I know this isn't recommended but the performance is good.

Forum Discussion

Troubleshooting MSDP with Disk Pool regularly going down

10 Replies

Related Content

Re: SQL Database Backup Fails...!

HPE OST Troubleshooting guidance

Troubleshoot

Re: BPDM VS BPDBM

Re: Pre and post Batch files do not work when job runs

Recent Discussions

command: bperror

MS-SharePoint policy restore error (2804) .

How to restore a backup

How to configure RBAC

10 years old netbackup appliance database service down, ssl certification out date