Solved: Media server backup stop/hung on certain schedule/...

Iwan_Tamimi · ‎08-15-2014

This problem happened again in our system (actually I reported similar problem https://www-secure.symantec.com/connect/forums/netbackup-client-failing-daily-incremental-ok-weekly-fulls but no solution yet). We are using NetBackup 7.6.0.2 Master server running on RHel 6.1, the media server that are in problem are running HPUX and Windows 2008.

This what was happening:

All this media server backup for all pool/schedule already running fine for several months
Last monday sudenly some power failure happened in the computer room so many of the server (NetBackup clients) rebooted, I believe some of the server didn't reboot perfectly (like maybe some services were not up, or evern the system was not up, but it took times for us to check one by one)
After the power failure the Daily_Incre schedule (we don't actually use the NetBackup scheduler we use external one and using script) begin to have problem, if it backup hit the still down server (nb client) the backup that have the same schedule on that particular media server will hung
The hung netbackup job I cannot killed from the GUI , I had to kill on the media server itself, I only killed one of them the rest will also be killed by themeselves
During this time, backup running on the same media server but different schedule/pool were running fine
(I said Schedule/Pool because in my system every Job Schedule will go the its own Pool)

My quesions are:

When it backup to certain failed client (like the down/cannot be ping), why it didn't just immediately fail, but causing the whole schedule/pool to be hanged?
Is it some timeout set to infinity, but what is the parameter?

Thank you.

Regards

Iwan

mph999 · ‎08-15-2014

Client connect time out springs to mind. If this is high, it will wait that time before failling.

View solution in original post

INT_RND · ‎08-15-2014

Are you using dedupe disk pools like PureDisk or MSDP?

Were Netbackup servers affected by the power outage?

Was there any network hardware affected by the power outage?

mph999 · ‎08-15-2014

Client connect time out springs to mind. If this is high, it will wait that time before failling.

Iwan_Tamimi · ‎08-15-2014

Hi INT_RND and mpsh999,

Thank you for your repsonses.

We are using dedup by Data Domain not sysmantec, but the affected schedule/pools are using tape not the the dedup, our dedup only used for VM backup and they are fine now.

Most of our NetBackup Servers (includes the master) are affected by the power outage but I think they are fine after that. I think the problem mostly affected the clients (we have so many clients some of them not really important production servers so nobody will check whether they were ok after the outage).

Yes network hardware also affected, but should be fine after that.

I think the main caused of the problem is the problematic client and the wrong setting of our NetBackup system. (Like I mentioned earliear I have posted the same thing earlier, I still could not find the real caused, but last time my solution just find out the problematic client then stop the backup to that client. This time is just too many)

Like mph99 said I also think there is some timeout setting not set correctly, any idea?

Thanks again.

Regards,

iwan

RonCaplinger · ‎08-18-2014

I'll take a guess, since you said "...the affected schedule/pools are using tape..." and "Most of our NetBackup Servers (includes the master) are affected by the power outage..."

Assuming you are using a tape library, do you know if you are using the SCSI "persistent binding" settings with your media servers' HBAs connected to your tape drives? If not, this *could* cause the behavior you see, where a backup starts but no data ever gets transferred. The persistent binding is what tells your Operating System what logical path to use when re-establishing the connection to the HBA/tape drives after the server has rebooted. This is not NetBackup; this is the OS communicating to the tape drives through the HBA. Remember that NetBackup is not the one connected to the tape drives, its the OS, and NetBackup communicates with the OS.

Without persistent binding, the drive that was previously using one path is no longer there,; it is now waiting for communication from another path, but since this occurs between the OS and the tape drive, NOT Netbackup, then NetBackup is still trying to send the data to a path that is no longer connected.

To fix this, use the normal set of utilities (scan, etc.) to make sure the OS still sees the tape drives. Then, delete the tape drives and robot from the NetBackup GUI and allow it to re-discover everything. If everything is connected and communicating correctly between the OS and the HBA and tape drives, NetBackup will rebuild the tape drive definitions and everything will work again.

Iwan_Tamimi · ‎08-19-2014

Hi RonCaplinger

I am still not sure about the hardware, I still believe this could be some setting in my NetBackup is not correct. Now after we can identify the problematic clients (the clienst also inclused some virtual IPs from package/service of a clustered client) then either whe don't backup the clients anymore or fix the clients, the problem goes away.

Actually this one is not really solving the problem, since problematic clients will be there again someday, this thing could happen again.

Regards,

Iwan

Marianne · ‎09-16-2014

Reading through the discussion again, the only logical cause of these hung backups will be excessive Client Connect timeout settings on the media servers.

Handy NetBackup Links

VOX

Media server backup stop/hung on certain schedule/pool