Backup job hangs, forever, never errors out
I am having trouble backing up the file system on my Microsoft Exchange 2016 Server. I have two backup jobs for this server. 1.The file system. 2.The DAG databases. Both jobs backup to tape. The DAG backup job runs fine. The file system backup job runs for a while, and then hangs. It seems to hang in a different place each time the job runs. The job should complete in 4 hours. I have given the job 48 hours to finish, and it never progresses once hung. Also, once the job hangs, it won't cancel. I have given the job 45 minutes to cancel, and then given up. I then must restart the Backup Exec server to cease the job.
This problem has been on-going for 4 months. I opened a support case 3 months ago, and have made zero progress towards a solution.
Part of the problem is the job sometimes completes successfully, and sometimes hangs. If debugging is enabled, the job completes successfully every time, and creates 3500+ log files totaling 70 GB. When the job just hangs and doesn't cancel, normal job logs aren't beneficial.
One proposed solution was to re-create the Backup job. I have recreated the Backup Exec job twice. Afterwards, the backup job will sometimes finish, but most of the time the job hangs.
I am running the latest version of Backup Exec, version 20.6 I have also uninstalled and re-installed the latest Backup Exec Remote agent to the target server.
I use this Backup Exec media server to backup numerous Windows servers, and this Microsoft Exchange 2016 Server is the only server experiencing this problem.
This is such a headache because once the job hangs, other backup jobs start queueing behind it. Also, since the job doesn't complete, the tape end marker is unreadable and the tape cannot be appended with additional data.
Pleading for help ...
After literally months of occasionally experiencing the backup job hanging issue, the job errored out one time with this error: "The job failed with the following error: The network connection to the Backup Exec Remote Agent has been lost. Check for network errors."
There is a veritas Knowledge Base article for this error located here: https://www.veritas.com/support/en_US/article.100005195
It proposes this solution:
Open Registry on the remote server
Start->Run->Regedit
Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameter
Create a new Dword value named IRPStackSize (case sensitive).
Modify the value to between 11 and 50. (15 is the default value when the key is not present; Recommend starting at 30. Note: On some computers, values from 33 through 38 and above can cause problems.)
Reboot the server.I implemented this solution and have not experienced the error again. (It has been a few weeks.)