Solved: MSSQL DBs fail with status 23 but DBs show success

trs06 · ‎01-31-2013

We have MSSQL jobs that run for application and logs backups. From time to time I will get a failure with Status 23 but when I look at the log it shows all succeeded and none failed. I've seen other's posts with this problem and Symantec's response via EEB and future releases. We are running: Master Server: 7.1.0.4 on RHEL, Media Servers: 7.1.0.4 on Win 2008 R2 and the clients are MSSQL 2008 (patched) running 7.1 and 7.1.0.4. What does this mean? Are the backups actually cataloged or is the catalog info discarded due to the Status > 1? See the AM details excerpts below.

01/27/2013 04:48:55 - Info dbclient (pid=4416) INF - OPERATION #12 of batch C:\Program Files\Veritas\NetBackup\DbExt\MsSql\FULL.bch SUCCEEDED with STATUS 0 (0 is normal). Elapsed time = 31(0) seconds
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - ODBC return code <2>, SQL State <01000>, SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]10 percent processed.>.
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]20 percent processed.>
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]30 percent processed.>
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]40 percent processed.>
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]50 percent processed.>
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]60 percent processed.>
01/27/2013 04:48:56 - Info dbclient (pid=4416) INF - SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]70 percent processed.>
01/27/2013 04:49:02 - Info dbclient (pid=4416) INF - Thread has been closed for stripe #0
01/27/2013 04:49:04 - Info dbclient (pid=4416) INF - OPERATION #11 of batch C:\Program Files\Veritas\NetBackup\DbExt\MsSql\FULL.bch SUCCEEDED with STATUS 0 (0 is normal). Elapsed time = 42(0) seconds

01/27/2013 04:49:44 - Info dbclient (pid=4416) INF - Results of executing <C:\Program Files\Veritas\NetBackup\DbExt\MsSql\FULL.bch>:
01/27/2013 04:49:44 - Info dbclient (pid=4416) <14> operations succeeded. <0> operations failed.

Mark_Solutions · ‎04-03-2013

Any failures will automatically get logged to the \netbackup\logs\mssql_backup_failures directory It also depends if they are full or transaction logs backups - I often see that people use the $ALL for transaction log backups which is not right as you cannot do one for the Master database so that needs an exclude line in the bch file The other issue is if the backup takes over 2 hours - as Marianne says a firewall can cause a break for this but so too can the Windows Keep Alive timeout which is also 2 hours by default, so on your Windows Media Server you can add the following registry keys (needs a reboot): HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ New DWORD - KeepAliveTime – Decimal value of 510000 New DWORD – KeepAliveInterval – Decimal Value of 3 Hope this helps

View solution in original post

RamNagalla · ‎01-31-2013

Are the backups actually cataloged or is the catalog info discarded due to the Status > 1? See the AM details excerpts below.

any backup ends with status code 0 or 1 haves the catalog stoted in database, other codes do not.

you would probably needs to confirm this by checking the image database and by doing the test restore..

trs06 · ‎03-08-2013

Sorry for not being back here in over a month. I was expecting consultants to help with this error but hasn't happened yet.

The backups are all confirmed as successful and verified as such by the db admin thus in the catalog:

01/27/2013 04:49:44 - Info dbclient (pid=4416) <14> operations succeeded. <0> operations failed

I suspect each db job is recorded at it's individual status which is status 0 even though the job finishes reporting status 23. It's like it finishes all the work then breaks the socket before reporting overall job success.

No solution yet :)

StefanosM · ‎03-08-2013

My long guess is that you are not backing up all databases.
maybe there is one or more databases that fails with error 23, before a job appear at the activity monitor and the parent job fails with the same error.

count your databases and the backup jobs and check if I'm right. Do not count the parent job

The better way to check if a database backup is valid is to go to catalog find the backups and run a verification. If you still have doubts you can do an alternate restore.

trs06 · ‎03-08-2013

Done all that plus the point in time restore with DB admin. Went line by line through the Detailed status and verified every backup (db) was accounted for and showed status 0. Also filtered on the Parent Job ID, highlighted all jobs and the count at the top was the number of db's plus the parent. Brought my colleague's in to look at it as well for anything I missed and nothing was found. The db admin confirmed the number of databases on the servers. There are generally around 8 servers out of 29 that this is happening to. All servers are running from the same policy. I have another person tracking if all servers are always the same ones that get the status 23s and will followup regarding that angle.

StefanosM · ‎03-08-2013

Some other long shots.

Does the 8 servers physical servers and you have implement network bonding? If yes, try to deactivate it on one server.
Does the servers have multiple networks? Check that netbackup can communicate with only one network or force it, for test.
Check the bpcd logs of all involving servers.

And finally, I imaging that you have already increase to the sky the timeouts on server, media and clients.

trs06 · ‎04-02-2013

The consultants I thought could help have not come through yet.

What log files might provide me the best clues? Verbosity and Debug? And which ones on which servers:

Client logs:

Media Server logs:

Master Server logs:

Marianne · ‎04-02-2013

Logs that I would have a look at:

client: dbclient

media server: bpbrm and bptm
bpbrm will contain update info back to the master server.

Question: is there a firewall anywhere in the picture?
We have seen this where a firewall between the media server and the master times out, causing status 23 or 24.

See this (very) old TN: http://www.symantec.com/docs/TECH91271

This one is for status 24, but similar issue: http://www.symantec.com/docs/TECH145234

Handy NetBackup Links

abhinav_trivedi · ‎04-02-2013

Have you checked the backed up databases in restore window?

Also let us know that have you checked to restore them. Please provide us parent node detailed status along with required logs(Updated by Marianne).

Mark_Solutions · ‎04-03-2013

Any failures will automatically get logged to the \netbackup\logs\mssql_backup_failures directory It also depends if they are full or transaction logs backups - I often see that people use the $ALL for transaction log backups which is not right as you cannot do one for the Master database so that needs an exclude line in the bch file The other issue is if the backup takes over 2 hours - as Marianne says a firewall can cause a break for this but so too can the Windows Keep Alive timeout which is also 2 hours by default, so on your Windows Media Server you can add the following registry keys (needs a reboot): HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ New DWORD - KeepAliveTime – Decimal value of 510000 New DWORD – KeepAliveInterval – Decimal Value of 3 Hope this helps

VOX

MSSQL DBs fail with status 23 but DBs show success