Forum Discussion

Shaun_Taylor's avatar
10 years ago

Oracle Backup Intermittent Status 41s

I have been investigating an intermittent issue with one of our Oracle clients over the last month or so, but unfortunately Symantec support have advised that they cannot find the cause and have recommended a rebuild of one of my master server nodes (v7.6.0.2). Although it's a long shot, I thought I would post here first just in case anyone has seen a similar issue before or has any other suggestions, as a rebuild is a last resort for us due to the amount of VCS / VVR / NBU configuration required. Logs from the master and media servers (bpbrm / bptm) and the client (bphdb / dbclient) have been provided multiple times but no cause has been found.

Database backup jobs for one particular Oracle client (Solaris 10 / Oracle 11.2.0.3) are intermittently failing with a status 41. However, the interesting thing is that we are backing up with 4 concurrent channels and only one of them fails - and when it fails, the other three continue with no problems and other child jobs continue to appear and complete until the end of the backup. The failed child job runs for around 30 minutes before actually failing without writing any data, which I assume is a timeout (client read?):

24/03/2015 21:42:10 - begin writing
24/03/2015 21:42:10 - Info bptm(pid=7896) backup child process is pid 7520.6964      
24/03/2015 21:42:10 - Info bptm(pid=7520) start           
24/03/2015 22:12:21 - Info dbclient(pid=23383) done. status: 41: network connection timed out     
24/03/2015 22:12:21 - end writing; write time: 0:30:11
network connection timed out(41)

The issue is only affecting Oracle backups on this client (filesystem backups are fine) and we have many other Oracle backup clients that are not affected (as well as hundreds of other backup clients that experience no problems). One thing that we did find is that the issue only occurs on one particular node of our master server cluster, but as far as I can tell the configuration is the same between nodes. Running the backups at different times (including in hours) has also seen the problem intermittently occur.

Our networks team have put a lot of time and effort into monitoring and checking the network but this does not seem to be the cause of the problem - which makes sense as all other child jobs for the same backup are not interrupted.

Our master servers are running on VMware 5.5, and we have even moved the affected master node to a completely different cluster / platform which has changed the underlying hardware and network. I have reinstalled VMware tools and configured a new vNIC (with a different MAC) as well. We have also tried different switch and NIC ports on the client.

Any suggestions would be appreciated.

  • If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

    If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

    Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383. 
    What was the process trying to do at this point in time? And the previous entries for this PID?

    About the one channel failing - is it with a particual db or random?

  • The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

    Thanks for everyone's help and input.

17 Replies

Replies have been turned off for this discussion
  • I normally find everything I need in level 3 logs.

    Just be aware that NBU on the master server needs to be restarted to apply any changes to bprd logging.

    Yes - my question to Support would be - if you suspect that the issue is with the master server, why have you never requested bprd log?

    The error in Job details still looks to be like comms issue between client and media server, but the dbclient log entries for PID 23383 should tell us if client was trying to talk to the master or media server when the process timed out.

    What is of utmost importance is to get a Backline Symantec engineer who specializes in DB backups who will trace the process flow to see where exactly the break in comms is.

  • I have already received a response and I was advised that the dbclient log showed that the client could not communicate with the master during the comm file update and that any bprd logs would simply tell us the same.

    However, this only returns me to the persistent question of - if this truly was a communication / network issue, why would 3 other child jobs for the same database backup continue uninterrupted, and why can our networks team not see any issues at the time of the failure?

    The case has been escalated to a backline engineer on a couple of occasions and they have also now advised rebuilding the server as they do not know the cause.

  • It would be interesting to at least cross-reference dbclient and bprd logs to see the request sent by the client and if the request was received by the master server....

  • I'll try to grab fresh bprd and dbclient logs overnight - they will be level 5 though in case I pass them on to support as well so feel free to ignore them if you don't have time or they're too long!

  • Last night's backups ran fine following the network change (although we don't understand why) - I will monitor over the weekend and provide a further update.

  • The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

    Thanks for everyone's help and input.