Forum Discussion

Shaun_Taylor's avatar
10 years ago

Oracle Backup Intermittent Status 41s

I have been investigating an intermittent issue with one of our Oracle clients over the last month or so, but unfortunately Symantec support have advised that they cannot find the cause and have recommended a rebuild of one of my master server nodes (v7.6.0.2). Although it's a long shot, I thought I would post here first just in case anyone has seen a similar issue before or has any other suggestions, as a rebuild is a last resort for us due to the amount of VCS / VVR / NBU configuration required. Logs from the master and media servers (bpbrm / bptm) and the client (bphdb / dbclient) have been provided multiple times but no cause has been found.

Database backup jobs for one particular Oracle client (Solaris 10 / Oracle 11.2.0.3) are intermittently failing with a status 41. However, the interesting thing is that we are backing up with 4 concurrent channels and only one of them fails - and when it fails, the other three continue with no problems and other child jobs continue to appear and complete until the end of the backup. The failed child job runs for around 30 minutes before actually failing without writing any data, which I assume is a timeout (client read?):

24/03/2015 21:42:10 - begin writing
24/03/2015 21:42:10 - Info bptm(pid=7896) backup child process is pid 7520.6964      
24/03/2015 21:42:10 - Info bptm(pid=7520) start           
24/03/2015 22:12:21 - Info dbclient(pid=23383) done. status: 41: network connection timed out     
24/03/2015 22:12:21 - end writing; write time: 0:30:11
network connection timed out(41)

The issue is only affecting Oracle backups on this client (filesystem backups are fine) and we have many other Oracle backup clients that are not affected (as well as hundreds of other backup clients that experience no problems). One thing that we did find is that the issue only occurs on one particular node of our master server cluster, but as far as I can tell the configuration is the same between nodes. Running the backups at different times (including in hours) has also seen the problem intermittently occur.

Our networks team have put a lot of time and effort into monitoring and checking the network but this does not seem to be the cause of the problem - which makes sense as all other child jobs for the same backup are not interrupted.

Our master servers are running on VMware 5.5, and we have even moved the affected master node to a completely different cluster / platform which has changed the underlying hardware and network. I have reinstalled VMware tools and configured a new vNIC (with a different MAC) as well. We have also tried different switch and NIC ports on the client.

Any suggestions would be appreciated.

  • If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

    If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

    Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383. 
    What was the process trying to do at this point in time? And the previous entries for this PID?

    About the one channel failing - is it with a particual db or random?

  • The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

    Thanks for everyone's help and input.

17 Replies

Replies have been turned off for this discussion