cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle Backup Intermittent Status 41s

Shaun_Taylor
Level 5
Certified

I have been investigating an intermittent issue with one of our Oracle clients over the last month or so, but unfortunately Symantec support have advised that they cannot find the cause and have recommended a rebuild of one of my master server nodes (v7.6.0.2). Although it's a long shot, I thought I would post here first just in case anyone has seen a similar issue before or has any other suggestions, as a rebuild is a last resort for us due to the amount of VCS / VVR / NBU configuration required. Logs from the master and media servers (bpbrm / bptm) and the client (bphdb / dbclient) have been provided multiple times but no cause has been found.

Database backup jobs for one particular Oracle client (Solaris 10 / Oracle 11.2.0.3) are intermittently failing with a status 41. However, the interesting thing is that we are backing up with 4 concurrent channels and only one of them fails - and when it fails, the other three continue with no problems and other child jobs continue to appear and complete until the end of the backup. The failed child job runs for around 30 minutes before actually failing without writing any data, which I assume is a timeout (client read?):

24/03/2015 21:42:10 - begin writing
24/03/2015 21:42:10 - Info bptm(pid=7896) backup child process is pid 7520.6964      
24/03/2015 21:42:10 - Info bptm(pid=7520) start           
24/03/2015 22:12:21 - Info dbclient(pid=23383) done. status: 41: network connection timed out     
24/03/2015 22:12:21 - end writing; write time: 0:30:11
network connection timed out(41)

The issue is only affecting Oracle backups on this client (filesystem backups are fine) and we have many other Oracle backup clients that are not affected (as well as hundreds of other backup clients that experience no problems). One thing that we did find is that the issue only occurs on one particular node of our master server cluster, but as far as I can tell the configuration is the same between nodes. Running the backups at different times (including in hours) has also seen the problem intermittently occur.

Our networks team have put a lot of time and effort into monitoring and checking the network but this does not seem to be the cause of the problem - which makes sense as all other child jobs for the same backup are not interrupted.

Our master servers are running on VMware 5.5, and we have even moved the affected master node to a completely different cluster / platform which has changed the underlying hardware and network. I have reinstalled VMware tools and configured a new vNIC (with a different MAC) as well. We have also tried different switch and NIC ports on the client.

Any suggestions would be appreciated.

2 ACCEPTED SOLUTIONS

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383. 
What was the process trying to do at this point in time? And the previous entries for this PID?

About the one channel failing - is it with a particual db or random?

View solution in original post

Shaun_Taylor
Level 5
Certified

The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

Thanks for everyone's help and input.

View solution in original post

17 REPLIES 17

Nicolai
Moderator
Moderator
Partner    VIP   

Does this happe when the master server get vmotioned ?

What virtual NIC type do you use - best should be VMXNET 3

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
We see a Client Connect Timeout here. This looks like comms issue between media server and the client. Logs that I would like to see at this point are dbclient and bpcd on the client and bptm and bpbrm on the media server.

Michael_G_Ander
Level 6
Certified

Have you tried to lower the number of channels on the problematic client ?

Each channel allocates some space/memory in the Oracle database and I have seen cases where the allocation of too many channels has exhausted the configured pool. If that is the case it can probably be seen in the Oracle alert log.

Another idea it to run rman with actually writing to netbackup, think the command is rman backup validate, to if the problem is reading from the Oracle disks.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Shaun_Taylor
Level 5
Certified

Hi Nicolai - yes, the vNICs are all VMXNET 3 adapters.

The VM is rarely migrated between hosts (manually or by DRS) but I have double checked this and confirmed that it has not been vMotioned when the issue has occurred (which was usually once a night at various times).

Shaun_Taylor
Level 5
Certified

Hi Marianne - I had initially suggested the same but when we found that the issue only occurs when running on 1 of our 2 master server nodes (but using the same media servers) support focused on the master server side of things. The thing that is still confusing me is that 3 of 4 concurrent child jobs for the same database backup experience no issues but one times out after 30 minutes!

We are planning a further network change on the client today, so I will test further and provide another update.

Shaun_Taylor
Level 5
Certified

Hi Michael - we did decrease to 3 as a test but this made no difference. However I am keen to try a backup with 2 channels to test further. Just out of interest, what would you say is the usual number of channels allocated for a backup?

I believe a backup validate has been run with no problems on a few occasions, but I will confirm this as well as ask our DBAs to check the Oracle alert log.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383. 
What was the process trying to do at this point in time? And the previous entries for this PID?

About the one channel failing - is it with a particual db or random?

Shaun_Taylor
Level 5
Certified

Thanks for the response Marianne.

I could be wrong, but I have checked through the logs requested by support and I can't see bprd from the master in there.

The problem has occurred on different databases on the client, but other clients (Oracle or otherwise) are unaffected.

My testing is somewhat restricted now as we essentially guarantee a backup failure of a business critical system, but I would like to collect more logs on the other master node at the earliest opportunity. Just to confirm, is it bprd on the master and dbclient on the client that you would like to see or just the bprd log? I should be able to share them if I substitute the hostnames and IPs.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I would like to see both.

I am amazed that Support is concentrating on the master server without looking at bprd! 
This is the process holding all the 'bits' of the process flow together!

Hopefully the logs are not at level 5 (I do not have clever tools to sift through them) but knowing Symantec Support they would insist on level 5 logs.....

Please upload logs as File attachments - will have a look if time permits.
Hopefully you have the bprd and dbclient log that covers the specific period as per the Job details posted above?

Shaun_Taylor
Level 5
Certified

You're right - they insisted on level 5 logs and they are very, very long! What level would be sufficient for you?

Frustratingly, I can't find a matching set of job details / logs from that far back now and support weren't asking for job details after the early stages of troubleshooting. I'll collect new logs (with job details!) and provide them as soon as possible.

Thanks for your help so far - I will also go back to support and enquire about bprd just to see what they say.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I normally find everything I need in level 3 logs.

Just be aware that NBU on the master server needs to be restarted to apply any changes to bprd logging.

Yes - my question to Support would be - if you suspect that the issue is with the master server, why have you never requested bprd log?

The error in Job details still looks to be like comms issue between client and media server, but the dbclient log entries for PID 23383 should tell us if client was trying to talk to the master or media server when the process timed out.

What is of utmost importance is to get a Backline Symantec engineer who specializes in DB backups who will trace the process flow to see where exactly the break in comms is.

Shaun_Taylor
Level 5
Certified

I have already received a response and I was advised that the dbclient log showed that the client could not communicate with the master during the comm file update and that any bprd logs would simply tell us the same.

However, this only returns me to the persistent question of - if this truly was a communication / network issue, why would 3 other child jobs for the same database backup continue uninterrupted, and why can our networks team not see any issues at the time of the failure?

The case has been escalated to a backline engineer on a couple of occasions and they have also now advised rebuilding the server as they do not know the cause.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

It would be interesting to at least cross-reference dbclient and bprd logs to see the request sent by the client and if the request was received by the master server....

Shaun_Taylor
Level 5
Certified

I'll try to grab fresh bprd and dbclient logs overnight - they will be level 5 though in case I pass them on to support as well so feel free to ignore them if you don't have time or they're too long!

Shaun_Taylor
Level 5
Certified

Last night's backups ran fine following the network change (although we don't understand why) - I will monitor over the weekend and provide a further update.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Curious to know what kind of network change....

Shaun_Taylor
Level 5
Certified

The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

Thanks for everyone's help and input.