Forum Discussion

Level 5

10 years ago

Oracle Backup Intermittent Status 41s

I have been investigating an intermittent issue with one of our Oracle clients over the last month or so, but unfortunately Symantec support have advised that they cannot find the cause and have recommended a rebuild of one of my master server nodes (v7.6.0.2). Although it's a long shot, I thought I would post here first just in case anyone has seen a similar issue before or has any other suggestions, as a rebuild is a last resort for us due to the amount of VCS / VVR / NBU configuration required. Logs from the master and media servers (bpbrm / bptm) and the client (bphdb / dbclient) have been provided multiple times but no cause has been found.

Database backup jobs for one particular Oracle client (Solaris 10 / Oracle 11.2.0.3) are intermittently failing with a status 41. However, the interesting thing is that we are backing up with 4 concurrent channels and only one of them fails - and when it fails, the other three continue with no problems and other child jobs continue to appear and complete until the end of the backup. The failed child job runs for around 30 minutes before actually failing without writing any data, which I assume is a timeout (client read?):

24/03/2015 21:42:10 - begin writing
24/03/2015 21:42:10 - Info bptm(pid=7896) backup child process is pid 7520.6964
24/03/2015 21:42:10 - Info bptm(pid=7520) start
24/03/2015 22:12:21 - Info dbclient(pid=23383) done. status: 41: network connection timed out
24/03/2015 22:12:21 - end writing; write time: 0:30:11
network connection timed out(41)

The issue is only affecting Oracle backups on this client (filesystem backups are fine) and we have many other Oracle backup clients that are not affected (as well as hundreds of other backup clients that experience no problems). One thing that we did find is that the issue only occurs on one particular node of our master server cluster, but as far as I can tell the configuration is the same between nodes. Running the backups at different times (including in hours) has also seen the problem intermittently occur.

Our networks team have put a lot of time and effort into monitoring and checking the network but this does not seem to be the cause of the problem - which makes sense as all other child jobs for the same backup are not interrupted.

Our master servers are running on VMware 5.5, and we have even moved the affected master node to a completely different cluster / platform which has changed the underlying hardware and network. I have reinstalled VMware tools and configured a new vNIC (with a different MAC) as well. We have also tried different switch and NIC ports on the client.

Any suggestions would be appreciated.

Windows Server (2003-2008)

Marianne
10 years ago
If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383.
What was the process trying to do at this point in time? And the previous entries for this PID?

About the one channel failing - is it with a particual db or random?
Shaun_Taylor
10 years ago
The client's backup interface was connected to a different switch and given a new IP address in a different VLAN. The most likely suggestion is that there was some sort of issue with the original switch as the issue was so inconsistent and nonsensical but we are unable to prove this for certain.

Thanks for everyone's help and input.

17 Replies

Replies have been turned off for this discussion

Nicolai
Moderator
10 years ago
Does this happe when the master server get vmotioned ?

What virtual NIC type do you use - best should be VMXNET 3
Marianne
Level 6
10 years ago
We see a Client Connect Timeout here. This looks like comms issue between media server and the client. Logs that I would like to see at this point are dbclient and bpcd on the client and bptm and bpbrm on the media server.
Michael_G_Ander
Level 6
10 years ago
Have you tried to lower the number of channels on the problematic client ?

Each channel allocates some space/memory in the Oracle database and I have seen cases where the allocation of too many channels has exhausted the configured pool. If that is the case it can probably be seen in the Oracle alert log.

Another idea it to run rman with actually writing to netbackup, think the command is rman backup validate, to if the problem is reading from the Oracle disks.
Shaun_Taylor
Level 5
10 years ago
Hi Nicolai - yes, the vNICs are all VMXNET 3 adapters.

The VM is rarely migrated between hosts (manually or by DRS) but I have double checked this and confirmed that it has not been vMotioned when the issue has occurred (which was usually once a night at various times).
Shaun_Taylor
Level 5
10 years ago
Hi Marianne - I had initially suggested the same but when we found that the issue only occurs when running on 1 of our 2 master server nodes (but using the same media servers) support focused on the master server side of things. The thing that is still confusing me is that 3 of 4 concurrent child jobs for the same database backup experience no issues but one times out after 30 minutes!

We are planning a further network change on the client today, so I will test further and provide another update.
Shaun_Taylor
Level 5
10 years ago
Hi Michael - we did decrease to 3 as a test but this made no difference. However I am keen to try a backup with 2 channels to test further. Just out of interest, what would you say is the usual number of channels allocated for a backup?

I believe a backup validate has been run with no problems on a few occasions, but I will confirm this as well as ask our DBAs to check the Oracle alert log.
Marianne
Level 6
10 years ago
If this is only happening with a specific master server node - has anyone had a look at bprd log on the master server?

If memory serves me right, bprd is expecting updates from the user_ops logs on the client.

Since dbclient is reporting the timeout, I would be interested in all entries for that particular PID: 23383.
What was the process trying to do at this point in time? And the previous entries for this PID?

About the one channel failing - is it with a particual db or random?
Shaun_Taylor
Level 5
10 years ago
Thanks for the response Marianne.

I could be wrong, but I have checked through the logs requested by support and I can't see bprd from the master in there.

The problem has occurred on different databases on the client, but other clients (Oracle or otherwise) are unaffected.

My testing is somewhat restricted now as we essentially guarantee a backup failure of a business critical system, but I would like to collect more logs on the other master node at the earliest opportunity. Just to confirm, is it bprd on the master and dbclient on the client that you would like to see or just the bprd log? I should be able to share them if I substitute the hostnames and IPs.
Marianne
Level 6
10 years ago
I would like to see both.

I am amazed that Support is concentrating on the master server without looking at bprd!
This is the process holding all the 'bits' of the process flow together!

Hopefully the logs are not at level 5 (I do not have clever tools to sift through them) but knowing Symantec Support they would insist on level 5 logs.....

Please upload logs as File attachments - will have a look if time permits.
Hopefully you have the bprd and dbclient log that covers the specific period as per the Job details posted above?
Shaun_Taylor
Level 5
10 years ago
You're right - they insisted on level 5 logs and they are very, very long! What level would be sufficient for you?

Frustratingly, I can't find a matching set of job details / logs from that far back now and support weren't asking for job details after the early stages of troubleshooting. I'll collect new logs (with job details!) and provide them as soon as possible.

Thanks for your help so far - I will also go back to support and enquire about bprd just to see what they say.

Forum Discussion

Oracle Backup Intermittent Status 41s

17 Replies

Related Content

Oracle database restore from vmware type backup

NetBackup Oracle Archive Logs backup only

Oracle to Netbackup Copilot

Encrypt Oracle RMAN Backup with NetBackup

Delete after making copies in Oracle Archive Logs backup

Recent Discussions

command: bperror

MS-SharePoint policy restore error (2804) .

How to restore a backup

How to configure RBAC

10 years old netbackup appliance database service down, ssl certification out date