My first stop would be the DBA, to hear if anything had been any changed on the database in question or if there anything the alert log.
1) What has changed. ( No changes at the DB end, as what they are saying is as it is failing at media manager layer its not a problem at the DB side)
3) If there are any tech notes or VOX posts regarding the issue : Referred article: 000011002 but its not exactly the issue on our environment. I'm unable to find a real document for this error:
"ORA-19506 ORA-27028 ORA-19511 - channel ch00 disabled, job failed on it will be run on another channel."
In the forums which i referred, some it talks about checking the hostname resolution error or some other communication error, or firewall issue, some talks on permission problems on bp.conf and other such files.
It is not taking any of the channels defined in the RMAN script. But the communication is fine between netbackup master and host. (we checked with the bpclntcmd and this is okay)
The root job from Netbackup triggers as per the schedule , but this in turn should trigger the oracle jobs ( application backups) which is not happening.
The dbclient logs are not being generated, so to dig deeper into the issue, we have planned to restart the services on the host.
You need to test connection between the clent (database server) and media/master server. This will tell you if there is communication between them.
1. Check if netbackup is runnung bpps -x
2. test connection bptsestbpcd -debug -verbose -client where client = media server and master server name
3. on the bp.conf, add storage unit media server to server list
4 rerun backup
If the dbclient logs is not generated and dbclient folder has 777 permission, you probably have a problem somewhere in shell script. This can often be seen in the bphdb logs.
https://vox.veritas.com/t5/NetBackup/General-database-backup-error-troubleshooting/m-p/675381 gives my way of troubleshooting errors like this
Instead of just relying on bpclntcmd which only really checks host name resolution, try to run bptestbpcd under <install path>\admincmd. This will actually verify Netbackup's connection over bpcd.
The only thing i did from my end is creating user_ops and dbclient folders on the host ( oracle user didnt have read/write permissions on this folder) initially.
Later i asked DB team to submit the jobs from RMAN.and this worked.
I'm still confused and unable to understand why didnt it trigger initially. What all are the parameters for allocating channels ?
Also i have seen Oracle jobs failing sometimes, we dont see complete failure, we could see 5 or 6 jobs failed out of 30 jobs, the jobs which had ran before it and after that success. How can I validate if i have a good backup of the databases, because normally it says the job will retry on other channel. How can we confirm that it had retried or the data is complete or corrupt.
Well if the oracle user didn't have read/write permissions originally that could have caused the error. From what you're saying it sounds like that once the permissons were given, you asked the DBA to submit the job from RMAN and it worked successfully. Is that correct?
In order to validate if the backups of the DB's are good or not, if you use OEM (Oracle Enterprise Manager), you could check the DB there and look at the backup reports. It would say if it completed succesfully even if errors/warnings were thrown, or you could look at the RMAN logs
But these backups were running fine before without the permission to those folders. Yes I tried submitting the job for the host manually from netbackup, but that failed to call the corresponding oracle jobs. So the root job completed successfully , but this never initiated the Application backups.
We have a shell script which calls another RMAN script with the oracle user which takes the oracle backups.
When my DBA submitted the jobs from RMAN it worked.
You need logs to know why jobs are failing:
bphdb and dbclient with 777 permissions on Oracle client.
On master: bprd (NBU must be restarted after creating folder)
On media server: bptm and bpbrm
Level 0 is fine under most circumstances but sometimes a bit more is needed. Level 3 is normally fine.
If you want to log a Support call, they will request level 5.
(Please do not post level 5 logs here...)
If failures are intermittend on large db's, try to increase Client Connect and Read Timeouts on the media server.
A value of 1800 is normally sufficient.
Best to look at logs before you change anything.