04-08-2021 09:47 AM - edited 04-08-2021 09:48 AM
Hello Veritas community,
We're running a master server + media server (Linux versions) on NetBackup 8.1. NetBackup processes started failing with database connection errors. Our first step in trying to resolve this was to stop the NetBackup processes and restart them. Unfortunately, that did not resolve our issue. I have been digging through posts and found the common semaphore issue, but the values set seem to be ok (posted below). I'm also adding a few error logs from /var/log/messages. Any help or direction would be greatly appreciated.
Thank you
https://www.veritas.com/support/en_US/article.100023842
Semaphore settings output:
sysctl -a | grep kernel.sem
kernel.sem = 400 307200 128 1024
kernel.sem_next_id = -1
============
/var/log/messages
Apr 8 07:28:12 <master_hostname> bash: Error receiving command message: No data to receive from socket.
# THIS BLOCK MAY BE RELEVANT (I replaced our actual hostname with master_hostname)
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'EMM_DATA' in file '/usr/openv/db/data/EMM_DATA.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'EMM_INDEX' in file '/usr/openv/db/data/EMM_INDEX.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'DBM_DATA' in file '/usr/openv/db/data/DBM_DATA.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'DBM_INDEX' in file '/usr/openv/db/data/DBM_INDEX.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'DARS_DATA' in file '/usr/openv/db/data/DARS_DATA.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'DARS_INDEX' in file '/usr/openv/db/data/DARS_INDEX.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'JOBD_DATA' in file '/usr/openv/db/data/JOBD_DATA.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'SLP_DATA' in file '/usr/openv/db/data/SLP_DATA.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): Opening dbspace 'SLP_INDEX' in file '/usr/openv/db/data/SLP_INDEX.db' for database 'NBDB'
Apr 7 09:25:45 <master_hostname> SQLAnywhere(nb_<master_hostname>): This database is licensed for use with:
Apr 7 09:29:33 <master_hostname> SQLAnywhere(nbappdb): Starting database "NBAPPDB" (/opt/NBUAppliance/db/data/NBAPPDB.db) at Wed Apr 07 2021 09:29
Apr 7 09:29:34 <master_hostname> SQLAnywhere(nbappdb): This database is licensed for use with:
Apr 7 10:15:30 <master_hostname> SQLAnywhere(nbazdb): Starting database "NBAZDB" (/usr/openv/db/staging/NBAZDB.db) at Wed Apr 07 2021 10:15
Apr 7 10:15:33 <master_hostname> SQLAnywhere(nbdb): Starting database "NBDB" (/usr/openv/db/staging/NBDB.db) at Wed Apr 07 2021 10:15
Apr 7 10:15:33 <master_hostname> SQLAnywhere(nbdb): Opening dbspace 'EMM_DATA' in file '/usr/openv/db/staging/EMM_DATA.db' for database 'NBDB'
# SAME BLOCK STARTS OVER AT THIS PIONT
============
04-08-2021 11:45 AM
What do you get if you just run "bperror"
The var/log/messages doesn't show anything useful, but check that you haven't run out of disk space, NetBackup will shutdown the database to protect itself if the volume is getting low on space.
04-08-2021 12:45 PM
Thank you @StoneRam-Simon
maintenance-!> bperror
Error occurred during initialization, check master configuration file
Space looks good as far as I can tell:
maintenance-!> df -hl
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/system-root 90G 38G 48G 45% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 9.8G 54G 16% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sdb2 198M 46M 143M 25% /boot
/dev/mapper/system-inst 50G 9.0G 38G 20% /inst
/dev/mapper/system-rep 241G 14G 215G 6% /repository
/dev/mapper/system-log 183G 35G 140G 20% /log
tmpfs 4.0K 0 4.0K 0% /dev/vx
tmpfs 13G 0 13G 0% /run/user/888
/dev/vx/dsk/nbuapp/advol 2.0T 3.9G 2.0T 1% /advanceddisk/dp1/advol
/dev/vx/dsk/nbuapp/catvol 1.5T 952G 544G 64% /cat
/dev/vx/dsk/nbuapp/pdcatvol 118G 4.5G 113G 4% /msdp/cat
/dev/vx/dsk/nbuapp/pdvol 36T 4.6T 31T 13% /msdp/data/dp1/pdvol
/dev/vx/dsk/nbuapp/cfgvol 100G 42G 58G 43% /config
/dev/vx/dsk/nbuapp/1pdvol 15G 13G 1.6G 90% /msdp/data/dp1/1pdvol
tmpfs 13G 0 13G 0% /run/user/0
maintenance-!>
04-08-2021 01:12 PM
I see this is a NetBackup Appliance.
You would need to elevate to be able to troubleshoot it further.
The message from bperror, would indicate that either processes are not running, or that something has changes the configuration.
Can you provide any other background about what was happening prior to the issue starting or what remedies you have tried?
One thing that can lead to services not working is the certificates expiring, not sure if this is your issue but its one thing that is worth checking.
https://www.veritas.com/support/en_US/downloads/update.UPD191278
https://www.veritas.com/support/en_US/article.100043900
04-08-2021 02:36 PM
I elevated and ran the bperror command. Unfortunately, I get "bperror: must be superuser to execute".
There was normal operation prior to the database connection errors. The last jobs to run successfully were Catalog Backup jobs.These entries were the first of many to follow with similar database 2505 connection errors.
[FROM DUPLICATON JOB] Apr 5, 2021 5:39:48 AM - Error bptm (pid=91289) db_getIMAGE() failed: Unable to connect to the database (2505)
Apr 5, 2021 5:39:48 AM - Error bpduplicate (pid=91254) write host <master_hostname>: extended error status has been encountered, check logs (252)
Apr 5, 2021 5:39:48 AM - Error bpduplicate (pid=91254) Duplicate of backupid <master_hostname>_1617613380 failed, extended error status has been encountered, check logs (252).
I have tried NetBackup Stop, NetBackup Start to restart all processes. I had rebooting the master server in mind, but I'm a bit hesitant. I wanted to reach out here to see if I could get any other pointers before going for a reboot.
Both master and media server certificates look good. Checked this in the NetBackup Administration Console: Security Management > Certificate Management and certs show Active.
04-08-2021 07:41 PM
If I were to go for the master server reboot to clear this problem, what is the best-practice method of performing the reboot? Any services or processes that need to go down prior to the reboot?
I want to be sure things come back up without issue.
Thank you for all the insight. @StoneRam-Simon
04-08-2021 09:20 PM - edited 04-08-2021 09:22 PM
This sounds the NetBackup Database (NBDB) is down.
Elevate to root prompt on the appliance and try running commands
/usr/openv/db/bin/nbdb_ping
Try to start NBDB alone
/usr/openv/netbackup/bin/nbdbms_start_stop start
Additonally post /usr/openv/var/global/server.log from master server.
Also check if Private Branch Exchange Service is running
/usr/openv/netbackup/bin/bpps -x
04-22-2021 11:24 AM
If this is an appliance and you're already getting failures then you're hurting absolutely nothing by rebooting - it's already broken remember ? =)
Just for kicks you can try cycling the NBU processes first (~5 minutes) down and up to see if it restarts cleanly, but if not a reboot should be your next step - kick it, go spend 30-45 minutes at lunch, and it'll probably be happy by the time you come back.