Notification emails not being sent on NetBackup appliance after 10.3/5.3 upgrade
TLDR: Not getting job failure notification or catalog backups emails from a Netbackup appliance after an upgrade to 5.3/10.3 but the email test notifications are working fine? Check the "mailq" command from a maintenance shell and see if mail is stuck in there. Then, check the postfix service on the appliance from the maintenance shell. The service might have stopped/died at some point. There may be a permissions error on a certain library: libpostfix-global.so -- Change permissions if needed and restart the postfix service from the maintenance shell.
Not TLDR:
I'm guessing this is one of those "one-offs" that likely won't happen to anyone else, however, if someone else ever sees something similar, here's what I went through:
I recently upgraded a customer of mine. 2 x 5240 (edit: not 5230s! My bad) appliance, each acting as a Master/Media, with AIR. It was at version 3.1/8.1 , so it was a two-phase upgrade. To 4.1/9.1 first, did some quick backup and recovery testing, then 5.3.0.1/10.3.0.1 (current as of Sept 2024). Afterwards I ran the firmware upgrade for each appliance, successfully.
After the upgrade, it looked like no backup job failure email notifications were being received from one appliance, and no catalog emails either. The other appliance was behaving normally. Both appliance notifications for job failures and catalog backups worked before the upgrade. Doing an "email test" from the Appliance CLISH under Settings > Notifications > Alert Configuration worked though -- No problem. There were no changes in email configuration in the administration console either in host properties, etc. All looked like it did before. Appliance CLISH hardware & software tests ran fine. bpps check of NBU services from the maintenance shell looked fine.
It's been a while since I've had to dig into something like this, so it took me a while to figure out. This is opinion, impression, and anecdotal evidence only, and I do not claim to know what lies deep in the hearts and brains of Veritas engineers and developers.
This was a weird one, but I'll admit I don't play with NBU appliances as much as I once did.
First off I checked the SMTP configuration in the appliance WEBUI as I saw there was a note about the SMTP password sometimes needing to be re-entered after an upgrade. (https://www.veritas.com/support/en_US/article.100061042). But, no issues there.
I checked the verbose bpdbm and bpbrm logs, and messages log for any errors or anything weird with mailing out notifications. I didn't see anything weird.
I didn't think anything would change in the admin console, but I went through and checked the host properties and email settings for the master servers. All were exactly as they were before.
As mentioned above, when I did an email test from the appliance CLISH... it worked. Came through without issue.
I went into the maintenance shell and sent some mail manually from the command line, using mailx/sendmail. It worked fine, and arrived immediately.
On a whim though and I don't know why I thought of it, I ran a "mailq" command from that maintenance shell: 107 emails in the queue.
At that point I was asking myself "how is that possible when alert test emails and mailing from the command line are working immediately?"
I didn't find any direct references to regarding appliances (and if someone has those, I'd love to see them), but I did see a few vague references and long-dead symantec tech articles, about checking postfix config on BYOS Linux NBU installs for unrelated issues, and it made me wonder.
I went looking for a postfix service via 'service postfix status'. It was down. I then ran "journalctl -u postfix" and saw that the postfix service had died after the upgrades and subsequent reboots a few days ago. The error was: loading shared libraries libpostfix-global.so: permission denied
In my case, I was able to do a "service postfix restart" from the elevated maintenance shell and it started cleanly and immediately processed the whole mail queue successfully. You may end up having to do some manual permissions changes to that library if that doesn't work. (if you're not comfortable with this or mucking around in the maintenance shell in general, please contact Veritas Support)
My theory, and it's not like I can easily prove it, is this: The appliance itself uses sendmail (or another MTA) for sending out hardware failure notifications, test emails, etc. The NBU installation/application environment, for lack of a better term, seems to use its own postfix MTA with configuration inferred from initial appliance settings. When the postfix service errored out, job failure and catalog notifications just filed into the mail queue with no notification.
The appliance itself is not aware of the postfix service. So it doesn't come up as a problem in the appliance software tests from the CLISH.
The NBU environment itself does not care about the postfix service either. It sends mail out on job failures or other conditions, and because it ends up in the queue, there's no error, it doesn't care, nothing strange gets logged.
I could be right, wrong, or anything in between on this. However, it fixed the problem for me. If someone has some further information and sources for the appliances and how this is configured, I'd love to see it.
No logs to post as I do not have permission to do so from a sensitive environment.
May you never encounter this. I've done upgrades on appliances before and never encountered this. But if you do, I hope this provides something to reference.