Netbackup 5.1 bpsched issue on HP-UX

etmsreec · ‎01-10-2024

Hi,

Newbie here, so apologies for any errors in formatting or etiquette.

We have a Netbackup installation running v5.1MP6 on HP-UX. Yes, it's old, yes we'd like to retire it or upgrade.

It seems to have stopped scheduling jobs all for itself? It's been reliable and happy up to just before the holidays, but now the scheduling of backups seems to have started to fail.

We have restarted NBU a couple of times, which doesn't seem to have made any difference.

We are seeing the following message in the bpsched log file (once I'd created the directory for it to log bpsched to):

10:21:30.620 [16340] <2> bpsched: INITIATING (verbose=0) ...
10:21:30.621 [16340] <2> bpsched: BPRD PPID = 20591
10:21:30.621 [16340] <2> logparams: /usr/openv/netbackup/bin/bpsched -ppid 20591
10:21:30.621 [16340] <2> bpsched_main: wait_on_que=0, timeout_in_que=36000, reread_interval=300,queue_on_error=0, bptm_query_timeout=480, s_m 0
10:21:30.871 [16340] <2> LOCAL CLASS_ATT_DEFS: Product ID = 6
10:21:31.109 [16340] <8> bpsched_main: another regular bpsched is already examining the policy configuration
10:21:31.110 [16340] <4> bpsched: scheduler exiting - regular bpsched is already running (214)
10:31:30.249 [18497] <2> bpsched: INITIATING (verbose=0) ...
10:31:30.249 [18497] <2> bpsched: BPRD PPID = 20591
10:31:30.249 [18497] <2> logparams: /usr/openv/netbackup/bin/bpsched -ppid 20591
10:31:30.249 [18497] <2> bpsched_main: wait_on_que=0, timeout_in_que=36000, reread_interval=300,queue_on_error=0, bptm_query_timeout=480, s_m 0
10:31:30.482 [18497] <2> LOCAL CLASS_ATT_DEFS: Product ID = 6
10:31:30.714 [18497] <8> bpsched_main: another regular bpsched is already examining the policy configuration
10:31:30.715 [18497] <4> bpsched: scheduler exiting - regular bpsched is already running (214)
10:41:30.389 [20828] <2> bpsched: INITIATING (verbose=0) ...
10:41:30.389 [20828] <2> bpsched: BPRD PPID = 20591
10:41:30.389 [20828] <2> logparams: /usr/openv/netbackup/bin/bpsched -ppid 20591

Even without the "another regular bpsched is already examining..." it still didn't seem to want to schedule backups for itself?

Any thoughts please? We haven't just rebooted the server, as why would we have to and why would that make any difference? (Typical sys admin response! :) )

Thanks

Nicolai · ‎01-10-2024

hi @etmsreec

Put VERBOSE = 5 in bp.conf and restart Netbackup. This will increase the debugging level of bpsched and maybe explain why a previous bpsched is still examine the NBU policies.

Check for :

any zombie processes
use /usr/openv/netbackup/bin/bp.kill_all to stop all Netbackup processes and check with bpps -a they are ALL gone. Kill -9 on the NBU process that doesn't die.
host names with non ASCII names or hostnames with a extra dot e.g srv1.acme. <-- dot in the end of host name.

StoneRam-Simon · ‎01-10-2024

bpsched issues.... those were the days.....

Some things you can check...

Has anyone done any manual modification of any files or folder under /usr/openv/netbackup/db/class

Check files / folder permission, check if there are files in the toplevel of the class folder (this should just contain more folders not files.. there should be one per policy...
Each policy folder should have 3 files (info, includes and clients) and one folder schedule, in schedule there will be one folder by schedule with 3 files (days, info and calendar)

Is the filesystem full or has it filled up and left any files corrupt.

Check the contents and permissions of the /usr/openv/netbackup/db/jobs too.

I assume you checked if there were multiple bpsched processes running...
bpps -a | grep bpsched

How many jobs does it try to launch? (are you seeing any failed jobs or that nothing is scheduled).
What do you have the "tries" set to in the host properties general attributes.

davidmoline · ‎01-10-2024

Hi @etmsreec

In addition to the suggestions from @Nicolai have a look for stale lock files.

These will most likely be in the /usr/openv/netbackup/bin directory and for bpsched the lock file may be bpsched.lock or bpsched.lck. Shutdown NetBackup before removing any lock files.

The final step would be to reboot the server to ensure there are no zombie processes etc. as per @Nicolai .

Cheers
David

Nicolai · ‎01-11-2024

bpsched.lock or bpsched.lck - good memory @davidmoline

I had a faint memory of lock files, but time had erased the name of the files in my memory :)

Nicolai · ‎01-15-2024

@etmsreec Any updates about your issues ?

etmsreec · ‎01-15-2024

Hi @Nicolai ,

I did try to reply earlier, but the reply got marked as spam for some reason. :(

We did a variety of things last week, including :

- killing off all of the NBU processes, setting the verbosity, and restarting;

- performing a reboot of the server;

- stopping NBU, moving the .lock files to a different directory and then restarting NBU.

The step of moving the lock files seemed to have allowed jobs to kick off on Friday night which I thought was going to be job done, but the jobs over the rest of the weekend failed to trigger. We get the error:

14:14:48.461 [21711] <8> bpsched_main: another regular bpsched is already examining the policy configuration
14:14:48.462 [21711] <4> bpsched: scheduler exiting - regular bpsched is already running (214)

As the song says, "...this is just where I came in..."

Nicolai · ‎01-15-2024

Hi @etmsreec

Posting debug text, usually get you in trouble, because your post will be marked as SPAM. To avoid this attach debug text as a file to a post.

In the bpsched debug text look for <8> (warning) & <16> (catastrophic). <1> <2> <4> can be ignored.

The "another regular bpsched is already examining the policy configuration" is quite normal for the old Netbackup versions. The scheduler wakes up every 20 minutes to examine the policy configuration to see if work is due. On larger installation this could easily take more than 20 minutes. But in your case is seems like bpsched never finishes it's work or crashes during the process

Has any new clients been added to the policy configuration lately , potential sneaking a non-ascii character in the host name ?

My recommendation :

Put VERBOSE = 5 in bp.cond
Clear out old debug logs in /usr/openv/netbackup/logs/bpsched
Restart Netbackup
Capture failed backup schedule
Capture bpsched log
Attach log file(s) to a post in this forum

Pls note bpsched log may contain hostnames, you don't want to share with the internet.

Also check DNS name resolution - both forward and reverse name lookup must be authorized. Non-authorized answered will fail the backup.

etmsreec · ‎01-18-2024

Hiya,

I think I have caught all of the hostnames, policy names, and IP addresses that needed to be changed in my log file, before posting! I didn't think posting the whole of the file was relevant, as it came out to be quite big, I have an extract.

The first message relating to another bpsched already running is at 21:18:37.979, and it is then repeated every ten minutes thereafter.

There have been no new clients added to the configuration in more than ten years.

I'm trying to add the file to the post, but it keeps telling me that the file type is not supported if I drag and drop, and there doesn't appear to be an Attach File option?

Steve

Nicolai · ‎01-21-2024

Hi @etmsreec

When creating a forum post, I get the option to attach files. A red cloud upload image. File size is only 71 MB. If file size is larger than 71 MB then consider to share from dropbox, G-Drive / OneDrive.

Best Regards

Nicolai

etmsreec · ‎01-22-2024

Hi @Nicolai ,

Nope, I don't get that, either with a reply or a new posting.

@benspickarddid say that it was probably due to the ability to add text files and attachments to posts in order to reduce spam.

We don't have a dropbox or public drive that I could present a link to, unfortunately. :o(

Nicolai · ‎01-22-2024

@etmsreec

Strange - Let's me ask in Veritas country why you don't have the option to attach files - I thought it was a default option for everyone.

/Nicolai

benspickard · ‎01-22-2024

We have turned on the ability to attach files again! Give it another try.

Nicolai · ‎01-23-2024

@etmsreec Give it a go with the bpsched log file, you can attach files now to a posts.

etmsreec · ‎01-23-2024

Hi @Nicolai

Yay! That worked. Thank you!

I think I've substituted all of the host names and IP addresses.

MASTER is the Master server, and there are no other media servers in this environment.

client followed by a letter is a Netbackup client.

All of the policies should have been renamed PolN where N is the number (e.g. Pol1, Pol2, Pol63).

21:18:37.979 is the timestamp for the first error of there being another regular bpsched checking the configuration, so there's a load of run-up to that message.

Steve

Nicolai · ‎01-23-2024

Hi @etmsreec

From the log file:

20:48:37.325 [11239] <2> bpsched: INITIATING (verbose=5) ...
20:58:37.454 [13437] <2> bpsched: INITIATING (verbose=5) ...
21:08:37.044 [15646] <2> bpsched: INITIATING (verbose=5) ...
21:18:37.454 [17781] <2> bpsched: INITIATING (verbose=5) ...

Those messages are expected because bpsched is set to wake up a examine if work is due.

Then at :

21:18:37.454 [17781] <2> bpsched: INITIATING (verbose=5) ...
21:18:37.455 [17781] <2> bpsched: BPRD PPID = 21636
21:18:37.455 [17781] <2> logparams: /usr/openv/netbackup/bin/bpsched -ppid 21636
21:18:37.455 [17781] <2> bpsched_main: wait_on_que=0, timeout_in_que=36000, reread_interval=300,queue_on_error=0, bptm_query_timeout=480, s_m 0
21:18:37.747 [17781] <2> LOCAL CLASS_ATT_DEFS: Product ID = 6
21:18:37.979 [17781] <8> bpsched_main: another regular bpsched is already examining the policy configuration

Also expected when backup windows are open and backup jobs are scheduled. if bpsched is more than 10 min to do its work, a new bpschedu will be initiated but then immediately exist if a earlier instance is already running. Also "normal" working behaviour

@etmsreec Can you confirm that no new backups are scheduled after 21:18:37 ?

If we look at bpsched initiated at 21:08:37.044 process ID 15646 , it seems it's done it's work at 21:09:07

21:09:07.117 [15646] <2> image_dir_name: ?

That's funny , because the bpsched at 21:18:37.454 process ID 17781 claims a bpsched is already running.

I did not find errors in the bpsched log narrowing down the issue.

maybe @davidmoline and @mph999 can add to the investigation ?

@etmsreec Does a bpsched process of PID 15646 exist on your system ? I know the log file is from the 17th of January.

etmsreec · ‎01-23-2024

Hi @Nicolai ,

"Does a bpsched process of PID 15646 exist on your system ?"

Netbackup has been restarted since then, but the process did exist.

etmsreec · ‎01-23-2024

Sorry, I missed the other question.

No new backups were scheduled after 21:18:37, even though there were candidates that should have been selected.

Nicolai · ‎01-24-2024

hi @etmsreec

It looks to me like the process dies/hans without cleaning up. That is rare.

Is there any indication in the syslog from the master server? A no space left on device - that apply both for file systems and semaphores.

Does the system run out of swap space - low memory condition could be the cause.

/Nicolai

etmsreec · ‎01-24-2024

Hi @Nicolai ,

The only messages in syslog are an interruption to GDM data collection and the Added Service messages for bpcd/tcp, vopied/tcp, and bpjava-msvc/tcp when the system was rebooted. The GDM data collection occasionally reports that there may be a problem, and then resumes a minute later. The times don't coincide with backup schedules starting to fail or to the reboot of the system.

We've only had one device full, and that was the filesystem that holds the Netbackup log files when I set the VERBOSE=5 *blush* The log filesystem is 68% now, 2.89GB free.

The server has 4GB physical memory. swapinfo reports memory as 86% used. It reports dev as 4% used, and that's 4GB as well.

Steve

VOX

Netbackup 5.1 bpsched issue on HP-UX