Forum Discussion

DPeaco's avatar
DPeaco
Moderator
3 years ago

bperror - final backup status

Greetings,

NBU 9.1.0.1 running on Redhat Linux 8.6

I'm running a script to dump the backup failures for the previous 24 hours. I don't want any "retry" error codes in the output and only want the real final backup status. I'm creating a ticket for failed backups and if a backup "retries" and is successful, I don't want it included in the output.

bperror -backstat  is what I'm using with specific switches to look from the previous day at 07:00 to the current day at 07:00.

Thoughts or ideas?

  • I see I'm getting several "lookers" but nobody offering any suggestions to my questions/queries. ;) 

    What I'm thinking is:

    Dumping the bperror output for the past 24 hours and then parsing that info. If a backup attempt has "retries" but the end result of the job is successful, then the job id is the same in the bperror -backstat output but the job stream epoch timestamp is at a greater value. Without getting too far into the "weeds" here, I guess I could just parse the output from bperror -backstat, grep everything that is "not" successful and grep everything that is not a return code of "1". That should leave me with a listing of return codes that are greater than 1 and no listing of successful jobs in the remaining output scrub file. Then check the final return code for every listed backup id via "bpdbjobs" and if the return code from bpdbjobs is greater than 1? Then I can dump the info for that line/job and flag it as a true failure and thus cut a ticket on the failed job.

    Am I thinking correctly here?

    • davidmoline's avatar
      davidmoline
      Level 6

      Hi DPeaco 

      I think you are on the right track.

      I think a simpler way to parse the output would be to record the status of each job in turn (the output is in time sequential order). As a retry job comes through the status for that job will be updated to the new exit code. At the end, ignore all the status 0 (& 1) and you have your list of tickets to raise. 

      Potentially even simpler, just record the jobs that meet the fail criteria. If the job comes up again in the output, update the status or throw away if it "succeeds".

      Aren't there canned OpsCenter reports that would do this for you though?

      David

  • unfortunately there is no opscenter report that will give you only the real failed jobs.
    Also I do not think that bperror will give you enough info to create the report. You may have to deal with a policy with multiple data streams and the job may retry with different backup id.

    What I have done is to get the output of bpbdjobs -all_columns for the last 24 hours and extract the status code, policy, schedule, client and file selection. and the status code
    Then for every complication of policy, schedule, client and file selection I check the latest status code.
    If the status code is 0 then the backup is OK.
    if the status code is grater that 0 then the last attempt was failed and I add the line to the report.

    It is a very primitive way to get the failed jobs and the process is slow, but it is working.

    PS. I've add status code 1 to the failed jobs as a database backup with status 1 is a failed backup (at least a database is missing form backup)

    • DPeaco's avatar
      DPeaco
      Moderator

      StefanosM 

      I'd like to see your script code to use as a guide in your logic flow. If you need my email address? Please let me know.

  • As I told you it is somewhat primitive. It can be more efficient but it works and I do not have time to improve it.
    I made it for a windows master server and run it  using portable git. That's why the paths are linux like
    https://github.com/git-for-windows/git/releases/download/v2.37.1.windows.1/PortableGit-2.37.1-64-bit.7z.exe

    You can change it for Linux master easily.
    If someone wants to run it from a windows command line or the scheduler the command must be
    C:\admin\scripts\PortableGit-1.7.2.3\bin\bash.exe --login /C/admin/scripts/failed_jobs.sh
    (assuming that the GIT and the script are stored at C:\admin\scripts\ and the script named failed_jobs.sh )

     

    Spoiler

    #set -x
    days=1

    echo > /c/temp/bpdbjobs_check.txt
    ddate=`date --date="$days days ago" +"%m/%d/%y %H:%M"`
    bpdbjobs -all_columns -t "$ddate"|grep -v SLP_ >/c/temp/bpdbjobs_no_slp.txt
    cat /c/temp/bpdbjobs_no_slp.txt |awk -F"," '{print $4","$5","$6","$7","$33}' >/c/temp/bpdbjobs_working.txt

    for i in `cat /c/temp/bpdbjobs_working.txt |awk -F"," '{print $2","$3","$4","$5}' |sort |uniq `
    do
    echo -ne .
    grep $i /c/temp/bpdbjobs_working.txt |head -1 |grep -v "^0," | grep -v "^," |uniq >>/c/temp/bpdbjobs_check.txt
    done

    echo
    date > /c/temp/error.txt
    echo Failed Jobs for the last ${days} days >> /c/temp/error.txt
    cat /c/temp/bpdbjobs_check.txt | awk -F"," '{print $1,$2,$3,$4,$5}' >> /c/temp/error.txt
    echo >> /c/temp/error.txt
    echo ---------------------------- >> /c/temp/error.txt
    echo running jobs for more than 10H>> /c/temp/error.txt
    echo >> /c/temp/error.txt
    cat /c/temp/bpdbjobs_no_slp.txt | awk -F"," '{if ($3 =="1") print $1","$5","$6","$7","int($10/60/60)","int($10%(60*60)/60)","$10%60}' |awk -F"," '{if ($5 >= 10) print $1","$2","$3","$4","$5":"$6":"$7}' >> /c/temp/error.txt
    echo >> /c/temp/error.txt
    echo ---------------------------- >> /c/temp/error.txt
    cat /c/temp/error.txt

    cat /c/temp/bpdbjobs_check.txt | awk -F"," '{print $1","$2","$3","$4","$5}' > /c/temp/error.csv

    echo
    echo

    I do not check for SLP jobs, you can easily change that.
    PS. the last part of the script checks if there are backup jobs that run more than 10 hours. I find it useful.

      • StefanosM's avatar
        StefanosM
        Level 6

        I think that the for loop can be improved. Now I check all unique backup jobs.  It will be more time efficient if I run it against only failed jobs.

        If you improve it, please share it.