cancel
Showing results for 
Search instead for 
Did you mean: 
Omar_Villa
Level 6
Employee

                To troubleshoot an SLP Backlog first we need to define what a backlog is: Mainly is data that it hasn’t been duplicated to a second or N-Destinations configured under the SLP.

i.e.:

                SLP Name: SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1

                                C1: Copy 1 (Backup)

                                1week: 1 Week Retention

                                MedA-disk1-dsu1: Backup Destination in to STU MedA-disk1-dsu1

                                C2: Copy 2 (Duplicate)

                                3month: 3 Month Retention

                                Tape1-stu1: Duplication Destination in to STU tape1-stu1

                The Storage Lifecycle Policy SLP_C1-1week-disk1-dsu1_C2-3month-tape1-stu1 will Backup Data in a Local Disk DSU named MedA-disk1-dsu1 and when is done will duplicate the data from MedA-disk1-dsu1 to MedA-tape1-stu1, if for any reason the images stored under disk1-stu1 don’t get duplicated to MedA-tape1-stu1 we start to generate a Backlog in MedA-disk1-dsu1 and because the SLP’s nature is to create every image under MedA-disk1-dsu1 with an infinite retention in some point MedA-disk1-dsu1 will get full and all backups will fail, only until the duplication is successful the retention for images under MedA-disk1-dsu1 will change to 1 Week retention and eventually expire.

                Now Imagine a scenario were we have 20 different SLP’s with Different destinations either for Backups or Duplications, troubleshooting this it can be a real challenge and because of this is that created an order of troubleshooting to better find bottlenecks and potential configuration issues.

                During this article will create a set of KSH functions that will end in a final script; main idea is to learn to read each function output and know how interpret each piece of data other key piece is to collect all the needed info in order to deliver a better solution.

Steps:

  1. Dump SLP incomplete images
  2. Get local Disk free space (Advance Disk for this sample) new
  3. Count total Backlog
  4. Count Images by SLP Status new
  5. Count Backlog hold by Media Server
  6. Count total Backups in the last 24 hours
  7. Top Clients Backlog new
  8. Split Images Count by size ranges new
  9. Count total duplications in a daily basis
  10. Count Backlog by SLP.

Dump SLP incomplete images.

               First we need to dump all those images that haven’t been duplicated and send this data in to a file, the following code will dump all incomplete images in to a log file.

               #!/bin/ksh

               LOG_DIR=/var/log/$SCRIPT_NAME/logs

               SLP_DUMP=$LOG_DIR/$SCRIPT_NAME.DUMP

               nbstlutil list -l -image_incomplete > $SLP_DUMP

 

Get local Disk free space (Advance Disk for this sample)

                To make it easy on this sample we use Advance Disk as our first destination option and to know where we stand and how much space left in our environment in order to keep delivering healthy backups we must know how much free space we have in all our media servers.

First step is dump the Disk list and with awk we only grab the DSU name and Used  Space and we count total available space and free space by media server.

        nbdevquery -listdv -stype AdvancedDisk -l | awk '

        {

                SER[$2]+=$7  #Array that holds each DSU Name and stores free space.

                TOTAL+=$7   # Variable that counts the total free space between all DSU’s.

        }

Once awk main body is done we print, headers, total free space and each DSU value going through each cell under SER[] Array.

        END {

                printf ("%-30s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

                printf ("%-30s%s\n", "Media Server", "Free Space")

                for (INC in SER) {printf ("%-30s%.2f TB\n",INC, SER[INC]/1024) }

        }'

 

Output sample:

 

Total Free Space                           4 TB

 

Media Server DSU                        Free Space

MedA-disk1-dsu1                          1.00 TB

MedB-disk1-dsu2                          1.00 TB

MedC-disk1-dsu3                          1.00 TB

MedD-disk1-dsu4                        1.00 TB

 Update:

            After troubleshooting some other sites with different technologies like PureDisk or DataDomain realized that leaving this script only for AdvancedDisk wasn’t that helpful, so decided to decom this function and created a new one that can detect any technology and provided the Free Space by DSU

            We have found that cron is not that smart and sometimes needs help to find the netbackup commands so we decided to introduce the path of each NBU command in to a variable so the script can cleanly be use under cron.

            >$LOG_DIR/total.log                 # Wipeout total.log file used to Summarize Free Space.

            NBDEVQUERY=/usr/openv/netbackup/bin/admincmd/nbdevquery

 

            printf "%-30s%-20s%s\n" "Media Server" "Storage Type" "Free Space"

 

            After we print the header of the function instead of dumping only the AdvancedDisk info we go through the list of STS and add them in to a loop so we can go through each one and gather the free space, logic is the same as the old AdvancedDisk function we only introduced a new column in the output that will tell us what technology the DSU is using.

  

            $NBDEVQUERY -liststs -l | awk '{print $3}' | sort -u | while read STYPE

            do

                        $NBDEVQUERY -listdv -stype $STYPE -l | awk '

                        {

                                    SER[$2]+=$7

                                    TOTAL+=$7

                        }

                        END {

                                    print TOTAL >> "'$LOG_DIR'/total.log"

                                    for (INC in SER)

                                                {printf ("%-30s%-20s%.2f TB\n",INC, awkSTYPE, SER[INC]/1024)}

                        }' awkSTYPE=$STYPE

            done

 

            We learned a new trick to help a bit with performance and avoid to run the all thing to sum the total free space as you notice we are printing the TOTAL variable in to a total.log file that will hold a list of each DSU and with the following loop we only go through the file and make the Sum.

 

        awk '

        { TOTAL+=$1 }

        END {

            printf ("\n%-50s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

        }' $LOG_DIR/total.log

 

Output Sample:

 

Media Server                  Storage Type        Free Space

MedA-disk1-dsu1          AdvanedDisk       1.00 TB

MedB-disk1-dsu2          AdvanedDisk      1.00 TB

MedC-disk1-dsu3          PureDisk             1.00 TB

MedD-disk1-dsu4          DataDomain        1.00 TB

 

Count total Backlog

                Next step is to know how much data we are holding or better say that it hasn’t been duplicated. This will came from our step 1 were we dumped the data in to $SLP_DUMP file.

     First we summarize every image fragment who haven’t been duplicated

        awk '

        $2=="F" {SUM+=$14}

 

     When sum is done we print the total in TB’s.

 

        END {

                printf ("%-30s%.2f TB\n\n", "Total Backlog ", SUM/1024/1024/1024/1024 )

        }' $SLP_DUMP

Output sample:

Total Backlog                 120 TB

 

Count Images by SLP Status

            There is a key piece in backlog troubleshooting and is images state, to know what is the status of the images is priceless in order make better decisions, in the next piece of code you will find that we count and sum images sizes by SLP State, we only handle the main 6 states but there are some others, but just knowing how many images are NOT_MANAGED or IN_PROCESS should be enough to know if we have corrupted images or policies not using SLP’s.

            Awk will go through the dumped images and compare the 11th column looking for values 0,1,2,3,9,10 who represent NOT_MANAGED, NOT_STARTED, IN_PROCESS, COMPLETE, NOT_STARTED_INACTIVE and IN_PROCESS_INACTIVE states and after the right value is located it is translated in to a string value to later add 1 in to an array using the STATE string value for the Array Cell, idea is to go through the array at the END of awk and print all states through a simple loop.

            printf "%-30s%-15s%s\n" "IMAGE STATUS" "IMAGES COUNT" "SIZE"

            awk '

            $2=="I" {

                        IMAGE=$4

                        STATE_COL=$11

              

                        if (STATE_COL == 0)       STATE = "NOT_MANAGED"

                        else if (STATE_COL == 1)  STATE = "NOT_STARTED"

                        else if (STATE_COL == 2)  STATE = "IN_PROCESS"

                        else if (STATE_COL == 3)  STATE = "COMPLETE"

                        else if (STATE_COL == 9)  STATE = "NOT_STARTED INACTIVE"

                        else if (STATE_COL == 10) STATE = "IN_PROCESS INACTIVE"

                        else STATE = "OTHER"

              

                        IMG_STATE_LIST[STATE]+=1

            }

  

            To know the fragment size of the image that was captured in the previous awk block we compare columns 2 and 4, first one to know we are in the fragment (F) line and second to ensure awk loop haven’t change to a different image, a second Array will sum the Fragment sizes but by the same STATE captured in the previous condition.

            $2=="F" && $4==IMAGE {

                        IMG_SUM[STATE]+=$14  

            }

 

            Once we go through the dump file we only print the results going through the Arrays lists, printing the images count and total storage in queue under by SLP state.

            END {

                        for (STATE_ELM in IMG_STATE_LIST)

                                    printf ("%-30s%-15d%.2f TB\n",  STATE_ELM, IMG_STATE_LIST[STATE_ELM], IMG_SUM[STATE_ELM]/1024/1024/1024/1024)

            }' $SLP_DUMP | sort

            printf "\n\n"

 

            In our sample we have a total of 20,000 images in backlog but 15,000 are IN_PROCESS, 2,000 in NOT_MANAGED state and 3000 NOT_STARTED, each of this states demands different actions but just to start it will be good to know why those 2000 images are in NOT_MANAGED state, by personal experience those are either bad images or backup policies not using SLP’s, if you have  many 800 errors is highly a list of  bad images.

Output Sample:

IMAGE STATUS                  IMAGES COUNT               SIZE

IN_PROCESS                       15000                                    100.00 TB

NOT_MANAGED                2000                                     10.00   TB

NOT_STARTED                   3000                                      10.00   TB

 

Count Backlog by Media Server

                Knowing the total backlog only tell us how bad our duplication SLA is. Next step is to strip out a bit our dump and figure out which media server holds most of our data, this will help us to make decisions as assign or de-assign SLP Alternate Readers or change STU’s Concurrent Drives values in order to assign more resources to a specific media server.

                First we capture the Storage Unit name and Fragment size from the $SLP_DUMP file and summarize each fragment in to an Array that will help us to split the backlog by Media Servers DSU’s or STU’s.

                printf "%-30s%s\n" "Media Server DSU" "Backlog Size"

                awk '{if ($2=="F") {print $9,$14} }' $SLP_DUMP | sort |

                awk '

                {

                        MED_LIST[$1]+=$2

                }

                Once sum is done we will go through the MED_LIST[] array and print in TB the total sum under each Array cell value (Media Server DSU Name)

                END {

                                for(MED in MED_LIST)

                                printf ("%-30s%.2f TB\n", MED, MED_LIST[MED]/1024/1024/1024/1024) |"sort"

                }'

                printf "\n\n"

 

Output sample:

 

Media Server DSU                           Backlog Size

MedA-disk1-dsu1                          15.00 TB

MedB-disk1-dsu2                          15.00 TB

MedC-disk1-dsu3                          30.00 TB

MedD-disk1-dsu4                          60.00 TB

                This sample output is telling us that MedD-disk1-dsu4 is holding half of the backlog and is probably our first point to troubleshoot, by expirience first things to look at is: tape drives health, Storage Unit Groups without STU/DSU forcing the SLP to ignore to duplicated the data under the excluded STU/DSU or backup policies over utilizing the STU/DSU, there are lot of possibilities but this are the ones I have found as most common.

Count total Backups in the last 24 hours

                Is impossible to know where we are on Backlogs without knowing how much data we are pulling in, this is why we need to count how much data we backed up in the last 24 hours, this will tell us if the free space we have will be enough for another night of backups also we will be able to compare with another function explained later that show us how much data we are duplicating in a day.

                Because we will dump the last 24 hours images with bpimagelist we first need to know under which netbackup version we are because there is a difference in the output between Netbackup 6 and 7, once we capture the right version we will know the column were the Total Image Size is and let awk do the math.

                NBUVER=$(cat /usr/openv/netbackup/version | grep -i version | awk '{print $NF}' | awk -F. '{print $1}')

      

                if (( NBUVER == 7 )) then

                                IMG_SLP_COL=6

                else

                                IMG_SLP_COL=3

                fi

                The process is very similar to our previous functions, we dump the data and with awk Arrays we summarize how much data we backed up in the last 24 hours, this in total and by SLP, last one is important because we will be able to capture the SLP with the highest backup load.

                bpimagelist -l -hoursago 24 | grep -i ^IMAGE | awk '

                {

                                SLP_LIST[$(NF-SLPCOL)]+=$19   # SLPCOL variable stores KSH Variable $IMG_SLP_COL value.

                                TOTAL+=$19

                }

                Once the sum process is done under the awk main body we print results in awk END foot going through the SLP_LIST[] array showing each SLP total backup data.

                END {

                                printf ("%-30s%.2f TB\n\n", "Total 24hr Backup", TOTAL/1024/1024/1024)

                                printf ("%50s%27s\n", "Policy Name", "Backup Size")

                                for (SLP in SLP_LIST) {printf ("%50s%20.2f TB\n", SLP, SLP_LIST[SLP]/1024/1024/1024)}

                }' SLPCOL=$IMG_SLP_COL

               printf "\n\n"

 

Output sample:

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             4.00 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-3month-MedB-tape1-stu2                              1.00 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-3month-MedC-tape1-stu3                              2.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-9month-MedD-tape1-stu4                             2.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                 0.50 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-6month-MedB-tape1-stu2                              1.50 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-1year-MedC-tape1-stu3                                1.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-5year-MedD-tape1-stu4                                0.00 TB

                Idea is to get a picture of which of the SLP’s is backing up more data. For simplicity of the article SLP’s names only include STU’s there is no SUG’s on this sample.

Count total duplications in a daily basis

                Next step is to know how much data we are duplicating, this can be tricky because we need to go through the jobs database and figure out how much data we have successfully move per job.

                First we capture the list of successful duplicate jobs and list them in to a single line separated by spaces to later use the list with bpdbjobs command, this makes the search of successful jobs way faster than going through one by one.

                printf "%-30s%s\n" "Date" "Duplicated Data"

                bpdbjobs -report | grep -i duplica | grep -i done | awk '{print $1}' | tr '\n' ',' | read JOBSLIST

                echo $JOBSLIST | wc -m | read JOBSCHARS

                ((JOBSNUM=$JOBSCHARS-2))

                echo $JOBSLIST | cut -c1-$JOBSNUM | read FINALJOBSLIST

                With the list ready we will pull 2 columns from each line, the Unix Date and the Size of the job in KB, this two variables will be used to translate the date in to a human format (mm-dd-yyyy) and sum successful duplications

                bpdbjobs -jobid $FINALJOBSLIST -most_columns | awk -F, '{print $9,$15}' | while read UDATE SIZEKB

                do

                                RES=$(bpdbm -ctime $UDATE | awk '{print $4"-"$5"-"$NF}')

                                echo $RES $SIZEKB

                done | \

                With the human date format we will start to summarize the written fragments in to an array and split it by date so we can print a history of the last 4-6 days (the output on the oldest days can change while the jobs got deleted from Activity Monitor, this is why is good to run this script daily and keep some track under the logs folder)

                awk '

                {

                                DAYLIST[$1]+=$2

                }

With the list done, we only go through each Array cell and print the results of the sum's in TB’s.

                END {

                                for (DAYDUP in DAYLIST)

                                                printf ("%-30s%.2f%s\n",DAYDUP, DAYLIST[DAYDUP]/1024/1024/1024, "TB" )

                }' | sort -n

                printf "\n\n"

Output sample:

Date                              Duplicated Data

Feb-23-2012                   0.05TB

Feb-24-2012                   2.10TB

Feb-25-2012                   4.30TB

Feb-26-2012                   5.54TB

Feb-27-2012                   5.58TB

Feb-28-2012                   4.23TB

Feb-29-2012                   0.39TB

               To know if we are doing good or bad on duplications we will need to know how many tape drives do we have available and what kind they are, for this sample we have 10 LTO4 drives shared across 4 Media Servers, with that said, we know we are way behind on performance because each drive should be able to more around 120MB/sec this in utopia world but at least we should expect to move around 2-4TB a day per drive, meaning we probably have a bottleneck at drives or media servers level (we will discuss more about drives and media server performance troubleshooting is a second article, we first build a strong case and later do the right modifications).

                Also there is always the possibility that the reading side (Disk Array) of the duplication is the root cause of the bottleneck but we first check everything in the backup world before we blame the SAN guys.

Top Clients Backlog

            For those cases were one day we don’t have backlog and suddenly in a 24 or 48 hours window we jump to 20TB backlog just from nowhere this normally is because a client decided to dump their 10TB DB in to a folder and because is not part of the exclude list it kills our space and increases the backlog, to quickly detect this we created a function that by default will give us the top 10 backlog clients, so we can engage that customer and see what actions can be taken in order to prevent a higher impact.

            The function allows us to select the number of clients we want to print, default is 10 but it can be any desired number.

            TOP=$1

 

            Soonest we print the header the logic is the same we go through the SLP dumped file and capture each client name under an Array Cell and summarize the fragment value inside the client cell name.

  

            printf "%-30s%s\n" "Client Name" "Backlog Size"

            awk '$2=="F" {print $4,$14} ' $SLP_DUMP | tr '_' ' ' | awk '

            {

                        CLIENT_LIST[$1]+=$3

            }

 

            Once we screen and capture all clients we print them all going through the Array and to capture the clients with the biggest backlog we only sort the list by the Size value (second column) and tail the $TOP variable to print only the top 10 or desire number of clients.

 

            END {

                        for (CLIENT in CLIENT_LIST)

                                    printf ("%-30s%.2f GB\n", CLIENT, CLIENT_LIST[CLIENT]/1024/1024/1024)

            }' | sort -nk 2,2 | tail -$TOP

            print "\n\n"

 

Output Sample:

 

            In our quick sample we print our top 5 clients with the biggest backlog and we easily see which clients represent at least 45% (55TB) of our 120TB backlog, not a bad place to start looking.

 

Client Name                       Backlog Size

Windows1                     2500GB

Unix1                            2500GB

Exchange1                    10000GB

MSSQL1                       15000GB

Oracle1                         25000GB

 

Count Images by Size Ranges

            Tuning the LIFECYCLE_PARAMETERS file can be a challenge if we don’t have the right data and guessing or doing change by test and error it doesn’t goes well with backlog, because of this and to better know what is NBSTSERV doing with the images is necessary to know how many images we have and their sizes with this info we can tune things like MIN/MAX_GB_SIZE_PER_DUPLICATION_JOB values, we will see this in our output sample explanation.

Code is quite simple we establish the ranges we want to print and capture the image and image fragment size in to an Array that later we will scan and compare each cell value with the hardcoded ranges we established, Increase by one the range variable and when the loop is done we only print the images count by range.

 

        awk '

        $2=="I" {

                IMAGE=$4

        }

        $2=="F" && $4==IMAGE {

                IMGSUM[IMAGE]+=$14

        }

 

            Hardcoded ranges values:

 

        END {

                S100MB=104857600

                S500MB=524288000

                S1GB=1073741824

                S5GB=5368709120

                S10GB=10737418240

                S50GB=53687091200

                S100GB=107374182400

                S250GB=268435456000

                S500GB=536870912000

                S1TB=1073741824000

 

            Loop will go through the Array values and compare them with the ranges we want and increase a count variable that we will print later.

 

                for (IMGSIZE in IMGSUM) {

                        if (IMGSUM[IMGSIZE] <= S100MB)                                            S100MB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S100MB && IMGSUM[IMGSIZE] <= S500MB)         S500MB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S500MB && IMGSUM[IMGSIZE] <= S1GB)           S1GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S1GB   && IMGSUM[IMGSIZE] <= S5GB)           S5GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S5GB   && IMGSUM[IMGSIZE] <= S10GB)          S10GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S10GB  && IMGSUM[IMGSIZE] <= S50GB)          S50GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S50GB  && IMGSUM[IMGSIZE] <= S100GB)         S100GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S100GB && IMGSUM[IMGSIZE] <= S250GB)         S250GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S250GB && IMGSUM[IMGSIZE] <= S500GB)         S500GB_COUNT+=1

                        else if (IMGSUM[IMGSIZE] > S500GB && IMGSUM[IMGSIZE] <= S1TB)           S1TB_COUNT+=1

                        else                                                                  SM1TB_COUNT+=1

                }

      

                        printf ("Images Size Range      Image Count\n")

                        printf ("< 100MB                %d\n", S100MB_COUNT)

                        printf ("> 100MB < 500MB        %d\n", S500MB_COUNT)

                        printf ("> 500MB < 1GB          %d\n", S1GB_COUNT)

                        printf ("> 1GB   < 5GB          %d\n", S5GB_COUNT)

                        printf ("> 5GB   < 10GB         %d\n", S10GB_COUNT)

                        printf ("> 10GB  < 50GB         %d\n", S50GB_COUNT)

                        printf ("> 50GB  < 100GB        %d\n", S100GB_COUNT)

                        printf ("> 100GB < 250GB        %d\n", S250GB_COUNT)

                        printf ("> 250GB < 500GB        %d\n", S500GB_COUNT)

                        printf ("> 500GB < 1TB          %d\n", S1TB_COUNT)

                        printf ("> 1TB                  %d\n", SM1TB_COUNT)              

        }' $SLP_DUMP

 

Output Sample:

Image Range            Image Count

< 100MB                     7000

> 100MB < 500MB    1000

> 500MB < 1GB          1500

> 1GB   < 5GB             800

> 5GB   < 10GB           500

> 10GB  < 50GB         500

> 50GB  < 100GB       300

> 100GB < 250GB      200

> 250GB < 500GB      50

> 500GB < 1TB           25

> 1TB                           20

            Two easy catch’s here are the 7000 images smaller than 100MB and the 20 images bigger than 1TB, for the 7000 images I will first check what is the MIN_GB_SIZE_PER_DUPLICATION_JOB value under             the LIFECYCLE_PARAMETERS file if value is too small is very likely we are creating tons of duplication jobs with only 1 or 2 images in and because they are so small the tape drives mount and dismount media every 10 minutes plus they never reach max speeds falling in a potential “Show Shine Effect”, increasing the MIN_GB_SIZE_PER_DUPLICATION_JOB helps NBSTSERV to better process the images and bundle them in to a single job based on SLP, SLP Priority, Retention, Source and Destination if all this match NBSTSERV will batch multiple images in to a single and bigger duplication job.

            For the Large images there are more things to check like compare them each image with the top 10 clients list output and see if those clients own any of these images, second thing can be to check the SLP state of the images because we have some NOT_MANAGED images it could be some of this big guys are stuck because they are corrupted, also tuning the MAX_GB_SIZE_PER_DUPLICATION_JOB value to fit more data in to a single tape could help to improve each image duplication.

 

Count Backlog by SLP

                Last step of the report is to know which SLP holds most of the backlog or how balanced the load is with this we can probably modify a couple of SLP’s and fix the issue or assign more resources to the SLP’s who have most of the load.

                 Process is to list each SLP and dump the incomplete images by SLP and do the corresponding math summarizing all fragments, in this case we don’t need awk Array because we already know the SLP we are working on, we only need to introduce the name of the SLP in to the printing piece of the report.

                printf "%50s%27s\n" "SLP Name" "Backlog Size"

                nbstl -b | while read SLP

                do

                     nbstlutil list -lifecycle $SLP -image_incomplete | awk '

                     $2=="F" { SUM+=$(NF-2) }

                      END {

                                printf ("%50s%20.2f TB\n", awkSLP, SUM/1024/1024/1024/1024)

                      } ' awkSLP=$SLP

               done | sort

               printf "\n\n"

Output sample:

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             10.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-6month-MedA-tape1-stu1                             0.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                5.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-5year-MedA-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-3month-MedB-tape1-stu1                              10.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-6month-MedB-tape1-stu1                              1.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-1year-MedB-tape1-stu1                                3.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-5year-MedB-tape1-stu1                                 1.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-3month-MedC-tape1-stu1                              30.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-6month-MedC-tape1-stu1                              0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-1year-MedC-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-5year-MedC-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1                             30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1                             25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                                5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                                0.00 TB

                Our data shows 4 SLP’s with a double digit number but the most interesting ones are:

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1               30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1               25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                  5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                  0.00 TB

                 Because 60 TB of the data are hold by Media Server MedD matching step 4 (Count Backlog hold by Media Server) now we have more granular data because we know which SLP’s we need to attack first and also figure out why MedD is heavily use while MedA and MedB are on vacation. Second check point is MedC with 30TB clog in the 3 month SLP.

                 Possibilities are huge, but with the final report we can catch some obvious issues, this only leads to the first phase of SLP’s Troubleshooting that is to know were the major problems are.

                 Final script is attached and can be used in Solaris environments haven’t try on any other Unix/Linux platform but it shouldn’t a problem and maybe will need some slight modifications if you have a different platform and the script fails please let me know or upload the fix for your OS version.

                 Another note, if you don’t have Advance Disk configuration only comment the function GetAdvDiskFreeSpace or adapt/create a new function for whatever you have as DataDomain or other 3rd Party Vendors Configs.

Script Syntax:

        SYNTAX: BacklogCheck.ksh -a | -sSBbDFMph [-m <email>] [-C <NClients>] | [-c <Ndays>]

                -a:     Print Full Report

                -s:     Print Short Report (NO SLP's)

                -S:     Print, Count and Sum SLP Images State

                -B:     Print Total Backlog in TB

                -b:     Print last 24hr Backup Info splited by SLP

                -c:     Delete log files older than N days based on User Argunment

                -C:     Get Top X Clients backlog were X is the desired top clients list

                -D:     Print Sum of Daily Duplications

                -i:     Print images count by size range

                -F:     Print DSU's Free Space

                -M:     Print Backlog hold by Media Server

                -m:     Send Report to a Specified eMail

                -h:     Print this help.

        Sample: ./BacklogCheck.ksh -a -m darth.vader@thedarkside.com

Full Report Output:

Total Backlog                               120 TB

 

Media Server DSU                        Free Space

MedA-disk1-dsu1                          1.00 TB

MedB-disk1-dsu2                          1.00 TB

MedC-disk1-dsu3                         1.00 TB

MedD-disk1-dsu4                          1.00 TB

Total Free Space                           4 TB

 

IMAGE STATUS                  IMAGES COUNT               SIZE

IN_PROCESS                       15000                                    100.00 TB

NOT_MANAGED                2000                                     10.00   TB

NOT_STARTED                   3000                                      10.00   TB

 

Media Server DSU                         Backlog Size

MedA-disk1-dsu1                          15.00 TB

MedB-disk1-dsu2                          15.00 TB

MedC-disk1-dsu3                          30.00 TB

MedD-disk1-dsu4                          60.00 TB

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             4.00 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-3month-MedB-tape1-stu2                              1.00 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-3month-MedC-tape1-stu3                              2.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-9month-MedD-tape1-stu4                             2.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                0.50 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-6month-MedB-tape1-stu2                              1.50 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-1year-MedC-tape1-stu3                                 1.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-5year-MedD-tape1-stu4                                 0.00 TB

 

Date                               Duplicated Data

Feb-23-2012                   0.05TB

Feb-24-2012                   2.10TB

Feb-25-2012                   4.30TB

Feb-26-2012                   5.54TB

Feb-27-2012                   5.58TB

Feb-28-2012                   4.23TB

Feb-29-2012                   0.39TB

 

Client Name                       Backlog Size

Windows1                     2500GB

Unix1                            2500GB

Exchange1                    10000GB

MSSQL1                       15000GB

Oracle1                         25000GB

 

Image Range            Image Count

< 100MB                     7000

> 100MB < 500MB    1000

> 500MB < 1GB          1500

> 1GB   < 5GB             800

> 5GB   < 10GB           500

> 10GB  < 50GB         500

> 50GB  < 100GB       300

> 100GB < 250GB      200

> 250GB < 500GB      50

> 500GB < 1TB           25

> 1TB                           20

 

SLP Name                                                                                                         Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1                             10.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-6month-MedA-tape1-stu1                             0.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1                                5.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-5year-MedA-tape1-stu1                                 0.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-3month-MedB-tape1-stu1                              10.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-6month-MedB-tape1-stu1                              1.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-1year-MedB-tape1-stu1                                 3.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-5year-MedB-tape1-stu1                                 1.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-3month-MedC-tape1-stu1                              30.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-6month-MedC-tape1-stu1                              0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-1year-MedC-tape1-stu1                                0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-5year-MedC-tape1-stu1                                0.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1                             30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1                             25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1                                5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1                                0.00 TB

 

Omar A. Villa

Netbackup Expert

These are my personal views and not those of the company I work for.

 

Comments
Omar_Villa
Level 6
Employee

Please posts any comments or bugs under the script and if you have any improvements to the report output is very wellcome.

 

Best Regards.

mph999
Level 6
Employee Accredited

Looks good, will have  a play when I have time.

Nice to see a script written properly.

Martin

Nicolai
Moderator
Moderator
Partner    VIP   

Nice Script !

The -lifecycle_only option is  required if you have Data Domain or simelar boxes. Else backup on those applianced are counted as backlog.

Running the script on a Linux box causes a cut error when running with the -e option. (this may be caused by no backlog).

cut: invalid byte, character or field list

yes From me

 
Nathan_Kippen
Level 6
Certified

Just adding another reference for folks:

Troubleshooting Auto Image Replication ... a lot of stuff related to SLP

http://www.symantec.com/business/support/index?page=content&id=HOWTO42477

Also here is a link to NB 7.1 Best Practice - Using SLPs and AIR

http://www.symantec.com/business/support/index?page=content&id=TECH153154

 

 

 

revarooo
Level 6
Employee

Excellen job.. We need more of these informative posts yes

Omar_Villa
Level 6
Employee

Hi Guys,

       Appreciate the comments and yes revaroo more just came up, I just updated the article and uploaded a newer version of the script with a lot of more functionalities, please take a look the new functions are:

Get local Disk Free Space (Any type of Disk)

Count images by SLP Status

Top Clients Backlog

Split Images Count by Size Ranges

 

All explanations are in the article.

Please let me know what you think and if you find any bug or improvements.

Maurice_Byrd
Level 4

I'm working in an all Windows environment.  Does anyone know how to troubleshoot backlog issues on Windows Server 2008?

Omar_Villa
Level 6
Employee

Is pretty much the same concepts but you will need to develot the script for windows, maybe a PowerShell will do the job, unfortunally I only have this for Unix/Linux and dont have a windows environment to develop the windows version, but if you see the output of the report is what you need to look for on the windows side.

huangj11
Level 2

Great,good job.

Joe_Despres
Level 6
Partner

Do you have a version for Linux [RH].....

Thanks.....

 

Joe Despres

Nicolai
Moderator
Moderator
Partner    VIP   

The script is written in Korn Shell - it work on Linux as well.

Vinh_La
Not applicable

Omar,

I was going to ask you for this script :)

 

Thanks man.

 

Vinh.

Omar_Villa
Level 6
Employee

Hi,

     Is been a while since last update I have a couple of improvements over the script and a bug fix, please take a look at new version 1.8.4.B and let me know if you have have an issue, bugs or comments:

 

Updates:

#        DATE: 06/05/2012 BY: Omar A Villa
#        MODIFICATION: Introduced GetMediaServerDups Function (Ver 1.8.1.B)
#        DATE: 07/15/2013 BY: Omar A Villa
#        MODIFICATION: Added GetLibraryBacklog Function (Ver 1.8.2.B)
#        DATE: 11/14/2013 BY: Omar A Villa
#        MODIFICATION: Fixed bug under ValidateFilesAndFolders function when script is run by first time (Ver 1.8.3.B)
#        DATE: 11/14/2013 BY: Omar A Villa
#        MODIFICATION: Improved SendMail function to identify mail or mailx commands (Ver 1.8.4.B)

 

Open code to see functions in case you want to see the details.

 

Best Regards.

Omar_Villa
Level 6
Employee

Hi,

       There are couple of  improvements added in to the script, please check it out and thanks to Kevin Good for he's input on the GetSLPsBklog function improvement, very fine piece of code he wrote.

 

#        DATE: 02/01/2014 BY: Kevin Good
#        MODIFICATION: Improved GetSLPsBklog algorithm. Print only SLP's with backlog (Ver 1.8.5.B)
#        DATE: 02/01/2014 BY: Omar A Villa
#        MODIFICATION: Added -k parameter in to main to only list SLP's with backlog (Ver 1.8.6.B)

 

Best Regards.

backdfup
Not applicable
Partner Accredited Certified

Good read Omar!! I will put this to work for me.

 

DC Martin

HoldTheLine
Level 4

This is a really, really good script!  Thanks for not only putting in the work to do this but also for sharing.  I am seeing some odd output and am not sure if there is some customization that has to be done to get it working in some environments, for example:

 

Running with the -a switch for the full report I see this right

 

"Date                          Duplicated Data
Dec-31-1969                   4314.97TB"
 

 

@a                11.8195
S5                1.79069
 

the @a looks like a disk device, not sure what the S5 is but that adds up to the total backlog of about 12TB.  Any idea what that is supposed to be telling me? The time stamp of Dec-31-1969 is confusing to me as well.  This is on Linux RH if it matters.

 

Also, under 24 hour backup there is a heading for Policy name but it lists it not as the name of a policy but a number, in this case 0:

Total 24hr Backup             24.48 TB

                                       Policy Name                Backup Size
                                                 0               24.48 TB

 

Omar_Villa
Level 6
Employee

Hi,

          I think I can explain and modify the script:

          1. about the output:

           "Date                          Duplicated Data
            Dec-31-1969                   4314.97TB"

             There is a bug in the script with NBU 7.5, this output is suppost to print the amount of data duplicated in the last 5 days, for some reason the output is comming up with the oldest date NBU supports, will take a look and update the script the soonest I can.

 

           2. about:

              @a                11.8195
              S5                1.79069

             Sorry on this one I forgot to add the headers, this is are the first 2 characters of your tapes barcodes, the intention is to let you know what source library or VTL haves the backlog so you can focus on that and you are right @a is your Disk that holds 11.8TB's and S5 is a library with barcode tapes S5 that holds 1.79TB of the backlog, next version I think I will need to print the full disk name for those cases where we have multiple instances.

 

            3. another bug:

              Total 24hr Backup             24.48 TB

                                       Policy Name                Backup Size
                                                 0               24.48 TB

              Let me check on this one but Im sure is the same issue around bpdbjobs in NBU 7.5 that changed a bit I'm sure the columns moved and is messing with the output.

 

Soonest I have everything fix will uploaded the script and post the output, hope it doesnt take me too long.

 

Thanks a lot for your imput on enhancing this script.

Best Regards.

Andrew_Madsen
Level 6
Partner

Omar,

While you are at it:

 

slp_l4enbmed03_l4nbpdpa1_rep_l8nbpdpa1_logs_1mon 0.00 TB

slp_l8vnb5220a_l8nbpdpa1_rep_1mon 0.00 TB

slp_l4inb5220a-passthru-l4nbpbpa1_rep_l8nbpdpa1_2wks 0.00 TB

slp_l4enb5220a_l4nb5020a_l8nb5020a_1mon 0.00 TB

slp_l4enbmed02_rep_l8nbpdpa1_l4nb5020a_1mon 0.00 TB

slp_l4inb5220a-passthru-l4nbpbpa1_rep_l8nbpdpa1_1mon 0.00 TB

slp_l4enbmed03_rep_l8nbpdpa1_l4nb5020a_1mon 0.00 TB

slp_l4enbmed01_FS_rep_1mon 0.00 TB

slp_l4vnb5220b_l4nbpdpb1_rep_1mon 0.00 TB

slp_l4vnb5220a_l4nbpdpb1_rep_1mon 0.00 TB

slp_l4vnb5220b_l4nbpdpa1_rep_1mon 0.00 TB

slp_l4vnb5220a_l4nbpdpa1_rep_1mon 0.00 TB

slp_l8inb5220a_l8nbpdpa1_rep_l4nbpdpa1_2wks 0.00 TB

slp_l8inb5220a_l8nbpdpa1_rep_l4nbpdpa1_1mon 0.00 TB

slp_l4enbmed02_rep_l8nbpdpa1_c7nb5220a_dup_l4nb5020a 0.00 TB

slp_l4enbmed02_l4nbpdpa1_rep_l8nbpdpa1_logs_2Weeks 0.00 TB

Those values should be something besides 0.00 TB

Omar_Villa
Level 6
Employee

Hi Andrew,

            I think you might be running an old version of the script, in 1.8.6.B we fixed this, now it only presents SLP's with backlog, if you are running the newest version then those SPL's have a very small backlog and can be the math and printf are eating the value what this means is probably these SLP's backlog can be something like 0.001TB, to confirm you can go to the GetSLPsBklog function and modify this line from:

            printf ("%50s%20.2f TB\n", FoundSLP, SUM[FoundSLP]/1024/1024/1024/1024)

           TO

            printf ("%50s%20.2f TB\n", FoundSLP, SUM[FoundSLP]/1024/1024)

 

This will print in MB instead of TB.

 

Check it out and let us know.

Regards.

Omar_Villa
Level 6
Employee

Hi,

    Script is been updated and fix, please check for new version, here are the updates:

 

#        DATE: 03/06/2013 BY: Omar A Villa
#        MODIFICATION: Fixed bug on GetSLPsBklog that was reporting 0's for each SLP backlog size (Ver 1.8.7.B)
#        DATE: 04/02/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetDailyDups function using bperror instead of bpdbjobs fixing output bug (Ver 1.8.8.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetLibraryBacklog function; Introduced header and splited report by Disk or Barcodes first 2 chars (Ver 1.8.9.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Re-architected GetSLPLast24hrBkp function; Removed NBU version search steps (Ver 1.8.10.B)
#        DATE: 04/03/2014 BY: Omar A Villa
#        MODIFICATION: Introduced Header to GetSLPsBklog function (Ver 1.8.11.B)

 

Any questions please let me know.

Regards.

HoldTheLine
Level 4

Looking great so far, thanks!

 

Omar_Villa
Level 6
Employee

Hi Everyone,

            Have some enhancements and fixed a big bug around the backlog count cutting the size of the backlogs by 50% because script dump was also counting the Copy 1 data instead of only Copy 2 or highier, if you have more then 3 copies is very likely you will need to customize the script a bit, please checkout the modified functions and let me know if you have any questions.

 

#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Decommed GetMediaServerDups function (Ver 1.8.12.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Introduced GetMediaServerDupsAndSpeeds function (Ver 1.9.0.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Modified GetDailyDups function to print Average Speeds (Ver 1.9.1.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Enhanced GetTotalBklog function to print SLP Copies Backlogs (Ver 1.9.2.B)
#        DATE: 04/08/2014 BY: Omar A Villa
#        MODIFICATION: Modified DataDumps function to fix bug that was doubling backlog size (Ver 1.9.3.B)

 

Have a good one.

Omar_Villa
Level 6
Employee

some few slight updates

 

#        MODIFICATION: Modified DataDumps Function to fix bug for cases where there is only 1 dup copy (Ver 1.9.4.B)
#        DATE: 04/23/2014 BY: Omar A Villa
#        MODIFICATION: Enhanced CleanLogs Function adding an array and loop that will go through all logs names (Ver 1.9.5.B)
#        DATE: 04/24/2014 BY: Omar A Villa
#        MODIFICATION: Introduced GetOS function to help with those commands with syntax differences (ver 1.10.0.B)

 

Enjoy.

sgt_why
Level 3

Omar@ This has been invaluable!! 

We've been fighting with an insane backlog for months with no easy solution in sight.

After just a couple days of running this script, we've already identified mutiple areas for improvement.

For starters, we had 825TB of "un managed" images in the Queue ... which was about 80% of our images.

I found a tech note on how to clean that part up.

 

In addition, we have over 125,000 images under 100mb in size.

We are using LTO4s and have the following configured;

MIN_GB_SIZE_PER_DUPLICATION_JOB = 200
MAX_GB_SIZE_PER_DUPLICATION_JOB = 800
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 120
DUPLICATION_SESSION_INTERVAL_MINUTES = 15

... any advice on how to tweak it further to account for the large number of tiny images?

Omar_Villa
Level 6
Employee

Hi Sgt_why,

 

             Usually when I see a lot of small images I will like them to be bundle in large Duplication jobs, this way you will experience less mounts and dismounts requests and NBRB will do better. Based on the parameters you have I will change it to:

MIN_GB_SIZE_PER_DUPLICATION_JOB = 200
MAX_GB_SIZE_PER_DUPLICATION_JOB = 800
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 240
DUPLICATION_SESSION_INTERVAL_MINUTES = 60

 

             This will allow EMM and NBSTSERV to better process and bundle those tiny images in to a big job, the trick here is if those images can be bundle because there are several rules behind the bundling like retention, destination, SLP policy and some other small factors, in my experience with backlogs troubleshooting is more an art than an exact science and you have to be very patient, do one change a day only and wait to see how your perfomance improves, then next day you will know excactly were you are, it will take a week or two to finaly get a perfect tuneup, I know is crazy but is what I have learn over all this years of dealing with the backlog beast.

            If the changes I recommended dont work dont get disapointed, it will mean you have a lot of images with different retentions and destinations and you will need to tweak your SLP's config to send data to the same STU's or SUG's.

 

Hope this helps and let us know how it goes.

Best Regards.

 

This is are my personal opinions and of the company I work for.

 

sgt_why
Level 3

Hi Omar,

What would cause a high number of NOT_MANAGED images ?

IMAGE STATUS                  IMAGES COUNT   SIZE

IN_PROCESS                    1322           60.51 TB

NOT_MANAGED                   51083          541.26 TB

NOT_STARTED                   30             0.46 TB

 

I cleared all the backlog down to (0) and now its grown back to over 51,0000 images ... and is growing daily.

Any suggestions on how to troubleshoot it further?

Omar_Villa
Level 6
Employee
Hi sgt, I explained this in the article above under function who counts the images states: In our sample we have a total of 20,000 images in backlog but 15,000 are IN_PROCESS, 2,000 in NOT_MANAGED state and 3000 NOT_STARTED, each of this states demands different actions but just to start it will be good to know why those 2000 images are in NOT_MANAGED state, by personal experience are either bad images or backup policies not using SLP’s, if you have many 800 errors is highly a list of bad images. Hope this helps. This is my personal opinion and not of the company I work for.
chengfr
Level 3

Hi Omar, 

Your script gives a lot valuable inforamtion on SLP. Thanks. 

Could you elaborate the argument D 0 and 1? What's the difference in your script in terms of Duplication and Replication? 

We only have one backup domain so no "NetBackup Replication" so I am puzzled with "Replication Data"

Date                          Replicated Data  
07/18/2015                    47.37 TB         
07/19/2015                    56.84 TB         
07/20/2015                    13.74 TB         
07/21/2015                    18.48 TB         
07/22/2015                    17.85 TB         
  

Date                          Duplicated Data 
07/19/2015                    29.82 TB        
07/20/2015                    19.32 TB        
07/21/2015                    16.85 TB        
07/22/2015                    5.15 TB         
 

Thanks

 

 

Omar_Villa
Level 6
Employee

Hi Chengfr,

         I'm glad the script is helping you, about your question, the difference between Replication and Duplication is the action behind the job, for example, if you are using an OST technology to Replicate your data lets say between DataDomains, the Deduplication appliances will only move the differentials or changes of data, this is what is behind Replication, on the other side when you Duplicate data you are moving all the data not just the changes the most common example of this is when you copy data to tape, so based on your output my guess is you are probably using an OST technology to replicate and there is a second copy remotely that goes to tape, remember Im just guessing here, if is not the case I will need some more info about your environment to be able to tell what is going on.

        Regarding the -D 0|1 option is basically if you want to print only the Replication or Duplication outputs, sometimes when we are troubleshooting SLP's we dont want all the data because the report can take a long time if you have a big backlog, is why I splited the script in so many functions.

 

Please let me know if I'm wrong or if you have any more questions.

Best Regards

Omar_Villa
Level 6
Employee

Hey,

         I just found a bug under the GetDailyDumps and GetMediaServerDupsAndSpeeds functions, the Initial IF condition was swaped, were 1==Dups, but really should be 0==Dups, I have changed this and updated the script, you should be able to download the new version now.

#        DATE: 07/23/2015 BY: Omar A Villa
#        MODIFICATION: Fixed bug under GetDailyDups Function changing IF DUP_TYPE condition to equal 0 (ver 1.10.1.B)
#        DATE: 07/23/2015 BY: Omar A Villa
#        MODIFICATION: Fixed bug under GetMediaServerDupsAndSpeeds Function changing IF DUP_TYPE condition to equal 0 (ver 1.10.2.B)

 

Hope this helps.

Regards.

chengfr
Level 3

Hi Omar, 

You are right that we have DataDomain replication between two data center and then tape out. That said, the D0 output would be the data taped out, and D1 output would be the data replicated by DataDomain. Please correct me if it's wrong. 

I found output from v1.10.0.B and v1.10.2.B has quite a bit difference.

Output from v1.10.2.B run today reports differently compared with output from v1.10.1.B from my yesterday's post. 

Date                          Duplicated Data         Average Duplication Speed
07/19/2015                    14.97 TB                 46201.10 KB/sec
07/20/2015                    19.32 TB                 63930.18 KB/sec
07/21/2015                    16.85 TB                 77306.24 KB/sec
07/22/2015                    5.15 TB                 60920.53 KB/sec
07/23/2015                    3.92 TB                 58941.52 KB/sec
07/24/2015                    3.18 TB                 56265.58 KB/sec

Date                          Replicated Data         Average Replication Speed
07/19/2015                    20.05 TB                 425773956.11 KB/sec
07/20/2015                    13.74 TB                 11398933.00 KB/sec
07/21/2015                    18.48 TB                 23037142.41 KB/sec
07/22/2015                    17.85 TB                 34231434.93 KB/sec
07/23/2015                    16.73 TB                 34343889.51 KB/sec
07/24/2015                    10.92 TB                 49382506.56 KB/sec

Thanks

Omar_Villa
Level 6
Employee

Hi Chengfr,

          The command used to pull this data is bperror -120 hours, meaning some data is constantly left behind everytime you run the script, I have see some times big drops from one day to another because we are letting behind some big jobs, but in average works fine. About the crazy replication speeds, that I cannot tell why you see those numbers but I have see this behavior when the script is ran against a new netbackup version and this version haves a new output for the command executed, by chance are you using NBU 7.6? I havent test the script on this version maybe it worth to check if there is any new column on the output of the bperror command that is messing with the speeds.

 

Hope this helps.

Best Regards.

chengfr
Level 3

Hi Omar, 

Thanks for your kind explaination. Your reply well explained the difference in my two reports. I am runnign on NBU 7.6 so you are right it might be messed by the new command output format.

Thanks again. 

 

Iwan_Tamimi
Level 6

Hi Omar, 

Thank you very much for the very usefull script. 

I just don't understand a part of output: 

Total 24hr Backup             48.40 TB

                                       Policy Name           Backup Size/24hr

                                     SLP_7DAYS_MOC                0.52 TB

                                    SLP_21DAYS_ROC                1.78 TB

                                    SLP_35DAYS_ROC                0.04 TB

                                      SLP_TEST_MOC                5.82 TB

                                            *NULL*               38.44 TB

                                    SLP_35DAYS_MOC                1.81 TB

                                       Policy Name               Backlog Size

                                    SLP_21DAYS_ROC                0.01 TB

                                      SLP_TEST_MOC                0.07 TB

                                    SLP_35DAYS_MOC                0.01 TB

 

What is the *NULL* ?

 

Thank you,

 

Iwan 

Version history
Last update:
‎02-29-2012 03:04 PM
Updated by: