Netbackup SLP Backlog Report and Troubleshooting

Omar_Villa · ‎02-29-2012

To troubleshoot an SLP Backlog first we need to define what a backlog is: Mainly is data that it hasn’t been duplicated to a second or N-Destinations configured under the SLP.

i.e.:

SLP Name: SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1

C1: Copy 1 (Backup)

1week: 1 Week Retention

MedA-disk1-dsu1: Backup Destination in to STU MedA-disk1-dsu1

C2: Copy 2 (Duplicate)

3month: 3 Month Retention

Tape1-stu1: Duplication Destination in to STU tape1-stu1

The Storage Lifecycle Policy SLP_C1-1week-disk1-dsu1_C2-3month-tape1-stu1 will Backup Data in a Local Disk DSU named MedA-disk1-dsu1 and when is done will duplicate the data from MedA-disk1-dsu1 to MedA-tape1-stu1, if for any reason the images stored under disk1-stu1 don’t get duplicated to MedA-tape1-stu1 we start to generate a Backlog in MedA-disk1-dsu1 and because the SLP’s nature is to create every image under MedA-disk1-dsu1 with an infinite retention in some point MedA-disk1-dsu1 will get full and all backups will fail, only until the duplication is successful the retention for images under MedA-disk1-dsu1 will change to 1 Week retention and eventually expire.

Now Imagine a scenario were we have 20 different SLP’s with Different destinations either for Backups or Duplications, troubleshooting this it can be a real challenge and because of this is that created an order of troubleshooting to better find bottlenecks and potential configuration issues.

During this article will create a set of KSH functions that will end in a final script; main idea is to learn to read each function output and know how interpret each piece of data other key piece is to collect all the needed info in order to deliver a better solution.

Steps:

Dump SLP incomplete images
Get local Disk free space (Advance Disk for this sample) new
Count total Backlog
Count Images by SLP Status new
Count Backlog hold by Media Server
Count total Backups in the last 24 hours
Top Clients Backlog new
Split Images Count by size ranges new
Count total duplications in a daily basis
Count Backlog by SLP.

Dump SLP incomplete images.

First we need to dump all those images that haven’t been duplicated and send this data in to a file, the following code will dump all incomplete images in to a log file.

#!/bin/ksh

LOG_DIR=/var/log/$SCRIPT_NAME/logs

SLP_DUMP=$LOG_DIR/$SCRIPT_NAME.DUMP

nbstlutil list -l -image_incomplete > $SLP_DUMP

Get local Disk free space (Advance Disk for this sample)

To make it easy on this sample we use Advance Disk as our first destination option and to know where we stand and how much space left in our environment in order to keep delivering healthy backups we must know how much free space we have in all our media servers.

First step is dump the Disk list and with awk we only grab the DSU name and Used Space and we count total available space and free space by media server.

nbdevquery -listdv -stype AdvancedDisk -l | awk '

{

SER[$2]+=$7 #Array that holds each DSU Name and stores free space.

TOTAL+=$7 # Variable that counts the total free space between all DSU’s.

}

Once awk main body is done we print, headers, total free space and each DSU value going through each cell under SER[] Array.

END {

printf ("%-30s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

printf ("%-30s%s\n", "Media Server", "Free Space")

for (INC in SER) {printf ("%-30s%.2f TB\n",INC, SER[INC]/1024) }

}'

Output sample:

Total Free Space 4 TB

Media Server DSU Free Space

MedA-disk1-dsu1 1.00 TB

MedB-disk1-dsu2 1.00 TB

MedC-disk1-dsu3 1.00 TB

MedD-disk1-dsu4 1.00 TB

Update:

After troubleshooting some other sites with different technologies like PureDisk or DataDomain realized that leaving this script only for AdvancedDisk wasn’t that helpful, so decided to decom this function and created a new one that can detect any technology and provided the Free Space by DSU

We have found that cron is not that smart and sometimes needs help to find the netbackup commands so we decided to introduce the path of each NBU command in to a variable so the script can cleanly be use under cron.

>$LOG_DIR/total.log # Wipeout total.log file used to Summarize Free Space.

NBDEVQUERY=/usr/openv/netbackup/bin/admincmd/nbdevquery

printf "%-30s%-20s%s\n" "Media Server" "Storage Type" "Free Space"

After we print the header of the function instead of dumping only the AdvancedDisk info we go through the list of STS and add them in to a loop so we can go through each one and gather the free space, logic is the same as the old AdvancedDisk function we only introduced a new column in the output that will tell us what technology the DSU is using.

$NBDEVQUERY -liststs -l | awk '{print $3}' | sort -u | while read STYPE

do

$NBDEVQUERY -listdv -stype $STYPE -l | awk '

{

SER[$2]+=$7

TOTAL+=$7

}

END {

print TOTAL >> "'$LOG_DIR'/total.log"

for (INC in SER)

{printf ("%-30s%-20s%.2f TB\n",INC, awkSTYPE, SER[INC]/1024)}

}' awkSTYPE=$STYPE

done

We learned a new trick to help a bit with performance and avoid to run the all thing to sum the total free space as you notice we are printing the TOTAL variable in to a total.log file that will hold a list of each DSU and with the following loop we only go through the file and make the Sum.

awk '

{ TOTAL+=$1 }

END {

printf ("\n%-50s%.2f TB\n\n", "Total Free Space", TOTAL/1024)

}' $LOG_DIR/total.log

Output Sample:

Media Server Storage Type Free Space

MedA-disk1-dsu1 AdvanedDisk 1.00 TB

MedB-disk1-dsu2 AdvanedDisk 1.00 TB

MedC-disk1-dsu3 PureDisk 1.00 TB

MedD-disk1-dsu4 DataDomain 1.00 TB

Count total Backlog

Next step is to know how much data we are holding or better say that it hasn’t been duplicated. This will came from our step 1 were we dumped the data in to $SLP_DUMP file.

First we summarize every image fragment who haven’t been duplicated

awk '

$2=="F" {SUM+=$14}

When sum is done we print the total in TB’s.

END {

printf ("%-30s%.2f TB\n\n", "Total Backlog ", SUM/1024/1024/1024/1024 )

}' $SLP_DUMP

Output sample:

Total Backlog 120 TB

Count Images by SLP Status

There is a key piece in backlog troubleshooting and is images state, to know what is the status of the images is priceless in order make better decisions, in the next piece of code you will find that we count and sum images sizes by SLP State, we only handle the main 6 states but there are some others, but just knowing how many images are NOT_MANAGED or IN_PROCESS should be enough to know if we have corrupted images or policies not using SLP’s.

Awk will go through the dumped images and compare the 11^th column looking for values 0,1,2,3,9,10 who represent NOT_MANAGED, NOT_STARTED, IN_PROCESS, COMPLETE, NOT_STARTED_INACTIVE and IN_PROCESS_INACTIVE states and after the right value is located it is translated in to a string value to later add 1 in to an array using the STATE string value for the Array Cell, idea is to go through the array at the END of awk and print all states through a simple loop.

printf "%-30s%-15s%s\n" "IMAGE STATUS" "IMAGES COUNT" "SIZE"

awk '

$2=="I" {

IMAGE=$4

STATE_COL=$11

if (STATE_COL == 0) STATE = "NOT_MANAGED"

else if (STATE_COL == 1) STATE = "NOT_STARTED"

else if (STATE_COL == 2) STATE = "IN_PROCESS"

else if (STATE_COL == 3) STATE = "COMPLETE"

else if (STATE_COL == 9) STATE = "NOT_STARTED INACTIVE"

else if (STATE_COL == 10) STATE = "IN_PROCESS INACTIVE"

else STATE = "OTHER"

IMG_STATE_LIST[STATE]+=1

}

To know the fragment size of the image that was captured in the previous awk block we compare columns 2 and 4, first one to know we are in the fragment (F) line and second to ensure awk loop haven’t change to a different image, a second Array will sum the Fragment sizes but by the same STATE captured in the previous condition.

$2=="F" && $4==IMAGE {

IMG_SUM[STATE]+=$14

}

Once we go through the dump file we only print the results going through the Arrays lists, printing the images count and total storage in queue under by SLP state.

END {

for (STATE_ELM in IMG_STATE_LIST)

printf ("%-30s%-15d%.2f TB\n", STATE_ELM, IMG_STATE_LIST[STATE_ELM], IMG_SUM[STATE_ELM]/1024/1024/1024/1024)

}' $SLP_DUMP | sort

printf "\n\n"

In our sample we have a total of 20,000 images in backlog but 15,000 are IN_PROCESS, 2,000 in NOT_MANAGED state and 3000 NOT_STARTED, each of this states demands different actions but just to start it will be good to know why those 2000 images are in NOT_MANAGED state, by personal experience those are either bad images or backup policies not using SLP’s, if you have many 800 errors is highly a list of bad images.

Output Sample:

IMAGE STATUS IMAGES COUNT SIZE

IN_PROCESS 15000 100.00 TB

NOT_MANAGED 2000 10.00 TB

NOT_STARTED 3000 10.00 TB

Count Backlog by Media Server

Knowing the total backlog only tell us how bad our duplication SLA is. Next step is to strip out a bit our dump and figure out which media server holds most of our data, this will help us to make decisions as assign or de-assign SLP Alternate Readers or change STU’s Concurrent Drives values in order to assign more resources to a specific media server.

First we capture the Storage Unit name and Fragment size from the $SLP_DUMP file and summarize each fragment in to an Array that will help us to split the backlog by Media Servers DSU’s or STU’s.

printf "%-30s%s\n" "Media Server DSU" "Backlog Size"

awk '{if ($2=="F") {print $9,$14} }' $SLP_DUMP | sort |

awk '

{

MED_LIST[$1]+=$2

}

Once sum is done we will go through the MED_LIST[] array and print in TB the total sum under each Array cell value (Media Server DSU Name)

END {

for(MED in MED_LIST)

printf ("%-30s%.2f TB\n", MED, MED_LIST[MED]/1024/1024/1024/1024) |"sort"

}'

printf "\n\n"

Output sample:

Media Server DSU Backlog Size

MedA-disk1-dsu1 15.00 TB

MedB-disk1-dsu2 15.00 TB

MedC-disk1-dsu3 30.00 TB

MedD-disk1-dsu4 60.00 TB

This sample output is telling us that MedD-disk1-dsu4 is holding half of the backlog and is probably our first point to troubleshoot, by expirience first things to look at is: tape drives health, Storage Unit Groups without STU/DSU forcing the SLP to ignore to duplicated the data under the excluded STU/DSU or backup policies over utilizing the STU/DSU, there are lot of possibilities but this are the ones I have found as most common.

Count total Backups in the last 24 hours

Is impossible to know where we are on Backlogs without knowing how much data we are pulling in, this is why we need to count how much data we backed up in the last 24 hours, this will tell us if the free space we have will be enough for another night of backups also we will be able to compare with another function explained later that show us how much data we are duplicating in a day.

Because we will dump the last 24 hours images with bpimagelist we first need to know under which netbackup version we are because there is a difference in the output between Netbackup 6 and 7, once we capture the right version we will know the column were the Total Image Size is and let awk do the math.

NBUVER=$(cat /usr/openv/netbackup/version | grep -i version | awk '{print $NF}' | awk -F. '{print $1}')

if (( NBUVER == 7 )) then

IMG_SLP_COL=6

else

IMG_SLP_COL=3

fi

The process is very similar to our previous functions, we dump the data and with awk Arrays we summarize how much data we backed up in the last 24 hours, this in total and by SLP, last one is important because we will be able to capture the SLP with the highest backup load.

bpimagelist -l -hoursago 24 | grep -i ^IMAGE | awk '

{

SLP_LIST[$(NF-SLPCOL)]+=$19 # SLPCOL variable stores KSH Variable $IMG_SLP_COL value.

TOTAL+=$19

}

Once the sum process is done under the awk main body we print results in awk END foot going through the SLP_LIST[] array showing each SLP total backup data.

END {

printf ("%-30s%.2f TB\n\n", "Total 24hr Backup", TOTAL/1024/1024/1024)

printf ("%50s%27s\n", "Policy Name", "Backup Size")

for (SLP in SLP_LIST) {printf ("%50s%20.2f TB\n", SLP, SLP_LIST[SLP]/1024/1024/1024)}

}' SLPCOL=$IMG_SLP_COL

printf "\n\n"

Output sample:

SLP Name Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1 4.00 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-3month-MedB-tape1-stu2 1.00 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-3month-MedC-tape1-stu3 2.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-9month-MedD-tape1-stu4 2.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1 0.50 TB

SLP_C1-1week-MedB-disk1-dsu2_C2-6month-MedB-tape1-stu2 1.50 TB

SLP_C1-1week-MedC-disk1-dsu3_C2-1year-MedC-tape1-stu3 1.00 TB

SLP_C1-1week-MedD-disk1-dsu4_C2-5year-MedD-tape1-stu4 0.00 TB

Idea is to get a picture of which of the SLP’s is backing up more data. For simplicity of the article SLP’s names only include STU’s there is no SUG’s on this sample.

Count total duplications in a daily basis

Next step is to know how much data we are duplicating, this can be tricky because we need to go through the jobs database and figure out how much data we have successfully move per job.

First we capture the list of successful duplicate jobs and list them in to a single line separated by spaces to later use the list with bpdbjobs command, this makes the search of successful jobs way faster than going through one by one.

printf "%-30s%s\n" "Date" "Duplicated Data"

bpdbjobs -report | grep -i duplica | grep -i done | awk '{print $1}' | tr '\n' ',' | read JOBSLIST

echo $JOBSLIST | wc -m | read JOBSCHARS

((JOBSNUM=$JOBSCHARS-2))

echo $JOBSLIST | cut -c1-$JOBSNUM | read FINALJOBSLIST

With the list ready we will pull 2 columns from each line, the Unix Date and the Size of the job in KB, this two variables will be used to translate the date in to a human format (mm-dd-yyyy) and sum successful duplications

bpdbjobs -jobid $FINALJOBSLIST -most_columns | awk -F, '{print $9,$15}' | while read UDATE SIZEKB

do

RES=$(bpdbm -ctime $UDATE | awk '{print $4"-"$5"-"$NF}')

echo $RES $SIZEKB

done | \

With the human date format we will start to summarize the written fragments in to an array and split it by date so we can print a history of the last 4-6 days (the output on the oldest days can change while the jobs got deleted from Activity Monitor, this is why is good to run this script daily and keep some track under the logs folder)

awk '

{

DAYLIST[$1]+=$2

}

With the list done, we only go through each Array cell and print the results of the sum's in TB’s.

END {

for (DAYDUP in DAYLIST)

printf ("%-30s%.2f%s\n",DAYDUP, DAYLIST[DAYDUP]/1024/1024/1024, "TB" )

}' | sort -n

printf "\n\n"

Output sample:

Date Duplicated Data

Feb-23-2012 0.05TB

Feb-24-2012 2.10TB

Feb-25-2012 4.30TB

Feb-26-2012 5.54TB

Feb-27-2012 5.58TB

Feb-28-2012 4.23TB

Feb-29-2012 0.39TB

To know if we are doing good or bad on duplications we will need to know how many tape drives do we have available and what kind they are, for this sample we have 10 LTO4 drives shared across 4 Media Servers, with that said, we know we are way behind on performance because each drive should be able to more around 120MB/sec this in utopia world but at least we should expect to move around 2-4TB a day per drive, meaning we probably have a bottleneck at drives or media servers level (we will discuss more about drives and media server performance troubleshooting is a second article, we first build a strong case and later do the right modifications).

Also there is always the possibility that the reading side (Disk Array) of the duplication is the root cause of the bottleneck but we first check everything in the backup world before we blame the SAN guys.

Top Clients Backlog

For those cases were one day we don’t have backlog and suddenly in a 24 or 48 hours window we jump to 20TB backlog just from nowhere this normally is because a client decided to dump their 10TB DB in to a folder and because is not part of the exclude list it kills our space and increases the backlog, to quickly detect this we created a function that by default will give us the top 10 backlog clients, so we can engage that customer and see what actions can be taken in order to prevent a higher impact.

The function allows us to select the number of clients we want to print, default is 10 but it can be any desired number.

TOP=$1

Soonest we print the header the logic is the same we go through the SLP dumped file and capture each client name under an Array Cell and summarize the fragment value inside the client cell name.

printf "%-30s%s\n" "Client Name" "Backlog Size"

awk '$2=="F" {print $4,$14} ' $SLP_DUMP | tr '_' ' ' | awk '

{

CLIENT_LIST[$1]+=$3

}

Once we screen and capture all clients we print them all going through the Array and to capture the clients with the biggest backlog we only sort the list by the Size value (second column) and tail the $TOP variable to print only the top 10 or desire number of clients.

END {

for (CLIENT in CLIENT_LIST)

printf ("%-30s%.2f GB\n", CLIENT, CLIENT_LIST[CLIENT]/1024/1024/1024)

}' | sort -nk 2,2 | tail -$TOP

print "\n\n"

Output Sample:

In our quick sample we print our top 5 clients with the biggest backlog and we easily see which clients represent at least 45% (55TB) of our 120TB backlog, not a bad place to start looking.

Client Name Backlog Size

Windows1 2500GB

Unix1 2500GB

Exchange1 10000GB

MSSQL1 15000GB

Oracle1 25000GB

Count Images by Size Ranges

Tuning the LIFECYCLE_PARAMETERS file can be a challenge if we don’t have the right data and guessing or doing change by test and error it doesn’t goes well with backlog, because of this and to better know what is NBSTSERV doing with the images is necessary to know how many images we have and their sizes with this info we can tune things like MIN/MAX_GB_SIZE_PER_DUPLICATION_JOB values, we will see this in our output sample explanation.

Code is quite simple we establish the ranges we want to print and capture the image and image fragment size in to an Array that later we will scan and compare each cell value with the hardcoded ranges we established, Increase by one the range variable and when the loop is done we only print the images count by range.

awk '

$2=="I" {

IMAGE=$4

}

$2=="F" && $4==IMAGE {

IMGSUM[IMAGE]+=$14

}

Hardcoded ranges values:

END {

S100MB=104857600

S500MB=524288000

S1GB=1073741824

S5GB=5368709120

S10GB=10737418240

S50GB=53687091200

S100GB=107374182400

S250GB=268435456000

S500GB=536870912000

S1TB=1073741824000

Loop will go through the Array values and compare them with the ranges we want and increase a count variable that we will print later.

for (IMGSIZE in IMGSUM) {

if (IMGSUM[IMGSIZE] <= S100MB) S100MB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S100MB && IMGSUM[IMGSIZE] <= S500MB) S500MB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S500MB && IMGSUM[IMGSIZE] <= S1GB) S1GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S1GB && IMGSUM[IMGSIZE] <= S5GB) S5GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S5GB && IMGSUM[IMGSIZE] <= S10GB) S10GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S10GB && IMGSUM[IMGSIZE] <= S50GB) S50GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S50GB && IMGSUM[IMGSIZE] <= S100GB) S100GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S100GB && IMGSUM[IMGSIZE] <= S250GB) S250GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S250GB && IMGSUM[IMGSIZE] <= S500GB) S500GB_COUNT+=1

else if (IMGSUM[IMGSIZE] > S500GB && IMGSUM[IMGSIZE] <= S1TB) S1TB_COUNT+=1

else SM1TB_COUNT+=1

}

printf ("Images Size Range Image Count\n")

printf ("< 100MB %d\n", S100MB_COUNT)

printf ("> 100MB < 500MB %d\n", S500MB_COUNT)

printf ("> 500MB < 1GB %d\n", S1GB_COUNT)

printf ("> 1GB < 5GB %d\n", S5GB_COUNT)

printf ("> 5GB < 10GB %d\n", S10GB_COUNT)

printf ("> 10GB < 50GB %d\n", S50GB_COUNT)

printf ("> 50GB < 100GB %d\n", S100GB_COUNT)

printf ("> 100GB < 250GB %d\n", S250GB_COUNT)

printf ("> 250GB < 500GB %d\n", S500GB_COUNT)

printf ("> 500GB < 1TB %d\n", S1TB_COUNT)

printf ("> 1TB %d\n", SM1TB_COUNT)

}' $SLP_DUMP

Output Sample:

Image Range Image Count

< 100MB 7000

> 100MB < 500MB 1000

> 500MB < 1GB 1500

> 1GB < 5GB 800

> 5GB < 10GB 500

> 10GB < 50GB 500

> 50GB < 100GB 300

> 100GB < 250GB 200

> 250GB < 500GB 50

> 500GB < 1TB 25

> 1TB 20

Two easy catch’s here are the 7000 images smaller than 100MB and the 20 images bigger than 1TB, for the 7000 images I will first check what is the MIN_GB_SIZE_PER_DUPLICATION_JOB value under the LIFECYCLE_PARAMETERS file if value is too small is very likely we are creating tons of duplication jobs with only 1 or 2 images in and because they are so small the tape drives mount and dismount media every 10 minutes plus they never reach max speeds falling in a potential “Show Shine Effect”, increasing the MIN_GB_SIZE_PER_DUPLICATION_JOB helps NBSTSERV to better process the images and bundle them in to a single job based on SLP, SLP Priority, Retention, Source and Destination if all this match NBSTSERV will batch multiple images in to a single and bigger duplication job.

For the Large images there are more things to check like compare them each image with the top 10 clients list output and see if those clients own any of these images, second thing can be to check the SLP state of the images because we have some NOT_MANAGED images it could be some of this big guys are stuck because they are corrupted, also tuning the MAX_GB_SIZE_PER_DUPLICATION_JOB value to fit more data in to a single tape could help to improve each image duplication.

Count Backlog by SLP

Last step of the report is to know which SLP holds most of the backlog or how balanced the load is with this we can probably modify a couple of SLP’s and fix the issue or assign more resources to the SLP’s who have most of the load.

Process is to list each SLP and dump the incomplete images by SLP and do the corresponding math summarizing all fragments, in this case we don’t need awk Array because we already know the SLP we are working on, we only need to introduce the name of the SLP in to the printing piece of the report.

printf "%50s%27s\n" "SLP Name" "Backlog Size"

nbstl -b | while read SLP

do

nbstlutil list -lifecycle $SLP -image_incomplete | awk '

$2=="F" { SUM+=$(NF-2) }

END {

printf ("%50s%20.2f TB\n", awkSLP, SUM/1024/1024/1024/1024)

} ' awkSLP=$SLP

done | sort

printf "\n\n"

Output sample:

SLP Name Backup Size

SLP_C1-1week-MedA-disk1-dsu1_C2-3month-MedA-tape1-stu1 10.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-6month-MedA-tape1-stu1 0.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-1year-MedA-tape1-stu1 5.00 TB

SLP_C1-1week-MedA-disk1-dsu1_C2-5year-MedA-tape1-stu1 0.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-3month-MedB-tape1-stu1 10.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-6month-MedB-tape1-stu1 1.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-1year-MedB-tape1-stu1 3.00 TB

SLP_C1-1week-MedB-disk1-dsu1_C2-5year-MedB-tape1-stu1 1.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-3month-MedC-tape1-stu1 30.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-6month-MedC-tape1-stu1 0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-1year-MedC-tape1-stu1 0.00 TB

SLP_C1-1week-MedC-disk1-dsu1_C2-5year-MedC-tape1-stu1 0.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1 30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1 25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1 5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1 0.00 TB

Our data shows 4 SLP’s with a double digit number but the most interesting ones are:

SLP_C1-1week-MedD-disk1-dsu1_C2-3month-MedD-tape1-stu1 30.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-6month-MedD-tape1-stu1 25.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-1year-MedD-tape1-stu1 5.00 TB

SLP_C1-1week-MedD-disk1-dsu1_C2-5year-MedD-tape1-stu1 0.00 TB

Because 60 TB of the data are hold by Media Server MedD matching step 4 (Count Backlog hold by Media Server) now we have more granular data because we know which SLP’s we need to attack first and also figure out why MedD is heavily use while MedA and MedB are on vacation. Second check point is MedC with 30TB clog in the 3 month SLP.

Possibilities are huge, but with the final report we can catch some obvious issues, this only leads to the first phase of SLP’s Troubleshooting that is to know were the major problems are.

Final script is attached and can be used in Solaris environments haven’t try on any other Unix/Linux platform but it shouldn’t a problem and maybe will need some slight modifications if you have a different platform and the script fails please let me know or upload the fix for your OS version.

Another note, if you don’t have Advance Disk configuration only comment the function GetAdvDiskFreeSpace or adapt/create a new function for whatever you have as DataDomain or other 3rd Party Vendors Configs.

Script Syntax:

SYNTAX: BacklogCheck.ksh -a | -sSBbDFMph [-m <email>] [-C <NClients>] | [-c <Ndays>]

-a: Print Full Report

-s: Print Short Report (NO SLP's)

-S: Print, Count and Sum SLP Images State

-B: Print Total Backlog in TB

-b: Print last 24hr Backup Info splited by SLP

-c: Delete log files older than N days based on User Argunment

-C: Get Top X Clients backlog were X is the desired top clients list

-D: Print Sum of Daily Duplications

-i: Print images count by size range

-F: Print DSU's Free Space

-M: Print Backlog hold by Media Server

-m: Send Report to a Specified eMail

-h: Print this help.

Sample: ./BacklogCheck.ksh -a -m darth.vader@thedarkside.com

Full Report Output: