Solved: Tapes are taking too much time to expire

BharadwajR · ‎06-01-2015

Hi,

Since past 3 days, we are facing an issue with the tape expiration.

In our environment, We expires 30 tapes daily with bpexpdate command.

Previously we used to expire each tape within 5 minutes.

But now it is taking almost 40 minutes to expire each tape.

We haven't made any upgrades/activites in recent times. Suddenly since past 3 days, we are facing this issue.

Please help.

Details:

Netbackup version: 7.0.1

Master server OS: SunOS 5.10

Media Server OS : AIX 6.1 (2 media servers)

Regards

Sai

mph999 · ‎06-01-2015

One of the engineer is suggesting that this is bug in netbackup. We need to proceed for upgrade the netbackup.

Does upgrading the netbackup 7.0.1 to higher version like 7.1.0.3 really fix the problem permenantly?

It would be interesting to learn where you heard this could be a bug from - was that a Symantec Engineer ?

Could it be a bug, well, yes, anything could until we prove otherwise, however let us consider this very important point.

"We haven't made any upgrades/activites in recent times. Suddenly since past 3 days, we are facing this issue."

... so you are running the same code as a few days back !!!

Maybe however, a combination of things has started to happen which causes the delay, perhaps not a bug in the usual sense, more an unfortunate combination of events, which I guess you could still call a bug, it depends where the events are happening, are they all within NBU, or is the OS playing some part - who knows ...

So - will an upgrade fix - no idea, because we don't know what is causing it.

If an upgarde is no problem, they it would be worth trying, as you are better at a higher version, and if it fixes it, we know that some change within the new code has been beneficial.

I'm also surprised that if it is a bug I haven't heard of it, if it is common I would have thought I would be aware of it - don;t forget that many people expire tapes, and so if it was an issue we would be aware. Of course, there is always the possibility that you are the first to see it, because your environment has the exact conditions to make it appear, that other people do not have - not impossible but hoew likely this is I can't say.

I just searched the eTracks for this issue using 3 or 4 different search strings, I couldn't find anything for slow expiration of media in 7.0.1

View solution in original post

BharadwajR · ‎06-08-2015

Issue got resolved after rebooting master server.

Now we are able to expire tape within 5 minutes.

Anyway we are going to upgrade master server to 7.5.

Thank you all for valuable suggestions.

View solution in original post

revarooo · ‎06-01-2015

Have you considered upgrading? 7.0.1 is old

How many jobs are running at the time you are expiring the tapes? Have you checked? If the system is busy it will take longer as the database has other things going on to deal with.

Have you enabled the bpdbm and admin logs before you expire then checked the logs to see what is happening?

RiaanBadenhorst · ‎06-01-2015

1. You should not need to expire media when you've got a proper backup strategy and retention period configured!

2. Have you checked if there are any media servers that are not online/reachable in the environment?

Marianne · ‎06-01-2015

If your Retention levels are configured in line with Business requirements, there should never be a need to manually expire tapes.

To get back to your issue:

Can you see a Image Cleanup job in Activity Monitor when bpexpdate is running?
Can you see a list of images in Overview tab of the job?
Can you see multiple or hanging bpdbm jobs?
(We could see multiple bpdbm as well as bpexpdate -deassignempty jobs...)

Any chance that any of these images are linked to GRT jobs?

Do you have verbose logging of bpdbm and admin enabled?

The reason for my questions is that I have seen something similar at my previous company about 4 years ago with an early NBU 7.x version at a fairly large customer.

I unfortunately do not remember the outcome/solution of this incident - just the symptoms and that we were waiting for the customer to enable verbose logging and restart NBU.

I doubt that the issue is still around in more recent versions of NBU as I have not seen or heard of similar issues since then....

Bottom line - you need verbose logging of admin and bpdbm.
Level 3 is fine for most troubleshooting attempts, but Symantec Support will need level 5 logs.

Handy NetBackup Links

BharadwajR · ‎06-01-2015

Thank you revaroo for quick response.

Around 84 Monthend backups are running at time, i am expiring the tapes.

Below is the master server memory utilized output in /usr/openv.

bash-3.2# df -h /usr/openv
Filesystem size used avail capacity Mounted on
/dev/md/dsk/d41 591G 499G 86G 86% /usr/openv

Although 86 % is higher,we used to expire tapes easily previously also with same load.

Both bpdbm and admi logs are already enabled. But size of bpdbm log is very high(around 50 GB)

revarooo · ‎06-01-2015

If you are expiring tapes manually so that you can control the amount of disk space used in your catalog because you are nearing the 90% limit that NetBackup needs to run, you need to:

A) Look at your retention settings for your backups as previously mentioned

or

B) Increase your disk space of your /usr/openv/ filesystem

or

C) Move the catalogs to a filesystem with more space.

mph999 · ‎06-01-2015

If the logs are large, simply 'zero' them ...

eg.

cd /usr/openv/netbackup/logs/bpdbm

>log.<date>

Repeat for admin log

This makes logs zero size (copy first if you wish to keep, but do NOT delete them, that will break the logging and NBU will need restarting)

It would be bst to do this a a quiet time, but once zero'd out, run a bpexpdate command and wait for it to complete. Then grab a copy of the logs quickly, before they get too large.

As mentioned already, bpexpdate should never be run as part of a daily/ weekly etc process. That is not what it is designed for, and manually running this command leads to mistakes and data being lost.

BharadwajR · ‎06-01-2015

Thank you Marianne for quick response.Please find following answers.

I am not seeing the Image Cleanup Job in activity monitor when bpexpdate is running.

Yes ,i am seeing mutliple bpdbm processes.

GRT is not enabled in our environment.

Logging of bpdbm and admin enabled. bpdbm verbose level is set as 0.

Marianne · ‎06-01-2015

You will need to make a list of the PIDs of running bpdbm jobs and then locate the PIDs one at a time in bpdbm log to see what exactly each of them is doing.
Level 0 log will probably have very little useful info.

My experience has been that bpdbm needs to be restarted to apply change in logging levels.
The problem is that you are already running low on disk space.
Level 5 logging will fill up your disk in a short amount of time.
You may also have a problem with memory consumption of bpdbm (not disk space) and cause bpdbm processes to hang if there are lots of these processes.

As per other posts above - NBU disk usage should NOT dictate retention levels - ONLY business recovery needs.

There are different ways to deal with NBU catalog sizing :
- Revisit retention levels to ensure you do not have retentions longer than needed (you can for example customize levels 10 and above ).
- Enable catalog compression.
- Archive catalogs.
- Add more space to existing volume.
- Move image folder to another filesystem.

All of the above is documented in NBU Admin Guide I.

Handy NetBackup Links

BharadwajR · ‎06-01-2015

Thank you marianne for valuable suggestions.

One of the engineer is suggesting that this is bug in netbackup. We need to proceed for upgrade the netbackup.

Does upgrading the netbackup 7.0.1 to higher version like 7.1.0.3 really fix the problem permenantly?

Marianne · ‎06-01-2015

I have not seen this issue in NBU 7.5 or 7.6.

Please consider one of these versions (preferably 7.6 or 7.6.1).

Before you upgrade - please resolve NBU catalog space shortage.
Each NBU version upgrade consumes more space and will not solve your disk space issue.

Also read through all Upgrade documentation and SORT to ensure that Master and media servers meet resource requirements such as memory (not disk space) and cpu requirements.

Handy NetBackup Links

mph999 · ‎06-01-2015

If you increase bpdbm verbose, the parent (which should always be running) will not pick up the change.

However, any 'child' bpdbm jobs will.

Let's give it a go ... providing the bpexpdate runs as a child process, it should pick up the verbose change without a restart.

23:43:26.505 [13375] <2> listen_loop: socket fd from accept() is 10
23:43:26.505 [13375] <2> listen_loop: Locale info: message=C, ctype=C, time=C, collate=C, num=C
23:43:26.517 [13978] <2> logconnections: BPDBM ACCEPT FROM 10.12.235.41.55134 TO 10.12.235.41.13721 fd = 10
23:43:26.518 [13978] <2> vnet_pcache_init_table: [vnet_private.c:208] starting cache size 200 0xc8
23:43:26.520 [13978] <2> vnet_cached_getnameinfo: [vnet_addrinfo.c:1918] found via getnameinfo OUR_HOST=womble IPSTR=10.12.235.41
23:43:26.523 [13978] <4> bpdbm: VERBOSE = 5
23:43:26.524 [13978] <2> db_getrequest_vxss: Entering
23:43:26.525 [13978] <2> vnet_check_vxss_server_magic: [vnet_vxss_helper.c:502] VxSS magic=1041669, remote_vxss=260
23:43:26.525 [13978] <2> vnet_check_vxss_server_magic: [vnet_vxss_helper.c:559] Ignoring VxSS authentication 2 0x2
23:43:26.529 [13978] <2> VssAzAuthorizeEx: (vss_az.cpp,5062): Status = 1 : ""
23:43:26.553 [13978] <2> process_request: request complete: exit status 0 ; query type: 260
23:43:26.651 [13375] <2> listen_loop: socket fd from accept() is 10
23:43:26.651 [13375] <2> listen_loop: Locale info: message=C, ctype=C, time=C, collate=C, num=C
23:43:26.662 [13979] <2> logconnections: BPDBM ACCEPT FROM 10.12.235.41.55135 TO 10.12.235.41.13721 fd = 10
23:43:26.663 [13979] <2> vnet_pcache_init_table: [vnet_private.c:208] starting cache size 200 0xc8
23:43:26.665 [13979] <2> vnet_cached_getnameinfo: [vnet_addrinfo.c:1918] found via getnameinfo OUR_HOST=womble IPSTR=10.12.235.41
23:43:26.667 [13979] <4> bpdbm: VERBOSE = 5
23:43:26.668 [13979] <2> db_getrequest_vxss: Entering
23:43:26.669 [13979] <2> vnet_check_vxss_server_magic: [vnet_vxss_helper.c:502] VxSS magic=1041669, remote_vxss=87
23:43:26.669 [13979] <2> vnet_check_vxss_server_magic: [vnet_vxss_helper.c:559] Ignoring VxSS authentication 2 0x2
23:43:26.671 [13979] <2> image_db: Entering

Where the PID changes from [13375] is when I ran bpexpdate, as we can see, it does pick up the verbose 5 setting, even though I didn't restart after adding the VERBSE = 5 to bp.conf.

We can see that [13979] is the correct PID for the tape expire (this is cut down, I just pulled out the lines with the tape media id (TAPE01).

log.060115:23:43:26.673 [13979] <2> db_logimagerec: name1 TAPE01
log.060115:23:43:26.800 [13979] <4> db_error_add_to_file: changing media TAPE01 to expiration date Thu Jan 01 01:00:00 1970 (0)
log.060115:23:43:26.801 [13979] <4> ImageExpdateMediaId::validateInputQuery: changing media TAPE01 to expiration date Thu Jan 01 01:00:00 1970 (0)
log.060115:23:43:26.960 [13979] <4> emmlib_MediaQueryOne: (1) V_QUERY_BYID/V_QUERY_BYID_DEBUG request (TAPE01 0 0)
log.060115:23:43:27.018 [13979] <2> ImageExpdateMediaId::executeQuery: (47.2) Executing "SELECT DISTINCT I.ImageKey, F.CopyNumber FROM DBM_Main.DBM_Image I JOIN DBM_Main.DBM_ImageFragment F ON F.ImageKey = I.ImageKey WHERE I.MasterServerKey = 1000002 AND (F.StorageUnitType = 2 OR F.StorageUnitType = 3) AND F.MediaID = 'TAPE01' ORDER BY I.ImageKey, F.CopyNumber" Bindings <>
log.060115:23:43:27.044 [13979] <2> ImageExpdateMediaId::executeQuery: 1 images located for media TAPE01
log.060115:23:43:27.141 [13979] <2> fetch_DBM_ImageFragments: (.6) Fragment {IFK=13, IK=131, ICK=14, CN=2, FN=1, RC=1, MID='TAPE01', MSK=1000002, SUT=2, SST=1, FS=1, FSZ=32768, FID='TAPE01', D=6, FN=1, BS=262144, O=2, MD=1433156700, DWO=0, FF=0, MD='', FCP=0, MSN=0, ME='', SCMH=''}
log.060115:23:43:27.505 [13979] <4> db_error_add_to_file: altered 1 image records and failed to alter 0 image records and media with mediaid TAPE01
log.060115:23:43:27.505 [13979] <2> ImageExpdateMediaId::executeQuery: altered 1 image records and failed to alter 0 image records and media with mediaid TAPE01

mph999 · ‎06-01-2015

One of the engineer is suggesting that this is bug in netbackup. We need to proceed for upgrade the netbackup.

Does upgrading the netbackup 7.0.1 to higher version like 7.1.0.3 really fix the problem permenantly?

It would be interesting to learn where you heard this could be a bug from - was that a Symantec Engineer ?

Could it be a bug, well, yes, anything could until we prove otherwise, however let us consider this very important point.

"We haven't made any upgrades/activites in recent times. Suddenly since past 3 days, we are facing this issue."

... so you are running the same code as a few days back !!!

Maybe however, a combination of things has started to happen which causes the delay, perhaps not a bug in the usual sense, more an unfortunate combination of events, which I guess you could still call a bug, it depends where the events are happening, are they all within NBU, or is the OS playing some part - who knows ...

So - will an upgrade fix - no idea, because we don't know what is causing it.

If an upgarde is no problem, they it would be worth trying, as you are better at a higher version, and if it fixes it, we know that some change within the new code has been beneficial.

I'm also surprised that if it is a bug I haven't heard of it, if it is common I would have thought I would be aware of it - don;t forget that many people expire tapes, and so if it was an issue we would be aware. Of course, there is always the possibility that you are the first to see it, because your environment has the exact conditions to make it appear, that other people do not have - not impossible but hoew likely this is I can't say.

I just searched the eTracks for this issue using 3 or 4 different search strings, I couldn't find anything for slow expiration of media in 7.0.1

BharadwajR · ‎06-08-2015

Issue got resolved after rebooting master server.

Now we are able to expire tape within 5 minutes.

Anyway we are going to upgrade master server to 7.5.

Thank you all for valuable suggestions.

VOX

Tapes are taking too much time to expire