DB2 database hangs when archive logs cannot be wri...

Wiggy332 · ‎02-08-2012

I have an issue with a new NB set up. We are running DB2 v9.7 on Netbackup 7.1 - AIX 6.1.

I have encoutnered an issue with the config, where the DB2 database hangs/refuses connections when it is unable to write its archive log to netbackup when using userexit and arcfunc save. When the media server is available, this has all looked spot on, backups complete fine and the restore process has been tested successfully.

I am however reluctant to roll this out to the production DB's until i know how to prevent the DB hanging if the media or master servers are unavailable.

2012-01-17-21.36.54.437114-360 I6347562A374       LEVEL: Info
PID     : 15466644             TID : 3636        PROC : db2sysc 0
INSTANCE: xxxx               NODE : 000         DB   : XXXX
EDUID   : 3636                 EDUNAME: db2loggr (QHPI04) 0
FUNCTION: DB2 UDB, data protection services, sqlpghck, probe:1780
MESSAGE : DB2 is waiting for log files to be archived.

When these archive log processes are queued up due to the media server not being available to archive the logs, the DB becomes unavailable until the processes either complete or are killed off.Any ideas how to prevent this other than having a workaround whereby we use arcfunc copy, to write the archive logs to local disk and then archive the entire directory off to netbackup?

Thanks,

mph999 · ‎02-08-2012

It is a common issue with databases - if they cannot backup the archive logs, they stop ...

(1)

One technical solution within NBU would be to use sorage unit groups ..

Eg, STU group is call STUG1 and contains storage units STU1, STU2, STU3

If each of these 3 STUs are referenced to x3 different media servers if one is unavailable, it will use another.

That I might argue is a 'correct' solution.

I am a firm believer of keeping things 'simple' - STU groups aren't really complex, but they are move complex than not having them. Some of the highest performing and reliable backup enviroments I see, and have seen (of various backup products) are those that are simple. Not beause the software is not capable, but on complex systems, it is much much harder to troubleshoot, and that increses the time to fix issues.

So, here is a 'non technical' solution from a previous compnay I worked for - this was a hugh backup environment, most backups were database, and, ran year in, year out with a success rate of about (excluding errors caused by the network) 99.5% of 2500 servers / about 70-80TB backed up each day.

<1> The database servers had enough space to run for 3 days without a backup (in case of backup server issues). This was a company requirement.

(The risk of not having a backup for up to 3 days was seen as less of a risk/ impact than the DB stopping ....)

<2> In the event of a backup server being unavailable, the physical layout of the hardware was such that it was easy to move a client to a different server until the issue was resolved. This was rarely done, but was an option if necessary.

.. and that was it ... simple, but designed from the begining to do that. In the 3 years I ran the backups, we never lost data though a backup server being down.

(Note, if you go for STU groups be aware that for each STU you add into the group, you incresase the load on nbrb to allocate resourses. )

Why did I add in option (2) - after all it will be seen as some as an unacceptable solution ...

.. simply to highlight the fact of time...

Many NBU issues are made 'major' due to time - for example, a DB archive logs don't backup, and I am informed I have 1 hours to fix the issue ... The solution requires a catalog restore that will take x hours ...

You can see the point I am trying to make, backup enviroments do go down, it is advisable to ensure that in this case, servers that rely on them, have sufficient ability to contime to run without them. FOr how long is a matter of design and company procedures.

Martin

Wiggy332 · ‎02-09-2012

Martin,

Thankyou very much for your prompt response - all of that was very useful information and a great help. The way we have our NB hardware set up, we have full replication between out production and DR site. In the policy we set up to test the db2 archive log process, the policy storage was set to use only the production site's puredisk. Are you suggesting that the simple fix for this would be in the event of the production puredisk/media being unavailable we simply switch this policy storage to the DR site to allow the archive of the logs to complete?

We have 24*7 onsite op's so this in theory, would be feasible as we can tie in our monitoring system to ensure that they alerted the minute that the puredisk or media are unavailable.

Another option i have seen this morning, but not yet experimented with is the option to select any available within the policy storage. Presumably the would attempt to go to the primary media/puredisk first, and then attempt the DR one automatically in the even of the primary not being available?

Thanks again for you information so far.

Chris

Marianne · ‎02-09-2012

Careful with Any Available!

Disk Storage unit are normally configured with 'On Demand Only', making them unavailable to policies with 'Any Available' selection.

I vote for Martin's STUGs.

Handy NetBackup Links

mph999 · ‎02-10-2012

No, I wouldn't switch the live backups to a DR server - DR servers get overwritten ... - that said, I am unfamiliar with your setup, so maybe this is a possibilty - you would know more of course.

The example I gave ... let me explain more ...

The backup environemnt was covers by multiple backup servers (as it happened, it wasn't NetBackup, but a 'similar product) - we had something like 25 - 30 backup servers (masters if you like).

Most 'big' oracle servers were 'dedicated media severs' - only backing up themselves to SAN attached drives, with the remainder going over a backup LAN to the masters.

In the event that a clients backup was moved, it went to another 'live' / production backup server, not a DR server ... (not that we had any ...)

Not ideal in a way, as you suddenly have a backup on a differennt backup server than normal, so you have to document this somewhere, but if it is a choice of that, or the DB stopping ...

In the 3 (ish) years I was there, I only every remember moving a few clients to a different server on one occassion - so not a regular occurance.

The physical layout of the switchs were such that we could (fairly) easily connect a 'media server' to different drives if required.

Martin

dukster · ‎02-13-2012

Another option is to write archive logs to DASD and have NetBackup pick them off the identified directory once they are archived. That could eliminate wait time.

VOX

DB2 database hangs when archive logs cannot be written to NetBackup