I created bptm and waiting

shahriar_sadm · ‎05-08-2016

Hi dear all,

after upgrade to 7.7.2, four media servers backup failed with status 84, media write error. backup types are application backup with RMAN, one channel in some archive/Level0 backups get error 84.

this is full detail:

05/07/2016 19:58:40 - Info nbjm (pid=3123) starting backup job (jobid=432802) for client fm-db2, policy FM-T6K-PH-ORA-ARC-fm-db2, schedule monthly
05/07/2016 19:58:40 - Info nbjm (pid=3123) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=432802, request id:{608605DC-1468-11E6-9F05-859882152612})
05/07/2016 19:58:40 - requesting resource fm-db2-hcart3-Qi6K-tld-2
05/07/2016 19:58:40 - requesting resource mstr-nbkp-srv.NBU_CLIENT.MAXJOBS.fm-db2
05/07/2016 19:58:40 - requesting resource mstr-nbkp-srv.NBU_POLICY.MAXJOBS.FM-T6K-PH-ORA-ARC-fm-db2
05/07/2016 19:58:41 - Waiting for scan drive stop HP.ULTRIUM6-SCSI.032, Media server: fm-db2
05/07/2016 19:58:41 - granted resource mstr-nbkp-srv.NBU_CLIENT.MAXJOBS.fm-db2
05/07/2016 19:58:41 - granted resource mstr-nbkp-srv.NBU_POLICY.MAXJOBS.FM-T6K-PH-ORA-ARC-fm-db2
05/07/2016 19:58:41 - granted resource L60029
05/07/2016 19:58:41 - granted resource HP.ULTRIUM6-SCSI.032
05/07/2016 19:58:41 - granted resource fm-db2-hcart3-Qi6K-tld-2
05/07/2016 19:58:54 - estimated 0 kbytes needed
05/07/2016 19:58:54 - Info nbjm (pid=3123) started backup (backupid=fm-db2_1462634934) job for client fm-db2, policy FM-T6K-PH-ORA-ARC-fm-db2, schedule monthly on storage unit fm-db2-hcart3-Qi6K-tld-2
05/07/2016 19:58:55 - started process bpbrm (pid=36869)
05/07/2016 19:58:55 - connecting
05/07/2016 19:58:56 - connected; connect time: 0:00:00
05/07/2016 19:59:36 - mounting L60029
05/07/2016 20:00:44 - Info bpbrm (pid=36869) fm-db2 is the host to backup data from
05/07/2016 20:00:44 - Info bpbrm (pid=36869) reading file list from client
05/07/2016 20:00:44 - Info bpbrm (pid=36869) listening for client connection
05/07/2016 20:00:45 - Info bpbrm (pid=36869) INF - Client read timeout = 3000
05/07/2016 20:00:45 - Info bpbrm (pid=36869) accepted connection from client
05/07/2016 20:00:46 - Info bphdb (pid=36750) Backup started
05/07/2016 20:00:46 - Info bpbrm (pid=36869) bptm pid: 36876
05/07/2016 20:00:46 - Info bptm (pid=36876) start
05/07/2016 20:01:26 - Info bptm (pid=36876) using 65536 data buffer size
05/07/2016 20:01:26 - Info bptm (pid=36876) using 30 data buffers
05/07/2016 20:01:26 - Info bptm (pid=36876) start backup
05/07/2016 20:01:26 - Info bptm (pid=36876) Waiting for mount of media id L60029 (copy 1) on server fm-db2.
05/07/2016 20:01:32 - mounted L60029; mount time: 0:01:56
05/07/2016 20:01:32 - positioning L60029 to file 7
05/07/2016 20:02:05 - positioned L60029; position time: 0:00:33
05/07/2016 20:02:05 - begin writing
05/07/2016 20:03:22 - Info bptm (pid=36876) media id L60029 mounted on drive index 35, drivepath /dev/rmt/53cbn, drivename HP.ULTRIUM6-SCSI.032, copy 1
05/07/2016 20:04:00 - Info bphdb (pid=36750) dbclient(pid=36750) wrote first buffer(size=262144)
05/07/2016 20:06:51 - end writing; write time: 0:04:46
05/07/2016 20:08:28 - Info bphdb (pid=36750) dbclient waited 7347 times for empty buffer, delayed 7347 times
05/07/2016 20:08:28 - Info bphdb (pid=36750) done. status: 0
05/07/2016 20:08:28 - Info bptm (pid=36876) waited for full buffer 4015 times, delayed 8130 times
05/07/2016 20:08:37 - Error bptm (pid=36876) FREEZING media id L60029, too many data blocks written, check tape/driver block size configuration
05/07/2016 20:08:37 - Info bptm (pid=36876) EXITING with status 84 <----------
05/07/2016 20:08:40 - Info bphdb (pid=36750) done. status: 84: media write error
media write error (84)

thanks

Deb_W · ‎05-08-2016

That's odd. Have you checked the system log on that media server?

Also - on the media server, check out the netbackup/db/media/errors file. IS it hte same drive that's always seeing the issue? Maybe it is simply a coincidence??

Deb

Marianne · ‎05-08-2016

Which 'Unix' OS? Version and patch level. This error looks like OS-related tape driver issue and not related to NBU upgrade.

Handy NetBackup Links

Nicolai · ‎05-09-2016

Please see MPH999 post in this thread :

http://www.veritas.com/community/forums/veritas-netbackup-console-frequently-getting-media-write-error-84-too-many-data-blocks-writte#comment-8814881

Quick hint :

Ensure directory /usr/openv/netbackup/logs/bptm is created on the media server

Add BPTM_VERBOSE = 9 in bp.conf on the media server (be warned bptm log will grow fast now)

Once failure occur again, the bptm log should have the needed info.

shahriar_sadm · ‎05-10-2016

This is a errors output on media server, not same drive but 3 drive have failure

04/29/16 08:49:16 L60580 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
04/29/16 12:22:02 L60376 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
04/29/16 14:11:39 L60455 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
04/30/16 14:19:52 L60492 41 WRITE_ERROR HP.ULTRIUM6-SCSI.026
04/30/16 14:20:23 L60257 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
05/04/16 20:17:41 L60679 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
05/05/16 02:30:17 L60273 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
05/06/16 08:07:17 L60614 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
05/06/16 14:09:17 L60652 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
05/07/16 14:18:51 L60598 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
05/07/16 20:08:37 L60029 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
05/08/16 16:58:18 L60247 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030
05/09/16 14:27:50 L60773 35 WRITE_ERROR HP.ULTRIUM6-SCSI.032
05/10/16 02:15:25 L60488 30 WRITE_ERROR HP.ULTRIUM6-SCSI.037
05/10/16 02:33:33 L60168 37 WRITE_ERROR HP.ULTRIUM6-SCSI.030

shahriar_sadm · ‎05-10-2016

SunOS 5.10 Generic_147147-26 sun4u sparc SUNW,SPARC-Enterprise

Oracle Solaris 10 1/13 s10s_u11wos_24a SPARC
Copyright (c) 1983, 2013, Oracle and/or its affiliates. All rights reserved.
Assembled 17 January 2013

This issue appear exactly afre upgrade,

I checked frozen media, number of frozen media increasing everyday!

shahriar_sadm · ‎05-10-2016

I created bptm and waiting for next failure, I will share the result.

Marianne · ‎05-10-2016

Please open a Support call with Veritas and Oracle and give each vendor the case number for the other vendor.

Please ensure that VERBOSE entry exists in vm.conf to catch NBU logs to /var/adm/messages.
(NBU/ltid needs to be restarted after VERBOSE is added.)

Handy NetBackup Links

shahriar_sadm · ‎05-11-2016

Hi Nicolai

This is bptm log, Error 84 exist in the log,

Thanks

mph999 · ‎05-12-2016

Almost certainly, you have an issue outside NBU. I appreciate this happened after the upgrade, however NetBackup doesn't actually write to tape itself (the OS does it).

The block size is 64k

(16:08:21.418 [78489] <2> write_data: received first buffer (65536 bytes), begin writing data)

This is a bit small for modern drives but should work without issue, however, I'd suggest:

Create file called /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS on every media server

Add this number into the file : 262144

This will be picked up on the next backup. Please also test a restore of a few files.

I don't think this will fix the issue, but it's a recommended tuning step and should give better speed.

What has happened:

16:12:54.303 [78489] <2> io_terminate_tape: absolute block position prior to writing empty header is 1885518, copy 1
16:12:54.303 [78489] <2> io_terminate_tape: block position check: actual 1885518, expected 1731186

The tape started at this position

16:08:15.325 [78489] <2> write_data: absolute block position prior to writing backup header(s) is 1576851, copy 1

Netbackup, then sent some data to the operating system, it sent in total 154335 blocks.

After the write finished, NBU knows the tape started at position 1576851, it sent 154335 blocks and so the tape should have finished at position 1576851+ 154335 = 1731186

It then asks the tape drive, 'What position are you at', the tape drives sends back the answer 'I'm at position 1885518' - this does not match where the tape should be, hence the error.

Could NBU cause it - well, if we mis-counted yes, but if that was the case it would fail on every backup because the same code would be used, and it would always mis-count.

Lets look:

Started at position 1576851

16:12:46.235 [78489] <2> write_data: Total Kbytes transferred 9877280

9877280/64 = 154332.5 - so the data backed up was about 154332 64k blocks (this will be a tiny bit out as there are headers as well), but it's close enough.

Lets say then - NBU sent 154332 blocks of data.

start position + number blocks sent = end position

1576851 + 154332 = 1731183

From this line:

16:12:54.303 [78489] <2> io_terminate_tape: block position check: actual 1885518, expected 1731186

NBU expected 1731186 (I'm close enough with my value of 1731183) but the drive is reporting = 1885518 - miles out ...

1885518 - 1731186 = 154332 blocks

Or in other words, the drive is positioned 154332 blocks further down the tape than it should be (or at least, this is what it reports).

The issue in not NBU, I'd suggest upgrading the firmware / drives (HBA and tape drives ) and see if that helps. If they are on the latest firmware, downgrade to the previous (the lateset is not always the best).

As a first look, I'm happy that I have proved this is not a NBU issue.

shahriar_sadm · ‎05-12-2016

Thank you Martin for your complete answer. I will plan to upgrade tape drives frimware and HBA and see if issue will resolve,

Another issue is that currently many media frozen by error 84, and I saw error 96 today ( no more media available) Can I unfreez frozen media manually?

mph999 · ‎05-12-2016

Yes, because we know the reason the media is frozen, it is safe to unfreeze.

shahriar_sadm · ‎05-12-2016

It is safe to unfreeze from Netbackup or it have to unfreeze from CLI?

Also we have more 100 media frozen and I have not list of media frozen because of 84 error.

mph999 · ‎05-12-2016

You can select multiple at the same time (shift and click) and unfreeze in GUI

Are they all frozen because of this error - need to be care ful about unfreezing tapes that were frozen for a different reason, they might really be 'bad' in which case all sorts of really bad things could happen to your drives.

VOX

to many data blocks written