Solved: Cyclic redundancy error in log means media went ba...

Soumyaa · ‎01-01-2012

Cyclic redundancy error in log means media went bad?

Generally in what circustances we decide media went bad.

mph999 · ‎01-02-2012

From TN http://www.symantec.com/docs/TECH169477, CRC issues are specifically mentioned under 'Read/ Write errors' section

Problem

Troubleshooting Drive/ Library Issues in NetBackup

This Document provides you with information on various tape drive issues that maybe encountered whilst using NetBackup and how to deal with them.

Solution

It is important to understand that NBU does not write data to a drive, for example when using Solaris, NetBackup relies on the operating system to write the data to the tape using the st tape driver. The only 'slight' involvement with NetBackup, is that it specifies the blocksize to use, but this is still passed to the operating system. Other operating systems work in a similar manner.

The scsi pass-through driver (sg driver on solaris) - allows scsi commands to be passed directly to the drive. These are scsi 'commands' such as 'test-unit-ready', which is used, for example, when mounting a tape. On occasion it is necessary to recreate/ rebuild the pass-through driver. The common symptom that involves the pass-through driver is that the scan command does not show the devices. Other issues involving the pass-through driver are very rare.

The majority of drive /tape issues have a cause outside of NetBackup. When troubleshooting these issues it is advisable to start the troubleshooting process at the hardware/ firmware level.

It should always be considered that although NetBackup reports an error, it does not mean it is the cause.

Common drive issues include:

Scan command

TAPE_ALERT

ASC/ ASCQ

Positioning errors

Read/ Write errors

I/O Errors

External event has caused rewind

Tapes not reaching capacity (for example) 300GB of Data is written to a 400GB (native capacity) capacity tape

Tapes being incorrectly marked as 'read only'

Library Inventory Issues

Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"

Missing drives, or drives disappearing and reappearing

In the first instance, it is always worth power cycling the library or drives reporting an issue, as well as rebooting the associated servers, Many of the errors referenced in the TechNote can be sometimes be cleared this way. In the event this does not clear the issue, it has at least been eliminated from being the cause.

Scan Command

The Scan command shows no devices at all, or, that some of the devices, or all of the devices appear and reappear when the command is run repeatedly.

Firstly, it must be confirmed that the operating system can see and communicate correctly with the tape drives.

The devices appearing in (for example) 'Device Manager' (Windows) or cfgadm (Solaris) is NOT necessarily sufficient confirmation that the devices are correctly configured to the operating system.

It has been seen that although devices 'appear' to be visible to the operating system, san issues prevented full/ correct communication, and as a result, the scan command failed.

Two things need to be checked before further troubleshooting is carried out:

1/ Check no backups are running on the drives (only applicable if the drives are shared). A scsi reservation of a drive due to a backup, may prevent the drive from responding to, and thus appearing in the output of the scan command.

2/ Rebuild the 'pass through' driver (Unix only). If the drive/ operating system configuration has not changed, this is very unlikely to be the issue, but it can be eliminated from being the cause by recreating the 'pass through' files. See the device configuration guide for information on how to do this.

Aside of the exceptions above issues with the scan command are not caused by NetBackup, when it is understood how the scan command works, it is clear how the issues are outside of NetBackup.

Although the scan command is supplied by Symantec, it does not issue any NetBackup commands, or interact with NetBackup in any way. When run, it issues 'operating system' SCSI commands to the devices configured in the operating system, the output of the command is sent from the devices. There are no settings, 'tuning' or troubleshooting that can be performed on the scan command.

Windows servers do not require a pass through driver. Providing that there are no backups running on other servers that may share the drives, then the issue will be caused by either a san issue, firmware, hardware or driver issue. Consideration should be given to san infrastructure (eg switches), HBAs or the physical drive/ library.

Unix servers require a pass through driver, for example, on Solaris this is called the sg driver. This is required as the scsi commands issued to query the device cannot be passed to the devices via the regular operating system driver.

Once the sg driver is configured, providing the configuration is not changed, there should be no issue with the pass through driver. If the scan command shows devices appearing and re-appearing, then the pass through driver is not the cause. If the devices, or device, permanently disappear, it may be worth reconfiguring the pass through driver. If the issue is not resolved, then the issue will be as per Windows servers, that is, san infrastructure (eg switches), HBAs or the physical drive/ library. Consideration should also be given to HBA configeration files, as incorrect settings in these have been seen to prevent output from the scan command being returned.

Providing the 'pass through' driver is configured (Unix only) Symantec recommends that to further investigate scan command issues, the operating system /san administrators, or hardware vendors are consulted.

TapeAlert / Tape Alert

A "tape alert" message is a critical, warning, or informational alert that occurs due to a tape drive or robotic library hardware event. These "tape alert" messages are stored on the tape drive or robotic library. Applications like NetBackup query the tape device or robotic library for these "tape alert" messages and display the "tape alerts" to the user. "Tape alert" messages are reported in the NetBackup bptm log The tape alert technology detects and logs hardware and media errors.

It is important to remember that while NetBackup displays these "tape alerts," the alerts occur due to a tape drive or robotic library hardware event. Check the Event Viewer /system log for any hardware related errors. Contact the Original Equipment Manufacturer (OEM) for support.

As a TapeAlert is sent from the drive it is impossible that this can be caused by NetBackup.

For example:

Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1 (index 4), Media Id R0TP01

To further investigate TapeAlert issues, Symantec recommends contacting your hardware vendor.

A link to the technote "Description of Tape Alerts and code definitions" is provided at the bottom of this technote

ASC/ ASCQ

SCSI Sense keys describe a 'state', which are returned when a command requests a 'check condition' status.

In this example, robtest was failing to load a tape into a drive.

Initiating MOVE_MEDIUM from address 1000 to 500

move_medium failed, CHECK CONDITION

sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED

The analysis can be broken down as follows :

Sense Key 0x5 - Illeagal Request

ACS/ ACSQ 0x30/00 - Incompatible Medium Inserted

In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device, not by NetBackup.

As ASC /ASQ alerts are sent from the hardware, it is impossible for them to be caused by NetBackup.

It has been seen that a power cycle of the drive (not soft reset) can sometimes clear ASC/ ASCQ errors.

Further information on these values can be found at http://www.t10.org

To further investigate ACS/ ASCQ issues, Symantec recommends contacting your hardware vendor.

Note

If hardware encryption is in use via NetBackup KMS, an issue with the service may cause the drives to send out ASC /ASCQ errors relating to "Encryption". In this instance, although the drive is sending he message, the cause may be the KMS service, and so this should be given consideration.

Positioning Errors

Positioning errors occur when the operating system is unable to position, fsf or rew the tape.

The error message seen may differ slightly, depending on when the error occurs.

Example 1

<2> write_data: block position check: actual 62504, expected 31254

Example 2

1/11/2010 7:50:13 AM - Error bptm(pid=3364) ioctl (MTREW) failed on media id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)

NetBackup requests the operating system to position the tape, at various points of the backup. Failure to correctly position, although detected by NetBackup, is most commonly caused by:

1. Hardware error

2. Tape error

3. Driver issue

4. Firmware issue

As NetBackup does not directly position tapes, to further investigate positioning errors issues, Symantec recommends contacting your hardware vendor.

Note

One known issue can be seen in the bptm log, affecting NBU 6.5.6 to 7.0.1.

Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index 0, The physical end of the tape has been reached.

EEB 2182228 resolves this issue.

If the issue is not resolved by this EEB, or, you see this issue at earlier or later version of NetBackup (before 6.5.6 or after 7.0.1) , then the issue is related to firmware of hardware.

Read/ Write errors

The reading or writing operation is performed at the operating system/ tapedriver level. Therefore, although this issue is detected and reported in the NetBackup logs, it is not caused by NetBackup.

The cause of read/ write errors are usually an issue with the tape drive or media cartridge.

For example:

Example 1 write_data: cannot write image to media id XXXXXX, drive index #, Data error (cyclic redundancy check). Example 2 io_write_block: write error on media id MIR107, drive index 0, writing header block, 1117 Example 3 Error bptm(pid=5268) cannot read image from media id 500507, drive index 1, err = 234

Note

a) McAffee Anti_virus software is known to be a possible cause of Status 84 errors on Windows Media Servers

b) Cyclic redundancy check errors indicate faulty hardware

I/O Error

I/O errors are caused at a hardware level, and are only detected by NetBackup.

For example:

11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could not be performed because of an I/O device error. (1117); bytes written = 65536; size = 0

To further investigate I/O Errors, Symantec recommends contacting your hardware vendor.

Known issues

open failed in io_open I/O error

This exact error can be caused by mis-configeration of the drives so this should be checked in the first instance. If the issue remains after confirmation that the configertion is correct, then the issue should be further investigated as a hardware /firmware issue.

External event has caused rewind

This issue is (potentially) serious and requires immediate investigation, as data can be lost. NetBackup will display this error if the block position calculation check by NetBackup does not match the position reported by the drive. It will not be certain that a full rewind has occurred (impossible to tell from a simple blockcheck), but it will mean that the position check has failed, and most likely that the calculated position is less than the expected position.

The error will look similar to the following:

<2> io_terminate_tape: block position check: actual 4, expected 5

<16> write_data: FREEZING media id XXXXXX, External event caused rewind during write, all data on media is lost

NetBackup keeps track of how much data it is sending to the operating system to write to the device. As an integrity check after the end of each write, NetBackup will ask the tape device for its position. If this position does not match what NetBackup has calculated the position should be, then the job will fail with a media write error.

If a full rewind has occurred this will overwrite the NetBackup header on the tape making it unreadable, if this has happened the data is lost. The most common cause is a SCSI reset on the SAN, which causes a rewind of the drive(s) whilst they are being written to. This event is undetected by NetBackup (impossible to detect) and is only discovered after the event when the block position check is made. NetBackup cannot cause SCSI resets on the SAN, the cause has to be external (the tape positioning /read/ write operations are controlled by the Operating System).

If the issue is a position error (as opposed to a 'Full' rewind) a message similar to the following will be seen (bptm log).

<2> write_data: block position check: actual 62504, expected 31254

<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check tape/driver block size configuration

The possible causes are numerous, and most commonly include:

Tape driver issue

Tape drive firmware issue

SAN fault

HBA fault, driver or firmware issue

Switch Fault

If the drives are attached to a NDMP device, it must be ensured thay the SCSI reservation on the NDMP device is set to match the SCSI reservation type of NetBackup.

To further investigate "External Event has caused rewind" issues, Symantec recommends contacting your hardware /operating system support vendor.

Note

The SCSI reservation is set /held by the Host Bus Adaptor, however NetBackup sends the reserve command through the SCSI pass-thru path for the device, so this needs to be configured correctly.

Known Issues:

NDMP

If the issue is occurring on drives that are shared (SSO) between a NDMP filer and NBU, and, the drives are zoned directly to the filer the issue can be caused if the SCSI reservation type set in NBU is not the same as the SCSI reservation type set on the filer.

If this is the case the issue can be resolved following these steps :

In the 'Host Properties' > 'Media Type' tab in NetBackup, check the SCSI reservation set, SPC2 or SCSI persistent

Change the type of SCSI reservation on the filer, to match the type you have set in NBU

Reboot the Robotic Library to break all the current reservation.

The following technote has a detailed explanation of SCSI reservation: http://www.symantec.com/docs/HOWTO32767

HP-UX 11.31 IA64 / atdd driver

BPTM block position check fails one block short using IBM atdd driver 6.0.0.96 on HP-UX 11.31 IA64

This issue is actually caused by the HP ATDD driver writing the EOT mark incorrectly. However Symantec have produced a NetBackup 7.0.1 EEB to workaround this issue (ETrack 2142743 /TECH155113)

Using the ATDD driver with NetBackup 7.0.1 and later on HP-UX 11.31 IA64 requires atdd driver 6.0.2.8 or later. Upgrade to the new ATDD driver resolves the problem.

Tapes not reaching capacity

Issues where only (for example) 300GB of Data is written to a 400GB capacity tape ...

NBU passes data to the OS, one block at a time, to be written to the tape drive. NBU has no understanding of tape capacity, in theory it would keep writing to the same tape 'for ever '.

When the tape physically passes the 'logical-end-of-tape' this is detected by the tape drive firmware. The tape drive firmware then sets a 'flag' in the tape driver (this would be the st driver in the case of Solaris). There is physically enough tape for the current block to be written so this is completed successfully. NBU then attempts to send the next block of data (via the operating system) but now the tape driver refuses, as the 'tape full' flag is set. The st driver then passes this 'tape full' message to the operating system, which passes it to NetBackup. Only when this has happened will Netbackup change the tape.

Common causes of this issue are tape drive firmware, or faulty hardware.

There are no settings in NetBackup that influence tape capacity. To further investigate Tape Capacity issues, Symantec recommends contacting your hardware vendor.

Tapes being incorrectly marked as 'read only'

NetBackup has no understanding of 'read only'. This state is set by the tapedrive usually by means of a small switch on the tape cartridge.

Therefore, if a tape is being reported as 'read only' this issue cannot be the fault of NetBackup.

'Read only' is reported by the firmware of the tapedrive, and logged by NetBackup, we see this as a Tapealert :

0x09: 'Cartridge write protected

It has been seen on occasion that firmware issues of the tapedrive have caused tape media to be incorrectly reported as 'read only'.

Library Inventory Issues

NetBackup up does not directly 'Inventory' a library. Instead it queries the library and waits to be told what tapes (barcodes) are located in which element address (slots/ drives). If, for example, NetBackup 'cannot see' a particular cartridge(s) it is because the library is 'hiding' the location, not because of any setting within NetBackup.

For example, common symptoms of library issues include tapes appearing in the incorrect/ wrong slot, and tapes/ slots not appearing at all. It is impossible for this to be caused by NetBackup.

To further investigate Library issues, Symantec recommends contacting your hardware vendor.

Note

Issues involving NetBackup and the Virtual I/O slots on the IBM 3500 series libraries where ALMS /Virtual I/O are enabled are occasionally seen.

Problems involving Virtual I/O slots cannot be caused by NetBackup because there are no settings in NetBackup that can influence the behavior of the Virtual I/O slots.

It has been found that the library setting "Queued Exports" should be set to 'HIDE' from within the IBM web console to allow tapes to be moved from the virtual I/O slots to the slots within the logical library.

Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"

This error is seen in the bptm log, and depending on the logging set, may be referenced in the ...volmgr/debug log, and the operating system event log

An excellent way to check this, is to use the robtest command, a link to a Technote for documentation on Robtest is available at the end of the Technote.

The robtest command does not issue any 'NetBackup ' commands. It only sends 'operating system' SCSI command to the library, and the output seen from the command issent from the libarary firmware. Given this description, it is clear to see that Robtest failures cannot be caused by NetBackup.

For example:

(Using robtest command to issue a move media request from slot 86 to drive 2)

m s86 d2

move_medium failed

sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR

As robtest has only sent a SCSI move request, straight away this failure can be seen to not be caused by NetBackup.

Further, the error is referrencing an 'ASC /ASCQ' error, which as explained in the "ASC /ASCQ" section of the Technote is never caused by NetBackup.

To further investigate robtest issues, Symantec recommends contacting the hardware vendor.

Missing drives, or drives disappearing and reappearing

In cases where, for example, tpautconf -report_disc shows inconsistent numbers of missing devices when the command is run at different times.

tpautoconf -report_disc will report "Missing Device", if a device that is configured and available within NetBackup, has become undetected from the operating System.

For example:

======================= Missing Device (Drive) ======================

Inquiry = "IBM Ultrium 3-SCSI

Serial Number = HM74536FFS

Drive Path = /dev/rmt/0cbn

Drive Name = DRV_F2D3_LTO5

In this case, NetBackup is only reporting that the Operating System cannot find a device that was previously available.

If a different number of devices are missing at different times (that is, the devices 'disappear' and 'reappear') this is very likely a SAN issue.

NetBackup has no control over the communication of devices between the device and the operating system.

If a device is showing as missing' it is because of an issue outside of NetBackup. Problems on the SAN are a very common cause of this issue.

Associated Documentation:

http://www.symantec.com/docs/TECH124594 - "Description of Tape Alerts and code definitions"

http://www.symantec.com/docs/TECH83129 - "Robtest command that can be used to test the SCSI functionality of a robot"

Article URL http://www.symantec.com/docs/TECH169477

View solution in original post

Marianne · ‎01-01-2012

Some TechNotes - some old, some new, even BE (as media errors are not unique to NBU)....

http://www.symantec.com/docs/TECH35336

http://www.symantec.com/docs/TECH5433

http://w ww.symantec.com/docs/TECH5325

http://www.symantec.com/docs/TECH139183

http://www.symantec.com/docs/TECH169477

If you Google 'Event ID 23 cyclic redundancy check' you will see that this is not a NetBackup error - this is an error reported by the OS while attempting to write to media. Extract from http://www.symantec.com/docs/TECH43243 :

As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.

Handy NetBackup Links

mph999 · ‎01-01-2012

Crc error means for almost 100% certain you have a faulty tape drive. Nothing to do with bad media. Martin UK Symantec Senior TSE

Kiran_Bandi · ‎01-01-2012

in what circustances we decide media went bad?

If the error occurs only on a particular media and all other medias are working fine then you can suspect that media.

If the error occurs with all the medias, then tape drive could be the culprit. If OS is windows and prior to 2008 stop and disable Removable Storage Manager service. And check the functionality of tape drive with vendor specific diagnostic utility.

Give a read: https://www-secure.symantec.com/connect/blogs/facing-any-issues-your-tape-device-try

mph999 · ‎01-02-2012