04-22-2013 06:27 AM
NetBackup 7.1.0.4 with the IBM TS3310 tape library.
It would seem I cannot perform Multiplex w/ Encryption (KMS). The backup runs without error (false positive?). The tape is unreadable afterwards.
Anyone else share this experience? How did you fix it?
Solved! Go to Solution.
04-22-2013 04:06 PM
If you look in bptm for a job of the failed restore you should see lines containing this :
establish_decryption_key: next block encryption
This line should appear after this line:
io_position_for_read: successfully positioned
The question: how many 'establish_decryption' lines are there directly after the postioning line, there should only be 1.
If there are more, you are probably looking at an issue similar to eTrack 3093882, and will need to log a call.
The bptm log, KMS log and nbsu -c -t outputs will be required, as well as a exact description of the issue.
The casue apeears to be firmware related, and an EEB was released to work around the issue.
Martin
07-11-2013 12:06 PM
It is unlikely that the retention level or timestamps had anything to do with not being able to decrypt your data. With KMS encryption, during a read process, the tape drive will encounter an encrypted block of data on the tape and notify the BPTM process via SCSI commands. BPTM will then communicate with NBKMS on the master server to retrieve the encryption key. Without actually seeing a restore failure from a full BPTM log we can't know if the tape drive is telling us if the data is encrypted or not.
One of the comments earlier mentioned that this is likely a firmware issue, and I would agree to that.
What model of tape drive are you using and what is the firmware version? It wouldn't be the first time that we have encountered defective firmware that would cause this issue. Check out the following technote:
07-11-2013 02:42 PM
04-22-2013 06:36 AM
You can have Netbackup KMS and multiplexing enabled - no problem. Done that myself. If you have configured KMS the intended way backup images will be encrypted wit AES256 in the tape drive.
What tape library does not matter - just a tape drive with that support the T10 encryption standard.
http://www.symantec.com/docs/HOWTO56305
Do not try to mix library based encryption with Netbackup based encryption. It's like oil and water - they do not mix.
04-22-2013 06:41 AM
I confirmed the drives do support encryption back when I first setup KMS. Infact, when I don't use multiplexing, the tapes can be read just fine.
04-22-2013 06:43 AM
When you don't multiplex are the tapes encrypted then ?. Verify with bpimagelist or using the GUI
The volume group must have the prefix ENCR_ else no encryption take place.
04-22-2013 07:11 AM
Yes, tapes show up encrypted.
"Flags: 0x40 (Tape Encrypted)"
ENCR_ is the prefix. Even the IBM tape utiltiies show it is encrypted.
I have also not had a problem with mutiplexed tapes until encryption came along.
I'm getting the feeling that what I am experiencing is a little unique. I was hoping someone else may have had this experience before.
04-22-2013 09:46 AM
I have used KMS with MPX before as well. You say the tape is unreadable afterwards - what do you mean? The restore process can't read it? Check the KMS logs (unified logs only, if I recall). It should say something about retrieving keys, etc.
Bill
04-22-2013 12:18 PM
You are on Linux - right?.
What status code are you seeing - status 85 ?
04-22-2013 12:22 PM
Linux, Yes.
Status 85. Yes.
I'll have to check the logs out tomorrow. I'll see if I can post something then.
04-22-2013 12:51 PM
Sound Good.
And you are 100% sure you did not change the encryption key somwhere in time (yes I made that one in a lab - did cost me 2 hours of debugging in vain) ?.
04-22-2013 01:29 PM
Yup. Only ever made one key.
04-22-2013 02:28 PM
One of the verbose image listings will list the key tag for the encrypted image - might be a good idea to make sure that matches what's in your keystore....
Bill
04-22-2013 04:06 PM
If you look in bptm for a job of the failed restore you should see lines containing this :
establish_decryption_key: next block encryption
This line should appear after this line:
io_position_for_read: successfully positioned
The question: how many 'establish_decryption' lines are there directly after the postioning line, there should only be 1.
If there are more, you are probably looking at an issue similar to eTrack 3093882, and will need to log a call.
The bptm log, KMS log and nbsu -c -t outputs will be required, as well as a exact description of the issue.
The casue apeears to be firmware related, and an EEB was released to work around the issue.
Martin
04-22-2013 11:12 PM
Interesting !
No TN for eTrack 3093882 ?. I could not find any.
04-23-2013 07:28 AM
I do get those.
09:59:53.290 [12822] <2> io_position_for_read: successfully positioned 000170 to file number 5, mpx_header = 2 09:59:53.291 [12822] <2> establish_decryption_key: next block encryption status: LON 0x000000000000331e, algorithm index 1, encryption status 0x3
This eventually happens in the job.
04/23/2013 10:14:54 - Error bptm (pid=12822) cannot read image from media id 000170, drive index 0, Input/output error 04/23/2013 10:14:54 - Error bptm (pid=12828) The following files/folders were not restored: 04/23/2013 10:14:54 - Error bptm (pid=12828) UTF - /root/anaconda-ks.cfg 04/23/2013 10:14:55 - Info bpbrm (pid=12813) got ERROR 85 from media manager 04/23/2013 10:14:55 - Info bpbrm (pid=12813) terminating bpbrm child 12833 jobid=198940 04/23/2013 10:14:55 - Error bpbrm (pid=12833) bpcd on torbk01 exited with status 21: socket open failed
This comes up almost immediately in bptm
09:59:54.509 [12822] <2> mpx_read_data: attempting read error recovery, err = 5 09:59:54.509 [12822] <2> tape_error_rec: error recovery to block 13086 requested 09:59:54.509 [12822] <2> tape_error_rec: attempting error recovery, delay 3 minutes before next attempt, tries left = 5
10:14:54.663 [12822] <2> set_job_details: Tfile (198940): LOG 1366726494 16 bptm 12822 cannot read image from media id 000170, drive index 0, Input/output error 10:14:54.696 [12822] <16> mpx_read_data: cannot read image from media id 000170, drive index 0, Input/output error 10:14:54.696 [12822] <2> send_MDS_msg: DEVICE_STATUS 1 83395 bumaster 000170 4001353 IBM.ULT3580-TD4.000 2000283 READ_ERROR 0 0
04-23-2013 08:46 AM
Oh, and yes, there is only one 'establish_decryption_key' line.
I have just found out that the original admin for this system did use the generic 'SCSI Tape' driver that comes with RHEL 5.3. Apparently, back when this backup server was created, using the IBM drivers made the drives undetectable by NetBackup. Fastforward to today, and the result of that is likely biting me in the butt.
Seems like I'll have to do an OS upgrade and/or driver update.
04-23-2013 09:49 AM
I am using the inbox (system default) driver on RH 5.6 - no problems. I think i have read somwhere the inbox driver is the only supported driver by Symantec.
04-23-2013 11:13 AM
Going to upgrade to OEL6.4 on Monday. Will update how that goes.
05-09-2013 07:01 PM
Hello again. Just an update.
I went for RHEL 5.3 to Oracle Linux 5.9. I confirmed with the IBM Express Update utility that I have the latest BIOS on the QLogic HBA.
Still doesn't work.
I have opened up a support ticket with Symantec. Let's see how they do.
05-11-2013 03:10 AM
Thanks for the update. If Symantec can help you, Plese post the soloution to the issues.
07-10-2013 06:50 AM
Months later, and the problem has solved*.
What was the solution? I'm really not sure.
I did notice during my testing that the data expiration dates were slightly off. I had made a retention level of 2 hours, and the expiration date of newly assigned media would be set to yesterday. Retention level of 2 days, expires today. Somewhere, a time calculation was way off. If that is the case, and timestamps are factored into the decryption process, than this could be a hint as to the cause of the problem. But why Multiplexed tapes only?
How did I fix the timestamp issue? No idea. I restarted all NetBackup on the master/media server, and that behavior stopped. For all I know, it could come back again, and when it does, it could cause this to happen again. I have performed this restart several times prior.
I also upgraded to 7.5.0.6. Could there be anything in there that fixed it? I didn't see anything related in the patch notes.
If it comes up again, I'll share.
Edit: It is also very uncomfortable to lose trust in the integrity of ones backup because of something like this.