cancel
Showing results for 
Search instead for 
Did you mean: 

backup to OST device failed w/ error 87 (status 2060014)

cbsi_nbu_team
Level 3

 

I am testing backup to OST device, and occasionally I get failed backup w/ error 87, with detailed log as follow:

Anyone has any clue where to pursue further? what is status 2060014?

17:47:43.399 [28936] <2> send_MDS_msg: KBYTES_WRITTEN 0 {9FCAB948-DEFF-11E5-A390-49AC35138B51} 209 1 141003008 375
17:47:43.401 [28936] <2> JobInst::sendIrmMsg: returning
17:49:42.708 [28936] <2> send_MDS_msg: KBYTES_WRITTEN 0 {9FCAB948-DEFF-11E5-A390-49AC35138B51} 209 1 144003072 375
17:49:42.709 [28936] <2> JobInst::sendIrmMsg: returning
17:50:42.068 [28940] <2> fill_buffer: [28936] socket is closed, waited for empty buffer 258 times, delayed 3749 times, read 145619232 Kbytes
17:50:42.083 [28936] <2> write_data: writing block shorter than BUFF_SIZE, 32768 bytes
17:50:42.083 [28936] <2> write_data: writing short block, 32768 bytes, remainder 0
17:50:42.083 [28936] <2> wait_for_sigcld: waiting for child to exit, timeout is 300
17:50:42.085 [28936] <2> write_data: waited for full buffer 219973 times, delayed 332436 times
17:50:42.085 [28936] <2> write_data: Total Kbytes transferred 145619232
17:50:42.085 [28936] <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = 0
17:50:42.085 [28936] <2> write_backup: tp.tv_sec = 1456768242, stp.tv_sec = 1456762520, tp.tv_usec = 83824, stp.tv_usec = 319519, et = 5721765, mpx_total_kbytes[TWIN_INDEX = 0] = 145619232
17:50:42.086 [28936] <4> report_throughput: VBRT 1 28936 5 1 _PhysicalLSU @aaaal 0 1 0 23180064  23180064 (bptm.c.20959)
18:06:26.125 [28936] <16> 375:bptm:28936:nbu1: dm_synchronize(0x1e8a138) returned XCOMM_ERR_RCV
18:06:26.341 [28936] <16> 375:bptm:28936:nbu1: EXIT pgn_close_image: (operation aborted:2060014) ih=(nil)

18:06:26.342 [28936] <32> bp_sts_close_image: sts_close_handle failed: 2060014
18:06:26.342 [28936] <16> write_backup: cannot write image to disk, media close failed with status 2060014
18:06:26.352 [28936] <2> delete_image_disk_sts_impl: Deleting disk header for oradb1005.tm.cbsig.net_1456762518_C1_HDR
18:06:26.359 [28936] <2> check_if_query_requires_authentication: need authentication = AUTH_PROHIBITED,for query type = 161
18:06:26.359 [28936] <2> ConnectionCache::connectAndCache: Acquiring new connection for host nbu1, query type 161
18:06:26.361 [28936] <2> logconnections: BPDBM CONNECT FROM 10.25.66.66.37276 TO 10.25.66.66.13721 fd = 0
18:06:26.368 [28936] <2> db_end: Need to collect reply
18:06:27.643 [28936] <2> bptm: EXITING with status 87 <----------
18:06:27.653 [28936] <2> cleanup: Detached from BPBRM shared memory

 

I saw https://www.veritas.com/community/forums/status-87-media-close-error-2060014

but my storage is a Quantum DXi 4702, and I am using the latest firmware. Should I be contacting Quantum?

Any help is greatly appreciated!

Thanks!

5 REPLIES 5

Marianne
Level 6
Partner    VIP    Accredited Certified
I would say Yes. Log a call with Quantum. Apart from latest firmware, do you have the latest plugin on the media server(s)? Sufficient resources on media server(s)?

cbsi_nbu_team
Level 3

OK, Quantum is taking a look at it now. Collecting a bunch of debug info.

Yes, we have the latest firmware and latest plugin. Plenty of resources on media server. 

One suspect is the FW in the routing path, but there are other clients going thru the same FW and was no issue.

 

cbsi_nbu_team
Level 3

The problem disappeared after moving the OST device from behind the firewall. The vendor support person mentioned that there are additional tunings that can be done if the FW can't be removed from the routing path, but still haven't got all the info yet.

Marianne
Level 6
Partner    VIP    Accredited Certified

It would be nice if Quantum could share Firewall port requirements. 

I have personally looked for this some time ago ...

cbsi_nbu_team
Level 3

This is what I figured out:

The media server open a connection at port 3095 to the quantum OST, and expect this connection to remain established while it uses port 10002 to send the actual data. So while p10002 is always busy, connection at port 3095 is idle until the data transfer is finished, The tcp keepalive time on the media server is set at 7200 (std linux default), so it will wait for 2 hrs before start sending the keepalive probes. I suspect that the FW idle timeout is set to be less than 2 hrs, so the FW kill the connection on the idle port 3095 before the keepalive kick in. In the meantime the media server continues sending data on port 10002, and when the data is finished to be copied, the media server tries to communicate on port 3095 again and was not able to do it, and the job failed. 

So other ways to solve this issue is to lower the tcp_keepalive value on the media server to be less thanthe FW idle timeout, or to increase the FW idle timeout value to be higher than the default tcp keepalive.