Highlighted

Communication errors

Hello

I got:

Win2008R2 Master server

Tape Library Quantum Scalar i500

Oracle(RHEL) client machine

3 defferent Windows 2008R2 client machine

 

All 3 Windows client backups fails with error 98:

28.07.2016 13:00:00 - Info nbjm(pid=5476) starting backup job (jobid=494375) for client kzwbackup01.eub.kz, policy Backup_SQL_MUOR, schedule Differential-Inc
28.07.2016 13:00:00 - Info nbjm(pid=5476) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=494375, request id:{7B803017-4C6B-4FF1-B78B-C26E4B9B84DD})
28.07.2016 13:00:00 - requesting resource kzwnb01-hcart3-robot-tld-0
28.07.2016 13:00:00 - requesting resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
28.07.2016 13:00:00 - requesting resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_MUOR
28.07.2016 13:00:00 - granted resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
28.07.2016 13:00:00 - granted resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_MUOR
28.07.2016 13:00:00 - granted resource 000106
28.07.2016 13:00:00 - granted resource HP.ULTRIUM6-SCSI.004
28.07.2016 13:00:00 - granted resource kzwnb01-hcart3-robot-tld-0
28.07.2016 13:00:00 - estimated 161158 Kbytes needed
28.07.2016 13:00:00 - Info nbjm(pid=5476) started backup (backupid=kzwbackup01.eub.kz_1469689200) job for client kzwbackup01.eub.kz, policy Backup_SQL_MUOR, schedule Differential-Inc on storage unit kzwnb01-hcart3-robot-tld-0
28.07.2016 13:00:00 - started process bpbrm (9716)
28.07.2016 13:00:00 - started
28.07.2016 13:00:05 - Info bpbrm(pid=9716) kzwbackup01.eub.kz is the host to backup data from
28.07.2016 13:00:05 - Info bpbrm(pid=9716) reading file list for client
28.07.2016 13:00:05 - connecting
28.07.2016 13:00:08 - Info bpbrm(pid=9716) starting bpbkar32 on client
28.07.2016 13:00:08 - connected; connect time: 0:00:03
28.07.2016 13:00:10 - Info bpbkar32(pid=488) Backup started
28.07.2016 13:00:10 - Info bptm(pid=4040) start
28.07.2016 13:00:10 - Info bptm(pid=4040) using 65536 data buffer size
28.07.2016 13:00:10 - Info bptm(pid=4040) setting receive network buffer to 263168 bytes
28.07.2016 13:00:10 - Info bptm(pid=4040) using 30 data buffers
28.07.2016 13:00:10 - Info bptm(pid=4040) start backup
28.07.2016 13:00:10 - Info bptm(pid=4040) backup child process is pid 10092.7356
28.07.2016 13:00:10 - Info bptm(pid=4040) Waiting for mount of media id 000106 (copy 1) on server kzwnb01.eub.kz.
28.07.2016 13:00:10 - Info bptm(pid=10092) start
28.07.2016 13:00:10 - mounting 000106
28.07.2016 13:00:46 - Info bpbkar32(pid=488) change journal NOT enabled for <D:\sql_backup\MUOR>
28.07.2016 13:31:11 - Error bptm(pid=4040) error requesting media, TpErrno = Robot operation failed
28.07.2016 13:31:11 - Warning bptm(pid=4040) media id 000106 load operation reported an error
28.07.2016 13:31:11 - current media 000106 complete, requesting next resource Any
28.07.2016 14:01:27 - Info bptm(pid=4040) EXITING with status 98 <----------
28.07.2016 14:01:32 - Info bpbkar32(pid=488) done. status: 98: error requesting media (tpreq)
28.07.2016 14:01:32 - end writing
error requesting media (tpreq)(98)
28.07.2016 14:01:35 - Info bpbrm(pid=9020) Starting delete snapshot processing
28.07.2016 14:01:39 - Info bpfis(pid=7628) Backup started
28.07.2016 14:01:39 - Critical bpbrm(pid=9020) from client kzwbackup01.eub.kz: cannot open C:\Program Files\Veritas\NetBackup\online_util\fi_cntl\bpfis.fim.kzwbackup01.eub.kz_1469689200.1.0
28.07.2016 14:01:39 - Info bpfis(pid=7628) done. status: 1542
28.07.2016 14:01:39 - end operation
28.07.2016 14:01:39 - Info bpfis(pid=7628) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations

Oracle fails with error 6 (not every backup, just few times in a day)

RMAN logs shows next:

channel ORA_SBT_TAPE_1: input backup set: count=9309, stamp=916194140, piece=1
channel ORA_SBT_TAPE_1: starting piece 1 at 26-JUL-16
channel ORA_SBT_TAPE_1: backup piece /archive/archivelog/DB.LOGS.ERP.s9309.p1.t916194140
piece handle=2tr9o0qs_1_3 comment=API Version 2.0,MMS Version 5.0.0.0
channel ORA_SBT_TAPE_1: finished piece 1 at 26-JUL-16
channel ORA_SBT_TAPE_1: backup piece complete, elapsed time: 00:00:45
channel ORA_SBT_TAPE_1: input backup set: count=9297, stamp=916194139, piece=1
channel ORA_SBT_TAPE_1: starting piece 1 at 26-JUL-16
channel ORA_SBT_TAPE_1: backup piece /archive/archivelog/DB.LOGS.ERP.s9297.p1.t916194139
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 07/26/2016 10:45:24
ORA-19506: failed to create sequential file, name="2hr9o0qr_1_3", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: failed to communicate with resource requester
ORA-19600: input file is backup piece (/archive/archivelog/DB.LOGS.ERP.s9297.p1.t916194139)
ORA-19601: output file is backup piece (2hr9o0qr_1_3)

Recovery Manager complete.
Backup ended at 2016-07-26 10:45:33

 

 

Can u provide me how to determine cause of that errors, which logs could help me

RMAN script is old an allways works fine, communication between master and client were checked via script(telneted ports 1556,13724,13782 every 30 secs, and no single interrupt appers in script log)

Master server were restarted few times, updated drivers for tape drives

Another windows and oracle(rhel) backups works well(about 30 clients)

 

 

 

 

17 Replies
Highlighted

Re: Communication errors

Here some logs from linux client(oracle):

more /usr/openv/netbackup/logs/user_ops/dbext/logs/16210.0.1469728327
Backup started Thu Jul 28 23:52:08 2016


23:52:09 Initiating backup
23:52:14 INF - Starting bpbrm
23:52:18 INF - Data socket = kzwnb01.eub.kz.IPC:/tmp/vnet-23466469728336472524000005994-8CAH0F;d15070320c90dd5ffac89c4ea9cdd487;11;1800
23:52:18 INF - Name socket = kzwnb01.eub.kz.IPC:/tmp/vnet-23467469728336748882000005994-J6NAJG;0a108b72858ec6d9bd551781ef6f47e2;11;1800
23:52:18 INF - Job id = 494667
23:52:18 INF - Backup id = KZLORA04.EUB.KZ_1469728332
23:52:18 INF - Backup time = 1469728332
23:52:19 INF - Policy name = KZLORA04_RMAN
23:52:19 INF - Snapshot = 0
23:52:19 INF - Frozen image = 0
23:52:19 INF - Backup copy = 0
23:52:20 INF - Master server = kzwnb01.eub.kz
23:52:20 INF - Media server = kzwnb01.eub.kz
23:52:20 INF - Multiplexing = 0
23:52:20 INF - New data socket = kzwnb01.eub.kz.IPC:/tmp/vnet-23460469728336212972000005994-KxZxkF;705f029b4a0a3b6cfb04117e0c2fd54b;11;1800
23:52:20 INF - Encrypt = 0
23:52:21 INF - Use shared memory = 0
23:52:21 INF - Compression = 1
23:52:21 INF - Encrypt = 0
23:52:21 INF - Keep logs = 28
23:52:21 INF - Client read timeout = 11800
23:52:22 INF - Media mount timeout = 0
23:52:23 INF - Waiting for mount of media id 000031 on server kzwnb01.eub.kz for writing
00:22:24 INF - Server status = 6
00:22:25 INF - Backup by oracle on client KZLORA04.EUB.KZ using policy KZLORA04_RMAN, sched Default-Application-Backup:the backup failed to back up the requested files

more /usr/openv/netbackup/logs/user_ops/dbext/logs/8917.0.1469744559
Backup started Fri Jul 29 04:22:39 2016


04:22:40 Initiating backup
04:22:45 INF - Starting bpbrm
04:22:49 INF - Data socket = kzwnb01.eub.kz.IPC:/tmp/vnet-09842469744567634243000006003-ej7Pmz;875d7aabfe6ff2b16a21976d8824cb61;11;1800
04:22:49 INF - Name socket = kzwnb01.eub.kz.IPC:/tmp/vnet-09845469744567914147000006003-rEMi6z;1b6c020056facf0c0385460f839867c9;11;1800
04:22:49 INF - Job id = 494897
04:22:49 INF - Backup id = KZLORA04.EUB.KZ_1469744564
04:22:50 INF - Backup time = 1469744564
04:22:50 INF - Policy name = KZLORA04_RMAN
04:22:50 INF - Snapshot = 0
04:22:50 INF - Frozen image = 0
04:22:51 INF - Backup copy = 0
04:22:51 INF - Master server = kzwnb01.eub.kz
04:22:51 INF - Media server = kzwnb01.eub.kz
04:22:51 INF - Multiplexing = 0
04:22:51 INF - New data socket = kzwnb01.eub.kz.IPC:/tmp/vnet-09841469744567347672000006003-hfeiCy;850f6d95072bbe80eb9a6f7d5c7e57ec;11;1800
04:22:52 INF - Encrypt = 0
04:22:52 INF - Use shared memory = 0
04:22:52 INF - Compression = 1
04:22:52 INF - Encrypt = 0
04:22:52 INF - Keep logs = 28
04:22:53 INF - Client read timeout = 11800
04:22:53 INF - Media mount timeout = 0
04:22:54 INF - Waiting for mount of media id 000031 on server kzwnb01.eub.kz for writing
04:23:02 INF - Waiting for mount of media id 000039 on server kzwnb01.eub.kz for writing
04:23:47 INF - Mounted media id 000039
04:25:00 INF - Beginning backup on server kzwnb01.eub.kz of client KZLORA04.EUB.KZ
04:25:09 INF - Server status = 0
04:25:10 INF - Backup by oracle on client KZLORA04.EUB.KZ using policy KZLORA04_RMAN, sched Default-Application-Backup:the requested operation was successfully completed

more /usr/openv/netbackup/logs/vnetd/log.072916
00:00:21.127 [7174] <2> setup_debug_log: switched debug log file for vnetd
00:22:23.544 [7174] <2> vnet_pbxAcceptSocket: Accepted sock[9] from 172.20.2.42:63851
00:22:23.544 [4350] <8> ProcessRequests: [vnetd.c:635] msg VNETD ACCEPT FROM 172.20.2.42.63851 TO 172.20.1.66.1556 fd = 9
00:22:23.549 [4350] <8> vnet_version_accept: [vnetd.c:1092] *actual_version 11 0xb
00:22:23.590 [4350] <8> ProcessRequests: [vnetd.c:718] msg Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
00:22:23.590 [4350] <8> vnet_generate_hashes: [vnet_vnetd.c:1592] hash_str1 d98e9d5f31b49718fc042bc9d50f1083
00:22:23.590 [4350] <8> vnet_generate_hashes: [vnet_vnetd.c:1593] hash_str2 9bd1cb593c3423ffe14d6a14714c1d3f
00:22:23.590 [4350] <8> process_connect_forward_socket: [vnetd.c:1972] ipc_string /tmp/vnet-04350469730143590705000005997-AHrPJK
00:22:23.796 [4350] <8> vnet_send_network_socket: [vnet_vnetd.c:1400] hash_str2 9bd1cb593c3423ffe14d6a14714c1d3f
00:22:23.797 [4350] <8> ProcessRequests: [vnetd.c:720] status 0 0x0
00:22:25.052 [7174] <2> vnet_pbxAcceptSocket: Accepted sock[9] from 172.20.2.42:63864
00:22:25.052 [4362] <8> ProcessRequests: [vnetd.c:635] msg VNETD ACCEPT FROM 172.20.2.42.63864 TO 172.20.1.66.1556 fd = 9
00:22:25.054 [4362] <8> vnet_version_accept: [vnetd.c:1092] *actual_version 11 0xb
00:22:25.096 [7174] <2> vnet_pbxAcceptSocket: Accepted sock[9] from 172.20.2.42:63866
00:22:25.097 [4364] <8> ProcessRequests: [vnetd.c:635] msg VNETD ACCEPT FROM 172.20.2.42.63866 TO 172.20.1.66.1556 fd = 9
00:22:25.097 [4362] <8> ProcessRequests: [vnetd.c:718] msg Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
00:22:25.097 [4362] <8> vnet_generate_hashes: [vnet_vnetd.c:1592] hash_str1 576d95acd0f17fbe6bd549e0aab642ce
00:22:25.097 [4362] <8> vnet_generate_hashes: [vnet_vnetd.c:1593] hash_str2 3c2f65ced6ec152d45b2eddfa580d20e
00:22:25.097 [4362] <8> process_connect_forward_socket: [vnetd.c:1972] ipc_string /tmp/vnet-04362469730145097185000005997-6HhwGO
00:22:25.099 [4364] <8> vnet_version_accept: [vnetd.c:1092] *actual_version 11 0xb
00:22:25.141 [4364] <8> ProcessRequests: [vnetd.c:718] msg Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
00:22:25.141 [4364] <8> vnet_generate_hashes: [vnet_vnetd.c:1592] hash_str1 5b74da90edf0ef5a00c351fa99595d90
00:22:25.142 [4364] <8> vnet_generate_hashes: [vnet_vnetd.c:1593] hash_str2 d22a419731e051d6da314c20bd947b20
00:22:25.142 [4364] <8> process_connect_forward_socket: [vnetd.c:1972] ipc_string /tmp/vnet-04364469730145141749000005997-gk5KNO
00:22:25.303 [4362] <8> vnet_send_network_socket: [vnet_vnetd.c:1400] hash_str2 3c2f65ced6ec152d45b2eddfa580d20e
00:22:25.303 [4362] <8> ProcessRequests: [vnetd.c:720] status 0 0x0
00:22:25.352 [4364] <8> vnet_send_network_socket: [vnet_vnetd.c:1400] hash_str2 d22a419731e051d6da314c20bd947b20
00:22:25.352 [4364] <8> ProcessRequests: [vnetd.c:720] status 0 0x0
02:26:27.258 [7174] <2> vnet_pbxAcceptSocket: Accepted sock[9] from 172.20.2.42:57941

more /usr/openv/netbackup/logs/dbclient/log.072916
00:07:24.487 [16210] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/16210.0.1469728327
00:07:24.498 [16210] <2> conf_update_time: ../../libvlibs/nbconf_glue.cpp.436: errno: 2 2 0x00000002
00:07:24.498 [16210] <2> conf_update_time: ../../libvlibs/nbconf_glue.cpp.437: path: /home/oracle/bp.conf
00:22:23.359 [16210] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/16210.0.1469728327
00:22:23.359 [16210] <32> serverResponse: ERR - could not read from comm file </usr/openv/netbackup/logs/user_ops/dbext/logs/16210.0.1469728327>
00:22:23.359 [16210] <16> CreateNewImage: ERR - serverResponse() failed
00:22:23.362 [16210] <16> VxBSACreateObject: ERR - Could not create new image with file /29rbkbg0_1_2.
00:22:23.362 [16210] <16> xbsa_CreateObject: ERR - VxBSACreateObject: Failed with error:
Server Status: Communication with the server has not been initiated or the server status has not been retrieved from the serve
00:22:35.962 [16210] <2> conf_update_time: ../../libvlibs/nbconf_glue.cpp.436: errno: 2 2 0x00000002
00:22:35.962 [16210] <2> conf_update_time: ../../libvlibs/nbconf_glue.cpp.437: path: /home/oracle/bp.conf
00:22:35.963 [16210] <4> sbtend: INF - --- END of SESSION ---
00:22:35.963 [16210] <8> close_image: Session being terminated abnormally, cleaning up
00:22:35.963 [16210] <4> close_image: INF - backup FAILED
00:22:35.963 [16210] <4> close_image: INF ---- end of Backup ---

00:22:35.963 [16210] <16> VxBSAEndTxn: ERR - Transaction ended with active Backup/Restore.
00:22:35.963 [16210] <16> xbsa_EndTransaction: ERR - VxBSAEndTxn: Failed with error:
The transaction was aborted.

 

 

Tags (2)
Highlighted

Re: Communication errors

Forget about Oracle backups for now (or start a new discussion for that).

This is the error that you need to troubleshoot first:

28.07.2016 13:31:11 - Error bptm(pid=4040) error requesting media, TpErrno = Robot operation failed
28.07.2016 13:31:11 - Warning bptm(pid=4040) media id 000106 load operation reported an error

Media id 000106 could not be loaded into the tape drive.

Have you confirmed that the media and tape drive(s) are compatibile?
E.g. LTO6 can write to LTO5 and LTO6 tapes.

Are you seeing the same issue with other media id's or just this one?

Highlighted

Re: Communication errors

Thanks for quick apply

All tape media are LTO6, all drives are compatibile

Errors appers with different drive ID an different media ID

Tape library reports dont include robot fails(reports via library web-interface)

Policies are old and works well.

Highlighted

Re: Communication errors

Is the master server also the media server and robot control host?

This is where you need to troubleshoot the load error. 

Ensure bptm folder exists under NetBackup\logs and turn up logging level in Host Properties to 3 or higher.
(Remember to drop down logging level when problem has been fixed).
Add VERBOSE entry to vm.conf (in volmgr folder) and restart NBU Device Management service.

After next status 98 error, check bptm log for errors as well as Event Viewer System and Application logs.

Other troubleshooting can be done by loading/unloading these tapes that gave the errors with 'robtest' utility.

See http://www.veritas.com/docs/000041570 

Highlighted

Re: Communication errors

Yes, media server and master server is a one host.

bptm folder exists and logging level property is already set to Host 5 (maximum).

VERBOSE record added and service been restarted

We are waiting for the appearance of an error 98

Thank you

Highlighted

Re: Communication errors

Utility "robtest" can not be used at this time as backups occur

Highlighted

Re: Communication errors

Hello!

After completing all the steps described above you, we have again an error has occurred 98.

bptm logs attached below:

 

https://www.dropbox.com/s/jmogtcb3clfz00f/073016.log?dl=0

https://www.dropbox.com/s/vzlz3sg7eyjkltu/073116.log?dl=0

https://www.dropbox.com/s/3kdn64yag34ioxz/080116.log?dl=0

Highlighted

Re: Communication errors

You have uploaded 3 logs. 

When did the error occur and on which day?

Please post the job details of the failed backup so that we know where in the logs to search.

Re: Communication errors

Screenshot 98 error's:

Capture.PNG

and logs the job:

KZWLOT08_LOTUS

30.07.2016 4:46:39 - Info nbjm(pid=6096) starting backup job (jobid=495701) for client KZWLOT08.EUB.KZ, policy KZWLOT08_LOTUS, schedule Differential-Inc
30.07.2016 4:46:39 - Info nbjm(pid=6096) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=495701, request id:{B17564B7-CF7A-4DCC-99AC-E21DA789D7B4})
30.07.2016 4:46:39 - requesting resource Any
30.07.2016 4:46:39 - requesting resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.KZWLOT08.EUB.KZ
30.07.2016 4:46:39 - requesting resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.KZWLOT08_LOTUS
30.07.2016 4:46:40 - awaiting resource Any Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Lotus, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 4:49:16 - awaiting resource Any - No drives are available
30.07.2016 4:49:18 - awaiting resource Any Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Lotus, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 5:19:29 - granted resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.KZWLOT08.EUB.KZ
30.07.2016 5:19:29 - granted resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.KZWLOT08_LOTUS
30.07.2016 5:19:29 - granted resource 000069
30.07.2016 5:19:29 - granted resource HP.ULTRIUM6-SCSI.000
30.07.2016 5:19:29 - granted resource kzwnb01-hcart3-robot-tld-0
30.07.2016 5:19:29 - estimated 4204336 Kbytes needed
30.07.2016 5:19:29 - Info nbjm(pid=6096) started backup (backupid=KZWLOT08.EUB.KZ_1469834369) job for client KZWLOT08.EUB.KZ, policy KZWLOT08_LOTUS, schedule Differential-Inc on storage unit kzwnb01-hcart3-robot-tld-0
30.07.2016 5:19:30 - started process bpbrm (6924)
30.07.2016 5:19:32 - Info bpbrm(pid=6924) KZWLOT08.EUB.KZ is the host to backup data from
30.07.2016 5:19:32 - Info bpbrm(pid=6924) reading file list for client
30.07.2016 5:19:32 - connecting
30.07.2016 5:19:36 - Info bpbrm(pid=6924) starting bpbkar32 on client
30.07.2016 5:19:36 - connected; connect time: 0:00:04
30.07.2016 5:19:39 - Info bpbkar32(pid=3376) Backup started
30.07.2016 5:19:39 - Info bpbkar32(pid=3376) change time comparison:<disabled>
30.07.2016 5:19:39 - Info bpbkar32(pid=3376) archive bit processing:<enabled>
30.07.2016 5:19:39 - Info bptm(pid=6340) start
30.07.2016 5:19:39 - Info bptm(pid=6340) using 65536 data buffer size
30.07.2016 5:19:39 - Info bptm(pid=6340) setting receive network buffer to 263168 bytes
30.07.2016 5:19:39 - Info bptm(pid=6340) using 30 data buffers
30.07.2016 5:19:39 - Info bptm(pid=6340) start backup
30.07.2016 5:19:39 - Info bptm(pid=6340) backup child process is pid 9152.2424
30.07.2016 5:19:39 - Info bptm(pid=6340) Waiting for mount of media id 000069 (copy 1) on server kzwnb01.eub.kz.
30.07.2016 5:19:39 - Info bptm(pid=9152) start
30.07.2016 5:19:39 - mounting 000069
30.07.2016 5:20:45 - Error bptm(pid=6340) error requesting media, TpErrno = No robot daemon or robotics are unavailable
30.07.2016 5:20:45 - Warning bptm(pid=6340) media id 000069 load operation reported an error
30.07.2016 5:20:45 - current media 000069 complete, requesting next resource Any
30.07.2016 5:50:49 - Info bptm(pid=6340) EXITING with status 98 <----------
30.07.2016 5:50:53 - Info bpbkar32(pid=3376) done. status: 98: error requesting media (tpreq)
30.07.2016 5:50:53 - end writing
error requesting media (tpreq)(98)

Backup_SQL_Detwig

31.07.2016 15:01:31 - Info nbjm(pid=6096) starting backup job (jobid=497798) for client kzwbackup01.eub.kz, policy Backup_SQL_Detwig, schedule Full
31.07.2016 15:01:31 - Info nbjm(pid=6096) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=497798, request id:{1B60CFDE-E54C-432D-84AC-0FFE0FC6132F})
31.07.2016 15:01:31 - requesting resource kzwnb01-hcart3-robot-tld-0
31.07.2016 15:01:31 - requesting resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
31.07.2016 15:01:31 - requesting resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_Detwig
31.07.2016 15:01:31 - granted resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
31.07.2016 15:01:31 - granted resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_Detwig
31.07.2016 15:01:31 - granted resource 000026
31.07.2016 15:01:31 - granted resource HP.ULTRIUM6-SCSI.000
31.07.2016 15:01:31 - granted resource kzwnb01-hcart3-robot-tld-0
31.07.2016 15:01:31 - estimated 168409357 Kbytes needed
31.07.2016 15:01:31 - Info nbjm(pid=6096) started backup (backupid=kzwbackup01.eub.kz_1469955691) job for client kzwbackup01.eub.kz, policy Backup_SQL_Detwig, schedule Full on storage unit kzwnb01-hcart3-robot-tld-0
31.07.2016 15:01:31 - started
31.07.2016 15:01:32 - started process bpbrm (8988)
31.07.2016 15:01:36 - Info bpbrm(pid=8988) kzwbackup01.eub.kz is the host to backup data from
31.07.2016 15:01:36 - Info bpbrm(pid=8988) reading file list for client
31.07.2016 15:01:37 - connecting
31.07.2016 15:01:40 - Info bpbrm(pid=8988) starting bpbkar32 on client
31.07.2016 15:01:40 - connected; connect time: 0:00:03
31.07.2016 15:01:42 - Info bpbkar32(pid=8128) Backup started
31.07.2016 15:01:42 - Info bptm(pid=8888) start
31.07.2016 15:01:42 - Info bptm(pid=8888) using 65536 data buffer size
31.07.2016 15:01:42 - Info bptm(pid=8888) setting receive network buffer to 263168 bytes
31.07.2016 15:01:42 - Info bptm(pid=8888) using 30 data buffers
31.07.2016 15:01:42 - Info bptm(pid=8888) start backup
31.07.2016 15:01:42 - Info bptm(pid=8888) backup child process is pid 916.10360
31.07.2016 15:01:42 - Info bptm(pid=8888) Waiting for mount of media id 000026 (copy 1) on server kzwnb01.eub.kz.
31.07.2016 15:01:42 - Info bptm(pid=916) start
31.07.2016 15:01:42 - mounting 000026
31.07.2016 15:02:50 - Info bpbkar32(pid=8128) change journal NOT enabled for <D:\sql_backup\Detwig>
31.07.2016 15:29:28 - Error bptm(pid=8888) error requesting media, TpErrno = No robot daemon or robotics are unavailable
31.07.2016 15:29:28 - Warning bptm(pid=8888) media id 000026 load operation reported an error
31.07.2016 15:29:28 - current media 000026 complete, requesting next resource Any
31.07.2016 15:59:54 - Info bptm(pid=8888) EXITING with status 98 <----------
31.07.2016 15:59:59 - Info bpbkar32(pid=8128) done. status: 98: error requesting media (tpreq)
31.07.2016 15:59:59 - end writing
error requesting media (tpreq)(98)
31.07.2016 16:00:03 - Info bpbrm(pid=7708) Starting delete snapshot processing
31.07.2016 16:00:08 - Info bpfis(pid=8112) Backup started
31.07.2016 16:00:08 - Critical bpbrm(pid=7708) from client kzwbackup01.eub.kz: cannot open C:\Program Files\Veritas\NetBackup\online_util\fi_cntl\bpfis.fim.kzwbackup01.eub.kz_1469955691.1.0
31.07.2016 16:00:08 - Info bpfis(pid=8112) done. status: 1542
31.07.2016 16:00:08 - end operation
31.07.2016 16:00:08 - Info bpfis(pid=8112) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations

Backup_SQL_MATM

30.07.2016 4:46:39 - Info nbjm(pid=6096) starting backup job (jobid=495699) for client kzwbackup01.eub.kz, policy Backup_SQL_MATM, schedule Differential-Inc
30.07.2016 4:46:39 - Info nbjm(pid=6096) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=495699, request id:{05140F21-D3AE-4103-8C7C-A86366320E59})
30.07.2016 4:46:39 - requesting resource kzwnb01-hcart3-robot-tld-0
30.07.2016 4:46:39 - requesting resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
30.07.2016 4:46:39 - requesting resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_MATM
30.07.2016 4:46:40 - awaiting resource kzwnb01-hcart3-robot-tld-0 Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: SQL_DB, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 4:49:16 - awaiting resource kzwnb01-hcart3-robot-tld-0 - No drives are available
30.07.2016 5:19:59 - granted resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwbackup01.eub.kz
30.07.2016 5:19:59 - granted resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.Backup_SQL_MATM
30.07.2016 5:19:59 - granted resource 000105
30.07.2016 5:19:59 - granted resource HP.ULTRIUM6-SCSI.005
30.07.2016 5:19:59 - granted resource kzwnb01-hcart3-robot-tld-0
30.07.2016 5:19:59 - estimated 29094294 Kbytes needed
30.07.2016 5:19:59 - Info nbjm(pid=6096) started backup (backupid=kzwbackup01.eub.kz_1469834399) job for client kzwbackup01.eub.kz, policy Backup_SQL_MATM, schedule Differential-Inc on storage unit kzwnb01-hcart3-robot-tld-0
30.07.2016 5:19:59 - started process bpbrm (6024)
30.07.2016 5:19:59 - started
30.07.2016 5:20:04 - Info bpbrm(pid=6024) kzwbackup01.eub.kz is the host to backup data from
30.07.2016 5:20:04 - Info bpbrm(pid=6024) reading file list for client
30.07.2016 5:20:04 - connecting
30.07.2016 5:20:06 - Info bpbrm(pid=6024) starting bpbkar32 on client
30.07.2016 5:20:06 - connected; connect time: 0:00:02
30.07.2016 5:20:08 - Info bpbkar32(pid=6456) Backup started
30.07.2016 5:20:08 - Info bptm(pid=7876) start
30.07.2016 5:20:08 - Info bptm(pid=7876) using 65536 data buffer size
30.07.2016 5:20:08 - Info bptm(pid=7876) setting receive network buffer to 263168 bytes
30.07.2016 5:20:08 - Info bptm(pid=7876) using 30 data buffers
30.07.2016 5:20:09 - Info bptm(pid=7876) start backup
30.07.2016 5:20:09 - Info bptm(pid=7876) backup child process is pid 8812.820
30.07.2016 5:20:09 - Info bptm(pid=7876) Waiting for mount of media id 000105 (copy 1) on server kzwnb01.eub.kz.
30.07.2016 5:20:09 - mounting 000105
30.07.2016 5:20:09 - Info bptm(pid=8812) start
30.07.2016 5:20:50 - Error bptm(pid=7876) error requesting media, TpErrno = No robot daemon or robotics are unavailable
30.07.2016 5:21:00 - Warning bptm(pid=7876) media id 000105 load operation reported an error
30.07.2016 5:21:00 - current media 000105 complete, requesting next resource Any
30.07.2016 5:21:47 - current media -- complete, awaiting next media Any Reason: Robotic library is down on server, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: SQL_DB, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 5:21:57 - Info bpbkar32(pid=6456) change journal NOT enabled for <D:\sql_backup\MATM>
30.07.2016 5:27:27 - current media -- complete, awaiting next media Any Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: SQL_DB, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 5:59:28 - current media -- complete, awaiting next media Any Reason: Robotic library is down on server, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: SQL_DB, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

30.07.2016 6:04:15 - Info bptm(pid=7876) Waiting for mount of media id 000105 (copy 1) on server kzwnb01.eub.kz.
30.07.2016 6:04:15 - granted resource 000105
30.07.2016 6:04:15 - granted resource HP.ULTRIUM6-SCSI.005
30.07.2016 6:04:15 - granted resource kzwnb01-hcart3-robot-tld-0
30.07.2016 6:04:15 - mounting 000105
30.07.2016 6:34:59 - Error bptm(pid=7876) error requesting media, TpErrno = No robot daemon or robotics are unavailable
30.07.2016 6:35:04 - Warning bptm(pid=7876) media id 000105 load operation reported an error
30.07.2016 6:35:04 - current media 000105 complete, requesting next resource Any
30.07.2016 7:05:31 - Info bptm(pid=7876) EXITING with status 98 <----------
30.07.2016 7:05:37 - Info bpbkar32(pid=6456) done. status: 98: error requesting media (tpreq)
30.07.2016 7:05:37 - end writing
error requesting media (tpreq)(98)
30.07.2016 7:05:40 - Info bpbrm(pid=7336) Starting delete snapshot processing
30.07.2016 7:05:48 - Info bpfis(pid=484) Backup started
30.07.2016 7:05:48 - Critical bpbrm(pid=7336) from client kzwbackup01.eub.kz: cannot open C:\Program Files\Veritas\NetBackup\online_util\fi_cntl\bpfis.fim.kzwbackup01.eub.kz_1469834399.1.0
30.07.2016 7:05:48 - Info bpfis(pid=484) done. status: 1542
30.07.2016 7:05:48 - end operation
30.07.2016 7:05:48 - Info bpfis(pid=484) done. status: 1542: An existing snapshot is no longer valid and cannot be mounted for subsequent operations

KZWLOT01_Lotus

01.08.2016 2:51:19 - Info nbjm(pid=6096) starting backup job (jobid=498127) for client kzwlot01.eub.kz, policy KZWLOT01_Lotus, schedule Differential-Inc
01.08.2016 2:51:19 - Info nbjm(pid=6096) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=498127, request id:{F694E5B4-54EA-45AF-9415-8BD1EC04BE0A})
01.08.2016 2:51:19 - requesting resource Any
01.08.2016 2:51:19 - requesting resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwlot01.eub.kz
01.08.2016 2:51:19 - requesting resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.KZWLOT01_Lotus
01.08.2016 2:51:19 - awaiting resource Any Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Lotus, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

01.08.2016 3:03:32 - awaiting resource Any - No drives are available
01.08.2016 3:03:51 - awaiting resource Any Reason: Drives are in use, Media Server: kzwnb01.eub.kz,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Lotus, Storage Unit: kzwnb01-hcart3-robot-tld-0, Drive Scan Host: N/A

01.08.2016 3:19:36 - granted resource kzwnb01.eub.kz.NBU_CLIENT.MAXJOBS.kzwlot01.eub.kz
01.08.2016 3:19:36 - granted resource kzwnb01.eub.kz.NBU_POLICY.MAXJOBS.KZWLOT01_Lotus
01.08.2016 3:19:36 - granted resource 000005
01.08.2016 3:19:36 - granted resource HP.ULTRIUM6-SCSI.000
01.08.2016 3:19:36 - granted resource kzwnb01-hcart3-robot-tld-0
01.08.2016 3:19:36 - estimated 63788092 Kbytes needed
01.08.2016 3:19:36 - Info nbjm(pid=6096) started backup (backupid=kzwlot01.eub.kz_1469999976) job for client kzwlot01.eub.kz, policy KZWLOT01_Lotus, schedule Differential-Inc on storage unit kzwnb01-hcart3-robot-tld-0
01.08.2016 3:19:36 - started process bpbrm (8052)
01.08.2016 3:19:38 - Info bpbrm(pid=8052) kzwlot01.eub.kz is the host to backup data from
01.08.2016 3:19:38 - Info bpbrm(pid=8052) reading file list for client
01.08.2016 3:19:38 - connecting
01.08.2016 3:19:41 - Info bpbrm(pid=8052) starting bpbkar32 on client
01.08.2016 3:19:41 - connected; connect time: 0:00:03
01.08.2016 3:19:43 - Info bpbkar32(pid=3544) Backup started
01.08.2016 3:19:43 - Info bptm(pid=1396) start
01.08.2016 3:19:43 - Info bpbkar32(pid=3544) change time comparison:<disabled>
01.08.2016 3:19:43 - Info bpbkar32(pid=3544) archive bit processing:<enabled>
01.08.2016 3:19:43 - Info bptm(pid=1396) using 65536 data buffer size
01.08.2016 3:19:43 - Info bptm(pid=1396) setting receive network buffer to 263168 bytes
01.08.2016 3:19:43 - Info bptm(pid=1396) using 30 data buffers
01.08.2016 3:19:43 - Info bptm(pid=1396) start backup
01.08.2016 3:19:43 - Info bptm(pid=1396) backup child process is pid 10964.7812
01.08.2016 3:19:43 - Info bptm(pid=1396) Waiting for mount of media id 000005 (copy 1) on server kzwnb01.eub.kz.
01.08.2016 3:19:43 - Info bptm(pid=10964) start
01.08.2016 3:19:43 - mounting 000005
01.08.2016 3:20:24 - Error bptm(pid=1396) error requesting media, TpErrno = No robot daemon or robotics are unavailable
01.08.2016 3:20:24 - Warning bptm(pid=1396) media id 000005 load operation reported an error
01.08.2016 3:20:24 - current media 000005 complete, requesting next resource Any
01.08.2016 3:50:40 - Info bptm(pid=1396) EXITING with status 98 <----------
01.08.2016 3:50:45 - Info bpbkar32(pid=3544) done. status: 98: error requesting media (tpreq)
01.08.2016 3:50:45 - end writing
error requesting media (tpreq)(98)

 

Highlighted

Re: Communication errors

Thanks for the job details.

The error is different to the previous status 98.
Before:
Warning bptm(pid=4040) media id .... load operation reported an error

We now see the following: 

Waiting for mount of media id ....  on server kzwnb01.eub.kz.
error requesting media, TpErrno = No robot daemon or robotics are unavailable

Which media server is robot control host for robot 0?

Can you check and verify that tldcd is running?
And tldd on media server kzwnb01?

Hopefully nobody is trying to use robtest (or forgot to exit the utility) while backups are running?

 

Highlighted

Re: Communication errors

Another possible reason for latest error message:

http://www.veritas.com/docs/000015874

 

Highlighted

Re: Communication errors

Yes, the media server kzwnb01 controls the robot 0 

Tell me how to check the status of tldcd and tldd ?

Highlighted

Re: Communication errors

From cmd, check output of 'bpps' on Windows master/media server.
(Command is in ...netbackup\bin)
There should be one tldcd process (robot control) and one tldd process for each robot-controlled tape drive.

Also check output of 'vmoprcmd -d' 
(in ....volmgr\bin) to ensure that all drives are in TLD control (not AVR).

This post is getting a bit lengthy and I now have to constantly page between previous and current posts...

Not sure if I have asked the following:

Add VERBOSE entry to vm.conf on master/media server
Create debug logs on master/media such as ltid and robots (see  http://www.veritas.com/docs/000109772 ) 
and restart Device Manager service.

Look for errors in Event Viewer System and Application logs and debug logs when status 98 is seen. 

 

Highlighted

Re: Communication errors

Thank you!

Here is a screenshot's of the teams "vmoprcmd -d" and "bpps"

tldd.pngvmoprcmd.png

"VERBOSE" was added earlier.

 

Highlighted

Re: Communication errors

You can see that there is no tldcd process and that tape drives are in AVR (non-robotic) control.

Check Windows Event Viewer System and Application log to see when and why tldcd went down.

Please enable VERBOSE logging and create log folders, then restart NBU Device Management service.
The service restart should hopefully start tldcd again.
Verbose logging should help with future troubleshooting.

PS:
6 tape drives attached to a single master/media server seems a bit much. 
Have you confirmed sufficient physical resources (memory, cpu, hba's, pci slots, bus speed, etc) to ensure that a single server can manage and stream 6 x LTO6 tape drives simultaneously?

Highlighted

Re: Communication errors

VERBOSE included. Service were restarted few times.
Environment is old, and in that configuration works well
The master server is powerful enough (CPU 2xIntel Xeon E5-2620 2.00GHz 24 Core, 32Gb RAM, 250Gb HDD, 10Gbps LAN) OS Windows Server 2008R2
We have a robotic library Quantum Scalar i500 with 6 LTO drives in it
Why robotic library must be non-robotic?
Also we got another backup spot(media server attached to kzwnb01) with name kzwnb03 and the same hardware configuration as described above only it is installed OS Windows Server 2012R2 and no backup fails with error code-98 on it
Here is screenshot from media server kzwnb03 where all tape drives are not in AVR(and works fine).

kzwnb03.png

Highlighted

Re: Communication errors

Service were restarted few times.

So, by the looks of vmoprcmd ouput, drives are back in TLD control.
This means that the service restart has started tldcd (robot control daemon).

It also seems as if 2 tape drives are loaded - meaning backups are running that previously ended in status 98?

Next time you see status 98, please check Windows System and Application logs for possible reasons why tldcd is going down. 
Services restart seems to be a short-term solution.
You need to find out why tldcd is going down as this quite abnormal.
The last time I have seen tldcd going down for no apparent reason was about 12 years ago when that particular NBU version had  a problem with AUTO_UPDATE_ROBOT entry in vm.conf. tldcd went down each time the MAP was opened instead of updating the robot inventory.
Windows System and/or Application logs will tell you if the server has lost connectivity to the robot or if robot is not responding as a result of robot door being opened, etc... etc...

Other helpful logs will be debug logs under volgmr. Hopefully you have created log folders such as 'robots'? 

Why robotic library must be non-robotic? 

Seems you misunderstood what I was trying to say.

The tape drives were in AVR /non-robotic mode because tldcd was not running. 

tldcd not running was the reason for the most recent status 98.