cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup job hung on 'begin writing' after 92h

moratar
Level 3

I am backuping up 3 hdd from media server.

3 discs are backed up as 3 streams on 3 lto6 drives. 15TB,50TB and 75TB

15TB,50TB got backed up and biggest one hang on message:

4/14/2015 1:37:07 PM - Info bptm(pid=3928) media id 601081 mounted on drive index 0, drivepath {5,0,1,0}, drivename lto61, copy 1
4/14/2015 1:37:08 PM - mounted; mount time: 0:00:54
4/14/2015 1:37:08 PM - positioning 601081 to file 1
4/14/2015 1:37:12 PM - positioned 601081; position time: 0:00:04
4/14/2015 1:37:12 PM - begin writing

It is hanging like this for 25h now, after 48TB got backed up.

There is no error anywhere in logs.

every 10 min there is looped message in BPTM log:

14:15:56.620 [468.3500] <2> initialize: fd values STDOUTSOCK=392 STDERRSOCK=532
14:15:56.635 [468.3500] <2> bptm: instance - -1578722640
14:15:56.635 [468.3500] <2> bptm: INITIATING (VERBOSE = 5): -rptdrv -jobid -1428495942 -jm
14:15:56.635 [468.3500] <2> bptm: Event Global\NetBackup Terminate Event, pid: 468 created.
14:15:56.635 [468.3500] <2> bptm: PORT_STATUS = 0x00000000
14:15:56.635 [468.3500] <2> drivename_open: Called with Create 0, file lto61
14:15:56.635 [468.3500] <2> drivename_checklock: Called
14:15:56.635 [468.3500] <2> drivename_checklock: File is locked
14:15:56.635 [468.3500] <2> report_drives: DRIVE = lto61 LOCK = TRUE CURTIME = 1429100156
14:15:56.635 [468.3500] <2> drivename_close: Called for file lto61
14:15:56.635 [468.3500] <2> report_drives: MODE = 0
14:15:56.635 [468.3500] <2> report_drives: TIME = 1429011374
14:15:56.635 [468.3500] <2> report_drives: MASTER = veritas
14:15:56.635 [468.3500] <2> report_drives: SR_KEY = 1161822238 2000057
14:15:56.635 [468.3500] <2> report_drives: PATH = {5,0,1,0}
14:15:56.635 [468.3500] <2> report_drives: MEDIA = 601081
14:15:56.635 [468.3500] <2> report_drives: REQID = -1428490612
14:15:56.635 [468.3500] <2> report_drives: ALOCID = 13396
14:15:56.635 [468.3500] <2> report_drives: RBID = {B45FF241-A33B-4FB8-BB21-41BEBA872DC7}
14:15:56.635 [468.3500] <2> report_drives: PID = 3928
14:15:56.635 [468.3500] <2> report_drives: FILE = C:\Program Files\Veritas\NetBackup\db\media\tpreq\drive_lto61
14:15:56.635 [468.3500] <2> main: Sending [EXIT STATUS 0] to NBJM
14:15:56.635 [468.3500] <2> bptm: EXITING with status 0 <----------
14:15:56.635 [468.3500] <2> bptm: Global\NetBackup Terminate Event, pid: 468 closed.


Media is allocated properly to correct media server.

Version is 7.6.0.1 on windows server 2012r2 x64.

 

Any help, idea what to do to keep the backup going would be awesom.

21 REPLIES 21

sdo
Moderator
Moderator
Partner    VIP    Certified

Is your client also your media server?

If so, then Windows 2012 R2 is only supported from v7.6.0.3:

http://www.symantec.com/business/support/library/BUSINESS/xCL/TECH76648/nbu_76_scl.html#operating_systems-microsoft_windows_server_2012

 

Marianne
Level 6
Partner    VIP    Accredited Certified

You need bptm log on media server doing the backup.

The log for 4/14/2015 on the media server should contain info for PID [3928].

More logs are needed to troubleshoot hung backups:

bpbrm on the media server (in addition to bptm)

bpbkar on the client.

The Job details should also contain PIDs for bpbrm and bpbkar on the client.

Can you check Task Monitor or bpps output on media server and client to see if these processes are still running?

 

 

moratar
Level 3

Client is also a Media server.

Last usefull info from bpkar is:

02:33:41.130 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 4240 9073 106966803492 -1 30 33216 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-307780267-2766365465-114402338-1001 4096 1428671993 1366970181 1366970181 /\\10.200.200.1/R$/._05_Portal
02:33:41.130 [5396.5076] <4> dos_backup::tfs_scannext: INF - no more path list entries
02:33:41.130 [5396.5076] <4> tar_backup::backup_endarg_state: INF - finish error count: 0
02:33:41.130 [5396.5076] <4> tar_backup::backup_endarg_state: INF - finish fatal count: 0
02:33:41.130 [5396.5076] <4> tar_backup::backup_add_last_entry_state: INF - catalog message: Dfi - 0 0 106966803507 -1 23 292 user other 0 1366970181 1366970181 1366970181 /VRTS_IMAGE_SIZE_RECORD
02:33:41.130 [5396.5076] <2> tar_base::backup_finish: TAR - backup:                     9073 files
02:33:41.130 [5396.5076] <2> tar_base::backup_finish: TAR - backup:          file data:  765234951 bytes  51005 gigabytes
02:33:41.130 [5396.5076] <2> tar_base::backup_finish: TAR - backup:         image data:  801664000 bytes  51005 gigabytes
02:33:41.130 [5396.5076] <2> tar_base::backup_finish: TAR - backup:       elapsed time:     381146 secs    143690353 bps
02:33:41.130 [5396.5076] <2> tar_base::V_vTarMsgW: INF - Total Size:765234951 + 51005GB
02:33:41.130 [5396.5076] <4> ov_log::OVLoop: Timestamp
02:33:41.130 [5396.5076] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
02:33:41.130 [5396.5076] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
02:33:41.130 [5396.5076] <2> tar_base::V_vTarMsgW: INF - Client completed sending data for backup
02:33:41.130 [5396.5076] <4> tar_base::stopKeepaliveThread: INF - waiting for keepalive thread to exit ...
02:33:41.130 [5396.3304] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
02:33:41.130 [5396.5076] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
02:33:41.130 [5396.5076] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...
02:33:41.130 [5396.5076] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed
02:33:41.130 [5396.5076] <2> dtcp_shutdown: TCP - success: shutdown socket (652)
02:33:41.130 [5396.5076] <2> dtcp_shutdown: TCP - success: shutdown socket (392)
02:33:50.271 [5396.5076] <4> tar_backup::readServerMessage: INF - 'INF - Server status = 0' received
02:33:50.271 [5396.5076] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
02:33:50.271 [5396.5076] <4> OVStopCmd: INF - EXIT - status = 0
02:33:50.271 [5396.5076] <4> OVStopCmd: INF - Shutdown stderr connection
02:33:50.271 [5396.5076] <2> dtcp_shutdown: TCP - success: shutdown socket (504)
02:33:50.271 [5396.5076] <4> OVStopCmd: INF - Shutdown stdout connection
02:33:50.271 [5396.5076] <2> dtcp_shutdown: TCP - success: shutdown socket (392)
02:33:50.271 [5396.5076] <4> OVStopCmd: INF - Shutdown stdin connection
02:33:50.271 [5396.5076] <2> dtcp_shutdown: TCP - success: shutdown socket (652)
02:33:50.271 [5396.5076] <4> OVStopCmd: INF - Shutdown wait started
02:33:50.271 [5396.5076] <2> tar_base::V_Close: closing...
02:33:50.271 [5396.5076] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
02:33:50.271 [5396.5076] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000101
02:33:50.271 [5396.5076] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): ubs specifics: 0x001d0000
02:33:51.287 [5396.5076] <16> dtcp_read: TCP - failure: recv socket (652) (TCP 10058: Can't send after socket shutdown)
02:33:51.287 [5396.5076] <16> dtcp_read: TCP - failure: recv socket (504) (TCP 10058: Can't send after socket shutdown)
02:33:51.287 [5396.5076] <4> OVShutdown: INF - Closing stderr connection
02:33:51.287 [5396.5076] <2> dtcp_close: TCP - success: close socket (504)
02:33:51.287 [5396.5076] <4> OVShutdown: INF - Closing stdout connection
02:33:51.287 [5396.5076] <2> dtcp_close: TCP - success: close socket (392)
02:33:51.287 [5396.5076] <4> OVShutdown: INF - Closing stdin connection
02:33:51.287 [5396.5076] <2> dtcp_close: TCP - success: close socket (652)
02:33:51.287 [5396.5076] <4> OVShutdown: INF - Finished process
02:33:51.287 [5396.5076] <4> WinMain: INF - Exiting C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
02:34:04.615 [4288.6136] <4> tar_base::keepaliveThread: INF - sending keepalive
02:35:04.618 [4288.6136] <4> tar_base::keepaliveThread: INF - sending keepalive

 

since 12h than there is only keepalive info

 

 

Last info about bpkar pid 4288 was:

14:46:03.633 [4288.6136] <4> tar_base::keepaliveThread: INF - sending keepalive
14:46:18.602 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Lff - 820521673 10 0 10 7348 93399732057 -1 120 33216 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 820521509 1428795203 1389275751 1404200807 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_09_DariuszMichalski/eksport/20140109_DariuszMichalski.mov
14:46:18.602 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Dir - 144 7349 93422306167 -1 80 16832 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 0 1428795202 1402300816 1404200623 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/
14:46:18.618 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 6312 7350 93422306174 -1 89 33216 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 6148 1428795203 1402300821 1404200883 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/.DS_Store
14:46:18.618 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Dir - 144 7351 93422306193 -1 88 16832 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 0 1428795202 1388673081 1404200883 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/projekt/
14:46:18.649 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 31340 7352 93422306200 -1 112 33216 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 31176 1428795203 1388673081 1404200883 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/projekt/2014_01_02_NatSikora.fcp
14:46:18.665 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 4256 7353 93422306270 -1 114 33216 root;S-1-5-21-307780267-2766365465-114402338-2002 root;S-1-5-21-307780267-2766365465-114402338-1001 4096 1428795203 1404200883 1404200883 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/projekt/._2014_01_02_NatSikora.fcp
14:46:18.665 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Dir - 144 7354 93422306289 -1 88 16832 root;SYSTEM@NT:AUTHORITY root;S-1-5-21-2890356171-3120593025-2523854914-1001 0 1428795202 1388666943 1404200883 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/eksport/
14:46:18.696 [5396.5076] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 4256 7355 93422306296 -1 114 33216 root;S-1-5-21-307780267-2766365465-114402338-2002 root;S-1-5-21-307780267-2766365465-114402338-1001 4096 1428795203 1404200909 1404200909 /\\10.200.200.1/R$/01_Jedynka/muzyczna_jedynka/2014_01/2014_01_02_NataliaSikora/eksport/._2014_01_02_NatSikora.mov
14:46:49.962 [4288.1564] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Lff - 330833330 26 0 26 66085 91205724956 -1 101 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 330833198 1428052695 1390826887 1390822744 /\\10.200.200.1/S$/Produkcja/Projekty/Fonografia/2014_01_24_DEMONO_symfonicznie/source/locha_Kuks.mov
14:46:49.993 [4288.1564] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 15496 66086 91260897075 -1 96 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 15364 1426168075 1396940474 1390977413 /\\10.200.200.1/S$/Produkcja/Projekty/Fonografia/2014_01_24_DEMONO_symfonicznie/source/.DS_Store


no error there.

 

 

 

In bptm last info from job 3928 was:

14:46:20.743 [3928.1888] <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -6
14:46:20.743 [3928.1888] <2> io_ioctl: command (0)MTWEOF 1 0x81 from (bptm.c.20348) on drive index 0
14:46:20.774 [3928.1888] <2> write_backup: maximum fragment size written copy 1 --- Fragmenting
14:46:20.774 [3928.1888] <2> ConnectionCache::connectAndCache: Acquiring new connection for host veritas, query type 1
14:46:20.790 [3928.1888] <2> vnet_pbxConnect: pbxConnectEx Succeeded
14:46:20.790 [3928.1888] <2> logconnections: BPDBM CONNECT FROM 10.242.10.98.58929 TO 10.242.10.84.1556 fd = 1780
14:46:20.852 [3928.1888] <2> db_end: Need to collect reply
14:46:20.884 [3928.1888] <4> report_throughput: VBRT 1 3928 5 1 lto61 601081 0 1 0 50000000  50000000 (bptm.c.20836)
14:46:20.884 [3928.1888] <2> ConnectionCache::connectAndCache: Acquiring new connection for host veritas, query type 77
14:46:20.899 [3928.1888] <2> vnet_pbxConnect: pbxConnectEx Succeeded
14:46:20.899 [3928.1888] <2> logconnections: BPDBM CONNECT FROM 10.242.10.98.58930 TO 10.242.10.84.1556 fd = 1780
14:46:20.962 [3928.1888] <2> db_end: Need to collect reply
14:46:21.071 [3928.1888] <2> ConnectionCache::connectAndCache: Acquiring new connection for host veritas, query type 1
14:46:21.087 [3928.1888] <2> vnet_pbxConnect: pbxConnectEx Succeeded
14:46:21.087 [3928.1888] <2> logconnections: BPDBM CONNECT FROM 10.242.10.98.58931 TO 10.242.10.84.1556 fd = 1780
14:46:21.149 [3928.1888] <2> db_end: Need to collect reply
14:46:21.165 [3928.1888] <4> write_backup: successfully wrote backup id backuptv.prsa.pl_1428676854, copy 1, fragment 921, 50000000 Kbytes at 157340.567 Kbytes/sec
14:46:21.165 [3928.1888] <2> ConnectionCache::connectAndCache: Acquiring new connection for host veritas, query type 1
14:46:21.180 [3928.1888] <2> vnet_pbxConnect: pbxConnectEx Succeeded
14:46:21.180 [3928.1888] <2> logconnections: BPDBM CONNECT FROM 10.242.10.98.58932 TO 10.242.10.84.1556 fd = 1780
14:46:21.243 [3928.1888] <2> db_end: Need to collect reply
14:46:21.274 [3928.1888] <4> write_backup: begin writing backup id backuptv.prsa.pl_1428676854, copy 1, fragment 922, to media id 601081 on drive lto61 (index 0)
14:46:21.274 [3928.1888] <2> write_data: twin_index: 0 active: 1 dont_process: 0 wrote_backup_hdr: 0 finished_buff: 0 saved_cindex: -1 twin_is_disk 0 delay_brm: 0
14:46:21.274 [3928.1888] <2> write_data: Total Kbytes transferred 45625671808
14:46:21.274 [3928.1888] <2> write_data: absolute block position prior to writing backup header(s) is 5078153, copy 1
14:46:21.274 [3928.1888] <2> write_data: block position check: actual 5078153, expected 5078153
14:46:21.274 [3928.1888] <2> io_write_back_header: drive index 0, backuptv.prsa.pl_1428676854, file num = 14, mpx_headers = 0, copy 1
14:46:21.274 [3928.1888] <2> write_data: completed writing backup header, start writing data when first buffer is available, copy 1
14:46:21.274 [3928.1888] <2> write_data: first write, twin_index: 0 cindex: 0 dont_process: 1 wrote_backup_hdr: 1 finished_buff: 0
14:46:21.274 [3928.1888] <2> write_data: received first buffer (131072 bytes), begin writing data

 

Since then there is only loop from 1st post.

While this job was hanging at the same time other jobs ( was working correctly)

sdo
Moderator
Moderator
Partner    VIP    Certified

Can anyone confirm...

...if a Windows 2012 R2 media server has been accidentally installed using the v7.6.0.1 Windows 'Server' kit, then:

a) Is it safe to simply patch/upgrade to v7.6.0.3?

...or:

b) Must one un-install the v7.6.0.1 base kit (which thus should not have been installed on Windows 2012 R2) - and then re-install using the 'special' re-issued v7.6.0.3 combo base+patch 'Server' installer kit which was released specifically for Windows 2012 R2?

 

moratar
Level 3

Client is also a Media Server.

Logs are from Media server where the backup is runing.

Processes are still up.

 

Marianne
Level 6
Partner    VIP    Accredited Certified

Do you have bpbrm log on the media server?

This entry shows catalog update sent to bpbrm:

14:46:49.993 [4288.1564] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: ....

What was the date of this log file? How long ago is this entry?

Have you checked bpps output on the media server to see if processes are still running?

Do you have Checkpoints enabled in the policy? 
If so, try to suspend the backup and resume it a couple of minutes later.

If you dont't have Checkpoints in the policy, all that is left is to kill the backup...

 

 

moratar
Level 3

After update to 7.6.1 on master server and media server, issue was not solved and there is the same behavior.

moratar
Level 3

This log entry was like 15h before i noticed problem.

Suspending backup and resuming seems to work - backup keeps on going. That doesn't seem like good solution. There is like 150TB od data, and first stop is after 90-94h (45-50TB).

 

It would be nice to have backup done on their own rather than manualy suspending/resuming.

Suspending:

2015-04-27 20:22:24 - Info bptm(pid=3048) media id 601199 mounted on drive index 2, drivepath {5,0,2,0}, drivename lto63, copy 1
2015-04-27 20:22:25 - mounted; mount time: 0:00:57
2015-04-27 20:22:25 - positioning 601199 to file 1
2015-04-27 20:22:29 - positioned 601199; position time: 0:00:04
2015-04-27 20:22:29 - begin writing
2015-04-28 12:09:46 - Info bpbkar32(pid=1812) done. status: 157: suspend requested by administrator      
2015-04-28 12:09:46 - end writing; write time: 15:47:17
suspend requested by administrator (157)
2015-04-28 12:09:46 - Info bpbkar32(pid=1812) done. status: 157: suspend requested by administrator      
2015-04-28 12:09:46 - end writing


Resuming:

2015-04-28 12:10:47 - Info nbjm(pid=2264) starting backup job (jobid=17585) for client backuptv.prsa.pl, policy SAN_TV_test, schedule full  
2015-04-28 12:10:47 - Info nbjm(pid=2264) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=17585, request id:{E8CAD212-D1E5-45B2-A98E-9F5B73C2D192})  
2015-04-28 12:10:47 - requesting resource bck_TV
2015-04-28 12:10:47 - requesting resource veritas.NBU_CLIENT.MAXJOBS.backuptv.prsa.pl
2015-04-28 12:10:47 - requesting resource veritas.NBU_POLICY.MAXJOBS.SAN_TV_test
2015-04-28 12:10:49 - granted resource veritas.NBU_CLIENT.MAXJOBS.backuptv.prsa.pl
2015-04-28 12:10:49 - granted resource veritas.NBU_POLICY.MAXJOBS.SAN_TV_test
2015-04-28 12:10:49 - granted resource 601051
2015-04-28 12:10:49 - granted resource lto61
2015-04-28 12:10:49 - granted resource bck_TV
2015-04-28 12:10:49 - estimated 0 Kbytes needed
2015-04-28 12:10:49 - Info nbjm(pid=2264) resumed backup (backupid=backuptv.prsa.pl_1429878309) job for client backuptv.prsa.pl, policy SAN_TV_test, schedule full on storage unit bck_TV
2015-04-28 12:10:50 - started process bpbrm (3076)
2015-04-28 12:10:51 - Info bpbrm(pid=3076) backuptv.prsa.pl is the host to backup data from     
2015-04-28 12:10:51 - Info bpbrm(pid=3076) reading file list for client        
2015-04-28 12:10:51 - connecting
2015-04-28 12:10:52 - Info bpbrm(pid=3076) starting bpbkar32 on client         
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) Backup started           
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) change time comparison:<disabled>          
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) archive bit processing:<enabled>          
2015-04-28 12:10:52 - connected; connect time: 0:00:01
2015-04-28 12:10:53 - Info bptm(pid=140) start            
2015-04-28 12:11:02 - Info bptm(pid=140) using 131072 data buffer size        
2015-04-28 12:11:02 - Info bptm(pid=140) setting receive network buffer to 263168 bytes      
2015-04-28 12:11:02 - Info bptm(pid=140) using 128 data buffers         
2015-04-28 12:11:02 - Info bptm(pid=140) start backup           
2015-04-28 12:11:02 - Info bptm(pid=140) Waiting for mount of media id 601051 (copy 1) on server backuptv.prsa.pl.
2015-04-28 12:11:02 - mounting 601051
2015-04-28 12:11:25 - Info bptm(pid=140) media id 601051 mounted on drive index 0, drivepath {5,0,1,0}, drivename lto61, copy 1
2015-04-28 12:11:26 - mounted; mount time: 0:00:24
2015-04-28 12:11:26 - positioning 601051 to file 56
2015-04-28 12:12:23 - positioned 601051; position time: 0:00:57
2015-04-28 12:12:23 - begin writing
2015-04-28 12:13:05 - current media 601051 complete, requesting next resource Any
2015-04-28 12:13:06 - Info bptm(pid=140) Waiting for mount of media id 601197 (copy 1) on server backuptv.prsa.pl.
2015-04-28 12:13:06 - granted resource 601197
2015-04-28 12:13:06 - granted resource lto62
2015-04-28 12:13:06 - granted resource bck_TV
2015-04-28 12:13:06 - end writing; write time: 0:00:43
2015-04-28 12:13:06 - mounting 601197
2015-04-28 12:13:47 - Info bptm(pid=140) media id 601197 mounted on drive index 1, drivepath {5,0,0,0}, drivename lto62, copy 1
2015-04-28 12:13:47 - mounted; mount time: 0:00:41
2015-04-28 12:13:47 - positioning 601197 to file 1
2015-04-28 12:13:51 - positioned 601197; position time: 0:00:04
2015-04-28 12:13:51 - begin writing

 

moratar
Level 3

The info on media server was like 14h before i noticed problem.

There are no other entries in logs, like if nothing was happening.

 

Suspending/resuming seems to work, but it dosn't look like good solution.

Would be nice to have backups runing fully normaly without need to manualy suspend/resume.

Supending:

2015-04-27 20:21:28 - end writing; write time: 4:20:47
2015-04-27 20:21:28 - mounting 601199
2015-04-27 20:22:24 - Info bptm(pid=3048) media id 601199 mounted on drive index 2, drivepath {5,0,2,0}, drivename lto63, copy 1
2015-04-27 20:22:25 - mounted; mount time: 0:00:57
2015-04-27 20:22:25 - positioning 601199 to file 1
2015-04-27 20:22:29 - positioned 601199; position time: 0:00:04
2015-04-27 20:22:29 - begin writing
2015-04-28 12:09:46 - Info bpbkar32(pid=1812) done. status: 157: suspend requested by administrator      
2015-04-28 12:09:46 - end writing; write time: 15:47:17
suspend requested by administrator (157)
2015-04-28 12:09:46 - Info bpbkar32(pid=1812) done. status: 157: suspend requested by administrator      
2015-04-28 12:09:46 - end writing
suspend requested by administrator (157)

Resuming

2015-04-28 12:10:47 - Info nbjm(pid=2264) starting backup job (jobid=17585) for client backuptv.prsa.pl, policy SAN_TV_test, schedule full  
2015-04-28 12:10:47 - Info nbjm(pid=2264) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=17585, request id:{E8CAD212-D1E5-45B2-A98E-9F5B73C2D192})  
2015-04-28 12:10:47 - requesting resource bck_TV
2015-04-28 12:10:47 - requesting resource veritas.NBU_CLIENT.MAXJOBS.backuptv.prsa.pl
2015-04-28 12:10:47 - requesting resource veritas.NBU_POLICY.MAXJOBS.SAN_TV_test
2015-04-28 12:10:49 - granted resource veritas.NBU_CLIENT.MAXJOBS.backuptv.prsa.pl
2015-04-28 12:10:49 - granted resource veritas.NBU_POLICY.MAXJOBS.SAN_TV_test
2015-04-28 12:10:49 - granted resource 601051
2015-04-28 12:10:49 - granted resource lto61
2015-04-28 12:10:49 - granted resource bck_TV
2015-04-28 12:10:49 - estimated 0 Kbytes needed
2015-04-28 12:10:49 - Info nbjm(pid=2264) resumed backup (backupid=backuptv.prsa.pl_1429878309) job for client backuptv.prsa.pl, policy SAN_TV_test, schedule full on storage unit bck_TV
2015-04-28 12:10:50 - started process bpbrm (3076)
2015-04-28 12:10:51 - Info bpbrm(pid=3076) backuptv.prsa.pl is the host to backup data from     
2015-04-28 12:10:51 - Info bpbrm(pid=3076) reading file list for client        
2015-04-28 12:10:51 - connecting
2015-04-28 12:10:52 - Info bpbrm(pid=3076) starting bpbkar32 on client         
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) Backup started           
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) change time comparison:<disabled>          
2015-04-28 12:10:52 - Info bpbkar32(pid=3796) archive bit processing:<enabled>          
2015-04-28 12:10:52 - connected; connect time: 0:00:01
2015-04-28 12:10:53 - Info bptm(pid=140) start            
2015-04-28 12:11:02 - Info bptm(pid=140) using 131072 data buffer size        
2015-04-28 12:11:02 - Info bptm(pid=140) setting receive network buffer to 263168 bytes      
2015-04-28 12:11:02 - Info bptm(pid=140) using 128 data buffers         
2015-04-28 12:11:02 - Info bptm(pid=140) start backup           
2015-04-28 12:11:02 - Info bptm(pid=140) Waiting for mount of media id 601051 (copy 1) on server backuptv.prsa.pl.
2015-04-28 12:11:02 - mounting 601051
2015-04-28 12:11:25 - Info bptm(pid=140) media id 601051 mounted on drive index 0, drivepath {5,0,1,0}, drivename lto61, copy 1
2015-04-28 12:11:26 - mounted; mount time: 0:00:24
2015-04-28 12:11:26 - positioning 601051 to file 56
2015-04-28 12:12:23 - positioned 601051; position time: 0:00:57
2015-04-28 12:12:23 - begin writing
2015-04-28 12:13:05 - current media 601051 complete, requesting next resource Any
2015-04-28 12:13:06 - Info bptm(pid=140) Waiting for mount of media id 601197 (copy 1) on server backuptv.prsa.pl.
2015-04-28 12:13:06 - granted resource 601197
2015-04-28 12:13:06 - granted resource lto62
2015-04-28 12:13:06 - granted resource bck_TV
2015-04-28 12:13:06 - end writing; write time: 0:00:43
2015-04-28 12:13:06 - mounting 601197
2015-04-28 12:13:47 - Info bptm(pid=140) media id 601197 mounted on drive index 1, drivepath {5,0,0,0}, drivename lto62, copy 1
2015-04-28 12:13:47 - mounted; mount time: 0:00:41
2015-04-28 12:13:47 - positioning 601197 to file 1
2015-04-28 12:13:51 - positioned 601197; position time: 0:00:04
2015-04-28 12:13:51 - begin writing

moratar
Level 3

bpkar32.exe is still running. It might be some issue with it, since it dosn't stop,

After suspedning and resumming backup old bpkar32.exe is still up running:

C:\Program Files\Veritas\NetBackup\bin>bpps.exe
* BACKUPTV                                               4/28/15 12:21:49.626
COMMAND           PID      LOAD             TIME   MEM                  START
mtstrmd          1200    0.000%            0.000  6.1M   4/24/15 14:07:46.678
nbcssc           1572    0.000%            0.281   18M   4/24/15 14:07:47.850
nbdisco          1860    0.000%            5.750   25M   4/24/15 14:07:50.116
vnetd            1912    0.000%            0.593  6.6M   4/24/15 14:07:52.850
nbrmms           2052    0.000%            4.265   29M   4/24/15 14:07:53.022
nbsl             2108    0.000%         3:25.250   36M   4/24/15 14:07:54.428
nbsvcmon         2204    0.000%            0.343   14M   4/24/15 14:07:55.460
bpinetd          2280    0.000%            1.843  8.4M   4/24/15 14:07:55.741
bpcd             2604    0.000%            0.796  9.9M   4/24/15 14:07:56.772
ltid             2716    0.000%            9.109   18M   4/24/15 14:07:57.022
vmd              2772    0.000%            8.906   16M   4/24/15 14:07:58.897
tldd             3108    0.000%            0.062   14M   4/24/15 14:08:00.616
avrd             3144    0.000%            0.062   13M   4/24/15 14:08:00.694
NBConsole        1936    3.077%        23:35.171   66M   4/24/15 14:22:58.518
bpbkar32         1812    0.000%      1:44:22.828   61M   4/24/15 14:25:12.562
bpbrm            3076    0.000%            0.093   16M   4/28/15 12:10:50.152
bpbkar32         3796    0.000%           15.843   57M   4/28/15 12:10:52.105
bptm              140    0.000%            4.031   37M   4/28/15 12:10:52.965
bpps             3244    0.000%            0.062  6.3M   4/28/15 12:21:48.548

 

 

Also there is new process for bpkar32.exe which started after resuming.

Weird that old process was not ended.

sdo
Moderator
Moderator
Partner    VIP    Certified

What version of NetBackup is the client running?  Wondering whether it's worth upgrading the client too?

To be able to suspend/resume, you must have 'checkpoint' enabled on the backup policiy.  Just wondering whether you could test same backup but without 'checkpoints'?  If you do decide to test this, make sure that there are no left over hung/orphaned bpbkar/bpbrm processes on the client before starting the test backup.

Marianne
Level 6
Partner    VIP    Accredited Certified

Please double-check and confirm NBU version on the client.

If 7.5.0.x, this TN might be applicable:

http://www.symantec.com/docs/TECH194918 

Logs needed:
On media server: bptm and bpbrm
On client: bpbkar

moratar
Level 3

On what client?

Media server is backing up it's local HDD.

Media server is 7.6.1, client on that media server is also 7.6.1

Logs on media server, says nothing in bptm, and bpbrm. 

bprm log just stops, bptm loops with something like this:

06:21:12.482 [3704.3392] <2> initialize: fd values STDOUTSOCK=388 STDERRSOCK=512
06:21:12.482 [3704.3392] <2> bptm: instance - 317633531
06:21:12.498 [3704.3392] <2> bptm: INITIATING (VERBOSE = 5): -rptdrv -jobid -1429876768 -jm 
06:21:12.498 [3704.3392] <2> bptm: Event Global\NetBackup Terminate Event, pid: 3704 created.
06:21:12.498 [3704.3392] <2> bptm: PORT_STATUS = 0x00000000
06:21:12.498 [3704.3392] <2> drivename_open: Called with Create 0, file lto63
06:21:12.498 [3704.3392] <2> drivename_checklock: Called
06:21:12.498 [3704.3392] <2> drivename_checklock: File is locked
06:21:12.498 [3704.3392] <2> report_drives: DRIVE = lto63 LOCK = TRUE CURTIME = 1430194872
06:21:12.498 [3704.3392] <2> drivename_close: Called for file lto63
06:21:12.498 [3704.3392] <2> report_drives: MODE = 0
06:21:12.498 [3704.3392] <2> report_drives: TIME = 1430158887
06:21:12.498 [3704.3392] <2> report_drives: MASTER = veritas
06:21:12.498 [3704.3392] <2> report_drives: SR_KEY = 0 1
06:21:12.498 [3704.3392] <2> report_drives: PATH = {5,0,2,0}
06:21:12.498 [3704.3392] <2> report_drives: MEDIA = 601199
06:21:12.498 [3704.3392] <2> report_drives: REQID = -1429872738
06:21:12.498 [3704.3392] <2> report_drives: ALOCID = 14081
06:21:12.498 [3704.3392] <2> report_drives: RBID = {B559CA20-B42A-4E87-8395-168D37784865}
06:21:12.498 [3704.3392] <2> report_drives: PID = 3048
06:21:12.498 [3704.3392] <2> report_drives: FILE = C:\Program Files\Veritas\NetBackup\db\media\tpreq\drive_lto63
06:21:12.498 [3704.3392] <2> main: Sending [EXIT STATUS 0] to NBJM
06:21:12.498 [3704.3392] <2> bptm: EXITING with status 0 <----------
06:21:12.498 [3704.3392] <2> bptm: Global\NetBackup Terminate Event, pid: 3704 closed.

 

Marianne
Level 6
Partner    VIP    Accredited Certified

The client is the one where the backup is hanging:  backuptv.prsa.pl .
In your case, all logs from the same SAN media server.

You need the logs of the backup start-date.

moratar
Level 3

At first i had no checkpoints, and couldn't suspend/resume

Than upgraded master server and media server  to 7.6.1, and turned checkpoints on.

When I upgraded Enterprise Clinet (media server) all files were updated to 7.6.1 there are no files from older verisions 7.6.0.x.

Media server's local hdd is backuped os there is no "other" client just media server (enterprise client) and master server.

Now i manualy do suspend/resume.

Test backups do run normaly.

Like i said before, i can buckup 2 local HDDs out of 3 locla HDDs i have in same backup policy.

2 hdds were backed up properly.

1st hdd - backup was performed correctly and it took 30h.

2nd hdd - backup was performed correctly and it took 70h.

3rd hdd - backup did hang after 92h ( after 92h backup hanged - no progress in logs)

3rd hdd had no errors in logs.

After suspending and restarting it seems the backup of 3rd hdd keeps on going forward.

moratar
Level 3

Those were logs from backuptv.prsa.pl

Marianne
Level 6
Partner    VIP    Accredited Certified

There are no logs for the backup start date (4/14/2015) in logs.zip.

Nothing we can do without these logs.

And no bpbkar log.

All I can suggest as this point in time is to kill all NBU processes in Windows Task Manager and start again.
Create bpbkar log folder before starting another backup.

You may want to break down the Backup Selection into smaller 'chucks' for this volume if at all possible, e.g:

NEW_STREAM
X:\folder1
X:\folder2
X:\folder3
NEW_STREAM
X:\folder4
X:\folder5
X:\folder6
NEW_STREAM
X:\folder7
X:\folder8
X:\folder9
NEW_STREAM
X:\folder...
X:\folder...
X:\folder...

Enable Multiple Data Streams in Policy Attributes.

The above example will generate 4 jobs.
If any of these jobs hang, it will be easier to follow the particular bpbkar process (PID) in the log to see where the job is getting hung.

moratar
Level 3

Bpkar logs:

14:47:55.760 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\Library\Receipts\com.jibapps.TrashMe.bom
14:47:55.760 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\Library\Receipts\com.jibapps.TrashMe.bom)
14:47:55.760 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\Library\Receipts\com.jibapps.TrashMe.bom'
14:47:55.791 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 72502 24941 2779216130 -1 59 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 72370 1372324866 1363691423 1363691423 /\\10.200.200.1/S$/Library/Receipts/com.jibapps.TrashMe.bom
14:47:55.791 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\Library\.DS_Store
14:47:55.791 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\Library\.DS_Store)
14:47:55.791 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\Library\.DS_Store'
14:47:55.807 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 6280 24942 2779216278 -1 36 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 6148 1429268187 1416922595 1416922006 /\\10.200.200.1/S$/Library/.DS_Store
14:47:55.807 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\ingest8
14:47:55.807 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\ingest8\)
14:47:55.807 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\ingest8'
14:47:55.807 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Dir - 112 24943 2779216297 -1 27 16832 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 0 1430204212 1430123206 1360323462 /\\10.200.200.1/S$/ingest8/
14:47:55.822 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\ingest8\BMZ 2.mov
14:47:55.822 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\ingest8\BMZ 2.mov)
14:47:55.822 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\ingest8\BMZ 2.mov'
14:48:23.150 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Lff - 1071819396 4 0 4 24944 2779216304 -1 36 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 1071819264 1430123199 1429894679 1429894370 /\\10.200.200.1/S$/ingest8/BMZ 2.mov
14:48:23.150 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\ingest8\BMZ 1.mov
14:48:23.150 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\ingest8\BMZ 1.mov)
14:48:23.150 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\ingest8\BMZ 1.mov'
14:48:24.229 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 228940420 24945 2789698316 -1 36 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 228940288 1430123200 1429894369 1429894356 /\\10.200.200.1/S$/ingest8/BMZ 1.mov
14:48:24.244 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\ingest8\.DS_Store
14:48:24.244 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\ingest8\.DS_Store)
14:48:24.244 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\ingest8\.DS_Store'
14:48:24.244 [3796.724] <4> tar_backup_tfi::backup_finishfile_state: INF - catalog message: Fil - 6280 24946 2790145472 -1 36 33216 root;S-1-5-21-2890356171-3120593025-2523854914-2002 root;Users@BUILTIN 6148 1430125434 1430123211 1363272443 /\\10.200.200.1/S$/ingest8/.DS_Store
14:48:24.260 [3796.724] <2> tar_backup_tfi::backup_startfile_state: TAR - Backup: \\10.200.200.1\S$\ingest8\LPP.mov
14:48:24.260 [3796.724] <4> folder_access::V_OpenForRead: INF - non-DFSR file name (\\?\UNC\10.200.200.1\S$\ingest8\LPP.mov)
14:48:24.260 [3796.724] <4> dos_backup::tfs_readopen: INF - NT Security information obtained for: '\\10.200.200.1\S$\ingest8\LPP.mov'

14:48:53.041 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:48:53.041 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:48:53.041 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:49:53.042 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:49:53.042 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:49:53.042 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:50:53.042 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:50:53.042 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:50:53.042 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:51:53.043 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:51:53.043 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:51:53.043 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:52:53.043 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:52:53.043 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:52:53.043 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:53:53.044 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:53:53.044 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:53:53.044 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:54:53.044 [3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
14:54:53.044 [3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received
14:54:53.044 [3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
14:55:19.576 [1812.3696] <2> tar_backup_cpr::wakeupThread: INF - setting checkpoint wakeup event

and 30MB of loop:

[3796.388] <4> tar_base::keepaliveThread: INF - sending keepalive
[3796.388] <4> bpio::read_string: INF - read non-blocking message of length 1
[3796.388] <4> tar_backup::readServerMessage: INF - keepalive message received

moratar
Level 3

Can't split this backup, content is dynamic, and there are no fixed folders.

But something changed, after suspend/restore backup now hangs after 1.5TB, and no more data is flowing from bpkar32.