Duplication stalling

rilwan_dawodu · ‎08-22-2005

hi all,

Can anyone help me out.

We have a duplication script that we run to duplicates.
the problem is that after some time the duplication jobs will not continue again.

many a times I have to restart netbackup services on the media servers and the master server to solve it tempoprarily coz when i run the script again it gets stalled.

My master server is unix box running solaris 8 and my media servers are windows 2003 server.

what do i do now.

pls help

regards,
ril

Jeffrey_Redingt · ‎08-22-2005

Is there any error message when it stalls? Do you have any logging enabled to find out what is going on when it stalls?

rilwan_dawodu · ‎08-22-2005

Hi,

No error message came up. it just stops at mounting

How do i enable logging?

regards,

Jeffrey_Redingt · ‎08-22-2005

Simply create the folder called bpsched under \NetBackup\logs on the master server. You may also create bptm folder in the same location. This should give you an indication of what is going on. Or if the media server doing the duplication is not the master create the bptm log in that directory on the media server.

MayurS · ‎08-22-2005

IN addition for enabling detailed logging
in the bp.conf change the use the the logging level to 5
edit bp.conf and set
VERBOSE=5

Yasuhisa_Ishika · ‎08-22-2005

Show us output of vmoprcmd command on the server duplication works.
vmoprcmd must be run when stalling is under going.

rilwan_dawodu · ‎08-23-2005

Thanks for all your response. The logging was already in exitense and below are the some of the contents of the logs.

the bpsched logs is saying something about storage unit being down. What could be the cause of this. it happens most often. what can be done to rectify this? I know the last resort will be to change it but before it gets to that what can be done?

the bpsched logs:

11:15:24.002 <2> run_any_ret_level: ----> 1
11:15:24.002 <2> run_any_ret_level: ----> 0
11:15:24.002 <2> run_backups: run_any_ret_level(1) returned -3(Something maxed out)
11:15:24.002 <2> recv_runQ_msg: msgrcv(nodelay) stat -1 errno 35 (No message of desired type). sigcld=0
11:15:24.002 <2> check_runmsgQ: no more bpsched children
11:15:24.002 <2> clean_up_worklists: worklists not clean, waiting for resources
11:15:24.002 <2> create_mesgQ: ?
11:15:24.015 <2> nb_getsockconnected: Connect to tpdvhhbu006 on port 1021
11:15:24.015 <2> logconnections: BPCD CONNECT FROM 10.205.56.25.1021 TO 10.205.56.31.13782
11:15:24.049 <2> start_bptm: /usr/openv/netbackup/bin/bptm bptm -count -cmd -rt 8 -rn 0 -stunit duplicate_
11:15:24.049 <2> start_bptm: Received BPCD success message
11:15:24.118 <4> log_in_errorDB: no drives up on storage unit
11:15:24.119 <2> ping_bpcd: Checking to see if bpcd available on tpdvhhbu002
11:15:24.120 <2> nb_getsockconnected: Connect to tpdvhhbu002 on port 1020
11:15:24.121 <2> logconnections: BPCD CONNECT FROM 10.205.56.25.1020 TO 10.205.56.27.13782
11:15:24.151 <2> ping_bpcd: BPCD_GET_VERSION_RQST
11:15:24.250 <2> bpcr_get_version_rqst: bpcd version: 05000401
11:15:24.250 <2> ping_bpcd: Received BPCD success message
11:15:24.251 <2> ping_bpcd: bpcd IS running on tpdvhhbu002
11:15:24.251 <2> ping_bpcd: Checking to see if bpcd available on tpdvhhbu003
11:15:24.251 <2> nb_getsockconnected: Connect to tpdvhhbu003 on port 1020
11:15:24.252 <2> logconnections: BPCD CONNECT FROM 10.205.56.25.1020 TO 10.205.56.28.13782
11:15:24.282 <2> ping_bpcd: BPCD_GET_VERSION_RQST
11:15:24.380 <2> bpcr_get_version_rqst: bpcd version: 05000401
11:15:24.380 <2> ping_bpcd: Received BPCD success message
11:15:24.381 <2> ping_bpcd: bpcd IS running on tpdvhhbu003
11:15:24.381 <2> nb_getsockconnected: Connect to tpdvhhbu002 on port 1020
11:15:24.382 <2> logconnections: BPCD CONNECT FROM 10.205.56.25.1020 TO

the bptm logs:

11:14:52.125 <2> bptm: EXITING with status 0 <----------
11:15:22.539 <2> bptm: INITIATING (VERBOSE = -1): -U
11:15:22.539 <2> bptm: EXITING with status 0 <----------
11:15:52.942 <2> bptm: INITIATING (VERBOSE = -1): -U
11:15:52.942 <2> bptm: EXITING with status 0 <----------
11:16:23.909 <2> bptm: INITIATING (VERBOSE = -1): -U
11:16:23.909 <2> bptm: EXITING with status 0 <----------
11:16:54.469 <2> bptm: INITIATING (VERBOSE = -1): -U
11:16:54.469 <2> bptm: EXITING with status 0 <----------
11:17:24.983 <2> bptm: INITIATING (VERBOSE = -1): -U
11:17:24.999 <2> bptm: EXITING with status 0 <----------
11:17:25.499 <2> bptm: INITIATING (VERBOSE = -1): -count -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -mt 2 -masterversion 500000
11:17:25.499 <2> bptm: EXITING with status 0 <----------
11:17:55.529 <2> bptm: INITIATING (VERBOSE = -1): -U
11:17:55.529 <2> bptm: EXITING with status 0 <----------
11:18:26.076 <2> bptm: INITIATING (VERBOSE = -1): -U
11:18:26.076 <2> bptm: EXITING with status 0 <----------
11:18:56.606 <2> bptm: INITIATING (VERBOSE = -1): -U
11:18:56.606 <2> bptm: EXITING with status 0 <----------
11:19:03.704 <2> bptm: INITIATING (VERBOSE = -1): -count -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -mt 2 -masterversion 500000
11:19:03.704 <2> bptm: EXITING with status 0 <----------
11:19:27.700 <2> bptm: INITIATING (VERBOSE = -1): -U
11:19:27.700 <2> bptm: EXITING with status 0 <----------
11:19:58.230 <2> bptm: INITIATING (VERBOSE = -1): -U
11:19:58.230 <2> bptm: EXITING with status 0 <----------

11:34:06.800 <2> bptm: EXITING with status 0 <----------
11:34:10.021 <2> bptm: INITIATING (VERBOSE = -1): -countmedia -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -p SQL -rl 3 -masterversion 500000
11:34:10.021 <2> add_to_vmhost_list: added tpdvhhbu001 to vmhost list
11:34:10.021 <2> nb_getsockconnected: host=tpdvhhbu001 service=vmd address=10.205.56.25 protocol=tcp non-reserved port=13701
11:34:10.037 <2> ParseConfigExA: Unknown configuration option on line 95: RenameIfExists = 0
11:34:10.037 <2> ParseConfigExA: Unknown configuration option on line 95: RenameIfExists = 0
11:34:10.037 <2> vmdb_get_scratch_list: server returned: ScratchPool

11:34:10.037 <2> vmdb_get_scratch_list: server returned: EXIT_STATUS 0
11:34:10.037 <2> nb_getsockconnected: host=tpdvhhbu001 service=vmd address=10.205.56.25 protocol=tcp non-reserved port=13701
11:34:10.177 <2> bptm: EXITING with status 0 <----------
11:39:03.254 <2> bptm: INITIATING (VERBOSE = -1): -count -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -mt 2 -masterversion 500000
11:39:03.254 <2> bptm: EXITING with status 0 <----------
11:39:10.120 <2> bptm: INITIATING (VERBOSE = -1): -count -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -mt 2 -masterversion 500000
11:39:10.120 <2> bptm: EXITING with status 0 <----------
11:44:09.607 <2> bptm: INITIATING (VERBOSE = -1): -count -cmd -rt 8 -rn 0 -stunit tpdvhhbu004-hcart-robot-tld-0 -den 6 -mt 2 -masterversion 500000
11:44:09.607 <2> bptm: EXITING with status 0 <----------

can you plese help me out from the output of these logs.
please its urgent.

Thanks

MayurS · ‎08-23-2005

It looks like the drives are down or no appropriate storage unit is specified for the duplications.

What is the output of this command ?

vmoprcmd -d

rilwan_dawodu · ‎08-23-2005

Hi,

Find below the output of vmoprcmd -d.
The media server is a window box.

I can see two of the drives being down but that does not mean the whole duplication should stall.

Moreover this two drives are always coming down despite taking them to up state. What could be wrong. Is it the drives are going bad?

the robot is an STK L700.

Thanks.

PENDING REQUESTS

DRIVE STATUS

rv Type Control User Label RecMID ExtMID Ready Wr.Enbl. ReqId
0 hcart TLD NetBacku Yes PPR354 PPR354 Yes Yes 7
1 hcart TLD NetBacku Yes PPR245 PPR245 Yes Yes 4
2 hcart TLD NetBacku Yes PPR286 PPR286 Yes Yes 10
3 hcart TLD NetBacku Yes PPR068 PPR068 Yes Yes 11
4 hcart DOWN-TLD - No - -
5 hcart DOWN-TLD - No - -
6 hcart TLD NetBacku Yes PPR129 PPR129 Yes Yes 2
7 hcart TLD NetBacku Yes PPR096 PPR096 Yes Yes 9
8 hcart TLD NetBacku Yes PPR187 PPR187 Yes Yes 6
9 hcart TLD NetBacku Yes PPR313 PPR313 Yes Yes 8

ADDITIONAL DRIVE STATUS

rv DriveName Shared Assigned Comment
0 IBMULTRIUM-TD25 Yes tpne-mgta-hbu07
1 IBMULTRIUM-TD26 Yes tpne-mgta-hbu07
2 IBMULTRIUM-TD27 Yes tpne-mgta-hbu07
3 IBMULTRIUM-TD28 Yes tpne-mgta-hbu07
4 IBMULTRIUM-TD29 Yes -
5 IBMULTRIUM-TD20 Yes -
6 IBMULTRIUM-TD21 Yes tpne-mgta-hbu07
7 IBMULTRIUM-TD22 Yes tpne-mgta-hbu07
8 IBMULTRIUM-TD23 Yes tpne-mgta-hbu07
9 IBMULTRIUM-TD24 Yes tpne-mgta-hbu07

MayurS · ‎08-24-2005

Hi,

Have you executed the "vmoprcmd -d" command while the duplicate backup in question was in stall state ?

And regarding the Downed Drives.
Are you able to view the "7 Segment Display" thats on the drive in the lib. Actually i determine the drive problems by the error code that flashes on the drive's 7 segement and then i refer to the TLU documentation and act accordingly..Message was edited by:
Mayur Shinde

Jeffrey_Redingt · ‎08-24-2005

Look at the Windows system and application event logs on the media server. Tell us what errors you are seeing.

TempoVisitor · ‎08-25-2005

Hi,
You have a complex site : SAN SSO drives and so.

You must be very precise on your duplciation jobs :
- are all the original backups made by only one machine
- is the destination STU on the same server
- are there MPX jobs
- what does your duplication script do

I think you have a problem of Duplication over the LAN.

Imagine you have several servers, all connected to the same library.
Each server makes its own backup, on its own tapes, in the common library.
When you duplicate, you have just designated ONE stu for duplication destination.
Therefore, if you haven't given the right for the destination server to read the tapes written by the other servers ... guess what ... each server mounts its own tapes, read the data, and send them ... ON THE LAN !!! to the destination server.
Nothing to do with the drives down ... you have to adapt your duplication :
- each server should make its own duplication
or
- host properties media server : specify Alternate read server to give the rights for the destination server to mount tapes written by all the other ones.

rilwan_dawodu · ‎08-29-2005

hi,

The original backups are being backed up by different media servers.

The duplication STU is on the same server.

There are MPX jobs.

The script runs multiplel duplications in parallel. It is designerd to duplicate all backups done within the last 24 hours.

Today, the duplication has been a bit successful except for the last duplication job which has stalled again for over 2 hours now.

Below is the output of the vmoprcmd -d

PENDING REQUESTS

DRIVE STATUS

Drv Type Control User Label RecMID ExtMID Ready Wr.Enbl. ReqId
0 hcart TLD - No - -
1 hcart TLD - No - -
2 hcart TLD - No - -
3 hcart TLD - No - -
4 hcart TLD - No - -
5 hcart TLD - No - -
6 hcart TLD NetBacku Yes DBT121 DBT121 Yes Yes 6
7 hcart TLD - No - -
8 hcart TLD - No - -
9 hcart TLD NetBacku Yes DBT350 DBT350 Yes Yes 0

ADDITIONAL DRIVE STATUS

Drv DriveName Shared Assigned Comment
0 IBMULTRIUM-TD20 Yes -
1 IBMULTRIUM-TD21 Yes -
2 IBMULTRIUM-TD22 Yes -
3 IBMULTRIUM-TD23 Yes -
4 IBMULTRIUM-TD24 Yes -
5 IBMULTRIUM-TD25 Yes -
6 IBMULTRIUM-TD26 Yes tpdvhhbu006
7 IBMULTRIUM-TD27 Yes -
8 IBMULTRIUM-TD28 Yes -
9 IBMULTRIUM-TD29 Yes tpdvhhbu006

How do I go about specifying alternate read server to give rights for destination server to mount tapes written by all the other ones?

regards,

VOX

Duplication stalling