cancel
Showing results for 
Search instead for 
Did you mean: 

Two Backup jobs, two media write errors - one job resumes, one doesn't

sb981
Level 2

A Tale of Two Backups

 
The Scene: Two backup jobs, "a" and "h", running simultaneously on client ned18 (along with three other backup jobs). Both policies have checkpointing enabled, both jobs back up the same fileserver.  Job "a" backs up the user directories beginning with "a" (one directory), and job "h" backs up the user directories beginning with "h" (three directories). 
 
NetBackup 7.0, server and client both running Linux
 
 
Act 1: Job "a"
- Job "a" having backed up 12TB and 800K files, stops with a "1: (84) media write error"
- Select the job, click Actions -> Resume Job
- Within minutes job "a" got set up, mounted and positioned a fresh tape
- Activity monitor immediately shows the byte count and file count incrementing
- In short job "a" is back on track and has been running fine for more than 8 hours
 
A few hours later...
 
Act 2, scene 1: Job "h"
- Job "h" having backed up 15TB and 150K files, stops with a "1: (84) media write error"
- Select the job, click Actions -> Resume Job
- Within minutes job "h" got set up, mounted and positioned a fresh tape
- BUT, the activity monitor shows no activity - the byte count and file count remain unchanged for job "h"
- Three hours later the master server timed out, "Error bpbrm (pid=nnn) socket read failed: errno = 62 - Timer expired" and "file read failed  (13)"
 
Act 2, scene 2: Job "h" (continued)
- Undaunted, enable debug logging for bpbkar on client (touched bpbkar_path_tr, VERBOSE = 5, Debug_Database = 5, ENABLE_ROBUST_LOGGING = YES)
- Select the job, click Actions -> Resume Job
- The bpbkar log file grows very quickly, BUT in a tight loop, and recording the same 19 lines over and over
- This continued for three more hours until the master server timed out (again, "errno = 62 - Timer expired").
 
Act 3: Looking for a happy ending
- Don't know why "Resume Job" for job "a" worked immediately
- Don't know why "Resume Job" for job "h" didn't work
- Once job "h" was resumed, what is the bpbkar process doing?
- And don't know what to trynext

 
 
References:
 
Endlessly repeating "bpbkar" debug log entries for job "h"
 
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol01 on /
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (proc) proc on /proc
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (sysfs) sysfs on /sys
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (devpts) devpts on /dev/pts
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (tmpfs) tmpfs on /dev/shm
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/sda1 on /boot
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/sda2 on /boot-rcvy
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol05 on /crash
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol00 on /rcvy
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol03 on /var
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol02 on /var-rcvy
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (binfmt_misc) none on /proc/sys/fs/binfmt_misc
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ipathfs) none on /ipathfs
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (rpc_pipefs) sunrpc on /var/lib/nfs/rpc_pipefs
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (gpfs) /dev/gsfs3 on /gpfs/gsfs3
15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (gpfs) /dev/gsfs2 on /gpfs/gsfs2
15:56:46.048 [15296] <2> bpbkar resolve_path: INF - /gs3/users/hcs2 resolves to /gpfs/gsfs3/users/hcs2
15:56:46.048 [15296] <2> bpbkar resolve_path: INF - Actual mount point of /gs3/users/hcs2 is /gpfs/gsfs3/users/hcs2
15:56:46.059 [15296] <2> bpbkar cd: remap name would return=/gs3/users/hcs2
<REPEAT>
 

 

"ps" output for processes for "a" and "h" backup jobs on client machine ned18
 
"a"
root     25535     1  6 08:05 ?        00:33:17 bpbkar -r 5356800 -ru root -dt 0 -to 0 -clnt ned18 -class dis-gs3-a -sched biomart -st FULL -bpstart_to 300 -bpend_to 300 -read_to 10800 -ckpt_time 7200 -blks_per_buffer 2048 -use_otm -fso -b ned18_1392781497 -kl 7 -use_ofb
 
"h"
root     15296     1  9 12:50 ?        00:20:04 bpbkar -r 5356800 -ru root -dt 0 -to 0 -clnt ned18 -class dis-gs3-h -sched biomart -st FULL -bpstart_to 300 -bpend_to 300 -read_to 10800 -ckpt_time 7200 -blks_per_buffer 2048 -use_otm -fso -b ned18_1392746035 -kl 7 -use_ofb
 
8 REPLIES 8

RamNagalla
Moderator
Moderator
Partner    VIP    Certified

you have 2 issues ..

1) Media write error.

2) backup issue with job "h"

please show me the backup policy configurations

bppllist <policy name of job a> -L

bppllist <policy name of job h> -L

also attach the entire bpbkar log for the h job with Verbose 5

 

PS:-I like the way that you presented the issue over here..

 

sb981
Level 2

Regarding 1) Media write errors -- yes, agreed, that's an issue.

Regarding 2) backup issue with job "h", attaching the info you requested. I had to resume the job again so I could capture the stuff at the beginning of the log file; I only included the first 1000 lines or so (I confirmed that there are 23 unique log entries and they all appear at the beginning of the log) 

Thanks for taking the time to look at this!

 

RamNagalla
Moderator
Moderator
Partner    VIP    Certified
22:29:57.054 [19169] <2> bpbkar resolve_path: INF - /gs3/users/hcs2 resolves to /gpfs/gsfs3/users/hcs2
22:29:57.054 [19169] <2> bpbkar resolve_path: INF - Actual mount point of /gs3/users/hcs2 is /gpfs/gsfs3/users/hcs2

so the actual mount for /gs3/users/h* is /gpfs/gsfs3/users.

so try to keep the backup selection as /gpfs/gsfs3/users/h* and try the backups, lets see how it goes..

i would also see deatail status of the failed job or full log of bpbkar log... because i would like to see how much time its taking befor its giving the failure message..

 

sdo
Moderator
Moderator
Partner    VIP    Certified

Hi sb981 - What is the O/S version and CPU architecture of the backup client ?

I ask because not all combinations of NetBackup + linux + ext4 are listed in compatibility matrix:

http://www.symantec.com/business/support/index?page=content&id=TECH76648

 

sb981
Level 2

I can see the rationale for using the actual mount point rather than our symbolic link shorthand... but both jobs have been backing up files this way for months.

About the log -- I'm not sure what is supposed to happen to the remote bpbkar client process when the master server places a job in Incomplete status, but for this "h" job the process persisted and continued to accumulate cpu time and write to the logs -- by the time I realized that was the case the logs had cycled through 111 50MB files (keeping three at any given time). I killed the process and the logging stopped. Note, however, there were no such lingering processes from the "a" backup, so it handled the transition more gracefully (and correctly, I'm sure).

I'm going to let this issue go for now because I just looked at the policy for "dis-gs3-h" with the GUI and noticed that the "Backup Selections" says "/gs3/users/g*" !! so I'm not surprised that this job isn't able to resume.

Thanks again for your help!

 

sb981
Level 2

Hi sdo,

Thanks for the link. The NetBackup master server is running RHEL5 on x86_64 and the client is RHEL6 on x86_64.

We're planning to upgrade Netbackup and the master server OS later this year.

 

 

sdo
Moderator
Moderator
Partner    VIP    Certified

Ok looks like ext4 is supported, with your O/S, architecture and NetBackup Client version.

Next I'm wondering if there's an etrack listed (from around page 80 onwards) in the v7.0.1 release notes that might be relevant, but it'll be a lot of reading if you don't get lucky with a search:

http://www.symantec.com/business/support/index?page=content&id=TECH124476

 

sdo
Moderator
Moderator
Partner    VIP    Certified

I can see that GFS2 support began with NetBackup v7.5 onwards - but it's not clear to me if GFS2 is synonymous with GPFS - which, might not be supported... Found this re GPFS:

http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/symantec-netbackup-gpfs-support-5245335

Is GFS2 the same thing as GPFS?