Two Backup jobs, two media write errors - one job ...

sb981 · ‎02-22-2014

A Tale of Two Backups

The Scene: Two backup jobs, "a" and "h", running simultaneously on client ned18 (along with three other backup jobs). Both policies have checkpointing enabled, both jobs back up the same fileserver. Job "a" backs up the user directories beginning with "a" (one directory), and job "h" backs up the user directories beginning with "h" (three directories).

NetBackup 7.0, server and client both running Linux

Act 1: Job "a"

- Job "a" having backed up 12TB and 800K files, stops with a "1: (84) media write error"

- Select the job, click Actions -> Resume Job

- Within minutes job "a" got set up, mounted and positioned a fresh tape

- Activity monitor immediately shows the byte count and file count incrementing

- In short job "a" is back on track and has been running fine for more than 8 hours

A few hours later...

Act 2, scene 1: Job "h"

- Job "h" having backed up 15TB and 150K files, stops with a "1: (84) media write error"

- Select the job, click Actions -> Resume Job

- Within minutes job "h" got set up, mounted and positioned a fresh tape

- BUT, the activity monitor shows no activity - the byte count and file count remain unchanged for job "h"

- Three hours later the master server timed out, "Error bpbrm (pid=nnn) socket read failed: errno = 62 - Timer expired" and "file read failed (13)"

Act 2, scene 2: Job "h" (continued)

- Undaunted, enable debug logging for bpbkar on client (touched bpbkar_path_tr, VERBOSE = 5, Debug_Database = 5, ENABLE_ROBUST_LOGGING = YES)

- Select the job, click Actions -> Resume Job

- The bpbkar log file grows very quickly, BUT in a tight loop, and recording the same 19 lines over and over

- This continued for three more hours until the master server timed out (again, "errno = 62 - Timer expired").

Act 3: Looking for a happy ending

- Don't know why "Resume Job" for job "a" worked immediately

- Don't know why "Resume Job" for job "h" didn't work

- Once job "h" was resumed, what is the bpbkar process doing?

- And don't know what to trynext

References:

Endlessly repeating "bpbkar" debug log entries for job "h"

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol01 on /

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (proc) proc on /proc

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (sysfs) sysfs on /sys

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (devpts) devpts on /dev/pts

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (tmpfs) tmpfs on /dev/shm

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/sda1 on /boot

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/sda2 on /boot-rcvy

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol05 on /crash

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol00 on /rcvy

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol03 on /var

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ext4) /dev/mapper/VolGroup00-LogVol02 on /var-rcvy

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (binfmt_misc) none on /proc/sys/fs/binfmt_misc

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (ipathfs) none on /ipathfs

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (rpc_pipefs) sunrpc on /var/lib/nfs/rpc_pipefs

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (gpfs) /dev/gsfs3 on /gpfs/gsfs3

15:56:46.048 [15296] <2> mount build_mount_list: INF - Processing (gpfs) /dev/gsfs2 on /gpfs/gsfs2

15:56:46.048 [15296] <2> bpbkar resolve_path: INF - /gs3/users/hcs2 resolves to /gpfs/gsfs3/users/hcs2

15:56:46.048 [15296] <2> bpbkar resolve_path: INF - Actual mount point of /gs3/users/hcs2 is /gpfs/gsfs3/users/hcs2

15:56:46.059 [15296] <2> bpbkar cd: remap name would return=/gs3/users/hcs2

"ps" output for processes for "a" and "h" backup jobs on client machine ned18

"a"

root 25535 1 6 08:05 ? 00:33:17 bpbkar -r 5356800 -ru root -dt 0 -to 0 -clnt ned18 -class dis-gs3-a -sched biomart -st FULL -bpstart_to 300 -bpend_to 300 -read_to 10800 -ckpt_time 7200 -blks_per_buffer 2048 -use_otm -fso -b ned18_1392781497 -kl 7 -use_ofb

"h"

root 15296 1 9 12:50 ? 00:20:04 bpbkar -r 5356800 -ru root -dt 0 -to 0 -clnt ned18 -class dis-gs3-h -sched biomart -st FULL -bpstart_to 300 -bpend_to 300 -read_to 10800 -ckpt_time 7200 -blks_per_buffer 2048 -use_otm -fso -b ned18_1392746035 -kl 7 -use_ofb

RamNagalla · ‎02-22-2014

you have 2 issues ..

1) Media write error.

2) backup issue with job "h"

please show me the backup policy configurations

bppllist <policy name of job a> -L

bppllist <policy name of job h> -L

also attach the entire bpbkar log for the h job with Verbose 5

PS:-I like the way that you presented the issue over here..

sb981 · ‎02-22-2014

Regarding 1) Media write errors -- yes, agreed, that's an issue.

Regarding 2) backup issue with job "h", attaching the info you requested. I had to resume the job again so I could capture the stuff at the beginning of the log file; I only included the first 1000 lines or so (I confirmed that there are 23 unique log entries and they all appear at the beginning of the log)

Thanks for taking the time to look at this!

RamNagalla · ‎02-22-2014

22:29:57.054 [19169] <2> bpbkar resolve_path: INF - /gs3/users/hcs2 resolves to /gpfs/gsfs3/users/hcs2
22:29:57.054 [19169] <2> bpbkar resolve_path: INF - Actual mount point of /gs3/users/hcs2 is /gpfs/gsfs3/users/hcs2

so the actual mount for /gs3/users/h* is /gpfs/gsfs3/users.

so try to keep the backup selection as /gpfs/gsfs3/users/h* and try the backups, lets see how it goes..

i would also see deatail status of the failed job or full log of bpbkar log... because i would like to see how much time its taking befor its giving the failure message..

sdo · ‎02-23-2014

Hi sb981 - What is the O/S version and CPU architecture of the backup client ?

I ask because not all combinations of NetBackup + linux + ext4 are listed in compatibility matrix:

http://www.symantec.com/business/support/index?page=content&id=TECH76648

sb981 · ‎02-23-2014

I can see the rationale for using the actual mount point rather than our symbolic link shorthand... but both jobs have been backing up files this way for months.

About the log -- I'm not sure what is supposed to happen to the remote bpbkar client process when the master server places a job in Incomplete status, but for this "h" job the process persisted and continued to accumulate cpu time and write to the logs -- by the time I realized that was the case the logs had cycled through 111 50MB files (keeping three at any given time). I killed the process and the logging stopped. Note, however, there were no such lingering processes from the "a" backup, so it handled the transition more gracefully (and correctly, I'm sure).

I'm going to let this issue go for now because I just looked at the policy for "dis-gs3-h" with the GUI and noticed that the "Backup Selections" says "/gs3/users/g*" !! so I'm not surprised that this job isn't able to resume.

Thanks again for your help!

sb981 · ‎02-23-2014

Hi sdo,

Thanks for the link. The NetBackup master server is running RHEL5 on x86_64 and the client is RHEL6 on x86_64.

We're planning to upgrade Netbackup and the master server OS later this year.

sdo · ‎02-23-2014

Ok looks like ext4 is supported, with your O/S, architecture and NetBackup Client version.

Next I'm wondering if there's an etrack listed (from around page 80 onwards) in the v7.0.1 release notes that might be relevant, but it'll be a lot of reading if you don't get lucky with a search:

http://www.symantec.com/business/support/index?page=content&id=TECH124476

sdo · ‎02-23-2014

I can see that GFS2 support began with NetBackup v7.5 onwards - but it's not clear to me if GFS2 is synonymous with GPFS - which, might not be supported... Found this re GPFS:

http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/symantec-netbackup-gpfs-support-5245335

Is GFS2 the same thing as GPFS?

VOX

Two Backup jobs, two media write errors - one job resumes, one doesn't