Re: vmware backup fail in job status 13

stucci · ‎01-03-2017

master Linux, nbu 7.7.3; media Appliance 5230 in 7.7.3; vSpheare 6.0.0;

Hi all, we have this Linux VM that fails backup by some weeks, do you have any troubleshooting suggestion ? tks in advance

Jan 1, 2017 9:29:55 AM - Info bpbrm (pid=78045) salclt4220 is the host to backup data from
Jan 1, 2017 9:29:55 AM - Info bpbrm (pid=78045) reading file list for client
Jan 1, 2017 9:29:55 AM - Info bpbrm (pid=78045) accelerator enabled
Jan 1, 2017 9:29:55 AM - Info bpbrm (pid=78045) There is no complete backup image match with track journal, a regular full backup will be performed.
Jan 1, 2017 9:29:55 AM - Info bpbrm (pid=78045) starting bpbkar on client
Jan 1, 2017 9:29:56 AM - Info bpbkar (pid=78061) Backup started
Jan 1, 2017 9:29:56 AM - Info bpbrm (pid=78045) bptm pid: 78062
Jan 1, 2017 9:29:57 AM - Info bptm (pid=78062) start
Jan 1, 2017 9:29:57 AM - Info bptm (pid=78062) using 1048576 data buffer size
Jan 1, 2017 9:29:57 AM - Info bptm (pid=78062) using 512 data buffers
Jan 1, 2017 9:29:59 AM - Info bpbkar (pid=78061) INF - Backing up vCenter server svvcmot11.syssede.systest.sanpaoloimi.com, ESX host svwmop920.sede.corp.sanpaoloimi.com, BIOS UUID 4228ea9b-ac41-9a9a-c429-1ba02e6386bf, Instance UUID 502804b9-75eb-1926-353e-26d1f4a34ae9, Display Name salclt4220, Hostname salclt4220
Jan 1, 2017 9:30:02 AM - Info bptm (pid=78062) start backup
Jan 1, 2017 9:30:22 AM - Info nbjm (pid=9940) starting backup job (jobid=3246268) for client salclt4220, policy CCM_svvcmot11_Settimanale, schedule Full_Weekly
Jan 1, 2017 9:30:22 AM - Info nbjm (pid=9940) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3246268, request id:{BA4002A9-7A0B-4C38-B226-88355A31C313})
Jan 1, 2017 9:30:22 AM - requesting resource ccmnbu04
Jan 1, 2017 9:30:22 AM - requesting resource scapmop0307.NBU_CLIENT.MAXJOBS.salclt4220
Jan 1, 2017 9:30:22 AM - requesting resource scapmop0307.VMware.Datastore.svvcmot11.syssede.systest.sanpaoloimi.com/Moncalieri_CCM/A_024_PURE_BRONZE_WEBLOGIC_BRZ_SIL_CCM
Jan 1, 2017 9:30:22 AM - requesting resource scapmop0307.VMware.ESXserver.svwmop920.sede.corp.sanpaoloimi.com
Jan 1, 2017 9:30:22 AM - requesting resource scapmop0307.VMware.vCenter.svvcmot11.syssede.systest.sanpaoloimi.com
Jan 1, 2017 9:30:22 AM - granted resource  scapmop0307.NBU_CLIENT.MAXJOBS.salclt4220
Jan 1, 2017 9:30:22 AM - granted resource  scapmop0307.VMware.Datastore.svvcmot11.syssede.systest.sanpaoloimi.com/Moncalieri_CCM/A_024_PURE_BRONZE_WEBLOGIC_BRZ_SIL_CCM
Jan 1, 2017 9:30:22 AM - granted resource  scapmop0307.VMware.ESXserver.svwmop920.sede.corp.sanpaoloimi.com
Jan 1, 2017 9:30:22 AM - granted resource  scapmop0307.VMware.vCenter.svvcmot11.syssede.systest.sanpaoloimi.com
Jan 1, 2017 9:30:22 AM - granted resource  MediaID=@aaaab;DiskVolume=PureDiskVolume;DiskPool=ccmnbu04_DP;Path=PureDiskVolume;StorageServer=ccmnbu04;MediaServer=ccmnbu04
Jan 1, 2017 9:30:22 AM - granted resource  ccmnbu04
Jan 1, 2017 9:30:23 AM - estimated 0 kbytes needed
Jan 1, 2017 9:30:23 AM - Info nbjm (pid=9940) started backup (backupid=salclt4220_1483259422) job for client salclt4220, policy CCM_svvcmot11_Settimanale, schedule Full_Weekly on storage unit ccmnbu04 using backup host ccmnbu04
Jan 1, 2017 9:30:33 AM - started process bpbrm (pid=78045)
Jan 1, 2017 9:30:36 AM - connecting
Jan 1, 2017 9:30:36 AM - connected; connect time: 0:00:00
Jan 1, 2017 9:30:47 AM - begin writing
Jan 1, 2017 9:30:55 AM - Error bpbrm (pid=78045) db_FLISTsend failed: file read failed (13)
Jan 1, 2017 9:30:57 AM - Error bptm (pid=78062) media manager terminated by parent process
Jan 1, 2017 9:31:42 AM - Info ccmnbu04 (pid=78062) StorageServer=PureDisk:ccmnbu04; Report=PDDO Stats (multi-threaded stream used) for (ccmnbu04): scanned: 82947 KB, CR sent: 10031 KB, CR sent over FC: 0 KB, dedup: 87.9%, cache disabled
Jan 1, 2017 9:31:46 AM - Info bpbkar (pid=0) done
Jan 1, 2017 9:31:46 AM - Info bpbkar (pid=0) done. status: 13: file read failed
Jan 1, 2017 9:32:27 AM - end writing; write time: 0:01:40
file read failed  (13)

sdo · ‎01-03-2017

1) Have you checked whether the file-systems within the Linux VM are supported for VM style backups?

Can you tell us more about the VM, such as...

2) OS type and version?

3) Any RDMs?

4) File-systems types and versions?

5) File-system "bit-ness" - i.e. 32bit or 64-bit file systems?

6) Any other VMs of same OS and same layout and same file-systems which do work for backups?

stucci · ‎01-03-2017

Hello sdo,

I think that it is all supported, I do more backup of similar VM without any issue, I tried a manual snapshot on vcenter and run properly.

[root@salclt4220 ~]# uname -a
Linux salclt4220 2.6.32-642.1.1.el6.x86_64 #1 SMP Fri May 6 14:54:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@salclt4220 ~]# 
[root@salclt4220 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.8 (Santiago)

sdo · ‎01-03-2017

Ok - if you are 100% absolutely definitely certain that there are no differences in backup policy configuration and no differences in VM technology and guest OS versions and VMtools versions and file system types and versions (bit-ness) between the "failing for backup VM" and other "working for backups VMs"... then, for starters, you/we will probably need to dig in to the bpfis log on the backup proxy/host. Are you ok doing that?

Marianne · ‎01-03-2017

bpbrm on the media server is failing with this error:
Error bpbrm (pid=78045) db_FLISTsend failed: file read failed (13)

Check bpbrm log on Appliance media server for more info.

Have you tried the backup without Accelerator?

Handy NetBackup Links

stucci · ‎01-04-2017

hello, I just tried without accelerator but the issue is the same, now I gather the logs, and I will update you

Jan 4, 2017 10:07:30 AM - Info bpbrm (pid=182310) salclt4220 is the host to backup data from
Jan 4, 2017 10:07:30 AM - Info bpbrm (pid=182310) reading file list for client
Jan 4, 2017 10:07:30 AM - Info bpbrm (pid=182310) starting bpbkar on client
Jan 4, 2017 10:07:30 AM - Info bpbkar (pid=182322) Backup started
Jan 4, 2017 10:07:30 AM - Info bpbrm (pid=182310) bptm pid: 182323
Jan 4, 2017 10:13:41 AM - Info nbjm (pid=9940) starting backup job (jobid=3258060) for client salclt4220, policy Esecuzioni_spot_settimo, schedule full
Jan 4, 2017 10:13:42 AM - estimated 0 kbytes needed
Jan 4, 2017 10:13:42 AM - Info nbjm (pid=9940) started backup (backupid=salclt4220_1483521221) job for client salclt4220, policy Esecuzioni_spot_settimo, schedule full on storage unit setnbu04 using backup host setnbu04
Jan 4, 2017 10:13:52 AM - started process bpbrm (pid=182310)
Jan 4, 2017 10:07:30 AM - Info bptm (pid=182323) start
Jan 4, 2017 10:07:31 AM - Info bptm (pid=182323) using 1048576 data buffer size
Jan 4, 2017 10:07:31 AM - Info bptm (pid=182323) using 512 data buffers
Jan 4, 2017 10:13:54 AM - connecting
Jan 4, 2017 10:13:54 AM - connected; connect time: 0:00:00
Jan 4, 2017 10:07:32 AM - Info bpbkar (pid=182322) INF - Backing up vCenter server svvcmot11.syssede.systest.sanpaoloimi.com, ESX host svwmop920.sede.corp.sanpaoloimi.com, BIOS UUID 4228ea9b-ac41-9a9a-c429-1ba02e6386bf, Instance UUID 502804b9-75eb-1926-353e-26d1f4a34ae9, Display Name salclt4220, Hostname salclt4220
Jan 4, 2017 10:07:49 AM - Info bptm (pid=182323) start backup
Jan 4, 2017 10:14:21 AM - begin writing
Jan 4, 2017 10:16:13 AM - Error bpbrm (pid=182310) db_FLISTsend failed: file read failed (13)
Jan 4, 2017 10:16:15 AM - Error bptm (pid=182323) media manager terminated by parent process
Jan 4, 2017 10:16:49 AM - Info setnbu04 (pid=182323) StorageServer=PureDisk:setnbu04; Report=PDDO Stats (multi-threaded stream used) for (setnbu04): scanned: 64515 KB, CR sent: 7576 KB, CR sent over FC: 0 KB, dedup: 88.3%, cache disabled
Jan 4, 2017 10:16:54 AM - Info bpbkar (pid=0) done
Jan 4, 2017 10:16:54 AM - Info bpbkar (pid=0) done. status: 13: file read failed
Jan 4, 2017 10:23:19 AM - end writing; write time: 0:08:58
file read failed  (13)

stucci · ‎01-04-2017

hello, in attachment you can find "/usr/openv/netbackup/logs/bpbrm" log. I tried to find on internet something for help me without success. Below you can find some extracts with alert upper than <4>

10:35:57.697 [241932.241932] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: file read failed (13)
10:36:33.732 [264551.264551] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:3728] vnet_get_user_credential_path failed 35 0x23
10:36:33.890 [264551.264551] <8> do_pbx_service: [vnet_connect.c:2302] via PBX VNETD CONNECT FROM 10.254.255.202.40089 TO 10.254.255.202.1556 fd = 5
10:36:33.891 [264551.264551] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:455] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa
10:36:33.931 [264551.264551] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:480] ipc_string /usr/openv/var/tmp/vnet-64569483522593930667000000761-jty4DA

Marianne · ‎01-04-2017

I am honestly battling with level 5 logs...
Veritas Support engineers have tools to sift through these large files...

What I have found was this:

10:35:55.885 [241932.241932] <2> non_mpx_backup_archive_verify_import: from client salclt4220: INF - Files System Name to catalog /opt/oracle
10:35:55.885 [241932.241932] <2> non_mpx_backup_archive_verify_import: start to write catalog
...
10:35:55.885 [241932.241932] <2> ConnectionCache::connectAndCache: Acquiring new connection for host scapmop0307, query type 78
10:35:55.901 [241932.241932] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:35:55.902 [241932.241932] <2> logconnections: BPDBM CONNECT FROM 10.254.171.33.54938 TO 10.254.168.215.1556 fd = 8
10:35:56.026 [241932.241932] <2> db_end: Need to collect reply
10:35:57.697 [241932.241932] <2> db_end: no DONE from db_getreply(): file read failed
10:35:57.697 [241932.241932] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: file read failed (13)

What I see here is that the attempt from the media server to update the catalog on the master server failed.

You need bpdbm log on the master server to see what happened with the update attempt between 10:35:55 and 10:35:57.

Probably good idea to log a Support call with Veritas as well - they have the time and the tools to sift through level 5 logs...

Handy NetBackup Links

stucci · ‎01-05-2017

hello Marianne,

I searched also in bpdbm (master server) without success, nothing that can help me.

I found something in bpbkar log of the media server, maybe can help us, what do you think about:

3e-26d1f4a34ae9, Display Name salclt4220, Hostname salclt4220
16:53:02.118 [61200.61200] <2> vnet_pbxConnect: pbxConnectEx Succeeded
16:53:02.118 [61200.61200] <2> logconnections: BPRD CONNECT FROM 10.254.171.33.52372 TO 10.254.168.215.1556 fd = 7
16:53:02.743 [61200.61200] <4> stage_one_virtual: INF - Snapshot mapping type: ON
16:53:02.743 [61200.61200] <4> stage_one_virtual: INF - Snapshot BLIB type: OFF
16:53:02.815 [61200.61200] <4> NbuVddkInitializer: Using created vixDiskLib config file: /usr/openv/var/tmp/vixDiskLib_61200_
1483545182.ini
16:53:03.899 [61200.61200] <4> getSSLThumbprint: SSL Thumbprint for server svvcmot11.syssede.systest.sanpaoloimi.com = E2:F5:
23:6D:97:C5:0F:B0:EE:30:6B:37:58:56:B9:04:7A:1D:08:D0
17:01:04.322 [61200.61200] <4> get_virtual_volumes: INF - Found 14 virtual volume(s)
17:01:05.706 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /
17:01:05.723 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /
17:01:31.956 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /
17:01:31.956 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /
17:01:31.957 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /
17:01:33.325 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /boot
17:01:33.325 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /boot
17:01:33.346 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /boot
17:01:33.346 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /boot
17:01:33.346 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /boot
17:01:34.797 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /home
17:01:34.797 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /home
17:01:37.549 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /home
17:01:37.550 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /home
17:01:37.550 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /home
17:01:38.874 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt
17:01:38.877 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt
17:01:46.995 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt
17:01:46.996 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt
17:01:46.996 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt
17:01:48.434 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/jdk
17:01:48.435 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/jdk
17:01:57.032 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/jdk
17:01:57.032 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/jdk
17:01:57.033 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/jdk
17:02:01.183 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle
17:02:01.184 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle
17:02:07.844 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle
17:02:07.844 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle
17:02:07.844 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle
17:02:09.216 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle-mw
17:02:09.217 [61200.61200] <4> catalogName_PS: INF - Files System Name to catalog /opt/oracle-mw
17:02:32.659 [61200.61200] <8> ~NbuVddkInitializer: Unable to remove temporary file /usr/openv/var/tmp/vixDiskLib_61200_14835
45182.ini, errno 0
17:02:32.728 [61200.61200] <8> bpbkar Exit: FTL - cleanup() failed, status 63
17:02:32.728 [61200.61200] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 63: process was killed by a signal
17:02:32.728 [61200.61200] <2> keep_alive_log: KeepAliveThread::stopKeepAlive Waiting for keepalive thread to finish
17:02:32.728 [61200.61200] <4> bpbkar Exit: INF - EXIT STATUS 63: process was killed by a signal
17:02:32.745 [61200.61200] <4> bpbkar Exit: INF - setenv FINISHED=0
setnbu04:/usr/openv/netbackup/logs/bpbkar #

Naturally, I will proceed to open a case in Veritas, thank you very much for your support.

regards

Marianne · ‎01-05-2017

I saw the status 63 in bpbrm log as well.

If you Google status 63, you will see that there is an issue with catalog update on the master server.

If there is nothing useful in bpdbm log, then your logging level is probably not high enough.
For bpdbm log, you need to restart NBU when the folder is created as well as each time the logging level is changed.

I suggest that that you log a Support call with Veritas ASAP.
They WILL request a full set of level 5 logs.

Handy NetBackup Links

Marianne · ‎01-05-2017

Have a look at this TN:

http://www.veritas.com/docs/000076426

Touch file may help but I don't see the attempt to add file to catalog taking more than 10 minutes.

Extracts from TN:

You can change the batch size on the Media Server that is used to send metadata to the NetBackup Catalog during backups.

A change to the batch size can help in the following cases:

If backups fail because a query to add files to the catalog takes more than 10 minutes to complete.

In this case, the backup job fails, and the bpbrm log contains a message that indicates a failed attempt to add files to the catalog. Note that the bpdbm log does not contain a similar message.
To improve backup performance when the folders to back up contain a large number of small files or subfolders.

To adjust the batch size for sending metadata to the catalog:

Create the following file on the NetBackup Media Server:

For FlashBackup-Windows policies (including NetBackup for VMware and Hyper-V):

/usr/openv/netbackup/FBU_MAX_FILES_PER_ADD

For all other policy types (including FlashBackup for UNIX):

/usr/openv/netbackup/MAX_ENTRIES_PER_ADD

In the file, enter a value for the number of metadata entries to be sent to the catalog in each batch. The allowed values are from 1 to 100,000.

Note:

In NetBackup versions 7.5.0.0 through 7.5.0.6 and in version 7.6.0.1, the MAX_ENTRIES_PER_ADD limit is 50000.

In NetBackup version 7.5.0.7 and in versions 7.6.0.2 and later, the MAX_ENTRIES_PER_ADD limit is 100000

Handy NetBackup Links

stucci · ‎01-17-2017

hi all,

update, we have opened a veritas case and it seems that on this machine has mounted an unsupported filesystem EXT4. We are waiting of OS support confirm, I will update you

VOX

vmware backup fail in job status 13