cancel
Showing results for 
Search instead for 
Did you mean: 

VMware Policy Backup - Network Connection Broken (40)

SYMAJ
Level 6
Partner Accredited

I have a Netbackup Environment with a Virtual Master Server running 7506 and a 5230 Appliance running 2.5.2 (7505).  There are a small number of Physical clients (Windows and Linux) and many VMware clients.

Whilst most of the VMware clients (ESX5) are working successfully (120+) I have two which are giving major problems.  The ESX LUNs are presented to the 5230 over SAN and we are authenticating to the vCenter server to control backups.  Policy type is VMWARE.

The two servers in question have a large number of files, over 1TB total size each.

One of the servers, following the backup being initiaited, performs the VMware snapshot, starts the backup job and then sits there for some time before failing with a status 40 - Network Connection Broken.  Which network conection ?  As this is a VMware policy type I am not talking directly to the VM, it is being snapped and then presented to the 5230 as a local image.  I have extended the client read timeouts on both the master and 5230 appliance with no success.

This job has run successfully in the past numerous times - both FULL and DIFF.

Failed Job log:

10/09/2013 10:53:38 - Info nbjm(pid=3704) starting backup job (jobid=4321) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY 
10/09/2013 10:53:38 - estimated 1388605664 Kbytes needed
10/09/2013 10:53:38 - Info nbjm(pid=3704) started backup (backupid=DUB-APPS-001_1378806818) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY on storage unit DUB-5230-DeDup
10/09/2013 10:53:39 - started process bpbrm (29973)
10/09/2013 10:53:41 - Info bpbrm(pid=29973) DUB-APPS-001 is the host to backup data from    
10/09/2013 10:53:41 - Info bpbrm(pid=29973) reading file list from client       
10/09/2013 10:53:41 - Info bpbrm(pid=29973) starting bpbkar on client        
10/09/2013 10:53:41 - Info bpbkar(pid=29997) Backup started          
10/09/2013 10:53:41 - Info bpbrm(pid=29973) bptm pid: 29998         
10/09/2013 10:53:41 - Info bptm(pid=29998) start           
10/09/2013 10:53:41 - connecting
10/09/2013 10:53:41 - connected; connect time: 00:00:00
10/09/2013 10:53:42 - Info bptm(pid=29998) using 1048576 data buffer size       
10/09/2013 10:53:42 - Info bptm(pid=29998) using 64 data buffers        
10/09/2013 10:53:42 - Info bptm(pid=29998) start backup          
10/09/2013 10:54:25 - begin writing
10/09/2013 12:01:17 - Error bptm(pid=29998) media manager terminated by parent process      
10/09/2013 12:01:24 - Info dub-5230(pid=29998) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0%
10/09/2013 12:01:25 - Info bpbkar(pid=0) done. status: 40: network connection broken      
10/09/2013 12:01:25 - end writing; write time: 01:07:00
network connection broken(40)
10/09/2013 12:11:28 - Info bpbrm(pid=11144) DUB-APPS-001 is the host to backup data from    
10/09/2013 12:11:28 - Info bpbrm(pid=11144) reading file list from client       
10/09/2013 12:11:28 - Info bpbrm(pid=11144) starting bpbkar on client        
10/09/2013 12:11:28 - Info bpbkar(pid=11151) Backup started          
10/09/2013 12:11:28 - Info bpbrm(pid=11144) bptm pid: 11152         
10/09/2013 12:11:28 - Info bptm(pid=11152) start           
10/09/2013 12:11:29 - Info bptm(pid=11152) using 1048576 data buffer size       
10/09/2013 12:11:29 - Info bptm(pid=11152) using 64 data buffers        
10/09/2013 12:11:29 - Info bptm(pid=11152) start backup     

Following this the job just sits there doing nothing.

Successful Job Log:

07/09/2013 02:59:09 - Info nbjm(pid=3984) starting backup job (jobid=4051) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY 
07/09/2013 02:59:09 - estimated 0 Kbytes needed
07/09/2013 02:59:09 - Info nbjm(pid=3984) started backup (backupid=DUB-APPS-001_1378519149) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY on storage unit DUB-5230-DeDup
07/09/2013 02:59:11 - started process bpbrm (7231)
07/09/2013 02:59:12 - Info bpbrm(pid=7231) DUB-APPS-001 is the host to backup data from    
07/09/2013 02:59:12 - Info bpbrm(pid=7231) reading file list from client       
07/09/2013 02:59:12 - connecting
07/09/2013 02:59:13 - Info bpbrm(pid=7231) starting bpbkar on client        
07/09/2013 02:59:13 - Info bpbkar(pid=7238) Backup started          
07/09/2013 02:59:13 - Info bpbrm(pid=7231) bptm pid: 7239         
07/09/2013 02:59:13 - Info bptm(pid=7239) start           
07/09/2013 02:59:13 - Info bptm(pid=7239) using 1048576 data buffer size       
07/09/2013 02:59:13 - Info bptm(pid=7239) using 64 data buffers        
07/09/2013 02:59:13 - connected; connect time: 00:00:01
07/09/2013 02:59:14 - Info bptm(pid=7239) start backup          
07/09/2013 02:59:51 - begin writing
07/09/2013 04:00:53 - Info bpbkar(pid=7238) INF - Transport Type = san      
07/09/2013 20:25:04 - Info bpbkar(pid=7238) bpbkar waited 124 times for empty buffer, delayed 77660 times  
07/09/2013 20:25:04 - Info bptm(pid=7239) waited for full buffer 848100 times, delayed 3160597 times   
07/09/2013 20:25:29 - Info bptm(pid=7239) EXITING with status 0 <----------       
07/09/2013 20:25:29 - Info dub-5230(pid=7239) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1148749424 KB, CR sent: 57710289 KB, CR sent over FC: 0 KB, dedup: 95.0%
07/09/2013 20:25:29 - Info bpbrm(pid=7231) validating image for client DUB-APPS-001       
07/09/2013 20:25:29 - end writing; write time: 17:25:38
07/09/2013 20:25:30 - Info bpbkar(pid=7238) done. status: 0: the requested operation was successfully completed   
the requested operation was successfully completed(0)

     

The other server is more promising, but runs extremely slowly and would take days to finish if let run.  This job has run successfully (FULL) twice and secured 1.6TB in 14hrs.

Failed Job Log:

09/09/2013 16:00:45 - Info nbjm(pid=3704) starting backup job (jobid=4227) for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY 
09/09/2013 16:00:45 - estimated 0 Kbytes needed
09/09/2013 16:00:45 - Info nbjm(pid=3704) started backup (backupid=DUB-PTFILE-P01_1378738845) job for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY on storage unit DUB-5230-DeDup
09/09/2013 16:00:46 - started process bpbrm (459)
09/09/2013 16:00:48 - Info bpbrm(pid=459) DUB-PTFILE-P01 is the host to backup data from    
09/09/2013 16:00:48 - Info bpbrm(pid=459) reading file list from client       
09/09/2013 16:00:48 - Info bpbrm(pid=459) starting bpbkar on client        
09/09/2013 16:00:48 - Info bpbkar(pid=467) Backup started          
09/09/2013 16:00:48 - Info bpbrm(pid=459) bptm pid: 468         
09/09/2013 16:00:48 - connecting
09/09/2013 16:00:48 - connected; connect time: 00:00:00
09/09/2013 16:00:49 - Info bptm(pid=468) start           
09/09/2013 16:00:49 - Info bptm(pid=468) using 1048576 data buffer size       
09/09/2013 16:00:49 - Info bptm(pid=468) using 64 data buffers        
09/09/2013 16:00:50 - Info bptm(pid=468) start backup          
09/09/2013 16:00:51 - begin writing
10/09/2013 10:42:07 - Error bptm(pid=468) media manager terminated by parent process      
10/09/2013 10:42:14 - Info dub-5230(pid=468) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3636227 KB, CR sent: 342397 KB, CR sent over FC: 0 KB, dedup: 90.6%
10/09/2013 10:42:15 - Info bpbkar(pid=0) done           
10/09/2013 10:42:15 - Info bpbkar(pid=0) done. status: 150: termination requested by administrator     
10/09/2013 10:42:15 - end writing; write time: 18:41:24
termination requested by administrator(150)

Successful job log:

10/08/2013 00:13:37 - Info nbjm(pid=3836) starting backup job (jobid=1681) for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY 
10/08/2013 00:13:37 - estimated 0 Kbytes needed
10/08/2013 00:13:37 - Info nbjm(pid=3836) started backup (backupid=DUB-PTFILE-P01_1376090017) job for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY on storage unit DUB-5230-DeDup
10/08/2013 00:13:38 - started process bpbrm (23623)
10/08/2013 00:13:40 - Info bpbrm(pid=23623) DUB-PTFILE-P01 is the host to backup data from    
10/08/2013 00:13:40 - Info bpbrm(pid=23623) reading file list from client       
10/08/2013 00:13:40 - connecting
10/08/2013 00:13:40 - connected; connect time: 00:00:00
10/08/2013 00:13:41 - Info bpbrm(pid=23623) starting bpbkar on client        
10/08/2013 00:13:41 - Info bpbkar(pid=23645) Backup started          
10/08/2013 00:13:41 - Info bpbrm(pid=23623) bptm pid: 23648         
10/08/2013 00:13:41 - Info bptm(pid=23648) start           
10/08/2013 00:13:41 - Info bptm(pid=23648) using 1048576 data buffer size       
10/08/2013 00:13:41 - Info bptm(pid=23648) using 64 data buffers        
10/08/2013 00:13:42 - Info bptm(pid=23648) start backup          
10/08/2013 00:13:51 - begin writing
10/08/2013 08:09:45 - Info bpbkar(pid=23645) INF - Transport Type = san      
10/08/2013 13:56:30 - Info bpbkar(pid=23645) bpbkar waited 389820 times for empty buffer, delayed 1140862 times  
10/08/2013 13:56:30 - Info bptm(pid=23648) waited for full buffer 107426 times, delayed 1894435 times   
10/08/2013 13:59:14 - Info bptm(pid=23648) EXITING with status 0 <----------       
10/08/2013 13:59:15 - Info dub-5230(pid=23648) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1617774661 KB, CR sent: 187714934 KB, CR sent over FC: 0 KB, dedup: 88.4%
10/08/2013 13:59:16 - Info bpbrm(pid=23623) validating image for client DUB-PTFILE-P01       
10/08/2013 13:59:17 - end writing; write time: 13:45:26
10/08/2013 13:59:18 - Info bpbkar(pid=23645) done. status: 0: the requested operation was successfully completed   
the requested operation was successfully completed(0)

 

I note in both of the above FAILED joblogs we never get the message INF - Transport Type = san, whereas we do in the successful jobs.

I am unaware of any other changes in the environment which have been made - and in either case why would it affect only these two VM clients ?  As a test I have now installed the 7506 agent on each of the VM's and am running an MS-Windows policy typw with Accelerator enabled and Client Side Dedup to test if the job will complete in that manner.  However - I do not want to progress this way, and both jobs worked perfectly previously ??

 

Any pointers ?

AJ

24 REPLIES 24

cruisen
Level 6
Partner Accredited

Hi Symaj,

 

does the upgrade solve the issue?

 

Best regards,

 

Cruisen

SYMAJ
Level 6
Partner Accredited

Cruisen,

No - this did not resolve the issues.  Both issues subsequently raised with support (not by myself) - I will get an update and post.

AJ 

halo
Level 2

Cruisen,

Have you heard anything back from support, or the customer about this issue as I am facing a similiar thing.

Thanks

J

Daniel_W
Level 2
Partner Accredited

I know this is an older thread, but here's hoping.

 

I have a customer that's seeing the same or similar issue.

Windows 2k3r2 Master running 7.5.0.6

NBU 5230 Appliance running 2.5.3 (7.5.0.6)

Dell Compellent Disk Array SAN-attached to Appliance (down a single path)

ESXi 5.1 with vCenter 5.1 server.

 

We're getting quite a lot of intermittent status code 40's on VMs of all sizes.

 

Were you able to get any traction on this issue?

sdo
Moderator
Moderator
Partner    VIP    Certified

Curious as to what is shown by:

$ bpgetconfig -M n5230-appliance-name sectorsperbuffer

...aka "raw partition buffer read size".