cancel
Showing results for 
Search instead for 
Did you mean: 

Flashbackup-Windows socket read failure Err 13

Kev_Lamb
Level 6

Hi,

Having managed to get the Flashbackup-Windows policy to validate (problem with the firewall) now I am getting an Error 13 each time the backup runs.

The snapshot part of the policy completes without error and the backup start Ok but gets to about 400128Kb and 325500 Files and terminates with the following in the activity monitor:

01/13/2014 09:17:37 - Info bpbrm (pid=21133) lonbfbelvis1.corp.ad.timeinc.com is the host to backup data from
01/13/2014 09:17:37 - Info bpbrm (pid=21133) reading file list from client
01/13/2014 09:17:38 - Info bpbrm (pid=21133) starting bpbkar on client
01/13/2014 09:17:40 - Info bpbkar (pid=6580) Backup started
01/13/2014 09:17:40 - Info bpbrm (pid=21133) bptm pid: 21135
01/13/2014 09:17:40 - Info bptm (pid=21135) start
01/13/2014 09:17:40 - Info bptm (pid=21135) using 262144 data buffer size
01/13/2014 09:17:40 - Info bptm (pid=21135) using 30 data buffers
01/13/2014 09:17:40 - Info bptm (pid=21135) start backup
01/13/2014 09:17:45 - Info bptm (pid=21135) backup child process is pid 21152
01/13/2014 09:18:01 - Info nbjm (pid=16501) starting backup job (jobid=233765) for client lonbfbelvis1.corp.ad.timeinc.com, policy ELVIS1-SNAP-TEST, schedule Weekly-Full
01/13/2014 09:18:01 - estimated 0 kbytes needed
01/13/2014 09:18:01 - Info nbjm (pid=16501) started backup (backupid=lonbfbelvis1.corp.ad.timeinc.com_1389604681) job for client lonbfbelvis1.corp.ad.timeinc.com, policy ELVIS1-SNAP-TEST, schedule Weekly-Full on storage unit BFB-VMWARE-OST
01/13/2014 09:18:02 - started process bpbrm (pid=21133)
01/13/2014 09:18:03 - connecting
01/13/2014 09:18:03 - connected; connect time: 0:00:00
01/13/2014 09:18:11 - begin writing
01/13/2014 09:25:52 - Error bpbrm (pid=21133) socket read failed: errno = 104 - Connection reset by peer
01/13/2014 09:25:52 - Error bptm (pid=21152) system call failed - Connection reset by peer (at child.c.1298)
01/13/2014 09:25:52 - Error bptm (pid=21152) unable to perform read from client socket, connection may have been broken
01/13/2014 09:25:52 - Error bptm (pid=21135) media manager terminated by parent process
01/13/2014 09:26:01 - Error bpbrm (pid=21133) could not send server status message
01/13/2014 09:26:02 - Critical bpbrm (pid=21133) unexpected termination of client lonbfbelvis1.corp.ad.timeinc.com
01/13/2014 09:26:02 - Info bpbkar (pid=0) done. status: 13: file read failed
01/13/2014 09:26:27 - end writing; write time: 0:08:16
file read failed  (13)
 
The data area is approx 3Tb in size and is comprised of millions of small files, this is sat on a 10Tb disk so I know that the snap space is adequate >15% could this soley be down to IO issues whist backing up the snap from the same disk, also is there any rules that need to be followed when using an alternative host for the backup IO?
 
The backup is curently being performed onto a B6200 using OST
 
Kev
Attitude is a small thing that makes a BIG difference
1 ACCEPTED SOLUTION

Accepted Solutions

Kev_Lamb
Level 6

Looks like we fixed this problem ourselves... we found that the application was creating recursive directories for whatever reason and one of these was also causing Storage Essentials to fail on a discovery, we did scan the area and this came back Ok so we deleted it and re ran the SE which worked (we still have a few other recursive directories but these seem Ok) so I re-ran the Flashbackup policy and this is now working Ok.

Not sure what was wrong with the directory but that was definately the issue.

 

Kev

Attitude is a small thing that makes a BIG difference

View solution in original post

29 REPLIES 29

Mark_Solutions
Level 6
Partner Accredited Certified

Although it may be hitting a corrupt file the most likely is that this is a timeout issue

Increase the Client Read timeout on your Media Server to 3600 and see if that resolves it for you

If the job fails after an hour next time then you will need to increase it further

Hope this helps

Kev_Lamb
Level 6

Hi,

I have increased the client read timeout to 3600 but it is failing at the same point, if this is a corrupt file is there any way to determine which this is and waht log should I be looking at.

 

Thanks

 

Kev

Attitude is a small thing that makes a BIG difference

Mark_Solutions
Level 6
Partner Accredited Certified

You would want the bpfis (in case it crashes during the snapshot process) and the bpbkar logs - both with the logging level increased

These could become very large for this type of job but if it fails quickly then you may be OK

bpcd would also be useful in case it is a communications issue

Kev_Lamb
Level 6

Thanks Mark,

the bpfis is clean and the snapshot works without any problems, I will take a look at the other two logs and get back to the forum

 

Kev

Attitude is a small thing that makes a BIG difference

Mark_Solutions
Level 6
Partner Accredited Certified

If the bpkar log doesn't give enough there is a touch file you can add to make it write every file it backs up into the bpkar log - hopefully you wont need that but lets see what you get first

Marianne
Level 6
Partner    VIP    Accredited Certified

Backup fails with 7 minutes - I doubt that this is a timeout:

01/13/2014 09:18:11 - begin writing
01/13/2014 09:25:52 - Error bpbrm (pid=21133) socket read failed: errno = 104 - Connection reset by peer

Please check bpbrm and bptm logs on media server in addition to client's bpbkar log.

You may need level 5 logs to trace all info/comms.

Please ask firewall admins once again to monitor network traffic between the media server and client when backup is kicked off. The problem may still be with the firewall, terminating the connection.

Kev_Lamb
Level 6

It certainly looks like something is dropping, I just cannot fathom out at what point and why, I have attached the bpbrm and bptm logs from the media server that is being used for the backup, I will engage our F/W team to take a look also.

Thanks in advance for all the assitance with this

 

Kev

 

Attitude is a small thing that makes a BIG difference

Mark_Solutions
Level 6
Partner Accredited Certified

from the log:

bpbrm Exit: unexpected termination of client lonbfbelvis1.corp.ad.timeinc.com
 

and a few in the other log

Sounds like a crash or an Anti Virus interruption - you dont use McAfee do you?

Anything in the event logs?

Is the client OK for free disk space?

#edit#

Oddly though when looking at the logs the job kicks in at 13:11 and boms out at 13:21 - that is 10 minutes - are you sure your client read timeouts are all set?

Kev_Lamb
Level 6

Hi Mark,

The disk is a 10Tb disk which is only used by 3Tb of data so the snap area is ok > 15%

Just looked at the application logs on the server and got this at 13:21

Log Name:      Application
Source:        Application Error
Date:          13/01/2014 13:21:41
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      lonbfbelvis1.corp.ad.timeinc.com
Description:
Faulting application name: bpbkar32.exe, version: 7.500.412.916, time stamp: 0x5055e1c4
Faulting module name: ntdll.dll, version: 6.1.7601.17725, time stamp: 0x4ec4aa8e
Exception code: 0xc00000fd
Fault offset: 0x0000000000054f4a
Faulting process id: 0x238c
Faulting application start time: 0x01cf1060f7a3740d
Faulting application path: C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: a333992d-7c55-11e3-b3a0-90b11c18ad6d
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Application Error" />
    <EventID Qualifiers="0">1000</EventID>
    <Level>2</Level>
    <Task>100</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-01-13T13:21:41.000000000Z" />
    <EventRecordID>97590</EventRecordID>
    <Channel>Application</Channel>
    <Computer>lonbfbelvis1.corp.ad.timeinc.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>bpbkar32.exe</Data>
    <Data>7.500.412.916</Data>
    <Data>5055e1c4</Data>
    <Data>ntdll.dll</Data>
    <Data>6.1.7601.17725</Data>
    <Data>4ec4aa8e</Data>
    <Data>c00000fd</Data>
    <Data>0000000000054f4a</Data>
    <Data>238c</Data>
    <Data>01cf1060f7a3740d</Data>
    <Data>C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe</Data>
    <Data>C:\Windows\SYSTEM32\ntdll.dll</Data>
    <Data>a333992d-7c55-11e3-b3a0-90b11c18ad6d</Data>
  </EventData>
</Event>
 
The client read value has been set on all the media servesr to 3600 just in case it decided to use a different server, looking at the above error this may indicate a faulty install???
 
Kev
Attitude is a small thing that makes a BIG difference

Kev_Lamb
Level 6

Just had a look on the Symantec tech notes and found this http://www.symantec.com/business/support/index?page=content&id=TECH202598

 

Looks like a reinstall of the client is required.............

 

 

Attitude is a small thing that makes a BIG difference

Mark_Solutions
Level 6
Partner Accredited Certified

There are a lot of fixes in 7.5.0.6 / 7

It may just be worth trying 7.5.0.7 on that client first to see if that resolves it

But if you do have any Anti Virus / Access protection running on the client make sure all NBU processes are excluded

#edit#

That server doesnt have DFSR running on it does it? If so that causes the crash as well so 7.5.0.7 would help

Kev_Lamb
Level 6

Can I use NBU 7.5.0.7 with the lower revision on the media server? didn't think that was Ok, I am looking to upgrade the Master/Media servers to 7.5.0.7 in the next 4 weeks or so.

I could try putting the 7.5.0.6 patch on the client and see what happens, if this requires a reboot (like most things on Windows) then this will have to be change controlled as it is a production server.

 

Kev

Attitude is a small thing that makes a BIG difference

Mark_Solutions
Level 6
Partner Accredited Certified

Double dot releases are OK - you could have a 7.5.0.1 Master with 7.5.0.6 client

As long as the Master and Media are 7.5.0.x then the client can be 7.5.0.6 or 7.5.0.7

#edit#

Doesnt usually need a reboot on Windows - not since we got rid of VSP in 6.x

Kev_Lamb
Level 6

In that case I will place the 7.5.0.7 patch on the client, I have raised a change fopr Wednesday evening so I will let you know if this works.

 

Thanks for all your help

 

Kev

Attitude is a small thing that makes a BIG difference

Kev_Lamb
Level 6

Performed an upgrade to 7.5.0.7 and ran the backup, again this failed after 10mins with the dll error in the Win application log, so I unistalled Netbackup, removed the directories and then did a fresh client install of 7.5 and then the 7.5.0.7 patch.

Ran the backup, this created a larger shadow copy on the disk than previously, it ran for approx 19 mins and then bombed out again with the same error.

Log Name:      Application
Source:        Application Error
Date:          15/01/2014 20:39:57
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      lonbfbelvis1.corp.ad.timeinc.com
Description:
Faulting application name: bpbkar32.exe, version: 7.500.713.1123, time stamp: 0x5291767f
Faulting module name: ntdll.dll, version: 6.1.7601.17725, time stamp: 0x4ec4aa8e
Exception code: 0xc00000fd
Fault offset: 0x0000000000054f3c
Faulting process id: 0x163c
Faulting application start time: 0x01cf122f498f886d
Faulting application path: C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: 31cbf9ed-7e25-11e3-bc49-90b11c18ad6d
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Application Error" />
    <EventID Qualifiers="0">1000</EventID>
    <Level>2</Level>
    <Task>100</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-01-15T20:39:57.000000000Z" />
    <EventRecordID>97969</EventRecordID>
    <Channel>Application</Channel>
    <Computer>lonbfbelvis1.corp.ad.timeinc.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>bpbkar32.exe</Data>
    <Data>7.500.713.1123</Data>
    <Data>5291767f</Data>
    <Data>ntdll.dll</Data>
    <Data>6.1.7601.17725</Data>
    <Data>4ec4aa8e</Data>
    <Data>c00000fd</Data>
    <Data>0000000000054f3c</Data>
    <Data>163c</Data>
    <Data>01cf122f498f886d</Data>
    <Data>C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe</Data>
    <Data>C:\Windows\SYSTEM32\ntdll.dll</Data>
    <Data>31cbf9ed-7e25-11e3-bc49-90b11c18ad6d</Data>
  </EventData>
</Event>
 
Stuck now as to why the Win ntdll.dll is causing an issue, unless that file is corrupt?
 
Could this be an issue with the server being in a DMZ? just clutching at straws now to find why this would continue to fail
Attitude is a small thing that makes a BIG difference

CRZ
Level 6
Employee Accredited Certified

Could you be experiencing the issue in this TechNote?

The NetBackup bpbkar32.exe process crashes with each backup job
 http://symantec.com/docs/TECH210854

Check your bpbkar log.  You might need to open a case and get an escalation if the listed workaround (which this UNIX guy does not understand) doesn't help.

(I found this document by doing a SymWISE search on "ntdll.dll" [no quotes] with NetBackup Enterprise Server specified as the product.)

Mark_Solutions
Level 6
Partner Accredited Certified

Just in case it is struggling to cope could you try this...

Open the media servers host properties but from the Clients section (you may have to add it to a test policy if you dont see it there)

Go to the Windows section and then Client Settings section

At the very bottom is the Raw partition read buffer size setting

It defaults to 32kb - change it to 1024 and try it again.

I am fairly sure it is the media server that needs this but it wont hurt to change it on the client too.

It may need a service re-start to register this change

See if that helps at all

Kev_Lamb
Level 6

Hi,

Many thanks for that link CRZ, I will give that a go and let you know.

 

Mark,

All my NBU estate is on RHEL so there is no option in the Media Server to change that parameter I have modified this on the client and will run a test to see what happens.

Attitude is a small thing that makes a BIG difference

Kev_Lamb
Level 6

Hi Mark,

After making that change on the client this has still bombed out with a file read failure (13) after approx 7 minutes of running, I will try the suggestion by CRZ but this will have to be a change request as it means rebooting the client a couple of times.

 

Kev

Attitude is a small thing that makes a BIG difference