Solved: Info bpbkar32(pid=3124) done. status: 44: network ...

Zahid_Haseeb · ‎01-25-2012

Environment

Veritas Netbackup = 7.1

OS of Netbackup = win2008

Tape Library attached with six drives

Problem

I am doing Catalog backup. While doing backup on Tape Cartridge at around 80% completion the backup got failed.

1/26/2012 10:36:15 AM - Info nbjm(pid=4108) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=143574, request id:{12B409E1-989A-442B-A1EA-8F10361D8B3D})
1/26/2012 10:36:15 AM - requesting resource NBU-Server-hcart-robot-tld-0
1/26/2012 10:36:15 AM - requesting resource NBU-Server.NBU_CLIENT.MAXJOBS.NBU
1/26/2012 10:36:15 AM - requesting resource NBU-Server.NBU_POLICY.MAXJOBS.Catalog_Backup
1/26/2012 10:36:15 AM - awaiting resource NBU-Server-hcart-robot-tld-0 - No drives are available
1/26/2012 10:39:53 AM - Info bpbrm(pid=5288) NBU-Server is the host to backup data from
1/26/2012 10:39:53 AM - Info bpbrm(pid=5288) reading file list from client
1/26/2012 10:39:53 AM - granted resource NBU-Server.NBU_CLIENT.MAXJOBS.NBU
1/26/2012 10:39:53 AM - granted resource NBU-Server.NBU_POLICY.MAXJOBS.Catalog_Backup
1/26/2012 10:39:53 AM - granted resource 0014L5
1/26/2012 10:39:53 AM - granted resource IBM.ULT3580-TD5.002
1/26/2012 10:39:53 AM - granted resource NBU-Server-hcart-robot-tld-0
1/26/2012 10:39:53 AM - estimated 48899691 Kbytes needed
1/26/2012 10:39:53 AM - Info nbjm(pid=4108) started backup job for client NBU-Server, policy Catalog_Backup, schedule Full on storage unit NBU-hcart-robot-tld-0
1/26/2012 10:39:53 AM - started process bpbrm (5288)
1/26/2012 10:39:53 AM - connecting
1/26/2012 10:39:54 AM - Info bpbrm(pid=5288) starting bpbkar32 on client
1/26/2012 10:39:54 AM - connected; connect time: 00:00:01
1/26/2012 10:39:55 AM - Info bpbkar32(pid=3124) Backup started
1/26/2012 10:39:55 AM - Info bptm(pid=960) start
1/26/2012 10:39:55 AM - Info bptm(pid=960) using 65536 data buffer size
1/26/2012 10:39:55 AM - Info bptm(pid=960) setting receive network buffer to 263168 bytes
1/26/2012 10:39:55 AM - Info bptm(pid=960) using 30 data buffers
1/26/2012 10:39:55 AM - Info bptm(pid=960) start backup
1/26/2012 10:39:55 AM - Info bptm(pid=960) Waiting for mount of media id 0014L5 (copy 1) on server NBU-Server.
1/26/2012 10:39:55 AM - mounting 0014L5
1/26/2012 10:40:32 AM - Info bptm(pid=960) media id 0014L5 mounted on drive index 2, drivepath {3,0,4,0}, drivename IBM.ULT3580-TD5.002, copy 1
1/26/2012 10:40:32 AM - mounted; mount time: 00:00:37
1/26/2012 10:40:32 AM - positioning 0014L5 to file 281
1/26/2012 10:41:30 AM - positioned 0014L5; position time: 00:00:58
1/26/2012 10:41:30 AM - begin writing
1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000 times.
1/26/2012 10:55:28 AM - Error bpbrm(pid=5288) db_FLISTsend failed: network write failed (44)
1/26/2012 11:00:28 AM - Error bpbrm(pid=5288) could not send server status message
1/26/2012 11:00:28 AM - end writing; write time: 00:18:58
1/26/2012 11:00:33 AM - Info bpbkar32(pid=3124) done. status: 44: network write failed
network write failed(44)

Mark_Solutions · ‎02-10-2012

Also check out the post I did earlier in the thread to tune your network - did you do these and reboot the server?:

1. Add the following registry key to the Master Server:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

DWORD – TcpTimedWaitDelay - Decimal Value of 30

2. From a "run as administrator" from command line

Netsh int ipv4 set dynamicport tcp start=10000 num=50000

This gives it 60000 connections, the default is 16383

Also, what NET_BUFFER_SZ values are you using (if any - \ntebackup\bin\)?

View solution in original post

mph999 · ‎01-26-2012

Please ensure you check out this post whenstarting new threads.

https://www-secure.symantec.com/connect/blogs/minimum-information-required-when-logging-problem-deta...

Did this every work ?

If it did, when did it last work

What has changed (if it did work, something HAS changed)

Is the master servre also the media server runnign the backup

Does it fail every time, always at the same point ?

Can you run other similar sized backups from this server, run a test if necessary

From this, the performance looks poor :

1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000

This could well be the main issue :

1/26/2012 11:00:33 AM - Info bpbkar32(pid=3124) done. status: 44: network write failed
network write failed(44)

The main question I guess at the moment, is if the master is acting as the media server.

If the master is the media, then tyhe problem may be a little mopre complex

If the media server is separate from the master server, I would suspect the Network.

Martin

Yasuhisa_Ishika · ‎01-26-2012

This may help...or not.

Anonymous · ‎01-26-2012

Try running the backup to NULL to test local performance.
http://www.symantec.com/docs/TECH17541

Then move onto checking disk defragmentation.

Check network speed on uplinks and switch ports are matched. (Had a mismatch last week on new client and this fixed it.)

mph999 · ‎01-26-2012

Also, please explain what troubleshooting steps you have taken, and what you suspect the issue to be.

Apologies, but I am unsure why Accredited members are not posting up some seriously detailed troubleshooting steps, and are not running basic testing before logging a new post.

These lines:

1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000 times. Thanks,

... should show that some performance testing needs to be done, it shows that the issue is most likely between the client and the media server (could be same box) - hence my questions, and Stuarts suggestions.

The questions in here https://www-secure.symantec.com/connect/blogs/minimum-information-required-when-logging-problem-deta...

should be supplied from accredited memebers on every post, without having to be asked for them - I appreciate some questions will not be relevant to every problem, but they are a good guide, and show the level of details needed. If a question is not relevant, think "what other details can I give that may be needed in place of this question".

Thanks,

Martin

Marianne · ‎01-26-2012

Unpatched NBU 7.1?

W2008 patch level? Physical resources on master?

From the opening post, The Master seems to be Media server as well - NBU-Server-hcart-robot-tld-0 .... client NBU-Server

There seems to a disconnect between bpbrm and bpdbm:

Error bpbrm(pid=5288) db_FLISTsend failed: network write failed (44)

You will need all relevant logs....

Handy NetBackup Links

mph999 · ‎01-26-2012

bpbrm and bptm at VERBOSE = 5 please ...

Zahid_Haseeb · ‎01-26-2012

@ mph999

The same machine is Master and Media Server. (Other Backups of different machine are running fine. And the Catalog bacjup is being backed up from Netbackup Server so the NBU does not seems loaded)

@ Yasuhisa

Client read timeout is 6000

@ Stuart

My Network seems fine and all backups are going good.

=========================

Will share the bpbrm and bptm logs soon

Zahid_Haseeb · ‎01-26-2012

Thanks all for kind inputs and follow-up on my Post.

I increased the Browse time out to 6000 from 300. The read time out was already 6000 while backup was failing. But after increasing the Browse time out to 6000 from 300 my backup got successful. I triggered the backup again and trying to notice that this may be the cause.

mph999 · ‎01-26-2012

OK, I would argue that changing the client browse timeout for a catalog backup on a 'non-busy' (???) server is NOT a fix. It is a workaround that is used to hide the problem caused by some other issue.

The above is likely to be true IF .... the increase in browse timeout required to make it work is quite large, but we need to know the history ...

If this was working at a browse timeout of 300, and if you tested and found that you only needed to increse the browse timeout to say 320, then yes, I would accept that - the system just got a bit bigger ...

If however, you have to increse the timeout to 6000 before it works again, something is very wrong - a system doesn't suddenly require an extra 5700 seconds to complete a 'task' that it could do previously in under 300.

So, I would do further testing to see what the value is for the backup to fail ... for example, reduce the time out value until it fails. If this value is 'close to' 300 then ok, that it probaby ok. If the valuse is high, then you haven't fioxed the problem, you only have hidden it.

Martin - Senior Symantec UK TSE

Zahid_Haseeb · ‎02-01-2012

Yes you are right. That wqas not the solution. Again failed. I opened a case with Symantec and will share the status in the end

Mark_Solutions · ‎02-01-2012

Re-reading the original post it looks like the backup actually completed but then failed to update itself with the final result.

You have said that the system is not busy but it could actually be a port lock out issue causing this.

The internal communication gets its ports locked out so by the time the backup finishes it cannot communicate

It writes for 13 minutes 58 seconds (guess your catalog is not very big?), but then has the failure exactly after 5 minutes.

Four things worth doing here:

1. Add the following registry key to the Master Server:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

DWORD – TcpTimedWaitDelay - Decimal Value of 30

2. From a "run as administrator" from command line

Netsh int ipv4 set dynamicport tcp start=10000 num=50000

This gives it 60000 connections, the default is 16383

3. On the Master Servers Host Properties ensure that in the Timeouts section the Client Connect, Client Read, File Browse and Media Server Connect timeouts are set to 600, just to be sure.

4. Now it could be something to do with bpdbm or bpcd which have hard coded 5 minute timeouts.

Check to make sure you dont have a load of bpdbm processes running. If the system is quiet do a bpdown and see if the bpdbm processes are still running - then do another bpdown and see if they go. If not reboot the Master for a full cleanup - regular process cleanups / reboots are always worth doing.

Also plain 7.1 has issues, especially for bpdbm, and there is an EEB to help overcome this which comes to light for GRT Restore browsing errors. This increases the bpdbm hardcoded timeout and I believe this is included in 7.1.0.3 so I would strongly reccomend patching your Master Server

Hope this helps

mph999 · ‎02-01-2012

Why have you opened a case witgh Symantec - you should open the case with either your Network support guys, or whoever supports the operating systems.

This is not a NetBackup issue.

Martin

Zahid_Haseeb · ‎02-01-2012

I have a single machine which is Master / Media Server. My Catalog is on the same machine. All other backups of different Server(which included SQL,Flat file backup and Exchange2010 backup) on DSU and Tape Library is going fine and 100% perfect. How this could be the Network error ?

mph999 · ‎02-01-2012

Perhaps I was a little harsh, looking again at the details, there is a possiblilty , for example if we get stuck updating the catalog for example, the network could time out, but, I think this is not the most likely cause (as the other backups also update the catalog).

Instead of automatically blaming NetBackup, let's look at this the other way ...

Here is the error posted ...

Info bpbkar32(pid=3124) done. status: 44: network write failed network write failed(44)

Tell me how this is NOT a Network or OS/ Performance issue ...

I have looked through this post, I and many others have spent a great deal of time making suggestions :

For example:

I asked a number of questions which have NOT been answered, for example, when did this start to fail, history of issue etc ...

I also pointed out ...

1/26/2012 10:41:30 AM - positioned 0014L5; position time: 00:00:58

1/26/2012 10:41:30 AM - begin writing

1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times

1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000 times.

1/26/2012 10:55:28 AM - Error bpbrm(pid=5288) db_FLISTsend failed: network write failed (44)

Stuart Green made some suggestions about testing the read performance which look poor - bptm should NOT be delayed 40142 times ...

If we had the answers to the questions we had asked, / testing we had suggested we would be more informed to accurately comment. You need to answer the questoins asked, it gives more details about the problem , and may well lead to a faster resolution.

Stuart suggested some testing that could be done, as another idea try this :

bpimagelist -m aabbcc1

Apologies, it might be bpimagelist -mediaid aabbcc1 , can't remember, but this will look through the catalog and will not find tape aabbcc1 (hopefully), please then tell us how long this takes.

If on unix, use time bpimagelist -m aabbcc1

This will then give you the time taken without you having to sit there watching it.

This may give an indication of any issues reading through the catalog. Also, approximately, how big is the catalog.

Another command to run :

bpdbm -consistency 2

This will report if there are any issues in the catalog, if theer are, this could cause a delay, during which time the network times out.

Martin

Zahid_Haseeb · ‎02-02-2012

1. The exact error message (if an error is given)

2. During which type of operation does the problem occur ? For example, Backup, Duplication, Vault, Restore etc.

Backup

3. At what point in the job does the issue occur, for example, right at the beginning.

In the middle of backup

4. Which system(s) are involved? For example, Master Server, Media Server - Please provide system names

Single Machine which is Master/Media Server

5. What version of NetBackup are the servers running ?

7.1

6. What type of operating system are the servers running, and what version (including patch and kernel level)

Win2008 SP2 (32bit)

7. Please submit the support report from the master and any relevant Media Servers in the environment.

xxxx

8. Is this a reoccurring issue ?

It may be un even issue. Right now I triggered backup and got successful

9. What was the system doing at this time ?

The system(Master/Media Server) was busy in taking backup

10. Please send in the contents of the details tab in the Activity Monitor for the job.

I have attached below. please find the attachment

11. What logs or screenshots are available?

screenshot in provided in step # 1 and logs are attached below

12. Has this worked previously? If so, when was the last time it DID work?

If has been working fine before few days

13. Have there been any environmental; system; or configuration changes?

no

14. When did the problem begin?

before few days

15. How often does the problem occur?

before Yesterday, Yesterday it had been occuring but today it got successful

16. Is a similar configuration working properly?

Yes

17. If applicable, are you running the job to/from disk or tape, if tape, is it a VTL.

Yes its Tape Cartridge but not VTL. Its IBM 3310 Tape Library with six Drives

============================================================

a.) bpimagelist -m MediaName , bpimagelist -media MediaName, bpimagelist -MediaID MediaName is not wotking may be the command is not correct

b.) Catalog backup size is around 40 to 50 GB

c.) bpdbm -consistency 2 successfully completed no error generated

Mark_Solutions · ‎02-02-2012

How much disk space do you have on the drive where your databases are held?

Zahid_Haseeb · ‎02-02-2012

Total Space of C Drive is 80GB and free space is around 8GB. The Netbackup is installed on default location which is C Drive

Mark_Solutions · ‎02-02-2012

You are starting to get low on disk space

I asked because during the catalog backup the databses are staged which takes up additional space

When EMM detects a shortage of disk space it can shut itself down, which would cause this error during the catalog backup.

This is not supposed to happen until 1 or 2 % free space but I have seen many issues when it gets to 10% as well

See if you can clear down your logs to see how much free space you can make then see if it works

mph999 · ‎02-02-2012

OK, my error :

Try this ...

bpimmedia -mediaid ddd

Thanks,

MArtin

VOX

Info bpbkar32(pid=3124) done. status: 44: network write failed network write failed(44)