01-25-2012 11:59 PM
Environment
Veritas Netbackup = 7.1
OS of Netbackup = win2008
Tape Library attached with six drives
Problem
I am doing Catalog backup. While doing backup on Tape Cartridge at around 80% completion the backup got failed.
1/26/2012 10:36:15 AM - Info nbjm(pid=4108) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=143574, request id:{12B409E1-989A-442B-A1EA-8F10361D8B3D})
1/26/2012 10:36:15 AM - requesting resource NBU-Server-hcart-robot-tld-0
1/26/2012 10:36:15 AM - requesting resource NBU-Server.NBU_CLIENT.MAXJOBS.NBU
1/26/2012 10:36:15 AM - requesting resource NBU-Server.NBU_POLICY.MAXJOBS.Catalog_Backup
1/26/2012 10:36:15 AM - awaiting resource NBU-Server-hcart-robot-tld-0 - No drives are available
1/26/2012 10:39:53 AM - Info bpbrm(pid=5288) NBU-Server is the host to backup data from
1/26/2012 10:39:53 AM - Info bpbrm(pid=5288) reading file list from client
1/26/2012 10:39:53 AM - granted resource NBU-Server.NBU_CLIENT.MAXJOBS.NBU
1/26/2012 10:39:53 AM - granted resource NBU-Server.NBU_POLICY.MAXJOBS.Catalog_Backup
1/26/2012 10:39:53 AM - granted resource 0014L5
1/26/2012 10:39:53 AM - granted resource IBM.ULT3580-TD5.002
1/26/2012 10:39:53 AM - granted resource NBU-Server-hcart-robot-tld-0
1/26/2012 10:39:53 AM - estimated 48899691 Kbytes needed
1/26/2012 10:39:53 AM - Info nbjm(pid=4108) started backup job for client NBU-Server, policy Catalog_Backup, schedule Full on storage unit NBU-hcart-robot-tld-0
1/26/2012 10:39:53 AM - started process bpbrm (5288)
1/26/2012 10:39:53 AM - connecting
1/26/2012 10:39:54 AM - Info bpbrm(pid=5288) starting bpbkar32 on client
1/26/2012 10:39:54 AM - connected; connect time: 00:00:01
1/26/2012 10:39:55 AM - Info bpbkar32(pid=3124) Backup started
1/26/2012 10:39:55 AM - Info bptm(pid=960) start
1/26/2012 10:39:55 AM - Info bptm(pid=960) using 65536 data buffer size
1/26/2012 10:39:55 AM - Info bptm(pid=960) setting receive network buffer to 263168 bytes
1/26/2012 10:39:55 AM - Info bptm(pid=960) using 30 data buffers
1/26/2012 10:39:55 AM - Info bptm(pid=960) start backup
1/26/2012 10:39:55 AM - Info bptm(pid=960) Waiting for mount of media id 0014L5 (copy 1) on server NBU-Server.
1/26/2012 10:39:55 AM - mounting 0014L5
1/26/2012 10:40:32 AM - Info bptm(pid=960) media id 0014L5 mounted on drive index 2, drivepath {3,0,4,0}, drivename IBM.ULT3580-TD5.002, copy 1
1/26/2012 10:40:32 AM - mounted; mount time: 00:00:37
1/26/2012 10:40:32 AM - positioning 0014L5 to file 281
1/26/2012 10:41:30 AM - positioned 0014L5; position time: 00:00:58
1/26/2012 10:41:30 AM - begin writing
1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000 times.
1/26/2012 10:55:28 AM - Error bpbrm(pid=5288) db_FLISTsend failed: network write failed (44)
1/26/2012 11:00:28 AM - Error bpbrm(pid=5288) could not send server status message
1/26/2012 11:00:28 AM - end writing; write time: 00:18:58
1/26/2012 11:00:33 AM - Info bpbkar32(pid=3124) done. status: 44: network write failed
network write failed(44)
Solved! Go to Solution.
02-10-2012 01:45 AM
Also check out the post I did earlier in the thread to tune your network - did you do these and reboot the server?:
1. Add the following registry key to the Master Server:
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DWORD – TcpTimedWaitDelay - Decimal Value of 30
2. From a "run as administrator" from command line
Netsh int ipv4 set dynamicport tcp start=10000 num=50000
This gives it 60000 connections, the default is 16383
Also, what NET_BUFFER_SZ values are you using (if any - \ntebackup\bin\)?
01-26-2012 12:24 AM
Please ensure you check out this post whenstarting new threads.
Did this every work ?
If it did, when did it last work
What has changed (if it did work, something HAS changed)
Is the master servre also the media server runnign the backup
Does it fail every time, always at the same point ?
Can you run other similar sized backups from this server, run a test if necessary
From this, the performance looks poor :
1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000
This could well be the main issue :
1/26/2012 11:00:33 AM - Info bpbkar32(pid=3124) done. status: 44: network write failed
network write failed(44)
The main question I guess at the moment, is if the master is acting as the media server.
If the master is the media, then tyhe problem may be a little mopre complex
If the media server is separate from the master server, I would suspect the Network.
Martin
01-26-2012 01:07 AM
01-26-2012 01:18 AM
Try running the backup to NULL to test local performance.
http://www.symantec.com/docs/TECH17541
Then move onto checking disk defragmentation.
Check network speed on uplinks and switch ports are matched. (Had a mismatch last week on new client and this fixed it.)
01-26-2012 01:38 AM
Also, please explain what troubleshooting steps you have taken, and what you suspect the issue to be.
Apologies, but I am unsure why Accredited members are not posting up some seriously detailed troubleshooting steps, and are not running basic testing before logging a new post.
These lines:
1/26/2012 10:55:28 AM - Info bptm(pid=960) waited for full buffer 31786 times, delayed 40142 times
1/26/2012 10:55:28 AM - Info bpbkar32(pid=3124) bpbkar waited 10922 times for empty buffer, delayed 11000 times. Thanks,
... should show that some performance testing needs to be done, it shows that the issue is most likely between the client and the media server (could be same box) - hence my questions, and Stuarts suggestions.
The questions in here https://www-secure.symantec.com/connect/blogs/minimum-information-required-when-logging-problem-deta...
should be supplied from accredited memebers on every post, without having to be asked for them - I appreciate some questions will not be relevant to every problem, but they are a good guide, and show the level of details needed. If a question is not relevant, think "what other details can I give that may be needed in place of this question".
Thanks,
Martin
01-26-2012 02:29 AM
Unpatched NBU 7.1?
W2008 patch level? Physical resources on master?
From the opening post, The Master seems to be Media server as well - NBU-Server-hcart-robot-tld-0 .... client NBU-Server
There seems to a disconnect between bpbrm and bpdbm:
Error bpbrm(pid=5288) db_FLISTsend failed: network write failed (44)
You will need all relevant logs....
01-26-2012 02:46 AM
bpbrm and bptm at VERBOSE = 5 please ...
01-26-2012 04:57 AM
@ mph999
The same machine is Master and Media Server. (Other Backups of different machine are running fine. And the Catalog bacjup is being backed up from Netbackup Server so the NBU does not seems loaded)
@ Yasuhisa
Client read timeout is 6000
@ Stuart
My Network seems fine and all backups are going good.
=========================
Will share the bpbrm and bptm logs soon
01-26-2012 05:06 AM
Thanks all for kind inputs and follow-up on my Post.
I increased the Browse time out to 6000 from 300. The read time out was already 6000 while backup was failing. But after increasing the Browse time out to 6000 from 300 my backup got successful. I triggered the backup again and trying to notice that this may be the cause.
01-26-2012 07:58 AM
OK, I would argue that changing the client browse timeout for a catalog backup on a 'non-busy' (???) server is NOT a fix. It is a workaround that is used to hide the problem caused by some other issue.
The above is likely to be true IF .... the increase in browse timeout required to make it work is quite large, but we need to know the history ...
If this was working at a browse timeout of 300, and if you tested and found that you only needed to increse the browse timeout to say 320, then yes, I would accept that - the system just got a bit bigger ...
If however, you have to increse the timeout to 6000 before it works again, something is very wrong - a system doesn't suddenly require an extra 5700 seconds to complete a 'task' that it could do previously in under 300.
So, I would do further testing to see what the value is for the backup to fail ... for example, reduce the time out value until it fails. If this value is 'close to' 300 then ok, that it probaby ok. If the valuse is high, then you haven't fioxed the problem, you only have hidden it.
Martin - Senior Symantec UK TSE
02-01-2012 01:14 AM
Yes you are right. That wqas not the solution. Again failed. I opened a case with Symantec and will share the status in the end
02-01-2012 03:26 AM
Re-reading the original post it looks like the backup actually completed but then failed to update itself with the final result.
You have said that the system is not busy but it could actually be a port lock out issue causing this.
The internal communication gets its ports locked out so by the time the backup finishes it cannot communicate
It writes for 13 minutes 58 seconds (guess your catalog is not very big?), but then has the failure exactly after 5 minutes.
Four things worth doing here:
1. Add the following registry key to the Master Server:
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DWORD – TcpTimedWaitDelay - Decimal Value of 30
2. From a "run as administrator" from command line
Netsh int ipv4 set dynamicport tcp start=10000 num=50000
This gives it 60000 connections, the default is 16383
3. On the Master Servers Host Properties ensure that in the Timeouts section the Client Connect, Client Read, File Browse and Media Server Connect timeouts are set to 600, just to be sure.
4. Now it could be something to do with bpdbm or bpcd which have hard coded 5 minute timeouts.
Check to make sure you dont have a load of bpdbm processes running. If the system is quiet do a bpdown and see if the bpdbm processes are still running - then do another bpdown and see if they go. If not reboot the Master for a full cleanup - regular process cleanups / reboots are always worth doing.
Also plain 7.1 has issues, especially for bpdbm, and there is an EEB to help overcome this which comes to light for GRT Restore browsing errors. This increases the bpdbm hardcoded timeout and I believe this is included in 7.1.0.3 so I would strongly reccomend patching your Master Server
Hope this helps
02-01-2012 04:13 AM
Why have you opened a case witgh Symantec - you should open the case with either your Network support guys, or whoever supports the operating systems.
This is not a NetBackup issue.
Martin
02-01-2012 04:49 AM
I have a single machine which is Master / Media Server. My Catalog is on the same machine. All other backups of different Server(which included SQL,Flat file backup and Exchange2010 backup) on DSU and Tape Library is going fine and 100% perfect. How this could be the Network error ?
02-01-2012 11:54 AM
02-02-2012 04:52 AM
02-02-2012 05:10 AM
How much disk space do you have on the drive where your databases are held?
02-02-2012 05:21 AM
Total Space of C Drive is 80GB and free space is around 8GB. The Netbackup is installed on default location which is C Drive
02-02-2012 05:38 AM
You are starting to get low on disk space
I asked because during the catalog backup the databses are staged which takes up additional space
When EMM detects a shortage of disk space it can shut itself down, which would cause this error during the catalog backup.
This is not supposed to happen until 1 or 2 % free space but I have seen many issues when it gets to 10% as well
See if you can clear down your logs to see how much free space you can make then see if it works
02-02-2012 05:44 AM
OK, my error :
Try this ...
bpimmedia -mediaid ddd
Thanks,
MArtin