cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup: Flash backups on Zimbra mail servers - Performance improvement

Anand_Avaala
Level 3

Can anybody help me out with my enviornment (details below), facing frequent issues with errors 40, 13, 20, 50 etc.

One Master server, 165 SAN Media Servers, HP VLS 9000 (VTL).

We use flash backups on all 165 servers and run the backup through sepearate policy for each server.

Backups from Monday to Sunday, 6 days Diff Incr and 1 day Full backup. Each has got 2 week retention.

I need help on the setup and improvement in performance.

Please let me know in case any information is required on the same.

11 REPLIES 11

Mark_Solutions
Level 6
Partner Accredited Certified
165 SAN Media Servers - that is quite a few! So all of these constantly have to check in with the Master Server about their storage unit status etc. From the error numbers you are getting it is possible that the Master Server is not coping - it may be its port usage is lacking or its memory or the NetBackup Processes are getting overloaded Let us know the O/S and specification of the Master Server (RAM, Network Speed etc) so that we can advise further

Anand_Avaala
Level 3
Hi Mark, Thanks for you response. Master server OS: [/usr/openv/lib]# cat /etc/*-release Red Hat Enterprise Linux Server release 5.1 (Tikanga) [/usr/openv/lib]# uname -mrs Linux 2.6.18-53.1.6.el5 x86_64 RAM: 16GB Network Speed: Speed: 1000Mb/s Duplex: Full Port: FIBRE We are using 7.5.3 on Master as well as all SAN Media servers.

Anand_Avaala
Level 3

Hi Mark,

Thanks for you response.

Master server OS:
[/usr/openv/lib]# cat /etc/*-release
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
[/usr/openv/lib]# uname -mrs
Linux 2.6.18-53.1.6.el5 x86_64

RAM: 16GB
Network Speed:
Speed: 1000Mb/s
Duplex: Full
Port: FIBRE

We are using 7.5.3 on Master as well as all SAN Media servers.

 

Mark_Solutions
Level 6
Partner Accredited Certified
OK - doesn't look too bad then - so maybe just needs a little tuning to prevent the overload and / or timeouts First look at keep alive settings - these are typical settings: # cat /proc/sys/net/ipv4/tcp_keepalive_time 7200 # cat /proc/sys/net/ipv4/tcp_keepalive_intvl 75 # cat /proc/sys/net/ipv4/tcp_keepalive_probes 9 If they are at these values then change them as follows: # echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time # echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes To keep persistent after a reboot see below – use vi editor: The changes would be rendered persistent with an addition such as the following to /etc/sysctl.conf ## Keepalive at 8.5 minutes # start probing for heartbeat after 8.5 idle minutes (default 7200 sec) net.ipv4.tcp_keepalive_time=510 # close connection after 4 unanswered probes (default 9) net.ipv4.tcp_keepalive_probes=3 # wait 45 seconds for reponse to each probe (default 75 net.ipv4.tcp_keepalive_intvl=3 You don’t need a restart for them to take effect - then run : chkconfig boot.sysctl on to commit the changes See if these help to start with - it may need some nbrb tuning (using nbrbutil) to maximise its capabilities - but the Status 50 points towards a possible keep alive issue

Anand_Avaala
Level 3

Hi Mark,

I have made those changes as you advised above and will monitor the environment for couple of days and get back to you.

However do we need to make any changes to Kernel Semaphores in regards to tune up the performance on the master.

Current Kernel Sem values:
# Syntax of the following paramter:  kernel.sem = SEMMSL SEMMNS SEMOPM SEMMNI
# 4 values defining limits for System V IPC semaphores.
# These fields are, in order:
#     SEMMSL  The maximum semaphores per semaphore set.
#     SEMMNS  A system-wide limit on the number of semaphores in all semaphore sets.
#     SEMOPM  The maximum number of operations that may be specified in a semop(2) call.
#     SEMMNI  A system-wide limit on the maximum number of semaphore identifiers.
kernel.sem = 300 32000 64 1024

 

Because I have read about those values in some of the tech notes from Symantec.

Please suggest me on the same.

Thanks a ton Mark.

Mark_Solutions
Level 6
Partner Accredited Certified

Love to help you but this one is out of my scope i am afraid!

If you have found it is a tuning tech note from Symantec then I can't see that it would hurt - but maybe try one thing at a time so that you know what actually does the trick for you

Anand_Avaala
Level 3

Even after tuning those parameters suggested by you, backups are failing with error 13 and 40.

Please find the detailed description below for both the errors.

4/3/2013 4:09:07 AM - Info bpbkar(pid=9680) 60000 entries sent to bpdbm       
4/3/2013 4:09:07 AM - Error bpbrm(pid=9636) from client sz0066a-util.westchester.pa.mail.comcast.net:  1364951613 755297/      
4/3/2013 4:09:22 AM - Error bpbrm(pid=9636) db_FLISTsend failed: file read failed (13)      
4/3/2013 4:09:24 AM - Error bptm(pid=9694) media manager terminated by parent process      
4/3/2013 4:09:28 AM - Info bpbkar(pid=0) done. status: 13: file read failed      
4/3/2013 4:09:28 AM - end writing; write time: 03:06:07
file read failed(13)

4/3/2013 4:04:14 AM - Info bpbkar(pid=16939) 100000 entries sent to bpdbm       
4/3/2013 4:04:42 AM - Info bpbkar(pid=16939) 105000 entries sent to bpdbm       
4/3/2013 4:05:19 AM - Error bptm(pid=16950) media manager exiting because bpbrm is no longer active   
4/3/2013 4:05:19 AM - Info bpbkar(pid=16939) 110000 entries sent to bpdbm       
4/3/2013 4:05:20 AM - Info bptm(pid=16950) EXITING with status 174 <----------       
network connection broken(40)
 

I suspect its something to do with bpbrm and media manager causing these errors.

In case, if its out of your scope, could you please refer someone who can actually help me out on this.

Thanks for your support by the way.

 

 

 

Mark_Solutions
Level 6
Partner Accredited Certified

After how long do they fail?

As for "referring you" - this is an open forum - we are all here to help and do it in our own time, so hopefully someone will see this and can assist further - my only referral would be to advise you to open a support case with Symantec

 

Need to see the full log (please attach as a text file rather than pasting into the thread)

Wiriadi_Wangsa
Level 4
Employee Accredited

Just a thought, given the first error is during "db_FLISTsend", might want to have a look at this: 

http://www.symantec.com/business/support/index?page=content&id=HOWTO56209

Wiriadi_Wangsa
Level 4
Employee Accredited

Anand_Avaala
Level 3

Hi Mark/Wiriadi Wangsa,

Sorry for the long delay in replying back to this forum post.

 

We are working with Symantec on this, however we have a particular client where the Incr backup fails with error 13 or 14 and Full backup goes successful without any issues.

I have checked both the below values and they are set to the max.

/usr/openv/netbackup/MAX_FILES_PER_ADD = 100000
/usr/openv/netbackup/bin/DBMto = 30
 
However, INCR backup went fine without any errors when we uninstall NBU in the media server and reinstall it back. And issue occurs after sometime. 
 
Please let me know if you need any logs related to this.
 
Thank you.