Forum Discussion

Ali_YAZICI's avatar
Ali_YAZICI
Level 2
11 years ago

interrupt when backup large SQL databases

Hi,

Usually MSSQL backups are interrupted at the middle of backup, it success if try again second or third times.

We tried scenerios below,(First tried with version 7.5.0.5, then 7.6.0.1 and result is same) What is the root cause of the problem do you think?

 

netbackup appliance test results.png

Regards.

  • There is no firewall between the sql server and the Master or Media Server.

    Root cause and Solution of the problem:

     

    Backline Root Cause Analysis

     

     

    RCA Author:

     John Peters

     

     

    Error Code/Message 

    Error 13

    Diagnostic Steps 

    Verbose logging, TCP dumps, Media,Master,client I/O parameter adjustments

    Related E-Tracks

     

    3445293

    Problem Description 

     MS SQL backups are failing randomly if MSDP is used.

    Cause 

    It is believed that several adjustments that have been done on Master and Media Servers  have resolved the issues .  Engineering believes that both Network and IO tuning ended the congestion, hence the bottleneck, resulting NBU to successfully complete backups.

    Solution 

     Master, Media Server and client side, I/O parameter adjustments.

    Resolution Criteria 

     ----

    Description Resolution  (if Criteria is not met) 

     

    In our attempts to determine the root cause of the status 13 failures

    when backing up SQL databases to MSDP using media side deduplication,

     

    From the initial logs we could see that after some time, the dbclient

    process attempted to send data to the appliance, ie :

    16:07:53.335 [17300.26824] <4> VxBSASendData: INF - entering SendData.

    16:07:53.335 [17300.26824] <4> dbc_put: INF - sending 65536 bytes

    16:07:53.335 [17300.26824] <2> writeToServer: bytesToSend = 65536 bytes

    16:07:53.335 [17300.26824] <2> writeToServer: start send -- try=1

    16:08:12.224 [17300.26824] <16> writeToServer: ERR - send() to server

    on socket failed:

    This message occurs a as a consequence of the OS library call send()

    returning a non-zero status attempting to send data to a file

    descriptor ( socket ) that is associated with a network connection.

    When this failure occurs, dbclient sends a failure status to the media

    server and bpbrm terminates the job resulting in the status 13. The OS

    library call to recv() data on the media server does not receive any

    data or indication from other end of the network connection at the time

    of the failure which indicates that the TCP stacks on the two hosts are

    not in agreement on the state of the connection.

    What we needed to find out was if the problem was occurring in the TCP

    stack on the client or appliance side of that connection.

    In order to understand this, it was felt necessary to look at the

    problem from a TCP level, we hoped that this would show if data was

    flowing or if there was a plugin ingest issue.

    For this to happen, we would need to perform network packet capture

    between the hosts, as well as performing strace on bptm  processes

    together with full debug logging enabled.

    This approach poses a difficult problem based on the intermittent

    nature of the failures and the indeterminate time taken for jobs to

    fail. The resulting logs would be large and the tcpdump / straces enormous.

    We attempted to work around this by rotating the tcp capture logs and

    terminating the traces as soon as the job failed in an attempt to

    capture the critical information.

    However, when we set this test up, the additional load imposed by the

    tracing and logging impacted the appliance to such a degree that the

    backup eventually failed with a 13, unfortunately we do not feel that

    this failure was the same issue seen for earlier failures.

    In this case when the job was repeated with tracing turned off it was

    successful.

     

     

    In order to try and prevent the issue occuring we made some

    tuning changes to components we felt were potentially the source of the

    problem,   specifically :

    1) Increased the TCP retransmission settings on the client to 15 (

    http://support.microsoft.com/default.aspx?scid=kb;en-us;170359)

    2) Implemented following settings on the appliance :

    echo 128 > /usr/openv/netbackup/db/config/CD_NUMBER_DATA_BUFFERS

     

    echo 524288 > /usr/openv/netbackup/db/config/CD_SIZE_DATA_BUFFERS

       

    echo 180 > /usr/openv/netbackup/db/config/CD_UPDATE_INTERVAL

    touch /usr/openv/netbackup/db/config/CD_WHOLE_IMAGE_COPY

    echo 0 > /usr/openv/netbackup/NET_BUFFER_SZ

    echo 0 > /usr/openv/netbackup/NET_BUFFER_REST

    echo 256 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS

    echo 512 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK

    echo 16 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_FT

     

    echo 1500 > /usr/openv/netbackup/db/config/OS_CD_BUSY_RETRY_LIMIT

       

    echo 262144 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS

    echo 1048576 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_DISK

     

    echo 262144 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_FT

         

    echo 3600 > /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

    echo 3600 > /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTSENDTMO

    touch /usr/openv/netbackup/db/config/DPS_PROXYNOEXPIRE

    echo 5000 > /usr/openv/netbackup/db/config/MAX_FILES_PER_ADD

    3) Set /usr/openv/pdde/pdcr/etc/contentrouter.cfg

    PrefetchThreadNum=16

    MaxNumCaches=1024

    ReadBufferSize=65536

    WorkerThreads=128

    4) Set /usr/openv/lib/ost-plugins/pd.conf

     

    PREFETCH_SIZE = 67108864

    CR_STATS_TIMER = 300

     

    5) Modified the "Client Communication Buffer size"

     

    Host properties --> Client Settings --> Communication Buffer size set

    to 256kb

     

     

    6) MSSQL script was modified :

     

    BUFFERCOUNT to 2

    STRIPE to 4

     

    7) Additionally we would always advise to check the "resolution"

    section of this technote http://www.symantec.com/docs/TECH37372.

     

     

     

     

     

     

     

     

     

     

     

     

     

  • If the backup fails after exactly 2 hours (give or take a few seconds) then this is caused by the keepalive settings

    From your diagrams i assume the appliance is also a master server but if not you will need to implement keep alive settings on your master server too.

    You need to go into the O/S of the appliance:

    Support - Maintenance - P@ssw0rd - elevate

    Run the following to check the cirrent settings - typical responses shown below:

    # cat /proc/sys/net/ipv4/tcp_keepalive_time
      7200
    # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
    75
    # cat /proc/sys/net/ipv4/tcp_keepalive_probes
    9

    To set new values do the following:

    # echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time
    # echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl
    # echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

    These settings take effect immediately so you can test it straight away.

    To keep there persistent after a reboot use vi editor to make the additions such as the following to /etc/sysctl.conf file:

    ## Keepalive at 8.5 minutes

    # start probing for heartbeat after 8.5 idle minutes (default 7200 sec)
    net.ipv4.tcp_keepalive_time=510

    # close connection after 4 unanswered probes (default 9)
    net.ipv4.tcp_keepalive_probes=3 

    # wait 45 seconds for reponse to each probe (default 75
    net.ipv4.tcp_keepalive_intvl=3

    Then run:

    chkconfig boot.sysctl on

    If you also have a Master Server do the same on that - if it is Windows it will need regsitry keys adding and a reboot for them to work:

    HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

    New DWORD - KeepAliveTime – Decimal value of 510000
    New DWORD – KeepAliveInterval – Decimal Value of 3

    See if this helps - applies if they fail after 2 hours (if they fail after 1 hour check for any firewalls between the Master / Media / Client

  • Lovely picture but no details about the backup failure?

    What is the status code?
    All text in Details tab of failed jobs?

    Any useful info in bptm and bpbrm logs on the Appliances?

    Have you enabled dbclient log on SQL clients?
    Error in these logs?

    What is Client Read and Client Connect timeouts set to on the Appliance media servers?

    PS:
    Please copy log files to .txt (e.g. bptm.txt) and upload as File attachments.

  • Hi Mark,

    It  fails randomly, sometimes at the begining, sometimes in the middle or when the %90.

    Hi Marianne,

    it gives status code 13 and 6.

    actually Symantec support are analysing the logs and  I mean which part has problem do you think?

    Parent Job: 
    3/2/2014 5:55:04 AM - Info nbjm(pid=2800) starting backup job (jobid=1496352) for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_1M  
    3/2/2014 5:55:04 AM - Info nbjm(pid=2800) requesting MEDIA_SERVER_WITH_ATTRIBUTES resources from RB for backup job (jobid=1496352, request id:{133765CF-CF86-4BF3-A950-309D37E3C0BA})  
    3/2/2014 5:55:04 AM - requesting resource stu_disk_sp-nbuappliance23
    3/2/2014 5:55:04 AM - requesting resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:55:04 AM - requesting resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:55:05 AM - granted resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:55:05 AM - granted resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:55:05 AM - granted resource stu_disk_sp-nbuappliance23
    3/2/2014 5:55:05 AM - estimated 0 Kbytes needed
    3/2/2014 5:55:05 AM - Info nbjm(pid=2800) started backup (backupid=client.domainname.com_1393732505) job for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_1M on storage unit stu_disk_sp-nbuappliance23
    3/2/2014 5:55:07 AM - started process bpbrm (30879)
    3/2/2014 5:55:27 AM - Info bpbrm(pid=30879) client.domainname.com is the host to backup data from     
    3/2/2014 5:55:27 AM - Info bpbrm(pid=30879) reading file list for client        
    3/2/2014 5:55:28 AM - connecting
    3/2/2014 5:56:08 AM - Info bpbrm(pid=30879) starting bphdb on client         
    3/2/2014 5:56:09 AM - Info bphdb(pid=4736) Backup started           
    3/2/2014 5:56:09 AM - connected; connect time: 0:00:41
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) INF - BACKUP STARTED USING        
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) Microsoft SQL Server 2008 (SP3) - 10.0.5500.0 (X64)     
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) Sep 21 2011 22:45:45         
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) Copyright (c) 1988-2008 Microsoft Corporation        
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) Enterprise Edition (64-bit) on Windows NT 6.0 <X64> (Build 6002: Service Pack 2) 
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) Batch = C:\NBU_SCRIPT\full_FACT.bch, Op# = 1       
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) INF - Using backup images clientNODEA.MSSQL7.client.db.FIN_BI_DW_FACT.~.7.001of008.20140302055612..C to clientNODEA.MSSQL7.client.db.FIN_BI_DW_008of008.20140302055612.C     
    3/2/2014 5:56:17 AM - Info dbclient(pid=10124) INF - backup database "FIN_BI_DW_FACT" to VIRTUAL_DEVICE='VNBU0-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU1-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU2-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU3-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU4-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU5-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU6-10124-3808-1393732577', VIRTUAL_DEVICE='VNBU7-10124-3808-1393732577' with  stats = 10, blocksize = 65536, maxtransfersize = 4194304, buffercount = 160
    3/2/2014 5:56:18 AM - Info dbclient(pid=10124) INF - Number of stripes: 8, Number of buffers per stripe 20. 
    3/2/2014 5:56:20 AM - Info dbclient(pid=10124) INF - Created VDI object for SQL Server instance <client.domainname.com>. Connection timeout is <300> seconds.
    3/2/2014 6:31:13 AM - Info dbclient(pid=10124) INF - ODBC return code <2>, SQL State <01000>, SQL Message <3211><[Microsoft][ODBC SQL Server Driver][SQL Server]10 percent processed.>.
    3/2/2014 7:20:50 AM - Info dbclient(pid=10124) ERR - Error in VxBSASendData: 1.       
    3/2/2014 7:20:50 AM - Info dbclient(pid=10124)     CONTINUATION: - All backup jobs have completed.  
    3/2/2014 7:20:50 AM - Info dbclient(pid=10124) ERR - Internal error. See the dbclient log for more information.  
    3/2/2014 7:20:50 AM - Info dbclient(pid=10124) ERR - Error in DBthreads::dbclient: 6.       
    3/2/2014 7:20:50 AM - Info dbclient(pid=10124)     CONTINUATION: - The system cannot find the file specified.
    3/2/2014 7:20:51 AM - Info dbclient(pid=10124) ERR - Error in VxBSAEndData: 6.       
    3/2/2014 7:20:51 AM - Info dbclient(pid=10124)     CONTINUATION: - The handle used to associate this call with a previous VxBSAInit() call is invalid.
    3/2/2014 7:20:52 AM - Info dbclient(pid=10124) INF - OPERATION #1 of batch C:\NBU_SCRIPT\full_FACT.bch FAILED with STATUS 1 (0 is normal). Elapsed time = 5079(5079) seconds.
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124) INF - Results of executing <C:\NBU_SCRIPT\full_FACT.bch>:       
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124) <0> operations succeeded. <1> operations failed.       
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124) INF - The following object(s) were not backed up successfully.   
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124) INF - FIN_BI_DW_FACT          
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124) INF - Error removing ODBC data source name DSN=NBMSSQL_13936_13828_4    
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124)     CONTINUATION: - Invalid DSN     
    3/2/2014 7:20:54 AM - Info dbclient(pid=10124)     CONTINUATION: - Driver's ConfigDSN, ConfigDriver, or ConfigTranslator failed 
    3/2/2014 7:20:54 AM - Error bpbrm(pid=30879) from client client.domainname.com: ERR - command failed: none of the requested files were backed up (2)
    3/2/2014 7:20:54 AM - Error bpbrm(pid=30879) from client client.domainname.com: ERR - bphdb exit status = 2: none of the requested files were backed up
    3/2/2014 7:21:15 AM - Info bphdb(pid=4736) done. status: 2: none of the requested files were backed up  
    3/2/2014 7:21:15 AM - end writing
    none of the requested files were backed up(2)
     
    Child job 1:
     
    3/2/2014 5:56:29 AM - Info nbjm(pid=2800) starting backup job (jobid=1496354) for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_App  
    3/2/2014 5:56:29 AM - Info nbjm(pid=2800) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1496354, request id:{0293ADA0-9021-4EF4-A4AC-488DE7378F22})  
    3/2/2014 5:56:29 AM - requesting resource stu_disk_sp-nbuappliance23
    3/2/2014 5:56:29 AM - requesting resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:56:29 AM - requesting resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:56:34 AM - granted resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:56:34 AM - granted resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:56:34 AM - granted resource MediaID=@aaabJ;DiskVolume=PureDiskVolume;DiskPool=dp_disk_sp-nbuappliance23;Path=PureDiskVolume;StorageServer=sp-nbuappliance23;MediaServer=sp-nbuappliance23
    3/2/2014 5:56:34 AM - granted resource stu_disk_sp-nbuappliance23
    3/2/2014 5:56:35 AM - estimated 0 Kbytes needed
    3/2/2014 5:56:35 AM - Info nbjm(pid=2800) started backup (backupid=client.domainname.com_1393732595) job for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_App on storage unit stu_disk_sp-nbuappliance23
    3/2/2014 5:56:37 AM - started process bpbrm (31014)
    3/2/2014 5:57:17 AM - Info bpbrm(pid=31014) client.domainname.com is the host to backup data from     
    3/2/2014 5:57:17 AM - Info bpbrm(pid=31014) reading file list for client        
    3/2/2014 5:57:18 AM - connecting
    3/2/2014 5:58:18 AM - Info bpbrm(pid=31014) listening for client connection         
    3/2/2014 5:59:39 AM - Info bpbrm(pid=31014) INF - Client read timeout = 14400      
    3/2/2014 5:59:39 AM - Info bpbrm(pid=31014) accepted connection from client         
    3/2/2014 5:59:40 AM - Info dbclient(pid=10124) Backup started           
    3/2/2014 5:59:40 AM - Info bpbrm(pid=31014) bptm pid: 31523          
    3/2/2014 5:59:40 AM - connected; connect time: 0:02:22
    3/2/2014 5:59:41 AM - Info bptm(pid=31523) start            
    3/2/2014 5:59:43 AM - Info bptm(pid=31523) using 262144 data buffer size        
    3/2/2014 6:00:24 AM - Info bptm(pid=31523) using 32 data buffers         
    3/2/2014 6:00:31 AM - Info bptm(pid=31523) start backup           
    3/2/2014 6:01:52 AM - Info bptm(pid=31523) backup child process is pid 31696       
    3/2/2014 6:01:52 AM - begin writing
    3/2/2014 6:01:56 AM - Info dbclient(pid=10124) dbclient(pid=10124) wrote first buffer(size=65536)         
    3/2/2014 7:20:54 AM - Error bpbrm(pid=31014) socket read failed: errno = 104 - Connection reset by peer  
    3/2/2014 7:28:39 AM - Error bptm(pid=31523) media manager terminated by parent process       
    3/2/2014 9:17:28 AM - Info sp-nbuappliance23(pid=31523) StorageServer=PureDisk:sp-nbuappliance23; Report=PDDO Stats for (sp-nbuappliance23): scanned: 45234690 KB, CR sent: 3272738 KB, CR sent over FC: 0 KB, dedup: 92.8%, cache hits: 19846 (5.6%), rebased: 78 (0.0%)
    3/2/2014 9:17:57 AM - Info dbclient(pid=10124) done. status: 13: file read failed       
    3/2/2014 9:17:57 AM - end writing; write time: 3:16:05
    file read failed(13)
     
    Child job2:
    3/2/2014 5:56:25 AM - Info nbjm(pid=2800) starting backup job (jobid=1496359) for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_App  
    3/2/2014 5:56:25 AM - Info nbjm(pid=2800) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=1496359, request id:{C430C84C-2BDD-41B5-A65D-9F4D1870DCF0})  
    3/2/2014 5:56:25 AM - requesting resource stu_disk_sp-nbuappliance23
    3/2/2014 5:56:25 AM - requesting resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:56:25 AM - requesting resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:56:29 AM - granted resource masterserver.domainname.com.NBU_CLIENT.MAXJOBS.client.domainname.com
    3/2/2014 5:56:29 AM - granted resource masterserver.domainname.com.NBU_POLICY.MAXJOBS.MSSQL_SS_DW_FACT
    3/2/2014 5:56:29 AM - granted resource MediaID=@aaabJ;DiskVolume=PureDiskVolume;DiskPool=dp_disk_sp-nbuappliance23;Path=PureDiskVolume;StorageServer=sp-nbuappliance23;MediaServer=sp-nbuappliance23
    3/2/2014 5:56:29 AM - granted resource stu_disk_sp-nbuappliance23
    3/2/2014 5:56:32 AM - estimated 0 Kbytes needed
    3/2/2014 5:56:32 AM - Info nbjm(pid=2800) started backup (backupid=client.domainname.com_1393732592) job for client client.domainname.com, policy MSSQL_SS_DW_FACT, schedule Full_App on storage unit stu_disk_sp-nbuappliance23
    3/2/2014 5:56:34 AM - started process bpbrm (31005)
    3/2/2014 5:57:14 AM - Info bpbrm(pid=31005) client.domainname.com is the host to backup data from     
    3/2/2014 5:57:14 AM - Info bpbrm(pid=31005) reading file list for client        
    3/2/2014 5:57:15 AM - connecting
    3/2/2014 5:58:15 AM - Info bpbrm(pid=31005) listening for client connection         
    3/2/2014 5:59:36 AM - Info bpbrm(pid=31005) INF - Client read timeout = 14400      
    3/2/2014 5:59:36 AM - Info bpbrm(pid=31005) accepted connection from client         
    3/2/2014 5:59:37 AM - Info dbclient(pid=10124) Backup started           
    3/2/2014 5:59:37 AM - Info bpbrm(pid=31005) bptm pid: 31489          
    3/2/2014 5:59:37 AM - connected; connect time: 0:02:22
    3/2/2014 5:59:37 AM - Info bptm(pid=31489) start            
    3/2/2014 5:59:38 AM - Info bptm(pid=31489) using 262144 data buffer size        
    3/2/2014 6:00:19 AM - Info bptm(pid=31489) using 32 data buffers         
    3/2/2014 6:00:27 AM - Info bptm(pid=31489) start backup           
    3/2/2014 6:01:48 AM - Info bptm(pid=31489) backup child process is pid 31694       
    3/2/2014 6:01:49 AM - begin writing
    3/2/2014 6:01:56 AM - Info dbclient(pid=10124) dbclient(pid=10124) wrote first buffer(size=65536)         
    3/2/2014 7:20:47 AM - Info dbclient(pid=10124) done. status: 6          
    3/2/2014 7:28:39 AM - Error bptm(pid=31489) media manager terminated by parent process       
    3/2/2014 9:17:25 AM - Info sp-nbuappliance23(pid=31489) StorageServer=PureDisk:sp-nbuappliance23; Report=PDDO Stats for (sp-nbuappliance23): scanned: 46257154 KB, CR sent: 3218010 KB, CR sent over FC: 0 KB, dedup: 93.0%, cache hits: 21267 (5.9%), rebased: 10 (0.0%)
    3/2/2014 9:17:56 AM - end writing; write time: 3:16:07
    the backup failed to back up the requested files(6)
    3/2/2014 9:17:56 AM - Info dbclient(pid=10124) done. status: 6: the backup failed to back up the requested files 
     

     

  • It could be a timeout, but we will need to see dbclient on SQL client and bptm and bpbrm to know for sure.
    Actually - timeout is already VERY high:  Client read timeout = 14400 
    So, this should not be the cause.

    Also seems like there is a problem on the client:

    3/2/2014 7:20:50 AM - Info dbclient(pid=10124) ERR - Internal error. See the dbclient log for more information.  

    All of the above logs are needed to troubleshoot.
    You have probably already submitted level 5 logs to Symantec.
    Let them get on with it. Those logs are W-A-Y too big and time consuming for us 'mere mortals' to try and sift through...

    Any errors in Event Viewer Application log?
    Or SQL ERROLOG and VDI log?

  • An hour and 20 minutes every time - definately sounds like a timeout - that is 4800 seconds so not your client read timeout

    Is there a firewall between the sql server and the Master or Media Server?

    Did you try the keepalive settings that I suggested?

    Having said that they also all fail at 7.20 am - is this a regular time for them to fail? If so have you checked that your SQL admins are not running any database maintenace (SQL dumps etc) on the servers at that time?

  • There is no firewall between the sql server and the Master or Media Server.

    Root cause and Solution of the problem:

     

    Backline Root Cause Analysis

     

     

    RCA Author:

     John Peters

     

     

    Error Code/Message 

    Error 13

    Diagnostic Steps 

    Verbose logging, TCP dumps, Media,Master,client I/O parameter adjustments

    Related E-Tracks

     

    3445293

    Problem Description 

     MS SQL backups are failing randomly if MSDP is used.

    Cause 

    It is believed that several adjustments that have been done on Master and Media Servers  have resolved the issues .  Engineering believes that both Network and IO tuning ended the congestion, hence the bottleneck, resulting NBU to successfully complete backups.

    Solution 

     Master, Media Server and client side, I/O parameter adjustments.

    Resolution Criteria 

     ----

    Description Resolution  (if Criteria is not met) 

     

    In our attempts to determine the root cause of the status 13 failures

    when backing up SQL databases to MSDP using media side deduplication,

     

    From the initial logs we could see that after some time, the dbclient

    process attempted to send data to the appliance, ie :

    16:07:53.335 [17300.26824] <4> VxBSASendData: INF - entering SendData.

    16:07:53.335 [17300.26824] <4> dbc_put: INF - sending 65536 bytes

    16:07:53.335 [17300.26824] <2> writeToServer: bytesToSend = 65536 bytes

    16:07:53.335 [17300.26824] <2> writeToServer: start send -- try=1

    16:08:12.224 [17300.26824] <16> writeToServer: ERR - send() to server

    on socket failed:

    This message occurs a as a consequence of the OS library call send()

    returning a non-zero status attempting to send data to a file

    descriptor ( socket ) that is associated with a network connection.

    When this failure occurs, dbclient sends a failure status to the media

    server and bpbrm terminates the job resulting in the status 13. The OS

    library call to recv() data on the media server does not receive any

    data or indication from other end of the network connection at the time

    of the failure which indicates that the TCP stacks on the two hosts are

    not in agreement on the state of the connection.

    What we needed to find out was if the problem was occurring in the TCP

    stack on the client or appliance side of that connection.

    In order to understand this, it was felt necessary to look at the

    problem from a TCP level, we hoped that this would show if data was

    flowing or if there was a plugin ingest issue.

    For this to happen, we would need to perform network packet capture

    between the hosts, as well as performing strace on bptm  processes

    together with full debug logging enabled.

    This approach poses a difficult problem based on the intermittent

    nature of the failures and the indeterminate time taken for jobs to

    fail. The resulting logs would be large and the tcpdump / straces enormous.

    We attempted to work around this by rotating the tcp capture logs and

    terminating the traces as soon as the job failed in an attempt to

    capture the critical information.

    However, when we set this test up, the additional load imposed by the

    tracing and logging impacted the appliance to such a degree that the

    backup eventually failed with a 13, unfortunately we do not feel that

    this failure was the same issue seen for earlier failures.

    In this case when the job was repeated with tracing turned off it was

    successful.

     

     

    In order to try and prevent the issue occuring we made some

    tuning changes to components we felt were potentially the source of the

    problem,   specifically :

    1) Increased the TCP retransmission settings on the client to 15 (

    http://support.microsoft.com/default.aspx?scid=kb;en-us;170359)

    2) Implemented following settings on the appliance :

    echo 128 > /usr/openv/netbackup/db/config/CD_NUMBER_DATA_BUFFERS

     

    echo 524288 > /usr/openv/netbackup/db/config/CD_SIZE_DATA_BUFFERS

       

    echo 180 > /usr/openv/netbackup/db/config/CD_UPDATE_INTERVAL

    touch /usr/openv/netbackup/db/config/CD_WHOLE_IMAGE_COPY

    echo 0 > /usr/openv/netbackup/NET_BUFFER_SZ

    echo 0 > /usr/openv/netbackup/NET_BUFFER_REST

    echo 256 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS

    echo 512 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_DISK

    echo 16 > /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS_FT

     

    echo 1500 > /usr/openv/netbackup/db/config/OS_CD_BUSY_RETRY_LIMIT

       

    echo 262144 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS

    echo 1048576 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_DISK

     

    echo 262144 > /usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS_FT

         

    echo 3600 > /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

    echo 3600 > /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTSENDTMO

    touch /usr/openv/netbackup/db/config/DPS_PROXYNOEXPIRE

    echo 5000 > /usr/openv/netbackup/db/config/MAX_FILES_PER_ADD

    3) Set /usr/openv/pdde/pdcr/etc/contentrouter.cfg

    PrefetchThreadNum=16

    MaxNumCaches=1024

    ReadBufferSize=65536

    WorkerThreads=128

    4) Set /usr/openv/lib/ost-plugins/pd.conf

     

    PREFETCH_SIZE = 67108864

    CR_STATS_TIMER = 300

     

    5) Modified the "Client Communication Buffer size"

     

    Host properties --> Client Settings --> Communication Buffer size set

    to 256kb

     

     

    6) MSSQL script was modified :

     

    BUFFERCOUNT to 2

    STRIPE to 4

     

    7) Additionally we would always advise to check the "resolution"

    section of this technote http://www.symantec.com/docs/TECH37372.