Very slow Exchange GRT backups

lu · ‎11-26-2009

Hi ! We have NBU 6.5.4 and try to run GRT backups on a W2003 SP2 R2 server, with all needed nfs patches. We can see two phases in the backup 1- very fast backup of the store via SAN client 2- very slow indexation when the NFS client/server is involved. It takes 1 to 4 hours to index the 60GB store which contains 5 mailboxes. The exchange server and the media server are connected to the same switch. If we snoop the dialog on the media server, we can see that NFS packets are sent very slowly (about 4-8 KB/s). We see that the NBU nfs server waits 1 second before sending the data (see snoop below). We also found this 1 second delay in the bpdm log (see below). Happens with basic disk and advanced disk. Any idea how to solve this problem ? TIA, Ludo. ------- nfs snoop --------- 2.24509 exchserv -> nbumedia026 NFS C READ3 FH=2283 at 1175568384 for 4096 ==> read request from the Exchange server 2.29524 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45132715 Len=0 Win=64240 ==> small ack sent by the media server (len=0) 3.28856 nbumedia026 -> exchserv NFS R READ3 OK (4096 bytes) ==> but data is sent 1 second after the request ! 3.28862 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45134175 Len=1460 Win=64240 3.28866 nbumedia026 -> exchserv TCP D=18624 S=7394 Push Ack=4153401168 Seq=45135635 Len=1308 Win=64240 ------- bpdm -------- 14:18:52.572 [13851] <2> read_data: waited for empty buffer 3 times, delayed 71 times 14:18:52.572 [13851] <2> set_restore_cntl: dmcommon.c.6963: firstblk = 0, blocks_to_skip = 0, bytes _to_skip = 0, fragnum = 0 (input parameters) 14:18:52.572 [13851] <2> read_backup: seeking to image relative block number 130967312 frag relativ e block number 130967312 to start read-blockmap ===> here is another 1 second delay 14:18:53.557 [13852] <2> send_bptm_req: [13851] bptm parent answered 0, 0, 0 14:18:53.557 [13852] <2> write_blocks: [13851] writing 2048 data blocks of 512 14:18:53.648 [13852] <2> filter_image_ifr: [13851] sending bp*m position request, curr_frag = 1, ne w_frag = 1, curr_blknum = 130969360, new_blknum = 130967216, firstblk = 130967216 14:18:53.651 [13851] <2> check_positioning: CINDEX 0 wants to skip to frag 1, firstblk 130967216, A CTIVE_GC = 1

bernes_stainz · ‎11-26-2009

Hi Ludo,

we`ve the same poblem

ciao bernes

lu · ‎11-27-2009

Same delay seen in the bpdm logs ?

Mouse · ‎11-28-2009

I've opened couple of cases with Symantec support several months ago and didn't receive any meaningful explanation so far.
It seems that GRT technology either was not tested well or simply doesn't work as advertised.

lu · ‎11-30-2009

Awesome ! With a truss you can see that they added a sleep(1) in the bpdm code ! Please Symantec, fix this code ! ... 3.5539 read(0, " X F E R B L O C K 4 2".., 21) = 21 3.5541 alarm(0) = 600 3.5542 sigaction(SIGALRM, 0xFFBFD0C0, 0x00000000) = 0 new: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000 3.5545 getpid() = 22360 [22355] 3.5546 llseek(4, 0, SEEK_END) = 1964349 3.5548 write(4, " 1 7 : 0 0 : 5 2 . 2 8 5".., 172) = 172 3.5550 nanosleep(0xFFBFE240, 0xFFBFE238) = 0 tmout: 0.000000000 sec resid: 0.000000000 sec 4.5553 nanosleep(0xFFBFE240, 0xFFBFE238) = 0 tmout: 1.000000000 sec resid: 0.000000000 sec 4.5559 kill(22355, SIG#0) = 0 4.5560 getpid() = 22360 [22355] 4.5562 llseek(4, 0, SEEK_END) = 1965001 4.5563 write(4, " 1 7 : 0 0 : 5 3 . 2 8 6".., 77) = 77 4.5566 getpid() = 22360 [22355] 4.5567 llseek(4, 0, SEEK_END) = 1965078 ...

MattS · ‎11-30-2009

Hi Lu,
Thats a good find.

Right now Symantec is still working on another EEB. Today i'm going to test with extending the SoftMountPingtimeout to 60 (in regards to NFS). I'll let you know how that goes.

Matt

lu · ‎11-30-2009

Does GRT works for Exchange 2007 ? Only "alpha quality" for Exchange 2003 ?

MattS · ‎11-30-2009

We are using exchange 2007 and are experiencing this issue. See the post in the other thread for our setup details.
https://www-secure.symantec.com/connect/forums/totally-annoyed-granular-recovery-nbu-653

lu · ‎12-01-2009

So sum up, the problem is found on NBU 6.5.3 or 6.5.4, Win 2k3 or Solaris 10 media server.

lu · ‎12-01-2009

I wanted to remove the 1 second delay in bpdm, so I tried to do this with a LD_PRELOAD32 (Solaris 10). 1- Compile the code below with "cc -G libnanosleep.c -o libnanosleep.so" 2- Create a wrapper for bpdm using this script: #!/bin/ksh LD_PRELOAD_32=/tmp/libnanosleep.so export LD_PRELOAD_32 /usr/openv/netbackup/bin/bpdm.orig "$@" It will replace the 1 sec sleep with a 20ms one. Here are our results: - indexing time=3-4 minutes instead of 1-4 hours - but much higher I/O load on the media server, caused by the fact that 'bpdm' read data in 256k block when NFS needs only 4/8k. So there are 2 bugs in the 6.5.4 bpdm: 1- the 1 second sleep before each seek 2- it reads 256kB when NBU NFS only asks for 4kB This "patch" only addresses the 1st one.

===libnanosleep.c===
#include <stdio.h>
#include <errno.h>
#include <dlfcn.h>
#include <time.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <strings.h>

static unsigned int (*func)();

unsigned int sleep(unsigned int seconds)
{
  unsigned int retval;
  int fd;
  char buf[256];
  time_t cur_time;

  if(!func) {
    func = (unsigned int (*)()) dlsym(RTLD_NEXT, "sleep");
  }
  
  sprintf(buf, "/tmp/sleep.%d.txt", (int)getpid()); 
  fd = open(buf, O_WRONLY|O_CREAT|O_APPEND);
  time(&cur_time);
  sprintf(buf, "%.19s: sleep=%d\n", ctime(&cur_time), (int)seconds);
  write(fd, (const void *)buf, (size_t)strlen(buf));
  close(fd);

  if (seconds == 1) { usleep(20000); return 0; }
  
  retval = func(seconds);
  return retval;
}

Vapulaflo · ‎12-01-2009

Hi ludo,

I've got exactly the same problem!

Vapulaflo · ‎12-01-2009

Hi Lu,

Thanks for the tip to reduce the nanosleep!.... It s crazy!

lu · ‎12-01-2009

Can you give more details ? (NBU version/platform, Exchange version, stores size, etc)

Vapulaflo · ‎12-01-2009

We try to backup a mail store of 60 Go with 5 mailbox

The windows 2003 server is on R2 SP2 with the NFS patch.

I think I will test your code to reduce the nanosleep time, but I should need a confirmation of a veritas or symantec guy that's working fine....

MattS · ‎12-01-2009

Did anyone experiencing this issue have older versions of netbackup or backup exec previously installed on their exchange servers? I was looking through the registry on our exchange server and I see lots of entries for the previous install of backup exec.

The reason i mention this is because netbackup uses backup exec code to backup exchange.

I guess a better question would be, are any of you working with a fresh OS, exchange install and netbackup install?

lu · ‎12-02-2009

We have a fresh install of the media servers on Solaris 10, so there's no registry. We see that we have the same problem with two completely different OSes, so the problem is really in bpdm.

lu · ‎12-02-2009

Just tried with the 6.5.5 bpdm binary. Still this 1 second delay... (I just installed the bpdm binary, not tried a full 6.5.5 upgrade on the media server: anybody can try with a 6.5.5 upgrade ?)

MattS · ‎12-02-2009

lu i was refering to the client, the exchange server.

By the way, i have experienced this with a RHEL4 media server on a seperate exchange cluster.

lu · ‎12-02-2009

> lu i was refering to the client, the exchange server. Ok ! Our exchange server was recently installed from scratch, for this test. So I think the registry is clean...

MattS · ‎12-02-2009

Good to know, thanks.

VOX

Very slow Exchange GRT backups