cancel
Showing results for 
Search instead for 
Did you mean: 

RALUS randomly dies

bigbenaugust
Level 3

Server: WS2k3 with BE2010R2

Clients: RHEL4 (32-bit) and RHEL5 (32- and 64-bit), RALUS 4164.111.146017. (also WS2k3, but that's not important here)

Since we upgraded everything from 11d.7170 to 2010R2, the RALUS agents manage to run just fine for a week or two and then at the start of a job, the beremote daemon on one of the Linux clients will crash. I restart the daemon, check the server and all clients for BackupExec updates, install if there are any, and the cycle repeats itself a few weeks later. I did enable debugging in the ralus.cfg, and I got this one line of debugging info when one of the clients crashed last night:

 *** glibc detected *** free(): invalid pointer: 0x0b0209a4 *** 

 

Has anyone else using RALUS with 2010R2 experienced this?

9 REPLIES 9

Tony_Bigras
Level 2

We are experiencing these failures on a more common basis on OSX Snow with the same version of RALUS.  Typically jobs fail 3 out of 4 times. Agent needs to be restarted.  This agent was released in July 2010 and there does not appear to be an update for it yet.  Current agent is barely usable.

 

 

 

bigbenaugust
Level 3

From what I can tell, the most recent update for RALUS I have deployed is dated 23 December 2010 and the latest file date in the install 8 Dec 2010.

bigbenaugust
Level 3

I am aware that it's just a glibc error. I have turned up the debugging level in ralus.cfg to see if more information can be obtained next time.

Since I can't edit my original post to add some more info, the BackupExec job log usually looks like this:

 Network control connection is established between 192.168.1.22:4265 <--> 192.168.1.15:6050
Network data connection is established between    192.168.1.22:4266 <--> 192.168.1.15:10001
V-79-57344-65297 - The Linux or Unix resource is not responding. Backup set canceled.
V-79-57344-65297 - The Linux or Unix resource is not responding. Backup set canceled.

So we can see that it establishes a connection, the daemon crashes with a glibc error, and then the client is marked as non-responsive. From looking at the job history, it also looks like it's a pretty regular 14 days or so between failures.

Tony_Bigras
Level 2

Service pack 1 for 2010R2 has a July 12 2010 date. These appear to be version 4164.5

Do you have a link to a more recent RALUS version?

Tony

bigbenaugust
Level 3

I always go to the Agents subdirectory in the install directory on the BackupExec server (in my case, it's D:\Program Files\Symantec\BackupExec\Agents), and if there is a newer version or patch for the RALUS, LiveUpdate will drop it there (in a .tar.gz) when it downloads updates for everything else. I scp it to the servers and install it.

This is probably on page 4578635 of the billion-page admin guide. So I saved you some reading. :)

Alec_K
Level 2

Have you checked how much memory beremote is using?

I am using 2010R2 with the RALUS agents to try and back up RedHat 9 (yes really) and Ubuntu Linux file servers

We have had real issues with the latest version of the agent eating memory, it would crash taking out the Linux box with an out of memory error before finishing the backup.

To let it at least complete the backup I added an extra swap file, on the Red Hat 9 box it was growing to over 1.9GB when trying to backup 170 GB, on the Ubuntu box it grew to 4.9GB having backed up 170 GB.

Not only does it eat memory, it also doesn't free it up even when the backup completes !

We have added a cron job to restart the beremote agent daily at midday to make sure it releases the memory

This was an issue in 12.5 that got hotfixed (http://www.symantec.com/business/support/index?page=content&id=TECH51566) but it seems to be back.

You can check your process memory usage with top :

start it then press <shift>M 

in the example below it has only grown to 432MB so far, but it had only backed up 6GB

top - 15:34:30 up 669 days, 17:01,  1 user,  load average: 0.99, 1.12, 1.17
Tasks: 114 total,   1 running, 112 sleeping,   0 stopped,   1 zombie
Cpu(s):  4.1% us,  4.1% sy,  0.0% ni, 74.9% id, 16.8% wa,  0.1% hi,  0.1% si
Mem:   1028316k total,  1018504k used,     9812k free,   334956k buffers
Swap:  6146184k total,   106152k used,  6040032k free,   218268k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16872 root      16   0  432m 281m 9.9m S   32 28.1  20:15.72 beremote
 5287 root      16   0 53372  33m 3052 S    0  3.3 159:46.49 amd
17747 ntp       16   0 13572 4392 3532 S    0  0.4   0:27.50 ntpd
17255 root      16   0 41984 3072 2416 S    0  0.3   0:00.08 smbd

I'm also having issues with the speed of the Linux agent being rubbish compared to the Windows version, on windows to the same HP tape loader I see up to 4GB/min, on Linux I'm lucky if we see 200MB/min about 20 times slower.

This means that my Build area (300 GB) can't be backed up overnight !

 

Alec_K
Level 2

By the way to find the exact version of beremote you are using:

[adk@darwin /opt/VRTSralus/bin]$ strings beremote | grep Version
_Z14ndmpSetVersionPvt
_Z14ndmpGetVersionPv
VERITAS_Backup_Exec_File_Version=13.0.4164.0

This one came from ralus4164SP1.tar.gz dated 23/12/2010

Alec

Tony_Bigras
Level 2

Thanks bigbenaugust for the directory heads up. 

I have picked up the ralus4164SP1.tar.gz dated 23/12/2010 and attempted an install on OSX. The script fails, and it appears to fail because there is no Darwinx86 directory in the patch. There are only linux and linux64 directories. So on the face of it there does not appear to be an update for OSX in SP1.

Tony

bigbenaugust
Level 3

I've never had it crash the whole box, but I will have to look into the memory usage. I also haven't noticed any speed issues (Dell PE6650 server with a PV132T (2xLTO3) SCSI library). It is slower than Windows, but it isn't a tremendous difference like you are describing.

beremote on another one of my RHEL4 boxes crashed last night in mid-backup. It seems to happen more to the RHEL4 boxes than the RHEL5 machines. The cron job route is one that I have been seriously considering, but maybe only with a weekly restart as opposed to daily.