BPDBM is choking system memory and making it hung eventually

NBU version is 7.1.0.4 and OS version is Solaris 10.

 

Issue started 2 days back when server starts hanging all of a sudden. Later got identifed that BPDBM was the one using very high memory.

 

Here are some outputs from prstat just before it got hung, system got 32GB of RAM

 

 

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 14876 root       19G   18G cpu0    11    0   0:08:17 3.8% bpdbm/1
  2241 root     3029M 2960M cpu10   11    0  21:03:15 3.8% dbsrv11/64
  2387 root      468M  356M sleep   59    0   0:16:24 0.1% nbstserv/19
  2347 root      292M  213M sleep   59    0  13:01:21 2.6% nbemm/98
  2350 root      265M  209M sleep   59    0   1:26:17 0.4% nbrb/10
 

 

/var/adm/messages :-

 

Sep  8 06:09:24 nbupitsun002 tldcd[22500]: [ID 673158 daemon.error] TLD(1) fork failure, Resource temporarily unavailable

Sep  8 06:19:24 nbupitsun002 sendmail[1404]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable

Sep  8 06:25:44 nbupitsun002 sendmail[1402]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable

Sep  8 07:05:23 nbupitsun002 sshd[526]: [ID 800047 auth.error] error: fork: Error 0

 

If we kill that particular PID, all will come back to normal. But after sometime, another PID will start the same behaviour. All the BPDBM PID will start normally with a normal memory usage. But then one of them will grow in a metter of seconds and start using the whole memory and eventually making the system hang.

 

issue is running for ast 2 days and Symantec didnt find much from the logs. server is getting hung and keep on getting rebooted after that

1 Solution

Accepted Solutions
Highlighted
Accepted Solution!

There are 2 Etracks listed in

There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...

 

3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.

3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.

 

View solution in original post

3 Replies
Highlighted
Accepted Solution!

There are 2 Etracks listed in

There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...

 

3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.

3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.

 

View solution in original post

3092156 applies to 7.1.0.4 as

3092156 applies to 7.1.0.4 as well, so Kevin might be onto something.  You might want to open a support case referencing that Etrack and see if they can give you an EEB.  Then, you will want to make plans to upgrade to 7.5.0.6!  ;-)

a case is opened already and

a case is opened already and they are analysing the logs. One EEb was given for ET3092156 and that have been applied. Tried a bpdbm -consistency and that is showing below errors

 

checking image file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR>

>>PRIMARY_COPY is set to an invalid copy

>>EXPIRATION is not set to the next valid copy to expire

checking files file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR.f>

 

 

checking image file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL>

>image is invalid

>>failed backup

>>image not deleted yet, backup may be resumed

>skipping - copy 1 on disk

checking files file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f>

</usr/openv/netbackup/db/images/BR249FILE1.bu/1374000000/Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f> does not exist

 

 

checking image file <Cigna_VMware_Prod_1378853406_FULL>

>image is invalid

>>failed backup

checking files file <Cigna_VMware_Prod_1378853406_FULL.f>

</usr/openv/netbackup/db/images/CIGPITLINVMPRTAP802/1378000000/Cigna_VMware_Prod_1378853406_FULL.f> does not exist

 

 

checking image file <eHRO_VMware_Tier2_Win_V_1378804188_INCR>

>image is invalid

>>failed backup

>skipping - copy 1 on disk

checking files file <eHRO_VMware_Tier2_Win_V_1378804188_INCR.f>

</usr/openv/netbackup/db/images/BUCKPITVPWHL002.acs.uscoopers.com/1378000000/eHRO_VMware_Tier2_Win_V_1378804188_INCR.f> does not exist

 

 

checking image file <Cigna_VMware_Prod_1377597706_INCR>

>skipping - copy 1 on disk

>copy 2 frag -1 media CI0023 host nbupitwin001: copy has expired

>copy 2 frag -1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group

>copy 2 frag 1 media CI0023 host nbupitwin001: copy has expired

>copy 2 frag 1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group

 

 

 

 

There are so many errors like that. unfortunatly, we are not able to complete this check since BPDBM usage again goes high and we have to kill it