Solved: BPDBM is choking system memory and making it hung ...

ajiwww · ‎09-11-2013

NBU version is 7.1.0.4 and OS version is Solaris 10.

Issue started 2 days back when server starts hanging all of a sudden. Later got identifed that BPDBM was the one using very high memory.

Here are some outputs from prstat just before it got hung, system got 32GB of RAM

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

14876 root 19G 18G cpu0 11 0 0:08:17 3.8% bpdbm/1

2241 root 3029M 2960M cpu10 11 0 21:03:15 3.8% dbsrv11/64

2387 root 468M 356M sleep 59 0 0:16:24 0.1% nbstserv/19

2347 root 292M 213M sleep 59 0 13:01:21 2.6% nbemm/98

2350 root 265M 209M sleep 59 0 1:26:17 0.4% nbrb/10

/var/adm/messages :-

Sep 8 06:09:24 nbupitsun002 tldcd[22500]: [ID 673158 daemon.error] TLD(1) fork failure, Resource temporarily unavailable

Sep 8 06:19:24 nbupitsun002 sendmail[1404]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable

Sep 8 06:25:44 nbupitsun002 sendmail[1402]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable

Sep 8 07:05:23 nbupitsun002 sshd[526]: [ID 800047 auth.error] error: fork: Error 0

If we kill that particular PID, all will come back to normal. But after sometime, another PID will start the same behaviour. All the BPDBM PID will start normally with a normal memory usage. But then one of them will grow in a metter of seconds and start using the whole memory and eventually making the system hang.

issue is running for ast 2 days and Symantec didnt find much from the logs. server is getting hung and keep on getting rebooted after that

Kevin_Good · ‎09-11-2013

There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...

3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.

3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.

View solution in original post

Kevin_Good · ‎09-11-2013

There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...

3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.

3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.

CRZ · ‎09-11-2013

3092156 applies to 7.1.0.4 as well, so Kevin might be onto something. You might want to open a support case referencing that Etrack and see if they can give you an EEB. Then, you will want to make plans to upgrade to 7.5.0.6! ;)

ajiwww · ‎09-12-2013

a case is opened already and they are analysing the logs. One EEb was given for ET3092156 and that have been applied. Tried a bpdbm -consistency and that is showing below errors

checking image file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR>

>>PRIMARY_COPY is set to an invalid copy

>>EXPIRATION is not set to the next valid copy to expire

checking files file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR.f>

checking image file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL>

>image is invalid

>>failed backup

>>image not deleted yet, backup may be resumed

>skipping - copy 1 on disk

checking files file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f>

</usr/openv/netbackup/db/images/BR249FILE1.bu/1374000000/Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f> does not exist

checking image file <Cigna_VMware_Prod_1378853406_FULL>

>image is invalid

>>failed backup

checking files file <Cigna_VMware_Prod_1378853406_FULL.f>

</usr/openv/netbackup/db/images/CIGPITLINVMPRTAP802/1378000000/Cigna_VMware_Prod_1378853406_FULL.f> does not exist

checking image file <eHRO_VMware_Tier2_Win_V_1378804188_INCR>

>image is invalid

>>failed backup

>skipping - copy 1 on disk

checking files file <eHRO_VMware_Tier2_Win_V_1378804188_INCR.f>

</usr/openv/netbackup/db/images/BUCKPITVPWHL002.acs.uscoopers.com/1378000000/eHRO_VMware_Tier2_Win_V_1378804188_INCR.f> does not exist

checking image file <Cigna_VMware_Prod_1377597706_INCR>

>skipping - copy 1 on disk

>copy 2 frag -1 media CI0023 host nbupitwin001: copy has expired

>copy 2 frag -1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group

>copy 2 frag 1 media CI0023 host nbupitwin001: copy has expired

>copy 2 frag 1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group

There are so many errors like that. unfortunatly, we are not able to complete this check since BPDBM usage again goes high and we have to kill it

VOX

BPDBM is choking system memory and making it hung eventually