09-11-2013 12:11 PM
NBU version is 7.1.0.4 and OS version is Solaris 10.
Issue started 2 days back when server starts hanging all of a sudden. Later got identifed that BPDBM was the one using very high memory.
Here are some outputs from prstat just before it got hung, system got 32GB of RAM
/var/adm/messages :-
Sep 8 06:09:24 nbupitsun002 tldcd[22500]: [ID 673158 daemon.error] TLD(1) fork failure, Resource temporarily unavailable
Sep 8 06:19:24 nbupitsun002 sendmail[1404]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable
Sep 8 06:25:44 nbupitsun002 sendmail[1402]: [ID 702911 mail.info] runqueue: Skipping queue run -- fork() failed: Resource temporarily unavailable
Sep 8 07:05:23 nbupitsun002 sshd[526]: [ID 800047 auth.error] error: fork: Error 0
If we kill that particular PID, all will come back to normal. But after sometime, another PID will start the same behaviour. All the BPDBM PID will start normally with a normal memory usage. But then one of them will grow in a metter of seconds and start using the whole memory and eventually making the system hang.
issue is running for ast 2 days and Symantec didnt find much from the logs. server is getting hung and keep on getting rebooted after that
Solved! Go to Solution.
09-11-2013 12:48 PM
There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...
3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.
3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.
09-11-2013 12:48 PM
There are 2 Etracks listed in the NetBackup 7.5.0.6 release notes, that look to be describing your issue... I can't seem to find the 7.1.x equalivant, but support should be able to help you track it down...
3019537 - A corrupt '.f' file causes the NetBackup Database Manager (bpdbm) to consume a lot of
memory.
3092156 - A corrupt image file causes the NetBackup Database Manager (bpdbm) process to consume
CPU and memory until the bpdbm process is killed.
09-11-2013 01:33 PM
3092156 applies to 7.1.0.4 as well, so Kevin might be onto something. You might want to open a support case referencing that Etrack and see if they can give you an EEB. Then, you will want to make plans to upgrade to 7.5.0.6! ;)
09-12-2013 02:24 AM
a case is opened already and they are analysing the logs. One EEb was given for ET3092156 and that have been applied. Tried a bpdbm -consistency and that is showing below errors
checking image file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR>
>>PRIMARY_COPY is set to an invalid copy
>>EXPIRATION is not set to the next valid copy to expire
checking files file <Ralcorp_Remote_SBeloit_BRBELOITDC1_1_1378159247_INCR.f>
checking image file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL>
>image is invalid
>>failed backup
>>image not deleted yet, backup may be resumed
>skipping - copy 1 on disk
checking files file <Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f>
</usr/openv/netbackup/db/images/BR249FILE1.bu/1374000000/Ralcorp_Remote_Ripon_BR249FILE1_1_1374876063_FULL.f> does not exist
checking image file <Cigna_VMware_Prod_1378853406_FULL>
>image is invalid
>>failed backup
checking files file <Cigna_VMware_Prod_1378853406_FULL.f>
</usr/openv/netbackup/db/images/CIGPITLINVMPRTAP802/1378000000/Cigna_VMware_Prod_1378853406_FULL.f> does not exist
checking image file <eHRO_VMware_Tier2_Win_V_1378804188_INCR>
>image is invalid
>>failed backup
>skipping - copy 1 on disk
checking files file <eHRO_VMware_Tier2_Win_V_1378804188_INCR.f>
</usr/openv/netbackup/db/images/BUCKPITVPWHL002.acs.uscoopers.com/1378000000/eHRO_VMware_Tier2_Win_V_1378804188_INCR.f> does not exist
checking image file <Cigna_VMware_Prod_1377597706_INCR>
>skipping - copy 1 on disk
>copy 2 frag -1 media CI0023 host nbupitwin001: copy has expired
>copy 2 frag -1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group
>copy 2 frag 1 media CI0023 host nbupitwin001: copy has expired
>copy 2 frag 1 media CI0023 host nbupitwin001: media is not allocated to this host nor is in an appropriate server group
There are so many errors like that. unfortunatly, we are not able to complete this check since BPDBM usage again goes high and we have to kill it