cancel
Showing results for 
Search instead for 
Did you mean: 

NetBackup Appliance hangs then shuts down

Hi troops,

IHAC who found one of the two 5220 2.6.1.1 appliance turned off twice and he had to power it off from the display since IMPI was not responfing either.

They opened a case with Symantec but since in the logs there was nothing relevant, Support could not provide a RCA. I looked at the logs myself and apart from the gap in the messages logs I did not find any specific errors.

Have you ever seen this ?

Below is Support's responce and the messages log, the carsh must have occurre on the 29th August at about 21.42.01

var messages
****************
Aug 29 21:42:01 nbami-5220 smbd[745]: [2015/08/29 21:42:01,  0] printing/print_cups.c:463(cups_async_callback)
Aug 29 21:42:01 nbami-5220 smbd[745]:   failed to retrieve printer list: NT_STATUS_UNSUCCESSFUL
Aug 31 08:34:11 nbami-5220 syslog-ng[6216]: syslog-ng starting up; version='2.0.9'
Aug 31 08:34:11 nbami-5220 SMmonitor: SMmonitor started. PID=6243
Aug 31 08:34:12 nbami-5220 logger: sisids.init: Using driver version 2.6.32.43/sisfim-x86_64-default.ko
Aug 31 08:34:15 localhost StorageArray: ESM AutoSync Mode = Enabled, started on 8/31/15 8:34 AM
Aug 31 08:34:15 nbami-5220 syslog-ng[6216]: Configuration reload request received, reloading configuration;

Symantec Support reply:

*****************************

• Appliance Details: 
Compute Node nbami-5220 
NetBackup Model = 5220 
Nbapp-release = 2.6.1.1 
Serial Number = SYM1000789 
Bios Version = S5500.86B.01.00.0057.031020111721 
Firmware = 2.120.63-1242/ 1.41.372-2527 
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-07-29\ 


• Node does not have any errors. -> Time Monitoring Ran: Mon Jul 27 2015 13:01:26 UTC 

• There is nothing logged in the messages log since Jul 25 07:24:01 and Jul 26 19:42:50 

Jul 25 07:24:01 nbami-5220 /usr/sbin/cron[7188]: (root) CMD (PATH=/opt/VRTSperl/bin:/sbin:/usr/sbin:$PATH; export PATH; . /opt/NBUAppliance/db/nbappdb_env.sh > /dev/null 2>&1; /opt/NBUAppliance/bin/perl -I/opt/NBUAppliance/scripts/ /opt/NBUAppliance/scripts//hwmon/callhome.pl > /dev/null 2>&1) 
Reboot occurred here 
Jul 26 19:42:50 nbami-5220 syslog-ng[6285]: syslog-ng starting up; version='2.0.9' 

Raid Card FW Term Log shows during that time CC check was running: 
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-07-29\DataCollect\SYM1000789\Raid-... 
07/25/15 7:14:03: EVT#93269-07/25/15 7:14:03: 94=Patrol Read progress on PD 17(e0x18/s16) is 69.99%(14699s) 
07/25/15 7:15:13: EVT#93270-07/25/15 7:15:13: 65=Consistency Check progress on VD 00/0 is 10.19%(15313s) 
07/25/15 7:15:53: EVT#93271-07/25/15 7:15:53: 65=Consistency Check progress on VD 01/1 is 28.60%(15353s) 
07/25/15 7:18:07: EVT#93272-07/25/15 7:18:07: 65=Consistency Check progress on VD 01/1 is 28.85%(15487s) 


• Alarm-Log_07-27-2015_12.55.55.log shows many repeated messages like below 
137 06/30/2015 14:15:29 Voltage #0x16 Upper Non-critical going high 
138 06/30/2015 15:34:19 Voltage #0x1c Upper Non-critical going high 

155 06/30/2015 18:22:34 Voltage #0x1c Upper Critical going high 
156 06/30/2015 18:22:51 Voltage #0x1c Upper Critical going high 

197 06/30/2015 20:11:16 Voltage #0x1c Upper Critical going high 
198 06/30/2015 20:13:13 Voltage #0x1c Upper Critical going high 

Getting Logged everyday till 13th July: 
e36 07/13/2015 03:54:37 Voltage #0x1c Upper Critical going high 
e37 07/13/2015 03:54:46 Voltage #0x1c Upper Critical going high 


• Checked on another customers machine. 
# ipmitool sdr list -verbose 
Sensor ID : BB +3.3V (0x16) 
Entity ID : 7.1 (System Board) 
Sensor Type (Analog) : Voltage 

Sensor ID : BB -12.0V (0x1c) 
Entity ID : 7.1 (System Board) 
Sensor Type (Analog) : Voltage 

30 
Requested for below outputs from customer 
# ipmiutil fru > /tmp/ipmi_events/ipmi_events fru.out 
# ipmiutil fru > /tmp/ipmi_events/ipmi_events fru.out 
# ipmiutil health > /tmp/ipmi_events/ipmi_events_health.out 
# ipmiutil sensor > /tmp/ipmi_events/sensor.out 
# ipmiutil sel > /tmp/ipmi_events/sel.out 
# ipmitool sdr list –verbose > /tmp/ipmi_events/sd_list_verbose.out 
# ipmitool sdr elist full > /tmp/ipmi_events/sdr_elist_full.out 
# ipmitool sdr list compact > /tmp/ipmi_events/srd_list_compact.out 
# ipmitool chassis status 
# ipmitool mc selftest 




3/8 

Customer confirmed that the backups are working fine and operation is normal but one LED on Power Supply is Blinking 
The backup operation are normal. One LED on Power Supply is blinking 



asked customer to confirm if the LED is blinking amber 
• LED is amber 
o Indication: The current power supply has failed. The other power supply in the unit is on. 
o Recommended actions: 
1. Check if the AC power cord is unplugged. 
2. Check if the AC is lost. Critical event causing an AC lost may include: over current protection (OCP), over voltage protection (OVP), over temperature protection (OVP). 

Customer mentioned that the LED is Blinking Green 
• LED is blinking green 
o Indication: AC present, only 12VSB on. Power supply is off or in cold redundant state. 


Customer provided DataCollect Log which is on fileshare ,, 
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-08-03 


06/08 
checked datacollect and all hw components are showing ok 
checking further 

14/08 
emailed customer to know if the Appliance gone down again since last time.. these are old logs and nothing in particular seen during the time of hang 

- Also see message like below which may or may not have anything to do with the hung appliance that happened in July . 
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.582514] :megasr[ahci]: port [1] encountered a hard error:[0x40000001] during NCQ operation, performing an explicit reset 
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.582519] :megasr[ahci]: warning PortReset called for port[1] 
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.594016] :megasr[ahci]: device on port:[1] online:[0x123] [0] milliseconds after reset 
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1800.638618] :megasr[ahci]: converting NCQ hal packet [ffff880037876e48] to N-NCQ for error recovery owner:6 start block:0x681900 
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1800.638627] :megasr[ahci]: port [0x1] is paused for read log ext 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.649104] :megasr[ahci]: port [0] encountered a hard error:[0x40000001] during NCQ operation, performing an explicit reset 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.649107] :megasr[ahci]: warning PortReset called for port[0] 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.660604] :megasr[ahci]: device on port:[0] online:[0x123] [0] milliseconds after reset 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705206] :megasr[ahci]: converting NCQ hal packet [ffff88003788c868] to N-NCQ for error recovery owner:4 start block:0x3621623a 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705211] :megasr[ahci]: port [0x0] is paused for read log ext 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705229] :megasr[ahci]: port [0x1] is restarted after read log ext 
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705637] :megasr[ahci]: port [0x0] is restarted after read log ext 
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537935] :megasr[ahci]: hal packet [ffff880037872bb8] owner:46 start block:0x1200 timeout detected. 
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537940] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff880037872bb8] in port[0] list [1] getting timed out 
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537945] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff880037851738] in port[0] list [2] getting timed out 
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537948] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff88003786e928] in port[0] list [2] getting timed out 
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537952] :megasr[ahci]: hal packet [ffff880037887f30] owner:46 start block:0x1200 timeout detected. 
messages/var/log/kernel_messages.out:May 14 11:54:32 nbami-5220 kernel: [ 1819.537956] :megasr[ahci]: hal_pkt[ffff880037887f30:ffff880037887f30] in port[1] list [1] getting timed out 
messages/var/log/kernel_messages.out:May 14 11:54:32 nbami-5220 kernel: [ 1819.537959] :megasr[ahci]: warning PortReset called for port[0] 

wikipedia 
Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed. 
This can reduce the amount of unnecessary drive head movement, resulting in increased performance (and slightly decreased wear of the drive) for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications. 



- messages log May 14 showed catalog volume full. 
messages/var/log/kernel_messages.out:May 14 17:38:25 nbami-5220 kernel: [22454.498786] vxfs: msgcnt 1 mesg 001: V-2-1: vx_nospace - /dev/vx/dsk/nbuapp/pdcatvol file system full (2 block extent) 


since there is nothing logged at the time of hang ,, unable to provide a root cause analysis. 
Its been 3 weeks since that the appliance is working fine hence customer gave consent to close the case . 
They will reference this case in future if happen to see the same scenario again.

Thanks for your help,

Francesco

1 Solution

Accepted Solutions
Accepted Solution!

No gaurentees it will help,

No gaurentees it will help, its the only thing that isn't up to par (other than the fact that it simply stops logging and reboots).

View solution in original post

5 Replies

Hi, It might be neccessary to

Hi, It might be neccessary to upgrade or reflash the BIOS as the BIOS itself can cause random issues such as these. 

 

Normally I would use our analysis tool on their datacollect file to confirm versions and other items, but it is no longer available from their last case due to the case being close.

 

Feel like attaching one or sending to me via other means?

 

Hi, Please find attached the

Hi,

Please find attached the logs collected at the time of failure. I think they more relevant than a recent datacollect.

Thanks a lot for your help.

Francesco

Everything seems ok except

Everything seems ok except for the fact that the system logs stops and reboots (which is what you are describing is happening here.)  

The BIOS seems up to date, but the HBAs could use an update to their firmware.

Thanks for your reply, I will

Thanks for your reply, I will open a support case to find the new HBA firmware.

Accepted Solution!

No gaurentees it will help,

No gaurentees it will help, its the only thing that isn't up to par (other than the fact that it simply stops logging and reboots).

View solution in original post