Solved: NetBackup Appliance hangs then shuts down

FrancescoRusso · ‎10-01-2015

Hi troops,

IHAC who found one of the two 5220 2.6.1.1 appliance turned off twice and he had to power it off from the display since IMPI was not responfing either.

They opened a case with Symantec but since in the logs there was nothing relevant, Support could not provide a RCA. I looked at the logs myself and apart from the gap in the messages logs I did not find any specific errors.

Have you ever seen this ?

Below is Support's responce and the messages log, the carsh must have occurre on the 29th August at about 21.42.01

var messages
****************
Aug 29 21:42:01 nbami-5220 smbd[745]: [2015/08/29 21:42:01, 0] printing/print_cups.c:463(cups_async_callback)
Aug 29 21:42:01 nbami-5220 smbd[745]: failed to retrieve printer list: NT_STATUS_UNSUCCESSFUL
Aug 31 08:34:11 nbami-5220 syslog-ng[6216]: syslog-ng starting up; version='2.0.9'
Aug 31 08:34:11 nbami-5220 SMmonitor: SMmonitor started. PID=6243
Aug 31 08:34:12 nbami-5220 logger: sisids.init: Using driver version 2.6.32.43/sisfim-x86_64-default.ko
Aug 31 08:34:15 localhost StorageArray: ESM AutoSync Mode = Enabled, started on 8/31/15 8:34 AM
Aug 31 08:34:15 nbami-5220 syslog-ng[6216]: Configuration reload request received, reloading configuration;

Symantec Support reply:

*****************************

• Appliance Details:
Compute Node nbami-5220
NetBackup Model = 5220
Nbapp-release = 2.6.1.1
Serial Number = SYM1000789
Bios Version = S5500.86B.01.00.0057.031020111721
Firmware = 2.120.63-1242/ 1.41.372-2527
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-07-29\

• Node does not have any errors. -> Time Monitoring Ran: Mon Jul 27 2015 13:01:26 UTC

• There is nothing logged in the messages log since Jul 25 07:24:01 and Jul 26 19:42:50

Jul 25 07:24:01 nbami-5220 /usr/sbin/cron[7188]: (root) CMD (PATH=/opt/VRTSperl/bin:/sbin:/usr/sbin:$PATH; export PATH; . /opt/NBUAppliance/db/nbappdb_env.sh > /dev/null 2>&1; /opt/NBUAppliance/bin/perl -I/opt/NBUAppliance/scripts/ /opt/NBUAppliance/scripts//hwmon/callhome.pl > /dev/null 2>&1)
Reboot occurred here
Jul 26 19:42:50 nbami-5220 syslog-ng[6285]: syslog-ng starting up; version='2.0.9'

Raid Card FW Term Log shows during that time CC check was running:
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-07-29\DataCollect\SYM1000789\Raid-...
07/25/15 7:14:03: EVT#93269-07/25/15 7:14:03: 94=Patrol Read progress on PD 17(e0x18/s16) is 69.99%(14699s)
07/25/15 7:15:13: EVT#93270-07/25/15 7:15:13: 65=Consistency Check progress on VD 00/0 is 10.19%(15313s)
07/25/15 7:15:53: EVT#93271-07/25/15 7:15:53: 65=Consistency Check progress on VD 01/1 is 28.60%(15353s)
07/25/15 7:18:07: EVT#93272-07/25/15 7:18:07: 65=Consistency Check progress on VD 01/1 is 28.85%(15487s)

• Alarm-Log_07-27-2015_12.55.55.log shows many repeated messages like below
137 06/30/2015 14:15:29 Voltage #0x16 Upper Non-critical going high
138 06/30/2015 15:34:19 Voltage #0x1c Upper Non-critical going high

155 06/30/2015 18:22:34 Voltage #0x1c Upper Critical going high
156 06/30/2015 18:22:51 Voltage #0x1c Upper Critical going high

197 06/30/2015 20:11:16 Voltage #0x1c Upper Critical going high
198 06/30/2015 20:13:13 Voltage #0x1c Upper Critical going high

Getting Logged everyday till 13th July:
e36 07/13/2015 03:54:37 Voltage #0x1c Upper Critical going high
e37 07/13/2015 03:54:46 Voltage #0x1c Upper Critical going high

• Checked on another customers machine.
# ipmitool sdr list -verbose
Sensor ID : BB +3.3V (0x16)
Entity ID : 7.1 (System Board)
Sensor Type (Analog) : Voltage

Sensor ID : BB -12.0V (0x1c)
Entity ID : 7.1 (System Board)
Sensor Type (Analog) : Voltage

30
Requested for below outputs from customer
# ipmiutil fru > /tmp/ipmi_events/ipmi_events fru.out
# ipmiutil fru > /tmp/ipmi_events/ipmi_events fru.out
# ipmiutil health > /tmp/ipmi_events/ipmi_events_health.out
# ipmiutil sensor > /tmp/ipmi_events/sensor.out
# ipmiutil sel > /tmp/ipmi_events/sel.out
# ipmitool sdr list –verbose > /tmp/ipmi_events/sd_list_verbose.out
# ipmitool sdr elist full > /tmp/ipmi_events/sdr_elist_full.out
# ipmitool sdr list compact > /tmp/ipmi_events/srd_list_compact.out
# ipmitool chassis status
# ipmitool mc selftest

3/8

Customer confirmed that the backups are working fine and operation is normal but one LED on Power Supply is Blinking
The backup operation are normal. One LED on Power Supply is blinking

asked customer to confirm if the LED is blinking amber
• LED is amber
o Indication: The current power supply has failed. The other power supply in the unit is on.
o Recommended actions:
1. Check if the AC power cord is unplugged.
2. Check if the AC is lost. Critical event causing an AC lost may include: over current protection (OCP), over voltage protection (OVP), over temperature protection (OVP).

Customer mentioned that the LED is Blinking Green
• LED is blinking green
o Indication: AC present, only 12VSB on. Power supply is off or in cold redundant state.

Customer provided DataCollect Log which is on fileshare ,,
\\pun-evidence.pun.spt.symantec.com\evidence\pun\62\09190462\2015-08-03

06/08
checked datacollect and all hw components are showing ok
checking further

14/08
emailed customer to know if the Appliance gone down again since last time.. these are old logs and nothing in particular seen during the time of hang

- Also see message like below which may or may not have anything to do with the hung appliance that happened in July .
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.582514] :megasr[ahci]: port [1] encountered a hard error:[0x40000001] during NCQ operation, performing an explicit reset
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.582519] :megasr[ahci]: warning PortReset called for port[1]
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1799.594016] :megasr[ahci]: device on port:[1] online:[0x123] [0] milliseconds after reset
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1800.638618] :megasr[ahci]: converting NCQ hal packet [ffff880037876e48] to N-NCQ for error recovery owner:6 start block:0x681900
messages/var/log/kernel_messages.out:May 14 11:53:38 nbami-5220 kernel: [ 1800.638627] :megasr[ahci]: port [0x1] is paused for read log ext
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.649104] :megasr[ahci]: port [0] encountered a hard error:[0x40000001] during NCQ operation, performing an explicit reset
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.649107] :megasr[ahci]: warning PortReset called for port[0]
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1800.660604] :megasr[ahci]: device on port:[0] online:[0x123] [0] milliseconds after reset
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705206] :megasr[ahci]: converting NCQ hal packet [ffff88003788c868] to N-NCQ for error recovery owner:4 start block:0x3621623a
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705211] :megasr[ahci]: port [0x0] is paused for read log ext
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705229] :megasr[ahci]: port [0x1] is restarted after read log ext
messages/var/log/kernel_messages.out:May 14 11:53:39 nbami-5220 kernel: [ 1801.705637] :megasr[ahci]: port [0x0] is restarted after read log ext
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537935] :megasr[ahci]: hal packet [ffff880037872bb8] owner:46 start block:0x1200 timeout detected.
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537940] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff880037872bb8] in port[0] list [1] getting timed out
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537945] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff880037851738] in port[0] list [2] getting timed out
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537948] :megasr[ahci]: hal_pkt[ffff880037872bb8:ffff88003786e928] in port[0] list [2] getting timed out
messages/var/log/kernel_messages.out:May 14 11:53:59 nbami-5220 kernel: [ 1819.537952] :megasr[ahci]: hal packet [ffff880037887f30] owner:46 start block:0x1200 timeout detected.
messages/var/log/kernel_messages.out:May 14 11:54:32 nbami-5220 kernel: [ 1819.537956] :megasr[ahci]: hal_pkt[ffff880037887f30:ffff880037887f30] in port[1] list [1] getting timed out
messages/var/log/kernel_messages.out:May 14 11:54:32 nbami-5220 kernel: [ 1819.537959] :megasr[ahci]: warning PortReset called for port[0]

wikipedia
Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.
This can reduce the amount of unnecessary drive head movement, resulting in increased performance (and slightly decreased wear of the drive) for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications.

- messages log May 14 showed catalog volume full.
messages/var/log/kernel_messages.out:May 14 17:38:25 nbami-5220 kernel: [22454.498786] vxfs: msgcnt 1 mesg 001: V-2-1: vx_nospace - /dev/vx/dsk/nbuapp/pdcatvol file system full (2 block extent)

since there is nothing logged at the time of hang ,, unable to provide a root cause analysis.
Its been 3 weeks since that the appliance is working fine hence customer gave consent to close the case .
They will reference this case in future if happen to see the same scenario again.

Thanks for your help,

Francesco

mnolan · ‎10-10-2015

No gaurentees it will help, its the only thing that isn't up to par (other than the fact that it simply stops logging and reboots).

View solution in original post

mnolan · ‎10-03-2015

Hi, It might be neccessary to upgrade or reflash the BIOS as the BIOS itself can cause random issues such as these.

Normally I would use our analysis tool on their datacollect file to confirm versions and other items, but it is no longer available from their last case due to the case being close.

Feel like attaching one or sending to me via other means?

FrancescoRusso · ‎10-05-2015

Hi,

Please find attached the logs collected at the time of failure. I think they more relevant than a recent datacollect.

Thanks a lot for your help.

Francesco

mnolan · ‎10-09-2015

Everything seems ok except for the fact that the system logs stops and reboots (which is what you are describing is happening here.)

The BIOS seems up to date, but the HBAs could use an update to their firmware.

FrancescoRusso · ‎10-09-2015

Thanks for your reply, I will open a support case to find the new HBA firmware.

mnolan · ‎10-10-2015

No gaurentees it will help, its the only thing that isn't up to par (other than the fact that it simply stops logging and reboots).

VOX

NetBackup Appliance hangs then shuts down