ONTAP Hardware

Metrocluster / Node crash / Boot loop

UweB
5,972 Views

Hi everyone,

We have a 2 node Metrocluster configuration with two FAS3140. Storage Configuration is Multi-Path HA. One of the nodes crashed and a failover happend. Unfortunately the node that crashed fails to boot and is in a Boot-Loop. RLM on the crashed node still works and this is the output if I login and try to boot it manually:

 

 

 

login as: naroot
naroot@SV01rlm's password:
RLM SV01> system console
Type Ctrl-D to exit.

LOADER-A> version

Variable Name        Value
-------------------- --------------------------------------------------
BIOS_VERSION         4.4.0
LOADER_VERSION       1.8
LOADER-A> ifconfig
Network interface has not been configured

LOADER-A> show devices
Device Name  Description
-----------  ---------------------------------------------------------
uart0a        NS16550 UART at 0x3F8
vidcons0a     Video and Keyboard Console
rlm0a         Remote LAN Module (RLM): Console at 0x3F8, RLM at 0x2F8
clock0a       ISA RTC at 0x70 (index) and 0x71 (target)
ide0.0        STEC    NACF1GM1U-B11, Sectors: 2001888 (977 MB) at I/O 01F0
e0M           BCM5703C Ethernet at 0xFDF00000 (00-A0-98-23-20-76)
e0a           BCM5715 Ethernet at 0xFE510000 (00-A0-98-23-20-74)
e0b           BCM5715 Ethernet at 0xFE530000 (00-A0-98-23-20-75)

LOADER-A> printenv

Variable Name        Value
-------------------- --------------------------------------------------
CPU_NUM_CORES        2
BOOT_CONSOLE         rlm0a
BIOS_VERSION         4.4.0
BIOS_DATE            07/24/2010
SYS_MODEL            FAS3140
SYS_REV              B0
SYS_SERIAL_NUM       xxx
MOBO_MODEL           1
MOBO_REV             A4
MOBO_SERIAL_NUM      741755
CPU_SPEED            2400
CPU_TYPE             Opteron
savenv               saveenv
ENV_VERSION          1
fmmbx-lkg-0           
16F5383011DF970EA0008CBB7620239806A1D5A6000000000C0040209884D7CA000000000000000000000000000000000000 
0000000000000000000000000000240000204E8139B600000000000000000000000000000000000000000000000000000000 
00000000
fmmbx-lkg-0b          
16F5383011DF970EA0008CBB7620239806A1D5A60000000052B400204EC9CA53000000000000000000000000000000000000 
000000000000000000000000000024000020FD7F39B600000000000000000000000000000000000000000000000000000000 
00000000
NVRAM_CLEAN          false
fc-port-0b           9
fc-port-0d           9
last-OS-booted-raid-ver 11
last-OS-booted-wafl-ver 22331
last-OS-booted-ver   8.1.4P1
REBOOT_REASON        REBOOT_GIVEBACK
partner-sysid        xxx
fmmbx-lkg-1           
3854498011DF91AAA00024879A222398015319770000000024000020A8E04CB6000000000000000000000000000000000000 
00000000000000000000000000002400002027114DB600000000000000000000000000000000000000000000000000000000 
00000000
fmmbx-lkg-1b          
3854498011DF91AAA00024879A222398015319770000000024000020D17F39B6000000000000000000000000000000000000 
000000000000000000000000000024000020D1A049B600000000000000000000000000000000000000000000000000000000 
00000000
fud_in_progress      false
USE_SECONDARY        true
LOADER_VERSION       1.8
ARCH                 x86_64
BOARDNAME            SB_XV
PRIMARY_KERNEL_URL   fat://ide0.0/x86_64/kernel/primary.krn
BACKUP_KERNEL_URL    fat://ide0.0/backup/x86_64/kernel/primary.krn
DIAG_URL             fat://ide0.0/x86_64/diag/diag.krn
GX_DIAG_URL          fat://ide0.0/x86_64/diag/kernel
FIRMWARE_URL         fat://ide0.0/x86_64/firmware/SB_XV/firmware.img
bootarg.mgwd.scsi_blade_uuid 81f5ff2d-e3a9-11e1-b677-fd0f5a25c34b
bootarg.from.version 8.1P2
failoverToken        SV01_16:49:17_2016:11:22
BOOT_DEVICE          ide0.0
BOOT_FILE            x86_64/freebsd/image2/kernel
BIOS_INTERFACE       9FC3
BOOT_FLASH           flash0a
GX_PRIMARY_KERNEL_URL fat://ide0.0/x86_64/freebsd/image2/kernel
GX_BACKUP_KERNEL_URL fat://ide0.0/x86_64/freebsd/image1/kernel
ntap.init.kernelname x86_64/freebsd/image2/kernel
AUTOBOOT             true
AUTOBOOT_FROM        PRIMARY
AUTO_FW_UPDATE       true
BOOTED_FROM          OTHER
boot_ontap           autoboot ide0.0
boot_primary         setenv BOOTED_FROM PRIMARY; boot -elf64 $GX_PRIMARY_KERNEL_URL  
$PRIMARY_KERNEL_URL
boot_backup          setenv BOOTED_FROM BACKUP; boot -elf64 $GX_BACKUP_KERNEL_URL $BACKUP_KERNEL_URL
netboot              setenv BOOTED_FROM NETWORK; boot -elf64
boot_diags           boot -elf64 $GX_DIAG_URL $DIAG_URL
ldkern               load -elf64 $GX_PRIMARY_KERNEL_URL $PRIMARY_KERNEL_URL
update_flash         flash -backup $FIRMWARE_URL flash0a
version              printenv BIOS_VERSION LOADER_VERSION
CF_BIOS_VERSION      4.4.0
CF_LOADER_VERSION    1.8

LOADER-A> boot_ontap
Loading x86_64/freebsd/image2/kernel:.....0x100000/8455008 0x910360/1278312 Entry at 0x80158990
Loading x86_64/freebsd/image2/platform.ko:.0xa49000/655152 0xb97c40/694752 0xae8f40/39656  
0xc41620/43152 0xaf2a28/86316 0xb07b54/63858 0xb174e0/140640 0xc4beb0/159120 0xb39a40/2024  
0xc72c40/6072 0xb3a228/304 0xc743f8/912 0xb3a358/1680 0xc74788/5040 0xb3a9e8/960 0xc75b38/2880  
0xb3ada8/184 0xc76678/552 0xb3ae60/448 0xb6f000/12918 0xb97b53/237 0xb72278/84120 0xb86b10/69699
Starting program at 0x80158990
NetApp Data ONTAP 8.1.4P1 7-Mode
Copyright (C) 1992-2014 NetApp.
All rights reserved.
md1.uzip: 25536 x 16384 blocks
md2.uzip: 5760 x 16384 blocks
*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************
^C^C^C^CProcessing PCI error...		(Ctrl-C don't work)
Probing EXB(0,6,0)
Probing EXB(0,7,0)
Probing EXB(0,8,0)
Probing EXB(0,9,0)
        report Dv(6,0,0) from error source register 0x600.
▒
Phoenix TrustedCore(tm) Server
Copyright 1985-2006 Phoenix Technologies Ltd.
All Rights Reserved
BIOS version: 4.4.0
Portions Copyright (c) 2007-2009 NetApp. All Rights Reserved.
CPU= Dual-Core AMD Opteron(tm) Processor 2216 X 1
Testing RAM
512MB RAM tested
4096MB RAM installed
Fixed Disk 0: STEC

Boot Loader version 1.8
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2009 NetApp

CPU Type: Dual-Core AMD Opteron(tm) Processor 2216

(Now it loops!)

Starting AUTOBOOT press Ctrl-C to abort...
Loading x86_64/freebsd/image2/kernel:.....0x100000/8455008 0x910360/1278312 Entry at 0x80158990
Loading x86_64/freebsd/image2/platform.ko:.0xa49000/655152 0xb97c40/694752 0xae8f40/39656  
0xc41620/43152 0xaf2a28/86316 0xb07b54/63858 0xb174e0/140640 0xc4beb0/159120 0xb39a40/2024  
0xc72c40/6072 0xb3a228/304 0xc743f8/912 0xb3a358/1680 0xc74788/5040 0xb3a9e8/960 0xc75b38/2880  
0xb3ada8/184 0xc76678/552 0xb3ae60/448 0xb6f000/12918 0xb97b53/237 0xb72278/84120 0xb86b10/69699
Starting program at 0x80158990
NetApp Data ONTAP 8.1.4P1 7-Mode
Copyright (C) 1992-2014 NetApp.
All rights reserved.
md1.uzip: 25536 x 16384 blocks
md2.uzip: 5760 x 16384 blocks
*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************
^CProcessing PCI error...	(Ctrl-C don't works)
Probing EXB(0,6,0)
Probing EXB(0,7,0)
Probing EXB(0,8,0)
Probing EXB(0,9,0)
        report Dv(6,0,0) from error source register 0x600.
▒
Phoenix TrustedCore(tm) Server
Copyright 1985-2006 Phoenix Technologies Ltd.
All Rights Reserved
BIOS version: 4.4.0
Portions Copyright (c) 2007-2009 NetApp. All Rights Reserved.
CPU= Dual-Core AMD Opteron(tm) Processor 2216 X 1
Testing RAM
512MB RAM tested
4096MB RAM installed
Fixed Disk 0: STEC


Boot Loader version 1.8
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2009 NetApp

CPU Type: Dual-Core AMD Opteron(tm) Processor 2216

Starting AUTOBOOT press Ctrl-C to abort... (Ctrl-C works this time)

Autoboot of PRIMARY image aborted by user.

LOADER-A> 

 

 

NetApp Release 8.1.4P1 7-Mode: Tue Feb 11 23:23:31 PST 2014

What I have tried so far: boot, boot_ontap, boot_backup, boot_primary, bye, power cycle.
Support contract has expired. It's an old system, the partner node works flawlessly and we are in the process of decommisioning both machines, but it would be better to have both machines online:).
Any idea how to reanimate SV01? Looks like a problem with the internal disk? Can I flash the fixed disk?

 

Regards
Uwe

1 REPLY 1

andris
5,886 Views

Hi Uwe,

 

You can purchase one-time support for systems with expired entitlement.  Contact NetApp Support for details.

 

In the meantime, it would be very helpful you had the initial panic/crash information recorded. It might still be there, if you collect an RLM diagnostic dump.

From RLM CLI: > rlm status -d

 

Look over the output for any interesting info about the initial cause of the crash and what the console logs and system event logs say.

 

Since it looks like a HW issue, it would also be a good idea to pull the PCM and check/re-seat all of the cards in the enclosure. 

After reseating, it wouldn't be a bad idea to run system diagnostics. Here's the link to the Diagnostics Guide applicable to the FAS3140.

https://library.netapp.com/ecmdocs/ECMP1112531/html/ch1/overview.htm

 

 

Public