Re: Cannot run this command because the system is not fully initialized

zaadman · ‎2024-09-05

I have a remote NetApp FAS 9000 that I am trying to retrieve some data from.

netappcluster-cluster-001::> version

NetApp Release 9.6P17: Fri Nov 19 09:15:10 UTC 2021

Only one node is running - Node 1. Node 2 is down due to power supply issue. I only have remote access to Node 1.

netappcluster-cluster-001::> cluster show

Node Health Eligibility

--------------------- ------- ------------

netapp-001 false true

netapp-002 false true

2 entries were displayed.

None of the interaces on Node 1 are confiured for data. I am trying to create an interface group a0b and getting a 'SYSTEM NOT FULLY INITIALIZED " error.

netappcluster-cluster-001::> network port ifgrp create -node netapp-001 -ifgrp a0b -distr-func ip -mode multimode

Error: command failed: Cannot run this command because the system is not fully initialized. Wait a few minutes, and then try the

command again.

netappcluster-cluster-001::> network port ifgrp create -node netapp-001 -ifgrp a0b -distr-func ip -mode singlemode

Error: command failed: Cannot run this command because the system is not fully initialized. Wait a few minutes, and then try the

command again.

Here are some other 'network port' commands I have tried with the same error.

netappcluster-cluster-001::> network arp show

Error: show failed: Cannot run this command because the system is not fully initialized. Wait a few minutes, and then try the

command again.

netappcluster-cluster-001::>

I don't what else to try to figure out the ptoblem.

Any help or suggestions would be appreciated.

Thanks

Andy

Here are some other commands I have run

netappcluster-cluster-001::> autosupport show

Node State From To Mail Hosts

--------------------- --------- ------------- ------------- ----------

netapp-001 enable netapp@*****.com

autosupport@netapp.com, operationsglobal@*****.com

smtp.global.****

Warning: Unable to list entries on node netapp-002. RPC: Couldn't make connection [from mgwd on node "netapp-001"

(VSID: -1) to mgwd at 169.254.25.24]

netappcluster-cluster-001::>

netappcluster-sun1-cluster-001::> storage aggregate show

Info: Node netapp-002 that hosts aggregate n2_aggr_root is offline

Node netapp-002 that hosts aggregate n2_aggr_sas_data01 is offline

Node netapp-002 that hosts aggregate n2_aggr_ssd_3_8t_data is offline

TMACMD · ‎2024-09-05

There is something more than a bad power supply here. Just a bad power supply would not bring down a fas9000.

additionally, node 1 has not taken over node 2. Another indicator something else is very wrong

the cluster show command indicates both nodes are NOT healthy which is another indicator.

what really happened here?

maybe start with

aggr show

node show

net int show

net port show?

zaadman · ‎2024-09-05

I don't know the history of this filer . I was brought in as a contractor to retrieve data from the filer.

The filer does not have network connectivity to anything except the Windows pc used to ssh to the management console. None of the interfaces are configured to serve data.

I was able to download a current autosupport for node 1 from https://172.31.40.103/spi and I am going through that to see what the problem(s) are.

Here are the commands you suggested.

netappcluster-1-cluster-001::> aggr show
Info: Node netapp-p-1-002 that hosts aggregate n2_aggr_root is offline
Node netapp-p-1-002 that hosts aggregate n2_aggr_sas_data01 is offline
Node netapp-p-1-002 that hosts aggregate n2_aggr_ssd_3_8t_data is offline
Aggregate Size Available Used% State #Vols Nodes RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
n1_aggr_root
1.05TB 104.5GB 90% online 1 netapp-p-1-001 raid_dp,normal
n1_aggr_sas_data01
- - - unknown - netapp-p-1-001 -
n1_aggr_ssd_3_8t_data
- - - unknown - netapp-p-1-001 -
n2_aggr_root - - - unknown - netapp-p-1-002 -
n2_aggr_sas_data01
- - - unknown - netapp-p-1-002 -
n2_aggr_ssd_3_8t_data
- - - unknown - netapp-p-1-002 -
6 entries were displayed.
netappcluster-1-cluster-001::>

netappcluster-1-cluster-001::> node show
Node Health Eligibility Uptime Model Owner Location
--------- ------ ----------- ------------- ----------- -------- ---------------
netapp-1-001
false true 27 days 07:23 FAS9000 USA,CA,Sunnyvale,DC-1,
netapp-1-002
false true - - - -
2 entries were displayed.

netappcluster-1-cluster-001::>

netappcluster-1-cluster-001::> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Cluster
netapp-1-001_clus1
up/up 169.254.127.206/16 netapp-p-1-001
e8a false
netapp-1-001_clus2
up/up 169.254.26.63/16 netapp-p-1-001
e8a true
netappcluster-1-cluster-001
netapp-1-001_mgmt1
up/up 172.31.40.103/23 netapp-p-1-001
e0M true
3 entries were displayed.

netappcluster-1-cluster-001::>

netappcluster-1-cluster-001::> net port show?
(network port show)
show Display network port attributes
show-address-filter-info Print the port's address filter information

netappcluster-1-cluster-001::> net port show
(network port show)

Node: netapp-1-001
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status
--------- ------------ ---------------- ---- ---- ----------- --------
a0a Default - up 9000 auto/40000 -
a0a-1015 Default - up 9000 auto/40000 -
a0a-1018 Default - up 9000 auto/40000 -
a0a-1097 Default - up 9000 auto/40000 -
a0a-1127 Default - up 9000 auto/40000 -
a0a-2000 Default - up 9000 auto/40000 -
a0a-2001 Default - up 9000 auto/40000 -
a0a-2002 Default - up 9000 auto/40000 -
a0a-2003 Default - up 9000 auto/40000 -
a0a-2004 Default - up 9000 auto/40000 -
a0a-2015 Default - up 9000 auto/40000 -
a0a-2016 Default - up 9000 auto/40000 -
a0a-2018 Default - up 9000 auto/40000 -
a0a-2019 Default - up 9000 auto/40000 -
a0a-2097 Default - up 9000 auto/40000 -
e0M Default - up 1500 1000/1000 -
e1a Default - down 9000 1000/auto -
e1e Default - down 9000 1000/auto -
e4a Cluster - up 9000 40000/40000 -
e4e Default - up 9000 40000/40000 -
e8a Cluster - up 9000 40000/40000 -
e8e Default - up 1500 40000/40000 -
22 entries were displayed.

netappcluster-1-cluster-001::>

---An attempt to run cluster setup terminated the ssh connection to filer---

netappcluster-1-cluster-001::> cluster setup

Enter the node management interface port [e0M]:
Enter the node management interface IP address [172.31.40.103]:
Enter the node management interface netmask [255.255.254.0]:

Network error: Software caused connection abort

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Session stopped

I started working on this filer a week ago, but the SP logs show a reboot loop with power loss back at the beginning of August before I got here.

-----Here are some log entries from the Service Processor on Node 1---
Record 1661: Thu Aug 01 20:35:28.254366 2024 [IPMI.notice]: 9905 | 02 | EVT: 6fc203ff | System_FW_Status | Assertion Event, "Memory Initialization done"
Record 1662: Thu Aug 01 20:51:53.000000 2024 [SP.emergency]: System input power lost
Record 1663: Thu Jan 01 00:00:49.949122 1970 [IPMI.notice]: 9a05 | c0 | OEM: ffff7000ff00 | ManufId: 150300 | SP Power Reset
Record 1664: Thu Jan 01 00:00:50.028428 1970 [IPMI.notice]: 9b05 | c0 | OEM: fcff70560000 | ManufId: 150300 | POS Register: Power on Reset(Normal Power Cycle)
Record 1665: Thu Jan 01 00:01:12.258484 1970 [IPMI.notice]: (PUA) Enable power to all PCIe slots
Record 1666: Thu Jan 01 00:01:12.263922 1970 [IPMI.notice]: (PUA) Enable power to all PCIe on board device
Record 1667: Thu Jan 01 00:01:12.286839 1970 [IPMI.notice]: (PUA) P_stat :slots=0x0,onboard_devs=0x0,final
Record 1668: Thu Jan 01 00:01:12.286902 1970 [IPMI.notice]: (PUA) Power status of all PCIe slots unchanged
Record 1669: Thu Jan 01 00:01:14.290589 1970 [IPMI.notice]: 9c05 | 02 | EVT: 6fc203ff | System_FW_Status | Assertion Event, "Memory Initialization done"
Record 1670: Thu Jan 01 00:01:15.301725 1970 [IPMI.notice]: 9d05 | 02 | EVT: 6f01ffff | PCM_Status | Assertion Event, "Unpowered Partner Present"
Record 1671: Thu Jan 01 00:01:15.936215 1970 [IPMI.notice]: 9e05 | 02 | EVT: 6f03ffff | PCM_Status | Assertion Event, "Power Good"
Record 1672: Thu Jan 01 00:01:41.003850 1970 [SP.notice]: Running primary version 4.10
Record 1673: Thu Jan 01 00:01:41.268885 1970 [IPMI.notice]: 9f05 | 02 | EVT: 6f00ffff | Fan1_Status | Assertion Event, "Present"
Record 1674: Thu Jan 01 00:01:56.104677 1970 [IPMI.notice]: a005 | 02 | EVT: 6f00ffff | Fan2_Status | Assertion Event, "Present"
Record 1675: Thu Jan 01 00:02:07.044567 1970 [IPMI.notice]: a105 | 02 | EVT: 6f00ffff | Fan3_Status | Assertion Event, "Present"
Record 1676: Thu Jan 01 00:02:19.144632 1970 [IPMI.notice]: a205 | 02 | EVT: 6f00ffff | PSU1_Status | Assertion Event, "Present"
Record 1677: Thu Jan 01 00:02:19.604307 1970 [IPMI.notice]: a305 | 02 | EVT: 6f04ffff | PSU1_Status | Assertion Event, "DC OK"
Record 1678: Thu Jan 01 00:02:20.596198 1970 [IPMI.notice]: a405 | 02 | EVT: 6f05ffff | PSU1_Status | Assertion Event, "AC OK"
Record 1679: Thu Jan 01 00:02:29.152540 1970 [IPMI.notice]: a505 | 02 | EVT: 6f00ffff | PSU3_Status | Assertion Event, "Present"
Record 1680: Thu Jan 01 00:02:29.519500 1970 [IPMI.notice]: a605 | 02 | EVT: 6f06ffff | PSU3_Status | Assertion Event, "Fault Signal"
SP netapp-1-001>

P netapp-1-001> events search [SP.critical]:
Record 1286: Thu Aug 01 19:14:39.173279 2024 [SP.critical]: Filer Reboots *** this is every 5 min
Record 1310: Thu Aug 01 19:19:35.414390 2024 [SP.critical]: Filer Reboots
Record 1334: Thu Aug 01 19:24:29.343534 2024 [SP.critical]: Filer Reboots
Record 1358: Thu Aug 01 19:29:24.837207 2024 [SP.critical]: Filer Reboots
Record 1382: Thu Aug 01 19:34:20.593598 2024 [SP.critical]: Filer Reboots
Record 1406: Thu Aug 01 19:39:14.668504 2024 [SP.critical]: Filer Reboots
Record 1431: Thu Aug 01 19:44:08.067638 2024 [SP.critical]: Filer Reboots
Record 1455: Thu Aug 01 19:49:02.038600 2024 [SP.critical]: Filer Reboots
Record 1479: Thu Aug 01 19:53:53.194053 2024 [SP.critical]: Filer Reboots
Record 1503: Thu Aug 01 19:58:49.065370 2024 [SP.critical]: Filer Reboots
Record 1527: Thu Aug 01 20:03:43.435989 2024 [SP.critical]: Filer Reboots
Record 1551: Thu Aug 01 20:08:36.482520 2024 [SP.critical]: Filer Reboots
Record 1576: Thu Aug 01 20:13:29.929644 2024 [SP.critical]: Filer Reboots
Record 1600: Thu Aug 01 20:18:23.599256 2024 [SP.critical]: Filer Reboots
Record 1624: Thu Aug 01 20:23:15.518206 2024 [SP.critical]: Filer Reboots
Record 1649: Thu Aug 01 20:33:57.493421 2024 [SP.critical]: Filer Reboots ***** 10 min
Record 1785: Thu Aug 01 21:15:02.436245 2024 [SP.critical]: Heartbeat stopped ****** reboots stopped
Record 1829: Sat Aug 03 17:47:53.345350 2024 [SP.critical]: Filer Reboots
Record 1853: Sat Aug 03 17:54:08.645220 2024 [SP.critical]: Filer Reboots ***** 7 min
Record 1899: Sat Aug 03 18:53:02.933777 2024 [SP.critical]: Filer Reboots ***** 59 min
Record 1920: Sat Aug 03 18:58:00.027271 2024 [SP.critical]: Filer Reboots ***** 5 min
Record 1944: Sat Aug 03 19:02:50.772372 2024 [SP.critical]: Filer Reboots ***** 4 min
Record 1964: Sat Aug 03 19:04:34.399076 2024 [SP.critical]: Heartbeat stopped ---
Record 1969: Sat Aug 03 19:07:45.950964 2024 [SP.critical]: Filer Reboots
Record 1993: Sat Aug 03 19:12:41.080466 2024 [SP.critical]: Filer Reboots
Record 2018: Sat Aug 03 19:17:31.531760 2024 [SP.critical]: Filer Reboots
Record 2042: Sat Aug 03 19:22:27.328425 2024 [SP.critical]: Filer Reboots
Record 2066: Sat Aug 03 19:27:23.751174 2024 [SP.critical]: Filer Reboots
Record 2090: Sat Aug 03 19:32:18.742056 2024 [SP.critical]: Filer Reboots
Record 2114: Sat Aug 03 19:37:13.612796 2024 [SP.critical]: Filer Reboots
Record 2138: Sat Aug 03 19:49:47.812270 2024 [SP.critical]: Filer Reboots
Record 2167: Sat Aug 03 20:24:28.816457 2024 [SP.critical]: Filer Reboots
Record 2206: Sat Aug 03 20:37:27.998200 2024 [SP.critical]: Filer Reboots
Record 2275: Sun Aug 04 15:46:41.111212 2024 [SP.critical]: Filer Reboots ***** after 15 hours
Record 2311: Sun Aug 04 15:58:00.017593 2024 [SP.critical]: Heartbeat stopped
Record 2314: Sun Aug 04 16:02:15.301245 2024 [SP.critical]: Filer Reboots
Record 2348: Sun Aug 04 16:59:38.007575 2024 [SP.critical]: Filer Reboots
Record 2362: Sun Aug 04 17:02:21.213344 2024 [SP.critical]: Heartbeat stopped
Record 2428: Tue Aug 06 20:20:51.882857 2024 [SP.critical]: Filer Reboots
Record 2441: Tue Aug 06 20:23:09.111343 2024 [SP.critical]: Heartbeat stopped
Record 2474: Wed Aug 07 18:52:22.001551 2024 [SP.critical]: Filer Reboots
Record 2489: Wed Aug 07 18:54:20.086292 2024 [SP.critical]: Heartbeat stopped
Record 2507: Wed Aug 07 19:09:07.401009 2024 [SP.critical]: Filer Reboots
Record 2520: Wed Aug 07 19:11:47.298321 2024 [SP.critical]: Heartbeat stopped
SP netapp-1-001>

======================================EMERGENCY=====================================

SP netapp-1-001> events search [SP.emergency]:
Record 1662: Thu Aug 01 20:51:53.000000 2024 [SP.emergency]: System input power lost
Record 1725: Thu Aug 01 21:00:27.000000 2024 [SP.emergency]: System input power lost
Record 15: Fri Aug 09 20:38:37.000000 2024 [SP.emergency]: System input power lost
SP netapp-1-001>

=========================================================

zaadman · ‎2024-09-06

------UPDATE SEPT 6 03:16 AM PDT-----

Looks like this is a software bug.

https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/Takeover_disabled_due_to_interconnect_down

https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/1396573

https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/REBOOT_(panic)_ALERT_-_AutoSupport_message

netappcluster-cluster-001::> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
netapp-001
netapp- false Waiting for netapp-002.
002 Waiting for cluster applications to
come online on the local node.
Offline applications: mgmt, vldb,
vifmgr, bcomd, crs, scsi blade,
clam. Takeover is not possible:
NVRAM log not synchronized, Disk
inventory not exchanged
netapp-002
netapp-001 - Unknown

2 entries were displayed.

globalStatus.critical:EMERGENCY]: Controller failover of netapp-002 is not possible: unsynchronized log. Power Supply Status Critical: PSU3. Disk shelf fault.

Sep 01 17:41:00 [netapp-001:monitor.globalStatus.critical:EMERGENCY]: Controller failover of netapp-002 is not possible: HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support. Power Supply Status Critical: PSU3. Disk shelf fault.

Sep 01 17:44:00 [netapp1-001:monitor.globalStatus.critical:EMERGENCY]: Controller failover of netapp1-002 is not possible: unsynchronized log. Power Supply Status Critical: PSU3. Disk shelf fault.

slot 6: Interconnect HBA: Generic OFED Provider
Port Name: ic6a
GID: fe80:0000:0000:0000:0000:0000:0000:0104
Base LID: 0x104
Active MTU: 8192
slot 6: HSL Interconnect
Data Rate: 0 Gb/s (8X)
Link State: DOWN

zaadman · ‎2024-09-06

So it looks like the customer that owns the filer will have to open a support case with NetApp on this one.

This would have been a whole easier if the filer had been able to send an autosupport to NetApp. Alas, the filer has no access to the internet. :>)

Have a great weekend.

Andy

TMACMD · ‎2024-09-06

Really? You do see all the local node aggregates are unknown which is likely due to missing disks. There is a remote possibility that bug might be in play but I’m thinking there is a lot more hardware that is not even powered on causing cascading issues

TMACMD · ‎2024-09-06

Right now my guess is that MANY supplies are offline. You have a root aggregate for this node but all other aggregates are unknown meaning they are likely missing disks. They are missing if ONTAP can’t see them which means they are broken or powered off.

probably deserves a site visit to fully determine the scenario

Sorry, not looking good kid!

zaadman · ‎2024-09-06

You're right , it is missing (unable to see) a lot of disks.
A site visit is definitely in order.

Also the Service processor log (see above) shows that during the 1st week of August the filer rebooted 37 times and lost power several times.
Something unusual is definetely going on.

zaadman · ‎2024-09-09

customer did a zoom call with NetApp support. It looks like faulty cables from the IOM's to the disk shelves. New cables ordered and they will replace the faulty cables. tomoorow.

Will post final resolution once it's up and working.