Solved: Node Morgan2" has taken over Morgan 1

mickdon · ‎2020-07-28

Hi guys, when i logged on the ontap system manager this morning, this is the message that was on my overview "Node "Morgan02" has taken over node "Morgan01". ssh into the management and check

first command

Morgan::> system health subsystem show
Subsystem Health
----------------- ------------------
SAS-connect ok
Environment ok
Memory ok
Service-Processor ok
Switch-Health ok
CIFS-NDO ok
Motherboard ok
IO ok
MetroCluster ok
MetroCluster_Node ok
FHM-Switch ok
FHM-Bridge ok
SAS-connect_Cluster
ok
13 entries were displayed.

Second, i tried cluster peer health show:

This table is currently empty.

Warning: Unable to list entries on node Morgan-01. RPC: Couldn't make
connection [from mgwd on node "Morgan-02" (VSID: -1) to mgwd at
169.254.26.183]

Any help will be greatly appreciated to troubleshoot this issue

SpindleNinja · ‎2020-07-28

Looks like you for sure had an unplanned failover/takeover of node 1. Something on the node failed or caused a panic.

I would review the event log to see what could have happened before you do the give back. You can connect to the SP/BMC and see what the current state of the node is by running "system console". it might just be holding at "waiting for giveback" or dead.

You can also run some commands from the SP/BMC to see what could have happened to the node.

from the SP CLUSTER-01> prompt you can run "events all" or "events latest 100" to see the 100 most recent events.

Best option though is to open a ticket with support and work with them to do the giveback when you can.

View solution in original post

SpindleNinja · ‎2020-07-28

first off, this isn't a place for P1 cases, please contact netapp support.

second, what's the output of the following:

cluster show

storage failover show

system node show -fields uptime

And anything in the logs? (event log show)

mickdon · ‎2020-07-28

Hi Spinninja,

Thanks for your help. I agree that it’s a p1 case and I should be contacting support. However, I don’t have that access at the moment. Below are the output you requested:

MorganLib::> cluster show
Node Health Eligibility
--------------------- ------- ------------
MorganLib-01 false true
MorganLib-02 true true
2 entries were displayed.

MorganLib::> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
MorganLib-01 MorganLib-02 - Unknown
MorganLib-02 MorganLib-01 false In takeover, Auto giveback deferred
2 entries were displayed.

MorganLib::> system node show
show show-discovered
MorganLib::> system node show -fields uptime
node uptime
------------ ------
MorganLib-01 -
MorganLib-02 2 days 04:53
2 entries were displayed.

Thank you kind Sir

SpindleNinja · ‎2020-07-28

Looks like you for sure had an unplanned failover/takeover of node 1. Something on the node failed or caused a panic.

I would review the event log to see what could have happened before you do the give back. You can connect to the SP/BMC and see what the current state of the node is by running "system console". it might just be holding at "waiting for giveback" or dead.

You can also run some commands from the SP/BMC to see what could have happened to the node.

from the SP CLUSTER-01> prompt you can run "events all" or "events latest 100" to see the 100 most recent events.

Best option though is to open a ticket with support and work with them to do the giveback when you can.

mickdon · ‎2020-07-28

Thank you @SpindleNinja ... Our support ended last month and management doesn't want to pay for it even though I presented a case for them to -- I hope they would listen because i'm already loosing it and letting them know today how important it is to our architecture.

Anyways pardon my frustratration! pwwwwwwwwwww... breath , breath ... thanks, I needed to vent!

I got logs from the sp: I highlighted the most recent ones

Record 2414: Wed Jun 24 02:56:34 2020 [Heartbeat.notice]: Heartbeat start: Set SP time. Old time: Wed Jun 24 02:56:34 2020. New time: Wed Jun 24 02:56:35 2020.
Record 2415: Fri Jun 26 00:47:34 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Fri Jun 26 00:47:32 2020. New time: Fri Jun 26 00:47:34 2020.
Record 2416: Sat Jun 27 19:30:34 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Sat Jun 27 19:30:32 2020. New time: Sat Jun 27 19:30:34 2020.
Record 2417: Mon Jun 29 13:57:34 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Mon Jun 29 13:57:32 2020. New time: Mon Jun 29 13:57:34 2020.
Record 2418: Wed Jul 1 07:47:34 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Wed Jul 1 07:47:32 2020. New time: Wed Jul 1 07:47:34 2020.
Record 2419: Thu Jul 2 23:53:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Thu Jul 2 23:53:33 2020. New time: Thu Jul 2 23:53:35 2020.
Record 2420: Sat Jul 4 17:59:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Sat Jul 4 17:59:33 2020. New time: Sat Jul 4 17:59:35 2020.
Record 2421: Mon Jul 6 10:59:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Mon Jul 6 10:59:33 2020. New time: Mon Jul 6 10:59:35 2020.
Record 2422: Wed Jul 8 04:38:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Wed Jul 8 04:38:33 2020. New time: Wed Jul 8 04:38:35 2020.
Record 2423: Thu Jul 9 21:56:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Thu Jul 9 21:56:33 2020. New time: Thu Jul 9 21:56:35 2020.
Record 2424: Sat Jul 11 15:04:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Sat Jul 11 15:04:33 2020. New time: Sat Jul 11 15:04:35 2020.
Record 2425: Mon Jul 13 07:37:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Mon Jul 13 07:37:33 2020. New time: Mon Jul 13 07:37:35 2020.
Record 2426: Wed Jul 15 00:52:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Wed Jul 15 00:52:33 2020. New time: Wed Jul 15 00:52:35 2020.
Record 2427: Thu Jul 16 17:10:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Thu Jul 16 17:10:33 2020. New time: Thu Jul 16 17:10:35 2020.
Record 2428: Sat Jul 18 09:51:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Sat Jul 18 09:51:33 2020. New time: Sat Jul 18 09:51:35 2020.
Record 2429: Mon Jul 20 01:50:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Mon Jul 20 01:50:33 2020. New time: Mon Jul 20 01:50:35 2020.
Record 2430: Tue Jul 21 18:39:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Tue Jul 21 18:39:33 2020. New time: Tue Jul 21 18:39:35 2020.
Record 2431: Thu Jul 23 12:06:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Thu Jul 23 12:06:33 2020. New time: Thu Jul 23 12:06:35 2020.
Record 2432: Sat Jul 25 06:23:35 2020 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old time: Sat Jul 25 06:23:33 2020. New time: Sat Jul 25 06:23:35 2020.
Record 2433: Thu Jan 1 00:00:36 1970 [IPMI.notice]: b404 | c0 | OEM: ffff7000ff00 | ManufId: 150300 | SP Power Reset
Record 2434: Thu Jan 1 00:00:36 1970 [IPMI.notice]: b504 | c0 | OEM: fcff70560000 | ManufId: 150300 | POS Register: Power on Reset(Normal Power Cycle)
Record 2435: Thu Jan 1 00:00:41 1970 [IPMI.notice]: b604 | 02 | EVT: 0157ff88 | SASS_1.2V | Assertion Event, "Upper Non-critical going high"
Record 2436: Thu Jan 1 00:00:41 1970 [IPMI.notice]: b704 | 02 | EVT: 0159ff8e | SASS_1.2V | Assertion Event, "Upper Critical going high"
Record 2437: Thu Jan 1 00:00:43 1970 [IPMI.notice]: b804 | 02 | EVT: 0301ffff | Power_Good | Assertion Event, "State Asserted"
Record 2438: Thu Jan 1 00:00:43 1970 [IPMI.notice]: b904 | 02 | EVT: 0301ffff | Power_Proc_OK | Assertion Event, "State Asserted"
Record 2439: Thu Jan 1 00:00:43 1970 [IPMI.notice]: ba04 | 02 | EVT: 0301ffff | Controller_Fault | Assertion Event, "State Asserted"
Record 2440: Thu Jan 1 00:00:43 1970 [IPMI.notice]: bb04 | 02 | EVT: 0900ffff | Wrench_Port_Up | Assertion Event, "Device Disabled"
Record 2441: Thu Jan 1 00:00:46 1970 [IPMI.notice]: bc04 | 02 | EVT: 81597d8e | SASS_1.2V | Deassertion Event, "Upper Critical going high"
Record 2442: Thu Jan 1 00:00:46 1970 [IPMI.notice]: bd04 | 02 | EVT: 81577d88 | SASS_1.2V | Deassertion Event, "Upper Non-critical going high"
Record 2443: Thu Jan 1 00:01:33 1970 [IPMI.notice]: be04 | 02 | EVT: 0901ffff | Wrench_Port_Up | Assertion Event, "Device Enabled"
Record 2444: Thu Jan 1 00:02:21 1970 [SP.notice]: Running primary version 2.10
Record 2445: Thu Jan 1 00:02:33 1970 [IPMI.warning]: FRUID 1 Access error
Record 2446: Thu Jan 1 00:02:45 1970 [IPMI.notice]: bf04 | 02 | EVT: 6fc100ff | System_FW_Status | Assertion Event, "Unspecified"
Record 2447: Thu Jan 1 00:02:49 1970 [IPMI.warning]: FRUID 1 Access error
Record 2448: Thu Jan 1 00:03:06 1970 [IPMI.warning]: FRUID 1 Access error
Record 2449: Thu Jan 1 00:03:22 1970 [IPMI.warning]: FRUID 1 Access error
Record 2450: Thu Jan 1 00:03:38 1970 [IPMI.warning]: FRUID 1 Access error
Record 2451: Thu Jan 1 00:03:55 1970 [IPMI.warning]: FRUID 1 Access error
Record 2452: Thu Jan 1 00:04:09 1970 [IPMI.warning]: FRUID 1 Access error
Record 2453: Sun Jul 26 11:39:44 2020 [BIOS.warning]: POST error 0x00a5: Definition not available Additional data: 0x00000000 0x00000000
Record 2454: Thu Jan 1 00:04:37 1970 [IPMI.notice]: c004 | 02 | EVT: 6fc204ff | System_FW_Status | Assertion Event, "Restoring MCH Values"
Record 2455: Sun Jul 26 11:39:49 2020 [CFE.notice]: Loader time adjust: Set SP time. Old time: Thu Jan 1 00:04:40 1970. New time: Sun Jul 26 11:39:49 2020.
Record 2456: Sun Jul 26 11:39:49 2020 [Boot Loader.notice]: Received time sync
Record 2457: Sun Jul 26 11:39:49 2020 [IPMI.notice]: c104 | 02 | EVT: 6fc000ff | System_FW_Status | Assertion Event, "Unspecified fatal firmware error"
Record 2458: Sun Jul 26 11:39:49 2020 [Boot Loader.critical]: Abort Autoboot due to BIOS POST failure.
Record 2459: Sun Jul 26 11:39:49 2020 [Trap Event.critical]: hwassist post_error (26)
Record 2460: Sun Jul 26 11:39:50 2020 [IPMI.warning]: FRUID 1 Access error
Record 2461: Sun Jul 26 11:39:52 2020 [IPMI.notice]: c204 | 02 | EVT: 6fc213ff | System_FW_Status | Assertion Event, "System boot initiated"
Record 2462: Sun Jul 26 11:39:57 2020 [IPMI.notice]: c304 | 02 | EVT: 6fc220ff | System_FW_Status | Assertion Event, "Bootloader is running"
Record 2463: Sun Jul 26 11:40:25 2020 [IPMI.warning]: FRUID 1 Access error
Record 2464: Sun Jul 26 11:40:42 2020 [IPMI.warning]: FRUID 1 Access error
Record 2465: Sun Jul 26 11:40:58 2020 [IPMI.warning]: FRUID 1 Access error
Record 2466: Sun Jul 26 11:41:14 2020 [IPMI.warning]: FRUID 1 Access error
Record 2467: Sun Jul 26 11:41:31 2020 [IPMI.warning]: FRUID 1 Access error
Record 2468: Sun Jul 26 11:41:47 2020 [IPMI.warning]: FRUID 1 Access error
Record 2469: Sun Jul 26 11:42:04 2020 [IPMI.warning]: FRUID 1 Access error
Record 2470: Sun Jul 26 11:42:20 2020 [IPMI.warning]: FRUID 1 Access error
Record 2471: Sun Jul 26 11:42:37 2020 [IPMI.warning]: FRUID 1 Access error
Record 2472: Sun Jul 26 11:42:53 2020 [IPMI.warning]: FRUID 1 Access error
Record 2473: Sun Jul 26 11:43:09 2020 [IPMI.warning]: FRUID 1 Access error
Record 2474: Sun Jul 26 11:43:26 2020 [IPMI.warning]: FRUID 1 Access error
Record 2475: Sun Jul 26 11:43:42 2020 [IPMI.warning]: FRUID 1 Access error
Record 2476: Sun Jul 26 11:43:58 2020 [IPMI.warning]: FRUID 1 Access error
Record 2477: Sun Jul 26 11:44:15 2020 [IPMI.warning]: FRUID 1 Access error
Record 2478: Sun Jul 26 11:44:31 2020 [IPMI.warning]: FRUID 1 Access error
Record 2479: Sun Jul 26 11:44:48 2020 [IPMI.warning]: FRUID 1 Access error
Record 2480: Sun Jul 26 11:45:04 2020 [IPMI.warning]: FRUID 1 Access error
Record 2481: Sun Jul 26 11:45:20 2020 [IPMI.warning]: FRUID 1 Access error
Record 2482: Sun Jul 26 11:45:42 2020 [IPMI.warning]: FRUID 1 Access error
Record 2483: Sun Jul 26 11:45:58 2020 [IPMI.warning]: FRUID 1 Access error
Record 2484: Sun Jul 26 11:46:14 2020 [IPMI.warning]: FRUID 1 Access error
Record 2485: Sun Jul 26 11:46:39 2020 [IPMI.warning]: FRUID 1 Access error
Record 2486: Sun Jul 26 11:46:43 2020 [ASUP.notice]: First notification email | (SYSTEM_BOOT_FAILED (POST failed)) CRITICAL | Send failed
Record 2487: Sun Jul 26 11:46:56 2020 [IPMI.warning]: FRUID 1 Access error
Record 2488: Sun Jul 26 11:47:12 2020 [IPMI.warning]: FRUID 1 Access error
Record 2489: Sun Jul 26 11:47:29 2020 [IPMI.warning]: FRUID 1 Access error
Record 2490: Sun Jul 26 11:47:45 2020 [IPMI.warning]: FRUID 1 Access error
Record 2491: Sun Jul 26 11:48:02 2020 [IPMI.warning]: FRUID 1 Access error
Record 2492: Sun Jul 26 11:48:18 2020 [IPMI.warning]: FRUID 1 Access error
Record 2493: Sun Jul 26 11:48:40 2020 [IPMI.warning]: FRUID 1 Access error
Record 2494: Sun Jul 26 11:48:40 2020 [IPMI.critical]: Rebooting SP due to task restarts
Record 2495: Sun Jul 26 11:48:40 2020 [IPMI.critical]: df: 98304 35160 63144 36%
Record 2496: Sun Jul 26 11:48:40 2020 [IPMI.critical]: fp: 795 0 12590
Record 2497: Sun Jul 26 11:48:40 2020 [IPMI.critical]: uptime: 811.890015 672.890015
Record 2498: Sun Jul 26 11:48:40 2020 [IPMI.critical]: ldavg: 1.480000 1.340000 0.860000 4/122 2237
Record 2499: Thu Jan 1 00:00:35 1970 [IPMI.notice]: c404 | c0 | OEM: ffff70005100 | ManufId: 150300 | SP Reset Internally
Record 2500: Thu Jan 1 00:00:41 1970 [IPMI.notice]: c504 | 02 | EVT: 0301ffff | Power_Good | Assertion Event, "State Asserted"
Record 2501: Thu Jan 1 00:00:42 1970 [IPMI.notice]: c604 | 02 | EVT: 0301ffff | Power_Proc_OK | Assertion Event, "State Asserted"
Record 2502: Thu Jan 1 00:00:42 1970 [IPMI.notice]: c704 | 02 | EVT: 6fc220ff | System_FW_Status | Assertion Event, "Bootloader is running"
Record 2503: Thu Jan 1 00:00:43 1970 [IPMI.notice]: c804 | 02 | EVT: 0301ffff | Controller_Fault | Assertion Event, "State Asserted"
Record 2504: Thu Jan 1 00:01:03 1970 [SP.notice]: Running primary version 2.10
Record 2505: Sun Jul 26 11:50:56 2020 [CFE.notice]: Loader time adjust: Set SP time. Old time: Thu Jan 1 00:01:34 1970. New time: Sun Jul 26 11:50:56 2020.
Record 2506: Sun Jul 26 11:50:56 2020 [Boot Loader.notice]: Received time sync
Record 2507: Sun Jul 26 12:02:30 2020 [SP.critical]: Heartbeat stopped
Record 2508: Sun Jul 26 12:02:30 2020 [Trap Event.warning]: hwassist loss_of_heartbeat (30)
Record 2509: Sun Jul 26 12:02:47 2020 [ASUP.notice]: First notification email | (HEARTBEAT_LOSS) WARNING | Send failed
Record 2510: Sun Jul 26 12:17:31 2020 [ASUP.notice]: Reminder email | (HEARTBEAT_LOSS) WARNING | Send failed
Record 2511: Tue Jul 28 20:46:15 2020 [IPMI.notice]: c904 | 02 | EVT: 0900ffff | Wrench_Port_Up | Assertion Event, "Device Disabled"
Record 2512: Tue Jul 28 20:48:23 2020 [IPMI.notice]: ca04 | 02 | EVT: 0901ffff | Wrench_Port_Up | Assertion Event, "Device Enabled"
Record 2513: Tue Jul 28 21:39:38 2020 [SP CLI.notice]: "log in from Serial Console"
SP Morgan-01>

Thanks for your help

SpindleNinja · ‎2020-08-03

Sorry for the delay. are you still in failover? That log doesn't sound pretty.

can you also run "sp status -v" from the SP on 01?

and related to what Parisi said, on 02 from the cluster shell. what's the output of "storage failover show-giveback".

mickdon · ‎2020-08-11

Hi @SpindleNinja Thanks for your help. We are running at full capacity now... thanks to your help. I was able to connect to the sp and got it running from there.

checked the health and all seem to be fine. I'm keeping a keen eye on it though.. but so far so good...thank you kind sir

SpindleNinja · ‎2020-08-11

No problem!

parisi · ‎2020-08-03

"auto-giveback deferred" usually means that the automatic giveback was interrupted by something in the system that would prevent non-disruptive giveback. Usually it's a CIFS/SMB lock. Sometimes, it's something more serious, such as a hardware issue.

The RDB output you see is due to the partner node not being online - there's no working RDB instance on that node, as it's not currently up. That's expected.

"event log show" would tell you *why* the autogiveback failed. Specifically:

::*> event log show -message-name cf.fsm.autoGiveback*

If you don't see messages in the CLI, it's possible the event log rolled over, so you can view Autosupport for events around that time.

But a support case can help guide you through this process.

mickdon · ‎2020-08-11

Hi @parisi There was a power outage, which was confirmed by the even logs.

I was able to get them both back on by reading @SpindleNinja 's suggestion. Connected to the sp console ==>and used '?' to find my way around it. checked the logs and the status of the node. finally, powercycled it and everything seems to be working fine now. I checked the health of both nodes and they are great... .fingers cross hope this folks can pay for the support quickly.

Thanks for your help