Solved: Re: FAS2750 Redundancy Testing

TroyPayne · ‎2023-02-17

Hello,

I recently purchased and setup a FAS2750 (1 Controller shelf with 12 SSDs and 3 SAS Disk shelves w/twelve 10TiB drives each). System will only be doing SMB/CIFS.

I wired up the controllers and shelves per the setup guide.

Updated to OnTap 9.12.1

Installed the latest DQP.

Created some aggregates, setup 2 physical 10G eth ports on each node in LACP to my Cisco Nexus switch, created a couple SVMs, and shares. Everything was working great.

I started testing redundancy by pulling the 10G cables while doing a large file copy from a terminal server to a CIFS share on the filer. It didn't go perfect but was acceptable.

Next, I simulated a node failure by removing Node1. I expected node 2 to takeover the aggregates, SVMs and SMB/CIFS shares. But this didn't happen. Node2 did not take over the virtual managment IP, did not takeover the aggregates, nor SVMs. I was able to get into the web GUI only by connecting to the management IP of Node2.

Can somebody explain why this behavior or anything to check?

TMACMD · ‎2023-02-20

The service-processor (SP or BMC) is in fact aligned with the serial port. If you "SSH" to the SP/BMC, you can run the command: "system console" to access the serial console port.

Once at the SP login, you can do "system power status" to see the current power status.

Or "system power cycle" to effectively turn the controller off and then on. Or "system power off", wait about 45 seconds for the NVMEM destage, then "system power status" to make sure it is still off then "system power on" to restore power to the controller.

I have used both methods for testing (pulling the controller, but leaving the cables attached!) and also from the SP.

View solution in original post

SpindleNinja · ‎2023-02-17

Few questions -

So you physically just removed the node from the cluster?

What's the output of the command in the alert that's in the screen shot?

TroyPayne · ‎2023-02-17

Yes, I physically removed the node from shelf1.

I don't have the output of the command in the screen shot anymore. I put the node back and the system returned to normal after about 15 min. I'd guess there's some logs however, but I'm not sure how to get that.

SpindleNinja · ‎2023-02-17

Did HA testing pass a normal Takeover-Giveback?

There's the event log - "event log show" you might also need to check the BMC logs - https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/FAS_Systems/What_logs_to_collect_for_a_down_system_with_a_BMC

I would open a ticket though and have them review the logs.

Also, to add, I don't think I've ever known anyone to physically rip out the controller for an HA test, usually it's running a halt on the node or power off from the SP/BMC.

TroyPayne · ‎2023-02-21

Yes, all the give backs were successful, but it took awhile.

TroyPayne · ‎2023-02-17

Read from bottom up. Oldest events at the bottom.

MVTMFILER01-02	notice	vifmgr	vifmgr.portup: A link up event was received on node MVTMFILER01-02, port e0b.
MVTMFILER01-02	emergency	cf_main	callhome.partner.down: Call home for PARTNER DOWN, TAKEOVER IMPOSSIBLE
MVTMFILER01-02	error	cf_hwassist	cf.hwassist.missedKeepAlive: HW-assisted takeover missing keep-alive messages from HA partner (MVTMFILER01-01).
MVTMFILER01-02	error	start_asup_collector_thread	callhome.dsk.redun.fault: Call home for DISK REDUNDANCY FAILED
MVTMFILER01-02	notice	start_asup_collector_thread	shelf.config.tomixedha: System has transitioned to a mixture of single, multi-path or quad-path storage configurations.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.noreach: Network port e0M on node MVTMFILER01-02 cannot reach its expected broadcast domain Default:Default. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.noreach: Network port a0a-208 on node MVTMFILER01-02 cannot reach its expected broadcast domain Default:Default-3. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02	notice	vifmgr	vifmgr.lifBeingRemoved: LIF lif_svm0_106 (on virtual server 4), IP address 10.246.208.41, is being removed from node MVTMFILER01-02, port a0a-208.
MVTMFILER01-02	alert	dsa_worker5	ses.status.procCplxError: DS224-12 (S/N SHJGD2248900479) shelf 0 on channel 0b Processor Complex error for Processor Complex 1: PCM on partner not installed This module is on the unknown location.
MVTMFILER01-02	error	dsa_worker5	ses.status.electronicsWarn: DS224-12 (S/N SHJGD2248900479) shelf 0 on channel 0b environmental monitoring warning for SES electronics 1: not installed. This module is on the rear of the shelf at the top left.
MVTMFILER01-02	notice	token_mgr_admin	token.node.out.of.quorum: All token references from node (ID - 54ba682b-6f63-11ed-b917-d039eaa12d32) are dropped because the node went out of quorum.
MVTMFILER01-02	notice	ThreadHandlerun	cf.misc.ProgTakeoverFail: Failover monitor: Programmatic takeover failed (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support)
MVTMFILER01-02	emergency	kltp	clam.node.ooq: Node (name=MVTMFILER01-01, ID=1000) is out of 'CLAM quorum' (reason=heartbeat failure).
MVTMFILER01-02	emergency	kltp	callhome.clam.node.ooq: Call home for NODE(S) OUT OF CLUSTER QUORUM.
MVTMFILER01-02	emergency	monitor	monitor.globalStatus.critical: Controller failover of MVTMFILER01-01 is not possible: HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support.
MVTMFILER01-02	notice	CsmMpAgentThread	nvmf.remote.status.cb: NVMeOF remote status callback endpoint=CLIENT, status=Session Failed.
MVTMFILER01-02	alert	cf_main	ha.takeoverImpIC: Takeover of the partner node is not possible due to interconnect errors.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.skipped: Network port e0b on node MVTMFILER01-02 was not scanned for reachability because it was administratively or operationally down at the time of the scan.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.noreach: Network port e0b on node MVTMFILER01-02 cannot reach its expected broadcast domain Cluster:Cluster. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.skipped: Network port e0a on node MVTMFILER01-02 was not scanned for reachability because it was administratively or operationally down at the time of the scan.
MVTMFILER01-02	notice	vifmgr	vifmgr.reach.noreach: Network port e0a on node MVTMFILER01-02 cannot reach its expected broadcast domain Cluster:Cluster. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02	notice	cf_main	cf.fsm.partnerNotResponding: Failover monitor: partner not responding
MVTMFILER01-02	notice	cf_main	cf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(MVTMFILER01-01), system_down because controller_inaccessible.
MVTMFILER01-02	alert	vifmgr	vifmgr.cluscheck.notassigned: Cluster LIF MVTMFILER01-02_clus2 (node MVTMFILER01-02) is not assigned to any port.
MVTMFILER01-02	alert	vifmgr	callhome.clus.net.degraded: Call home for CLUSTER NETWORK DEGRADED: Cluster LIF Not Assigned to Any Port - Cluster LIF MVTMFILER01-02_clus1 (node MVTMFILER01-02) is not assigned to any port
MVTMFILER01-02	alert	vifmgr	vifmgr.cluscheck.notassigned: Cluster LIF MVTMFILER01-02_clus1 (node MVTMFILER01-02) is not assigned to any port.
MVTMFILER01-02	notice	vifmgr	vifmgr.lifmoved.linkdown: LIF MVTMFILER01-02_clus2 (on virtual server 4294967293), IP address 169.254.186.84, is being moved to node MVTMFILER01-02, port e0b.
MVTMFILER01-02	error	cf_main	cf.fsm.takeoverByPartnerDisabled: Failover monitor: takeover of MVTMFILER01-02 by MVTMFILER01-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
MVTMFILER01-02	error	cf_main	cf.fsm.takeoverOfPartnerDisabled: Failover monitor: takeover of MVTMFILER01-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
MVTMFILER01-02	notice	vifmgr	vifmgr.lifmoved.linkdown: LIF MVTMFILER01-02_clus1 (on virtual server 4294967293), IP address 169.254.95.221, is being moved to node MVTMFILER01-02, port e0a.
MVTMFILER01-02	emergency	vifmgr	vifmgr.clus.linkdown: The cluster port e0b on node MVTMFILER01-02 has gone down unexpectedly.
MVTMFILER01-02	notice	vifmgr	vifmgr.portdown: A link down event was received on node MVTMFILER01-02, port e0b.
MVTMFILER01-02	emergency	vifmgr	vifmgr.clus.linkdown: The cluster port e0a on node MVTMFILER01-02 has gone down unexpectedly.
MVTMFILER01-02	notice	vifmgr	vifmgr.portdown: A link down event was received on node MVTMFILER01-02, port e0a.
MVTMFILER01-02	notice	gop_sbb_thread	cf.ic.sbb: HA interconnect: SBB Compatibility Event. No compatible partner node found. The interconnect device has been disabled.

SpindleNinja · ‎2023-02-17

It looks like all the backend clusternet cables were disconnect before the node was pulled correct?

When those backend clusternet cabling were pulled it caused the cluster to disable HA. It's not really normal to have the whole of the backend network fail, it's why there's redudency - two cables on a 2 node cluster and why there's 2 switches when you have 4 node or larger cluster.

What it looks like it did:
Cables were disconnected -> The Cluster disabled HA -> Node was pulled and went offline -> Since HA was disabled there was no takeover.

This scenario is kinda like removing the wings on a plane and then having an engine failure and hoping the second engine will keep the plane in the air.

For hard node failure testing, you're better off to do a hard halt on the node.

TroyPayne · ‎2023-02-20

Negative, I did not remove the cluster interconnect cables before pulled out Controller Node 1. I know the logs indicate that, but that isn't what happened.

I merely pulled the little release lever and dislodged the node about 2 inches from its bay on Shelf1 and left all cables in place.

I guess from Node2's perspective the logs are not wrong, because the cables are effectively removed when node1 is pulled.

My other system is a FAS8040 7mode MetroCluster. With its redundancy type you can instantly lose a node and data continues to be saved and served to clients. Am I wrong to assume the redundancy/failover behavior of my 2750 should be the same as my MC?

I thought about doing a graceful shutdown of Node1, instead of pulling, but decided that's not a fair test because likely Node1 will send some kind of "Hey I'm about to go down" notifications to Node2 before shutting down.

Is running the "Halt" you mentioned a good option ?

TMACMD · ‎2023-02-20

with the system at your "normal" state, what is the output of

storage failover show
cluster show

TroyPayne · ‎2023-02-21

TMACMD · ‎2023-02-17

Or just dislodge the controller (and wait 60 seconds for NVMEM destage) without removing cables first.

If you try the Service-processor route:

login to SP.

At the SP prompt try a "system power off"

Wait a 60 seconds (for NVMEM destage) then a "system power status" followed by a "system power on"

TroyPayne · ‎2023-02-20

Hi TMACMD,

are the steps you provided above 2 different methods you would use to perform a failover test? Not sure I have NVME in my 2750.

I do have the SP configured with an IP address and I can SSH into it. Strangely it appears to be sharing the e0M management interface. I'm trying to figure out what's the point of that.

TroyPayne · ‎2023-02-20

Correction. I see you wrote NVMEM not NVME.

TMACMD · ‎2023-02-20

The service-processor (SP or BMC) is in fact aligned with the serial port. If you "SSH" to the SP/BMC, you can run the command: "system console" to access the serial console port.

Once at the SP login, you can do "system power status" to see the current power status.

Or "system power cycle" to effectively turn the controller off and then on. Or "system power off", wait about 45 seconds for the NVMEM destage, then "system power status" to make sure it is still off then "system power on" to restore power to the controller.

I have used both methods for testing (pulling the controller, but leaving the cables attached!) and also from the SP.

TroyPayne · ‎2023-02-21

I tried these steps to power off Node1 yesterday and then again today to power off Node2. The failover worked much better each time than when I yanked the node out. I didn't lose management interface, nor SMB access, and the Node that still had power tookover the aggr, Volumes, LIFs, and SVMs that were homed on the down node.

My question at this point is: Why does powering off a node and leaving it's cables plugged in behave differently (better), than physically pulling the node?

Do I have unrealistic expectation?