ONTAP Discussions

FAS2750 Redundancy Testing

TroyPayne
6,953 Views

Hello,

I recently purchased and setup a FAS2750 (1 Controller shelf  with 12 SSDs and 3 SAS Disk shelves w/twelve 10TiB drives each). System will only be doing SMB/CIFS.

I wired up the controllers and shelves per the setup guide.

Updated to OnTap 9.12.1

Installed the latest DQP.

Created some aggregates, setup 2 physical 10G eth ports on each node in LACP to my Cisco Nexus switch, created a couple SVMs, and shares. Everything was working great.

I started testing redundancy by pulling the 10G cables while doing a large file copy from a terminal server to a CIFS share on the filer. It didn't go perfect but was acceptable.

Next, I simulated a node failure by removing Node1. I expected node 2 to takeover the aggregates, SVMs and SMB/CIFS shares. But this didn't happen. Node2 did not take over the virtual managment IP, did not takeover the aggregates, nor SVMs. I was able to get into the web GUI only by connecting to the management IP of Node2.

TroyPayne_0-1676662652873.png

 

Can somebody explain why this behavior or anything to check?

1 ACCEPTED SOLUTION

TMACMD
6,750 Views

The service-processor (SP or BMC) is in fact aligned with the serial port. If you "SSH" to the SP/BMC, you can run the command: "system console" to access the serial console port.

 

Once at the SP login, you can do "system power status" to see the current power status.

Or "system power cycle" to effectively turn the controller off and then on. Or "system power off", wait about 45 seconds for the NVMEM destage, then "system power status" to make sure it is still off then "system power on" to restore power to the controller.

 

I have used both methods for testing (pulling the controller, but leaving the cables attached!) and also from the SP.

View solution in original post

14 REPLIES 14

SpindleNinja
6,940 Views

Few questions - 

So you physically just removed the node from the cluster?  

 

What's the output of the command in the alert that's in the screen shot? 

 

 

TroyPayne
6,938 Views

Yes, I physically removed the node from shelf1.

 

I don't have the output of the command in the screen shot anymore. I put the node back and the system returned to normal after about 15 min. I'd guess there's some logs however, but I'm not sure how to get that.

SpindleNinja
6,929 Views

Did HA testing pass a normal Takeover-Giveback? 

 

There's the event log - "event log show"  you might also need to check the BMC logs - https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/FAS_Systems/What_logs_to_collect_for_a_down_system_with_a_BMC

 

I would open a ticket though and have them review the logs. 

 

Also, to add,  I don't think I've ever known anyone to physically rip out the controller for an HA test,  usually it's running a halt on the node or power off from the SP/BMC. 

 

TroyPayne
6,703 Views

Yes, all the give backs were successful, but it took awhile.

TroyPayne
6,922 Views

Read from bottom up. Oldest events at the bottom.

MVTMFILER01-02noticevifmgrvifmgr.portup: A link up event was received on node MVTMFILER01-02, port e0b.
MVTMFILER01-02emergencycf_maincallhome.partner.down: Call home for PARTNER DOWN, TAKEOVER IMPOSSIBLE
MVTMFILER01-02errorcf_hwassistcf.hwassist.missedKeepAlive: HW-assisted takeover missing keep-alive messages from HA partner (MVTMFILER01-01).
MVTMFILER01-02errorstart_asup_collector_threadcallhome.dsk.redun.fault: Call home for DISK REDUNDANCY FAILED
MVTMFILER01-02noticestart_asup_collector_threadshelf.config.tomixedha: System has transitioned to a mixture of single, multi-path or quad-path storage configurations.
MVTMFILER01-02noticevifmgrvifmgr.reach.noreach: Network port e0M on node MVTMFILER01-02 cannot reach its expected broadcast domain Default:Default. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02noticevifmgrvifmgr.reach.noreach: Network port a0a-208 on node MVTMFILER01-02 cannot reach its expected broadcast domain Default:Default-3. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02noticevifmgrvifmgr.lifBeingRemoved: LIF lif_svm0_106 (on virtual server 4), IP address 10.246.208.41, is being removed from node MVTMFILER01-02, port a0a-208.
MVTMFILER01-02alertdsa_worker5ses.status.procCplxError: DS224-12 (S/N SHJGD2248900479) shelf 0 on channel 0b Processor Complex error for Processor Complex 1: PCM on partner not installed  This module is on the unknown location.
MVTMFILER01-02errordsa_worker5ses.status.electronicsWarn: DS224-12 (S/N SHJGD2248900479) shelf 0 on channel 0b environmental monitoring warning for SES electronics 1: not installed.  This module is on the rear of the shelf at the top left.
MVTMFILER01-02noticetoken_mgr_admintoken.node.out.of.quorum: All token references from node (ID - 54ba682b-6f63-11ed-b917-d039eaa12d32) are dropped because the node went out of quorum.
MVTMFILER01-02noticeThreadHandleruncf.misc.ProgTakeoverFail: Failover monitor: Programmatic takeover failed (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support)
MVTMFILER01-02emergencykltpclam.node.ooq: Node (name=MVTMFILER01-01, ID=1000) is out of 'CLAM quorum' (reason=heartbeat failure).
MVTMFILER01-02emergencykltpcallhome.clam.node.ooq: Call home for NODE(S) OUT OF CLUSTER QUORUM.
MVTMFILER01-02emergencymonitormonitor.globalStatus.critical: Controller failover of MVTMFILER01-01 is not possible: HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support. 
MVTMFILER01-02noticeCsmMpAgentThreadnvmf.remote.status.cb: NVMeOF remote status callback endpoint=CLIENT, status=Session Failed.
MVTMFILER01-02alertcf_mainha.takeoverImpIC: Takeover of the partner node is not possible due to interconnect errors.
MVTMFILER01-02noticevifmgrvifmgr.reach.skipped: Network port e0b on node MVTMFILER01-02 was not scanned for reachability because it was administratively or operationally down at the time of the scan.
MVTMFILER01-02noticevifmgrvifmgr.reach.noreach: Network port e0b on node MVTMFILER01-02 cannot reach its expected broadcast domain Cluster:Cluster. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02noticevifmgrvifmgr.reach.skipped: Network port e0a on node MVTMFILER01-02 was not scanned for reachability because it was administratively or operationally down at the time of the scan.
MVTMFILER01-02noticevifmgrvifmgr.reach.noreach: Network port e0a on node MVTMFILER01-02 cannot reach its expected broadcast domain Cluster:Cluster. No other broadcast domains appear to be reachable from this port.
MVTMFILER01-02noticecf_maincf.fsm.partnerNotResponding: Failover monitor: partner not responding
MVTMFILER01-02noticecf_maincf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(MVTMFILER01-01), system_down because controller_inaccessible.
MVTMFILER01-02alertvifmgrvifmgr.cluscheck.notassigned: Cluster LIF MVTMFILER01-02_clus2 (node MVTMFILER01-02) is not assigned to any port.
MVTMFILER01-02alertvifmgrcallhome.clus.net.degraded: Call home for CLUSTER NETWORK DEGRADED: Cluster LIF Not Assigned to Any Port - Cluster LIF MVTMFILER01-02_clus1 (node MVTMFILER01-02) is not assigned to any port
MVTMFILER01-02alertvifmgrvifmgr.cluscheck.notassigned: Cluster LIF MVTMFILER01-02_clus1 (node MVTMFILER01-02) is not assigned to any port.
MVTMFILER01-02noticevifmgrvifmgr.lifmoved.linkdown: LIF MVTMFILER01-02_clus2 (on virtual server 4294967293), IP address 169.254.186.84, is being moved to node MVTMFILER01-02, port e0b.
MVTMFILER01-02errorcf_maincf.fsm.takeoverByPartnerDisabled: Failover monitor: takeover of MVTMFILER01-02 by MVTMFILER01-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
MVTMFILER01-02errorcf_maincf.fsm.takeoverOfPartnerDisabled: Failover monitor: takeover of MVTMFILER01-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
MVTMFILER01-02noticevifmgrvifmgr.lifmoved.linkdown: LIF MVTMFILER01-02_clus1 (on virtual server 4294967293), IP address 169.254.95.221, is being moved to node MVTMFILER01-02, port e0a.
MVTMFILER01-02emergencyvifmgrvifmgr.clus.linkdown: The cluster port e0b on node MVTMFILER01-02 has gone down unexpectedly.
MVTMFILER01-02noticevifmgrvifmgr.portdown: A link down event was received on node MVTMFILER01-02, port e0b.
MVTMFILER01-02emergencyvifmgrvifmgr.clus.linkdown: The cluster port e0a on node MVTMFILER01-02 has gone down unexpectedly.
MVTMFILER01-02noticevifmgrvifmgr.portdown: A link down event was received on node MVTMFILER01-02, port e0a.
MVTMFILER01-02noticegop_sbb_threadcf.ic.sbb: HA interconnect: SBB Compatibility Event. No compatible partner node found. The interconnect device has been disabled.

SpindleNinja
6,907 Views

It looks like all the backend clusternet cables were disconnect before the node was pulled correct?  

 

When those backend clusternet cabling were pulled it caused the cluster to disable HA.  It's not really normal to have the whole of the backend network fail,  it's why there's redudency - two cables on a 2 node  cluster and why there's 2 switches when you have 4 node or larger cluster.  

 

What it looks like it did: 
Cables were disconnected -> The  Cluster disabled HA -> Node was pulled and went offline -> Since HA was disabled there was no takeover.   

 

This scenario is kinda like removing the wings on a plane and then having an engine failure and hoping the second engine will keep the plane in the air.

 

For hard node failure testing,  you're better off to do a hard halt on the node. 

TroyPayne
6,772 Views

Negative, I did not remove the cluster interconnect cables before pulled out Controller Node 1. I know the logs indicate that, but that isn't what happened.

 

I merely pulled the little release lever and dislodged the node about 2 inches from its bay on Shelf1 and left all cables in place.

 

I guess from Node2's perspective the logs are not wrong, because the cables are effectively removed when node1 is pulled.

 

My other system is a FAS8040 7mode MetroCluster. With its redundancy type you can instantly lose a node and data continues to be saved and served to clients. Am I wrong to assume the redundancy/failover behavior of my 2750 should be the same as my MC?

 

I thought about doing a graceful shutdown of Node1, instead of pulling, but decided that's not a fair test because likely Node1 will send some kind of "Hey I'm about to go down" notifications to Node2 before shutting down.

 

Is running the "Halt" you mentioned a good option ? 

TMACMD
6,750 Views

with the system at your "normal" state, what is the output of 

  • storage failover show
  • cluster show

 

TroyPayne
6,713 Views

TroyPayne_0-1677012395414.png

TroyPayne_1-1677012412944.png

 

 

TMACMD
6,903 Views

Or just dislodge the controller (and wait 60 seconds for NVMEM destage) without removing cables first.

 

If you try the Service-processor route:

login to SP.

At the SP prompt try a "system power off" 

Wait a 60 seconds (for NVMEM destage) then a "system power status" followed by a "system power on"

TroyPayne
6,771 Views

Hi TMACMD,

are the steps you provided above 2 different methods you would use to perform a failover test? Not sure I have NVME in my 2750.

 

I do have the SP configured with an IP address and I can SSH into it. Strangely it appears to be sharing the e0M management interface. I'm trying to figure out what's the point of that.

TroyPayne
6,757 Views

Correction. I see you wrote NVMEM not NVME. 

TMACMD
6,751 Views

The service-processor (SP or BMC) is in fact aligned with the serial port. If you "SSH" to the SP/BMC, you can run the command: "system console" to access the serial console port.

 

Once at the SP login, you can do "system power status" to see the current power status.

Or "system power cycle" to effectively turn the controller off and then on. Or "system power off", wait about 45 seconds for the NVMEM destage, then "system power status" to make sure it is still off then "system power on" to restore power to the controller.

 

I have used both methods for testing (pulling the controller, but leaving the cables attached!) and also from the SP.

TroyPayne
6,709 Views

I tried these steps to power off Node1 yesterday and then again today to power off Node2. The failover worked much better each time than when I yanked the node out. I didn't lose management interface, nor SMB access, and the Node that still had power tookover the aggr, Volumes, LIFs, and SVMs that were homed on the down node. 

 

My question at this point is: Why does powering off a node and leaving it's cables plugged in behave differently (better), than physically pulling the node?

Do I have unrealistic expectation?

Public