Network and Storage Protocols

HA Testing

TIMWALSHMI
8,719 Views

I've setup a FAS3270 and want to test the failover to make sure it works. Here's what I do

1) Make sure HA is enabled on both controllers

2) Verify that the interface group (ifgrp0) is in shared mode for both controllers

3) Start a continuous ping of both interface groups IP addresses

4) From the CLI run cf takeover on the first controller

5) Using SP watch the two controllers as they gracefully fail over and the one being taken over reboots

6) Once the rebooted controller reboots and is in the waiting giveback state run the cf giveback -f command

7) Watch the controllers transfer control to the other controller

😎 Repeat using the other controller

What I'm seeing however doesn't convince me that HA is actually working. In step 5, I also watch the continuous pings, and basically from the time the taken over controller is taken over till the time the giveback is completed I see the pings to the IP address timing out. It's my understanding that the other controller should assume the IP address of the failed controller when in shared mode. If I can't ping it, then I can't access it, so how is this HA?

Both controllers have ifgrp0, on the same vlan, with separate IP Addresses 10.5.141.21 and 10.5.141.22

Am I missing something? If so, what?

1 ACCEPTED SOLUTION

TIMWALSHMI
8,719 Views

Changed the default gateway and ran the test again. This time when I takeover the continuous ping from the client only loses 1 ping, and then is back again. It loses 6 pings when I do a giveback. Worst case, 6 seconds on takeover and 18 seconds on giveback.

View solution in original post

6 REPLIES 6

aborzenkov
8,719 Views

HA is not transparent. Your clients should tolerate small period of unavailability. That's normal and expected.

TIMWALSHMI
8,719 Views

Thanks for the response. While it's true that clients will experience a momentary disconnect when the node fails, it's my understanding that once the partner node takes over the clients reconnect and have access to their storage. I understand CIFS experience this disconnect the most.

My assumption is that once the partner node has taken over that the IP address of the failed node that it will now respond to pings to that address. This does not seem to be the case and is why I'm asking the question if my assumption is correct. I'm assuming I'm missing an options setting, just haven't figured out which one.

aborzenkov
8,719 Views

Usual case is stale ARP entry in client. Normally NetApp sends unsolicited ARP reply to make clients update ARP cache. But it is entirely in client side to listen and react on this. Another cause would be MAC table in switches.

Wait for 10 - 15 minutes. If you are now able to access filer after failover, it is most likely one if two reasons.

TIMWALSHMI
8,719 Views

So I modified the test a bit, and from a device that is on the storage vlan I can ping the IP Address of the failed node and get a response.

From the core switch I can look at the arp table and see the IP Address has been remapped to the MAC address of the takeover node.

I can not however ping the address from a device not on the storage vlan.

This FAS3270 has a management port connected, and the ifgrp0 connected. On failover only the ifgrp0's IP Address is taken over by the takeover node. The Default gateway if for the management port. I wonder if I messed up with the default gateway, should it be for the storage vlan and not the management port.

aborzenkov
8,719 Views

If I understand you correctly, interface that is used for default route does not fail over? Then it's not going to work (for connections that require going via default route). You need to ensure all interfaces required for connectivity are present during failover.

TIMWALSHMI
8,720 Views

Changed the default gateway and ran the test again. This time when I takeover the continuous ping from the client only loses 1 ping, and then is back again. It loses 6 pings when I do a giveback. Worst case, 6 seconds on takeover and 18 seconds on giveback.

Public