Solved: NFS Datastore Failover

Lumina · ‎2019-08-29

Hi All,

I have a relatively simple scenario but I'm having a hard time getting it to work the way I thought it would. I'm pretty much a one-man show and new to NFS, so maybe I'm just missing the boat. We are migrating off of fibre channel storage to NFS with VMware for the sake of simplicity and ease of use.

Here is my setup:

NetApp:

I have an AFF8080 with a single LIF on 10GbE to use as a NFS datastore for VMs. I have the LIF setup on a 10GbE link on controller #1 port 0be, with a failover group set to failover to 0be on controller #2.

Ethernet Switches:
Two 10GbE switches. One cable to controller #1, the other two controller #2.

Host:

For this example, I have a single ESXi 6.7u3 host and vSphere Standard. The host has 2 physical NIC cards with 2 ports each for a total of 4 ports. I have two cables running to one ethernet switch and two to the other. I have setup a single VMKernel port with an IP address for the storage traffic and have added it to a standard vSwitch.

The NetApp LIF, VMkernel adapter, and vSwitch are set to MTU 9000 and the switches 9216. I have VSC installed on vCenter and the NFS VIAA plugin installed on the host. All NFS, MPIO, and Adapter settings in VSC are green.

With this setup, I can easily add the NFS datastore and access it from vCenter. I placed a test Windows Server VM on there and it runs without issue. My dilemma is that when I simulate a failover, either by pulling a cable from the switch or NetApp, or by migating the LIF from controller #1 to #2, I get an average of about 30 seconds on my RDP session to the VM, before it becomes unresponsive. As soon as I migrate the LIF back to the home port (controller #1), my RDP session is restored. When I try to remote into the VM from vCenter, it just gives me a black screen so it must be in some kind of paused state looking for a connection to the storage. The datastore still seems to be accessible from vCenter (e.g. I can browse for files or create folders). It does not say "inaccessible" like I would think it should.

I've tried playing with the vSwitch settings. Currenlty I have the VMkernel adapter set with all four NICs (ports) to Active. I've tried to set one to active and one to standby but I still end up with the same results. I also have played around with the failback, notify switch, and load balancing settings to no avail.

I've done countless Google searches and still am not sure where I'm going wrong. Maybe this just isn't how NFS works? Or maybe my network setup is just too simple to accomplish what I'm trying to do.

Any insight would be appreciated!

Lumina · ‎2019-09-02

@Trubida wrote:

It almost sounds like a spanning tree issue. What model of switches are you connected to, what configuration do you have the the ports? Have you tried to create an LIF on node 2 and then try the failover?

Unfortunately networking isnt my specialty. I am working with a vendor to get this all set up, but I have not ran this by them yet. I've been doing a lot of troubleshooting on my own since we have a tab going with the vendor. This is my last hurdle but I can run the spanning tree idea by them.

We are using two Juniper EX4600-40F switches. The other guys set them up, but it appears the port mode is set to trunk, MTU is 9216, speed is 10gbps, duplex full, and flow control enabled.

I just tried creating an NFS LIF on node 2 and I am able to mount the datastore to the host. When I try to clone a test VM to the new datastore, it just kind of freaks out and never finishes and the datastore sometimes shows up as inaccessible. So there must be something up with the port on node 2? All of the port configurations on the switches are the same, but I'm wondering about the physical cabling now. Should port e0b on node 1 and node 2 be connected to the same physical switch? Or does this not matter? I assumed you would connect one node to switch one and the other node to switch 2 for redundancy.

I can try to create a new LIF again later with port e0d this time and see what the results are.

UPDATE: I tested the LIF connections and handoff from e0d on node 1 and 2 and all worked flawlessly. It's only when I use port e0b on node 2 that things get weird. So I'm assuming that the scope of this issue can be reduced to the link between the NetApp port e0b on node 2 and the ethernet switch it is connected to. The host side of the switches should be OK since the other ports failover correctly. Is this right?

UPDATE #2: We can cross this off the list. I checked out the port configurations for the affected e0b port adn noticed that the MTU was set to 1514 instead of 9216, so it must have been a typo. Changed it to 9216 and everything works as it should.

Thanks again for all of your input! You got me in the right direction.

View solution in original post

SpindleNinja · ‎2019-08-30

If you create a new lif and put it on the port on controller 2, can you ping it from the host? And can you ping from the netapp through that lif to an IP on the host?

Also, can you also post the following:

net int show -fields failover-group, failover-policy

net interface failover-groups show

net port show

broadcast-domain show

Lumina · ‎2019-08-30

Thanks for your reply!

I was able to ping the new “test” LIF on the second controller from the host, as well as from the LIF to the host.

Here are the commands you requested:

AFF8080::> ping -lif
AFF8080-SVM1_NFS_10GBe AFF8080-SVM1_mgmt AFF8080-SVM1_nfs_lif1
test AFF8080-01_clus1 AFF8080-01_clus2
AFF8080-02_clus1 AFF8080-02_clus2 AFF8080-01_SnapMirror
AFF8080-01_mgmt1 AFF8080-02_SnapMirror AFF8080-02_mgmt1
cluster_mgmt

AFF8080::> ping -lif test -vserver
AFF8080 AFF8080-SVM1 Cluster
AFF8080::> ping -lif test -vserver AFF8080
AFF8080 AFF8080-SVM1
AFF8080::> ping -lif test -vserver AFF8080-SVM1 -destination
<Remote InetAddress> Destination
AFF8080::> ping -lif test -vserver AFF8080-SVM1 -destination 172.16.94.51
172.16.94.51 is alive

AFF8080::> clear
Error: Ambiguous command. Possible matches include:
storage failover hwassist stats clear
storage tape alias clear
system smtape status clear

AFF8080::>

AFF8080::> net int show -fields failover-group, failover-policy
(network interface show)
vserver lif failover-policy failover-group
------- --------------------- --------------- --------------
AFF8080 AFF8080-01_SnapMirror local-only Public
AFF8080 AFF8080-01_mgmt1 local-only 10GBe_Datacenter_VLAN_5
AFF8080 AFF8080-02_SnapMirror local-only Public
AFF8080 AFF8080-02_mgmt1 local-only 10GBe_Datacenter_VLAN_5
AFF8080 cluster_mgmt broadcast-domain-wide
10GBe_Datacenter_VLAN_5
AFF8080-SVM1
AFF8080-01_fc_lif_1 disabled -
AFF8080-SVM1
AFF8080-01_fc_lif_2 disabled -
AFF8080-SVM1
AFF8080-02_fc_lif_1 disabled -
AFF8080-SVM1
AFF8080-02_fc_lif_2 disabled -
AFF8080-SVM1
AFF8080-SVM1_NFS_10GBe
system-defined 10GBe_Datacenter_VLAN_5
AFF8080-SVM1
AFF8080-SVM1_mgmt system-defined 10GBe_Datacenter_VLAN_5
AFF8080-SVM1
AFF8080-SVM1_nfs_lif1 system-defined Default

vserver lif failover-policy failover-group
------- --------------------- --------------- --------------
AFF8080-SVM1
test system-defined 10GBe_Datacenter_VLAN_5
Cluster AFF8080-01_clus1 local-only Cluster
Cluster AFF8080-01_clus2 local-only Cluster
Cluster AFF8080-02_clus1 local-only Cluster
Cluster AFF8080-02_clus2 local-only Cluster
17 entries were displayed.

AFF8080::> net interface failover-groups show
(network interface failover-groups show)
Failover
Vserver Group Targets
---------------- ---------------- --------------------------------------------
AFF8080
10GBe_Datacenter_VLAN_5
AFF8080-01:e0b-5, AFF8080-02:e0b-5
Default
AFF8080-01:e0M, AFF8080-01:e0i,
AFF8080-01:e0j, AFF8080-01:e0k,
AFF8080-02:e0M, AFF8080-02:e0i,
AFF8080-02:e0j, AFF8080-02:e0k
Public
AFF8080-01:e0l, AFF8080-02:e0l
Cluster
Cluster
AFF8080-01:e0a, AFF8080-01:e0c,
AFF8080-02:e0a, AFF8080-02:e0c
4 entries were displayed.

AFF8080::>
AFF8080::> net port show
(network port show)

Node: AFF8080-01
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status
--------- ------------ ---------------- ---- ---- ----------- --------
e0M Default Default up 1500 auto/1000 healthy
e0a Cluster Cluster up 9000 auto/10000 healthy
e0b Default - up 9000 auto/10000 healthy
e0b-5 Default 10GBe_Datacenter_VLAN_5
up 9000 auto/10000 healthy
e0c Cluster Cluster up 9000 auto/10000 healthy
e0d Default - up 9000 auto/10000 healthy
e0d-5 Default - up 9000 auto/10000 healthy
e0i Default Default down 1500 auto/- -
e0j Default Default up 1500 auto/1000 healthy
e0k Default Default down 1500 auto/- -
e0l Default Public up 1500 auto/1000 healthy

Node: AFF8080-02
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status
--------- ------------ ---------------- ---- ---- ----------- --------
e0M Default Default up 1500 auto/1000 healthy

Node: AFF8080-02
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status
--------- ------------ ---------------- ---- ---- ----------- --------
e0a Cluster Cluster up 9000 auto/10000 healthy
e0b Default - up 9000 auto/10000 healthy
e0b-5 Default 10GBe_Datacenter_VLAN_5
up 9000 auto/10000 healthy
e0c Cluster Cluster up 9000 auto/10000 healthy
e0d Default - up 9000 auto/10000 healthy
e0d-5 Default - up 9000 auto/10000 healthy
e0i Default Default down 1500 auto/- -
e0j Default Default up 1500 auto/1000 healthy
e0k Default Default down 1500 auto/- -
e0l Default Public up 1500 auto/1000 healthy
22 entries were displayed.

AFF8080::> broadcast-domain show
(network port broadcast-domain show)
IPspace Broadcast Update
Name Domain Name MTU Port List Status Details
------- ----------- ------ ----------------------------- --------------
Cluster Cluster 9000
AFF8080-01:e0a complete
AFF8080-01:e0c complete
AFF8080-02:e0a complete
AFF8080-02:e0c complete
Default 10GBe_Datacenter_VLAN_5
9000
AFF8080-01:e0b-5 complete
AFF8080-02:e0b-5 complete
Default 1500
AFF8080-01:e0M complete
AFF8080-01:e0i complete
AFF8080-01:e0j complete
AFF8080-01:e0k complete
AFF8080-02:e0M complete
AFF8080-02:e0i complete
AFF8080-02:e0j complete
AFF8080-02:e0k complete
Public 1500
AFF8080-01:e0l complete
AFF8080-02:e0l complete
4 entries were displayed.

AFF8080::>

Let me know if you need anything else.

Thanks!

SpindleNinja · ‎2019-08-30

Can you also post the following whenever you have a minute: (forgot this one eairler.) .

row 0

net int show

Lumina · ‎2019-08-30

You got it.

AFF8080::> row 0
(rows)

AFF8080::> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
AFF8080
AFF8080-01_SnapMirror up/up 10.38.1.2/24 AFF8080-01 e0l true
AFF8080-01_mgmt1 up/up 172.16.94.21/24 AFF8080-01 e0b-5 true
AFF8080-02_SnapMirror up/up 10.38.1.3/24 AFF8080-02 e0l true
AFF8080-02_mgmt1 up/up 172.16.94.23/24 AFF8080-02 e0b-5 true
cluster_mgmt up/up 172.16.94.22/24 AFF8080-01 e0b-5 true
AFF8080-SVM1
AFF8080-01_fc_lif_1 up/up 20:08:00:a0:98:a8:45:0c AFF8080-01 0e true
AFF8080-01_fc_lif_2 up/up 20:09:00:a0:98:a8:45:0c AFF8080-01 0f true
AFF8080-02_fc_lif_1 up/up 20:0a:00:a0:98:a8:45:0c AFF8080-02 0e true
AFF8080-02_fc_lif_2 up/up 20:0b:00:a0:98:a8:45:0c AFF8080-02 0f true
AFF8080-SVM1_NFS_10GBe up/up 172.16.94.20/24 AFF8080-01 e0b-5 true
AFF8080-SVM1_mgmt up/up 172.16.94.24/24 AFF8080-01 e0b-5 true
AFF8080-SVM1_nfs_lif1 up/up 172.16.93.27/24 AFF8080-01 e0j true
test up/up 172.16.94.70/24 AFF8080-02 e0b-5 true
Cluster
AFF8080-01_clus1 up/up 169.254.212.228/16 AFF8080-01 e0a true
AFF8080-01_clus2 up/up 169.254.26.92/16 AFF8080-01 e0c true
AFF8080-02_clus1 up/up 169.254.2.56/16 AFF8080-02 e0a true
AFF8080-02_clus2 up/up 169.254.189.200/16 AFF8080-02 e0c true
17 entries were displayed.

AFF8080::>

Thanks!

Trubida · ‎2019-09-01

It almost sounds like a spanning tree issue. What model of switches are you connected to, what configuration do you have the the ports? Have you tried to create an LIF on node 2 and then try the failover?

Lumina · ‎2019-09-02

@Trubida wrote:

It almost sounds like a spanning tree issue. What model of switches are you connected to, what configuration do you have the the ports? Have you tried to create an LIF on node 2 and then try the failover?

Unfortunately networking isnt my specialty. I am working with a vendor to get this all set up, but I have not ran this by them yet. I've been doing a lot of troubleshooting on my own since we have a tab going with the vendor. This is my last hurdle but I can run the spanning tree idea by them.

We are using two Juniper EX4600-40F switches. The other guys set them up, but it appears the port mode is set to trunk, MTU is 9216, speed is 10gbps, duplex full, and flow control enabled.

I just tried creating an NFS LIF on node 2 and I am able to mount the datastore to the host. When I try to clone a test VM to the new datastore, it just kind of freaks out and never finishes and the datastore sometimes shows up as inaccessible. So there must be something up with the port on node 2? All of the port configurations on the switches are the same, but I'm wondering about the physical cabling now. Should port e0b on node 1 and node 2 be connected to the same physical switch? Or does this not matter? I assumed you would connect one node to switch one and the other node to switch 2 for redundancy.

I can try to create a new LIF again later with port e0d this time and see what the results are.

UPDATE: I tested the LIF connections and handoff from e0d on node 1 and 2 and all worked flawlessly. It's only when I use port e0b on node 2 that things get weird. So I'm assuming that the scope of this issue can be reduced to the link between the NetApp port e0b on node 2 and the ethernet switch it is connected to. The host side of the switches should be OK since the other ports failover correctly. Is this right?

UPDATE #2: We can cross this off the list. I checked out the port configurations for the affected e0b port adn noticed that the MTU was set to 1514 instead of 9216, so it must have been a typo. Changed it to 9216 and everything works as it should.

Thanks again for all of your input! You got me in the right direction.

SpindleNinja · ‎2019-09-03

I assumed you would connect one node to switch one and the other node to switch 2 for redundancy.

-Yes you do. Providing the vlan is accessable on both switches.

make sure that the e0d are in the same brodcast domain and the lifs can failover and migrate freely between them. If e0b is being "weird" I would not even give the lifs a chance to failover to those ports.

SpindleNinja · ‎2019-09-01

I'm assuming these are your NFS datastore lifs?

AFF8080-SVM1_NFS_10GBe up/up 172.16.94.20/24 AFF8080-01 e0b-5 true
test up/up 172.16.94.70/24 AFF8080-02 e0b-5 true

what's this one for?

AFF8080-SVM1_nfs_lif1 up/up 172.16.93.27/24 AFF8080-01 e0j true

Why do the cluster_mgmt and the node mgmt share the same vlan as the NFS datastores?

Lumina · ‎2019-09-02

@SpindleNinja wrote:

I'm assuming these are your NFS datastore lifs?

AFF8080-SVM1_NFS_10GBe up/up 172.16.94.20/24 AFF8080-01 e0b-5 true
test up/up 172.16.94.70/24 AFF8080-02 e0b-5 true

what's this one for?

AFF8080-SVM1_nfs_lif1 up/up 172.16.93.27/24 AFF8080-01 e0j true

Why do the cluster_mgmt and the node mgmt share the same vlan as the NFS datastores?

That is correct. AFF8080-SVM1_NFS_10GBe is the new datastore LIF.

The AFF8080-SVM1_nfs_lif1 is our old 1 gig NFS file share. This will be retired at some point but I don't want to re-IP the file share on each VM until the new networking is solid.

They are sharing the same VLAN for the sake of simplicity. We had a flat network before, but I didn't want to go too crazy with segmentation just yet.