Solved: Direct NFS for multiple Esxi

AllanHedegaard · ‎2019-06-08

I would like the community advise on the correct route for connecting 2 esxi servers to an AFF using sfp+ without the use of a L2 switch. Goal is to have the same NFS datastore mounted on both servers.

Direct connection from each server NIC to a port on each controller. This is to provide redundancy in case of link or controller failure.

Is one subnet possible for multiple ports, so the same mount IP can be used from different hosts, or should the esxi hostfile be used for allowing different mount IP but same DNS name?

I really cant be the first to encounter this scenario and would like to know what is the recommended path.

Thanks

AllanHedegaard · ‎2019-06-09

Then assigning a failover adapter in vmware would be the best way to go. It will not provide load balancing, but failover cability if the actual link fails. Such an event would also move the lif to the secondary port. Correct?

View solution in original post

aborzenkov · ‎2019-06-08

NFS high availability is based on failing over IP to different physical port. This requires L2 switch and won’t work with direct connection.

AllanHedegaard · ‎2019-06-08

Thanks for your reply. I am not sure what you exactly mean by high availability NFS?

I am just talking about exposing LIF home and failover port to each server. Esxi support beacon probing. I am running a similar setup today without switches.

For smaller setups a 10g switch is not necessarily required. Only the amount of physical ports sets the limit.

SpindleNinja · ‎2019-06-09

For direct connect, you're best off going with iSCSI.

While I think it would work in the sense you'd be able to mount the datastores, you'd have issues during failover, which is what aborzenkov was saying.

Though, random question, what version on NFS were you planning to use?

aborzenkov · ‎2019-06-09

ESXi beacon probing relies on L2 connectivity between all physical ports in NIC team that you do not have here. I am not aware of any automatic mechanism to detect LIF move and redirect traffic to another port (in direct attach case).

AllanHedegaard · ‎2019-06-09

I am not sure the actual NFS version should matter in this case, as multipathing is not supported. So I would go v. 3 for simplicity purposes.

Consider one server:

One data LIF (a1,b1) could be connected directly to the server nic. If a1 fails the lif will migrate to b1.

I would just use multiple lif pairs for multiple servers.

AllanHedegaard · ‎2019-06-09

Just had a thought. Why not create a LACP between the two pair of ports. Would that be possible?

I know VM support port aggregation. Can I span such a 'port-channel' to both Netapp controllers? I am not looking after active/active. Just failover capability.

aborzenkov · ‎2019-06-09

@AllanHedegaard wrote:

Can I span such a 'port-channel' to both Netapp controllers?

No. ifgrp can include only ports in the same controller.

AllanHedegaard · ‎2019-06-09

Then assigning a failover adapter in vmware would be the best way to go. It will not provide load balancing, but failover cability if the actual link fails. Such an event would also move the lif to the secondary port. Correct?

AllanHedegaard · ‎2019-06-11

I can confirm that this works. A lot of money saved on Nexus and not any drawbacks from my point-of-view.

walterr · ‎2021-01-22

Could you please elaborate more on how you configured this?

How is it possible that the 2 ESX Servers can connect to the same NFS-Export with the same IP, if you do not have a layer 2 switch in between. Did you configure different NFS target IPs for the ESX hosts? Can ESX connect to the same datastores to different NFS IPs?

How did you configure the LIFs on the NetApp side and how did you configure the kernel ports on ESX?

AllanHedegaard · ‎2021-01-22

Just use the hostfile on the ESXi hosts. That way you can mount the same NFS share using same hostname, but by using different IPs on the SVM. Works fine 🙂

walterr · ‎2021-01-22

Still not sure what you mean.

Following example:

lets say I have a LIF with IP1 on e0c, and another LIF with IP2 on e0e exporting the same datastore.

Host has one port from ESX1 directly connected to e0c and one port from ESX2 directly connected to e0e.

So how can you configure in ESX1 to connect to IP1 and ESX2 to connect to IP2 if this is an ESX cluster? If I connect ESX1 to IP1, doesn't it automatically connect ESX2 also to IP1?

AllanHedegaard · ‎2021-01-22

By using the hostfile on esxi, you dont mount using IP, but by using hostname of SVM.

walterr · ‎2021-01-25

OK, I understand. I will try to implement this and will let you know if this is working.

walterr · ‎2021-01-29

How exactly did you configure the failover of the ESX kernel port? I tried this configuration now, and it is working, but I have the problem that during a NetApp storage takeover the ESX loses its connection to the IP. Obviously it is still trying to connect through the kernel port which is connected through the controller which has rebooted and is waiting for giveback. After giveback everything is working again.

AllanHedegaard · ‎2021-01-29

Assign an active and passive physical port in the ESXi to the vSwitch. This way in case the active port goes down, it will switch to the passive.

Of course this is not bulletproof, as there are different states of 'going down'. This will only cover the physical link connection.

But to be honest for me it doesn't matter - I have 6 ESXi hosts connected, and simply use the failover capability of VMware HA. Of course it has never been needed.

Looking at a risk perspective, it is more likely that a volume runs full, and stops IO.

walterr · ‎2021-01-29

The thing is that during a Storage failover the active ports on controller 1 physically only go down shortly, and after boot of controller 1 when it goes to waiting for giveback state, the physical port goes up again, but the IPs are still on controller 2. So how would the failover kernel ports on ESX recognize this?

AllanHedegaard · ‎2021-01-29

Auto-revert (auto-home) the LIF of the SVM?

walterr · ‎2021-01-29

Yes of course I have configured the LIF with auto-revert. Failback is working. But failover will not work by design.

Following example:

ESX with nic1 and nic2, nic1 connected to controller A e0c, nic2 connected to controller B e0c. Broadcastdomain and failovergroups properly configured with both e0c. When you do a storage failover or also a simple LIF migrate to the other controller ESX will lose access to IP.

This happens during a LIF migrate from controller A e0c to controller B e0c:

- A e0c still physically connected to ESX nic1 although IP is on B e0c

- ESX does not recognize the LIF migration

- ESX still tries to access IP via A e0c hence loses access to datastore

This happens during a controller failover from controller A to controller B

- aggregates and lifs are migrated to controller B

- controller A is rebooted

- A e0c link goes offline and ESX temporarily switches to nic2 but only for a short time

- controller A continues booting, onlines A e0c and goes to waiting for giveback

- from now on ESX switches back to nic1 and cannot access IP since it is hosted on B e0c

- until you do manually a giveback, then everything is working again

Conclusion: ESX does not recognize a storage failover or LIF migration. You would have to physically disconnect cable connected to A e0c, then ESX would recognize, which is not feasible in a production environment.

AllanHedegaard · ‎2021-01-29

Agreed, in some scenarios the failover is not optimal. In my case it works fine enough for operation.