VMware ESXi 5.5 living through a cluster failover...

orametrix · ‎2017-02-24

Is there anything that needs to be done to the ESXi 5.5 (update 2) configuration to help it live through a storage failover? We're using NFS to access our datastores. ESXi 5.5 U2 only supports NFS v3.

We've had some storage instability (lucky me, I'm living through an ONTAP bug that panics a filer node) and my virtual machines aren't always lived through the cluster failover unscathed. I'm wondering if there are some timeouts that need tweaking, etc. In VSC (6.2P2), the host setting status is all green for NFS, MPIO and Adapter settings. I read about guest OS SCSI timeout settings, but the documentation for VSC says these aren't necessary if your datastores are riding on NFS.

Suggestions? Yes, I know I need to upgrade my ONTAP so it stops barfing. The answer to this question also helps me assess how much I need to shut down my environment for the ONTAP upgrade. I know they say "non-disruptive", but I don't consider anything that's going to terminate all of the CIFS sessions to be "non-disruptive". I'm trying to decide if I need to shut down all of my VMs as well to be sure nothing gets corrupted.

Thanks

Pat

GidonMarcus · ‎2017-02-25

Hi

you haven't specified the version and flavour of ontap.

let's start from the end, for CIFS if your clients utilizes SMB2 or 3. durable handles will make the CIFS drop to be seamless in most applications. i have some heavy duty CIFS workloads from 2000 users and around 10 heavy workload 24*7 applications. can't remember anyone complains both in 7 mode or cmode. yes the alarming notification of session drops are everywhere, but this is more tech driven then real world issue (as other protocols recovers on protocol level, and CIFS requires the levels above to do so)

same for VMS, this should be non disruptive. as for corruption - take a snapshot, and a backup before is always a good practice. but i would not go to the extend of taking stuff down. you paid good money to have always-on system. i suggest to make use of it 🙂

about the main question, VSC as you mentioned already set the tweaks netapp recommends, it's a good idea to review the latest best practices mainly around the Network layout. (as that do change from version to version),

https://kb.netapp.com/support/s/article/ka31A000000137lQAA/How-to-configure-VMware-vSphere-6-x-on-Data-ONTAP-8-x

https://kb.netapp.com/support/s/article/ka31A000000137bQAA/how-to-configure-vmware-vsphere-5-x-for-data-ontap-7-3-and-8-x

the above should have been set by VSC as documented here

https://kb.netapp.com/support/s/article/ka21A0000000dGLQAY/what-settings-made-by-vsc-for-vmware-require-an-esx-host-reboot-to-take-effect

TRs

https://www.netapp.com/us/media/tr-4333.pdf

https://www.netapp.com/us/media/tr-4067.pdf

https://www.netapp.com/us/media/tr-4068.pdf

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

orametrix · ‎2017-02-27

Hi Gidi,

Thanks for the information. I'll do some more reading.

Our FAS2650 is running 9.1 RC2 (it came from the factory that way). The CIFS bug that's killing us is fixed in 9.1 GA. However, 9.1 GA has another bad CIFS bug, so they're recommending 9.1 P1, which probably won't be available until later this week. I also don't know how I feel about being one of the "pioneers" with any new release of software, even if it's a patch release.

During our recent panic experience, we had a few things happen that led me to these questions. GEEK ALERT: I was in a very "Star Wars" mood when we received the new filer. Our 2 nodes are named k-2so and chirrut. Chirrut does mostly CIFS traffic, so when the poisoned CIFS request comes in, he's the one that panics. Strangely enough, we had a VMware datastore (riding on NFS on k-2so) go offline. K-2so wasn't the victim of the panic, so this caught my attention. The datastore didn't stay offline, but it was enough for the VMs to notice (and it forced their root volumes to remount as read-only). We had iSCSI LUNs on k-2so "wink" offline enough that Microsoft SQL didn't consider the databases healthy.

That's why I'm getting nervous about how "non-disruptive" that upgrade is really going to be.

GidonMarcus · ‎2017-02-27

Hi

i cannot have an 1:1 view from the input you give to the system but - few points:

* yes SQL is very likely not to like any outage. as you probably know it has a DB and log. and if it manged to write something to the log before a write has completely committed to the DB - it will complain. (not for corruption per say - but for some sort of mismatch - if you see a real corruptions - something is really wrong) .

* why did the datastore went to read only. that's a very good question for NetApp & VMWare (From experience, VMWare very likely to pushback as long as you in any RC version, as it's not in their HCL), but give us maybe a snippet from the vmkernel.log so we can give a look. and some view on the network, vol and aggr locationx - as you believe they should all be independent on the CIFS node - but because the cluster is a "cluster" some funny config might cause a CIFS workload to come to one node while you expect it to the other and vice verse.

* with the above note - it could be that during one panic or before the LIF failed back you maybe had another panic on the other node ? did k-2so stayed online since the deployment (command " system node show") ?

* i tried to look here https://mysupport.netapp.com/NOW/cgi-bin/relcmp.on?notfirst=Go!&rrel=9.1RC1&rels=9.1&what=fix as i was wondering what bug exactly you were hitting but not sure i got the right one (is it the last one?)

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

orametrix · ‎2017-02-27

Hi Gidi,

Thanks for the input.

We didn't corrupt a database. But we had to restart SQL to get it going again, so I had an application down.

Ubuntu (maybe other Linux variants too) has a mount option "errors=remount-ro". For those VMs, they saw something wrong with the disk and remounted ro (as they were configured to do). Again, not a corruption but I had another application down when that happened.

During at least one of the panics, chirrut went down and k-2so took over like he was supposed to. Even though that happened, these minor disruptions happened to VMs that were served by k-2so. We had 4 panics during Thursday and Friday of last week and I will admit they're starting to merge together in my memory.

I am holding out a little bit of hope that a controlled takeover during an upgrade might be a bit more graceful than recovering from a panic.

Our bug is 1056085. We noticed that the panics started right after we moved some volumes to the new filer, so I moved them back to our old filer over the weekend. So far, we've been stable today (finger crossed). This buys us some time so we don't have to rush into an ONTAP upgrade.

SYNTAXERROR · ‎2017-02-26

Hi Pat

I did a lot of upgrades and it never was a problem to do so unless there was a bug in the new version.

Which version do you currently running on your (cDOT or 7-Mode) system and on which version do you want to go?

Please check these things before you do the upgrade:

- Broadcast-Domains and Failover-groups if this is a cDOT system

- /etc/rc if this is a 7-Mode system (doublecheck it with System Manager)

- Switch configuration (VLANs, Trunks and so on)

- Interface groups if you use them

Use also the Upgrade Advisor to check if you don't miss a thing.

If everything seems fine you should be able to do the upgrade without any problems.

If you have a cDOT system it is possible to switch the interfaces to another node in the cluster to check if the VLANs exists on the switch and the switch configuration is alright. If not you can switch the interfaces back to the source node. You have about 3 minutes until NFS will fail. So enough time to test it.

Kind regards

Dario

orametrix · ‎2017-02-27

Hi Dario,

Thanks for responding.

We're running 9.1 RC2 (cDOT) on a FAS2650. The filer came from the factory with RC2. The bug that's killing us is fixed in 9.1 GA. However, there's a caution about 9.1 GA on the download pages that suggests waiting for 9.1 P1.

I've run the upgrade advisor for 9.1 GA and I'm working through the list.

I'm feeling pretty good about my switch configuration. The 4 NICs on each node are configured into an interface group that's trunked, so VLANs aren't an issue. Through our panics last week, LIFs have moved around and still worked.

Pat