Cluster failover impact

saranraj456 · ‎2012-02-26

Hi Frndz,

While during an cluster failover,some of the VM's,LPAR & windows host using Netapp LUN become inaccessible.What could be the reason for this .?

Does the cluster failover will have any impact.

Thanks,

Saran

aborzenkov · ‎2012-02-26

When you say “cluster failover” do you mean NetApp takeover/giveback or host clustering?

For NetApp the usual reason is improper configuration of timeouts in host’s drivers. Another possibility is incorrect connection (e.g. no path via partner).

saranraj456 · ‎2012-02-26

I mean about CF takeover.

Perhaps it was due to timeout config in host driver .Since it was happened only on very few host.

columbus_admin · ‎2012-02-27

We are running 3160s for our ESX guest images, RDMs, and NAS storage. We had a bad CNA card that had to be replaced, the failover did not impact any of the services, including ESX guests, with RDMs and NAS storage. The NAS side had a couple of timeouts, but none of the LUNs reported any issues.

Drivers and settings are always important, but the HBA's drivers and setting are extremely important. We had issues with the VMs at one point when their default settings were too low, they would get timeout errors pretty regularly accessing their RDMs. So our usual steps:

1.) what is different between the working and affected hosts?

2.) drivers/settings/config checks

3.) best practices for the equipment involved checks

4.) multipathing software checks

5.) and my favorite...log scrubbing on everything to see what each piece reported(can be tedious and time consuming)

- Scott

saranraj456 · ‎2012-02-29

Hi Scott,

For regarding the point1-whether we need to check the diff's in host end or storage end. ?

Please clarify.

Thanks,

Saran

columbus_admin · ‎2012-02-29

There really isn't anything to configure on the filer end in a cluster, so your hosts all should be running the same version of the HBA driver, have the same MPIO settings, the same firmware settings, etc.

- Scott

saranraj456 · ‎2012-02-26

I mean about CF takeover.

Perhaps it was due to timeout config in host driver .Since it was happened only on very few host.