Lost LUN after dirty shut-down of a FAS

chalupskys · ‎2011-10-25

Pop quiz: A customer lost power to one of their FAS controllers that is in an active/active pair, and with that, the LUNs attached to thier Hyper-V server went away. The envronment is such that nothing was lost and we are able to troubleshoot the situation on why it happend. That said, I preformed a cf takeover (then cf giveback) to see if the same thing happend on a graceful hand-off. The results were as one would expect when things are setup properly, the LUN remained online and visible to the server through the activity.

So my question, why on a dirty shut-down did they lose connectivity to thier LUN? I am going to be troubleshooting onsite tomorrow so any suggestion would be welcome.

radek_kubka · ‎2011-10-25

Hi,

Nice one - I like it!

So, let's start with this question: is it a relatively modern version of ONTAP, or an ancient one? Is FCP run in a single_image mode? If that's the case, then this simply should not happen - controller failure is an equivalent of path failover (if everything is set properly).

Regrads,

Radek

chalupskys · ‎2011-10-25

Radek,

Let's just say, they were "testing a theory!" 🙂

So the FAS' are running 8.0.2 7-mode. It was my understanding that single_image was the default mode of 8.x+. I'll double check tomorrow. As for the protocol, it is iSCSI over 10GbE.

Regards,

Steve

radek_kubka · ‎2011-10-25

Me bad - I automatically assumed LUN equals FCP!

Single_image cfmode is FCP specific & irrelevant for iSCSI (and it's actually default since 7.2).

What are disk timeouts on the hosts? Unplanned takeover will presumably take longer than a clean one - this may explain the difference in observed behaviour.

aborzenkov · ‎2011-11-06

As this is iSCSI and there was power outage - any chance that switch which provided LUN connectivity was powered off as well?

pascalduk · ‎2011-10-25

You did not mention if a cluster failover took place because of the power outage. Did you only loose the controller or also disk shelves during the power outage?

chalupskys · ‎2011-10-26

@Radek, I'll have to check the timeout settings.

@Pascal, a failover did take place. A controller power outtage was all that occurred. The shelves stayed powered and disk multipathis set up properly.

wes_beckum · ‎2011-11-04

My guess would be that the LUN lost it's mapping when it lost power. Just a guess

Wes

rorzmcgauze · ‎2011-11-14

What else did they do during the outage?

Its seems strange I'll give you that. Was Iscsi licenced on both controllers, and patched correctly? Did they test CFO during initial setup of this HyperV environment? do they have the ontap DSM installed and snapdrive? how many luns and how many hyperV hosts are involved?

Did you grab the logs and ASUPs from this event? did they help trouble shoot this?