Solved: iSCSI failover time with MPxIO

UNIXSYSV · ‎2019-10-28

Hi,

What kind of failover times should we expect to see when using a Solaris 11 host with iSCSi against a Cluster? During a test we had two active online paths and did a quick test to filter out all traffic to and from one ONTAP node at a time, we then had writes stall for over a minute before the other online path was used. Is it expected for MPxIO to handle the failure on Solaris or should the Initiator failover before that when we have more than one session?

Thanks

paul_stejskal · ‎2019-10-29

We cannot comment on specifics, but it should be robust enough I'd imagine.

Like I said before, a case might be better but at this point it would be good to have both vendors Solaris and NetApp engaged in a formal case. We are 95% sure this is all Solaris but it would be good to review storage logs and clear any storage issues out.

View solution in original post

TMACMD · ‎2019-10-29

Have ou checked you config against the Support Matrix (IMT) http://support.netapp.com/matrix ?

Looks like Solaris 11.2 and 11.3 do not fully support MPxIO and you should be on some version on 11.4.

paul_stejskal · ‎2019-10-29

Also, if we have any kind of software (HUK, DSM, etc.) for host side, usually it just sets timeouts. It really is dependent on yoru specific OS and what it is running. I would get to a supported configuration, and if you really want to take a look it's best to oepn a support case. iSCSI failover should not take that long, but that tells me that the host isn't configured right possibly, not storage.

TMACMD · ‎2019-10-29

according the IMT, MPxIO will not work correctly on versions prior to 11.4. Thats why I asked

It appears to be an issue on the Solaris side and nothing to do with the storage as you suspected.

paul_stejskal · ‎2019-10-29

Yes I saw. Thank you for the research.

UNIXSYSV · ‎2019-10-29

Thanks for the replies. I tweeked most on the Solaris side, host utilities settings applied, change timeout for the initiator etc but there are not that many things that can be changed. Tested both with a ZFS pool and with just a format inquery that was also hanging for a long time while MPxIO was fiuring things out.

But it should be faster than this, how fast should it be? Should it be the initiator that handles it first and then MPxIO takes the path offline once it has determined it as non functional?

paul_stejskal · ‎2019-10-29

We cannot comment on specifics, but it should be robust enough I'd imagine.

Like I said before, a case might be better but at this point it would be good to have both vendors Solaris and NetApp engaged in a formal case. We are 95% sure this is all Solaris but it would be good to review storage logs and clear any storage issues out.

UNIXSYSV · ‎2019-10-29

Ok, this is a Solaris 11.4 that was also patched to the latest (or perhaps the month before) SRU.