ONTAP Hardware

ESXi and Multipathing

Mako77
6,818 Views

Hi,

 

I have a ESXi 6.5 enviroment connecting across FC to a NetApp FAS2552 with 2 nodes. Each node has redundant paths through switchs. So a dual fabric topology. My question is regarding the a node path failure. So the ESXi is going to the 1 node which has the aggr, so the optimised path but one of the fibres fails for that node. What we see is that the VMs stop for around 30 secs before continuing. I would expect this not to be the case for a path failure on the node as there is another path and the controller hasn't failed. Am I wrong ? I assumed that the timeout would be for node failures and handovers but for paths on the same node that this would simply re-route very quickly as that disk and paths are still accessable through another route. Like in a TCP network with modern 'spanning-tree'.

 

Thanks


Ed

1 ACCEPTED SOLUTION

Jeff_Yao
6,744 Views

hi

 

for FC, actually the paths to both sides of filer are active anytime. but due to ALUA, host sticks with the optimized path for IO. If the optimised path fails, host needs some time to realise that path failed, so that host goes to the other path. and normal timeout value should be 60s. so it's fine for your situation.

below is the kb about ALUA for your info

https://kb.netapp.com/support/s/article/ka21A0000000d32QAA/asymmetric-logical-unit-access-alua-support-on-netapp-storage-frequently-asked-questions?la...

 

jeff

View solution in original post

6 REPLIES 6

Jeff_Yao
6,745 Views

hi

 

for FC, actually the paths to both sides of filer are active anytime. but due to ALUA, host sticks with the optimized path for IO. If the optimised path fails, host needs some time to realise that path failed, so that host goes to the other path. and normal timeout value should be 60s. so it's fine for your situation.

below is the kb about ALUA for your info

https://kb.netapp.com/support/s/article/ka21A0000000d32QAA/asymmetric-logical-unit-access-alua-support-on-netapp-storage-frequently-asked-questions?la...

 

jeff

Mako77
6,732 Views

Thanks - I would have thought that the second path to the same controller but through the second switch would have taken much quicker to switch than having to go go through the alternative controller or the aggr move to the other controller in a controller failure.

 

Ed

Jeff_Yao
6,727 Views

in general it depends on your topology and multipathing policy. like how many paths do you have from one host to one controller etc.

more kb to read 🙂

https://kb.netapp.com/support/s/article/ka31A00000013QYQAY/how-to-verify-vmware-esx-fibre-channel-configurations-with-multipathing-i-o-mpio

 

ps. read the related links too, that should be helpful

 

thanks

 

Jeff

Mako77
6,719 Views

Many thanks. Yes I have the standard dual fabric setup. Server -> Switch - > Node with redundant paths. I understand the 30 - 120 second failure over for the aggr moving nodes but just was surprised that the path for the same node but going through a alternative switch took any time. As both these paths are optimised and active, so in my mind if one fails the other should continue without 'pause'. On my old SAN array this was the case but this didn't use AULA as far as I know so I guess this is where the delay comes in.

 

Ed

Mako77
6,715 Views

I suppose my thoughts on this are more simply - if I have two optimised active i/o paths and one fails should I still see a 'pause' of 30 secs. Is this what other users are seeing as I have only the one unit so can't compare.

 

Ed

Mako77
6,636 Views

So it appears that removal of non-optimized path results in no downtime, using software takeover also results in no downtime. Removing optimized path/paths, rebooting controllers without takeover or any other unscheduled failure along that route results in a 30 - 40 seconds delay. Much bigger loads on the system I expect the delay to be nearer 60 seconds.

 

So in conclusion to do any maintenance we need to ensure that we move the aggr's from the affected node using takeover before and then disable automatic takeback to ensure that it doesn't automatically return to their home nodes. These timeouts appear to be down to the ALUA protocol and in my view in effcient!

Public