VM with VVOLS drops network when migrated - What!!??

echolaughmk · ‎2015-10-09

Hello,

During testing of a VM vmotion between hosts within a cluster, it has been noticed that the VM drops network pings for about 15 seconds and I am hoping osmeone can shed some light as to why? vmotion is working fine as I don't have this issue on traditional datastores and I am leaning to the issue being around binding/re-binding on the PE side, but lack the documentation to validate it fully and was hoping someone from product management or engineering might be able to shed osme light on it.

TEST:

1. ESX HostA with 1 path to the PE (1 iSCSI LIF seen on the array) - VM is running on this host with 4 vvols underneath (config/memory/disk vvols).

2. ESX HostB with 2 paths to the PE (2 iSCSI LIFs seen on the array)

3. VM is vMotioned between ESX HostA and HostB. A console within the VM and a ping from the CMD line outside of the VM both drop at the tail end of the vmotion (picture attached) and the vmotion takes 20 MINUTES to complete - why?

I have tried to query the VASA inventory for any rebind operations but either don't catch them or they don't happen, however the VM stays online for the entire duration of the vmotion except for the tail end and I am not sure if this is expected, a bug, or a misconfiguration of something on my end since I can't recreate with traditional datastores.

Note - this test has been run several times in opposite directions and the same outcome is seen regardless of the source/target host. This seems like a critical issue and curious if others are seeing the same??

Thanks.

peterl · ‎2015-10-10

One thing that looks off is that you say HostA has one path (1 iSCSI target LIF) and hostB has 2. This KB article suggests that's not a recommended config since ESXi iSCSI initiator will try to reach all iSCSI targets returned by SENDTARGETS. Any targets the initiatorcan't get to won't reply andl the intiator eventually times out. Rescans will take substantially longer. As a test, I suggest you do a rescan of the iSCSI initiaor on each host and see if there's a significant difference between them.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2038869

When I do vMotion (host, not storage) with VVols it takes 7 seconds or less with VVols on FC, 30-35 seconds with VVols on NFS, and doesn't drop any pings in or out. Some ping responses go up as high as 72ms.

echolaughmk · ‎2015-10-11

Thanks for the info and advice. I set that test up on purpose with different paths to simulate and test what happens during path failures (network since this is iSCSI at the moment) and vvol rebinding. Specifically I am looking to see what impact would be in these odd situations. I agree it is not thebest practice.

I will try to run a few more tests with all paths. i don't have FC setup yet (that round is next) so I'm not sure if this is just with iSCSI or not given your experiences with FC and NFS. I guess in the end I still don't expect to see the outcome I am seeing so was hoping someone might be able to shed any light on an restrictions there are since this seems kind of off.

I could remove the iSCSI targets related to the vmkernel port group I removed to simulate the single path PE as well and redo that test.

UPDATE - I just put the other path online to HostA so each host has been rescanned for 2 paths to the PE and I saw a slightly different behavior - the vMotion didn't drop a ping this time (good news), but it still took 12 minutes to run (not good news). Given some of the other tests I have been doing, I don't feel this is network related now that everything is online, so I'm still in the realm of thinking it has to do with vvols and some sort of rebinding process.....any thoughts?

Not sure if anyone knows what to look for in the VP logs or inventory...I never catch a rebindingbean process in the inventory...

Thanks!

echolaughmk · ‎2015-10-14

Hi Peter,

So i did have a negotion issue in my environment which has been corrected. This has caused the network ping to NO Longer drop - which is good.

However, I still seem some good amount of time to complete the storage vmotions likely due to the change of the VVOL provisioining (thin-to-thick and thick-to-thin). I would expect this to be somewhat expected given the change in disk provisioning? Not sure if you have any comments there? I am still running through test iterations as I write this so the behavior is still new to me...

Also, the other thing I am seeing is that, as a function of the storage vmotion, there are net-new vvols created that appear to be related to the storage vmotion's delta vmdk's for the migration. Sometimes the end result is the same UUID of the vvols are kept and other times after the migrations they are changed. The VM stays online through all of this, but I am curious what the thinking is here do you know? I'm guessing it is related to how VMware is doing the storage vmotion, but wasn't sure of any issues that could be caused by the change in vvol UUID as a result of it?

thanks for any insight!