Let me preface this by saying I know it is primarily a VMware question (and it has been posted there also) but I really need help with the NetApp side as well.
I am having a really hard time architecting a solution for an environment I am working on. I have found documentation for NFS or iSCSI, but cannot figure out the best way to get both at the same time with the hardware we are working with. We are dealing with the following:
The storage devices are dedicated to VMware so all that will be on them are VMFS stores for iSCSI and NFS datastores.
We have the need to use both iSCSI and NFS on the above devices. I cannot wrap my head around how the vifs and vSwitches are going to look. One note is that we are working with vSphere Enterprise Plus so we do have vNetwork Distributed Switches and the ability to "Route based on physical NIC load" which sounds really intriguing for the NFS traffic. For iSCSI, I have always been told there should be a 1 to 1 mapping between VMKernel port and physical NIC, and each path to the SAN should be on a separate subnet to ensure that the traffic is sent/received on the expected interfaces and to allow for proper multipathing. From what I can tell with NFS, multipathing is not possible and it is recommended to team all the physical NICs into a single VMKernel port. The NFS traffic will then balance across datastores (different target IPs).
Anyway I guess a couple of questions that I am struggling with are:
I am very much open to suggestions and any advice or articles that will help clarify. I've combed through numerous articles only to find more questions that needed answers. They all seem to be one or the other but never both on the same environment.
Any help would be greatly appreciated!
A couple considerations
I would suggest stacking the 3750's and using the 2960's for VM traffic, this will make your life easier when it comes to load balancing and providing fault tolerance from a switch failure. As for your other points my comments are inline.
As I said eariler you should stack the switches. I don't think it is worth breaking up the iSCSI traffic on to different physical nics, I think VLANs will work fine. You will probably be fine with 4 NIC's for storage traffic. I would suggest monitoring it and seeing if you really need 4 NICs for VM traffic, you would be suprised how little VM traffic there is on a normal network. People just want to think that they need 4 nics for 30 VMs but most the time they do not.
I've been using NFS on Netapps via 10G redundant ethernet links (active/standby) for about six months now using ESX 4.1. It rocks. I've not seen utilization on the Netapps approach 4 Gig yet, and IOPs are still relatively low (~200 VMs and climbing fast).
But I'm being forced to put up some servers utilizing 8 x 1 Gig links in a manner similar to your environment.
I can't imagine why anyone would use iSCSI rather than NFS anymore, partularly now that "Load based on physical port" is a teaming option in vSphere Enterprise plus.
I've run tests with a vmware distributed switch (not the vNexus) on a single portgroup and vlan with physical uplinks to TWO DIFFERENT SWITCHS without creating a L2 loop. If works fine without 802.3ad LACP, or proprietary Cisco Etherchannel, or spanning tree. In fact the docs say "Load Based on Physical Port" is not compatible with LACP or Etherchannel. VMware takes care of the routing, putting VM1 on the first link, then VM2 on the second link, etc. It then looks for any physical port which goes over 70% bandwidth for an extended period of time, and starts to move VMs off of that port. Now, that's extremely cool. I haven't found anywhere I can change the 70% configuration, and I only know about it from reading.
So, you put the same (multiple) vlan tags on two different ports on two different physical switches, and you have an active/active load balanced team (LBT). You can scale to four or six uplinks balanced across physical swithes as well. This means that high end switches which can do LACP (or Etherchannel) across two switches are no longer necessary.
In my new design, which will house Oracle and MSSQL VMs (don't ask), I'm going active/standby to two different switches for management console, and two 3 x 1 Gig uplinks to different physical switches for both data and storage, no LACP at all, and that's it. NFS has its own portgroup, and with ESX 4.1 I have QOS or whatever VMware calls it, so I assign priority to IP storage and let it fly.
You do have to be careful configuring your portgroups. Hope this gives you some ideas. VMware + Netapps + NFS is really very nice. I suppose you could throw iSCSI in there as well.
I agree that NFS is the way to go with vSphere datastores, but iSCSI still has its place. For starters the storage QoS you mention (vSphere Storage I/O Control) doesn't work on NFS. Currently it is only available to block based protocols / VMFS volumes. The other area that iSCSI has the leg up is the ability to aggregate bandwidth. Using iSCSI MPIO in vSphere you can aggregate bandwidth with round robbin I/O down all available interfaces. Until we see pNFS support we won't have that option with NFS datastores.
Using iSCSI MPIO in vSphere you can aggregate bandwidth with round robbin I/O down all available interfaces.
To be fair on NFS - a similar end goal can be achieved by using multiple NFS datastores & VMkernel ports & connecting each of them via different vmnic.
Probably the vSphere "QOS" I mentioned is a misnomer -- it is based on IP, and not dependent on file systems or block level storage. Edit the dvSwitch properties and there's the "Enable Network I/O Control" checkbox. Use that with NFS and it may mistakenly be called QOS. It suppose it really is not. The bottom line is that you can protect your NFS bandwidth, and while that's not as granular as Cisco, it sure is nice.
IP is IP -- and at least in the VMware realm I see little advantage using IP to reach LUNs, except that it is cheaper than fiber channel. In the last year or so I've come to shy away from LUNs altogether, regardless of protocol, at least for vSphere.
Using LACP (or Etherchannel) with "route based on IP hash" for NFS on a vmkernel is as effective as aggregating 1 Gig iSCSI uplinks. No balancing or round-robin about it, just aggregation. And newer 10Gig environments pretty much render aggregation moot. The tests and graphs I've seen show pretty much identical performance between iSCSI and NFS, but iSCSI has more limitations in my mind. But I'll admit I'm biased (*nix guy). In my 10GigE blade environment I've tested NFS based VMs against clones sitting on LUNs accessed via 8Gig HBAs. I'm happy. I will continue to use fiber channel when somebody insists on it, but otherwise not. It's getting so that the bottleneck is neither the protocol nor the bandwidth, but the storage device itself. IOPs. So... more VMs, more storage devices. Must be a nice business to be in.
The trickiest thing about using IP storage of any sort is getting separate redundant physical routes to storage, and that's critical in virtual environments. Fiber channel makes that a foregone conclustion, but you may have a hard time convincing network managers (real network managers) that you need separate physical IP pathing. Generally they change their minds when hundreds of VM servers fall of the network due to a single 10G switch failure.
Probably the vSphere "QOS" I mentioned is a misnomer -- it is based on IP, and not dependent on file systems or block level storage.
This is the feature which does work on iSCSI, but doesn't work (yet?) on NFS datastores:
Arguably FlexShare can deliver similar functionality - providing all datastores are on NetApp.
I am not referring to "Storage I/O" whatsoever.
I am referring to the capability to limit network I/O. With NFS it amounts to the same thing. I am putting limitations on IP, not storage.
I am using it successfully, and it has nothing to do with the API for storage. Nothing. It is strictly based on a vmkernel getting more bandwidth than anything else in a dvSwitch.
Yes, Network I/O control is a nice feature on distributed vSwitches. That said, if you have a run-away process on a VM connected to an NFS datastore that single VM can consume all your bandwidth to that datastore. This is where an iSCSI, FC, or FCoE mounted VMFS datastore with Storage I/O will be of benefit. You can run both Net I/O and Storage I/O together for optimal performance by guaranteeing bandwidth (Net I/O Control) and preventing a single VM from starving others for disk I/O (Storage I/O Control). NFS datastores do not yet have the capability in vSphere. If you have storage SLAs based on latency, at this point, VMFS volumes are the only way you can make such a guarantee. So while NFS is an awesome choice, and one I strongly advocate, iSCSI does still have it's place. I think certainly the tools you have outlined here make NFS a very attractive, and simple choice for customers thinking about IP storage in a vSphere environment.