Very Low Write Performance with Ontap Select 2 Node

Matt99 · ‎2018-01-14

I've been playing around with Ontap Select 9.3 for our remote branch solution.

I've been able to get it installed and running and have started my preformance testing.

Each node has an internal Intel NVME PCI-E Flash card. I've deployed the 2 node cluster using the "small" config.

I'm seeing terrible write latency....steady 200ms+ when preforming a storage vmotions. Volumes are presented as NFS like our physcal netapps are.

I can actually storage vmotion faster between two old test SATA Netapp's than I can to/from the Ontap Select w/ NVME.

Looking at the preformance graphs shows the underlying NVME datastore has sub 1ms latnecy between the Node VM and the datastore.

So it looks like the latency is within the ontap stack.

I certianly woud not expect baremetal speed. But considering we are using very fast NVME PCI-E cards I'd expect much better preformance than we are seeing.

This same gear was also used for a vSAN test and its preformance was way way higher.

So is there somthing I'm missing? I do see the virtual e0a adapters are set to 1000GB full but changing it to 10GB Full did not seem to help. These host are all connected via 40GB Mellanox switches.

I've also tried disabling dedup/compression on the volume but that did not see to any impact either.

Reg · ‎2018-01-14

I doubt the hardware configuration you mentioned is Netapp supported or tested. Please validate it.
Usually the bottleneck in the stack is disk type since it has to mirror across，network pipes，and compute cpu and memory supported by the license type..

Matt99 · ‎2018-01-14

Can you elborate?

Netapp was pretty clear with us that being "software" the hardware reqirements are those as stated as supported by the underlying hypervisor.

It seems odd that a software layer without any knowledge of the phsycal hardware would have a hardware specific requirement.

Can you send a link to the list of "supported" ontap select hardware? I don't see any mention of one.

All of our hardware is new, fully listed on VMWare HCL and they are all vSAN ready nodes. Are switches are fully supported by Netapp (they are the same switches used for our physcal Netapp cluster switches).

What hardware would you see as an "issue" for support?

On a side note, write preformance is also terrible when using the built in HP SSD's (which are rebranded Intel DC's) on the internal RAID controller. Using the HP branded Intel NVME dirve doesn't change the terrible write latnecy better or worse.

Reg · ‎2018-01-14

NetApp Interoperability Matrix https://www.netapp.com/us/technology/interop.aspx

Attached 3 files as an example for the configuration you perhaps have to show what is supported.

Matt99 · ‎2018-01-14

Yes, but thats exactly what i was saying.

Those requirements are quite general. It doesn't list anywhere you must have "Network Adapter Brad X, with Firmware X".....instead they are general like "Dual 40Gb adpaters".

All of our hardware fully meets the requirements in those docs as far as I can see. And all the hardware and VMWare versions meet the requirements for Netapp interop and VMWAre HCI.

This isn't old hardware. Its all brand new HP vSAN ready nodes.

All of this hardware is supported and preforms very well with vSAN and physcal Netapp's. VMWare preformance graphs show the underlying latency from the nodes to phsycal storage and network is very very low (just like when using vSAN).

But the preformance from the datastore to the SVM has terrible write latency. It doesn't seem like a hardware issue to me unless you see a specific item that would be a problem.

Reg · ‎2018-01-14

Agree with what you say, and technically it should work, but from all the NetApp literature that I've read, they talk about SSD or HDDs, but not Flash cards; and only mention 10Gbps cards, etc..

Therefore, I cannot comment any further where the issue is.

What ESXi version have you installed on the physical nodes? Have you tagged the datastores as SSD within vSphere and is the controller driver the right one? Have you changed the MTU size on the NICs? Have you checked the performance fom the NetApp system manager and see where it is choking?

Matt99 · ‎2018-01-15

We can replicate the very low write preformance with standard SSD drives on the HP Smart Array Controller so its not the NVME PCI-E Card. When testing with the standard SSD's the write latency goes 200+ ms as soon as there is any steady workload.

These host were designed for vSAN so they have a very fast pci-e SSD for write cache and then a bunch of standard SSD's for the capacity tier. We initially tested with the PCI-E card as it can push 100,000+ IOPS at just a few milliseconds latency.

One very intresting test was we deployed the Ontap node to occupy only have of the PCI-E card. Then wedeployed a test VM directly on the PCI-E datastore next to Ontap node. During testing the Ontap node was pushing 98MB throuput at 210MS latnecy averyage. A VM on that same datastore did 1050 MB with an average 3MS latency. We then ran both test at the same time and Ontap preformance stayed the same and the standalone VM dropped only slightly...so there is plenty of head room on those pci-e cards.

All of these host are 6.5u1 and the drives are already identified by VMWare as flash devices.

We have 10GB SFP+ cards in the host already in additionto the 40GB so I'm gonna reconfigure the host to use the HP integrated 10GB cards to see if changes anything but my guess is it won't.

If there was a network latency issue with the 40GB cards then vSAN should see the same slow down but it shows single digit latency instead of 200+.

System manager does not show the same terrible preformance that the VM's see. And the Node VM to the physcal datastore shows very low latency. I'll have to dig into the node preformance to see if there are any new preformance metrics that can detail the latency between the nodes for the write ack.

ildella · ‎2018-04-03

Hello Matt99,

I am very interested in the topic, did you have the chance to perform further tests in order to understand why the latency was so high ?

Thank you in advance,

Alessandro

Matt99 · ‎2018-04-03

The final update is that Ontap Select is not for preformance sentive workloads.

We tested many different ways and used several different brands of SSDs. All of which are VMWare HCL certified. Tested both HP servers and Dell servers (each with their own brand controllers and drives).

No matter what we did the latnecy between the VM and the SVM was very high with even a moderate load. The physcal storage stats between the SVM and physcal datastore it runs off where always very low (sub 1ms).

It seems the current version of Ontap Select introduces alot of latency within it stack. Perhaps thats why Netapp doesn't make too many (if at all) claims about preformance levels.

Hopefully they will get the platform tuned in future versions. We love the idea of software defined storage using the Netapp stack. But as it is, even for a remote branch file server, the preformance was too low.

FrankRust · ‎2019-05-16

Can you specify what "Very Low Performance" means for you?

I started with a One-Node KVM version of Ontap Select with as few as 3MB/s NFS write performance on Sata Disks.

Due to tuning several parameters I got it to 250MB/s what is quite enough for our use as backup.

The host is natively writing 330MB/s peak as NFS to the same disk set (18 Disks Software RAID 6 without hardware cache controller). So the overhead of Ontap Select is quite there but not too bad.

Esxi_host_cache · ‎2020-05-19

I think the latency issue in Ontap Select is not related to the specific SSDs / hardware components involved but the long storage IO path. The write IOs traverse through two VMFS layers + Netapp Filesystem layer + the VMware network (to mirror the data).

Disclosure, I am a SE with Virtunet Systems, our software for ESXi caches hot data to in-host RAM / SSD. Ontap Select customers have used it to reduce VM storage latencies. Here's a link specific to Ontap Select.