QoS limits inaccurate, observed values not even close to -max-throughput setting

tcm10 · ‎2016-12-28

I have a workload that I'd like to limit with qos, so I tested qos on a non-production SVM with a single policy (cluster wide) with a single volume in the qos policy. Whether I specify an iops limit or a througput limit, the actual, observed limit is exactly half of the specified limit.

The doc shows a nice example where a workload is clamped right at the set limit. The doc also says a 10% overage is not uncommon...but 50% low seems way off.

Example (statistics output clipped to show only the volume of interest):

No qos set:

clusterX::> qos statistics workload performance show
Workload ID IOPS Throughput Latency
--------------- ------ -------- ---------------- ----------
home01-wid2228 2228 3127 97.73MB/s 1341.00us
home01-wid2228 2228 3171 99.08MB/s 1288.00us
home01-wid2228 2228 3206 100.17MB/s 1.70ms

(The unthrottled load is capable of ~3,200 iops at 100 MB/s.)

qos bandwidth limit set:

clusterX::> qos policy-group modify -policy-group test_qos -max-throughput 10mb/s

clusterX::> qos statistics workload performance show
Workload ID IOPS Throughput Latency
--------------- ------ -------- ---------------- ----------
home01-wid2228 2228 159 4.96MB/s 673.04ms
home01-wid2228 2228 160 4.99MB/s 727.68ms
home01-wid2228 2228 159 4.95MB/s 787.47ms

qos iops limit set:

clusterX::> qos policy-group modify -policy-group test_qos -max-throughput 1000iops

clusterX::> qos statistics workload performance show
Workload ID IOPS Throughput Latency
--------------- ------ -------- ---------------- ----------
home01-wid2228 2228 494 15.44MB/s 242.15ms
home01-wid2228 2228 505 15.76MB/s 241.72ms
home01-wid2228 2228 494 15.45MB/s 245.38ms

(I realize as well that the limit is a maximum limit, so anything under that is "techincally" correct, but still...)

The system is a 2-node AFF cluster running 8.3.2P5.

GidonMarcus · ‎2016-12-29

Hi

looked at it for a few min and i want to repro on my lab.

What the type of workload you apply in that test? have you tried applying workload from another type? (other tool. other read write ratio, more operation instances in parallel)What protocol is used and what OS and tools are used for the tests? what the network configuration used for this (LIF, multipathing, network redundancy such LACP/native teaming)?

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

tcm10 · ‎2016-12-30

Gidi,

The test is a copy of a 4 GB file to the volume (100% writes) via NFS from a RedHat 6.5 client. I have not tried another workload yet because of the unexpected nature of this initial result.

Our network configuration is a 2 port LACP ifgrp at 10 Gbps per port.

@GidonMarcus wrote:
Hi

looked at it for a few min and i want to repro on my lab.
What the type of workload you apply in that test? have you tried applying workload from another type? (other tool. other read write ratio, more operation instances in parallel)What protocol is used and what OS and tools are used for the tests? what the network configuration used for this (LIF, multipathing, network redundancy such LACP/native teaming)?

Gidi

tcm10 · ‎2016-12-30

Just tried a read test. Same observed limit (half the specified maximum).

Just tried two clients reading simultaneously. Same observed limit (half the specified maximum).

GidonMarcus · ‎2016-12-30

Hi

i did a repro that i think really close to yours without hitting the problem. (attached a file).

need to start exclude componenets....

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

Joao_Matos · ‎2017-10-23

Hi,

Do you have any news on this subject?

I'm experiencing exactly the same thing. The volume QoS is set to 2000 IOPS and it never hits more than the constant 1000 IOPS.

Thanks!

AlexDawson · ‎2017-10-24

Hi there,

We have determined some versions of ONTAP are succeptable to an issue where IOPS are limited too much by QoS, as detailed in this BURT. This is resolved in ONTAP 9.1P8, and can be confirmed by doing a takeover/giveback cycle to see if throughput increases.