Solved: why Aggregate and Volume IOPS from CLI do not match?

lb98 · ‎2016-10-12

This is the CLI out put from our device, I would assume that total ops in aggregate should be the sum of its related volume.

Same for Read Ops and write Ops.

Thanks!

*Total Read Write

Aggregate Node Ops Ops Ops

------------------------------- ------------------ ------ ---- -----

aggr0_dataontap_vsim_cluster_02 dataontap-vsim-cm2 23 20 1

aggr0_dataontap_vsim_cluster_01 dataontap-vsim-cm1 20 20 0

molsi_test_aggr_c dataontap-vsim-cm1 0 0 0

aggr2 dataontap-vsim-cm2 0 0 0

aggr1 dataontap-vsim-cm1 0 0 0

dataontap-vsim-cluster::statistics> volume show

dataontap-vsim-cluster : 10/11/2016 20:21:54

*Total Read Write Other Read Write Latency

Volume Vserver Ops Ops Ops Ops (Bps) (Bps) (us)

------ ------- ------ ---- ----- ----- ----- ----- -------

vol0 - 415 28 78 309 23424 37677 626

molsi_iscsi_volc

vserver1

30 2 0 27 1280 128 48

ESXDC7vol1

vserver1

27 2 0 25 1024 0 23

NetApp_HyperV

vserver1

16 2 0 14 1024 0 17

molsi_nfs_volc

vserver2

15 2 0 13 1024 42 89

vserver2_root

vserver2

9 2 0 7 1024 0 16

vserver1_root

vserver1

9 2 0 7 1024 0 16

molsi_vol1

vserver1

9 2 0 7 1024 0 16

vol0 - 7 1 0 5 719 468 174

bobshouseofcards · ‎2016-10-12

Actually for latency the same concept applies - aggregate latency and volume latency are different, disconnected, but related in roundabout ways. So just as with IOPs latency for the aggregate is tracked separately from latency for a volume. Also, we now need to take read/write operations into account more specifically.

Starting with aggregates. When not under high load stress, latency for an aggregate is essentially constant, between 4-10ms per IOP depending on disk type, size, etc. Why? All IOPs are 4K in size. A wide ranging workload tends toward a level of randomness in access despite WAFLs best efforts to leverage sequential access when it can. Thus the latency from disk is mathematical for every aggregat IOP based on degree of randomness, rotational speed, interface speed, etc. That is classic disk math.

But clearly a volume is not always getting 4-10ms per IOP. In fact it is getting wildly different performance from <1ms to 20ms per IOP typically (again assuming no silly loads). Why? First, consider writes. Writes are not sent to disk first. Rather, a write is stored in the NVRAM (or equivalent) on both nodes of an HA pair then acknowledged. The volume latency could easily be 0.2ms for write operations. Periodically NVRAM is flushed to disk as needed. That triggers write IOPs to the aggregate in that 4-10ms range as described above. Thus a long as disk is keeping up bandwidth-wise with the total combined write load of the volumes, volume writes can get really good latency performance in the sub millisecond range while the disks are steady at 4-10ms per backend IOP.

Volume reads have a similar result. The first read of any volume has to go to disk to get data so may be subject to that 4-10ms aggregate latency in addition to internal processing overhead - protocol, space efficiency expansion, etc. But, if the readahead prediction is good, subsequent reads will mostly come from data in RAM that came along for free with that first read. As such, subsequent volume read latencies could be much reduced.

All that holds in isolation. There are numerous other "total system" factors that drive both read and write latency up, the largest of which is the number if volume IOPs that directly require aggregate disk IOPs to meet. There can be multiplication effects due to space efficiencies as well - multiple layers of snapshots, deduplication, cloning can call force extra disk reads to find the key data. Extra or more disk needs due to load drive up aggregate latency which eventually drives up volume latency as well. There are performance counters to track how many reads were satisfied from various layers of the data stack - RAM, FlashPool (if available), SSD (if all Flash), FlashCache, and disk. As you watch periods where more reads are require disk accesses, volume read latencies naturally increase as well.

A key takeaway is that you shouldn't try to add/average volume performance indicators to get the equivalent aggregate performance indicators. Still comparing apples to oranges. Of course they are related, but they are not directly comparable. Rather, you can use one as the indicator to investigate the other. For example, if a volume is getting unacceptable latency, you might then check the aggregate latency where that volume lives. If it is within the calculated mathematical range the aggregate should have based on disk type, then aggregate performance may not be the contributing factor to the volume latency in question. If aggregate latency is either high or spiky in sync with the volume latency, then aggregate performance is more likely to be a contributing factor.

An example - I had a system that would average 60-80ms latency for reads for hours (it was a backup cycle) on a single aggregate measured at the volume level. Protocol IOPs would hit 30K, but backend disk IOPs would be more like 15K - shows the disconnect. Disk utilization would spike to 75%-80% but wasn't steadily high. Disk read latency was crappy though due to the extreme load. Throughput off the system was 15-20 Gbps over multiple links of course. This was bulk data reads, and in the aggregate data was flowing out really well. Measured against a single volume the latency and performance was not great, but being a backup cycle it was the total system performance that was important and how do I complain about systaining 15Gbps out for 6 hours? That's 32TB out in 6 hours give or take. Of course any "normal" application read during that window also saw that terrible response due to the aggregate trying to meet the backup workload.

Hope this helps.

Bob Greenwald

Senior Systems Engineer | cStor

NCIE SAN ONTAP, Data Protection

Kudos and accepted solutions are always appreciated.

View solution in original post

bobshouseofcards · ‎2016-10-12

Hi lb98 -

This question is actually very common. The short answer - a volume IOP (front end from a data consumer) is not the same as an aggregate IOP (back end to the physical disk).

The more important question is why?

Let's assume for the moment that all front end IOPs are 4K in size, which matches the disk block size in the back end. Every storage system uses some type of cache - internal system RAM, SSD, etc. to minimze when the really slow disk has to be used compared to a faster media. So, imagine a user workload that reads the same 10 blocks of "volume" data - a file perhaps - at a rate of 1000 IOPs. The first such read would hit the disk. All subsequent reads would likely come from RAM cache. So there is 1000 volume IOPs on the front end and essentially 0 aggregate IOPs on the backend for that workload.

Of course, front end workloads are not always 4K in size. They could be anything. A single 64K read IOP to a volume could result in at least 16 4K read IOPs to the aggregate backend. A whole bunch of 512 byte IOPs at the front end might be satisfied through 1 backend IOP to disk. ONTAP like other storage systems uses predictive readahead mechanisms to preload the cache with data likely to be requested next. When successul, this can significantly reduce backend disk IOPs especially when IOPs are less than 4K in size.

Storage efficiency produces further disconnect between volume and aggregate IOPs needed.

The disconnect highlights the need to monitor the system as a whole rather than just a single measurement to determine likely future performance. In general a disk has a generic IOPs number you can use. For 10K SAS, you could use say 170 4K IOPs per disk. For simplicity consider a 10 data disk aggregate which in theory would produce around 1700 IOPs. Your volume load reading/writing to that disk might be 10000 IOPs, and maybe that is using about 800 IOPs of the aggregate due to caching, access patterns, data consolidation, etc. So the aggregate is running at about 50% of its theoretical IOP capacity. I personally try to limit average disk utilization to no more than 60%, so in theory you have about 10% more to get from the aggregate based on backend IOPs. That implies you could get 10% more IOPs measured from the volume point of view, or 11000 IOPs assuming the workload pattern is similar. Of course this assumes CPU overhead and network bandwidth and all the other items have 10% more capacity to give to. That's the basics of using the differences in IOPs measurement between aggregates and volumes to determine workload potential.

Hope this helps.

Bob Greenwald

Senior Systems Engineer | cStor

NCIE SAN ONTAP, Data Protection

Kudos and accepted solutions are always appreciated.

lb98 · ‎2016-10-12

Whoa, That is a brilliant answer and explaination!

I would like to follow up with another question. If IOPS are different from aggregate to volume, how about latency relationship between aggregate and volume.

What is the better measurement of latency of aggregate interms of volume? I can think of two:

1) latency of aggregate = average of the latencys of its containing volumes.

2) latency of aggregate = maximum value of he latencys of its containing volumes

Please let me know your opinion on this.

Thanks again!

bobshouseofcards · ‎2016-10-12

Actually for latency the same concept applies - aggregate latency and volume latency are different, disconnected, but related in roundabout ways. So just as with IOPs latency for the aggregate is tracked separately from latency for a volume. Also, we now need to take read/write operations into account more specifically.

Starting with aggregates. When not under high load stress, latency for an aggregate is essentially constant, between 4-10ms per IOP depending on disk type, size, etc. Why? All IOPs are 4K in size. A wide ranging workload tends toward a level of randomness in access despite WAFLs best efforts to leverage sequential access when it can. Thus the latency from disk is mathematical for every aggregat IOP based on degree of randomness, rotational speed, interface speed, etc. That is classic disk math.

But clearly a volume is not always getting 4-10ms per IOP. In fact it is getting wildly different performance from <1ms to 20ms per IOP typically (again assuming no silly loads). Why? First, consider writes. Writes are not sent to disk first. Rather, a write is stored in the NVRAM (or equivalent) on both nodes of an HA pair then acknowledged. The volume latency could easily be 0.2ms for write operations. Periodically NVRAM is flushed to disk as needed. That triggers write IOPs to the aggregate in that 4-10ms range as described above. Thus a long as disk is keeping up bandwidth-wise with the total combined write load of the volumes, volume writes can get really good latency performance in the sub millisecond range while the disks are steady at 4-10ms per backend IOP.

Volume reads have a similar result. The first read of any volume has to go to disk to get data so may be subject to that 4-10ms aggregate latency in addition to internal processing overhead - protocol, space efficiency expansion, etc. But, if the readahead prediction is good, subsequent reads will mostly come from data in RAM that came along for free with that first read. As such, subsequent volume read latencies could be much reduced.

All that holds in isolation. There are numerous other "total system" factors that drive both read and write latency up, the largest of which is the number if volume IOPs that directly require aggregate disk IOPs to meet. There can be multiplication effects due to space efficiencies as well - multiple layers of snapshots, deduplication, cloning can call force extra disk reads to find the key data. Extra or more disk needs due to load drive up aggregate latency which eventually drives up volume latency as well. There are performance counters to track how many reads were satisfied from various layers of the data stack - RAM, FlashPool (if available), SSD (if all Flash), FlashCache, and disk. As you watch periods where more reads are require disk accesses, volume read latencies naturally increase as well.

A key takeaway is that you shouldn't try to add/average volume performance indicators to get the equivalent aggregate performance indicators. Still comparing apples to oranges. Of course they are related, but they are not directly comparable. Rather, you can use one as the indicator to investigate the other. For example, if a volume is getting unacceptable latency, you might then check the aggregate latency where that volume lives. If it is within the calculated mathematical range the aggregate should have based on disk type, then aggregate performance may not be the contributing factor to the volume latency in question. If aggregate latency is either high or spiky in sync with the volume latency, then aggregate performance is more likely to be a contributing factor.

An example - I had a system that would average 60-80ms latency for reads for hours (it was a backup cycle) on a single aggregate measured at the volume level. Protocol IOPs would hit 30K, but backend disk IOPs would be more like 15K - shows the disconnect. Disk utilization would spike to 75%-80% but wasn't steadily high. Disk read latency was crappy though due to the extreme load. Throughput off the system was 15-20 Gbps over multiple links of course. This was bulk data reads, and in the aggregate data was flowing out really well. Measured against a single volume the latency and performance was not great, but being a backup cycle it was the total system performance that was important and how do I complain about systaining 15Gbps out for 6 hours? That's 32TB out in 6 hours give or take. Of course any "normal" application read during that window also saw that terrible response due to the aggregate trying to meet the backup workload.

Hope this helps.

Bob Greenwald

Senior Systems Engineer | cStor

NCIE SAN ONTAP, Data Protection

Kudos and accepted solutions are always appreciated.

lb98 · ‎2016-10-12

I really appreciate your help on this! This is the best I have seen on explaining NetApp IOPS and latency.

Since I feel like I hit a jackpot on collecting knowledge on this topic, I will ask a little bit more.

Form the CLI output below, I do not see how to get the latency value from aggregate and node.

Node has the latency entry but with not value.

I do not see latency entry from agregate CLI output. Also looking at SDK API aggregate performance data, there is not counter related to latency for aggregate to get the value from.

In short, what is way to get latency value for aggregate and node.

Thanks again!

------------------------------------------------------------------------------------

dataontap-vsim-cluster::> statistics node show

dataontap-vsim-cluster : 10/12/2016 17:32:51

CPU *Total Latency

Node (%) Ops (ms)

------------------ --- ------ -------

dataontap-vsim-cm1 62 68 -

dataontap-vsim-cm2 45 11 -

dataontap-vsim-cluster::> statistics aggregate show

dataontap-vsim-cluster : 10/12/2016 17:33:59

*Total Read Write

Aggregate Node Ops Ops Ops

------------------------------- ------------------ ------ ---- -----

aggr0_dataontap_vsim_cluster_02 dataontap-vsim-cm2 14 14 0

aggr0_dataontap_vsim_cluster_01 dataontap-vsim-cm1 14 14 0

molsi_test_aggr_c dataontap-vsim-cm1 0 0 0

aggr2 dataontap-vsim-cm2 0 0 0

aggr1 dataontap-vsim-cm1 0 0 0

bobshouseofcards · ‎2016-10-12

Gathering that you are using the simulator, which makes things just a bit interesting. From the output it appears you have very little active I/O, which can make measurement somewhat more challenging.

For the node, try "statistics node show -interval 12 -iterations 10" and be sure to direct some I/O to the virtual node while that's running for the two minutes.

For the aggregate - bit trickier. Aggregate will track operations, but latency is tracked under the "disk" object in the performance counters. So you'd have to collect all the latencies from all the disks in the aggregate, weight them appropriately (data disks generally have different latencies than parity disks), and build an aggregate level latency. If you went to that level I'd leave off the parity disks and just report the latency of the data disks.

That's where tools like OnCommand Performance Manager or Harvest/Graphite/Grafana come in to collect that data automatically for you for display and analysis.

Hope this helps.

Bob Greenwald

Senior Systems Engineer | cStor

NCIE SAN ONTAP, Data Protection

Kudos and accepted solutions are always appreciated.

lb98 · ‎2016-10-12

Thanks for taking time answering my questions!

why Aggregate and Volume IOPS from CLI do not match?

Introducing GenAI Search on NSS