Performances data discrepancy

gmilazzoitag · ‎2015-01-31

Hi everybody,

during a POC on a quite complex environment together with customer we've noted some discrepancies between the data reported in OCI Java GUI, the reality of the data recorded using i.e. systat or stats on a CMode, or the ones from other NetApp consoles such Operations Manager in OnCommand Core (aka DFM).
I had a choice to get data from two different 2 nodes CMode installation (fas4 and fas5) and from a 6240 ha 7 Mode one.

In the 7 Mode we've always noted that iops and throughput data reported were very high, almost impossible such 50k iops on a single aggregate: also considering the role of Flash Cache it seemed very unlikely. Using CLI sysstat or data from DFM values reported this data were lower and they match between them but not with the OCI ones. In other cases OCI reports of 30-35 k IOPS spikes and and mean higher that reality from systat of DFM again.
In the Java GUI furthermore the graphs never match the numerical values.

In the CMode we've instead noted a discrepancy between the data reported in Java GUI and the one in the Web UI asset dashboard. So that the high values were confirmed by stats but they don't match between Web UI and Java UI. Also in this case anyway the Java UI graphs do not match the numbers!

This differences can easily bring the customer to question about reliability of data reported by OCI in its different screens...

Here I attach some screen captured where maybe you can see the anomalies.

Regards.

ostiguy · ‎2015-02-01

Hey Giacomo

In your first two screenshots, the lower left hand table is storage pool performance - which for Ontap, means an aggregate, or in rare cases, a traditional volume. However, the lower right hand pane is showing the overall storage performance. So, those are not like for like comparisons.

Moving on:

As a rule, OCI's storage pool performance numbers will be representative of the disk performance for the disks within that pool.

The DFM view you are looking at appears to be focused on protocol ops. NFS "other" ops, or metadata operations, are serviced from memory - you can have a workload where a controller is doing 30k NFS ops - 25k are meta data, and 5k are not - only the 5k which are not are likely to have any chance of actually driving OCI's internal volume (flexvol) or storage pool (aggregate) statistics, as only those 5k are possibly going to impact the disks

OCI's storage pool and internal volume numbers are built off of iops that we think will impact disks - therefore, metadata ops are not included. OCI's node performance numbers are more complex, and the metadata IOP numbers are likely to be included.

Hopefully this helps

Matt

gmilazzoitag · ‎2015-02-01

Hi Matt,

Could be but I'm not totally convinced though your comments 😉

Assuming that on the first screenshot (CMode) on the left I've an aggregate (OCI call them storage pools) and on the right the overall storage performances where is that 50k IOPS peak?
Graph is scaled at 20k and, more, on the MB/s graph througput is scaled to 900 MB/s while I don't have this value in none of the other cumulative data for the overall storage.

And last, this maybe important from a customer perspective, why the same data are not correctly plotted in the Web UI? (low in the first screenshot). Anyway for CMode stats confirmed that high values also if in none of the graph seem to be rightly reported.

In the second shot (6240-HA in 7 Mode) we've 40 and 56 k iops but in the overall graphs are scaled at 20 and it does not report those peak. And again, those graphs plot values completely different from Web UI. This has generated confusion and bafflement in the customer - you should also consider that he's with Cineca, italian institute of advanced math and calculation center since 60s 🙂

Good point for the DFM explaination you've gave me (OCI count all IO, data, metadata, dedupe, spreading, wafl block io and so on..)
I did not consider but this can bring an issue. If DFM do not consider the overall iops of all protocols running the value that is reported is "false" and can bring out of the reality of capacity of the storage.
This 7 mode perform after all NFS only (for VMware exports) and the sysstat -x 1 values reported matched the DFM one, not the OCI.

Last but not least Web UI and Java UI graphs appear slightly different.

Anyway I will advice the customer to take more graphs and comparison during this POC and to send me some other screenshot because our job is to demonstrate that values are the right ones.

I know that the sources for OCI and DFM and more are the same performances log that Ontap periodically generate but maybe a good key to right read and compare timeframe data is to run a perfstat and to compare what is reported with the OCI data (of course knowing that OCI does not have the same granularity of a perfstat). I think this could be a real proof of the validity of those data, do you agree? Or maybe this route can bring confusion?

Let me know if you other cases where somebody report differencies between the Java UI and the Web one. Those graph must match.

Thank you very much!

Bye

ostiguy · ‎2015-02-01

Disk IO may or may not be reflected in the storage level statistics, if the disk IO is NOT in response to end user driven IO

if the disk IO is due to a disk rebuild, or a backend process like dedupe, you can have high disk IO (would be reflected in OCI's storage pool numbers), but low IO (relatively) as seen in the node performance, internal volume, and/or volume numbers

In your screenshots, I don't see a single performance chart for a storage pool from the Java UI - the Java UI screenshots all appear to show the total storage array statistics - the title of the chart, in the first screenshot's lower right hand corner is "Storage Performance Chart: Storage fas-5". Conversely, the OCI webUI screenshots all appear to be from the Storage Pool landing page, so they default to showing Storage Pool statistics

If you want to see charts for individual arrays, right click the array in the top pane -> Analyze, and in the ensuing window, there should be a tab for Storage Pools.

Matt

gmilazzoitag · ‎2015-02-02

Thanks.

I'll try to let the user do this in the next days. I'm no more there and, for now, the POC has been completed, now they've to play and study 😉 for this 2 PB opportunity! 😉

If I will have some new data/screens I'll share with you and the community of course.

Regards,

PS) by the way. This customer is the one waiting for Dell arrays support...any news or not enough market presence? 🙂