Solved: Re: Discrepancy between performance metrics output by CLI and shown by OnCommand Unified Manager

markccollins · ‎2020-12-17

Because the CLI is not capable of retrieving historical performance data (if it is, please let me know!), we've developed a script to call the CLI and get raw counter data for different components like so:
statistics show -raw -object volume -counter [list of common metrics like read_ops, write_ops, read_latency, etc.)

The script is scheduled to run commands like this through the CLI and save them every 5 minutes. So when we go back and look at all of these raw records, we can take any two records of a component (i.e. a volume), and take the difference between their raw counters and divide it by the number of seconds between the two records to get the rate of a metric over that time period. Take read ops as an example:

Object: volume
Instance: performance_test
Start-time: 12/17/2020 10:04:24
End-time: 12/17/2020 10:04:24
Scope: svm1

    Counter                                                     Value
    -------------------------------- --------------------------------
    read_ops                                                     1752

Object: volume
Instance: performance_test
Start-time: 12/17/2020 10:09:23
End-time: 12/17/2020 10:09:23
Scope: svm1

    Counter                                                     Value
    -------------------------------- --------------------------------
    read_ops                                                     4837

This is kind of what the output from the CLI would look like for two records. You can see that the times they were recorded were just a second shy of 5 minutes exactly. So if we ask ourselves: "what was the rate of read operations per second over those five minutes?", we can come up with an answer.
We subtract the old raw counter from the current raw counter, and divide it by the duration (5 minutes, or 300 seconds):

(4837- 1752) / 300 = 3085 / 300 = 10.28333 read operations per second.

---------------------------------------------------------------------------------------------------------------------------------------------------------------

Our issue is that while this makes sense to us, and we've been able to use this method to verify that it works with one volume, other volumes tend to have drastically higher rates compared to the OnCommand Unified Manger which we are using to verify our calculations and ensure they are accurate. We take a component, calculate a metric rate for a given time period, and then compare it to OCUM performance data at that given time.

Questions:

1) Why would a comparison for one volume match OCUM performance data graphs, but not other volumes?

2) Is this method accurate/reliable?

3) Are there any mistakes in the calculations we are performing?

4) Is there a way to obtain historical performance data from the CLI?

Thanks,

Mark

paul_stejskal · ‎2020-12-18

UM offers a date filter in the volume performance view. There is an example here: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/Active_IQ_Unified_Manager/How_to_monitor_volume_latency_from_ActiveIQ_...

If you wish to follow up, I will be unavailable until Monday. If you need immediate assistance, please open a case and we'll be glad to help. You can reference this thread in the case.

View solution in original post

paul_stejskal · ‎2020-12-18

This is because volume level is the WAFL layer, and WAFL is 64k max. IOPs can be up to 1 MB (or even bigger) at the LUN or CIFS/NFS layer. AIQUM shows the op size at the protocol layer.

If your application is requesting 256kB reads (SQL query for example), then you may see 500 IOPs in UM, but 2000 in statistics volume.

Does that match up with what you are seeing?

markccollins · ‎2020-12-18

@paul_stejskal That actually sounds very similar to what we're experiencing! Thank you for your explanation, that helps us understand it more.

If that's the case, is there any way we can get CLI statistics to use op size at the protocol layer so that we can match what is shown in UM?

paul_stejskal · ‎2020-12-18

You can use qos statistics volume performance show.

I've gotta update some KBs. I don't recall which version this changed but UM used to use stats volume. I think it may be as recent as OCUM 9.5 where it used volume stats.

markccollins · ‎2020-12-18

@paul_stejskal Hmm, qos statistics volume performance show only allows us to view performance in real time, and I haven't seen any options in the documentation that would allow us to output raw counters like we do with "statistics show -raw -object volume".

Are there any other options for us to query or record+calculate historical performance through the CLI that would return metrics at the protocol layer?

paul_stejskal · ‎2020-12-18

AIQUM will show workload statistics, and Harvest will give volume level statistics (stats volume).

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/Active_IQ_Unified_Manager/Why_is_there_a_difference_between_statistics... <-KB I just created.

markccollins · ‎2020-12-18

@paul_stejskal It sounds like we won't be able to utilize the CLI for these purposes then.

I'm unsure if I should make a new thread for this topic:
Given that it has a database storing all of the performance that it creates its graphs from, is there a way to obtain historical performance data via the OCUM?
The API endpoint /rest/volumes appears to return point-in-time performance data but I haven't found any filters for specifying a start/end time for historical data.

paul_stejskal · ‎2020-12-18

UM offers a date filter in the volume performance view. There is an example here: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/Active_IQ_Unified_Manager/How_to_monitor_volume_latency_from_ActiveIQ_...

If you wish to follow up, I will be unavailable until Monday. If you need immediate assistance, please open a case and we'll be glad to help. You can reference this thread in the case.