Solved: About Latency that can be checked in Solidfire

shuhei · ‎2021-09-13

When the customer inquired via case, the answer was that the latency displayed in ActiveIQ includes the communication time between the client and SolidFire.
Also, the answer was that there was no past performance data in SupportBundle.

(Latency of Solid alone cannot be checked from SupportBundle.)

In this case, do I need to choose the following method to check the past Latency of Solidfire alone?

・Using OnCommand Insight/Cloud Insignt
・Using Grafana (https://grafana.com/grafana/dashboards/14025)

Or is it not possible to check the latency of Solidfire alone, because the latency information is based on the time it takes for the client to respond with an ACK after the packet arrives successfully?

elementx · ‎2021-09-14

They don't really need Grafana, they could use Bash or curl or Python or PowerShell and call the API (like solidfire-exporter does) to gather latency info. Grafana is mainly for visualization.

SolidFire latency counter doesn't monitor external TCP/IP flows, that would be too resource-consuming. It only monitors "internal" service latency.

External latency, such as network and client (hypervisor) are of separate concern anyway.

You can see in https://docs.netapp.com/us-en/element-software/storage/reference_monitor_volume_performance_details.html:

Total Latency
The average time, in microseconds, to complete read and write operations to a volume.
Read Latency
The average time, in microseconds, to complete read operations to the volume in the last 500 milliseconds.
Write Latency
The average time, in microseconds, to complete write operations to a volume in the last 500 milliseconds.

I think these are from the moment request is received, until it's fulfilled (sent out by SolidFire).

If you run a simple (low IOPS) performance test from the client and collect I/O latency information on the same volume, you'll see that:

a) client latency is around (say) 3ms

b) SolidFire latency is 0.2-0.3ms

SolidFire Exporter or HCI Collector will collect latency for all volumes, which can be fairly significant overhead if they don't need latency for all volumes, in which case it's better to write your own script in Python CLI or PowerShell, and just collect the stats for one or more volumes and store them in SQLite or CSV.

If you want to monitor end to end, then use Cloud Insights, or expand your script to also collect hypervisor (or VM).

HCI Collector collects vSphere performance metrics, for example, but it is not meant for finding precise correlations between different components such as SolidFire and VMs (it's surprisingly complex).

That's why I suggest to focus on selected volume and selected client, at least with own script we can precisely watch those two entities.

View solution in original post

elementx · ‎2021-09-14

They don't really need Grafana, they could use Bash or curl or Python or PowerShell and call the API (like solidfire-exporter does) to gather latency info. Grafana is mainly for visualization.

SolidFire latency counter doesn't monitor external TCP/IP flows, that would be too resource-consuming. It only monitors "internal" service latency.

External latency, such as network and client (hypervisor) are of separate concern anyway.

You can see in https://docs.netapp.com/us-en/element-software/storage/reference_monitor_volume_performance_details.html:

Total Latency
The average time, in microseconds, to complete read and write operations to a volume.
Read Latency
The average time, in microseconds, to complete read operations to the volume in the last 500 milliseconds.
Write Latency
The average time, in microseconds, to complete write operations to a volume in the last 500 milliseconds.

I think these are from the moment request is received, until it's fulfilled (sent out by SolidFire).

If you run a simple (low IOPS) performance test from the client and collect I/O latency information on the same volume, you'll see that:

a) client latency is around (say) 3ms

b) SolidFire latency is 0.2-0.3ms

SolidFire Exporter or HCI Collector will collect latency for all volumes, which can be fairly significant overhead if they don't need latency for all volumes, in which case it's better to write your own script in Python CLI or PowerShell, and just collect the stats for one or more volumes and store them in SQLite or CSV.

If you want to monitor end to end, then use Cloud Insights, or expand your script to also collect hypervisor (or VM).

HCI Collector collects vSphere performance metrics, for example, but it is not meant for finding precise correlations between different components such as SolidFire and VMs (it's surprisingly complex).

That's why I suggest to focus on selected volume and selected client, at least with own script we can precisely watch those two entities.

shuhei · ‎2021-09-15

Thank you for your comment.
The problem in the customer environment is that the latency displayed in ActiveIQ includes the client part.

In other words, we just need to make sure that the latency we can get from the Solidfire API does not include the client part.

As you kindly mentioned, the API acquisition mechanism is also maintained by the customer, so I think it can be solved by being able to get the API to Solidfire.

elementx · ‎2024-06-15

> The problem in the customer environment is that the latency displayed in ActiveIQ includes the client part.

It does? How does AIQ know what latency the client has?

Edit: by the way I know such latency can be collected, but I don't know that AIQ does it.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/diskio/README.md