Subscribe
Accepted Solution

Interperting Grafana Graphs

Hi all,

 

There's a wealth of information presented in Grafana, but I'm having trouble finding definitions for some graphs. For example, what's the difference between frontend and backend latency, as well cluster latency? What causes them to increase? And what is the difference between "CIFS Frontend" and "CIFS Backend"?

 

I'd love to see brief descriptions or links to information/documents on what the graphs are presenting because unfortunately it is proving very diffuclt to find. For example, I've searched the internet high and low but haven't found many resources that cover the above mentioned graphs.

 

Thank you for your time everyone!

Re: Interperting Grafana Graphs

Hi @Storage001

 

I understand your challenge and plan to populate the 'description' field of every Grafana panel in a future update of Harvest to explain what it shows and what you might do.  But, you can always click 'edit' on any panel to see the underlying graphite metrics string which includes the source ONTAP object and counter name.  On the system you can use 'statistics catalog' from 'set -priv advanced' mode to get the descriptions for counters.

 

But at a high level I can explain briefly that the term 'frontend' is work measured at the node that owns the LIF that serviced the IO, while 'backend' is work measured at the node that owns the voolume that serviced the IO.  If you have 100% direct traffic they should be [nearly] the same, but if you have a lot of indirect traffic they will differ and answer different questions.

 

But rather than these views I really like using the QoS sourced statistics shown on the SVM and volume detail pages as they give you latency breakdown by component in the cluster.  Check this other post where I explain a little more on interpreting these graphs.

 

If you have follow-up questions, or a graph you'd like explained, fire away!

 

  

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

 

 

 

Re: Interperting Grafana Graphs

[ Edited ]

wow @madden, thank you for the informative post. You've given me a lot of great information to go on and I really appreciate it.

 

The issue that brought me here was that we are seeing streaming backups causing high CIFS frontend latency on the node which hosts the CIFS LIF. We also found that the source volume (hosted on another node) was reporting similarly high "qos_latency_from_frontend" stats.

 

At first I was wondering where the latency is coming from - is it the CIFS LIF node, or the node hosting the volume. However, after reading your post am I correct in saying that the latency is coming from the CIFS LIF node and the volume graph is merely reporting on that? I came to this conlcusion because the volume would be considered 'backend' as it relates to the "node that owns the volume that serviced the IO" and that it wouldn't be considered 'frontend' in this scenario given that the CIFS LIF is hosted on a different node. How does that sound?

 

EDIT: I'm just sifting through the stats catalouge at the moment and it is fantastic! Though I'm having trouble locating the "qos" entries. I've had a look at the Grafana metrics so that they can point me in the right direction, but I'm not having any luck yet.

Re: Interperting Grafana Graphs

Hi @Storage001

 

If you see high "qos_latency_from_frontend" on a volume then I would check CPU utilization on the node that owns the LIF that is servicing the traffic.  If CPU is generally high, then it could be queueing for CPU is the cause of your latency.  If total CPU is not high, or you see the CPU saturation at the same time as the workload increases, then you might be encountering a bug. Usually frontend processing is much lighter than backend processing.  To research this further I would open a support case and provide a perfstat during the workload so the engineer can tell you exactly where the latency is coming from on the frontend node.

 

For frontend and backend (not qos breakdown, but the rows on the node dashboard page for example) the frontend is the work and latency seen processing IOs from arriving the LIF to answering out on it, whereas the backend is the work and latency from the time the Volume received the IO and answered back on it.  So the frontend latency will indeed include the backend latency as well.  Again, in the QoS breakdown data it is seperated out per service center, or per potential 'stop' in the cluster.

 

Great that you like the statistics catalog details.  It is true that the qos_ ones are not listed.  The raw statistics tracked by the cluster have to be manipulated and summarized by Harvest, so for these there is no 1:1 match to the statistics catalog as you do have for most other counters.  For these I would read the link I had in my original response as I explain some of these counters there.

 

Cheers,
Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

 

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

Re: Interperting Grafana Graphs

Thanks for another informative reply @madden.

 

I should have mentioned in my previous post that the CPU utilisation does not spike at all during the issue. What we do see is the "slow at the network processing node" alert OCPM raises, as well as up to 2.5 seconds of read latency on the node who owns the CIFS LIF. A ticket was logged with Commvault and they suggested it might be due to the slow SATA disks being used and asked us to put the data on faster disks. However as you've suggested, I'll log a support ticket and will send a perfstat through.

 

Thank you again for your time and detailed responses Chris. It is very much appreciated.

Re: Interperting Grafana Graphs

Hi @Storage001

 

If the SATA disks were to blame then in the QoS latency breakdown you should see lots of latency from disk, but you don't, you see it on the frontend node.  A NetApp support case with a perfstat is definitely the way to go on this one.  And also, thank you for the kind words!

 

Cheers,

Chris

 

 

 

Re: Interperting Grafana Graphs

I completely agree @madden. Disk latency is 4.9ms so is definitely not the isue.

 

And no worries at all about the kind words, they're very well deserved. 

Re: Disk Shelf Without Filer

 

I was able to setup a DS14MK2 shelf with 2x FC-ATX connected to my Dell PowerEdge Server with a Qlogic dual port 4gb FC card installed. Will I be able to setup a similar connection with a DS2246 shelf? If so, any suggestion on the kind of QSFP port controller I will need for my server? Thank you