Tech ONTAP Blogs
Tech ONTAP Blogs
Monitoring the performance of a solution is crucial for your entire infrastructure and usually necessitates a dedicated platform.
Amazon FSx for NetApp ONTAP (FSx for ONTAP) can be monitored using different platforms and applications, but the most relevant information can be effectively monitored using Amazon CloudWatch.
Amazon CloudWatch gathers and processes raw data from FSx for ONTAP in order to create readable, near real-time metrics. These statistics are stored for 15 months, allowing you to access historical information to evaluate your file system’s performance.
In this post, I’ll cover the basics of using Amazon CloudWatch metrics to monitor FSx for ONTAP performance and discuss some conclusions you can draw from those metrics.
Read on, here is what we’ll cover:
When configuring a cloud solution, it's important to consider the performance and total cost of ownership (TCO) requirements in order to select the right FSx for ONTAP throughput capacity. Let’s focus on the standard, first-generation, Multi-AZ option for now, which offers six throughput capacities: 128MBps, 256MBps, 512MBps, 1GBps, 2GBps, and 4GBps. These numbers represent the maximum throughputs in megabytes per second (MBps) or gigabytes per second (GBps) from the SSD.
In summary, there are three key limits that define an FSx for ONTAP performance:
The AWS documentation has all the details on FSx for ONTAP limits. There are two main tables, each with different sections for Single AZ, Multi AZ, first-generation and second-generation.
The first table pertains to FSx for ONTAP deployed in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), and Ireland (eu-west-1). The second table is for all other regions.
This means that you have different limits based on the region for the network, and you also have different Burst limits, which I’ll explain later. However, the Disk Baseline limits are the same across all regions.
In the Amazon CloudWatch demonstration below I will use the throughput capacity of 256MBps in a Multi-AZ deployment, which is highlighted in the table below (though check the FSx for ONTAP performance guide for the latest numbers):
Please take note of the following details:
Baseline: This represents minimum throughput that you can expect from your file system.
Burst: FSx for ONTAP provides the capability to burst to higher speeds for short periods of time, for both network and disk throughput. This system uses a credit mechanism to allocate throughput and input / output per second (IOPS) based on average utilization. Note that not all throughput capacities support bursting; bursting is available for 128 and 256MBps in all regions, and in 512 MBps in select regions.
Burst can be utilized to manage workload spikes, but it isn’t suitable for capacity planning. For that reason, baseline values must be used.
Let’s review the initial table for a Multi-AZ deployment with 256MBps throughput capacity. It's important to note that it comes with the following limits:
These limits play a crucial role in defining the scope of the solution and must be diligently utilized in sizing your FSx for ONTAP file system.
Amazon CloudWatch is an AWS monitoring and alerting service which can be used to monitor a wide range of AWS services, including FSx for ONTAP. It collects and processes raw data from an FSx for ONTAP file system into readable metrics, which are retained for 15 months.
CloudWatch provides information related to used and available capacity, storage efficiencies, and many other metrics. This demonstration will focus on file system-related and volume-related performance metrics and dashboards, specifically.
You can find here a CloudFormation template to deploy an AWS CloudWatch dashboard for monitoring FSx for ONTAP systems. The dashboard offers comprehensive insights into your FSx for ONTAP resources, helping you monitor performance, track metrics, and manage alarms efficiently.
Let’s get our hands dirty by logging into the AWS console and accessing the CloudWatch dashboards. Remember that for the test, I used a 256MBps-capable Multi-AZ file system.
I’m excited to show you the set of CloudWatch dashboards I collected.
Here's a look at the client IOPS dashboard that can be found by clicking on the “Monitoring & performance" tab under the “Summary” section.
In the Summary section, you'll find two important client-related statistics: Total client throughput and Total client IOPS.
These stats provide crucial information about the traffic between all clients and the FSx for ONTAP file system, specifically regarding the protocol workload (NFS, CIFS/SMB, S3 or iSCSI), and traffic not related to disk/SSD.
See here the Total Client IOPS view:
You may notice a significant increase of IOPS, going all the way up to around 13,000. Since the 256MBps Through Capacity File System is limited to 12,000 IOPS, you might be asking yourself “How can I be getting 13,000 operations per seconds?”
As mentioned above, this chart is showing protocol operations, not disk operations, which is what the 12,000 IOPS limit is referring to. So, it is very likely that some of these operations are being served from the cache and therefore not requiring any disk I/Os.
The other client side metric is the Client side total throughput:
It shows the sum of bytes read and bytes written from the clients in MBps.
Now let's explore the Average latency dashboard. It provides easy and simple-to-understand read, write, and metadata breakdowns.
Note: Metadata in CloudWatch represents SMB/NFS metadata operations, like getting the last modification time of a file, and other operations.
This graph represents the Average FSx for ONTAP file system latency, but be aware that it is possible for a client’s latency to be slightly higher, due to the added network path.
FSx for ONTAP is designed to provide sub-millisecond latency, but there are some cases where we expect the latency to be higher. One such case is having a tiering policy set to “All” on a volume, where all the data is hosted on the capacity tier. The question is how to identify that scenario in a CloudWatch dashboard. Let’s see an example:
As you can see in the image above, the read latency is significantly high, around 100 milliseconds per operation. Now let’s see what’s the contributing factor.
The answer is something we’ll find in the volume dashboard. Let’s find out which volume is the culprit here.
Moving to the Volumes tab, look for a volume with the tiering policy set to “ALL” in the Tiering policy column, as shown here:
Let’s review the performance statistics for that volume. It appears that a significant amount of reads were served from the capacity pool when the latency was at the highest.
Utilization dashboards
We have now arrived at the core of this article: the dashboards concerned with your overall utilization. These CloudWatch dashboards will be the go-to places to understand your FSx for ONTAP file system’s current utilization.
Note that these graphs show the percentage of utilization of the specific resource based on the amount that resource has been provisioned. Doing it this way makes it easy to see if you’re being limited by an under-provisioned resource.
For example, if the IOPS utilization is 100% for an extended amount of time, then you can increase the maximum number of IOPS via the AWS console to increase performance. Note, however, that won’t be possible if you’re already at the maximum allowable IOPS for that file system).
Let’s look at the key dashboards in the graphic below (notice that time frame is not the same). These dashboards can be found by going to the Monitoring & performance tab, then navigating to the Performance section. Let's first look at the Network Throughput chart:
As you can see, network utilization goes up to 319% (!!) then it drops down to 127%.
Now let’s look at disk throughput:
This measures disk utilization in terms of resource utilization percentage. It goes up to 180%, then it drops to ~85%.
Now, when you look at the graphs, two questions might come to mind:
Let me dispel any doubts and fill you in on the answers. To do that, we need to revisit our initial FSx for ONTAP configuration.
If you recall, for this test I used a 256MBps throughput capacity Multi-AZ deployment with 3,072GB of SSD capacity, provisioned IOPS of 9,216, and a maximum network throughput of 300MBps.
The percentages are calculated based on the deployed configuration. For instance, if CloudWatch shows disk throughput at 180%, it means it's at 180% of 256MBps, which is equivalent to 476MBps. Similarly, if CloudWatch displays 319% for network utilization, it means you are at 319% of 300MBps, i.e., 957Mbps. I hope this clarifies the first question.
In response to “How can utilization exceed 100%?,” remember what was said about baseline and burst. When utilization exceeds 100%, the burst comes into play, as can be seen in the graph below, where I used disk throughput as an example.
I want to explore exactly how burst credits kick in, but that will have to wait for an upcoming post.
Here’s another example of usage numbers exceeding 100%. If you look at the disk IOPS utilization, you will find two matrices: one in the file system performance section and another in the disk performance section:
You may notice that both show different utilization percentages for the same data. How is it possible to have different IOPS utilizations for the same SSD disks?
Let’s review to understand how the utilization is calculated. In the first graph under the “File system performance” section, the percentage utilization is calculated based on the throughput capacity maximum disk IOPS, which is 12,000 for 256MBps throughput capacity.
55.22 % of 12,000 IOPS = 6,626 IOPS.
Meanwhile, for the second graph displayed under the disk performance section, the percentage utilization is based on the provisioned IOPS, which is 9,216 IOPS.
71.90 % of 9,216 IOPS = 6,626 IOPS
So, to conclude, the disk IOPS remain the same, it's just a matter of how it's being calculated and displayed.
Does a brand new FSx for ONTAP system come with any predefined CloudWatch alert definitions? Yes, you’ll have a number of different alerts that will trigger under specific circumstances (though note that they are not configurable):
When encountering any of these situations, it's essential to consider your next steps carefully. Instead of jumping to immediately scale up your deployment, thoroughly evaluate the following:
Armed with this information, we can confidently make informed decisions about the proper course of action. For example, if it was a one-time event with no performance impact, exclusively affecting network utilization, you won’t have to remediate but it’s imperative to record the incident for future reference.
You can also refer to the AWS main page for further information on this topic.
With this article I hope I have provided enough information to allow you to monitor your FSx for ONTAP deployment in a more effective way using CloudWatch.
Feel free to access the GitHub repository for more information or useful tools to easily deploy CloudWatch.
For more information about, visit the Amazon FSx for NetApp ONTAP page.