Tech ONTAP Blogs
Tech ONTAP Blogs
The scope of many migrate-to-the-cloud projects often ends after successful implementation. You scoped the effort, decided on a target architecture, sized the individual components, and moved the workload to the cloud. Everything works just fine. But a few months later you get paged out of the blue on a weekend because the application stopped working. Hectic troubleshooting ensues. While you try to get to the root cause of the issue, management is scheduling hourly status update calls that take 50 minutes each. It’s hard to make any progress in the remaining 10 minutes. Eventually you discover that a volume ran out of space and caused a lot of painful trouble.
Nobody likes these escalations. With proper monitoring of your infrastructure, you can catch issues long before they get serious. Adding alerting in addition to monitoring frees you from having to check the status periodically; instead, you get notified if metrics start to become critical.
Google built a centralized monitoring service into their cloud platform. Most services send their metrics to Cloud Monitoring, which provides flexible monitoring and alerting on those metrics. You can create all kinds of graphs from metrics, group them into dashboards, do SLO monitoring, and trigger alerts on user-defined criteria.
Google Cloud NetApp Volumes sends lots of useful metrics on resource usage and performance to Cloud Monitoring. The documentation contains a list of all available metrics. NetApp Volumes ships metric updates every 5 minutes. You can’t get more granular than that. Performance metrics like volume throughput, IOPS, and latency are averaged over the 5-minute period. When charting metrics, be sure to use a minimum interval of 5 minutes; otherwise, your charts will be incorrect. This article uses PromQL queries, which don’t have this issue.
Let’s look into a few use cases that will help you avoid running into operational issues.
A very common request is monitoring “disk space” to make sure that your workloads don’t run out of it.
NetApp Volumes exports the metric netapp.googleapis.com/volume/bytes_used to Cloud Monitoring. This metric tells you how many bytes are used inside the volume. Note that space used for snapshots is also counted as used space in a volume.
This metric is good to know, but it isn’t really helpful unless you also know how big the volume actually is. netapp.googleapis.com/volume/allocated_bytes gives you that number.
To get a usage percentage (“How full is my volume in percent?”) you can use Cloud Monitor’s PromQL to do a simple calculation:
netapp_googleapis_com:volume_bytes_used / netapp_googleapis_com:volume_allocated_bytes * 100
If your volume usage is greater than 80%, it’s time to investigate. Is my workload growing, or do I have a rogue script or user filling up my volume with unwanted data? Do I have to delete data or grow the volume? Be careful when deleting data. If you use snapshots, deleted data will move into the snapshot and will be freed only when the snapshot eventually gets deleted.
You can use Metrics Explorer to run this PromQL query. Here’s an example of a very static demo environment:
You can see that the volume “okdata” in region northamerica-northeast1 is close to full at 99.43% utilization. It’s time to increase the size of this volume.
Inodes are another metric worth monitoring. Volumes are file systems. Every file or directory in a file system is stored in an internal data structure called an inode. Like all resources, inodes are not infinite. In NetApp Volumes, this limit increases with the volume size. If you run out of inodes, you can’t create new files or directories in the volume.
To monitor the used inodes in relation to the number of available inodes, the following PromQL query is helpful:
netapp_googleapis_com:volume_inode_used / netapp_googleapis_com:volume_inode_limit * 100
Again, if your usage grows beyond 80%, it’s time to plan further growth.
Typical performance charts show throughput, IOPS, and latency over time for monitored volumes. Charts with metrics are easy to create with Cloud Monitoring. Metrics get updated every 5 minutes. Throughput, IOPS, and latency numbers in Cloud Monitoring are the average of a 5-minute window. Here’s an example:
Can you learn more from those metrics? Yes! You can use them to size volumes to better suit your workload requirements. Before learning how, you need to understand how volume performance in NetApp Volumes works.
NetApp Volumes performance is based on throughput; IOPS are a byproduct. With every additional GiB of capacity that you provision, you also get additional throughput bandwidth.
For the service levels Standard, Premium, and Extreme you get a bandwidth of 16, 64, and 128KiB/s per GiB of provisioned volume capacity. A 2TiB Premium volume therefore has a throughput capacity of 2TiB * 1024GiB/TiB * 64KiB/s/GiB = 131072KiB/s = 128MiB/s. The larger the volume, the more throughput it can drive. Or you can change the service level to Extreme to double the throughput while retaining the volume capacity. It boils down to capacity versus performance versus cost optimization.
For the Flex service level, you get 16KiB/s per GiB of provisioned storage pool capacity. Small volumes can use the full throughput capability of the storage pool but share it with other volumes in the same pool.
In the example Volume Throughput chart, the throughput hit a ceiling at 6.25MiB/s and IOPS are stuck at 100. We can easily tell that the applied workload is doing 6.25MiB/s * 1024 KiB/MiB / 100 1 IO/s = 64KiB sized I/O operations, which is considered large block I/O.
But why is it stuck at 6.25MiB/s and 100 IOPS? This is a 100GiB Premium volume. Its throughput is limited to 64KiB/s per GiB volume size, which translates to 100GiB * 64KiB/s/Gib = 6400KiB/s = 6.25MiB/s. It goes as fast as I defined it as the volume administrator. If I want to make it go faster, I have to increase its size or change its service level to Extreme. Or maybe this was only a short burst and it’s fine for it to go that slow in return for saving costs for more capacity or faster service levels. NetApp Volumes gives you the flexibility to size according to your needs and constraints.
What if you have very large and very small volumes in your charts? The big numbers would make the small numbers of small volumes look like noise. The best approach is to graph the actual volume throughput versus the throughput capability:
Although this chart looks like the previous chart, this chart shows the “performance usage” of a volume as a percentage. You can quickly see the performance usage of any volume, without the throughput capacity of the volumes skewing the picture. If any volume gets hot too often, it’s time to redo performance sizing.
As you can see in the previous examples, you will want to monitor a few NetApp Volume charts. Charts can be combined into dashboards, which are basically named groups of charts shown on one page. To make your life easier and start you off with some best practice charts, see the example dashboard for NetApp Volumes on GitHub.
This dashboard provides usage charts for volume capacity, inodes, and performance. It also offers a few other commonly used charts. A dashboard is basically a JSON file defining a dashboard that Cloud Monitoring understands. These dashboards are easy to import and use. You can use them as a baseline and customize them for your needs.
This blog post covers how to use Cloud Monitoring metrics to get charts of individual metrics and shares a starter dashboard you can import that contains baseline best practice charts. We may add new best practices charts, so check back occasionally.
In the next part of this blog post we will look into creating alerts to notify you if certain metrics or calculations exceed defined thresholds. Stay tuned.