Introduction

ArpanChowdhry

By Mohammad Hossein Hajkazemi, Bhushan Jain, Arpan Chowdhry

Introduction

Google Cloud NetApp Volumes (GCNV) is a fully managed, cloud-native storage service for Google Cloud that delivers high-performance, enterprise-grade storage with advanced data management features. NetApp’s ONTAP storage operating system serves as a unified solution supporting file and block protocols, while delivering robust enterprise-grade data management capabilities over standard NAS and SAN protocols for GCNV.

In this phase of the solution, we are introducing iSCSI block service under the GCNV Flex service level. Benchmarking, performance analysis, and continuous improvement are critical for cloud storage customers to ensure their workloads are being served with the lowest possible latency and highest possible throughput. This document outlines the core principles of storage system performance analysis, details of the bottlenecks encountered, and describes the optimizations implemented to enhance the performance of the GCNV block service.

Evaluating the Performance of a Storage System

When assessing the performance of a storage system, it’s crucial to understand how it behaves with different types of workloads. Storage systems exhibit varying performance characteristics depending on the size and pattern of data access. To capture these variations succinctly and meaningfully, we rely on a set of standardized microbenchmarks known as the 4-corner microbenchmarks.

The 4-corner microbenchmarks focus on four fundamental types of I/O operations that represent the extremes or "corners" of typical storage workloads:

64KiB Sequential Read
64KiB Sequential Write
8KiB Random Read
8KiB Random Write

These benchmarks test the system’s ability to handle large, contiguous data transfers (sequential) as well as small, scattered data accesses (random), for both reading and writing. Sequential operations involve reading or writing large blocks of data in a continuous stream. This pattern is common in media streaming, backups, and large file transfers. Random operations involve accessing small blocks of data scattered across the storage medium, typical in database queries, virtual machine disk operations, and general-purpose file sharing workloads.

While storage workloads can be infinitely varied, the 4-corner microbenchmarks are sufficient for a high-level performance evaluation because:

They represent the extremes of access patterns: Most real-world workloads fall somewhere between purely sequential and purely random, and between large and small I/O sizes.
They capture key performance trade-offs: For example, a storage system might have excellent sequential throughput but poor random IOPS, or vice versa.
They simplify benchmarking: Instead of running dozens of tests with varying block sizes and access patterns, these four tests provide a manageable yet informative set of metrics.
They enable comparison: Using standardized benchmarks allows for apples-to-apples comparisons across systems and configurations.

Our Measurement Methodology

Evaluating the performance of a storage system requires a methodology that ensures reliable and representative results for real-world workloads. This evaluation focuses on the four fundamental microbenchmarks—64KiB sequential read and write, and 8KiB random read and write—to capture the key performance characteristics of the system.

Benchmarks are generated using sio, a NetApp proprietary tool similar to an industry standard tool fio, which allows precise control over the rate of I/O operations per second (ops/s). The load increases gradually in a controlled manner to observe how the system scales and where it reaches saturation. Multiple load generating client instances are deployed, and each such client creates multiple sio processes concurrently, simulating a distributed workload and enabling evaluation of multi-client performance and contention effects.

During the performance test, statistics are gathered continuously for different ONTAP sub-systems and cloud resources. Plotting ops-latency or MiB/s-latency curves visualize results, which show the relationship between achieved throughput and response time. Analyzing these curves alongside the collected statistics helps identify peak performance points and potential bottlenecks such as CPU saturation, network limitations, disk contention or software limitations.

The measurement philosophy reflects a deep understanding of the storage system’s internal behavior. For sequential reads, the readahead engine pre-populates the cache by reading data ahead of requests. The benchmark ensures that this engine fetches data directly from the disk, measuring true backend throughput rather than benefiting from cached hits. For random reads, each request reads directly from the disk, bypassing cache effects to measure raw random-access performance. For both random and sequential writes, we overwrite existing data to guarantee actual write operations instead of simply appending or writing to empty space, which could skew the results.

Performance Summary

The following table presents a summary of the performance outcomes for the 4-corners methodology outlined previously. It includes results for both sequential and random operations. Sequential workload performance is reported in MiB/s, while random workloads are measured in Ops/s. Latency metrics have been recorded at the GCNV nodes.

Workload	Performance @ 1ms	Performance Bottleneck
64KiB Sequential Read (MiB/s)	4,700	VM MiB/s entitlement limit
64KiB Sequential Write (MiB/s)	1,800	Journal Replication Latency
8KiB Random Read (Ops/s)	160,000	GCNV service limits
8KiB Random Write (Ops/s)	113,000	Journal Replication Latency

Enhancing Performance Throughout the Development Cycle

This section outlines the methods employed to enhance the performance of various micro-benchmarks.

Random Read Performance

NetApp Volumes benefit from an external cache (EC) that delivers superior performance for read workloads. The capacity of this cache is substantially greater than that of the buffer cache, and read operations served through it do not count towards VM disk entitlement limits, thereby allowing significant scalability in read performance. However, it is possible for the workload's working set size to be significantly larger than the cache size, causing some percentage of the requests to be served by the disks. Since most customer datasets sizes will fall somewhere in between, we covered both scenarios; all requests are served by the disk (100% cache miss), and all requests are served by the cache (100% cache hit).

100% Cache Miss

When running a random read benchmark, our performance goal is to fully saturate either storage IOPS entitlements or compute. The initial 8KiB random read measurements showed lower-than-expected performance, as neither the storage nor compute were saturated. Although there were higher disk IOPS available, the maximum observed usage only reached 80% of that limit.

The first round of bottleneck analysis revealed that the thread pool serving disk operations were not sized adequately. While n disks were attached to the ONTAP cluster, only n/2 such threads were spawned. Increasing the thread pool size led to a noticeable improvement in performance, although the impact was somewhat limited. However, we were still not able to fully saturate the IOPs nor compute, indicating that the bottleneck had shifted. This motivated us to continue working on this issue.

Following this, we collaborated with the disk driver development team to address the identified bottleneck. Through our joint efforts, we implemented an optimization by batching acknowledgments (ACKs) from the backend disk before relaying them to the upper-level stack, rather than processing each acknowledgment individually. This method amortizes the overhead of interrupt handling over multiple ACKs. The results with this proposed approach showed that the disk could be fully saturated by ONTAP, resulting in improved performance. As a result of all the measurements, analysis, and improvements, the peak performance was enhanced by approximately 23% compared to the initial results.

100% Cache Hit

Our initial 8KiB random read benchmark measurement, with the optimized external cache (EC) enabled, showed that it achieved strong performance but could be further improved by right sizing the iSCSI LUN handler thread pool size. However, after making a few code changes to improve other workloads, we noticed a significant performance regression. Our analysis indicated that scheduling overhead due to the enablement of some helper threads. These thread pool is meant to enhance sequential read performance when sequential blocks are read from the external cache caused the degradation. To address this, we disabled the mentioned threads while ensuring that sequential read performance is not adversely affected. The peak performance was improved by approximately 25%.

The Figure[1] below shows the performance improvement progression for random reads.

[1] Please refer to the Appendix section for an explanation of how the OPS-latency graphs are generated.

Sequential Read Performance

Although we expect no performance harm when running sequential reads with external cache (using SSD or NVMe devices) we observed a performance degradation. We found that ONTAP uses small block sizes for external cache (EC) reads and writes, which makes large block operations inefficient due to I/O amplification.

The EC-Bypassing feature was introduced to let large read requests skip the external cache and is activated adaptively in response to defined response time thresholds. Since the read-ahead engine already pre-fetches large sequential reads, EC offers no extra benefit. After enabling bypassing, no notable improvement was seen. Reviewing EC-related runtime statistics showed that EC-Bypass was ineffective because EC response times remained in the acceptable range. The actual bottleneck stemmed from saturated interrupt threads processing the completion queue, not EC saturation. Offloading completions to helper threads improved performance and resolved the bottleneck, but this approach couldn't be adopted long-term as it caused performance regressions in other benchmarks.

Finally, we ended up changing the local caching policy so that only randomly written and read data can be cached. Previously, based on the older policy, sequentially read and written data would also be cached. Changing the policy recovered the performance, as shown in the graph below, since almost all sequentially read data were read ahead and prefetched from the backend disk instead of the external cache. The OPS-latency graph below demonstrates how the sequential read performance has been improved by more than 2x through the steps explained above.

Sequential Write Performance

One observation was that the performance of our block measurements was slightly lower compared to the file protocol-based measurements. Our analysis indicated that this discrepancy is due to the higher latency of journal replication to a high-availability node. This motivated us to investigate parameter tuning, including the flow control settings for journal mirroring. Ultimately, our investigation revealed that the current settings are appropriate.

During our investigation into the performance discrepancy between block and file protocol measurements, we identified an approximately 34% performance regression caused by recent changes in the file system cleaner module. After consulting with the file system team, we decided to revert and rework these changes. As a result, performance was restored.

While investigating potential performance improvements, we implemented a few other enhancements based on our observations. Here are the two most important ones:

We noticed that the journal provisioned size was lower than expected. This could negatively impact the performance of write workloads. We modeled and recommended the correct size to address this issue.
Datasets with low compressibility had a redundant process that increased CPU cost per operation, which we addressed with an optimization.

Random Write Performance

Our initial 8KiB random write measurements indicate that the expected performance for this benchmark has been met, and no improvements were necessary. As anticipated, high journal replication latency was the performance bottleneck in this case.

Conclusion

Google Cloud NetApp Volumes (GCNV) block service performance was evaluated using 4-corner microbenchmarks focusing on sequential and random read/write workloads, revealing various performance bottlenecks, which were addressed through optimizations like thread pool resizing, caching policy adjustments, and parameter tuning to improve throughput and latency across workloads while maintaining balanced performance across the 4-corners.

Appendix

OPs-Latency Graph

The OPs-Latency graphs we shared on this document report either the achieved throughput in terms of either IOPs or MiB/s, as well as the latency from the ONTAP’s perspective. Clients might see higher latency depending on the connectivity performance to the ONTAP cluster. In these graphs, while the x-axis shows the throughput, the y-axis shows the latency. Each data point on each curve represents an iteration during the performance measurement. The leftmost data point represents the lowest achieved/offered load, while the rightmost one represents the highest achieved/delivered load.

The Performance Optimization Journey of Google Cloud NetApp Volumes Block Storage

Introduction

Evaluating the Performance of a Storage System

Our Measurement Methodology

Performance Summary

Enhancing Performance Throughout the Development Cycle

Random Read Performance

100% Cache Miss

100% Cache Hit

Sequential Read Performance

Sequential Write Performance

Random Write Performance

Conclusion

Appendix

OPs-Latency Graph

Introducing GenAI Search on NSS