Tech ONTAP Blogs
Tech ONTAP Blogs
From event-driven architectures to operational analytics and user-facing services, Apache Kafka underpins much of today’s streaming infrastructure. Many organizations run it on Amazon Web Services (AWS) to benefit from managed infrastructure, AWS Regional scale, and native storage options.
Even with this cloud elasticity, Kafka's tightly coupled architecture means storage and replication costs can rise sharply as workloads scale. Diskless Kafka deployments have emerged in response, where compute is scales independently from storage to improve elasticity and reduce long-term storage costs. But running diskless Kafka across multiple AWS Availability Zones (AZs) introduces a new challenge: Preserving low write latency without incurring high cross-AZ data transfer charges.
AutoMQ is a cloud-based, drop-in replacement for Apache Kafka designed to address this challenge by providing faster scalability, reduced costs, and improved performance at the storage layer. AutoMQ now supports Amazon FSx for NetApp ONTAP (FSx for ONTAP) as a file system service, which adds the trusted capabilities of NetApp® ONTAP® technologies to the solution.
In this blog post, we show how this new approach makes it possible to run diskless Kafka across multiple AZs and sidesteps the traditional trade-off between cost and latency, delivering sub-10 ms writes across AZs.
Read on as we cover:
The cost-latency trade-off of multi-AZ Kafka streaming on AWS
How AutoMQ addresses the write-ahead log challenge using FSx for ONTAP
AutoMQ now supports FSx for ONTAP
The benefits of AutoMQ integration with FSx for ONTAP for diskless Kafka deployments
Explore AutoMQ with FSx for ONTAP
Production Kafka deployments on AWS typically balance three non-negotiable requirements:
Multi-AZ architectures address the availability requirement, but they also amplify storage and network costs when traditional Kafka replication patterns are applied. Because each broker bundles its own storage, adding brokers for resilience or throughput means replicating data across all of them, multiplying storage and cross-AZ transfer costs.
Diskless Kafka designs attempt to rebalance this equation through compute-storage decoupling. In these architectures, brokers remain stateless while durable data moves to shared storage services. This approach improves elasticity and reduces long-term storage costs, particularly for workloads with large data retention windows.
Here you can see AutoMQ’s diskless Kafka architecture with a shared write-ahead log (WAL):
However, decoupling compute and storage shifts pressure onto the WAL. While bulk data can move to higher-latency object storage, the WAL can’t. It must commit every write synchronously before acknowledgment, making its latency the dominant factor in producer performance.
Because of the WAL challenge, running diskless Kafka on AWS has meant making a structural trade-off in multi-AZ deployments:
These options mean teams had to choose between fast but expensive multi-AZ designs and affordable architecture that would limit Kafka to high-latency workloads. AutoMQ support for FSx for ONTAP has changed all that.
AutoMQ is one way companies have addressed the WAL requirements of diskless Kafka storage. AutoMQ’s cloud-native tiered storage architecture improves WAL behavior in multi-AZ Kafka deployments by redefining where the WAL lives and how it is shared.
AutoMQ decouples the WAL from both local broker disks and object storage, placing the WAL acceleration layer on shared, AWS Regional storage, such as FSx for ONTAP, while continuing to use Amazon Simple Storage Service (Amazon S3) for long-term log retention. This design keeps write latency low while brokers remain stateless and long-term data scales cost-efficiently on Amazon S3.
This architecture shifts where ordering and durability are enforced at the shared storage layer rather than through broker-to-broker replication.
Here you can see the AutoMQ multi-AZ architecture using FSx for ONTAP as a shared write-ahead log:
AutoMQ now supports FSx for ONTAP as a shared WAL layer. FSx for ONTAP is a fully managed storage service accessible from multiple AZs, giving Kafka brokers a shared write coordination point without broker-to-broker replication.
The combined solution of AutoMQ with FSx for ONTAP provides:
With FSx for ONTAP serving as the WAL, brokers append records sequentially and commit writes before acknowledging producers. AutoMQ then flushes data asynchronously in batches to Amazon S3 for long-term retention. This separation keeps write latency independent of retention scale.
On the read path, recent data is served directly from the WAL on FSx for ONTAP for trailing reads. When consumers fall behind and need older segments, reads shift to Amazon S3, where higher latency is acceptable because it is less frequent and not on the hot path.
In an FSx for ONTAP multi-AZ deployment, FSx for ONTAP operates as a high-availability (HA) pair across two AZs and synchronously replicates data at the storage layer. That means ordering and durability for the WAL are preserved through the storage service, rather than requiring broker-level replication to create multiple in-cluster copies of the same data.
This changes the durability model in a practical way: Brokers in different AZs can fail over and continue operating by re-attaching to the same shared WAL, with consistency handled at the storage layer.
Note that AutoMQ’s use of multiple AZs is to load balance and to provide reliability against AZ disruptions. The multi-AZ FSx for ONTAP configuration provides additional reliability against AZ disruptions for the WAL.
Because the WAL is shared and multi-AZ by design, AutoMQ avoids broker-to-broker replication on the write path while keeping brokers stateless. This supports rapid scale-out and scale-in without partition reassignment or data migration and keeps write coordination within the storage layer rather than the broker layer. This also allows the faster recovery of any failed brokers, which is an important factor when running large-scale production workloads.
When using FSx for ONTAP with AutoMQ, there isn’t a forced trade-off between cost efficiency and low latency. AutoMQ with FSx for ONTAP represents an architectural shift this way: It provides a purpose-built Regional WAL that preserves diskless Kafka characteristics while supporting production-grade operation in multi-AZ deployments.
The shared WAL architecture described above translates directly into operational advantages for production Kafka deployments. Teams gain multi-AZ resilience without the traditional performance penalties or cost multipliers, a combination that opens Kafka to workloads previously constrained by infrastructure trade-offs.
Let’s explore how the architecture translates to clear benefits for performance, cost efficiency, scalability, and operational simplicity.
The synchronous replication carried out by FSx for ONTAP at the storage layer eliminates the need for broker-to-broker coordination on writes, keeping producer acknowledgments on a low-latency path even across AZs.
Benchmarks for AutoMQ with FSx for ONTAP show sub-10 ms write latency in multi-AZ deployments, with p99 at 17.5 ms and end-to-end latency at 28 ms under high-throughput workloads. These performance characteristics approach local disk performance of real-time use cases while maintaining multi-AZ fault tolerance.
Since brokers append to a shared FSx for ONTAP-based WAL rather than replicating data between themselves, cross-AZ data transfer costs are eliminated.
The WAL footprint on FSx for ONTAP remains relatively fixed, storing only active log data, while Amazon S3 scales independently for retention. This separates fixed WAL costs from variable storage costs.
FSx for ONTAP shared storage aligns naturally with the stateless broker model presented by AutoMQ, allowing brokers to scale in or out in seconds without data migration or repartitioning.
Throughput and capacity scale independently: FSx for ONTAP provides consistent WAL performance while broker count adjusts to workload demand. This supports dynamic scaling patterns in production without service interruption.
Eliminating local disks from brokers removes the operational overhead associated with disk provisioning, balancing, and recovery. Brokers can fail over and resume operation by re-attaching to the shared WAL, with durability and consistency handled at the storage layer.
Also, FSx for ONTAP HA pairs provide consistent behavior across AZs, simplifying failure scenarios and reducing operational complexity compared to architectures that rely on broker-managed replication and recovery logic.
Learn more about these benefits in the benchmark testing published by AutoMQ.
For teams running event-driven architectures, streaming analytics, or user-facing services on AWS, the trade-off between multi-AZ resilience and sub-10 ms latency no longer exists.
AutoMQ with FSx for ONTAP is available today to support organizations that require real-time Kafka performance across AZs without the replication costs or operational complexity of traditional diskless deployment.
Ready to transform your Kafka infrastructure?