Tech ONTAP Blogs

Low-latency, multi-AZ diskless Kafka with AutoMQ and Amazon FSx for NetApp ONTAP

VarunS
NetApp
98 Views

From event-driven architectures to operational analytics and user-facing services, Apache Kafka underpins much of today’s streaming infrastructure. Many organizations run it on Amazon Web Services (AWS) to benefit from managed infrastructure, AWS Regional scale, and native storage options.

 

Even with this cloud elasticity, Kafka's tightly coupled architecture means storage and replication costs can rise sharply as workloads scale. Diskless Kafka deployments have emerged in response, where compute is scales independently from storage to improve elasticity and reduce long-term storage costs. But running diskless Kafka across multiple AWS Availability Zones (AZs) introduces a new challenge: Preserving low write latency without incurring high cross-AZ data transfer charges.

 

AutoMQ is a cloud-based, drop-in replacement for Apache Kafka designed to address this challenge by providing faster scalability, reduced costs, and improved performance at the storage layer. AutoMQ now supports Amazon FSx for NetApp ONTAP (FSx for ONTAP) as a file system service, which adds the trusted capabilities of NetApp® ONTAP® technologies to the solution.

 

In this blog post, we show how this new approach makes it possible to run diskless Kafka across multiple AZs and sidesteps the traditional trade-off between cost and latency, delivering sub-10 ms writes across AZs.

 

Read on as we cover:

The cost-latency trade-off of multi-AZ Kafka streaming on AWS

How AutoMQ addresses the write-ahead log challenge using FSx for ONTAP

How AutoMQ works

AutoMQ now supports FSx for ONTAP

The benefits of AutoMQ integration with FSx for ONTAP for diskless Kafka deployments

Explore AutoMQ with FSx for ONTAP

 

The cost-latency trade-off of multi-AZ Kafka streaming on AWS

Production Kafka deployments on AWS typically balance three non-negotiable requirements:

 

  • Resilience across AZs
  • Cost efficiency at scale
  • Predictable low latency

 

Multi-AZ architectures address the availability requirement, but they also amplify storage and network costs when traditional Kafka replication patterns are applied. Because each broker bundles its own storage, adding brokers for resilience or throughput means replicating data across all of them, multiplying storage and cross-AZ transfer costs.

 

Diskless Kafka designs attempt to rebalance this equation through compute-storage decoupling. In these architectures, brokers remain stateless while durable data moves to shared storage services. This approach improves elasticity and reduces long-term storage costs, particularly for workloads with large data retention windows.

 

Here you can see AutoMQ’s diskless Kafka architecture with a shared write-ahead log (WAL):

 
 

Screenshot 2026-06-15 at 6.39.03 PM.png

 

However, decoupling compute and storage shifts pressure onto the WAL. While bulk data can move to higher-latency object storage, the WAL can’t. It must commit every write synchronously before acknowledgment, making its latency the dominant factor in producer performance.

 

Because of the WAL challenge, running diskless Kafka on AWS has meant making a structural trade-off in multi-AZ deployments:

 

  • Object-storage-backed WALs using Amazon S3 remove replication overhead but write latencies of over 150 ms push Kafka use cases toward lower-priority workloads such as observability pipelines or batch ingestion. 
  • Local or block-based WALs using Amazon EBS deliver sub-10 ms latency, but multi-AZ deployments introduce replication traffic between brokers, increasing cross-AZ data transfer costs and operational complexity.

 

These options mean teams had to choose between fast but expensive multi-AZ designs and affordable architecture that would limit Kafka to high-latency workloads. AutoMQ support for FSx for ONTAP has changed all that.

 

How AutoMQ addresses the write-ahead log challenge using FSx for ONTAP

AutoMQ is one way companies have addressed the WAL requirements of diskless Kafka storage. AutoMQ’s cloud-native tiered storage architecture improves WAL behavior in multi-AZ Kafka deployments by redefining where the WAL lives and how it is shared.

 

How AutoMQ works

AutoMQ decouples the WAL from both local broker disks and object storage, placing the WAL acceleration layer on shared, AWS Regional storage, such as FSx for ONTAP, while continuing to use Amazon Simple Storage Service (Amazon S3) for long-term log retention. This design keeps write latency low while brokers remain stateless and long-term data scales cost-efficiently on Amazon S3.

 

This architecture shifts where ordering and durability are enforced at the shared storage layer rather than through broker-to-broker replication. 

Here you can see the AutoMQ multi-AZ architecture using FSx for ONTAP as a shared write-ahead log:

AutoMQ.jpg

 

 

 

AutoMQ now supports FSx for ONTAP

AutoMQ now supports FSx for ONTAP as a shared WAL layer. FSx for ONTAP is a fully managed storage service accessible from multiple AZs, giving Kafka brokers a shared write coordination point without broker-to-broker replication.

 

The combined solution of AutoMQ with FSx for ONTAP provides:

 

  • Improved write and read flow

With FSx for ONTAP serving as the WAL, brokers append records sequentially and commit writes before acknowledging producers. AutoMQ then flushes data asynchronously in batches to Amazon S3 for long-term retention. This separation keeps write latency independent of retention scale.

 

On the read path, recent data is served directly from the WAL on FSx for ONTAP for trailing reads. When consumers fall behind and need older segments, reads shift to Amazon S3, where higher latency is acceptable because it is less frequent and not on the hot path.

 

  • Multi-AZ durability without replication overhead

In an FSx for ONTAP multi-AZ deployment, FSx for ONTAP operates as a high-availability (HA) pair across two AZs and synchronously replicates data at the storage layer. That means ordering and durability for the WAL are preserved through the storage service, rather than requiring broker-level replication to create multiple in-cluster copies of the same data.

 

This changes the durability model in a practical way: Brokers in different AZs can fail over and continue operating by re-attaching to the same shared WAL, with consistency handled at the storage layer.

 

Note that AutoMQ’s use of multiple AZs is to load balance and to provide reliability against AZ disruptions. The multi-AZ FSx for ONTAP configuration provides additional reliability against AZ disruptions for the WAL.

 

  • Rapid scaling for stateless brokers

Because the WAL is shared and multi-AZ by design, AutoMQ avoids broker-to-broker replication on the write path while keeping brokers stateless. This supports rapid scale-out and scale-in without partition reassignment or data migration and keeps write coordination within the storage layer rather than the broker layer. This also allows the faster recovery of any failed brokers, which is an important factor when running large-scale production workloads.

 

When using FSx for ONTAP with AutoMQ, there isn’t a forced trade-off between cost efficiency and low latency. AutoMQ with FSx for ONTAP represents an architectural shift this way: It provides a purpose-built Regional WAL that preserves diskless Kafka characteristics while supporting production-grade operation in multi-AZ deployments.

 

The benefits of AutoMQ integration with FSx for ONTAP for diskless Kafka deployments

 

The shared WAL architecture described above translates directly into operational advantages for production Kafka deployments. Teams gain multi-AZ resilience without the traditional performance penalties or cost multipliers, a combination that opens Kafka to workloads previously constrained by infrastructure trade-offs.

 

Let’s explore how the architecture translates to clear benefits for performance, cost efficiency, scalability, and operational simplicity.

 

  • Performance: Near local-disk write latency in multi-AZ deployments

The synchronous replication carried out by FSx for ONTAP at the storage layer eliminates the need for broker-to-broker coordination on writes, keeping producer acknowledgments on a low-latency path even across AZs.

 

Benchmarks for AutoMQ with FSx for ONTAP show sub-10 ms write latency in multi-AZ deployments, with p99 at 17.5 ms and end-to-end latency at 28 ms under high-throughput workloads. These performance characteristics approach local disk performance of real-time use cases while maintaining multi-AZ fault tolerance.

 

  • Cost optimization: Eliminating cross-AZ replication charges

Since brokers append to a shared FSx for ONTAP-based WAL rather than replicating data between themselves, cross-AZ data transfer costs are eliminated.

 

The WAL footprint on FSx for ONTAP remains relatively fixed, storing only active log data, while Amazon S3 scales independently for retention. This separates fixed WAL costs from variable storage costs.

 

  • Scalability and elasticity: Rapid scaling without data migration

FSx for ONTAP shared storage aligns naturally with the stateless broker model presented by AutoMQ, allowing brokers to scale in or out in seconds without data migration or repartitioning.

 

Throughput and capacity scale independently: FSx for ONTAP provides consistent WAL performance while broker count adjusts to workload demand. This supports dynamic scaling patterns in production without service interruption.

 

  • Operational simplicity: Stateless brokers and consistent multi-AZ failover

Eliminating local disks from brokers removes the operational overhead associated with disk provisioning, balancing, and recovery. Brokers can fail over and resume operation by re-attaching to the shared WAL, with durability and consistency handled at the storage layer.

 

Also, FSx for ONTAP HA pairs provide consistent behavior across AZs, simplifying failure scenarios and reducing operational complexity compared to architectures that rely on broker-managed replication and recovery logic.

 

Learn more about these benefits in the benchmark testing published by AutoMQ.

 

Explore AutoMQ with FSx for ONTAP

For teams running event-driven architectures, streaming analytics, or user-facing services on AWS, the trade-off between multi-AZ resilience and sub-10 ms latency no longer exists.

 

AutoMQ with FSx for ONTAP is available today to support organizations that require real-time Kafka performance across AZs without the replication costs or operational complexity of traditional diskless deployment.

 

Ready to transform your Kafka infrastructure?

 

 

Public