Kafka in Kraft Mode on FSxN for NetApp ONTAP
Introduction
Apache Kafka, a widely used open-source distributed event streaming platform, is renowned for its high throughput, low latency, and scalability. It is employed by numerous companies for diverse applications such as high-performance data pipelines, streaming analytics, data integration, and IoT (Internet of Things) message passing, among others.
The scalability of Kafka is attributed to its distributed nature, allowing for the addition of more brokers to manage increasing load. Furthermore, Kafka can process large volumes of data at high speeds, making it suitable for IoT devices and real-time/big data applications.
Kafka's flexibility extends to its storage configuration. It can be paired with a variety of storage options available in the market, such as EBS, NVMe, and FSxN-ONTAP.
After evaluation, FSxN-ONTAP emerges as a superior choice for Kafka storage, catering to both short-term and long-term requirements. It excels in terms of availability, durability, reliability, resilience, and fault-tolerance, making it ideal for enterprise mission-critical production systems.
Performance Benchmark
Apache Kafka, configured in a 3-node cluster running in Kraft mode (without ZooKeeper). The study focuses on evaluating throughput and synchronization across different storage options: EBS, NVMe, and FSxN-ONTAP. Below are configuration details and benchmark results for each storage option.

The data presented above, including numbers and charts, represent the typical workflow in enterprise Kafka production systems. This workflow involves data transfer from client machines to Kafka brokers, and subsequently to storage disks. [(Client Machines -> Kafka Brokers -> Storage Disks)]
Below is typical flow from Kafka -> Storage for IOPS and throughput (TPS) benchmarks were evaluated using the FIO tool. This tool facilitates direct testing from Kafka to the storage disks, providing a realistic measure of performance.

Security
Security is a paramount concern for everyone, particularly when it comes to data. Data primarily resides in two states: in transit or at rest.
- For data in transit, it is the responsibility of the client or application to implement customized serialization/deserialization (serde) techniques. These techniques should include client-specific custom encryption and decryption to secure the payload.
- As for data at rest, industry-standard Identity and Access Management (IAM) authentication and authorization policies are in place. These policies work in conjunction with the respective cloud or storage provider to ensure your data is safe and secure at rest, adhering to industry-standard protocols. By default, Key Management Service (KMS) keys are used to encrypt and decrypt file systems, including both data and metadata. There is also the option to use your own Customer Master Keys (CMKs) for further control over your data.
Cost
The chart below provides an idea of the yearly on-demand infrastructure cost for the configurations. Among all the options, FSxN-ONTAP stands out as the most cost-effective choice when compared to other storage options. You can choose between FSxN-ONTAP Single-AZ or Multi-AZ, depending on your High Availability (HA) and long-term storage needs. FSxN-ONTAP offers numerous benefits such as data auto-tiering and compression, snapshots, etc.
Upon closer examination of the pricing, you will notice that as the number of nodes/machines in your cluster increases, the cost of Amazon Elastic Block Store (EBS) grows significantly. This is because EBS disks are tightly attached to instances, meaning each machine requires a dedicated EBS disk. On the other hand, FSxN-ONTAP volumes are network shared file systems, allowing a single FSxN to be attached to multiple instances, which can result in cost savings.

Pros & Cons
EBS:
- EBS volumes are non-ephemeral, meaning they are permanent or long-lasting.
- EBS is a block-level storage solution used with the EC2 cloud service to store persistent data. The data remains on the AWS EBS servers even if the EC2 instances are shut down or fail.
- EBS volumes are tightly coupled with EC2 instances. If you have N instances, you will end up with N EBS volumes. This is a 1-1 mapping, unlike FSxN-ONTAP disks which can be shared across multiple instances.
- IOPS and TPS are closely tied to their respective machines.
- EBS charges are based on storage capacity, TPS, and IOPS.
- EBS may not be the most cost-effective solution for long-term data storage compared to FSxN-ONTAP, which offers superior data management capabilities such as data deduplication, compression, tiering, and at-rest security.
- Network traffic and charge occurs when node failures, scale-up horizontally/vertically and scale -down situations, and significant latency to move the data when rebalancing the partitions; whereas with FSxN can be avoided.
NVMe:
- NVMe volumes are ephemeral, meaning they are short-lived and only available during execution context.
- NVMe disks are tightly coupled with instances.
- Data on NVMe disks is non-persistent, meaning it is not available after the application is fully closed.
- NVMe provides high-performance, quick storage access, supporting over 1M IOPS per second.
- Data on NVMe disks is lost if there are failures, shutdowns, or if the current context/session is closed.
- NVMe SSD-based devices are more expensive than standard devices.
- Older systems may not support NVMe, making it difficult to upgrade storage systems.
- NVMe is not typically preferred for production systems due to its ephemeral nature.
FSxN-ONTAP:
- FSxN-ONTAP is non-ephemeral, suitable for permanent or long-lasting data.
- FSxN-ONTAP is a network shared file system for storing persistent data. The data on ONTAP can be attached to any number of instances, even in the event of failures or shutdowns.
- IOPS and TPS are shared across machines within the same region.
- FSxN-ONTAP offers popular data management capabilities like snapshots, replication (Snap Mirror), cloning (Flex Clone), data compression/deduplication, and data tiering with multi-AZ support.
- Data on FSxN-ONTAP can be accessed from various environments via industry-standard NFS (Network File System), SMB, and iSCSI protocols.
- FSxN-ONTAP provides the flexibility to select the exact storage capacity, throughput, and IOPS as per your needs.
- ONTAP charges are based on the storage capacity, TPS, and IOPS that you select.
- FSxN-ONTAP is a preferred option for enterprise production workloads and mission-critical applications, especially if you intend to retain your data for more than 7 days for various use cases like KSQL, etc., to avoid other databases.
- NO network traffic and charge occur when node failures, scale-up horizontally/vertically, no data movement for partitions rebalance since its shared file system FSxN volumes unlike EBS. No charge in region for front nodes/computes, all rolled into throughput costs of fsxn.
Conclusion
After evaluating key factors such as cost, high availability (HA), security, and disaster recovery (DR) scenarios, among other considerations, it is concluded that FSxN-ONTAP is the optimal choice.
Recommendations:
- For use cases such as typical core Kafka message passing with short-term retention requirements, especially for small scale or non-production systems, the FSxN-ONTAP Single-AZ deployment is recommended.
- For more complex use cases, like message passing with long-term retention needs, KSQL, quick business analysis on raw data, machine learning, and others, the FSxN-ONTAP Multi-AZ deployment is suggested. This option helps avoid the need for additional databases to store data over extended periods.
- Get benefited with FSxN for small scale customers with required bandwidth only to match up with EBS, and for large scale customers match-up or beat the cost & performance with required no.of nodes with what business demand of network & storage needs.
Please note that the choice between Single-AZ and Multi-AZ deployment should be made based on specific use cases and requirements, as both options have their own advantages and trade-offs.
Acronyms
- EBS – Elastic Block Store: This is a service provided by Amazon Web Services (AWS) that provides raw block-level storage that can be attached to Amazon EC2 instances.
- NVMe – Non-Volatile Memory Express: This is a protocol for accessing high-speed storage media that brings many advantages compared to legacy protocols.
- FSxN – File System/Server Shared Network ONTAP(Netapp): This is a combination of two different services. FSx is a family of fully managed file storage services provided by AWS. ONTAP is a data management software by NetApp.
- ONTAP – Open Network Technology for Appliance Products: ONTAP is a proprietary data storage management operating system from NetApp.
- TPS – Thruput per Second: This is typically referred to as "Throughput per Second" and measures the number of operations processed per second.
- IOPS – I/O operations per second: This is a common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area networks (SAN).
- CMK – Customer managed Key: In the context of AWS, a CMK can be either a customer managed key or an AWS managed key.
- KSQL – Kafka Structure Query Language: This is a SQL-like language for Apache Kafka, a distributed streaming platform.
- KMS – Key Management Services from KDC: KMS usually stands for Key Management Service, a feature provided by cloud services like AWS for creating and controlling encryption keys. KDC is the Key Distribution Center in Kerberos protocol.
- KDC – Kerberos/Key Distribution Center: The KDC is a part of the Kerberos protocol for authenticating networked users.
- ZK – Zookeeper: Apache ZooKeeper, a service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- NFS – Network File System: This is a distributed file system protocol that allows a user on a client computer to access files over a network in a manner like how local storage is accessed.
- SMB – Server Message Block (known as a Common Internet File System): SMB is a network protocol for sharing access to files, printers, and serial ports.
- iSCSI – Internet Small Computer System Interface: This is a network protocol that allows clients (called initiators) to send SCSI commands (CDBs) to SCSI storage devices (targets) on remote servers.
- PR – Produce Rate: The rate at which data is produced in a system.
- CR – Consumer Rate: The rate at which data is consumed in a system.
- FIO – Flexible I/O Tester.
Further References
https://aws.amazon.com/ebs/general-purpose/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html
https://aws.amazon.com/fsx/netapp-ontap/
https://aws.amazon.com/ebs/pricing/
https://aws.amazon.com/fsx/netapp-ontap/pricing/
https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/what-is-fsx-ontap.html
https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/getting-started.html
https://www.datacore.com/blog/availability-durability-reliability-resilience-fault-tolerance/
https://fio.readthedocs.io/en/latest/fio_doc.html
https://openmessaging.cloud/docs/benchmarks/