In the ever-evolving landscape of artificial intelligence and machine learning (AI and ML), the adoption of vector databases has emerged as foundational for enhancing the capabilities and performance of retrieval-augmented generation (RAG) systems. These specialized databases are designed to efficiently store, search, and manage vector embeddings, which are high-dimensional representations of data, enabling fast retrieval of relevant information that significantly boosts the intelligence and responsiveness of RAG-based architectures.
Using vector databases in RAG is not merely a technical enhancement; it’s a paradigm shift. By enabling more nuanced and contextually aware retrievals, vector databases empower applications to generate responses that are grounded in the semantic meaning of the data. This leap in relevance is crucial for a wide range of applications, from natural language processing and conversational AI to personalized recommendations and beyond. And it marks a pivotal moment in our journey toward creating more intelligent, efficient, and human-centric AI systems.
In this blog post, we delve into the I/O characteristics of vector databases. Understanding these characteristics is pivotal for effectively using vector databases in RAG deployments, because they directly affect the performance, scalability, and efficiency of these systems.
Table of Contents
Lab Setup. 2
Infrastructure. 2
Software. 2
Benchmark: VectorDB-Bench. 2
Vector Database. 2
Methodology. 3
Results & Lessons Learned. 4
Results. 4
Lessons Learned. 5
References. 6
Lab setup
This section describes the lab setup for our study.
Infrastructure
The testbed includes a NetAp® AFF A800 HA-pair running ONTAP® 9.14.1, a Fujitsu PRIMERGY RX2540-M4 running Ubuntu 22.04, and connections between host and storage going through a Cisco switch using 100GbE connections.
The system setup included the NetApp system with four 100GbE connections to a Cisco switch, and the Fujitsu host connected via a single 100GbE link. For performance optimization relative to the single host, 48 NetApp FlexVol® volumes were configured, each with one LUN, all mapped to the host by using the iSCSI protocol.
On the host, the /etc/iscsi/iscsid.conf file was modified to increase the number of iSCSI sessions from one to four, and multipathd was enabled. A volume group was then established using these 48 LUNs, and a striped logical volume was created to support the XFS file system.
Software
This section outlines the configuration of the software stack that we used during our performance measurements.
Benchmark: VectorDB-Bench
VectorDB-Bench is a vector database benchmark tool designed for user-friendliness. It enables anyone to easily replicate tests or evaluate new systems, simplifying the selection process among numerous cloud and open-source providers.
VectorDB-Bench tests mimic real-world conditions, including data insertion and various search functions, using public datasets from actual production environments like SIFT, GIST, Cohere, and one generated by OpenAI.
Vector databases
Milvus
Milvus is a database that is engineered specifically for storing, indexing, and managing the vast amounts of embedding vectors generated by deep neural networks and other machine learning models. Designed to operate on a scale of billions of vectors, Milvus excels in handling embedding vectors derived from unstructured data, a task that traditional relational databases, which focus on structured data, cannot perform.
With the rise in the volume of unstructured data, such as emails, social media content, and IoT sensor data, Milvus can store this data in the form of vector embeddings. This ability allows it to measure the similarity between vectors and, by extension, the similarity of the data source they originate from.
PostgreSQL pgvecto.rs extension
Pgvecto.rs is a PostgreSQL extension that enhances the relational database with vector similarity search capabilities. It is developed in Rust and builds on the framework provided by pgrx.
Index types
Hierarchical Navigable Small World index
The Hierarchical Navigable Small World (HNSW) index is a type of data structure used in vector databases for efficient search of high-dimensional data. It’s particularly good at finding the nearest neighbors in this kind of data, which is a common requirement for many machine learning applications, such as recommendation systems and similarity searches. How does it work?
Imagine that you’re at a large party and you need to find a group of people who share your interests out of hundreds of guests. Walking up to each person to find out if they’re a match would take a long time. Instead, HNSW organizes people into groups based on how similar they are to each other, creating layers of these groups from very broad to very specific. When you start your search, you first interact with the broad groups, which quickly guide you to increasingly specific groups until you find your best matches without having to meet everyone at the party.
Disk-Approximate Nearest Neighbor index
The Disk-Approximate Nearest Neighbor (DiskANN) index is a type of indexing mechanism designed to efficiently perform nearest neighbor searches on very large datasets that don’t fit entirely into the main memory of a host, but rather need to be stored on disk. How does it work?
Suppose that you have a huge library of books, far more than could fit on a single self or even in an entire room. You need a system to find the most relevant book based on a topic you’re interested in. However, space constraints mean that you can’t possibly have all the books laid out in front of you at once, so you need a smart way to store and retrieve them. DiskANN creates an efficient pathway to retrieve the most relevant books (or data points) from your storage (the disk), even though they’re not all immediately accessible in your main memory. It optimizes the layout of data on the disk and intelligently caches parts of the data to minimize the disk access times, which are typically the bottleneck in such large-scale systems.
HNSW versus DiskANN
In summary, HNSW is highly efficient for datasets that can fit within the server’s cache (RAM), leveraging fast memory access to speed up the search for nearest neighbors in high-dimensional space. However, its effectiveness is bounded by the amount of RAM available, which can limit its use in extremely large datasets.
On the other hand, DiskANN is designed to handle situations where the dataset is too large to fit into RAM. It uses clever strategies to minimize the performance penalties of having to fetch data from slower disk storage, thereby extending the potential size of the dataset to the limits of disk capacity. This makes DiskANN suitable for massive datasets, trading off some speed for the ability to handle larger amounts of data.
Methodology
We started our setup by deploying a Milvus standalone instance using a shell script, available at https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh. The script spins up a set of three containers, which constitute the Milvus database service.
Next, we measured the performance of the Milvus database instance using two datasets. The OpenAI dataset contains 5 million vectors, each with 1,536 dimensions using the DiskANN index. The LAION dataset contains 10 million vectors, each with 768 dimensions using the HNSW. The LAION dataset was used in the comparison of Milvus versus pgvecto.rs.
The measurement using the DiskANN index focused on understanding the I/O characteristics of this type of index. The measurement using the HNSW focused on checking whether there would be any I/O at all, since it’s an in-memory index, and it was used for the performance comparison between Milvus and pgvecto.rs.
To capture the I/O characteristics of the database during the vectordb-bench process, we recorded the start and end dates and times for each run and generated an ONTAP performance archive corresponding to the measurement periods.
When the Milvus measurements were completed, we switched the database to PostgreSQL running with pgvecto.rs 0.2.0.
About the index type we used in our measurements: For Milvus, which supports HNSW and DiskANN, we collected measurements with both indexes. At the time of that we measured performance, pgvecto.rs didn’t have support for DiskANN, so we collected measurements with HNSW.
Results and lessons learned
Results
First, let’s examine the performance of Milvus and Pgvecto.rs using the HNSW index. Pgvecto.rs delivered 1,068 queries per second (QPS) with a recall rate of 0.6344, whereas Milvus managed 106 QPS but achieved a higher recall of 0.9842. In terms of the 99 th percentile latency, Milvus demonstrated marginally better latency compared to Pgvecto.rs.
From the perspective of storage, there was no disk I/O, which aligns with expectations, because the index is memory-based and was completely loaded into RAM.
When precision in query results is important, according to the benchmark results, Milvus is superior to Pgvecto.rs because it retrieves a higher proportion of relevant items for each query.
When query throughput is the priority, Pgvecto.rs outperforms Milvus in terms of QPS. However, it’s important to note that the relevance of the retrieved data is compromised, because 37% of the results are not pertinent to the specified query.
Let’s now examine Milvus using the DiskANN index. Milvus reached 10.93 QPS with a recall rate of 0.9987 and a 99 th percentile latency of 708.2 milliseconds. Notably, the host CPU, operating at full capacity throughout, was the primary bottleneck.
From a storage point of view, the data ingestion and post-insert optimization phase primarily involved a mix of read and write operations, predominantly writes, with an average I/O size of 64KB. During the query phase, the workload consisted entirely of random read operations, with an average I/O size of 8KB.
Lessons learned
In reviewing the index implementations for vector databases, HNSW emerges as the predominant type, largely due to its established presence. DiskANN, being a newer technology, is not yet as universally adopted. However, as generative AI applications expand and the associated data grows, more developers are integrating DiskANN options into vector databases.
DiskANN is increasingly important for managing large, high-dimensional datasets that exceed RAM capacities, and it is gaining traction in the market. Its disk I/O profile is well suited for modern flash-based storage systems, like NetApp AFF A-Series and C-Series, ensuring that it handles large data volumes efficiently.
References
[1] VectorDB Benchmark. https://github.com/zilliztech/VectorDBBench
[2] Milvus Vector Database. https://milvus.io/docs
[3] Postgres pgvecto.rs Database. https://docs.pgvecto.rs/getting-started/overview.html
NetApp style is to keep sentences to about 35 words, so I broke this one into two sentences. [WJ1]
I made Index types a second-level head, like Vector databases. OK? [WJ2]
I made this a third-level head, as in the previous section. OK? [WJ3]
I added this third-level head. OK? [WJ4]
... View more
ONTAP FlexGroup volumes offer incredible performance and massive capacity scalability. What if you have data today in a Flexible volume (FlexVol) but decide you want a FlexGroup volume? How best do you go about this change?
First, we should confirm that you should in fact consider the change. Starting with ONTAP 9.12.1, the maximum FlexVol size tripled from 100TB to 300TB. If your workload is well under 300TB, and you don’t expect it to grow to that level, a FlexVol might be the best place to stay. Every version of ONTAP delivers better single volume performance so simply upgrading to a more current version of ONTAP might be the best course of action.
However, what if you decide you do need to convert to a FlexGroup volume? What is the best process? Next, we need to determine the reason for the move as that will best determine the path forward.
Need greater capacity
Let’s say you have a large pool of cool data. The performance of a FlexVol is sufficient, but you need to grow that pool of data well beyond 300TB. In this case you could perform an in-place FlexVol to FlexGroup volume conversion.
An in-place conversion leaves all existing data in the original volume but expands the data container by adding additional member volumes creating a FlexGroup volume. It will not accelerate any of the original data but does allow for seamless expansion to multiple PB capacity with new data being optimally placed across the new member volumes.
It is important to note that after doing the conversion, you will have one member volume (the original volume) that is quite full and other volumes that are empty. This is normal and will eventually equalize over time.
Need greater performance
What if instead of needing greater capacity, you have a workload that now demands much greater performance. Perhaps it is exceeding what as single storage node can satisfy. In this case, a different approach is recommended.
For these cases, a new separate FlexGroup volume should be created with an optimal number of member volumes. ONTAP does this automatically, and this number will vary based on how large your data set is and the composition of your ONTAP cluster. Data in the original FlexVol should be copied to the new FlexGroup volume.
This is most easily done by mounting the new FlexGroup volume as a different mount point then copying the data using NetApp’s free XCP software. XCP can copy significant amounts of data and the new FlexGroup volume will place the data in an optimal layout across the member volumes.
A short cut over is required, after which the clients using the original FlexVol should unmount it, XCP can do a final copy pass, and then the FlexVol can be placed offline. The new FlexGroup can then be mounted to replace the FlexVol and the clients can remount the share.
Although more work, this leaves you with an optimally laid out FlexGroup volume, ready to deliver a huge increase in throughput and the ability to grow as needed.
There you go, if you need to move from a FlexVolume to a FlexGroup, you have 2 options depending on your needs and situation.
... View more
Large language models (LLMs) have revolutionized the field of natural language processing in recent years. However, getting the full value of these models often requires customization in the form of fine tuning. Fine tuning these models for specific tasks or datasets can be challenging. Data has weight. Migrating critical data for access by GPU compute systems or HPC deployments is a heavy lift.
So, we wondered, what prevents us from working on our data in its home? What piece of this workload do we need to enhance NetApp® ONTAP® to be able to address? Spoiler alert: ONTAP is already very good at these workloads today, and NetApp is investing in making it even better.
In this blog post, we look closely at the fine-tuning training process, paying special attention to its I/O characteristics. Our journey is about understanding the I/O characteristics of each phase of the process to provide guidance and insights to our customers about this type of application. Fine tuning a pretrained LLM involves adjusting its parameter weights to fit a specific task or dataset. This process typically involves the following phases.
Phase 1 | Data Preparation
Dataset selection. Choose a dataset that closely aligns with the intended application of the model. The quality and relevance of this data are paramount.
Data cleaning and preprocessing. Clean the dataset to remove irrelevant or corrupt data. This phase may include normalizing text, handling missing values, and tokenization.
Data annotation. For certain tasks, like question answering or sentiment analysis, the data may need to be annotated with the correct answers or sentiments.
Phase 2 | Model Selection
Choosing a base model. Select a pretrained model as a starting point. This choice depends on factors like the model's original training data, size, and previous performance on similar tasks.
Architecture considerations. In some cases, slight modifications to the model architecture are made to suit the specific task.
Phase 3 | Adaptation and Training
Parameter adjustment. Fine tuning involves adjusting the model's parameter weights based on the new dataset. This is typically done by using techniques like gradient descent, with a lower learning rate to make only incremental changes.
Regularization techniques. To prevent overfitting, regularization techniques such as dropout may be employed.
Phase 4 | Evaluation and Iteration
Performance metrics. Use appropriate metrics to evaluate the model's performance on the fine-tuning task. Common metrics include accuracy, F1-score, and perplexity, depending on the task.
Iterative refinement. Based on performance evaluations, the model may go through several iterations of fine tuning to optimize its accuracy and effectiveness.
Phase 5 | Integration and Deployment
Application integration. When fine tuning is complete, integrate the model into the target application or service.
Continuous monitoring Deploy mechanisms for continuous monitoring to catch any performance degradation or drift over time.
In a typical artificial intelligence/machine learning (AI/ML) pipeline, data preparation and training are the stages where storage performance is crucial to keep job execution time as low as possible. That’s especially true in training, because an idle GPU means money flying out the window. For that reason, this blog post focuses on the training phase, where a lot of the heavy lifting happens in terms of data handling, like carrying data from disk to GPUs and vice versa.
POPOUT: Idle GPUs mean money flying out the window. Storage performance during the data preparation and training stages is crucial to prevent idle GPUs.
Don’t take my word for it! Kevin Lee, Adi Gangidi, and Mathew Oldham from Meta’s Data Center Engineering said in the blog post Building Meta’s GenAI Infrastructure,
“Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. The need to fit all that data storage into a performant, yet power-efficient footprint doesn’t go away though, which makes the problem more interesting.”
Lab setup
For this project, we designed a simple but powerful testbed. Figure 1 shows the equipment used in this study: an NVIDIA DGX-1 system and a NetApp AFF-A400, both connected to an ethernet NVIDIA Mellanox switch SN3700. The DGX-1 system is connected to the switch via one 100GbE port, and the NetApp AFF-A400 is connected via two 100GbE ports (one port per storage controller).
Figure 1) LLM fine-tuning testbed.
The operating system in the DGX-1 system is DGX-OS version 6, and the NetApp AFF-A400 runs with ONTAP version 9.14.1.
From a storage layout perspective, we had a 10TB FlexGroup with a total of 16 data-constituent volumes (8 data-constituent volumes per storage controller) exported via NFS v4.1 with pNFS enabled over TCP/IP.
To enable NFS v4.1 and pNFS, the following command was issued in the ONTAP system:
::> vserver nfs modify -vserver llm-lab -v4.1 enabled -v4.1-pnfs enabled
Then the FlexGroup was mounted in the DGX-1 system with the following command:
# mount -o vers=4.1,proto=tcp,rsize=1048576,wsize=1048576,hard,proto=tcp,nconnect=16 llm-lab:/fg_llm /mnt/fg_llm
For fine tuning, we installed LLM Studio from H2O.ai. LLM Studio is a framework and no-code GUI designed for fine tuning state-of-the-art LLMs. This tool supports the most recent fine-tuning techniques available at the time of this study, such as low-rank adaptation (LoRA) and 8-bit model training with a low memory footprint.
The dataset used for fine tuning was the OASST, the only open-source dataset available at the time we did this study. We know that this isn’t a large dataset, but we also learned that datasets for fine-tuning aren’t too large. In general, these datasets are curated by humans, and this is a very labor intensive and therefore expensive task.
I believe that we are all on this learning journey together. However, if you have a different opinion or experience with datasets for fine tuning, feel free to share your use case in the comments. I’ll bet that everybody reading this post will appreciate it.
Methodology
Operating LLM Studio turned out to be very easy. After installing it, you access its GUI via a browser, and you can import your dataset right away. Next, you create an experiment in which you select a large language model for your fine-tuning training, and then adjust some hyperparameters such as learning rate, batch size, and the number of epochs. Finally, you start training and monitor your experiment.
The following table shows how we configured our experiment.
Experiment Property
Value
Model
h2oai/h2ogpt-4096-llama2-13b
Architecture → backbone dtype
int4
Training → learning_rate
0.00015
Training → batch_size
2
Training → epochs
3
Training → evaluate_before_training
False
Training → save_best_checkpoint
True
Because the goal is to characterize the I/O, we must collect some information at the operating system level to help achieve our goal.
To understand the GPU utilization, we collected the output of “nvidia-smi dmon” by sampling it every 10 seconds. For NFS performance, we collected the output of “nfsiostat” by sampling it every second. To understand the general system activity and the details about file access patterns, we monitored all system calls during the training interval with a couple of bpftrace scripts.
Next, we captured monitored GPU, protocol (NFS), and system calls to collect a hoard of nerd stats to pore over. After capturing several full training cycles, we were left with a bunch of data to look at. As you’ll see in the next section, the results are simultaneously surprising and completely intuitive.
Results and lessons learned
The DGX-1 provides eight V100 GPUs. We expected some tuning or tweaking to achieve optimal results with our system. After all, we were completely new to this realm and weren’t starting from some best practice guidance. Imagine our surprise when, without tuning or tweaking, we achieved 97% GPU utilization, as shown in Figure 2.
Figure 2) GPU utilization during the fine-tuning experiment.
What gives? Does the result mean that this is a boring performance blog?
Not exactly. Figure 3 shows typical I/O behavior during our fine-tuning experiment. There are exceptionally few reads going on here. This makes sense when considering the data flow, but it was not exactly what we anticipated. After all, training is extremely data intensive! Yet in typical fine-tuning workflows there is substantial emphasis on having curated, high-quality training data that is expensive to produce.
Leveraging the (at the time) largest public fine-tuning dataset we could get our hands on, the DGX-1 comfortably cached the dataset in RAM after its initial read. And, looking toward newer generation systems, RAM has continued to scale.
But what about these bursts of activity we’re seeing in Figure 3? Enter checkpointing. Saving model weights is clearly a critical element for fine tuning and progressing a model toward the desired behavior. In our experiments, flushing these weights is the critical element to keep our GPUs crunching away to improve our model.
This is the first set of several tests and experiments we are doing with AI model fine tuning. Watch for the next blog post on our AI journey, in which we will detail our experience, research, and experiments in fine tuning retrieval-augmented generative (RAG) systems.
Figure 3) NFS I/O throughput during the fine-tuning experiment.
This cycle makes storage write performance crucial for fine tuning. In our training cycle, this is the only time that GPUs won’t be spending cycles on training. If the system is taking too long to flush data to disk to start its next round of training, it means that your precious GPU resources are idle, wasting time and money. At the same time, checkpointing is a tiny portion of overall clock time. Right-sizing storage to match computational capability, and scaling storage alongside compute, are critical features in model training.
We experimented with a relatively small model on a relatively small compute setup. We also used a single HA pair. As we scale these dimensions (model size, compute scale, storage scale), the picture remains remarkably the same.
All of this means that you can keep your GPUs busy and your AI workflows productive, enabling you to get the most out of your GPU investment and to deliver fine-tuned models for more business applications. By leveraging ONTAP as your data lake storage system, you can keep your data accessible to any system via any protocol: NFS over RDMA, S3-compatible, CIFS, and many other storage protocols, in the cloud or on your premises. And all without compromising on full enterprise data management features or performance.
Yesterday’s leaders in HPC storage rapidly adapted to the demands of training and fine tuning for generative AI workloads. Today, we are demonstrating that NetApp is ready to bring modern data management and features to these workloads without compromise. These features include:
NFS over RDMA, which allows data to be copied directly between storage system memory and the host system memory, circumventing CPU overhead.
FlexGroup volumes, which provide scale-out NAS containers for high performance and automated load distribution across constituent disks.
Modern protocol stacks, which provide performant and secure data movement for model training.
We demonstrated that the workload isn’t a huge mystery, it isn’t an always-on fire hose (in the context of a single job). Fast, feature-rich storage that can scale with your compute can save you time and money and accelerate your GenAI progress with the bulletproof quality of ONTAP.
... View more