Fine-Tuning a Llama2 Model and Observing its Storage I/O Activity

RodrigoNascimento · ‎2024-11-11

Large language models (LLMs) have revolutionized the field of natural language processing in recent years. However, getting the full value of these models often requires customization in the form of fine tuning. Fine tuning these models for specific tasks or datasets can be challenging. Data has weight. Migrating critical data for access by GPU compute systems or HPC deployments is a heavy lift.

So, we wondered, what prevents us from working on our data in its home? What piece of this workload do we need to enhance NetApp® ONTAP® to be able to address? Spoiler alert: ONTAP is already very good at these workloads today, and NetApp is investing in making it even better.

In this blog post, we look closely at the fine-tuning training process, paying special attention to its I/O characteristics. Our journey is about understanding the I/O characteristics of each phase of the process to provide guidance and insights to our customers about this type of application. Fine tuning a pretrained LLM involves adjusting its parameter weights to fit a specific task or dataset. This process typically involves the following phases.

Phase 1 | Data Preparation

Dataset selection. Choose a dataset that closely aligns with the intended application of the model. The quality and relevance of this data are paramount.
Data cleaning and preprocessing. Clean the dataset to remove irrelevant or corrupt data. This phase may include normalizing text, handling missing values, and tokenization.
Data annotation. For certain tasks, like question answering or sentiment analysis, the data may need to be annotated with the correct answers or sentiments.

Phase 2 | Model Selection

Choosing a base model. Select a pretrained model as a starting point. This choice depends on factors like the model's original training data, size, and previous performance on similar tasks.
Architecture considerations. In some cases, slight modifications to the model architecture are made to suit the specific task.

Phase 3 | Adaptation and Training

Parameter adjustment. Fine tuning involves adjusting the model's parameter weights based on the new dataset. This is typically done by using techniques like gradient descent, with a lower learning rate to make only incremental changes.
Regularization techniques. To prevent overfitting, regularization techniques such as dropout may be employed.

Phase 4 | Evaluation and Iteration

Performance metrics. Use appropriate metrics to evaluate the model's performance on the fine-tuning task. Common metrics include accuracy, F1-score, and perplexity, depending on the task.
Iterative refinement. Based on performance evaluations, the model may go through several iterations of fine tuning to optimize its accuracy and effectiveness.

Phase 5 | Integration and Deployment

Application integration. When fine tuning is complete, integrate the model into the target application or service.
Continuous monitoring Deploy mechanisms for continuous monitoring to catch any performance degradation or drift over time.

In a typical artificial intelligence/machine learning (AI/ML) pipeline, data preparation and training are the stages where storage performance is crucial to keep job execution time as low as possible. That’s especially true in training, because an idle GPU means money flying out the window. For that reason, this blog post focuses on the training phase, where a lot of the heavy lifting happens in terms of data handling, like carrying data from disk to GPUs and vice versa.

POPOUT: Idle GPUs mean money flying out the window. Storage performance during the data preparation and training stages is crucial to prevent idle GPUs.

Don’t take my word for it! Kevin Lee, Adi Gangidi, and Mathew Oldham from Meta’s Data Center Engineering said in the blog post Building Meta’s GenAI Infrastructure,

“Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. The need to fit all that data storage into a performant, yet power-efficient footprint doesn’t go away though, which makes the problem more interesting.”

Lab setup

For this project, we designed a simple but powerful testbed. Figure 1 shows the equipment used in this study: an NVIDIA DGX-1 system and a NetApp AFF-A400, both connected to an ethernet NVIDIA Mellanox switch SN3700. The DGX-1 system is connected to the switch via one 100GbE port, and the NetApp AFF-A400 is connected via two 100GbE ports (one port per storage controller).

Figure 1) LLM fine-tuning testbed.

The operating system in the DGX-1 system is DGX-OS version 6, and the NetApp AFF-A400 runs with ONTAP version 9.14.1.

From a storage layout perspective, we had a 10TB FlexGroup with a total of 16 data-constituent volumes (8 data-constituent volumes per storage controller) exported via NFS v4.1 with pNFS enabled over TCP/IP.

To enable NFS v4.1 and pNFS, the following command was issued in the ONTAP system:

::> vserver nfs modify -vserver llm-lab -v4.1 enabled -v4.1-pnfs enabled

Then the FlexGroup was mounted in the DGX-1 system with the following command:

# mount -o vers=4.1,proto=tcp,rsize=1048576,wsize=1048576,hard,proto=tcp,nconnect=16 llm-lab:/fg_llm /mnt/fg_llm

For fine tuning, we installed LLM Studio from H2O.ai. LLM Studio is a framework and no-code GUI designed for fine tuning state-of-the-art LLMs. This tool supports the most recent fine-tuning techniques available at the time of this study, such as low-rank adaptation (LoRA) and 8-bit model training with a low memory footprint.

The dataset used for fine tuning was the OASST, the only open-source dataset available at the time we did this study. We know that this isn’t a large dataset, but we also learned that datasets for fine-tuning aren’t too large. In general, these datasets are curated by humans, and this is a very labor intensive and therefore expensive task.

I believe that we are all on this learning journey together. However, if you have a different opinion or experience with datasets for fine tuning, feel free to share your use case in the comments. I’ll bet that everybody reading this post will appreciate it.

Methodology

Operating LLM Studio turned out to be very easy. After installing it, you access its GUI via a browser, and you can import your dataset right away. Next, you create an experiment in which you select a large language model for your fine-tuning training, and then adjust some hyperparameters such as learning rate, batch size, and the number of epochs. Finally, you start training and monitor your experiment.

The following table shows how we configured our experiment.

Experiment Property	Value
Model	h2oai/h2ogpt-4096-llama2-13b
Architecture → backbone dtype	int4
Training → learning_rate	0.00015
Training → batch_size	2
Training → epochs	3
Training → evaluate_before_training	False
Training → save_best_checkpoint	True

Because the goal is to characterize the I/O, we must collect some information at the operating system level to help achieve our goal.

To understand the GPU utilization, we collected the output of “nvidia-smi dmon” by sampling it every 10 seconds. For NFS performance, we collected the output of “nfsiostat” by sampling it every second. To understand the general system activity and the details about file access patterns, we monitored all system calls during the training interval with a couple of bpftrace scripts.

Next, we captured monitored GPU, protocol (NFS), and system calls to collect a hoard of nerd stats to pore over. After capturing several full training cycles, we were left with a bunch of data to look at. As you’ll see in the next section, the results are simultaneously surprising and completely intuitive.

Results and lessons learned

The DGX-1 provides eight V100 GPUs. We expected some tuning or tweaking to achieve optimal results with our system. After all, we were completely new to this realm and weren’t starting from some best practice guidance. Imagine our surprise when, without tuning or tweaking, we achieved 97% GPU utilization, as shown in Figure 2.

Figure 2) GPU utilization during the fine-tuning experiment.

What gives? Does the result mean that this is a boring performance blog?

Not exactly. Figure 3 shows typical I/O behavior during our fine-tuning experiment. There are exceptionally few reads going on here. This makes sense when considering the data flow, but it was not exactly what we anticipated. After all, training is extremely data intensive! Yet in typical fine-tuning workflows there is substantial emphasis on having curated, high-quality training data that is expensive to produce.

Leveraging the (at the time) largest public fine-tuning dataset we could get our hands on, the DGX-1 comfortably cached the dataset in RAM after its initial read. And, looking toward newer generation systems, RAM has continued to scale.

But what about these bursts of activity we’re seeing in Figure 3? Enter checkpointing. Saving model weights is clearly a critical element for fine tuning and progressing a model toward the desired behavior. In our experiments, flushing these weights is the critical element to keep our GPUs crunching away to improve our model.

This is the first set of several tests and experiments we are doing with AI model fine tuning. Watch for the next blog post on our AI journey, in which we will detail our experience, research, and experiments in fine tuning retrieval-augmented generative (RAG) systems.

Figure 3) NFS I/O throughput during the fine-tuning experiment.

This cycle makes storage write performance crucial for fine tuning. In our training cycle, this is the only time that GPUs won’t be spending cycles on training. If the system is taking too long to flush data to disk to start its next round of training, it means that your precious GPU resources are idle, wasting time and money. At the same time, checkpointing is a tiny portion of overall clock time. Right-sizing storage to match computational capability, and scaling storage alongside compute, are critical features in model training.

We experimented with a relatively small model on a relatively small compute setup. We also used a single HA pair. As we scale these dimensions (model size, compute scale, storage scale), the picture remains remarkably the same.

All of this means that you can keep your GPUs busy and your AI workflows productive, enabling you to get the most out of your GPU investment and to deliver fine-tuned models for more business applications. By leveraging ONTAP as your data lake storage system, you can keep your data accessible to any system via any protocol: NFS over RDMA, S3-compatible, CIFS, and many other storage protocols, in the cloud or on your premises. And all without compromising on full enterprise data management features or performance.

Yesterday’s leaders in HPC storage rapidly adapted to the demands of training and fine tuning for generative AI workloads. Today, we are demonstrating that NetApp is ready to bring modern data management and features to these workloads without compromise. These features include:

NFS over RDMA, which allows data to be copied directly between storage system memory and the host system memory, circumventing CPU overhead.
FlexGroup volumes, which provide scale-out NAS containers for high performance and automated load distribution across constituent disks.
Modern protocol stacks, which provide performant and secure data movement for model training.

We demonstrated that the workload isn’t a huge mystery, it isn’t an always-on fire hose (in the context of a single job). Fast, feature-rich storage that can scale with your compute can save you time and money and accelerate your GenAI progress with the bulletproof quality of ONTAP.