Accelerating LLM Retrieval-Augmented Generation development with NetApp

arndt · ‎2024-03-05

NetApp AI

NetApp is the intelligent data infrastructure company. In this post, we will take a look at what this means in the context of the now ubiquitous technology that is Generative AI (genAI).

Large language models (LLM) are at the center of genAI offerings. These models require months of time and significant hardware investments to be created. As an example, the well-known open source Llama 2 LLMs from Meta were trained between January 2023 and July 2023. If you want to create an application that leverages recently created data, you have a couple of options. One of the more popular approaches is to use Retrieval-Augmented Generation (RAG) to bring more current data or domain-specific knowledge to a general-purpose LLM.

Why use NetApp for developing LLMs with RAG?

The rest of this post shows a technical proof of concept. Before we get into those details, I will lay out why NetApp is relevant and brings value to this space to begin with.

Centralized storage from NetApp has long been a popular choice for unstructured data. Netapp is likely where your current documents and domain-specific knowledge already live.
NetApp Snapshots can instantly capture a point in time version of your data, LLM, and RAG database for traceability.
NetApp FlexClones can instantly create many copies of your data, LLM, and RAG database. This allows each of your data scientists to continue iterative development without delay.

Creating a basic chatbot

To start, let's take a look at how to create a simple Python chatbot with a Llama 2 model. I am using the llama-cpp-python Python library with Langchain, to interface with our LLM. We will ask our chatbot to answer a couple of questions, but I know that the answers were not publicly documented at the time the model was trained.

(rag) arndt@rag:~/rag_llama2$ ./chat.py /aiwork/ws1/models/llama-2-7b-chat.Q4_K_M.gguf
Chatbot initialized, ready to chat...
> As briefly as possible, tell me what version of ONTAP supports NFS over RDMA on the A900 platform.
ONTAP (NetApp ONTAP) version 9.3 and later support NFS (Network File System) over RDMA (Remote Direct Memory Access) on the A900 platform.

> As briefly as possible, tell me how many concurrent API calls are supported by ONTAP.
ONTAP supports a maximum of 1024 concurrent API calls.

If you have used genAI enough, you have likely seen it give an incorrect answer, and that is the case in the examples above. I need to give our model more up to date information to work with, which I will do next.

Populating the RAG vector database

RAG uses a special database, called a vector database, to store mathematical representations of data. For this testing, we will use a popular open source vector database called Chroma. Let's populate our Chroma database with embeddings from a directory of text files containing various notes on a variety of topics.

(rag) arndt@rag:~/rag_llama2$ ls /aiwork/ws1/vectors
(rag) arndt@rag:~/rag_llama2$ ./rag_db_add.py /files/data1 /aiwork/ws1/vectors
There are now 2039 document chunks in the db.
(rag) arndt@rag:~/rag_llama2$ ls /aiwork/ws1/vectors
chromadb
(rag) arndt@rag:~/rag_llama2$

In my POC, I am using a NFSv4.1 mount to store my Chroma database. If you prefer to use a block protocol such as iSCSI, NVMe/TCP, FC-SAN, or NVMe/FC, that will also work. NetApp features such as Snapshots and FlexClone are protocol agnostic. NetApp All-Flash storage systems can deliver many GB/s of throughput with sub-1ms latencies when using any of these protocols, including NFS!

Implementing a chatbot that leverages RAG

Now I will present the questions to a RAG enabled version of the chatbot.

(rag) arndt@rag:~/rag_llama2$ ./rag_chat.py /aiwork/ws1/models/llama-2-7b-chat.Q4_K_M.gguf /aiwork/ws1/vectors
Chatbot initialized, ready to chat...
> As briefly as possible, tell me what version of ONTAP supports NFS over RDMA on the A900 platform.
Currently supported versions are 9.6 and later releases.

> As briefly as possible, tell me how many concurrent API calls are supported by ONTAP.
Up to 240 concurrent API calls are supported by ONTAP, but not all of them will run at the same time. Some will be queued depending on the node and SVM limits.

This is better. We now have a correct answer for our second question, but not for the first. It looks like we need to iterate on our RAG implementation!

Snapshot and FlexClone implementation

Next, let's use the NetApp DataOps Toolkit to create a snapshot of the current state. After the snapshots are created, we will create a clone for another developer to update the RAG database. Note that the clone is placed in the namespace under a volume that was already NFS mounted on my compute node, so it is automatically available in that path.

(rag) arndt@rag:~/rag_llama2$ netapp_dataops_cli.py create snapshot --volume ws1 --name snap1
Creating snapshot 'snap1'.
Snapshot created successfully.

(rag) arndt@rag:~/rag_llama2$ netapp_dataops_cli.py clone volume --name ws2 --source-volume ws1 --source-snapshot snap1 --junction /aiwork/ws2
Creating clone volume 'svm1:ws2' from source volume 'svm1:ws1'.
Clone volume created successfully.
Setting export-policy:default snapshot-policy:none

(rag) arndt@gpu01:~/rag_llama2$ ls /aiwork/ws2/models
llama-2-7b-chat.Q4_K_M.gguf
(rag) arndt@gpu01:~/rag_llama2$ ls /aiwork/ws2/vectors
chromadb

Let's talk about when it makes sense to leverage these capabilities. If you have a small number of developers and it takes 30 seconds to make a copy of your environment, this may not be a big win for you. So, when is this valuable?

What if you have tens, hundreds, or even thousands of developers?
What if creating copies can go from tens of minutes to under a minute for each developer?
What if your developers are geographically dispersed, and you need to replicate the data to be close to them?
What if you need to replicate your data between on-prem and a public cloud?

In all of the above scenarios, NetApp solutions will provide a massive improvement in developer productivity, and therefore accelerate your time-to-market.

Updating the RAG database clone

After creating a clone of our RAG database we can iterate on our previous work by adding more data to the vector database. Note that in the command below, we are using a new path to the files, and we are updating the vector database in workspace ws2, leaving our original vector database untouched.

(rag) arndt@rag:~/rag_llama2$ ./rag_db_add.py /files/data2 /aiwork/ws2/vectors
There are now 2206 document chunks in the db.
(rag) arndt@rag:~/rag_llama2$

One last chat

Once our RAG database is updated, we will test our second question to the RAG enabled chatbot one more time. I provide the path to the new vectors in the ws2 clone with my command line arguments below.

(rag) arndt@rag:~/rag_llama2$ ./rag_chat.py /aiwork/ws2/models/llama-2-7b-chat.Q4_K_M.gguf /aiwork/ws2/vectors
Chatbot initialized, ready to chat...
> As briefly as possible, tell me what version of ONTAP supports NFS over RDMA on the A900 platform.
ONTAP 9.14.1

Success! I hope you enjoyed the exercise.

Python Code Repository

All of the Python scripts used in this exercise can be found at https://github.com/arndt-netapp/rag_llama2. In this repository I've also included some tips about how to get llama-cpp-python to use GPU hardware.