Data Management Challenges for Retrieval Augmented Generation (RAG)

PuneetD · ‎2024-06-26

Retrieval Augmented Generation (RAG) frameworks have become popular to connect Generative AI foundation models to private data. Utilizing company's private data helps to augment responses from models and answer based on knowledge derived from private data sets, thereby providing responses that are grounded in context and facts derived from company data. In a typical RAG implementation, the private documents and any user queries are converted into vector embeddings to perform similarity search. An embedding model is used to converts text or images to a numerical vector representation which is stored in a vector database. User queries, from an application such as a Q&A chatbot, are converted into their vector representation and searched within the vector database to 'retrieve' relevant results. The original user query or prompt is then 'augmented' with relevant context from similar documents retrieved from the vector database. This augmented prompt is then sent to the foundation language model which takes the user query and augmented content from the vector database to 'generate' an end user response.

Use of unstructured data as source knowledge is gaining momentum for a multitude of use cases such as deriving insights from data, document summarization, virtual assistants for use cases such as case management, employee on-boarding, research assistant, etc. For unstructured data working within RAG, as explained above, there are two key data sets of importance: the company's source data and vector embeddings that are created based on source data. If not thought through carefully, integrating unstructured data into RAG can present several data management challenges. Based on our engagements with customers in recent past, below are some of these challenges data and platform engineers could experience especially as they move generative AI applications into production and run them at scale:

Data Discovery, Privacy and Security: Ensuring customer data used in RAG pipelines remains private and secure is critical, both from an access control perspective as well as when dealing with sensitive information. Planning a secure implementation starts with ensuring it is easier for data owners to discover the right data along with right access model to share datasets with GenAI applications and developers. Most enterprises already have access controls in place for their data. GenAI applications using enterprise data should honor these access controls and not provide insights from data that end users do not have access to. Similarly, sensitive information such as PII should not be leaked into GenAI applications without needed guardrails in place. Compliance with data protection regulations, such as GDPR or HIPAA, is a key challenge for enterprises and these will apply to all applications accessing customer data including ones using GenAI.
Protection against Ransomware: As any other application or infrastructure, generative AI applications using company data are also susceptible to ransomware attacks via malware, prompt injection, data poisoning, etc. Customers not only need mechanisms to quickly detect such attacks to limit their blast radius, but also need to be able to quickly recover their data when such incidents happen.
Data Synchronization: Keeping the knowledge sources or vector embeddings used in RAG pipelines up-to-date with changes in source data or source metadata (e.g. document permissions) is important to ensure that the latest information is used by the applications to augment model responses. For example, if access permissions are changed on source data, the GenAI application querying the vector database should take the new permissions into account when searching for revenant records. Keeping source data and vector embeddings in sync can be challenging with large and continuously evolving datasets.
Cost and Complexity Management: The costs associated with data storage, processing, and retrieval can be significant. Optimizing these costs while maintaining performance can be a challenge for enterprises deploying RAG systems. Furthermore, creating new storage silos for GenAI (e.g. by copying data to public cloud object stores) to integrate with cloud generative AI services can add additional cost and management complexity. Instead, customers want to use and extend their existing data sources to build GenAI applications. This helps them control both costs and management complexity, while being able to utilize all the data management capabilities and processes they currently have established.
Data Gravity and Availability of Accelerated Compute Capacity: Customer data may be distributed across many sites where availability of compute and GPU resources may be very limited to run inferencing tasks. Customers could utilize compute and GPU in the public clouds, however bringing customer data from on-premises or distributed sites quickly and cost efficiently may be a challenge. Additionally keeping distributed data sets synchronized and secure poses another challenge.
Integrating Varied Datasets: Integrating data from multiple sources into a cohesive knowledge source that the RAG systems can query is a complex task. It involves dealing with various data sources that may need different protocols such as NFS/SMB/S3 or block storage, unstructured and structured data, different data formats, schemas, and deal with potential inconsistencies. Use cases that need multimodal GenAI models (e.g. a product recommendation engine may need to consume both product description texts and images or videos to make product recommendations) may need to consume and produce data in multiple formats which the underlying storage system must be able to store efficiently and serve at the needed performance.
Data Storage and Scalability: Managing storage and retrieval of large volumes of data and ensuring that the storage systems can scale efficiently as the data grows can be a challenge. This includes considerations for both source data as well as the vector databases . As RAG systems need to quickly retrieve relevant documents from potentially vast datasets, efficient indexing and search algorithms are essential. Balancing speed and accuracy in the retrieval process can be a significant challenge, especially as the size of data and complexity of queries grow.
Data Protection and Availability: Ensuring the data that’s used for functions such as ingest sources and vector embeddings is backed up and is highly available can add complexity and cost if customers must create separate data copies or islands to integrate with GenAI applications. Among multiple storage options available especially in public clouds, choosing the right storage service that will meet application service level agreements for uptime, RPO and RTO while adhering to infrastructure budgets can be a time-consuming task.
Developer Efficiency: Developers working on generative AI applications or proof of concepts don't want to be held by the time or cost of making data available to them. Developers may need to quickly access to large production data sets or vector databases quickly to test new features, perform A/B testing, etc. However, making data copies available to multiple developers can be both time consuming and costly affair if full data copies need to be made for each request. If the developers are geographically distributed, customers may have to deal with data movement latencies and bandwidth costs.
Data Traceability: In production RAG deployments, especially in regulated industries, traceability of the data, models and embeddings can be crucial for audit trail compliance or troubleshooting purposes. Customers may need to go back in time and analyze a consistent view of data to troubleshoot issues, optimize applications or respond to regulatory requests.

Addressing the above data management challenges can be crucial for the successful implementation and operation of production RAG systems. Solutions may involve a combination of advanced database technologies, efficient algorithms, robust data governance policies, and continuous monitoring and optimization efforts. We at NetApp are constantly working to help our customers with solutions for managing their data for generative AI in a simple, secure and cost-effective manner.

For customers looking to develop generative AI applications in AWS, we will be announcing some exciting updates in near future. Stay tuned for more information!

Authors: Puneet Dhawan (@PuneetD) , Yuval Kalderon (@YuvalK)

Data Management Challenges for Retrieval Augmented Generation (RAG)

New video on NetApp KB TV

New video on NetApp KB TV

New video on NetApp KB TV