Tech ONTAP Blogs
Tech ONTAP Blogs
A ready to deploy reference architecture that addresses the challenges of building and scaling enterprise AI data pipelines for Retrieval Augmented Generation (RAG) workflows that scale into AI Agents for advanced use-cases.
Built on the Google Cloud Platform, this solution leverages the NVIDIA Foundational RAG Blueprint to develop a RAG Pipeline that will serve as the foundation data plane for several other NVIDIA Blueprints and GenAI architectures that depend on data. Google Cloud NetApp Volumes (GCNV) aka NetApp Volumes is integrated as the data plane providing scalable, customizable, high-performance file storage that is essential for Enterprise RAG.
This design is a collaborative effort between NetApp, NVIDIA, and Google that outlines the deployment of an enterprise-ready Foundational Retrieval-Augmented Generation (RAG) pipeline. These pipelines are designed to empower AI agents by connecting them with enterprise data, utilize advanced models for reasoning, and ultimately deliver trusted business insights.
It offers a step-by-step guide and reference architecture for managing large volumes of multimodal enterprise content, ensuring fast, accurate responses. This capability is powered by NVIDIA's RAG Blueprints, alongside high-performance storage provided by NetApp Volumes.
Over 80% of all organizational data is unstructured or semi-structured (documents, emails, code, media), a vast, underutilized asset pool. This multi-modal data holds deep, domain-specific insights essential for business innovation, from accelerated R&D to hyper-personalized customer experiences.
The challenge, however, is accessing this knowledge securely, accurately, and at scale. This realization is driving a critical shift: the need for well-defined, scalable, secure and repeatable Retrieval-Augmented Generation (RAG) data pipelines that natively integrate with primary enterprise data sources.
This reference architecture details a robust Enterprise Foundational RAG solution on Google Cloud Platform, utilizing the NVIDIA Foundation RAG Blueprint and the high-performance file storage of Google Cloud NetApp Volumes.
By bridging the gap between proprietary data and generative AI models, this solution empowers organizations to extract actionable intelligence from their data landscape.
The NVIDIA AI Blueprint for RAG is a production-ready, modular, GPU-optimized reference architecture for building high-accuracy, high-performance RAG systems for enterprise use cases like search and copilots.
It supports modern agent ecosystems with features like summarization, reasoning configurability, query decomposition, and dynamic metadata filtering, using native Python libraries, OpenAI-compatible APIs, and a built-in data catalog.
It enables advanced multimodal generation and has a robust pipeline for extracting and enriching various content types.
Designed for flexibility and scale, it features hybrid dense + sparse retrieval, multi-collection search, GPU-accelerated indexing, reranking, and pluggable vector database support (ElasticSearch, Milvus). It includes Observability (OpenTelemetry), evaluation scripts (RAGAS), and optional guardrails.
Deployable via Docker or Kubernetes, it is customizable, runs standalone or integrates with other systems, and is a foundational layer of the NVIDIA AI Data Platform, transforming raw data into AI-ready knowledge. Use this architecture to ground AI-driven decisions in trusted enterprise data at production scale.
It is also foundational to the AI Agent for Enterprise Research, providing the trusted knowledge base, summarization, and retrieval capabilities required for advanced, reasoning-driven enterprise agents AI Agent for Enterprise Research, which is being showcased in this reference architecture.
NetApp Volumes is a fully managed, first-party Google Cloud data storage service built on NetApp's ONTAP technology. It delivers high-performance, scalable, and feature-rich file and block storage over NFS, SMB and iSCSI protocols for enterprise workloads within the Google Cloud Platform.
NetApp Volumes is uniquely suited to serve as the high-performance data plane for enterprise RAG pipelines by delivering -
NetApp Volumes caters to multiple service levels that can deliver the required performance for RAG pipelines based on the scale of operations -
NetApp Volumes allows customers to start at a minimum storage capacity of 1TiB and supports capacity growth in increments of 1 GiB, providing a fine-grained approach to capacity management in the cloud. This enables RAG pipelines to grow storage capacity on-demand, as and when the source data for ingestion is subject to growth.
In addition to this, the foundational ONTAP features delivered by NetApp Volumes provide advanced data management capabilities that enrich the implementation of RAG pipelines in several ways:
| ONTAP Features via NetApp Volumes | Benefits to GenAI and RAG Pipelines |
| Snapshots | Creates immediate, space-efficient, point-in-time copies of the volumes that host the RAG pipelines and the source data. Essential for auditability and quick rollback/recovery. |
| FlexClone |
Allows for instantaneous, zero-capacity-copy creation of full data volumes. This enables rapid versioning and experimentation (e.g., testing a new RAG data-pipeline) without wasting time or storage capacity by copying huge amounts of data. |
| Auto-tiering | Tier source data to lower cost storage after ingestion workflows, helps free up the high performance tier for busy workloads. |
As part of this reference architecture, we will deploy the NVIDIA Foundational RAG blueprint on Google Cloud Platform. This blueprint is available as a Helm package and ready for deployment on Kubernetes.
The Foundational RAG Blueprint will be deployed on GKE using the persistent storage from NetApp Volumes. Subsequently, the GCNV Data Ingestor will be deployed and the RAG data-pipeline will be configured. This will set up the file selection and embedding workflow for the data that is present in NetApp Volumes.
Google Kubernetes Engine (GKE) - a single GKE cluster will be provisioned with the requisite GPUs for container orchestration to host all the pods corresponding to the blueprint deployments.
The GKE cluster will be provisioned with a node-pool that comprises the requisite NVIDIA GPUs.
The NVIDIA blueprint components are packaged as containers and will be deployed as Kubernetes deployments on the GKE cluster. In this reference architecture design, the Foundational RAG blueprint will be deployed on a GKE cluster with a single node that contains 8 x NVIDIA RTX 6000 PRO GPUs. The NVIDIA drivers are automatically deployed as part of the GKE cluster creation.
Google Cloud NetApp Volumes (GCNV) - will deliver the high performance storage required to deploy the blueprints and will also serve as the source of user-data to implement the RAG pipeline for Research. NetApp Trident CSI is installed and configured to work with a GCNV Storage Pool as the storage backend and a Storage Class that is mapped to the Trident provisioner is configured as default storage class for the GKE cluster.
Networking - Both GKE and GCNV are provisioned in the same region and connected to the same workload VPC.
Create a GKE Cluster in your region of choice where you have GPUs available.
Below is a sample command that creates a GKE Cluster with the following parameters-
Update the parameters marked in BOLD with details specific to your environment.
gcloud container --project "<<project_name>>" clusters create "nvidia-gcnv-ai-blueprint" \
--zone "us-east1-b" \
--enable-ip-alias \
--cluster-version "1.35.1-gke.1396002" \
--release-channel "regular" \
--machine-type "g4-standard-384" \
--accelerator "type=nvidia-rtx-pro-6000,count=8,gpu-driver-version=default" \
--image-type "UBUNTU_CONTAINERD" \
--disk-type "hyperdisk-balanced" \
--disk-size "300" \
--num-nodes "1" \
--network "projects/<<project_name>>/global/networks/<<network_name>>" \
--subnetwork "projects/<<project_name>>/regions/us-east1/subnetworks/<<subnetwork_name>>" \
--workload-pool "<<project_name>>.svc.id.goog" \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,JOBSET,CADVISOR,KUBELET,DCGM \
--node-locations "us-east1-b"
Note: Enabling Workload Identity on the GKE cluster is a prerequisite to use Cloud Identity with the NetApp Trident CSI, which will be covered later in this solution.
A Storage Pool using the Flex service level will be provisioned in the same region/ zone where the GKE cluster was created. The pool is configured with a Throughput of 1 GiB/s and 16384 IOPS, you can change this to suit your requirements.
Update the parameters marked in BOLD with details specific to your environment.
gcloud netapp storage-pools create nvidia-aiq-blueprint --location=us-east1-b --capacity=1024 --network=name=<<network_name>> --service-level=Flex --custom-performance-enabled=true --total-iops=16384 --total-throughput=1024
The next step is to deploy the Trident CSI driver on the GKE cluster and configure the Storage Pool on NetApp Volumes as a Storage Backend for Trident to work with.
helm repo add netapp-trident https://netapp.github.io/trident-helm-chart
Install the helm chat and use the service account that was created earlier.
helm install trident-csi netapp-trident/trident-operator --version 100.2602.0 --create-namespace --namespace trident --set cloudProvider="GCP" --set cloudIdentity="'iam.gke.io/gcp-service-account: <<name_of_service_account>>'"
kubectl get pods -n trident
NAME READY STATUS RESTARTS AGE
trident-controller-6c9d99984d-wcmft 6/6 Running 0 5m3s
trident-node-linux-2wlgs 2/2 Running 0 5m3s
trident-node-linux-bg2dp 2/2 Running 0 5m3ss
trident-node-linux-v69wq 2/2 Running 0 5m3ss
trident-operator-774b6c5568-v2tqj 1/1 Running 0 5m26s
gcloud iam service-accounts add-iam-policy-binding <<name_of_service_account>> –role=roles/iam.workloadIdentityUser –member="serviceAccount:<<project_name>>.svc.id.goog[trident/trident-controller]"
Annotate the trident-controller service account
kubectl annotate serviceaccount trident-controller --namespace trident iam.gke.io/gcp-service-account=<<name_of_service_account>>
serviceaccount/trident-controller annotated
To configure the Storage Pool created earlier as a Storage Backend, a TridentBackendConfig(TBC) needs to be defined.
Refer to the below sample definition of the TridentBackendConfig and create a definition trident-backend-gcnv.yaml.
apiVersion: trident.netapp.io/v1
kind: TridentBackendConfig
metadata:
name: tbc-gcnv
spec:
version: 1
storageDriverName: google-cloud-netapp-volumes
backendName: backend-tbc-gcp-gcnv
projectNumber: '<<gcp_project_number>>'
location: us-east1
storage:
- labels:
performance: flex
serviceLevel: flex
kubectl apply -f trident-backend-gcnv.yaml -n trident
tridentbackendconfig.trident.netapp.io/backend-tbc-gcp-gcnv created
A Storage Class that uses the NetApp Trident CSI as the provisioner will be created and it will be set as the default storage class.
Reference to the sample definition of the Storage Class here https://github.com/NetApp/trident/blob/master/trident-installer/sample-input/storage-class-samples/storage-class-ontapnas-gold.yaml
Ensure that the following parameters in the Storage Class are set -
provisioner: csi.trident.netapp.io
backendType: google-cloud-netapp-volumes
After the Storage Class has been created, set it as the Default storage class.
kubectl patch storageclass gcnv-flex -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
In this reference architecture the Blueprint is deployed using a Helm chart that is provided in this repository https://github.com/NVIDIA-AI-Blueprints/rag/blob/main/docs/deploy-helm.md.
The NVIDIA NIM Operator will be installed in a dedicated namespace.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
--username='$oauthtoken' \
--password=$NGC_API_KEY
helm repo update
helm install nim-operator nvidia/k8s-nim-operator -n nim-operator --create-namespace
kubectl create namespace rag
model:
engine: tensorrt_llm
precision: "fp8"
qosProfile: "throughput"
tensorParallelism: "1"
gpus:
- product: "rtx6000_blackwell_sv"
helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.5.0.tgz \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
-f deploy/helm/nvidia-blueprint-rag/values.yaml
kubectl get pods -n rag
kubectl get pvc -n rag
kubectl get svc -n rag
You can access the GUI interface of the RAG service by port forwarding the rag-frontend service to your local machine.
kubectl port-forward -n rag service/rag-frontend 3000:3000 --address 0.0.0.0
Click on ‘New Collection”.
Enter a name for the Collection, e.g. ‘gcnv_rag’.
You may configure additional settings under the Data Catalog, Collection Configuration and Metadata Schema sections.
Click “Create Collection”.
In this step, a data ingestion service will be configured through which a data pipeline will be setup that will create a collection of files from NetApp Volumes and ingest them into the RAG system.
This is an important step in the configuration that indicates how the source data will be made available to the RAG pipeline. The GCNV volume that represents the data source will be presented as a Persistent Volume Claim to the GCNV data ingestor pod.
Subsequently, the workflows that are part of the data ingestor will enable the creation of vector embeddings for the files that the user has identified. The details of these workflows will be covered in the upcoming section.
The GCNV data ingestor will be deployed in a separate namespace, if needed it can also be deployed in the same namespace as the RAG Blueprint.
kubectl create namespace gcnv-data-ingestor
Navigate to “examples/google-cloud-netapp-volumes-data-ingestor/values.yaml” in the directory where the GitHub repository https://github.com/NVIDIA-AI-Blueprints/rag.git was cloned earlier.
Update the following parameters in the values.yaml file before installing the helm chart for the GCNV data ingestor. Refer to the README.md file for detailed customization.
image.repository: “ghcr.io/netapp/gcnv_data_ingestor”
image.tag: “latest”
appData.storageClassName: “gcnv-flex”
appData.size: app PVC size, defaults to `50Gi`
Option 1 - If data for RAG will be copied into the system after the blueprint is setup, then the recommended option will be to provision a new PVC to store the data.
New volume creation, where data will be made available
sourceData.storageClassName: “gcnv-flex”
sourceData.size: source PVC size request, defaults to `200Gi`
Option 2 - If data for RAG is already available in a GCNV volume, then leverage the Trident CSI Volume Import feature to present it as a PVC to the GKE cluster and use the below parameters to present the PVC as the data source for RAG.
To use existing PVC with data for RAG
sourceData.create=falseNote: The PVC must exist in the same namespace where the ingestor is deployed.
sourceData.existingClaim: <<name of PVC>>
env.nvIngestEndpoint:http://ingestor-server.rag.svc.cluster.local:8082/v1
A dataset comprising judgements from the Supreme Court of India has been downloaded from Kaggle and made available in the PVC.
This dataset contains more than 26000 PDF files related to judgements from the year 1950 to 2024.
Run the below command to install the helm chart with the configured values.
helm install gcnv-data-ingestor ./examples/gcnv-data-ingestor \
--namespace gcnv-data-ingestor \
-f ./examples/gcnv-data-ingestor/values.yaml
kubectl get pod,svc,pvc -n gcnv-data-ingestor
You can access the GUI interface of the data ingestor for NetApp Volumes by port forwarding the gcnv-data-ingestor service to your local machine.
kubectl port-forward -n gcnv-data-ingestor service/gcnv-data-ingestor 8000:8000 --address 0.0.0.0
Click on “Create Scanner”
Enter the following details to create the scanner -
Click on “Create Scanner”
A scanner is created subsequently and since this is the first job run, a full sync operation is performed.
The subsequent scans that will run every 60 minutes will be incremental, i.e. only the modified files and any new files created in the last 60 minutes will be presented to the RAG ingestor for updates. Files that have not gone through any updates in the last 60 minutes will not need any updates at the RAG endpoint.
In this manner, the scanner ensures that the vector embeddings in the RAG endpoint are kept in sync with the source data every 60 minutes. If needed, the sync interval can be updated in the scanner at any time.
After the files are presented to the RAG ingestor, the Scanner transitions to an “Idle” state.
The scanner indicates that 61 files were uploaded to the RAG Ingestor in the previous iteration.
Looking at the source folder, it is confirmed that the directory had 61 files, all of them were PDFs.
On accessing the RAG Web Server interface and navigating to the “gcnv_rag” collection, it can be easily verified that the 61 files have been ingested successfully.
A prompt which can be answered using the available data in the gcnv_rag collection is submitted through the RAG Web Server chat interface and an appropriate response is seen.
This reference architecture demonstrates a robust, scalable, and highly efficient solution for building and operating enterprise-grade Retrieval-Augmented Generation (RAG) pipelines, that are fundamental to enabling advanced GenAI use cases through AI Agents e.g. Deep Research.
By leveraging the Google Cloud Platform (GCP), organizations gain the massive scalability, global reach, and secure infrastructure required to handle petabytes of unstructured data. The combination of GKE for container orchestration and Google Cloud NetApp Volumes (GCNV) for high-performance, enterprise-grade file storage provides a seamless and powerful foundation for AI at scale.
The NVIDIA Foundational RAG Blueprint offers a highly optimized, production-ready framework that drastically simplifies deployment. It provides an efficient, end-to-end RAG pipeline, complete with GPU acceleration, advanced retrieval techniques (hybrid search, re-ranking), observability, and evaluation tools, effectively reducing the time-to-market for AI-driven applications.
The critical layer between NetApp Volumes and the RAG Blueprint that elevates this reference architecture is the Google Cloud NetApp Volumes (GCNV) Data Ingestor. This service directly addresses the perennial challenge in RAG: bridging the gap between proprietary enterprise data and the RAG endpoint.
The GCNV Data Ingestor provides customers with the ability to:
This architecture delivers a complete, production-ready solution, enabling enterprises to transform vast amounts of unstructured data into trustworthy, actionable intelligence that is securely built upon a performant and scalable foundation.
The author extends appreciation and recognition to the following individuals for their contributions to this reference architecture -