Zero to LLM Inference with NVIDIA NIM and Google Cloud NetApp Volumes

moglesby · ‎2025-07-25

Several weeks ago, I published a post walking through the deployment of a basic LLM inference stack that will run on any of NetApp's NVIDIA-based AIPods. What if you don't have immediate access to a NetApp AIPod, however? I have good news for you - the stack that I deployed in that post is fully portable, meaning that you can deploy the same stack in any environment in which you have access to NVIDIA GPUs and NetApp storage. If your goal is to get started quickly, the public cloud is a great option. In this post, I will walk through the deployment of the same basic LLM inference stack in Google Cloud using Google Kubernetes Engine (GKE) and Google Cloud NetApp Volumes.

Prerequisites

You must have created at least one Google Cloud NetApp Volumes storage pool.
You must have created a GKE cluster, and you must have NVIDIA GPUs available in your GKE cluster. NVIDIA's tutorial on running NIM on GKE can be a helpful reference.
You must have installed and configured NetApp Trident in your GKE cluster, and you must have a Trident-affiliated StorageClass as your default StorageClass.
You must have Helm and kubectl installed on the laptop or jump box that you are using to administer your Kubernetes cluster.

Deploy an Inference Server

First, you need to deploy an inference server. Just as I did in the previous post, I will use an NVIDIA NIM as my inference server. NIMs are prebuilt, optimized inference microservices for rapidly deploying the latest AI models on any NVIDIA-accelerated infrastructure.

You can use NVIDIA NIM for LLMs to quickly deploy an LLM inference server. First, create a nim-values.yaml file.

# Change image to use a different model (https://docs.nvidia.com/nim/large-language-models/latest/models.html)
image:
  repository: "nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1"
  tag: 1.8.5

model:
  ngcAPIKey: <ngc-api-key> # Enter your NGC API Key

persistence:
  enabled: true
  size: 1Ti

resources:
  limits:
    nvidia.com/gpu: 4 # Change based on model requirements (https://docs.nvidia.com/nim/large-language-models/latest/models.html)

startupProbe:
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 1500   # tune based on the size of your model; model downloads from HuggingFace may be throttled.
readinessProbe:
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6
livenessProbe:
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6

Next, use Helm to deploy NIM for LLMs.

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=<ngc-api-key>
helm install -n nim --create-namespace nim-nemotron-49b nim-llm-1.7.0.tgz -f nim-values.yaml

After deploying, confirm that the NIM is running.

kubectl -n nim get pod

The output of this command should show a single running pod, with a single container reporting as "ready." It may take a while for this container to reach a "ready" state - the model will be downloaded as part of the initial startup process, and large model downloads are frequently throttled.

NAME                        READY   STATUS    RESTARTS   AGE
nim-nemotron-49b-nim-llm-0  1/1     Running   0          1m

Deploy a Chat UI

You now have an LLM inference server with an OpenAI-compatible API, but you still need a user-facing chat UI so that you (and the users that you support) can easily interact with your LLM. As I mentioned in the previous post, there are many different options, including several free and open-source options, but Open WebUI is one that stands out - it is a fully-packaged, production-ready, open-source chat UI that is very popular in the AI engineering community.

You can use Open WebUI to quickly deploy a chat UI in front of your inference server. First, create an open-webui-values.yaml file.

ollama:
  enabled: false

pipelines:
  enabled: false

service:
  type: LoadBalancer # NodePort or LoadBalancer

openaiBaseApiUrl: "http://nim-nemotron-49b-nim-llm.nim.svc.cluster.local:8000/v1" # URL for NIM inference API (only accessible from within K8s cluster)

extraEnvVars:
  - name: OPENAI_API_KEY
    value: "" # Leave as empty string
  - name: ENABLE_WEB_SEARCH
    value: "True" # Change to False to disable web search within chat UI
  - name: WEB_SEARCH_ENGINE
    value: duckduckgo
  - name: ENABLE_DIRECT_CONNECTIONS
    value: "False"

# For additional options, including SSO integration,
#   refer to https://github.com/open-webui/helm-charts/blob/main/charts/open-webui/values.yaml.
# For additional environment variables to include in the 'extraEnvVars' section,
#   refer to https://docs.openwebui.com/getting-started/env-configuration.

Next, use Helm to deploy Open WebUI.

helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install open-webui open-webui/open-webui -n open-webui --create-namespace -f open-webui-values.yaml

The console output of the 'helm install' command will include instructions for accessing the application. Note down the URL.

2. Access the Application:
  Access via LoadBalancer service:
  NOTE: It may take a few minutes for the LoadBalancer IP to be available.
  NOTE: The external address format depends on your cloud provider:
    - AWS: Will return a hostname (e.g., xxx.elb.amazonaws.com)
    - GCP/Azure: Will return an IP address
  You can watch the status by running:

    kubectl get -n open-webui svc open-webui --watch
    export EXTERNAL_IP=$(kubectl get -n open-webui svc open-webui -o jsonpath="{.status.loadBalancer.ingress[0].hostname:-.status.loadBalancer.ingress[0].ip}")
    echo http://$EXTERNAL_IP:80

After deploying, confirm that Open WebUI is running.

kubectl -n open-webui get pod

The output of this command should show a single running pod.

NAME          READY   STATUS    RESTARTS   AGE
open-webui-0  1/1     Running   0          1m

Access and Configure Chat UI

Access the Open WebUI application in your web browser using the URL that you noted down, and then step through the initial setup wizard. You now have your own private Chat UI!

Screenshot 2025-06-20 at 8.21.40 AM.png

Troubleshooting Tip

If you can't access the application, you may need to add a firewall ingress rule that allows access to the Open WebUI service's NodePort on each of your GKE nodes. To retrieve the NodePort value, run the following command.

kubectl get -n open-webui -o jsonpath="{.spec.ports[0].nodePort}" services open-webui

If you did not configure SSO integration in your open-webui-values.yaml file, as I did not in the preceding example, then you will need to create accounts for your users via the admin panel. To access the admin panel, click on your name in the bottom-left corner of the screen, and then click on 'Admin Panel.'

Screenshot 2025-06-20 at 8.28.58 AM.png

To add a user, click on the '+' icon in the top-right corner of the screen.

Screenshot 2025-06-20 at 8.30.15 AM.png

If you add additional users, you will want to make your model visible to all users. By default, the model is only visible to you, meaning that other users will have no model to chat with. To make the model visible to all users, access the admin panel, then click on 'Settings' in the navigation bar at the top of the screen, then click on 'Models' in the navigation panel on the left, and then click on the pencil icon next to your model.

Screenshot 2025-06-18 at 4.21.47 PM.png

Once on the model settings page, click on 'Visibility,' and then choose 'Public' in the drop-down menu. At this point, you can also decide whether or not you want to turn reasoning on. If you used the same values in your nim-values.yaml as I did in the preceding example, then you deployed the llama-3.3-nemotron-super-49b-v1 model inside of your NIM. This is a reasoning model, but reasoning is disabled by default. To enable reasoning, add "detailed thinking on" to the 'System Prompt.'

Screenshot 2025-06-18 at 4.26.08 PM.png

You and your users can now chat with your NIM! You are now a GenAI superstar, and it only took a few minutes!

Screenshot 2025-06-20 at 8.52.05 AM.png

Screenshot 2025-06-20 at 10.24.31 AM.png

What about RAG?

RAG, or retrieval augmented generation, is a common technique for augmenting LLMs with private data/knowledge without having to perform any additional training. Open WebUI includes RAG capabilities. Open WebUI's Workspace functionality is particularly powerful. Review the Open WebUI documentation for more details.

Why NetApp?

At this point, you may be asking, why should I use Google Cloud NetApp Volumes and Trident for the persistent storage component of this stack? The answer is simple. If you run NIM for LLMs and Open WebUI with Google Cloud NetApp Volumes, using Trident, then your model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers in addition to storage appliances, your entire application stack will be fully portable, meaning that you can move between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings. You will never be locked in to one specific environment. NetApp even offers free tools like Trident Protect that greatly simplify the process of migrating applications across different environments.

Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.