Zero to LLM Inference in Five Minutes

moglesby · ‎2025-06-20

Confused about where to start with LLMs? You've come to the right place. In this post, I will walk through the deployment of a basic LLM inference stack that will run on any of NetApp's NVIDIA-based AIPods. This stack is appropriate for smaller-scale deployments and POCs.

Prerequisites

You must have an NVIDIA-based NetApp AIPod, such as NetApp AIPod for NVIDIA DGX or NetApp AIPod with Lenovo.
You must have installed Kubernetes on one or more compute nodes in your AIPod.
You must have installed and configured the NVIDIA GPU Operator in your Kubernetes cluster.
You must have installed and configured NetApp Trident in your Kubernetes cluster, and you must have a Trident-affiliated StorageClass as your default StorageClass.
You must have Helm and kubectl installed on the laptop or jump box that you are using to administer your Kubernetes cluster.

Deploy an Inference Server

First, you need to deploy an inference server. There are many different options, but since you have an NVIDIA-based AIPod, you have access to the NVIDIA AI Enterprise suite of tools, libraries, and frameworks. This includes NVIDIA NIMs, which are prebuilt, optimized inference microservices for rapidly deploying the latest AI models on any NVIDIA-accelerated infrastructure.

You can use NVIDIA NIM for LLMs to quickly deploy an LLM inference server. First, create a nim-values.yaml file.

# Change image to use a different model (https://docs.nvidia.com/nim/large-language-models/latest/models.html)
image:
  repository: "nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1"
  tag: 1.8.5

model:
  ngcAPIKey: <ngc-api-key> # Enter your NGC API Key

persistence:
  enabled: true
  size: 1Ti

resources:
  limits:
    nvidia.com/gpu: 4 # Change based on model requirements(https://docs.nvidia.com/nim/large-language-models/latest/models.html)

Next, use Helm to deploy NIM for LLMs.

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=<ngc-api-key>
helm install -n nim --create-namespace nim-nemotron-49b nim-llm-1.7.0.tgz -f nim-values.yaml

After deploying, confirm that the NIM is running.

kubectl -n nim get pod

The output of this command should show a single running pod.

NAME                        READY   STATUS    RESTARTS   AGE
nim-nemotron-49b-nim-llm-0  1/1     Running   0          1m

Deploy a Chat UI

You now have an LLM inference server with an OpenAI-compatible API, but you still need a client-side chat UI so that you (and the users that you support) can interact with your LLM. Again, there are many different options, including several free and open-source options. Open WebUI is one that stands out - it is a fully-packaged, production-ready, open-source chat UI.

You can use Open WebUI to quickly deploy a chat UI in front of your inference server. First, create a open-webui-values.yaml file.

ollama:
  enabled: false

pipelines:
  enabled: false

service:
  type: NodePort # NodePort or LoadBalancer

openaiBaseApiUrl: "http://nim-nemotron-49b-nim-llm.nim.svc.cluster.local:8000/v1" # URL for NIM inference API (only accessible from within K8s cluster)

extraEnvVars:
  - name: OPENAI_API_KEY
    value: "" # Leave as empty string
  - name: ENABLE_WEB_SEARCH
    value: "True" # Change to False to disable web search within chat UI
  - name: WEB_SEARCH_ENGINE
    value: duckduckgo
  - name: ENABLE_DIRECT_CONNECTIONS
    value: "False"

# For additional options, including SSO integration,
#   refer to https://github.com/open-webui/helm-charts/blob/main/charts/open-webui/values.yaml.
# For additional environment variables to include in the 'extraEnvVars' section,
#   refer to https://docs.openwebui.com/getting-started/env-configuration.

Next, use Helm to deploy Open WebUI.

helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install open-webui open-webui/open-webui -n open-webui --create-namespace -f open-webui-values.yaml

The console output of the 'helm install' command will include instructions for accessing the application. Note down the URL.

2. Access the Application:
  Access via NodePort service:
    export NODE_PORT=$(kubectl get -n open-webui -o jsonpath="{.spec.ports[0].nodePort}" services open-webui)
    export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
    echo http://$NODE_IP:$NODE_PORT

After deploying, confirm that Open WebUI is running.

kubectl -n open-webui get pod

The output of this command should show a single running pod.

NAME          READY   STATUS    RESTARTS   AGE
open-webui-0  1/1     Running   0          1m

Access and Configure Chat UI

Access the Open WebUI application in your web browser using the URL that you noted down, and then step through the initial setup wizard. You now have your own private Chat UI!

Screenshot 2025-06-20 at 8.21.40 AM.png

If you did not configure SSO integration in your open-webui-values.yaml file, as I did not in the preceding example, then you will need to create accounts for your users via the admin panel. To access the admin panel, click on your name in the bottom-left corner of the screen, and then click on 'Admin Panel.'

Screenshot 2025-06-20 at 8.28.58 AM.png

To add a user, click on the '+' icon in the top-right corner of the screen.

Screenshot 2025-06-20 at 8.30.15 AM.png

If you add additional users, you will want to make your model visible to all users. By default, the model is only visible to you, meaning that other users will have no model to chat with. To make the model visible to all users, access the admin panel, then click on 'Settings' in the navigation bar at the top of the screen, then click on 'Models' in the navigation panel on the left, and then click on the pencil icon next to your model.

Screenshot 2025-06-18 at 4.21.47 PM.png

Once on the model settings page, click on 'Visibility,' and then choose 'Public' in the drop-down menu. At this point, you can also decide whether or not you want to turn reasoning on. If you used the same values in your nim-values.yaml as I did in the preceding example, then you deployed the llama-3.3-nemotron-super-49b-v1 model inside of your NIM. This is a reasoning model, but reasoning is disabled by default. To enable reasoning, add "detailed thinking on" to the 'System Prompt.'

Screenshot 2025-06-18 at 4.26.08 PM.png

You and your users can now chat with your NIM! You are now a GenAI superstar, and it only took five minutes!

Screenshot 2025-06-20 at 8.52.05 AM.png

Screenshot 2025-06-20 at 10.24.31 AM.png

What about RAG?

RAG, or retrieval augmented generation, is a common technique for augmenting LLMs with private data/knowledge without having to perform any additional training. Open WebUI includes RAG capabilities. Open WebUI's Workspace functionality is particularly powerful. Review the Open WebUI documentation for more details.

Why NetApp?

At this point, you may be asking, why should I run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. If you run NIM for LLMs and Open WebUI on NetApp, using Trident, then your model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, your entire application stack will be fully portable, meaning that you can move between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.

Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.