Zero to LLM Inference - vLLM Edition

moglesby · ‎2025-07-08

A couple of weeks ago, I published a post walking through the deployment of a basic LLM inference stack that will run on any of NetApp's NVIDIA-based AIPods. In that post, I used NVIDIA NIM for LLMs as my inference server. NIM is powerful and easy to adopt, but it is not the only option for the inference server portion of the stack. Some organizations might not have an NVIDIA AI Enterprise subscription. Other organizations might just prefer open-source software or a custom hardware configuration. For these organizations, vLLM is a good option. In this post, I will walk through the deployment of a basic LLM inference stack that uses vLLM.

Prerequisites

You must have one or more servers with NVIDIA GPUs.
You must have installed Kubernetes on your servers.
You must have installed and configured the NVIDIA GPU Operator in your Kubernetes cluster.
You must have installed and configured NetApp Trident in your Kubernetes cluster, and you must have a Trident-affiliated StorageClass as your default StorageClass.
You must have Helm and kubectl installed on the laptop or jump box that you are using to administer your Kubernetes cluster.

Deploy an Inference Server

First, you need to deploy an inference server. As I mentioned previously, this post will focus on vLLM for the inference server portion of the stack. vLLM is a popular open-source LLM inference engine that is known for its flexibility and performance.

To deploy vLLM on Kubernetes, you can use the helm chart that is published by the vLLM Production Stack project. First, create a vllm-prod-stack-values.yaml file.

servingEngineSpec:
  vllmApiKey: "my-key" # Set an API key for your vLLM API
  modelSpec:
  - name: "mistral"
    repository: "vllm/vllm-openai"
    tag: "v0.9.1"
    modelURL: "mistralai/Mistral-Small-3.2-24B-Instruct-2506" # Choose a model from HuggingFace
    replicaCount: 1
    requestCPU: 14 # Set based on model requirements
    requestMemory: "64Gi" # Set based on model requirements
    requestGPU: 2 # Set based on model requirements
    pvcStorage: "1Ti"
    vllmConfig:
      tensorParallelSize: 2 # Set to number of GPUs
      maxModelLen: 16384 # Set based on model requirements
      extraArgs: ["--tokenizer-mode", "mistral", "--config_format", "mistral", "--load_format", "mistral", "--limit_mm_per_prompt", "image=4"] # Set based on model requirements

    hf_token: "<huggingface_token>" # Enter your HuggingFace access token

Next, use Helm to deploy vLLM.

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -n vllm-stack --create-namespace -f vllm-prod-stack-values.yaml

After deploying, confirm that vLLM is running.

kubectl -n vllm-stack get pod

The output of this command should show two running pods. It may take several minutes for the inference server pod to reach a running state as vLLM will need to download the model from HuggingFace.

NAME                                            READY   STATUS    RESTARTS   AGE
vllm-deployment-router-75d66f7fb4-6s5p7         1/1     Running   0          20m
vllm-mistral-deployment-vllm-86b7bc4bf5-9kkhz   1/1     Running   0          20m

Deploy a Chat UI

You now have an LLM inference server with an OpenAI-compatible API, but you still need a client-side chat UI so that you (and the users that you support) can interact with your LLM. As I mentioned in the previous post, there are many different options, including several free and open-source options, but Open WebUI is one that stands out. Open WebUI is a fully-packaged, production-ready, open-source chat UI.

You can use Open WebUI to quickly deploy a chat UI in front of your inference server. First, create an open-webui-values.yaml file.

ollama:
  enabled: false

pipelines:
  enabled: false

service:
  type: NodePort # NodePort or LoadBalancer

openaiBaseApiUrl: "http://vllm-router-service.vllm-stack.svc.cluster.local:80/v1" # URL for vLLM inference API (only accessible from within K8s cluster)

extraEnvVars:
  - name: OPENAI_API_KEY
    value: 'my-key' # vLLM API key that you set
  - name: ENABLE_WEB_SEARCH
    value: "True"
  - name: WEB_SEARCH_ENGINE
    value: duckduckgo
  - name: ENABLE_DIRECT_CONNECTIONS
    value: "False"

# For additional options, including SSO integration,
#   refer to https://github.com/open-webui/helm-charts/blob/main/charts/open-webui/values.yaml.
# For additional environment variables to include in the 'extraEnvVars' section,
#   refer to https://docs.openwebui.com/getting-started/env-configuration.

Next, use Helm to deploy Open WebUI.

helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install open-webui open-webui/open-webui -n open-webui --create-namespace -f open-webui-values.yaml

The console output of the 'helm install' command will include instructions for accessing the application. Note down the URL.

2. Access the Application:
  Access via NodePort service:
    export NODE_PORT=$(kubectl get -n open-webui -o jsonpath="{.spec.ports[0].nodePort}" services open-webui)
    export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
    echo http://$NODE_IP:$NODE_PORT

After deploying, confirm that Open WebUI is running.

kubectl -n open-webui get pod

The output of this command should show a single running pod.

NAME          READY   STATUS    RESTARTS   AGE
open-webui-0  1/1     Running   0          1m

Access and Configure Chat UI

Access the Open WebUI application in your web browser using the URL that you noted down, and then step through the initial setup wizard. You now have your own private Chat UI!

Screenshot 2025-07-08 at 4.53.09 PM.png

If you did not configure SSO integration in your open-webui-values.yaml file, as I did not in the preceding example, then you will need to create accounts for your users via the admin panel. To access the admin panel, click on your name in the bottom-left corner of the screen, and then click on 'Admin Panel.'

To add a user, click on the '+' icon in the top-right corner of the screen.

If you add additional users, you will want to make your model visible to all users. By default, the model is only visible to you, meaning that other users will have no model to chat with. To make the model visible to all users, access the admin panel, then click on 'Settings' in the navigation bar at the top of the screen, then click on 'Models' in the navigation panel on the left, and then click on the pencil icon next to your model.

Screenshot 2025-07-08 at 4.56.58 PM.png

Once on the model settings page, click on 'Visibility,' and then choose 'Public' in the drop-down menu. You can also edit your system prompt if you desire to do so.

Screenshot 2025-07-08 at 4.59.23 PM.png

You and your users can now chat with your vLLM inference server! You are now a GenAI superstar, and it only took a few minutes!

Screenshot 2025-07-08 at 5.02.13 PM.png

What about RAG?

RAG, or retrieval augmented generation, is a common technique for augmenting LLMs with private data/knowledge without having to perform any additional training. As I mentioned in the previous post, Open WebUI includes RAG capabilities. Open WebUI's Workspace functionality is particularly powerful. Review the Open WebUI documentation for more details.

Why NetApp?

At this point, you may be asking, why should I run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. If you run vLLM and Open WebUI on NetApp, using Trident, then your model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, your entire application stack will be fully portable, meaning that you can move between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.

Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.