Tech ONTAP Blogs
Tech ONTAP Blogs
A couple of weeks ago, I published a post walking through the deployment of a basic LLM inference stack that will run on any of NetApp's NVIDIA-based AIPods. In that post, I used NVIDIA NIM for LLMs as my inference server. NIM is powerful and easy to adopt, but it is not the only option for the inference server portion of the stack. Some organizations might not have an NVIDIA AI Enterprise subscription. Other organizations might just prefer open-source software or a custom hardware configuration. For these organizations, vLLM is a good option. In this post, I will walk through the deployment of a basic LLM inference stack that uses vLLM.
First, you need to deploy an inference server. As I mentioned previously, this post will focus on vLLM for the inference server portion of the stack. vLLM is a popular open-source LLM inference that is known for its flexibility and performance.
To deploy vLLM on Kubernetes, you can use the helm chart that is published by the vLLM Production Stack project. First, create a vllm-prod-stack-values.yaml file.
servingEngineSpec:
vllmApiKey: "my-key" # Set an API key for your vLLM API
modelSpec:
- name: "mistral"
repository: "vllm/vllm-openai"
tag: "v0.9.1"
modelURL: "mistralai/Mistral-Small-3.2-24B-Instruct-2506" # Choose a model from HuggingFace
replicaCount: 1
requestCPU: 14 # Set based on model requirements
requestMemory: "64Gi" # Set based on model requirements
requestGPU: 2 # Set based on model requirements
pvcStorage: "1Ti"
vllmConfig:
tensorParallelSize: 2 # Set to number of GPUs
maxModelLen: 16384 # Set based on model requirements
extraArgs: ["--tokenizer-mode", "mistral", "--config_format", "mistral", "--load_format", "mistral", "--limit_mm_per_prompt", "image=4"] # Set based on model requirements
hf_token: "<huggingface_token>" # Enter your HuggingFace access token
Next, use Helm to deploy vLLM.
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -n vllm-stack --create-namespace -f vllm-prod-stack-values.yaml
After deploying, confirm that vLLM is running.
kubectl -n vllm-stack get pod
The output of this command should show two running pods. It may take several minutes for the inference server pod to reach a running state as vLLM will need to download the model from HuggingFace.
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-75d66f7fb4-6s5p7 1/1 Running 0 20m
vllm-mistral-deployment-vllm-86b7bc4bf5-9kkhz 1/1 Running 0 20m
You now have an LLM inference server with an OpenAI-compatible API, but you still need a client-side chat UI so that you (and the users that you support) can interact with your LLM. As I mentioned in the previous post, there are many different options, including several free and open-source options, but Open WebUI is one that stands out. Open WebUI is a fully-packaged, production-ready, open-source chat UI.
You can use Open WebUI to quickly deploy a chat UI in front of your inference server. First, create an open-webui-values.yaml file.
ollama:
enabled: false
pipelines:
enabled: false
service:
type: NodePort # NodePort or LoadBalancer
openaiBaseApiUrl: "http://vllm-router-service.vllm-stack.svc.cluster.local:80/v1" # URL for vLLM inference API (only accessible from within K8s cluster)
extraEnvVars:
- name: OPENAI_API_KEY
value: 'my-key' # vLLM API key that you set
- name: ENABLE_WEB_SEARCH
value: "True"
- name: WEB_SEARCH_ENGINE
value: duckduckgo
- name: ENABLE_DIRECT_CONNECTIONS
value: "False"
# For additional options, including SSO integration,
# refer to https://github.com/open-webui/helm-charts/blob/main/charts/open-webui/values.yaml.
# For additional environment variables to include in the 'extraEnvVars' section,
# refer to https://docs.openwebui.com/getting-started/env-configuration.
Next, use Helm to deploy Open WebUI.
helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install open-webui open-webui/open-webui -n open-webui --create-namespace -f open-webui-values.yaml
The console output of the 'helm install' command will include instructions for accessing the application. Note down the URL.
2. Access the Application:
Access via NodePort service:
export NODE_PORT=$(kubectl get -n open-webui -o jsonpath="{.spec.ports[0].nodePort}" services open-webui)
export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT
After deploying, confirm that Open WebUI is running.
kubectl -n open-webui get pod
The output of this command should show a single running pod.
NAME READY STATUS RESTARTS AGE
open-webui-0 1/1 Running 0 1m
Access the Open WebUI application in your web browser using the URL that you noted down, and then step through the initial setup wizard. You now have your own private Chat UI!
If you did not configure SSO integration in your open-webui-values.yaml file, as I did not in the preceding example, then you will need to create accounts for your users via the admin panel. To access the admin panel, click on your name in the bottom-left corner of the screen, and then click on 'Admin Panel.'
To add a user, click on the '+' icon in the top-right corner of the screen.
If you add additional users, you will want to make your model visible to all users. By default, the model is only visible to you, meaning that other users will have no model to chat with. To make the model visible to all users, access the admin panel, then click on 'Settings' in the navigation bar at the top of the screen, then click on 'Models' in the navigation panel on the left, and then click on the pencil icon next to your model.
Once on the model settings page, click on 'Visibility,' and then choose 'Public' in the drop-down menu. You can also edit your system prompt if you desire to do so.
You and your users can now chat with your vLLM inference server! You are now a GenAI superstar, and it only took a few minutes!
RAG, or retrieval augmented generation, is a common technique for augmenting LLMs with private data/knowledge without having to perform any additional training. As I mentioned in the previous post, Open WebUI includes RAG capabilities. Open WebUI's Workspace functionality is particularly powerful. Review the Open WebUI documentation for more details.
At this point, you may be asking, why should I run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. If you run vLLM and Open WebUI on NetApp, using Trident, then your model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, your entire application stack will be fully portable, meaning that you can move between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.
Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.