Tech ONTAP Blogs
Tech ONTAP Blogs
Earlier this week, OpenAI released two state-of-the-art open-weight LLMs (large language models), gpt-oss-20b and gpt-oss-120b. These models have generated quite a bit of interest due to their reasoning capabilities, tool-use performance, and native efficiency. According to OpenAI, the gpt-oss-120b model, which can run on a single GPU with 80GB of VRAM (for example, a single NVIDIA H100), performs similarly to their proprietary o4-mini model on core reasoning benchmarks. The gpt-oss-20b model, which only requires a GPU with 16GB of VRAM, performs similarly to their proprietary o3‑mini model on common benchmarks. In this post, I will show you how to deploy a gpt-oss model on an NVIDIA-based NetApp AIPod.
To host a gpt-oss model, you need to deploy an inference server. You can use vLLM, NVIDIA NIM, or Ollama as your inference server since the gpt-oss models are supported by all three. However, vLLM's and NIM's support is limited to Blackwell and Hopper GPUs, so if you have a NetApp AIPod with Lenovo or an older NetApp AIPod for NVIDIA DGX with Ampere GPUs, you will need to use Ollama. In this post, I will demonstrate the deployment of gpt-oss-120b using Ollama, which works an all NVIDIA-based NetApp AIPods.
If you have been following my posts over the past couple of months, you have seen my posts about Open WebUI. Open WebUI is a popular production-ready open-source chat UI. Conveniently, the Open WebUI helm chart includes the option to deploy an Ollama instance and automatically connect it to Open WebUI. This means that this helm chart can be used to deploy a full-stack inference application, including an inference server and a user-facing chat UI.
To deploy Open WebUI and Ollama together, first create an open-webui-values.yaml file.
ollama:
enabled: true
fullnameOverride: "open-webui-ollama"
ollama:
gpu:
enabled: true
type: 'nvidia'
number: 2 # If you have H100, H200, or B200 GPUs, you can change this to 1.
models:
pull:
- "gpt-oss:120b"
run:
- "gpt-oss:120b"
image:
tag: 0.11.2
runtimeClassName: nvidia # If you are running on OpenShift, you may need to change this value to an empty string ("")
persistentVolume:
enabled: true
size: 1Ti
pipelines:
enabled: false
service:
type: NodePort # NodePort or LoadBalancer
extraEnvVars:
- name: ENABLE_WEB_SEARCH
value: "True"
- name: WEB_SEARCH_ENGINE
value: duckduckgo
- name: ENABLE_DIRECT_CONNECTIONS
value: "False"
# For additional options, including SSO integration,
# refer to https://github.com/open-webui/helm-charts/blob/main/charts/open-webui/values.yaml.
# For additional environment variables to include in the 'extraEnvVars' section,
# refer to https://docs.openwebui.com/getting-started/env-configuration.
Next, use Helm to deploy Open WebUI.
helm repo add open-webui https://open-webui.github.io/helm-charts
helm repo update
helm install open-webui open-webui/open-webui -n open-webui --create-namespace -f open-webui-values.yaml
The console output of the 'helm install' command will include instructions for accessing the application. Note down the URL.
2. Access the Application:
Access via NodePort service:
export NODE_PORT=$(kubectl get -n open-webui -o jsonpath="{.spec.ports[0].nodePort}" services open-webui)
export NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT
After deploying, confirm that Open WebUI and Ollama are running.
kubectl -n open-webui get pod
The output of this command should show a two running pods, with all containers in a "Ready" state. It may take a while for the Ollama container to reach a "Ready" state - the model will be downloaded as part of the initial startup process.
NAME READY STATUS RESTARTS AGE
open-webui-0 1/1 Running 0 30m
open-webui-ollama-5ffb8fd8f5-ff7gg 1/1 Running 0 30m
Access the Open WebUI application in your web browser using the URL that you noted down, and then step through the initial setup wizard. You now have your own private Chat UI with gpt-oss-120b!
If you did not configure SSO integration in your open-webui-values.yaml file, as I did not in the preceding example, then you will need to create accounts for your users via the admin panel. To access the admin panel, click on your name in the bottom-left corner of the screen, and then click on 'Admin Panel.'
To add a user, click on the '+' icon in the top-right corner of the screen.
If you add additional users, you will want to make your model visible to all users. By default, the model is only visible to you, meaning that other users will have no model to chat with. To make the model visible to all users, access the admin panel, then click on 'Settings' in the navigation bar at the top of the screen, then click on 'Models' in the navigation panel on the left, and then click on the pencil icon next to your model.
Once on the model settings page, click on 'Visibility,' and then choose 'Public' in the drop-down menu. You can also edit your system prompt if you desire to do so - this is a simple way to introduce basic guardrails.
You and your users can now chat with gpt-oss-120b!
RAG, or retrieval augmented generation, is a common technique for augmenting LLMs with private data/knowledge without having to perform any additional training. As I mentioned in the previous post, Open WebUI includes RAG capabilities. Open WebUI's Workspace functionality is particularly powerful. Review the Open WebUI documentation for more details.
At this point, you may be asking, why should I run all of this on NetApp? The answer is simple. If your goal is to build an enterprise-class, production-ready GenAI application as opposed to a "science project" or proof-of-concept, then you need enterprise-class storage. If you host gpt-oss on NetApp, then your model versions, chat UI configurations, chat history, RAG knowledge base, etc., will be safe and protected. You can use snapshot copies to efficiently back up your data, ensuring that you will never lose anything. Additionally, you can utilize NetApp's Autonomous Ransomware Protection to ensure that your critical data is protected. Finally, since NetApp is the only storage vendor to offer native services co-engineered with each of the three major public cloud providers, your entire application stack will be fully portable, meaning that you can move between the cloud and the datacenter, and between different cloud providers, as needed to optimize costs or take advantage of specialized infrastructure offerings.
Stay tuned for more posts on building out your LLM inference stack! To learn more about NetApp, visit netapp.com.