Tech ONTAP Blogs

Hybrid AI Training with NetApp and Union.ai

tsathish
NetApp
24 Views

Bridging Data Gravity and Cloud Ambition

What do you do when the future of AI seems to be in the cloud, but your data is firmly rooted on the ground? This isn’t just a technical riddle, it’s a business and ethical crossroads facing every organization with aspirations for AI at scale.

 

I’ve spent the last few years immersed in conversations with customers across healthcare, financial services, life sciences and manufacturing. Their stories are remarkably consistent: the data that matters most such as medical images, transaction histories, confidential proprietary designs all reside on-premises, protected by layers of governance, privacy, and sheer volume. Yet, the promise of AI has never been more compelling, and the cloud offers an almost gravitational pull with its elastic compute, scalable GPUs, and speed to innovation.

 

But as with all things in technology, the path forward is rarely linear. Instead, it’s about blending strengths, respecting constraints, and, sometimes, rethinking what it means to “move” at all.

 

The Unmovable Data Dilemma

There’s a growing perception that the road to AI excellence simply involves pouring all your data into the cloud and letting algorithms go to work. If only it were that simple. In reality, data gravity is real. Regulatory compliance, cost, and the sheer physics of petabyte-scale archives mean that for many organizations, data stays put by necessity, not by choice.

 

The question, then, is not whether to migrate, but how to innovate with what you have and where you have it.

 

I often hear from teams who feel caught in the middle. They want to experiment, to iterate, to bring the power of modern AI models and access to in-demand GPUs to bear on their most valuable data assets. But the roadblocks which include technical, regulatory, and operationally are daunting. Should they wait for new hardware to be purchased and deployed? Should they attempt a risky migration? Or is there another way?

 

Rethinking Hybrid: Where Data Stays, and Compute Roams

The answer, increasingly, lies in hybrid thinking, which is not a compromise, but as a new foundation. What if your data could remain secure and compliant, right where it belongs, while your AI workloads stretch their legs in the cloud?

 

This is the thinking behind the integration of Union.ai’s orchestration platform with NetApp FlexCache and Kubernetes-native storage provisioning via Trident. It’s an approach that doesn’t force a binary choice. Instead, it quietly bridges the gap, allowing cloud-based AI training jobs to access on-premises data without requiring that massive data to be relocated or replicated.

 

NetApp FlexCache, a caching technology creates read-optimized cache volume in the cloud that is logically linked to your primary ONTAP volume on-premises and brings faster throughput with a smaller footprint. When a training job running in the cloud accesses a file, FlexCache transparently retrieves it from the origin volume if it’s not already cached. From that point forward, reads are served locally from the cache. Writes, if any, are immediately passed through to the origin, ensuring consistency and compliance.

 

Union.ai orchestrates training workflows in both on-premises data processing workflows and cloud-based training workflows. These jobs request persistent volumes via the NetApp DataOps Toolkit, which provisions a FlexCache volume on ONTAP and then binds it to a Kubernetes PVC. To the training job, the data appears as a mounted file system path, with no awareness of whether the data is cached or fetched on-demand. The result is a seamless hybrid experience: data remains governed and secure on-prem, while compute scales elastically in the cloud.

 

This hybrid approach works because of the orchestration intelligence that Union.ai brings to the table. Union.ai is the company behind Flyte, the open-source orchestration platform trusted by over 3,500 organizations to power their most critical AI and data workflows. As the original creators and core maintainers of Flyte, Union brings deep expertise in scalable, reproducible, and cloud-native AI/ML operations. Our platform enables teams to define, schedule, and execute complex AI pipelines with ease whether they are running on-premises, in the cloud, or across both. Customers choose Union because we help them move faster with confidence, offering enterprise-grade support, governance, and extensibility on top of the proven Flyte foundation. In hybrid environments, Union orchestrates AI training and AI inference workflows seamlessly across infrastructure boundaries, ensuring that data scientists and ML engineers can focus on innovation and not infrastructure.

 

Together with NetApp FlexCache and Trident, Union enables a hybrid AI training model that is not just technically feasible, but operationally elegant. It’s a solution that respects the realities of enterprise environments while unlocking the agility and scale of the cloud.

 

tsathish_1-1759514191781.png

 

From Theory to Practice: A Day in the Life

Consider a healthcare organization training an AI model on sensitive medical images adhering to HIPAA regulations. In the past, this might have meant months of planning, risk assessments, and ultimately a decision not to proceed, or to settle for less. But in this hybrid model, the workflow becomes almost routine.

 

The data remains protected on-premises in a NetApp ONTAP system. A FlexCache volume is created in the cloud—on Amazon FSx for NetApp ONTAP and Google Cloud NetApp Volumes, or a self-managed ONTAP cluster. Trident, the CSI-compliant storage orchestrator, is deployed in both the on-prem and cloud Kubernetes environments. It provisions persistent volumes from the FlexCache backend in the cloud cluster.

 

There’s no drama, no heavy lifting. Just a smooth handshake between storage, orchestration, and compute—letting teams focus on science, not infrastructure.

 

The Subtle Power of Not Moving

What’s most striking about this approach is how it reframes the conversation about AI infrastructure. It’s not about “lifting and shifting,” or about maintaining silos. It’s about building a connective tissue but rather a data fabric, that respects the realities of governance, cost, and performance, while still unlocking the potential of cloud-scale compute.

 

This isn’t just a technical win. It’s a cultural shift. Data scientists and AI engineers are empowered to iterate and explore, free from the friction and delay of infrastructure constraints. Compliance teams rest easier, knowing that control and auditability are preserved. And leaders gain a faster path to AI-driven outcomes, without compromise.

 

Looking Ahead

As AI continues to shape the future, the organizations that succeed will be those that find harmony between where their data lives and where their ideas can flourish. Hybrid training with Union.ai and NetApp FlexCache is one example of this new balance and one that doesn’t force a false choice between innovation and control.

 

And, the hybrid model is just the beginning. As AI evolves, so will the infrastructures that support it. Expect to see tighter integrations, smarter caching strategies, and even policy-driven workload placement, where orchestration systems like Union dynamically decide where to run each part of your pipeline based on cost, latency, data security or compliance. The future isn’t just hybrid, it’s adaptive.

 

If your organization is grappling with the realities of data gravity and looking for a way to move faster without moving at all you should consider what’s possible when you let your data stay rooted and let your AI ambitions take flight. Learn more about NetApp AI solutions and Union.ai, and start building the future of hybrid AI, on your terms.

Public