Hybrid RAG in the Real World: Graphs, BM25, and the End of Black-Box Retrieval

DavidvonThenen

In the earlier posts in this series, we talked about what happens when Retrieval-Augmented Generation leans too hard on vector search. The first post, From "Trust Me" to "Prove It": Why Enterprises Need GraphRAG, walked through why enterprises need retrieval that behaves more like a knowledge graph than a fuzzy lookup table. This blog is a follow on to my last one, DocumentRAG Using OpenSearch: GraphRAG-like Structure Without the Graph Overhead, which showed how BM25 (or full text search) and entity-aware retrieval can deliver graph-like behavior using tooling teams already know how to operate. Together, those posts made a simple point: if you care about explainability, governance, and reproducibility, you cannot treat retrieval as an opaque side effect of mathematical embedding.

This post picks up where those ideas left off and pushes into a harder question: how do you externalize the Retrieval step so you can actually steer AI responses? If your agent keeps providing half-baked answers, how do you change the retrieval behavior without throwing more embeddings at the problem? Or asking data scientists to backfill the corpus with synthetic data glue? GraphRAG addresses this by encoding your data relationships into a knowledge graph. DocumentRAG (or BM25-based RAG) addresses this by anchoring retrieval to entities, fields, and BM25 scoring. Both approaches share the same philosophy: treat retrieval as a controllable, inspectable process, not as a magical function inside a vector database.

This blog will also explain why NetApp cares so much about this space. Enterprise customers live in a world of audits, SLAs, and regulators who do not accept "the LLM or Embeddings said so" as an answer. They need pipelines that are explainable, repeatable, and grounded in storage systems that already hold decades of business data. In this post, we are going to connect those dots: externalized retrieval, AI governance, and the next step in the architecture story, Hybrid RAG. We will examine how combining graph-style structure with BM25 and vector search can improve accuracy, reduce risk, and still run on infrastructure patterns operators already know how to scale.

At NetApp, this problem shows up daily in customer environments where AI systems are expected to reason over decades of data already governed, protected, and replicated across hybrid and multi-cloud estates. Retrieval doesn't live in isolation there. It lives alongside storage platforms that already enforce durability, lineage, and compliance.

Heard of Hybrid RAG Agents?

Enterprise teams live and die by repeatable, explainable processes. The same bar applies to AI systems, especially when the use case carries real risk. In healthcare, finance, or safety-critical workloads, a hallucinated answer is not a quirky failure mode; it's a liability one in which someone might get hurt or even worse. If an AI agent is going to assist with decisions that affect people, safety, or money, it must ground its answers in evidence, show where that evidence came from, and allow humans to inspect the path from query to result. Remember in school where we humans had to show our work? The same should be true for AI Agents as well.

This is why graph-based RAG architecture became popular. GraphRAG builds a knowledge graph from your corpus, organizing entities and relationships so the system can retrieve a connected subgraph that matches the question, rather than a grab bag of "similar" chunks or embeddings. That graph behaves like a map: it encodes how concepts relate, which rules depend on which exceptions, and how data points sit inside a larger structure. NetApp's Graph RAG Guide details this design, showing how dual-memory, knowledge-graph-based retrieval improves governance, traceability, and control over multi-hop reasoning paths.

Hybrid RAG takes this idea a step further by combining graph-based and vector-based retrieval into a single pipeline. Research from NVIDIA and BlackRock on HybridRAG shows that fusing Knowledge Graphs with traditional vector RAG produces better results than either approach alone, especially on financial document Q&A tasks. Their experiments report gains in faithfulness, answer relevance, and context recall when the system integrates contexts from both structured graphs and dense embeddings. In practice, this means you can use the graph to lock in high-precision, domain-structured evidence, while vectors help you pick up nuanced context and edge cases that are hard to encode as explicit relationships.

If you want a good mental model for how this works in real systems, Mitesh Patel's session at this AI Engineer World Fair is a useful reference point. He explains how hybrid pipelines use knowledge graphs to enforce structure and governance, while vector search fills semantic gaps and improves coverage of messy, real-world text. Another great video explaining the effectiveness of Hybrid RAG is the video above from The AI Automators. In my previous article, NVIDIA reported a 96% factual faithfulness on financial-filings answers (financial is more difficult to do but this also extends into traditional RAG implementations) using a Graph+Vector architecture; this BM25+Vector design is a great middle ground. For enterprises, Hybrid RAG agents occupy the sweet spot: they address the need for audits and data relationships while still benefiting from the flexibility of embeddings. I think of the two halves of Hybrid RAG as follows: the knowledge graph grounds answers in the documents, and vector embeddings provide domain-level understanding.

Why BM25-based Hybrid RAG?

Building a full knowledge graph is powerful, but it's also a serious lift. You need a graph database, a schema, lifecycle rules for your nodes and edges, and people who know how to operate that system in production. For many teams, that's a lot to take on before they even know whether their AI agent will justify the engineering investment. This is where BM25-based Hybrid RAG using OpenSearch becomes a practical middle ground. It gives you some of the structure and grounding benefits of GraphRAG without forcing you to stand up an entirely new data stack.

BM25 helps by providing your system with a way to reason about text using signals that humans already understand: keywords, phrases, term frequency, and document relevance. It doesn't provide the deep relationship mapping of a full knowledge graph, but it does something important… it surfaces documents based on explicit textual cues, not fuzzy embedding geometry. Those cues act like lightweight relationships. They reveal when sections share entities, when policies overlap, or when multiple documents describe the same concept in different language. You're not building a graph, but you're still grounding retrieval in interpretable structure rather than pure semantic guesswork.

The video below explains how BM25, vector, and graph retrieval differ when answering questions on the same dataset. The contrast is useful. Vector search excels at capturing subtle meaning but can drift toward irrelevant passages when context is ambiguous. Graph retrieval hits the structured center of the target but requires up-front modeling work. BM25 sits in the middle. It's transparent, predictable, and surprisingly effective when combined with a vector-based reranking or augmentation step. That pairing makes BM25-based Hybrid RAG a strong option for teams that want explainability and grounding without the overhead of full graph engineering teams.

ONTAP brings the same operational simplicity to BM25-based Hybrid RAG that it delivers across all workloads: unified data management, automated tiering, FlexCache distribution, atomic snapshots, and seamless hybrid cloud replication that works identically whether you're managing search indices and vector embeddings or any other enterprise application, eliminating the need for workload-specific storage architectures. For example, enabling management of both your full-text corpus and dense vector embeddings under a single storage "fabric", where FlexCache can automatically distribute hot indices and frequently accessed embeddings to geo-distributed RAG clusters while intelligent tiering moves cold data to capacity storage. SnapMirror provides atomic, unified replication of your entire OpenSearch data estate across hybrid cloud environments with consistent snapshot-based versioning, enabling you to manage a single, coherent data protection strategy rather than orchestrating separate backup solutions for sparse and dense search components. At AI scale, this operational simplicity translates directly into cost efficiency through reduced storage footprint and consolidation of the specialized expertise required to maintain the environment.

If you are looking at more details on how NetApp integrates with this type of Hybrid RAG architecture, take a look at this README.md guide within this GitHub repo: https://github.com/davidvonthenen-com/hybrid-rag-bm25-with-ai-governance/blob/main/enterprise_version/README.md.

If you want to explore how these pieces fit together, the Document RAG Guide remains a solid reference. For many organizations, this approach offers the right balance: firm grounding, improved correctness, and an infrastructure footprint that doesn't require reinventing your entire data platform.

Consider Externalizing the Hybrid Search Process

One thing we should call out is that although OpenSearch ships with native hybrid search (BM25 and vector) in the same engine, this Hybrid RAG Guide intentionally avoids using the built-in hybrid query. The reason might not be easily apparent: the native hybrid scorer blends BM25 and vector signals into a single score, and you can't tell which half contributed what. When you're troubleshooting missing documents, tuning retrieval behavior, or explaining a result to an auditor, that black-box blending gets in the way. By externalizing the BM25 and vector retrieval into separate steps, you keep clear boundaries between the two modalities and gain full visibility into how each one shaped the final answer.

Another limitation is that the native hybrid search keeps all normalization, weighting, and reranking logic within OpenSearch's internal processors. That's fine when your policies are basic and you tend to have a smaller corpus or dataset, but the moment you need nuanced behavior, you're stuck encoding retrieval strategies within pipelines and script contexts rather than in your normal application layer or as a service. It also means your business rules must fit within the domain-specific language (DSL) OpenSearch exposes, not the other way around. Externalizing the process frees you to write policies, blends, or intent-aware routing in your own stack, where you can evolve the logic as your use case matures.

The native hybrid mechanisms also enforce specific normalization and score-combination techniques. Those work well for simple use cases but fall apart when you need more advanced retrieval behavior: custom Reciprocal Rank Fusion (RRF) weighting (pictured above), query-intent-based blending, ML rerankers that consume lexical and vector features, or multi-stage retrieval chains like BM25 → vector expansion → cross-encoder rerank. And because OpenSearch's normalization processor lacks knowledge of your labels, domain tags, or task metadata, it can't incorporate that richer context into scoring. By moving retrieval logic out of the engine and into your application, you can build hybrid pipelines that respond to domain signals, dynamically adapt to query intent, and support more advanced ranking strategies without being constrained by the internal architecture of the search backend.

Having said that, OpenSearch’s native hybrid search is an excellent “easy button” for the following use cases:

If you are just starting to explore Hybrid RAG
You don't want to augment the ranking process with your own metadata or data relationships
You don't have hard requirements for AI governance

However, if you are dealing with Enterprise data at scale, I would highly recommend investigating the reasons for externalizing the search/reranking and how owning these processes enable AI governance in your AI solutions.

Hybrid RAG for Production-Ready AI

A lot has been written lately about AI pilots failing before reaching production. Anyone who has tried to build an agent they actually depend on knows why: inconsistent retrieval, drifting answers, and systems that can't reproduce the same result twice. The issue isn't "AI isn't ready." The issue is that many teams leaned on retrieval pipelines they couldn't inspect, tune, or govern. When your AI agent can't explain why it surfaced a document or why it ignored a better one… you don't have a production system. You have a prototype with a deadline.

This is why so many teams are circling back to traditional, proven data technologies. Yes, there has even been a movement to use SQL databases for Retrieval operations in RAG because of the decades of operational experience behind them. These systems were built for consistency, traceability, and compliance long before AI entered the picture. They don't hide their logic behind gradient math, and they don't require specialized modeling expertise to understand the scoring rules. They behave in ways operators can predict, developers can debug, and auditors can verify. That alone puts them ahead of many of the "AI-native" tools that dominate the marketing hype cycle.

This entire series has been about understanding the level of reliability your AI agent must deliver. In some cases, fast and lightweight vector search is enough... But for enterprises where the answer must be correct (and where a wrong answer has a real cost), explainability and correctness become non-negotiable. Hybrid RAG, whether powered by graphs, BM25, or a structured blend of both, gives teams the control they need to trust their systems. Because at the end of the day, the answers your AI produces reflect the judgment and engineering of your organization. If your agent speaks with confidence but gets the facts wrong, that still lands on you. Building retrieval pipelines that are grounded, observable, and correct isn't just good engineering practice... It's how you build AI systems that deserve to be in production.