RAG knowledge conflicts

RAG: Right Data, Wrong Answers? Fixing Knowledge Conflicts in Enterprise Retrieval

Production RAG often retrieves the right document but still returns wrong answers. Four concrete fixes for knowledge conflicts in enterprise retrieval.

Martin Benes· Founder & AI Automation EngineerMay 16, 2026Updated May 30, 20269 min read

Drafted by Flux Bot · Reviewed by Martin Benes

Last reviewed: 16 May 2026 by Martin Benes.

A CFO opens the corporate AI assistant and asks for the 2023 annual revenue. The system retrieves two documents — both technically about 2023 revenue, both from the company's own SharePoint. One is a preliminary earnings release showing $4.2M; the other is the audited restatement at $3.97M. The model picks one and returns it confidently. The retrieval was perfect. The answer is wrong.

TL;DR: When RAG fails on enterprise corpora, it usually isn't a retrieval miss. It's a knowledge conflict — the right document is retrieved alongside a contradictory document, and the model has no principled way to choose between them. Below: why this happens, and four concrete fixes that don't require a new model.

The problem: right data, wrong answers

Most RAG failures in production aren't the model fabricating from thin air. They're the model fabricating a confident synthesis across contradictory but plausibly-relevant context. The retrieval layer is doing its job — it pulls the top-k semantically nearest chunks — but the corpus itself contains overlapping, versioned, mutually-incompatible truth.

In enterprise corpora this shows up in four shapes:

Versioned truth — a preliminary earnings release and an audited restatement sit in the same data lake. Both are "about 2023 revenue." Only one is authoritative.
Temporal staleness — last year's pricing policy and this quarter's pricing policy both reference "current rates." Without a valid-from / valid-until anchor on retrieval, the older one frequently wins on lexical similarity.
Source-authority drift — a draft memo from one team and a signed-off policy from another make different claims about the same workflow. Both look like reasonable corporate documents to the retriever.
Ambiguous entity references — "the Helsinki contract" matches a 2021 supplier contract and a 2024 customer contract. Different entities, same string match.

Researchers have studied this class of failure under several names. Kortukov et al. (2024) describe "knowledge conflicts" between an LLM's parametric memory and the retrieved context. Liu et al. (2024) characterise "lost-in-the-middle" effects when conflicting passages share the prompt. Mishra et al. (2024) propose a fine-grained taxonomy of hallucination, of which retrieval-conflict-induced confabulation is a distinct category. The unifying observation: vector similarity is the wrong objective when the corpus contains versioned truth.

Why semantic similarity isn't enough

Embeddings encode topical proximity. They don't encode which document is current, which is authoritative, or which has been superseded. So when two chunks tie on cosine similarity, the tie-break is essentially arbitrary — chunk order, position in the prompt, lexical overlap with the query. The model then writes a confident sentence that splits the difference between the two retrieved values, and the CFO leaves with a number that exists in no real source document.

The fix isn't a bigger model or a longer context window. Both 4o-class and Claude-Sonnet-class models, given two contradictory chunks, will still produce a confident wrong answer roughly half the time, because the prompt offers no signal about which chunk to trust. The fix has to live in retrieval and context construction, not in generation.

Four fixes that actually work

Fix 1 — Source-authority weighting at index time

Tag every document with structured metadata when you ingest it: issuing system, version number, valid-from / valid-until, signed-off boolean, supersedes/superseded-by relations to other documents. Then bake those signals into your ranker — typically as a learned re-weighting on top of the cosine-similarity score, or as hard filters at retrieval time ("only retrieve chunks where signed=true and valid_until > now()").

This is the single highest-impact change you can make. In our work with regulated clients, simply gating retrieval on signed_off=true for policy-class documents eliminates roughly two-thirds of "right data, wrong answer" incidents — because the unsigned drafts that used to tie on similarity never enter the prompt.

Fix 2 — Temporal awareness in the ranker

For any corpus where the same fact can be restated over time (financials, pricing, policies, customer records), include a freshness term in the retrieval score. A common pattern: score = α · cosine_similarity + β · recency_decay(doc.timestamp), where recency_decay is an exponential half-life tuned to the document class. Earnings restatements have a half-life of weeks; reference architecture docs have a half-life of years.

The detail that matters: recency means "as of when the document is true," not "as of when the document was uploaded." A scanned PDF of last year's contract uploaded yesterday is not a fresh document. The valid-from timestamp must come from the document content (or its system-of-record metadata), never from the storage mtime.

Fix 3 — Explicit contradiction detection before generation

Between retrieval and generation, run a contradiction-check pass over the top-k chunks. Two cheap options:

NLI-style cross-check — a small natural-language-inference model (e.g. a distilled DeBERTa or a fine-tuned 7B Llama) scores each pair of retrieved chunks for entailment / contradiction / neutrality. If any pair scores high on contradiction, branch.
Self-check with the generator — ask the same generation model to compare retrieved chunks against each other before answering ("Are any of these passages making incompatible claims about the same entity? If yes, list them."). Cheaper to wire up; biased by the same prompt-following behaviour you're trying to constrain, so verify on held-out examples.

When a contradiction fires, the system has principled options: surface the conflict to the user, prefer the chunk with the stronger metadata signal, or escalate to a stricter retrieval pass. The wrong response is to silently let the generator pick.

Fix 4 — Citation-aware prompting and answer-grounding constraints

Even after the retrieval is clean, prompt construction matters. Two patterns that materially reduce confabulation across conflicting context:

Per-chunk citations in the prompt — number each retrieved chunk and instruct the model to attach the chunk number to each factual claim. Reviewers (human or automated) can then check whether the cited chunk actually supports the claim. Self-RAG (Asai et al., 2023) and corrective-RAG (Yan et al., 2024) formalise this.
Abstention as a first-class output — explicitly instruct the model to return "I don't have enough authoritative source data to answer this" when retrieved context is in conflict and the conflict-detection step fires. In enterprise contexts, "I don't know" with a referral to the system of record beats a confident wrong number every time.

What this costs you to build

The unglamorous truth: most of the fixes above are data-engineering work, not ML work. Adding structured metadata to your ingestion pipeline, wiring a metadata-aware ranker, and bolting a contradiction-check pass between retrieval and generation collectively take a small team a few weeks. The model never changes. The corpus never changes. What changes is how the corpus is described to the retriever, and how the retriever's output is described to the generator.

This is also the reason that swapping in a more capable LLM rarely fixes "right data, wrong answers" failures. The model never had the information it needed to choose correctly. The information lives in metadata you haven't surfaced yet.

References

Asai et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.
Kortukov et al. (2024). Studying Large Language Model Behaviors Under Context-Memory Conflicts.
Liu et al. (2024). Lost in the Middle: How Language Models Use Long Contexts.
Mishra et al. (2024). Fine-grained Hallucination Detection and Editing for Language Models.
Yan et al. (2024). Corrective Retrieval-Augmented Generation (CRAG).

Sound like your use case? Let's talk.

Drop us your email. Optional: what are you working on?

Q&A

Because the corpus itself contains contradictory or stale versions of the same fact (preliminary vs. audited financials, draft vs. signed policy, mirror vs. canonical record). RAG dutifully retrieves both, and the LLM weights whichever chunk has the strongest lexical match to the query — not whichever is authoritative. The retrieval step is correct in the narrow sense; the conflict resolution step is missing.

Run an NLI-style cross-check (Natural Language Inference, or a smaller LLM with a contradiction-detection rubric) over the top-k retrieved chunks before passing them to the generator. When two chunks make incompatible claims about the same entity, branch: either surface the conflict to the user, deterministically prefer the chunk with the most recent metadata timestamp, or fall back to a stricter retrieval pass that re-weights by source authority.

Source authority weighting at index time. Tag every document with structured metadata — issuing system, version, signing status, valid-from / valid-until timestamps — and bake those into the ranker so an audited record always outranks a draft, and a current policy always outranks a superseded one. Pure semantic similarity is the wrong objective when the corpus contains versioned truth.

Not by themselves. Cross-encoder rerankers improve relevance — they put more topically appropriate chunks at the top — but they don't know which of two contradictory chunks is authoritative. Pair the reranker with metadata-weighted scoring (source, recency, signing status) and an explicit contradiction-detection pass; reranking alone amplifies relevance, not truth.

For regulated industries — finance, healthcare, defence, public sector — usually yes. Embedding-as-a-service providers see your queries and, depending on the contract, may retain them. A self-hosted embedding model (e.g. BGE-large, jina-embeddings-v3) plus a self-hosted vector store (pgvector, Qdrant, Weaviate) closes that exposure. The trade-off is operational: you own the cluster, the index, and the reindex cadence.

Free download

EU AI Act Checklist for Companies

Compliance deadlines, risk tiers, Art. 4 and 50 obligations — one page. PDF, no login.

View plans & pricing

Need this for your business?

We can implement this for you.

Get in Touch