Guide

RAG evaluation explained

Retrieval-augmented generation (RAG) grounds LLM answers in your documents — until it doesn’t. A pipeline can retrieve irrelevant chunks, miss the one paragraph that matters, or hallucinate a confident answer while ignoring perfectly good context. RAG evaluation measures each link in that chain: did retrieval surface the right evidence, did the generator stay faithful to it, and was the final answer actually useful? Without dedicated metrics, teams optimize embedding models on vibes and ship regressions every time the knowledge base updates. This guide covers retrieval vs generation metrics, golden QA datasets, LLM-as-judge scoring (including RAGAS-style decompositions), offline regression suites vs online monitoring, how evaluation pairs with RAG architecture and reranking, a Harbor Archive documentation portal worked example, a metric decision table, common pitfalls, and a production checklist — alongside our broader LLM evaluation and hybrid search guides.

What RAG evaluation measures

RAG is a two-stage system: retrieval fetches candidate chunks from a vector index, keyword index, or both; generation conditions the LLM on those chunks to produce an answer. Failures split cleanly across stages, so metrics should too.

Retrieval quality

Did the top-k chunks contain the information needed to answer the question? Classic information-retrieval metrics apply: recall@k (was the gold chunk in the set?), precision@k (how many retrieved chunks were relevant?), mean reciprocal rank (how high did the best chunk rank?), and NDCG when multiple relevant passages exist. For RAG you often care about context recall — whether any retrieved context suffices to support a correct answer, even if noisy chunks appear alongside it.

Generation quality

Given the retrieved context, did the model answer correctly and honestly? Faithfulness (also called groundedness or attribution) checks whether every claim in the answer is supported by the provided chunks — no invented dates, prices, or policy exceptions. Answer relevance checks whether the response addresses the user’s question without padding or topic drift. Answer correctness compares the output to a reference answer when you have one; when you don’t, human rubrics or LLM-as-judge with strict criteria fill the gap.

End-to-end quality

End-to-end accuracy asks: would a user with the same question accept this answer? It conflates retrieval and generation failures, which is why you need stage-level metrics first. A high end-to-end score with low faithfulness means lucky guessing; low end-to-end with high context recall means your generator or prompt is the bottleneck.

Building golden evaluation datasets

Metrics are only as good as the questions you test. A golden RAG dataset is a table of rows: user question, optional reference answer, and document IDs (or exact chunk text) that constitute ground-truth evidence.

Sources of test questions

  • Production logs — sample real queries (redact PII), have experts label correct answers and supporting doc IDs. Highest external validity.
  • Synthetic QA from docs — prompt an LLM to generate questions answerable from each chunk; filter with human review or automatic answerability checks. Fast coverage of the corpus.
  • Adversarial sets — questions that should trigger “I don’t know,” cross-document reasoning, or outdated-doc traps. Essential for faithfulness testing.
  • Regression fixtures — freeze 50–200 rows that must never break; run on every index rebuild and model swap.

Labeling guidelines

Document what counts as a relevant chunk (exact paragraph vs whole section), whether paraphrased answers are acceptable, and how to score partial credit. Inconsistent labels make automated metrics noisy — two annotators should agree on chunk relevance at least 80% of the time before you trust the eval set.

Train/eval contamination

Never generate synthetic QA from documents you also use to tune embedding models without holding out entire documents or topics. Leakage inflates recall@k and hides retrieval failures on unseen content.

LLM-as-judge for RAG

Human labeling does not scale to nightly CI. LLM-as-judge prompts a capable model to score faithfulness, relevance, and context utilization on a 1–5 rubric or binary pass/fail. Frameworks like RAGAS decompose the pipeline into context precision, context recall, faithfulness, and answer relevancy scores using chained judge prompts.

How judge prompts work

A faithfulness judge receives the question, retrieved context, and generated answer. It lists each atomic claim in the answer and marks whether each is entailed by the context. The score is the fraction of supported claims. Relevance judges compare the answer to the question without requiring word overlap. Context-recall judges check whether the reference answer (or known facts) can be inferred from retrieved chunks alone.

Calibrating judges

LLM judges correlate with humans but are biased toward verbose, confident text and may miss subtle hallucinations. Calibrate monthly: sample 100 rows, score with humans and judges, compute Cohen’s kappa, and adjust rubrics or switch judge models when agreement drops. Use a stronger judge than the model under test; never evaluate GPT-4 class answers with a 7B judge without validation.

When not to use judges

High-stakes domains (medical dosing, legal liability, financial compliance) need human review on the judge sample permanently. Judges also struggle with numeric tables and multi-hop math — use deterministic checks (exact match on extracted entities, regex on policy IDs) alongside soft scores.

Offline regression vs online monitoring

Offline eval runs the full RAG pipeline on the golden set before deploy: embed query, retrieve, optionally rerank, generate, score. Track metrics per component version (embedding model, chunk size, reranker, prompt template, LLM). Block releases when faithfulness or context recall drops more than an agreed threshold (e.g. 3 absolute points).

Online monitoring samples live traffic: implicit signals (thumbs down, reformulated queries, escalation to human support) and periodic LLM-judge scoring on production logs. Watch for drift when document freshness matters — a corpus update can crater recall@k without any code change.

What to log per request

  • Query text, embedding model version, index snapshot ID
  • Retrieved chunk IDs, scores, and character counts
  • Reranker scores if used
  • Final prompt hash and LLM model ID
  • Latency per stage (embed, retrieve, rerank, generate)
  • User feedback and support ticket linkage

Worked example: Harbor Archive documentation portal

Harbor Archive hosts 4,200 internal markdown pages (API references, runbooks, onboarding). The RAG assistant must cite the correct doc section and refuse when policy is missing.

Dataset

Engineers exported 600 anonymized Slack questions from #docs-help. Three senior devs labeled 400 rows with reference answers and gold chunk IDs; 200 adversarial rows ask about deprecated v1 APIs and fictional features. Synthetic QA added 1,800 rows from held-out doc sections, human-audited at 10% sample.

Baseline and iteration

Initial stack: OpenAI text-embedding-3-small, 512-token chunks with 64-token overlap, top-8 retrieval, GPT-4o-mini generation. Baseline: context recall 0.71, faithfulness 0.84, answer relevance 0.88. Recall bottleneck traced to API pages where function names appeared only in tables split across chunks.

Fixes measured offline

  • Parent-child chunking (retrieve small, expand to section) — context recall +0.09
  • BGE reranker on top-24 — precision@8 +0.12, faithfulness +0.04
  • Hybrid BM25 + dense with RRF — recall +0.06 on exact-error-code queries
  • Faithfulness rubric in system prompt (“cite doc slug; say unknown if absent”) — faithfulness +0.07, fewer escalations

CI gates: faithfulness ≥ 0.90 and context recall ≥ 0.78 on the 200-row regression fixture. Nightly judge scoring on 2% of live queries; weekly human audit of 30 failures.

Metric and approach decision table

Symptom Likely stage Metric to watch Typical fix
Wrong doc family retrieved Retrieval Recall@k, MRR Better embeddings, hybrid search, metadata filters
Right doc, wrong paragraph Chunking / retrieval Context recall, precision@k Smaller chunks, parent-child, reranker
Good context, invented details Generation Faithfulness Stricter prompt, lower temperature, cite-or-abstain
Accurate but off-topic Generation Answer relevance Query rewriting, relevance filter on chunks
Confident on missing policy End-to-end Abstention rate on adversarial set Relevance threshold, explicit IDK training
Regression after index rebuild Infra Fixture pass rate Version index IDs, diff golden set before swap

Common pitfalls

  • End-to-end only — you cannot tell whether to fix retrieval or the LLM.
  • Eval set too easy — synthetic questions from the same chunks you retrieve perfectly; add adversarial and logged queries.
  • Judge unc calibrated — trusting LLM scores without human agreement checks.
  • Static golden set forever — product and docs change; refresh quarterly from production.
  • Ignoring latency — rerankers and huge context windows improve scores but blow p95 latency budgets.
  • Metric gaming — prompts that refuse everything score high faithfulness but useless relevance.
  • Single-number dashboards — averaging faithfulness and recall hides tradeoffs; report both.

Production checklist

  • Golden dataset with question, gold chunks, and reference answer where applicable.
  • Held-out regression fixture (50+ rows) run on every deploy.
  • Stage-level metrics: recall@k, faithfulness, answer relevance logged separately.
  • LLM-as-judge rubrics versioned; human calibration sample monthly.
  • Adversarial subset for abstention and stale-doc behavior.
  • Per-request retrieval logs with chunk IDs and scores.
  • CI gate thresholds documented; failures block merge.
  • Online sample rate for judge scoring and user feedback linkage.
  • Index version ID in logs to correlate corpus changes with metric drops.
  • Playbook for diagnosing retrieval vs generation regressions.
  • Quarterly golden-set refresh from production logs.
  • Human escalation review on all faithfulness failures above severity threshold.

Key takeaways

  • RAG evaluation splits into retrieval metrics (recall, precision) and generation metrics (faithfulness, relevance).
  • Golden datasets from production logs plus adversarial rows catch real failure modes synthetic QA misses.
  • LLM-as-judge scales scoring but requires calibration against human labels.
  • Offline regression gates prevent index and model regressions; online monitoring catches drift.
  • Diagnose bottlenecks by stage before swapping embedding models or LLMs at random.

Related reading