Guide

Agentic RAG explained

Classic retrieval-augmented generation (RAG) embeds the user question, fetches the top-k chunks, stuffs them into the prompt, and generates an answer. That single-shot pattern works for simple FAQs but fails on multi-hop questions, ambiguous queries, and cases where the first retrieval pass returns irrelevant noise. Agentic RAG treats retrieval as a loop the model controls: plan sub-queries, call search tools, grade whether evidence is sufficient, rewrite the query, and repeat until grounded or a budget expires. This guide covers Self-RAG and Corrective RAG, query decomposition, ReAct-style orchestration with LangGraph, retrieval grading, a Harbor Support escalation agent worked example, an architecture decision table, pitfalls, and a production checklist — building on our AI agents and tool use explainer.

Naive RAG vs agentic RAG

Naive RAG is a fixed pipeline: chunk documents, embed at index time, embed the question at query time, cosine-similarity search, concatenate hits, generate. Latency is predictable (one embedding call + one vector query + one LLM call) and cost is low. The failure modes are equally predictable: wrong chunks when the question uses different vocabulary than the source, missing context that requires two documents (compare policy A vs policy B), and no recovery when retrieval returns garbage.

Agentic RAG adds a control layer — often an LLM with tools or a state machine — that decides when to retrieve, what to search for, and whether to try again. The agent may decompose “Which Harbor plan covers international wire fees and what is the SLA?” into two searches, merge results, notice the SLA doc is missing, broaden the query to “wire transfer processing time,” and only then draft the answer. You trade latency and token cost for accuracy on hard questions.

When agentic RAG is worth it

Use agentic patterns when evaluation shows naive RAG fails on multi-document reasoning, when users ask compound questions, or when your corpus is large enough that top-k alone routinely misses the right section. Skip it for high-QPS, low-complexity lookups (password reset steps) where a single retrieval plus caching is enough. Start with naive RAG, measure faithfulness and context recall, then add agency only where metrics gap.

Self-RAG: retrieve, generate, critique

Self-RAG (self-reflective retrieval-augmented generation) trains or prompts the model to emit special tokens or structured decisions at each step: retrieve?, is this passage relevant?, is the draft supported?, is the answer useful? In production implementations you rarely need custom training tokens — a smaller classifier model or a structured JSON schema from a capable LLM plays the same role.

A typical Self-RAG loop:

  1. Model decides whether retrieval is needed (some chit-chat skips search).
  2. Retrieve top-k chunks; model grades each as relevant / irrelevant / ambiguous.
  3. Generate a draft answer citing graded passages.
  4. Model checks faithfulness (claims match sources) and overall utility.
  5. If utility is low, rewrite the query or retrieve again; else return.

Self-RAG shines when hallucination risk is high and you want explicit gates before users see an answer. Pair grading with RAG evaluation metrics offline so threshold tuning is data-driven, not guesswork.

Corrective RAG (CRAG): fix bad retrieval

Corrective RAG focuses on the retrieval step. After the first search, a lightweight retrieval evaluator scores whether returned documents are relevant to the question. Outcomes branch:

  • Correct — documents look good; proceed to generation.
  • Incorrect — discard vector hits; fall back to web search, a secondary index, or a broader keyword query.
  • Ambiguous — keep some chunks but also run a refined query or decompose into sub-questions.

CRAG is cheaper than full agent loops because the evaluator can be a small cross-encoder reranker or a fast LLM call with a yes/no rubric. It directly addresses the “garbage in, garbage out” problem without requiring the main model to plan every move. Combine CRAG with hybrid search so the fallback path has lexical recall when embeddings miss exact product codes or legal citations.

Query decomposition and step-back prompting

Complex questions hide multiple information needs. Query decomposition asks the LLM to split the user message into atomic sub-queries, run retrieval for each, then synthesize. Example: “Can I downgrade from Enterprise mid-contract and keep SSO?” becomes (1) “Enterprise contract downgrade policy” and (2) “SSO availability by plan tier.”

Step-back prompting generates a broader question first (“What are Harbor billing and plan change rules?”), retrieves background context, then answers the specific question with that framing. Step-back helps when users ask narrow questions that depend on unstated policy context the embedding model would not connect.

Parallel vs sequential retrieval

Independent sub-queries can run in parallel (lower latency, higher embedding cost). Dependent sub-queries — answer B requires a date found in doc A — need sequential loops. LangGraph and similar frameworks model this as a graph: decompose node, parallel retrieve nodes, merge node, grade node, conditional edges back to retrieve or forward to generate. Cap parallel fan-out (e.g. max 4 sub-queries) to prevent cost explosions on rambling user input.

ReAct-style tool orchestration

ReAct (reasoning + acting) interleaves natural-language thought steps with tool calls: search_kb(query="refund window"), observe snippets, search_kb(query="partial refund exceptions"), then answer. In agentic RAG, tools are not only vector search — they include SQL over metadata, filtered search by date or product, graph traversals, and calculators.

Framework patterns:

  • Tool definitions — JSON Schema describing search_documents, fetch_page, list_related; see our function calling guide.
  • State — accumulated chunks, query history, grades, token budget remaining.
  • Termination — max steps (3–5 is common), confidence threshold from grader, or explicit finish tool.
  • Human-in-the-loop — pause before external web search or PII-heavy lookups.

LlamaIndex QueryEngine tools and LangGraph prebuilt ReAct agents wrap this loop; you still own prompt design, grader thresholds, and observability.

Retrieval grading and re-querying

A retrieval grader scores each chunk (or the set) against the current sub-question. Implementation options, from cheapest to strongest:

  • Cross-encoder reranker score below threshold → treat as irrelevant.
  • LLM rubric: “Does this passage contain information to answer X? yes/no/uncertain.”
  • Embedding similarity between chunk and hypothetical ideal answer.

On failure, re-query strategies include: HyDE (generate a hypothetical answer, embed that), query expansion with synonyms, strip entities and search broader, switch from semantic to BM25 for SKU-like tokens, or ask the user a clarifying question (agentic RAG with human tool). Log every rewrite — clusters of failed queries signal indexing or chunking bugs upstream.

Budgets and guardrails

Unguarded loops burn money. Set max retrieval rounds, max total chunks injected into context, and max LLM calls per request. Return partial answers with “I could not verify X” when budget exhausts rather than hallucinating. Rate-limit agentic endpoints separately from simple chat.

Worked example: Harbor Support escalation agent

Harbor Support runs a tier-1 bot on 40k Confluence pages plus Zendesk macros. Naive RAG (k=8, single shot) scored 62% faithfulness on a golden set of 200 multi-hop tickets. They deployed an agentic pipeline in LangGraph with four nodes:

  1. Router — fast classifier: FAQ (naive RAG path) vs complex (agent path). 70% of traffic stays naive.
  2. Decomposer — GPT-4o-mini emits 1–3 sub-queries as JSON.
  3. Retrieve + CRAG — parallel hybrid search per sub-query; cross-encoder grades each hit; if <2 relevant chunks, HyDE re-query once.
  4. Generate + Self-check — draft answer; grader verifies each bullet against sources; on fail, one more targeted search with the missing fact as query.

Example ticket: “Customer on Pro annual paid June 1 wants refund after using API for 2 weeks — eligible?” Sub-queries: (1) Pro annual refund policy, (2) API usage effect on refund eligibility, (3) 14-day window exceptions. First pass retrieved marketing copy (graded incorrect); CRAG triggered keyword search on “refund eligibility API usage” and found the billing policy PDF section. Self-check caught an unsupported claim about “full refund” and replaced it with prorated language from the doc. Faithfulness on the agent path rose to 81%; p95 latency went from 1.2s (naive) to 4.8s (agent). They accept the trade-off only for the 30% complex route and show a “reviewing sources” UI during loops.

Architecture decision table

PatternBest forLatencyCostComplexity
Naive RAG (single shot)Simple FAQs, high QPSLowLowLow
CRAG onlyNoisy retrieval, clear fallbacksLow–mediumMediumMedium
Query decompositionMulti-part questionsMediumMediumMedium
Self-RAG gradingHigh hallucination riskMediumMedium–highMedium
Full ReAct agentMany tool types, exploratory searchHighHighHigh
Router + hybrid pathsMixed traffic (Harbor pattern)VariableOptimizedHigh
Graph RAG + agentEntity-heavy corporaHighHighVery high

Common pitfalls

  • Agentic everything — 3–5x cost on questions a cached naive RAG answers fine.
  • No step cap — models loop until timeout; always set max iterations.
  • Grader too weak — yes/no from the same model that hallucinated passes bad evidence.
  • Context stuffing — ten retrieval rounds append duplicate chunks; dedupe by doc ID.
  • Ignoring latency UX — users abandon after 3s without progress indicators.
  • Skipping offline eval — agent paths multiply failure modes; golden sets per route are mandatory.
  • Tool sprawl — twenty search tools confuse the planner; compose filters into one tool with parameters.
  • No audit trail — support cannot explain why the bot said X; log queries, grades, and sources.

Production checklist

  • Baseline naive RAG with faithfulness, context recall, and latency metrics.
  • Route simple queries to the cheap path; reserve agentic loops for classified hard cases.
  • Implement CRAG or retrieval grading before full ReAct agents.
  • Cap max steps, chunks, and LLM calls per user request.
  • Deduplicate and rank merged chunks from parallel sub-queries.
  • Expose retrieval progress in the UI for multi-second agent runs.
  • Log sub-queries, tool calls, grader scores, and final citations for debugging.
  • A/B test agentic vs naive on real traffic; watch cost per resolved ticket.
  • Alert on step-limit hits and grader failure rate spikes (index drift signal).
  • Document when to escalate to humans — agents should not guess on legal or safety edge cases.

Key takeaways

  • Agentic RAG loops retrieval until evidence is good enough — naive RAG retrieves once.
  • Self-RAG grades relevance and faithfulness; CRAG corrects bad first-pass retrieval.
  • Query decomposition and step-back prompting handle multi-hop and under-specified questions.
  • ReAct + LangGraph orchestrate tools with explicit budgets and termination.
  • Route by complexity — agentic power where metrics justify cost, naive RAG everywhere else.

Related reading