Guide

LlamaIndex fundamentals explained

A compliance team needs analysts to query 600 pages of internal policy without hallucinating obligations. Wiring raw PDFs into a chat prompt fails at page ten; stitching loaders, splitters, vector stores, and rerankers by hand in LangChain works but scatters retrieval logic across unrelated modules. LlamaIndex (formerly GPT Index) is a Python framework built retrieval-first: it models knowledge as documents and nodes, builds indexes over them, and exposes query engines and chat engines that orchestrate retrieval, synthesis, and citation in one API surface. Where LangChain generalizes every LLM app pattern, LlamaIndex optimizes the path from messy enterprise files to grounded answers — including ingestion pipelines, hybrid retrievers, reranking postprocessors, and event-driven workflows for multi-step agents. This guide covers core data structures, index types, query and chat engines, retriever composition, hybrid search integration, deployment patterns, a Harbor Analytics policy knowledge base worked example, a framework decision table versus LangChain and raw SDKs, common pitfalls, and a production checklist — assuming baseline RAG vocabulary.

What LlamaIndex is and when to adopt it

LlamaIndex sits between your data sources and any LLM provider. Its central bet: most production LLM apps are really retrieval problems with a synthesis step on top. The framework therefore ships first-class abstractions for parsing, chunking, embedding, indexing, filtering, reranking, and response synthesis — rather than treating documents as one of dozens of equal Runnable types.

Core primitives

Document — a loaded unit (PDF page, HTML article, database row) with text and metadata.
Node — a chunk derived from documents; the atomic unit stored in indexes (with parent/child relationships for hierarchical retrieval).
Index — a data structure mapping nodes to retrieval strategies (vector, keyword, tree summary, knowledge graph).
Retriever — fetches relevant nodes for a query; composable via fusion and routing.
Query engine — retriever + response synthesizer; answers one-shot questions with optional source citations.
Chat engine — query engine plus conversational memory for multi-turn Q&A over the same corpus.

Adopt LlamaIndex when your product is document-heavy Q&A, enterprise search with LLM synthesis, or agent workflows grounded in private knowledge. Prefer LangChain when you need broad provider glue, LCEL composition across non-retrieval steps, or tight LangSmith integration without mixing frameworks. Prefer a thin raw SDK layer when you have one static prompt and no corpus. Pair LlamaIndex retrieval with LangGraph when agent control flow (cycles, human approval, checkpoints) matters more than indexing ergonomics.

Documents, nodes, and ingestion pipelines

Loading starts with SimpleDirectoryReader, specialized readers (Notion, Slack, databases), or LlamaParse for complex PDFs and tables. Readers emit Document objects. Splitters such as SentenceSplitter or SemanticSplitterNodeParser convert documents into TextNode instances with metadata (source file, page number, section heading).

An IngestionPipeline chains transformations declaratively:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)
nodes = pipeline.run(documents=documents)

Pipelines support caching (skip re-embedding unchanged files), parallel workers, and docstore persistence so incremental updates do not rebuild the entire index. Store rich metadata early: department, effective date, and access tier enable metadata filters at query time. Follow embedding model guidance on chunk size — legal prose often needs 256–512 tokens with overlap; API reference docs tolerate smaller chunks with heading-aware splits.

Index types and VectorStoreIndex

VectorStoreIndex is the default for semantic search: nodes embed into vectors stored in Chroma, Pinecone, pgvector, Qdrant, or Elasticsearch dense vectors. Build from nodes or call from_documents for a one-shot prototype:

from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=6)

Other index types address different retrieval shapes:

SummaryIndex — sequential summarization; useful for small corpora or map-reduce over many chunks.
TreeIndex — hierarchical summaries; legacy pattern largely superseded by parent-child node retrieval.
KnowledgeGraphIndex — extracts entity-relationship triples for structured traversal alongside vectors.
ComposableGraph — routes queries across multiple indexes (per department or product line).

Production systems rarely rely on pure vector search. Combine dense embeddings with BM25 keyword retrieval via QueryFusionRetriever (reciprocal rank fusion) — the pattern detailed in our hybrid search guide. Persist indexes to external vector stores so application servers stay stateless and horizontally scalable.

Query engines, chat engines, and synthesis modes

A query engine wires a retriever to a ResponseSynthesizer. Call query_engine.query("What is the retention policy for EU customers?") and receive a Response object with .response text and .source_nodes for citations. Synthesis modes trade latency for quality:

compact — default; stuffs retrieved chunks into one prompt.
tree_summarize — map-reduce summarization for many chunks.
refine — iterative refinement across chunks (slower, higher quality).
simple_summarize — single-pass summary when chunks are few.

Chat engines wrap the same retrieval stack with memory. index.as_chat_engine(chat_mode="condense_plus_context") condenses conversation history into a standalone retrieval query before fetching nodes — critical when users say “what about the German version?” without repeating context. For streaming UIs, use streaming=True on query and chat engines and yield tokens to the client.

Customize prompts via PromptTemplate subclasses: inject tone, citation format (“answer only from sources; say I don’t know otherwise”), and structured output instructions. Structured answers pair naturally with Pydantic output parsers on the LLM side.

Retrievers, postprocessors, and routing

Drop down one level when you need fine control: index.as_retriever(similarity_top_k=20) returns nodes without synthesis. Compose retrievers with RouterRetriever (route by metadata or LLM classification) or AutoMergingRetriever (fetch child chunks, promote to parent summaries when many siblings match).

Postprocessors refine retrieved nodes before synthesis:

SimilarityPostprocessor — cut off low cosine-similarity nodes.
CohereRerank / cross-encoder rerankers — reorder top-20 to top-5 with a second-stage model.
Metadata filters — restrict to doc_type=policy and region=EU.
LongContextReorder — place most relevant chunks at context window edges (lost-in-the-middle mitigation).

Pass postprocessors to as_query_engine(node_postprocessors=[...]). Measure recall@k and nDCG on a golden question set; reranking often lifts answer faithfulness more than swapping the base LLM. Evaluate with patterns from our RAG evaluation guide.

Agents, workflows, and observability

Beyond Q&A, LlamaIndex ships agents that call query engines as tools alongside APIs and calculators. OpenAIAgent.from_tools([query_engine_tool, ...]) implements multi-step reasoning: retrieve policy, compare versions, draft summary email. For explicit control flow, Workflows (event-driven, async) replace opaque agent loops with testable steps: ingest event, retrieve event, synthesize event, human-review event.

LlamaCloud offers hosted parsing, indexing, and retrieval APIs when you want managed infrastructure. Self-hosted deployments should enable llama_index.core.set_global_handler("simple") or integrate OpenTelemetry/Langfuse for trace spans per retrieval and synthesis call. Log embedding model version, index schema version, and chunk counts alongside each answer for reproducibility.

Expose query engines through FastAPI:

@app.post("/ask")
async def ask(body: AskRequest):
    response = await query_engine.aquery(body.question)
    return {
        "answer": str(response),
        "sources": [n.metadata for n in response.source_nodes],
    }

Rate-limit per user, enforce auth before retrieval, and never return nodes the caller is not cleared to see — metadata filters must mirror your ACL model.

Worked example: Harbor Analytics policy knowledge base

Harbor Analytics operates in twelve jurisdictions. Its 220 internal policy PDFs (data retention, model governance, client communication) total 1,800 pages. Analysts previously searched SharePoint manually; wrong answers in client audits carried regulatory risk.

Ingestion and index design

The team runs a nightly IngestionPipeline: LlamaParse extracts tables from PDFs, SentenceSplitter chunks at 400 tokens with 80-token overlap, and text-embedding-3-small embeds into pgvector on PostgreSQL. Each node carries metadata: policy_id, jurisdiction, effective_date, and classification (public/internal/restricted). A ComposableGraph routes queries: an LLM classifier picks the jurisdiction sub-index before retrieval.

Query stack

Production queries use QueryFusionRetriever (vector + BM25, k=20), then CohereRerank(top_n=5), then a compact synthesizer with a strict citation prompt. The chat engine runs in condense_plus_context mode for follow-ups. Analysts see inline citations linking to source PDF page anchors. Restricted nodes filter out unless the user’s JWT carries clearance=restricted.

Operations and outcomes

Golden-set evals (180 compliance questions) run weekly: faithfulness and citation accuracy must stay above 0.92 or deploys block. P95 latency is 6.2 seconds including rerank. Human escalation handles 9% of queries where max similarity score falls below 0.72. Incremental ingestion re-processes only changed files via content-hash cache keys. Audit logs store retrieved node IDs per answer for regulator review.

Framework decision table

Need	Prefer	Why
Large document corpus, Q&A, citations	LlamaIndex	Indexing, retrievers, and synthesis are first-class; less boilerplate
General chains, many providers, LCEL glue	LangChain	Broader Runnable ecosystem; LangSmith native
Stateful agent graphs, human-in-the-loop	LangGraph (+ LlamaIndex retriever as tool)	Explicit checkpoints; use LlamaIndex only for retrieval layer
Single prompt, no private corpus	Raw OpenAI/Anthropic SDK	Minimal dependencies; easiest to audit
Multi-tenant SaaS search	LlamaIndex + external vector DB per tenant	Metadata filters, composable graphs, managed parsers
Graph-augmented RAG over entities	LlamaIndex KnowledgeGraphIndex or dedicated graph RAG	Built-in triple extraction; see graph RAG guide for tradeoffs

Common pitfalls

Rebuilding indexes on every deploy — persist to pgvector/Pinecone; treat the index as durable state, not ephemeral memory.
Ignoring metadata — flat vector search returns wrong jurisdiction or outdated policy; filter by effective_date.
Chunk size copy-paste — 1024-token chunks work for blogs, fail for dense legal tables; tune per corpus.
Skipping reranking — top-k vector hits often miss nuance; a cross-encoder rerank step is cheap relative to GPT-4 synthesis.
No citation enforcement — models confabulate when prompts do not require source grounding; return source_nodes to the UI.
Mixing frameworks blindly — LangChain retrievers + LlamaIndex synthesizers doubles abstraction layers; pick a primary owner.
Blocking ingestion in request path — PDF parsing belongs in batch jobs; query endpoints should only read prebuilt indexes.
Trusting retrieved text — poisoned documents are prompt injection vectors; sanitize and ACL-filter before synthesis.

Practitioner checklist

Define chunking and metadata schema before writing application code.
Run ingestion through IngestionPipeline with caching for incremental updates.
Persist vectors in a managed store (pgvector, Qdrant, Pinecone) with versioned embedding models.
Combine vector and keyword retrieval; add reranking postprocessors in production.
Enforce metadata ACL filters matching your auth system on every query.
Use condense_plus_context chat mode for multi-turn sessions over the same corpus.
Build a golden question set; block deploys on faithfulness and citation regressions.
Log retrieved node IDs, scores, and model versions for audit trails.
Expose async query endpoints; keep heavy parsing offline.
Review quarterly whether LangGraph agents should own orchestration while LlamaIndex owns retrieval only.

Key takeaways

LlamaIndex optimizes retrieval-first LLM apps: documents, nodes, indexes, and query engines in one coherent stack.
VectorStoreIndex plus hybrid fusion and reranking delivers production recall; synthesis mode trades speed for depth.
Chat engines with condense-plus-context handle follow-up questions without losing retrieval precision.
Ingestion pipelines with caching and external vector stores keep indexes fresh without full rebuilds.
Pair with LangGraph or LangChain when you need broader orchestration; own retrieval quality with evals and citations.