Guide
LLM embeddings explained: vectors, similarity search and model choice
A chat LLM predicts the next token. An embedding model maps whole passages into a single dense vector — a point in high-dimensional space where semantic similarity becomes geometric closeness. "Refund policy for damaged goods" and "how to return a broken item" may share no keywords, yet their embeddings sit near each other. That property powers retrieval-augmented generation (RAG), deduplication, clustering, recommendation, and multimodal search when image and text share a joint embedding space. This guide explains how embeddings are trained, which similarity metrics matter, how to chunk and index text for vector search, and how to evaluate models before you commit to one in production.
What an embedding actually is
Given input text (or an image, audio clip, or code snippet), an embedding model outputs a fixed-length array of floats — commonly 384, 768, 1,024, or 1,536 dimensions. Each dimension is not human-interpretable; meaning lives in the relationship between vectors. Two passages about the same topic should have a small angle between their vectors; unrelated passages should be far apart.
Embeddings are not the same as the hidden states inside a generative LLM during chat. Chat models optimize for fluent continuation; embedding models optimize for retrieval — pulling similar items together and pushing dissimilar items apart. You typically call a dedicated embedding endpoint (or run a small encoder locally), store the vectors in a database, and query by nearest neighbor — not by feeding embeddings back into a chat prompt as raw numbers.
Tokenization still matters: the same tokenizer boundaries that affect LLM context windows also cap how much text a single embedding can represent. Most models truncate or pool over the first N tokens (often 512–8,192). Long documents must be split into chunks, each embedded separately.
How embedding models are trained
Modern text embedders are usually fine-tuned from transformer encoders using one or more of these objectives:
Contrastive learning
Pairs of similar texts (query + relevant passage, duplicate questions, translated sentences) are pulled together; random negatives are pushed apart. Loss functions like InfoNCE and triplet loss shape the geometry of the space. Hard-negative mining — choosing negatives that are superficially similar but semantically wrong — is what separates mediocre embedders from strong ones.
Matryoshka and variable dimensions
Some models (OpenAI text-embedding-3, Nomic Matryoshka variants) train
so that the first k dimensions of a larger vector remain useful on their
own. You can store 1,536-dim vectors for quality-critical search and 256-dim
vectors for coarse pre-filtering, cutting storage and index size without retraining.
Instruction prefixes
Models like E5 and BGE expect different prefixes for queries vs documents —
e.g. query: and passage: . Skipping the prefix or
using the wrong one measurably hurts recall. Always read the model card; treat
prefixes as part of your API contract.
Similarity metrics: cosine, dot product, Euclidean
Once you have vectors, search means "find the stored vectors closest to the query vector." The metric must match how the model was trained:
- Cosine similarity — measures the angle between vectors, ignoring magnitude. Default for most text embedders. Values range from -1 to 1; higher is more similar. Normalizing vectors to unit length before indexing makes cosine equivalent to dot product, which speeds SIMD math.
- Dot product — fast when vectors are normalized; some models (especially unnormalized ones) are trained with dot-product loss instead.
- Euclidean (L2) distance — common in image embeddings and some older pipelines; less typical for modern text APIs unless the provider specifies it.
Mixing metrics — indexing with cosine but evaluating with L2 — silently degrades recall. Lock the metric in your vector index configuration and in your offline eval harness.
Choosing an embedding model
No single model wins every benchmark. Use your domain, language mix, latency budget, and hosting constraints as filters, then validate on your queries:
Hosted APIs
OpenAI text-embedding-3-small and -large, Cohere Embed,
Voyage, and Google Gemini embedding endpoints offer strong out-of-the-box quality,
automatic updates, and predictable SLAs. Cost scales with tokens embedded; batch
APIs reduce price for offline indexing jobs.
Open-weight encoders
Models like bge-large-en-v1.5, e5-mistral-7b-instruct,
nomic-embed-text-v1.5, and gte-Qwen2 run on your own GPU
or CPU via sentence-transformers, ONNX, or llama.cpp-style runtimes. You pay
infra instead of per-token API fees and keep data on-prem — important for regulated
workloads.
Multilingual and domain-specific
English-only models underperform on mixed-language corpora. Legal, medical, and code domains often benefit from domain-tuned embedders or fine-tuning on in-house query–document pairs. A model that tops the public MTEB leaderboard can still fail on your acronyms and product names.
Dimension vs quality trade-off
Higher dimensions capture finer distinctions but multiply storage and ANN index memory. A 1,536-dim index with 10 million rows is not trivial on pgvector; 384-dim may suffice for FAQ-scale retrieval. Benchmark at the dimension you plan to ship.
Chunking and indexing for RAG
Embeddings represent chunks, not whole knowledge bases. Chunking strategy often matters more than model choice for end-to-end RAG quality:
- Fixed token windows (256–512 tokens with 10–20% overlap) — simple, works for uniform prose. Overlap prevents answers from sitting on a boundary cut.
- Structure-aware splits — respect markdown headings, HTML sections, PDF page breaks, or code function boundaries. Preserves semantic units better than blind token splits.
- Parent–child indexing — embed small chunks for precise retrieval but return a larger parent span to the LLM for context. Balances recall and context completeness.
- Metadata sidecars — store source URL, section title, date, and access tier alongside each vector. Filter before vector search to avoid retrieving stale or unauthorized chunks.
Re-embed when you change models. Vectors from different embedders are not comparable — mixing them in one index destroys search quality.
Hybrid retrieval: when vectors alone fail
Pure semantic search misses exact matches: SKUs, error codes, person names, legal citations. Hybrid search combines BM25 (or another keyword index) with vector similarity, then fuses rankings with reciprocal rank fusion (RRF) or a learned cross-encoder reranker.
A practical pipeline: (1) hybrid retrieve top 50–100 candidates, (2) rerank with a cross-encoder or lightweight LLM scorer down to 5–10, (3) pass those chunks to the generator. Step 2 fixes cases where the bi-encoder embedding was too coarse. Latency increases, but answer quality on enterprise support and legal corpora usually jumps.
Evaluating embedding quality
Public leaderboards (MTEB, BEIR) provide coarse rankings across retrieval, clustering, and classification tasks. Treat them as a shortlist, not a contract. Build a golden set of 100–500 real user queries with human-labeled relevant documents from your corpus and measure:
- Recall@k — did the right document appear in the top k results?
- MRR and nDCG — how high did the best match rank?
- Latency p95 — embed query + ANN search under production load.
A/B test model swaps on live traffic when possible: downstream RAG answer correctness (human or LLM-judged) is the metric that ultimately matters, not cosine scores in isolation.
Common pitfalls
- Embedding the wrong thing — summarizing a page before embedding loses detail; embedding raw HTML with nav boilerplate adds noise.
- Query/document asymmetry — using the same prefix for questions and passages on instruction-tuned models.
- Stale indexes — content updates without re-embedding leave ghost chunks that contradict the live source.
- Normalization bugs — double-normalizing or mixing float32 index with float16 query vectors.
- Security — retrieved chunks can carry prompt-injection payloads; treat them as untrusted input to the generator.
Production checklist
- Pick metric (cosine vs dot) to match the model card; normalize consistently.
- Apply correct query/passage prefixes for instruction-tuned embedders.
- Define chunk size, overlap, and structure-aware splits on a sample corpus.
- Version embedding model ID in index metadata; plan full re-index on model change.
- Run hybrid retrieval if your queries include IDs, codes, or rare proper nouns.
- Measure Recall@10 on a golden query set before shipping.
- Monitor embed latency, index size, and retrieval p95 separately from generation.
- Sanitize retrieved text before it reaches the LLM context window.
Key takeaways
- Embeddings compress meaning into dense vectors for similarity search, distinct from generative LLM hidden states.
- Contrastive training and instruction prefixes shape retrieval quality; follow each model's input format.
- Cosine similarity is the default metric; keep it consistent across index, query, and eval.
- Chunking and hybrid search often beat swapping to a marginally better embedder.
- Evaluate on your queries with Recall@k and downstream RAG correctness, not leaderboard rank alone.
Related reading
- RAG explained — end-to-end retrieval-augmented generation pipelines
- Vector databases explained — ANN indexes, HNSW, and hybrid stores
- Multimodal AI models explained — shared embedding spaces for text and images
- LLM tokenization explained — how text becomes model input