Guide

vLLM fundamentals explained

Your prototype runs fine on Ollama until marketing launches a public demo and fifty users hit send at once — GPU utilization flatlines, queues balloon, and p95 latency crosses ten seconds. vLLM is an open-source LLM inference engine built for exactly that wall: it combines PagedAttention (virtual-memory-style KV cache management), continuous batching (requests join and leave batches mid-flight), and optimized CUDA kernels so a single A100 can serve far more concurrent chats than naive Hugging Face generate() loops. It exposes an OpenAI-compatible HTTP server, supports tensor and pipeline parallelism across GPUs, and powers many production stacks behind LiteLLM gateways. This guide covers what vLLM is, core memory and scheduling concepts, installation and the API server, parallelism modes, offline batch inference, a Harbor Analytics chat gateway worked example, an inference-stack decision table, common pitfalls, and a production checklist. Pair it with KV cache fundamentals and FlashAttention for the theory behind the speedups.

What vLLM is (and is not)

vLLM is a high-throughput inference runtime for transformer language models. You point it at Hugging Face weights (or a local checkpoint), start vllm serve meta-llama/Llama-3.1-8B-Instruct, and clients call /v1/chat/completions the same way they would OpenAI. Under the hood, vLLM schedules decode steps across all active sequences, reclaims KV cache pages when sessions end, and fuses attention with custom kernels — the result is often 10–24× higher throughput than static batching on the same hardware.

It is not a model trainer, a desktop chat app, or a multi-vendor API router. Fine-tuning still happens in PyTorch or TRL; routing across OpenAI, Anthropic, and local endpoints belongs in LiteLLM. vLLM is the engine inside your VPC when you own the GPU and need many simultaneous users on one model revision. For a laptop prototype, Ollama remains simpler; graduate to vLLM when metrics show queue depth and GPU memory fragmentation, not model quality, are the bottleneck.

Core concepts

  • PagedAttention — KV cache stored in fixed-size blocks (like OS pages); sequences of different lengths share GPU memory without pre-allocating max context per request.
  • Continuous batching — new requests enter the running batch between decode steps; finished sequences exit without waiting for the slowest member of a static batch.
  • Prefill vs decode — prefill processes the prompt in one parallel pass; decode generates one token at a time per sequence. Scheduling balances both phases (see model serving).
  • Tensor parallelism (TP) — shards weight matrices across GPUs; required for 70B+ models that do not fit on one card.
  • Pipeline parallelism (PP) — layers split across GPUs; trades latency for capacity on very deep models.
  • Quantization — AWQ, GPTQ, and FP8 kernels reduce VRAM and raise tokens/sec; quality tradeoffs vary by task (see LLM quantization).

Installation and the OpenAI-compatible server

vLLM targets Linux with NVIDIA CUDA (AMD ROCm builds exist). Install with pip in a fresh virtualenv — CUDA version must match your driver:

pip install vllm

# Start server (downloads weights from Hugging Face on first run)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Key flags:

  • --max-model-len — caps context window; lower values save KV cache RAM and raise concurrent capacity.
  • --gpu-memory-utilization — fraction of VRAM vLLM may claim; leave headroom if you colocate embeddings or vision towers.
  • --tensor-parallel-size N — spread one model across N GPUs (e.g. 2× A100 for 70B).
  • --quantization awq / gptq — load pre-quantized checkpoints for memory-constrained hosts.
  • --enable-prefix-caching — reuse KV blocks for identical prompt prefixes (RAG system prompts, few-shot templates).

Client integration

Point the OpenAI Python SDK at your server:

from openai import OpenAI

client = OpenAI(base_url="http://gpu-box:8000/v1", api_key="unused")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize PagedAttention in two sentences."}],
    temperature=0.2,
    max_tokens=256,
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Streaming uses Server-Sent Events; disable reverse-proxy buffering on the path. For embeddings, run a separate small model or use vLLM’s embedding entrypoints when your version supports them — many teams keep RAG embedders on CPU to preserve GPU cycles for the main LLM.

Offline batch inference

Not every workload is an HTTP API. vLLM’s LLM Python class runs high-throughput offline jobs — nightly summarization, eval harnesses, synthetic data generation for distillation:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1)
params = SamplingParams(temperature=0, max_tokens=512)

prompts = [f"Classify sentiment: {text}" for text in ticket_batch]
outputs = llm.generate(prompts, params)
labels = [o.outputs[0].text.strip() for o in outputs]

Offline mode shares the same PagedAttention scheduler; batch thousands of prompts without standing up HTTP. Use this for backfills; expose HTTP only for interactive traffic that cannot wait for a batch window.

GPU planning and parallelism

VRAM needs scale with model parameters, precision, max context, and concurrent sequences. Rough planning for FP16 8B instruct models on a 24 GB GPU:

  • Weights — ~16 GB for 8B FP16; AWQ 4-bit can halve this.
  • KV cache — dominant at long context and high concurrency; use --max-model-len and monitor gpu_cache_usage_perc metrics.
  • Concurrent sequences — vLLM packs many short chats; one 24 GB card often handles 32–64 active 2K-context sessions for 8B models.

For 70B models, set --tensor-parallel-size 4 on four 80 GB H100s (or 2× A100 80GB with aggressive quantization). Pipeline parallelism adds when a single layer exceeds per-GPU memory. Always run a soak test with production-shaped prompts — marketing copy and JSON extraction stress KV cache differently than haiku demos.

Observability

  • Prometheus metrics on /metrics — queue length, cache usage, tokens/sec.
  • Log request_id and time-to-first-token (TTFT) separately from inter-token latency.
  • Alert when pending queue > N for 60s — scale replicas before users time out.

Harbor Analytics chat gateway (worked example)

Harbor Analytics shipped an internal “ask your metrics” chat widget to 400 analysts. Prototype Ollama on a T4 handled five users; launch week peaked at 120 concurrent sessions with 4K-token SQL context snippets. Migration to vLLM on two L40S GPUs (48 GB each):

  1. Modelmeta-llama/Llama-3.1-8B-Instruct AWQ 4-bit; --max-model-len 6144 after profiling showed 95th percentile prompt length.
  2. Gateway — LiteLLM proxy in front with API keys, per-team rate limits, and fallback to GPT-4o for queries tagged complex_reasoning.
  3. Prefix caching — shared schema DDL and metric definitions cached via --enable-prefix-caching; TTFT dropped 38% on repeat dashboards.
  4. Autoscaling — Kubernetes HPA on vLLM queue depth; second replica warms during business hours only.
  5. Results — p95 latency 1.1s (was 8.4s on Ollama under load); GPU utilization 72% vs 28%; cloud fallback share fell from 40% to 12% after AWQ quality audit passed.
  6. Ops — model weights baked into AMI; config in git; weekly canary prompt suite gates rollouts.

The team kept Ollama on engineer laptops for ad-hoc prompts. Production traffic never mixed dev and serve tiers — a common mistake that starves paying users when someone pulls a 70B experiment on the shared box.

Inference stack decision table

NeedvLLMOllamaTGI (HF)Cloud API
Laptop / single-dev prototypingHeavyExcellentModerateInstant
Many concurrent chat sessionsExcellentModerateGoodExcellent
OpenAI-compatible HTTP APIYesYesYesNative
Multi-GPU tensor parallelismNativeLimitedYesN/A
Prefix / prompt cachingYesLimitedVariesVendor-specific
Offline batch throughputExcellentModerateGoodRate-limited
Fastest time-to-first-demoSlowFastModerateFastest
Frontier model without self-hostingNoNoNoYes

Choose vLLM when GPU saturation and queue depth are the problem and you control the hardware. Stay on Ollama for solo dev and light internal tools. Use cloud APIs for frontier reasoning and as the escape hatch in cascade routers.

Common pitfalls

  • Max context set to model maximum — reserves KV cache for 128K even when median prompt is 2K; cap --max-model-len to real usage.
  • No gateway auth — vLLM has no built-in API keys; exposing port 8000 publicly invites crypto miners and data exfiltration.
  • Ignoring prefill spikes — huge prompts stall decode for everyone; chunk RAG context or use separate prefill-heavy and chat-light pools.
  • Quantization without eval — AWQ can crush JSON extraction accuracy; run task-specific benchmarks before cutover.
  • Single replica, no queue SLO — one GPU restart drops all sessions; run N+1 replicas behind a load balancer.
  • Mixing dev and prod on one daemon — experimental model pulls evict production weights from VRAM.
  • Streaming through buffering proxies — nginx default buffering breaks SSE; set proxy_buffering off.
  • Wrong chat template — Hugging Face models need matching tokenizer templates; garbled outputs often mean template mismatch, not “bad weights.”

Production checklist

  • Profile production prompt length percentiles before choosing --max-model-len.
  • Install vLLM with CUDA version matching host drivers; pin package version in Docker.
  • Smoke-test /v1/chat/completions with streaming and non-streaming clients.
  • Enable prefix caching if many requests share system or RAG headers.
  • Configure tensor parallelism for models that exceed single-GPU VRAM.
  • Put LiteLLM or similar gateway in front for keys, budgets, and cloud fallback.
  • Scrape Prometheus metrics; alert on queue depth and cache usage.
  • Run soak test at 2× expected peak concurrent sessions.
  • Bake weights into images or use local HF cache volumes — avoid cold-start pulls.
  • Document model revision, quant method, and max context in runbooks.
  • Schedule canary eval suite on every model or config change.

Key takeaways

  • vLLM optimizes multi-user LLM serving via PagedAttention and continuous batching.
  • The OpenAI-compatible server drops into existing SDKs and LiteLLM gateways with minimal code change.
  • VRAM planning must account for concurrent sequences and KV cache, not weights alone.
  • Graduate from Ollama when queues and GPU memory fragmentation appear under real concurrency.
  • Pair self-hosted vLLM with gateway routing and cloud escalation — local throughput and frontier quality are complementary tiers.

Related reading