Guide
vLLM fundamentals explained
Your prototype runs fine on
Ollama until marketing
launches a public demo and fifty users hit send at once — GPU utilization flatlines,
queues balloon, and p95 latency crosses ten seconds. vLLM is an open-source
LLM inference engine built for exactly that wall: it combines
PagedAttention (virtual-memory-style KV cache management),
continuous batching (requests join and leave batches mid-flight), and
optimized CUDA kernels so a single A100 can serve far more concurrent chats than naive
Hugging Face generate() loops. It exposes an OpenAI-compatible HTTP
server, supports tensor and pipeline parallelism across GPUs, and powers many
production stacks behind
LiteLLM gateways.
This guide covers what vLLM is, core memory and scheduling concepts, installation and the
API server, parallelism modes, offline batch inference, a Harbor Analytics chat gateway
worked example, an inference-stack decision table, common pitfalls, and a production
checklist. Pair it with
KV cache fundamentals and
FlashAttention for the theory
behind the speedups.
What vLLM is (and is not)
vLLM is a high-throughput inference runtime for transformer language models.
You point it at Hugging Face weights (or a local checkpoint), start
vllm serve meta-llama/Llama-3.1-8B-Instruct, and clients call
/v1/chat/completions the same way they would OpenAI. Under the hood, vLLM
schedules decode steps across all active sequences, reclaims KV cache pages when sessions
end, and fuses attention with custom kernels — the result is often 10–24×
higher throughput than static batching on the same hardware.
It is not a model trainer, a desktop chat app, or a multi-vendor API router. Fine-tuning still happens in PyTorch or TRL; routing across OpenAI, Anthropic, and local endpoints belongs in LiteLLM. vLLM is the engine inside your VPC when you own the GPU and need many simultaneous users on one model revision. For a laptop prototype, Ollama remains simpler; graduate to vLLM when metrics show queue depth and GPU memory fragmentation, not model quality, are the bottleneck.
Core concepts
- PagedAttention — KV cache stored in fixed-size blocks (like OS pages); sequences of different lengths share GPU memory without pre-allocating max context per request.
- Continuous batching — new requests enter the running batch between decode steps; finished sequences exit without waiting for the slowest member of a static batch.
- Prefill vs decode — prefill processes the prompt in one parallel pass; decode generates one token at a time per sequence. Scheduling balances both phases (see model serving).
- Tensor parallelism (TP) — shards weight matrices across GPUs; required for 70B+ models that do not fit on one card.
- Pipeline parallelism (PP) — layers split across GPUs; trades latency for capacity on very deep models.
- Quantization — AWQ, GPTQ, and FP8 kernels reduce VRAM and raise tokens/sec; quality tradeoffs vary by task (see LLM quantization).
Installation and the OpenAI-compatible server
vLLM targets Linux with NVIDIA CUDA (AMD ROCm builds exist). Install with pip in a fresh virtualenv — CUDA version must match your driver:
pip install vllm
# Start server (downloads weights from Hugging Face on first run)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
Key flags:
--max-model-len— caps context window; lower values save KV cache RAM and raise concurrent capacity.--gpu-memory-utilization— fraction of VRAM vLLM may claim; leave headroom if you colocate embeddings or vision towers.--tensor-parallel-size N— spread one model across N GPUs (e.g. 2× A100 for 70B).--quantization awq/gptq— load pre-quantized checkpoints for memory-constrained hosts.--enable-prefix-caching— reuse KV blocks for identical prompt prefixes (RAG system prompts, few-shot templates).
Client integration
Point the OpenAI Python SDK at your server:
from openai import OpenAI
client = OpenAI(base_url="http://gpu-box:8000/v1", api_key="unused")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Summarize PagedAttention in two sentences."}],
temperature=0.2,
max_tokens=256,
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="")
Streaming uses Server-Sent Events; disable reverse-proxy buffering on the path. For embeddings, run a separate small model or use vLLM’s embedding entrypoints when your version supports them — many teams keep RAG embedders on CPU to preserve GPU cycles for the main LLM.
Offline batch inference
Not every workload is an HTTP API. vLLM’s LLM Python class runs
high-throughput offline jobs — nightly summarization, eval harnesses, synthetic data
generation for
distillation:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1)
params = SamplingParams(temperature=0, max_tokens=512)
prompts = [f"Classify sentiment: {text}" for text in ticket_batch]
outputs = llm.generate(prompts, params)
labels = [o.outputs[0].text.strip() for o in outputs]
Offline mode shares the same PagedAttention scheduler; batch thousands of prompts without standing up HTTP. Use this for backfills; expose HTTP only for interactive traffic that cannot wait for a batch window.
GPU planning and parallelism
VRAM needs scale with model parameters, precision, max context, and concurrent sequences. Rough planning for FP16 8B instruct models on a 24 GB GPU:
- Weights — ~16 GB for 8B FP16; AWQ 4-bit can halve this.
- KV cache — dominant at long context and high concurrency; use
--max-model-lenand monitorgpu_cache_usage_percmetrics. - Concurrent sequences — vLLM packs many short chats; one 24 GB card often handles 32–64 active 2K-context sessions for 8B models.
For 70B models, set --tensor-parallel-size 4 on four 80 GB H100s (or 2×
A100 80GB with aggressive quantization). Pipeline parallelism adds when a single layer
exceeds per-GPU memory. Always run a soak test with production-shaped prompts —
marketing copy and JSON extraction stress KV cache differently than haiku demos.
Observability
- Prometheus metrics on
/metrics— queue length, cache usage, tokens/sec. - Log
request_idand time-to-first-token (TTFT) separately from inter-token latency. - Alert when pending queue > N for 60s — scale replicas before users time out.
Harbor Analytics chat gateway (worked example)
Harbor Analytics shipped an internal “ask your metrics” chat widget to 400 analysts. Prototype Ollama on a T4 handled five users; launch week peaked at 120 concurrent sessions with 4K-token SQL context snippets. Migration to vLLM on two L40S GPUs (48 GB each):
- Model —
meta-llama/Llama-3.1-8B-InstructAWQ 4-bit;--max-model-len 6144after profiling showed 95th percentile prompt length. - Gateway — LiteLLM proxy in front with API keys, per-team rate limits, and fallback to GPT-4o for queries tagged
complex_reasoning. - Prefix caching — shared schema DDL and metric definitions cached via
--enable-prefix-caching; TTFT dropped 38% on repeat dashboards. - Autoscaling — Kubernetes HPA on vLLM queue depth; second replica warms during business hours only.
- Results — p95 latency 1.1s (was 8.4s on Ollama under load); GPU utilization 72% vs 28%; cloud fallback share fell from 40% to 12% after AWQ quality audit passed.
- Ops — model weights baked into AMI; config in git; weekly canary prompt suite gates rollouts.
The team kept Ollama on engineer laptops for ad-hoc prompts. Production traffic never mixed dev and serve tiers — a common mistake that starves paying users when someone pulls a 70B experiment on the shared box.
Inference stack decision table
| Need | vLLM | Ollama | TGI (HF) | Cloud API |
|---|---|---|---|---|
| Laptop / single-dev prototyping | Heavy | Excellent | Moderate | Instant |
| Many concurrent chat sessions | Excellent | Moderate | Good | Excellent |
| OpenAI-compatible HTTP API | Yes | Yes | Yes | Native |
| Multi-GPU tensor parallelism | Native | Limited | Yes | N/A |
| Prefix / prompt caching | Yes | Limited | Varies | Vendor-specific |
| Offline batch throughput | Excellent | Moderate | Good | Rate-limited |
| Fastest time-to-first-demo | Slow | Fast | Moderate | Fastest |
| Frontier model without self-hosting | No | No | No | Yes |
Choose vLLM when GPU saturation and queue depth are the problem and you control the hardware. Stay on Ollama for solo dev and light internal tools. Use cloud APIs for frontier reasoning and as the escape hatch in cascade routers.
Common pitfalls
- Max context set to model maximum — reserves KV cache for 128K even when median prompt is 2K; cap
--max-model-lento real usage. - No gateway auth — vLLM has no built-in API keys; exposing port 8000 publicly invites crypto miners and data exfiltration.
- Ignoring prefill spikes — huge prompts stall decode for everyone; chunk RAG context or use separate prefill-heavy and chat-light pools.
- Quantization without eval — AWQ can crush JSON extraction accuracy; run task-specific benchmarks before cutover.
- Single replica, no queue SLO — one GPU restart drops all sessions; run N+1 replicas behind a load balancer.
- Mixing dev and prod on one daemon — experimental model pulls evict production weights from VRAM.
- Streaming through buffering proxies — nginx default buffering breaks SSE; set
proxy_buffering off. - Wrong chat template — Hugging Face models need matching tokenizer templates; garbled outputs often mean template mismatch, not “bad weights.”
Production checklist
- Profile production prompt length percentiles before choosing
--max-model-len. - Install vLLM with CUDA version matching host drivers; pin package version in Docker.
- Smoke-test
/v1/chat/completionswith streaming and non-streaming clients. - Enable prefix caching if many requests share system or RAG headers.
- Configure tensor parallelism for models that exceed single-GPU VRAM.
- Put LiteLLM or similar gateway in front for keys, budgets, and cloud fallback.
- Scrape Prometheus metrics; alert on queue depth and cache usage.
- Run soak test at 2× expected peak concurrent sessions.
- Bake weights into images or use local HF cache volumes — avoid cold-start pulls.
- Document model revision, quant method, and max context in runbooks.
- Schedule canary eval suite on every model or config change.
Key takeaways
- vLLM optimizes multi-user LLM serving via PagedAttention and continuous batching.
- The OpenAI-compatible server drops into existing SDKs and LiteLLM gateways with minimal code change.
- VRAM planning must account for concurrent sequences and KV cache, not weights alone.
- Graduate from Ollama when queues and GPU memory fragmentation appear under real concurrency.
- Pair self-hosted vLLM with gateway routing and cloud escalation — local throughput and frontier quality are complementary tiers.
Related reading
- Ollama fundamentals explained — fast local prototyping before you need vLLM scale
- LiteLLM fundamentals explained — gateway layer for keys, budgets, and multi-provider fallback
- LLM KV cache explained — prefill, decode, and why PagedAttention matters
- Model serving explained — online inference, batching, and production latency patterns