Guide
LLM speculative decoding explained
Harbor Support's tier-1 chat gateway ran a 70B target on eight A100s with CUDA graphs and continuous batching already tuned. Decode still dominated P50 latency at 28 ms per token because each autoregressive step launched a full forward pass through 70 billion parameters. The platform team added speculative decoding: a 1.5B draft model proposes short token chains; the 70B target verifies them in one batched forward. P50 decode fell to 11 ms per effective token; acceptance rate averaged 72% on support transcripts once draft vocabulary alignment and temperature handling matched production sampling.
Speculative decoding (also called speculative sampling) speeds autoregressive generation without changing the target model's output distribution when implemented correctly. A small, fast draft model guesses the next k tokens; the large target model evaluates all guesses in parallel and accepts a prefix via rejection sampling. Accepted tokens cost one target forward pass instead of k. This guide covers the draft-and-verify loop, acceptance-rate math, Eagle and Medusa variants, vLLM speculative configuration, interaction with KV cache and batching, the Harbor Support refactor, a technique decision table versus graphs-only and quantization-only stacks, pitfalls, and a production checklist — building on vLLM serving fundamentals.
Why decode is the bottleneck
LLM inference splits into prefill (process the prompt in parallel) and decode (emit one or few tokens per step). Prefill is compute-bound matrix math; decode is latency-bound because each step depends on the previous token's KV state. Even with PagedAttention, a 70B model may spend 15–40 ms per token on kernel launch, memory bandwidth, and attention over growing context.
Common mitigations each address part of the wall:
| Technique | What it fixes | What it does not fix |
|---|---|---|
| CUDA graphs | Kernel launch overhead per step | Still one target forward per token |
| Quantization (FP8/INT4) | Weight and KV bytes per step | Sequential token dependency |
| Continuous batching | GPU utilization across requests | Per-request tokens/sec ceiling |
| Speculative decoding | Multiple tokens per target forward | Draft cost + verification when acceptance is low |
Speculation targets the fundamental autoregressive serial loop: if the draft agrees with the target often enough, you amortize expensive target passes across several emitted tokens.
Draft-and-verify loop
Classical speculative decoding (Leviathan et al., Chen et al.) runs two models sharing the same tokenizer and (ideally) similar distributions:
- Draft autoregressively proposes tokens
d₁, d₂, …, dₖ(typical k = 4–8) using the small model's KV cache. - Target verifies by running one forward pass on the concatenated prefix plus all draft tokens, producing logits at each position.
- Acceptance sampling compares target and draft probabilities position by position. Accept matching prefixes; on first mismatch, sample a correction from the adjusted target distribution and discard remaining drafts.
- Append accepted tokens (and the correction token) to the sequence; refresh both models' KV caches; repeat.
Acceptance rate and speedup
Let α be the per-position probability the draft matches the target
(acceptance rate). Expected accepted tokens per target forward is roughly
1 + α + α² + … + αᵏ for draft length k. Example:
α = 0.7 and k = 5 yields ~2.8 effective tokens per target
step before draft overhead. Speedup is only positive when:
(tokens_accepted × T_target) > T_draft + T_verify
where T_draft is cheap autoregressive drafting,
T_verify is one parallel target forward over k+1
positions, and drafting runs on the same GPU or a sidecar. If
α < 0.4 on your traffic, speculation often slows
inference.
Distribution correctness
Proper rejection sampling guarantees outputs match sampling from the target alone — critical for regulated or A/B-sensitive workloads. Greedy verification (argmax only) is faster but changes the distribution unless both models are near-deterministic. Temperature and top-p must be applied consistently during verification, not only on the draft.
Eagle, Medusa, and tree speculation
Separate draft models are not the only pattern:
- Eagle / Eagle-2 — lightweight prediction heads atop the target's hidden states, trained to forecast multiple future tokens. Shares backbone features; draft cost drops because you avoid a second full model load.
- Medusa — multiple decoding heads on frozen target weights, proposing several continuations per step. Verification still uses target logits; heads specialize in high-acceptance branches.
- Tree speculation — draft explores a branching tree of continuations; target verifies the best path in one batched pass. Higher acceptance on ambiguous contexts at the cost of more draft FLOPs.
- Lookahead decoding — Jacobi-style parallel token updates without a separate draft model; works when target self-predicts well at offset positions.
Eagle-style methods dominate when VRAM cannot fit two full models. Dual-model speculation wins when a well-aligned small checkpoint exists (e.g. same family: Llama-3.2-1B drafting for Llama-3.3-70B).
vLLM and production configuration
vLLM and similar engines expose speculative decoding via draft model paths,
num_speculative_tokens, and acceptance metrics. Practical rules:
- Draft selection — same tokenizer and chat template as target; ideally same pretraining lineage. Mismatched BPE merges destroy acceptance.
- VRAM budget — draft plus target KV pools must fit;
shrinking target batch concurrency may negate speedup. Profile
gpu_memory_utilizationwith speculation enabled. - Draft length k — start at 4; increase only if acceptance stays >65%. Long drafts waste verify FLOPs on rejections.
- Colocation — run draft on the same GPU as target for PCIe simplicity; dedicated draft GPUs help only at very large target scale.
- Metrics — export
spec_acceptance_rate,spec_draft_tokens,spec_accepted_tokens, effective tokens per second versus baseline decode.
Pair with CUDA graphs on the verify pass — fixed shapes for
k+1 positions bucket cleanly. Avoid stacking speculation with
aggressive
KV eviction
unless golden tests confirm acceptance rates hold on truncated context.
Batching, KV cache, and scheduling
Speculative decoding complicates continuous batching:
- Sequences in the same batch may accept different draft lengths each step — schedulers pad verify passes to the batch maximum k.
- Draft models maintain separate KV caches per sequence; memory accounting doubles for dual-model setups.
- Prefill is usually unchanged; speculation applies to decode only. Do not expect TTFT improvements unless draft also assists prompt extension (rare).
- Prefix-cache hits reduce verify FLOPs; hot system prompts raise acceptance because draft and target agree on templated boilerplate.
Admission control should estimate both model footprints. A fleet sized for target-only concurrency may OOM when speculation is toggled on without reducing max concurrent sequences.
Harbor Support chat gateway refactor
Harbor Support routed billing and outage tickets through a 70B instruct model with strict latency SLO (P50 < 15 ms/token decode). The refactor:
- Baseline — measured 28 ms P50 decode with graphs enabled; acceptance of a trial 3B draft from a different family averaged 41% (unusable).
- Draft alignment — switched to same-family 1.5B instruct; matched chat template, stop tokens, and temperature=0.7 on both sides during verify.
- Spec config —
num_speculative_tokens=5, draft colocated on same GPU; reduced max concurrent sequences from 48 to 36 to fit dual KV pools. - Scheduling — verify pass CUDA-graph bucketed for draft lengths 1–5; fallback eager for rare overflow.
- Monitoring — dashboard on acceptance rate by ticket category; alert if <60% for 10 minutes.
- Rollback — feature flag disables draft load without draining fleet; target-only path preserved for regression tests.
Decode P50 fell 28 ms → 11 ms per effective token. P99 fell 94 ms → 41 ms. Throughput per GPU rose 2.1× on support-shaped prompts. Output quality audits (human spot checks + logprob KL vs baseline) showed no measurable drift once sampling-correct verification shipped.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Decode-bound 70B+ chat, aligned small model available | Dual-model speculative decoding | Longer draft chains before measuring α |
| VRAM cannot fit two full models | Eagle/Medusa heads on target | Large separate draft checkpoint |
| Highly creative / high-temperature outputs | Full rejection sampling verify | Greedy verify (distribution shift) |
| Acceptance <50% on production traces | CUDA graphs + quantization instead | More speculative tokens |
| Code or JSON with rare tokens | Domain-finetuned draft or shorter k | Generic small model draft |
| Regulated identical-output requirement | Speculation with proven sampling equivalence | Approximate verify shortcuts |
Pitfalls
- Tokenizer mismatch — draft and target must share vocabulary; mixed families silently crater acceptance.
- Chat template drift — different special tokens between draft and target break alignment on the first generated token.
- VRAM surprise — enabling speculation without lowering concurrency causes OOM mid-flight.
- Measuring draft-only speed — benchmarks that skip verify passes overstate gains.
- Long-context acceptance collapse — draft models weak on 32K+ context accept fewer tokens; tune per tier.
- Ignoring draft latency — autoregressive drafting on CPU or across PCIe can erase verify savings.
- Distribution audits skipped — greedy shortcuts change customer-visible tone in support bots.
Production checklist
- Measure baseline decode P50/P99 before enabling speculation.
- Choose draft from same model family and tokenizer as target.
- Align chat templates, stop sequences, and sampling params on verify.
- Start with
num_speculative_tokens=4; tune from acceptance metrics. - Resize max concurrent sequences for dual KV memory.
- CUDA-graph verify buckets for common draft lengths.
- Export acceptance rate and effective tokens/sec dashboards.
- Run logprob KL or A/B quality checks against target-only baseline.
- Feature-flag speculation for instant rollback.
- Re-profile when prompt distribution shifts (new product lines, languages).
Key takeaways
- Speculative decoding proposes multiple tokens with a fast draft, then verifies them in one target forward — trading draft FLOPs for fewer serial target steps.
- Speedup depends on acceptance rate α; misaligned drafts below ~50% acceptance often slow inference.
- Correct rejection sampling preserves the target distribution; greedy verify is not interchangeable for sampled outputs.
- Eagle/Medusa avoid a second full model when VRAM is tight; dual-model speculation wins with well-matched small checkpoints.
- Harbor Support cut decode P50 28 ms → 11 ms with a 1.5B draft and 72% acceptance on support transcripts.
Related reading
- LLM continuous batching explained — scheduling that speculation plugs into
- LLM CUDA graphs for decode inference explained — launch overhead reduction on verify passes
- LLM KV cache explained — dual-model memory accounting
- vLLM fundamentals explained — serving engine baseline for speculative config