Guide

LLM speculative decoding explained

Harbor Support's tier-1 chat gateway ran a 70B target on eight A100s with CUDA graphs and continuous batching already tuned. Decode still dominated P50 latency at 28 ms per token because each autoregressive step launched a full forward pass through 70 billion parameters. The platform team added speculative decoding: a 1.5B draft model proposes short token chains; the 70B target verifies them in one batched forward. P50 decode fell to 11 ms per effective token; acceptance rate averaged 72% on support transcripts once draft vocabulary alignment and temperature handling matched production sampling.

Speculative decoding (also called speculative sampling) speeds autoregressive generation without changing the target model's output distribution when implemented correctly. A small, fast draft model guesses the next k tokens; the large target model evaluates all guesses in parallel and accepts a prefix via rejection sampling. Accepted tokens cost one target forward pass instead of k. This guide covers the draft-and-verify loop, acceptance-rate math, Eagle and Medusa variants, vLLM speculative configuration, interaction with KV cache and batching, the Harbor Support refactor, a technique decision table versus graphs-only and quantization-only stacks, pitfalls, and a production checklist — building on vLLM serving fundamentals.

Why decode is the bottleneck

LLM inference splits into prefill (process the prompt in parallel) and decode (emit one or few tokens per step). Prefill is compute-bound matrix math; decode is latency-bound because each step depends on the previous token's KV state. Even with PagedAttention, a 70B model may spend 15–40 ms per token on kernel launch, memory bandwidth, and attention over growing context.

Common mitigations each address part of the wall:

Technique	What it fixes	What it does not fix
CUDA graphs	Kernel launch overhead per step	Still one target forward per token
Quantization (FP8/INT4)	Weight and KV bytes per step	Sequential token dependency
Continuous batching	GPU utilization across requests	Per-request tokens/sec ceiling
Speculative decoding	Multiple tokens per target forward	Draft cost + verification when acceptance is low

Speculation targets the fundamental autoregressive serial loop: if the draft agrees with the target often enough, you amortize expensive target passes across several emitted tokens.

Draft-and-verify loop

Classical speculative decoding (Leviathan et al., Chen et al.) runs two models sharing the same tokenizer and (ideally) similar distributions:

Draft autoregressively proposes tokens d₁, d₂, …, dₖ (typical k = 4–8) using the small model's KV cache.
Target verifies by running one forward pass on the concatenated prefix plus all draft tokens, producing logits at each position.
Acceptance sampling compares target and draft probabilities position by position. Accept matching prefixes; on first mismatch, sample a correction from the adjusted target distribution and discard remaining drafts.
Append accepted tokens (and the correction token) to the sequence; refresh both models' KV caches; repeat.

Acceptance rate and speedup

Let α be the per-position probability the draft matches the target (acceptance rate). Expected accepted tokens per target forward is roughly 1 + α + α² + … + αᵏ for draft length k. Example: α = 0.7 and k = 5 yields ~2.8 effective tokens per target step before draft overhead. Speedup is only positive when:

(tokens_accepted × T_target) > T_draft + T_verify

where T_draft is cheap autoregressive drafting, T_verify is one parallel target forward over k+1 positions, and drafting runs on the same GPU or a sidecar. If α < 0.4 on your traffic, speculation often slows inference.

Distribution correctness

Proper rejection sampling guarantees outputs match sampling from the target alone — critical for regulated or A/B-sensitive workloads. Greedy verification (argmax only) is faster but changes the distribution unless both models are near-deterministic. Temperature and top-p must be applied consistently during verification, not only on the draft.

Eagle, Medusa, and tree speculation

Separate draft models are not the only pattern:

Eagle / Eagle-2 — lightweight prediction heads atop the target's hidden states, trained to forecast multiple future tokens. Shares backbone features; draft cost drops because you avoid a second full model load.
Medusa — multiple decoding heads on frozen target weights, proposing several continuations per step. Verification still uses target logits; heads specialize in high-acceptance branches.
Tree speculation — draft explores a branching tree of continuations; target verifies the best path in one batched pass. Higher acceptance on ambiguous contexts at the cost of more draft FLOPs.
Lookahead decoding — Jacobi-style parallel token updates without a separate draft model; works when target self-predicts well at offset positions.

Eagle-style methods dominate when VRAM cannot fit two full models. Dual-model speculation wins when a well-aligned small checkpoint exists (e.g. same family: Llama-3.2-1B drafting for Llama-3.3-70B).

vLLM and production configuration

vLLM and similar engines expose speculative decoding via draft model paths, num_speculative_tokens, and acceptance metrics. Practical rules:

Draft selection — same tokenizer and chat template as target; ideally same pretraining lineage. Mismatched BPE merges destroy acceptance.
VRAM budget — draft plus target KV pools must fit; shrinking target batch concurrency may negate speedup. Profile gpu_memory_utilization with speculation enabled.
Draft length k — start at 4; increase only if acceptance stays >65%. Long drafts waste verify FLOPs on rejections.
Colocation — run draft on the same GPU as target for PCIe simplicity; dedicated draft GPUs help only at very large target scale.
Metrics — export spec_acceptance_rate, spec_draft_tokens, spec_accepted_tokens, effective tokens per second versus baseline decode.

Pair with CUDA graphs on the verify pass — fixed shapes for k+1 positions bucket cleanly. Avoid stacking speculation with aggressive KV eviction unless golden tests confirm acceptance rates hold on truncated context.

Batching, KV cache, and scheduling

Speculative decoding complicates continuous batching:

Sequences in the same batch may accept different draft lengths each step — schedulers pad verify passes to the batch maximum k.
Draft models maintain separate KV caches per sequence; memory accounting doubles for dual-model setups.
Prefill is usually unchanged; speculation applies to decode only. Do not expect TTFT improvements unless draft also assists prompt extension (rare).
Prefix-cache hits reduce verify FLOPs; hot system prompts raise acceptance because draft and target agree on templated boilerplate.

Admission control should estimate both model footprints. A fleet sized for target-only concurrency may OOM when speculation is toggled on without reducing max concurrent sequences.

Harbor Support chat gateway refactor

Harbor Support routed billing and outage tickets through a 70B instruct model with strict latency SLO (P50 < 15 ms/token decode). The refactor:

Baseline — measured 28 ms P50 decode with graphs enabled; acceptance of a trial 3B draft from a different family averaged 41% (unusable).
Draft alignment — switched to same-family 1.5B instruct; matched chat template, stop tokens, and temperature=0.7 on both sides during verify.
Spec config — num_speculative_tokens=5, draft colocated on same GPU; reduced max concurrent sequences from 48 to 36 to fit dual KV pools.
Scheduling — verify pass CUDA-graph bucketed for draft lengths 1–5; fallback eager for rare overflow.
Monitoring — dashboard on acceptance rate by ticket category; alert if <60% for 10 minutes.
Rollback — feature flag disables draft load without draining fleet; target-only path preserved for regression tests.

Decode P50 fell 28 ms → 11 ms per effective token. P99 fell 94 ms → 41 ms. Throughput per GPU rose 2.1× on support-shaped prompts. Output quality audits (human spot checks + logprob KL vs baseline) showed no measurable drift once sampling-correct verification shipped.

Technique decision table

Your situation	Prefer	Avoid
Decode-bound 70B+ chat, aligned small model available	Dual-model speculative decoding	Longer draft chains before measuring α
VRAM cannot fit two full models	Eagle/Medusa heads on target	Large separate draft checkpoint
Highly creative / high-temperature outputs	Full rejection sampling verify	Greedy verify (distribution shift)
Acceptance <50% on production traces	CUDA graphs + quantization instead	More speculative tokens
Code or JSON with rare tokens	Domain-finetuned draft or shorter k	Generic small model draft
Regulated identical-output requirement	Speculation with proven sampling equivalence	Approximate verify shortcuts

Pitfalls

Tokenizer mismatch — draft and target must share vocabulary; mixed families silently crater acceptance.
Chat template drift — different special tokens between draft and target break alignment on the first generated token.
VRAM surprise — enabling speculation without lowering concurrency causes OOM mid-flight.
Measuring draft-only speed — benchmarks that skip verify passes overstate gains.
Long-context acceptance collapse — draft models weak on 32K+ context accept fewer tokens; tune per tier.
Ignoring draft latency — autoregressive drafting on CPU or across PCIe can erase verify savings.
Distribution audits skipped — greedy shortcuts change customer-visible tone in support bots.

Production checklist

Measure baseline decode P50/P99 before enabling speculation.
Choose draft from same model family and tokenizer as target.
Align chat templates, stop sequences, and sampling params on verify.
Start with num_speculative_tokens=4; tune from acceptance metrics.
Resize max concurrent sequences for dual KV memory.
CUDA-graph verify buckets for common draft lengths.
Export acceptance rate and effective tokens/sec dashboards.
Run logprob KL or A/B quality checks against target-only baseline.
Feature-flag speculation for instant rollback.
Re-profile when prompt distribution shifts (new product lines, languages).

Key takeaways

Speculative decoding proposes multiple tokens with a fast draft, then verifies them in one target forward — trading draft FLOPs for fewer serial target steps.
Speedup depends on acceptance rate α; misaligned drafts below ~50% acceptance often slow inference.
Correct rejection sampling preserves the target distribution; greedy verify is not interchangeable for sampled outputs.
Eagle/Medusa avoid a second full model when VRAM is tight; dual-model speculation wins with well-matched small checkpoints.
Harbor Support cut decode P50 28 ms → 11 ms with a 1.5B draft and 72% acceptance on support transcripts.