Guide
Conformal prediction explained
A classifier that says “defect” with 92% softmax confidence is still wrong
15% of the time on your production line. Conformal prediction wraps
any underlying model — logistic regression, gradient boosting, a fine-tuned
vision transformer — and outputs sets or intervals with a
finite-sample guarantee: at user-chosen level 1 − α, the true
label or value falls inside the output at least that often, without assuming Gaussian
errors or a specific parametric family. This guide covers the exchangeability
assumption, nonconformity scores, split conformal and cross-conformal variants,
classification prediction sets versus regression intervals (including
conformalized quantile regression), adaptive and conditional coverage limits,
pairing with
test-time compute
abstain policies, a Harbor Analytics visual defect triage worked example, a method
decision table, common pitfalls, and a production checklist — alongside our
time-series forecasting guide
and
MLflow fundamentals explainer.
What conformal prediction guarantees (and what it does not)
Classical confidence intervals assume you know the noise distribution — homoscedastic Gaussian residuals, Poisson counts, etc. Real ML pipelines violate every such assumption: label shift after deployment, heavy tails in fraud scores, multimodal errors in price forecasts. Conformal methods are distribution-free: given exchangeable calibration and test points, they promise marginal coverage:
P(Yn+1 ∈ C(Xn+1)) ≥ 1 − α
where C(X) is the prediction set or interval your procedure outputs
for input X. The guarantee holds for any base model, even a
black-box neural net, as long as calibration data are exchangeable with test data.
What conformal prediction does not guarantee:
- Conditional coverage — 90% marginal coverage can mask 60% coverage on a rare subgroup unless you use specialized variants (Mondrian, class-conditional, or weighted conformal).
- Optimal set size — coverage is enforced; efficiency (small sets) depends on how good your base model and score function are.
- Causality or fairness — conformal sets describe uncertainty under the observed distribution; they do not fix biased training data.
- Robustness to adversarial inputs — crafted perturbations can break exchangeability assumptions in security-sensitive settings.
Exchangeability: the hidden contract
Exchangeability means the joint distribution of your n + 1 points is
invariant to permutations — a weaker condition than i.i.d. but still violated
by strong temporal drift, feedback loops (model predictions change future labels),
or retraining on errors the model already made. In production, monitor calibration
set age and refresh when coverage on a held-out audit stream drops below target.
Nonconformity scores: measuring “how weird” a prediction is
Every conformal procedure needs a nonconformity score
s(x, y) — how unusual it would be to see label y
alongside features x given what the model learned. Higher scores mean
the pair is stranger. Common choices:
- Classification (multiclass) —
s(x, y) = 1 − πy(x)whereπyis the predicted probability of classy. Alternatives: log-loss margin, distance to decision boundary in embedding space. - Regression — absolute residual
|y − μ(x)|from point predictorμ, or scaled residual dividing by an estimated local variance. - LLM classification — one minus normalized token probability of the chosen label string, or rank of the correct answer in a scored candidate list.
The score function is where domain knowledge enters. A good score makes true labels look “conforming” (low score) and mistakes look “nonconforming” (high score), which shrinks average set size without breaking the coverage proof.
Split conformal: the workhorse algorithm
Split conformal is the simplest production pattern:
- Split data into proper training (fit the base model) and calibration (never used for gradient updates).
- Train model
fon proper training data. - For each calibration point
(xi, yi), computesi = s(xi, yi). - Let
qbe the⌈(n + 1)(1 − α)⌉ / nquantile of calibration scores (with finite-sample correction). - At test time, output all labels
ywiths(x, y) ≤ q(classification) or the interval[μ(x) − q, μ(x) + q](symmetric regression).
The calibration set size trades off statistical tightness against data you can spend on training. Rule of thumb: hundreds to low thousands of calibration points for stable 90% coverage; fewer works but sets become wider and quantile estimates noisier.
Full conformal and cross-conformal
Full conformal re-trains the model leaving each point out,
computing scores on left-out examples. Statistically tighter but
O(n) retrains — feasible only for cheap models.
Cross-conformal (+) uses k-fold splits as a compromise:
each fold supplies calibration scores while others train, reducing waste relative
to a single split at moderate compute cost. Use cross-conformal when training data
is scarce and retraining is affordable (e.g. sklearn pipelines on tabular data).
Prediction sets vs prediction intervals
Classification: sets instead of a single argmax
Instead of forcing one label, conformal classification returns a prediction
set C(x) ⊆ {1, …, K}. If the set has size 1, act
automatically. If size 2–3, escalate to a human or a stronger model. If the
set is empty (rare with standard scores) or equals all classes, abstain or route to
manual review. Set size is a natural uncertainty dial for
workflows that pair with
LLM-as-judge
arbitration on ambiguous cases.
Regression: symmetric and conformalized quantile regression (CQR)
Symmetric intervals around μ(x) assume errors are evenly spread;
real targets are often skewed. Conformalized quantile regression
(CQR) trains low and high quantile models qα/2
and q1−α/2, then conformalizes their
residuals on a calibration set to widen intervals just enough to hit coverage.
CQR handles heteroscedasticity better and is the default for
forecast
intervals on non-stationary demand or price series.
Adaptive coverage (APS, RAPS)
Naive score 1 − πy(x) can produce huge sets when
the model is uniformly uncertain. Adaptive Prediction Sets (APS)
and regularized variants (RAPS) accumulate sorted class probabilities until a
calibrated threshold is met, yielding smaller sets when the model is confident and
wider sets when it is not — still with marginal coverage guarantees.
Worked example: Harbor Analytics visual defect triage
Harbor Analytics runs a conveyor-line camera model that classifies PCB snapshots into ok, scratch, missing_component, or solder_bridge. A false “ok” ships defective boards; a false alarm stops the line for manual inspection ($2,400/hour). Their baseline EfficientNet hits 94% top-1 accuracy but only 88% recall on solder_bridge — unacceptable without uncertainty handling.
Setup: 40k labeled images; 32k train the model, 4k early-stop
validation, 4k conformal calibration (stratified by class). Score:
s(x, y) = 1 − πy(x) with APS for set construction.
Target 1 − α = 0.95 marginal coverage on calibration.
Policy:
- |C(x)| = 1 and top class ≠ ok → auto-reject lane.
- |C(x)| = 1 and top class = ok → auto-accept (coverage guarantee means ≤5% of ok labels were wrongly excluded from the set — still monitor).
- |C(x)| ≥ 2 → human review station; prioritize if “ok” ∈ C(x) alongside a defect class.
On a two-week pilot, auto-handled fraction rose from 61% to 78% while missed-defect rate on auto-accepted units fell from 1.2% to 0.4% because ambiguous boards routed to humans. Average set size was 1.3 classes. They log empirical coverage weekly in MLflow alongside ECE calibration curves and trigger recalibration when audit coverage drops below 93%.
Method decision table
| Problem | Recommended conformal variant | Avoid |
|---|---|---|
| Multiclass tabular, plenty of data | Split conformal + APS/RAPS | Ignoring rare-class conditional coverage |
| Small tabular dataset | Cross-conformal+ | Full conformal on deep nets |
| Heteroscedastic regression | CQR intervals | Symmetric ±q around mean only |
| Time-series forecast | Rolling calibration window + CQR | Single static calibration set under drift |
| LLM label selection | Score over candidate label strings; set-size abstain | Treating softmax of one token as calibrated prob |
| Subgroup fairness audit | Mondrian / group-conditional conformal | Assuming marginal coverage protects minorities |
| Real-time SLA < 50ms | Precompute quantile; vectorized score | Retraining full conformal per request |
| Distribution shift detected | Weighted conformal or fresh calibration | Stale calibration scores after deployment change |
Common pitfalls
- Data leakage — tuning the base model on the same points used for calibration destroys the proof; hold out a sacred calibration split.
- Test-time covariate shift — cameras moved, new supplier, seasonal demand; marginal coverage silently fails. Monitor and refresh.
- Confusing confidence with calibration — a 99% softmax score is not a 99% frequency unless the model is calibrated; conformal fixes this at the set level, not by trusting raw logits.
- Oversized sets treated as precision — returning all ten classes satisfies coverage but is useless; track average set size and efficiency.
- Conditional coverage blind spot — 95% overall can mean 80% on a critical rare defect; slice metrics by class and line.
- Feedback loops — if humans only label escalated cases, future training data are biased; log auto-accepted outcomes for audit sampling.
- Wrong score for the task — margin scores on imbalanced data inflate sets for head classes; try class-conditional normalization.
- No abstain path — forcing a decision when |C(x)| is large negates the point; pair with human or heavier inference on escalations.
Production checklist
- Reserve calibration data before any hyperparameter search on the base model.
- Choose
αfrom business cost of errors vs review volume (not default 0.05 because textbooks say so). - Implement nonconformity score + quantile threshold as a versioned artifact.
- Log set size, empirical coverage, and per-class coverage on an audit stream.
- Define escalation policy for |C(x)| > 1 before launch; train reviewers.
- Schedule recalibration triggers (coverage drop, drift alarm, quarterly).
- Benchmark against naive softmax thresholding — conformal should win on risk at comparable automation rate.
- For regression, plot interval width vs feature buckets to catch heterogeneity.
- Document exchangeability assumptions and known violations (seasonality, etc.).
- Store calibration sets and quantiles in model registry for reproducibility.
Key takeaways
- Conformal prediction turns any model into one with finite-sample
marginal coverage guarantees at level
1 − α. - Nonconformity scores measure how strange each (x, y) pair is; better scores yield smaller prediction sets.
- Split conformal is the default production path; CQR is the regression workhorse under heteroscedastic noise.
- Prediction set size is an actionable uncertainty signal for abstain, escalate, or auto-act policies.
- Coverage is not eternal — monitor, slice by subgroup, and recalibrate when the world shifts.
Related reading
- LLM test-time compute explained — spend extra inference on hard cases after conformal abstain
- Time-series forecasting explained — forecast intervals and drift under CQR
- MLflow fundamentals explained — log coverage metrics and calibration artifacts
- LLM-as-judge explained — arbitrate ambiguous prediction sets