Guide

Conformal prediction explained

A classifier that says “defect” with 92% softmax confidence is still wrong 15% of the time on your production line. Conformal prediction wraps any underlying model — logistic regression, gradient boosting, a fine-tuned vision transformer — and outputs sets or intervals with a finite-sample guarantee: at user-chosen level 1 − α, the true label or value falls inside the output at least that often, without assuming Gaussian errors or a specific parametric family. This guide covers the exchangeability assumption, nonconformity scores, split conformal and cross-conformal variants, classification prediction sets versus regression intervals (including conformalized quantile regression), adaptive and conditional coverage limits, pairing with test-time compute abstain policies, a Harbor Analytics visual defect triage worked example, a method decision table, common pitfalls, and a production checklist — alongside our time-series forecasting guide and MLflow fundamentals explainer.

What conformal prediction guarantees (and what it does not)

Classical confidence intervals assume you know the noise distribution — homoscedastic Gaussian residuals, Poisson counts, etc. Real ML pipelines violate every such assumption: label shift after deployment, heavy tails in fraud scores, multimodal errors in price forecasts. Conformal methods are distribution-free: given exchangeable calibration and test points, they promise marginal coverage:

P(Y_n+1 ∈ C(X_n+1)) ≥ 1 − α

where C(X) is the prediction set or interval your procedure outputs for input X. The guarantee holds for any base model, even a black-box neural net, as long as calibration data are exchangeable with test data.

What conformal prediction does not guarantee:

Conditional coverage — 90% marginal coverage can mask 60% coverage on a rare subgroup unless you use specialized variants (Mondrian, class-conditional, or weighted conformal).
Optimal set size — coverage is enforced; efficiency (small sets) depends on how good your base model and score function are.
Causality or fairness — conformal sets describe uncertainty under the observed distribution; they do not fix biased training data.
Robustness to adversarial inputs — crafted perturbations can break exchangeability assumptions in security-sensitive settings.

Exchangeability: the hidden contract

Exchangeability means the joint distribution of your n + 1 points is invariant to permutations — a weaker condition than i.i.d. but still violated by strong temporal drift, feedback loops (model predictions change future labels), or retraining on errors the model already made. In production, monitor calibration set age and refresh when coverage on a held-out audit stream drops below target.

Nonconformity scores: measuring “how weird” a prediction is

Every conformal procedure needs a nonconformity score s(x, y) — how unusual it would be to see label y alongside features x given what the model learned. Higher scores mean the pair is stranger. Common choices:

Classification (multiclass) — s(x, y) = 1 − π_y(x) where π_y is the predicted probability of class y. Alternatives: log-loss margin, distance to decision boundary in embedding space.
Regression — absolute residual |y − μ(x)| from point predictor μ, or scaled residual dividing by an estimated local variance.
LLM classification — one minus normalized token probability of the chosen label string, or rank of the correct answer in a scored candidate list.

The score function is where domain knowledge enters. A good score makes true labels look “conforming” (low score) and mistakes look “nonconforming” (high score), which shrinks average set size without breaking the coverage proof.

Split conformal: the workhorse algorithm

Split conformal is the simplest production pattern:

Split data into proper training (fit the base model) and calibration (never used for gradient updates).
Train model f on proper training data.
For each calibration point (x_i, y_i), compute s_i = s(x_i, y_i).
Let q be the ⌈(n + 1)(1 − α)⌉ / n quantile of calibration scores (with finite-sample correction).
At test time, output all labels y with s(x, y) ≤ q (classification) or the interval [μ(x) − q, μ(x) + q] (symmetric regression).

The calibration set size trades off statistical tightness against data you can spend on training. Rule of thumb: hundreds to low thousands of calibration points for stable 90% coverage; fewer works but sets become wider and quantile estimates noisier.

Full conformal and cross-conformal

Full conformal re-trains the model leaving each point out, computing scores on left-out examples. Statistically tighter but O(n) retrains — feasible only for cheap models. Cross-conformal (+) uses k-fold splits as a compromise: each fold supplies calibration scores while others train, reducing waste relative to a single split at moderate compute cost. Use cross-conformal when training data is scarce and retraining is affordable (e.g. sklearn pipelines on tabular data).

Prediction sets vs prediction intervals

Classification: sets instead of a single argmax

Instead of forcing one label, conformal classification returns a prediction set C(x) ⊆ {1, …, K}. If the set has size 1, act automatically. If size 2–3, escalate to a human or a stronger model. If the set is empty (rare with standard scores) or equals all classes, abstain or route to manual review. Set size is a natural uncertainty dial for workflows that pair with LLM-as-judge arbitration on ambiguous cases.

Regression: symmetric and conformalized quantile regression (CQR)

Symmetric intervals around μ(x) assume errors are evenly spread; real targets are often skewed. Conformalized quantile regression (CQR) trains low and high quantile models q_α/2 and q_1−α/2, then conformalizes their residuals on a calibration set to widen intervals just enough to hit coverage. CQR handles heteroscedasticity better and is the default for forecast intervals on non-stationary demand or price series.

Adaptive coverage (APS, RAPS)

Naive score 1 − π_y(x) can produce huge sets when the model is uniformly uncertain. Adaptive Prediction Sets (APS) and regularized variants (RAPS) accumulate sorted class probabilities until a calibrated threshold is met, yielding smaller sets when the model is confident and wider sets when it is not — still with marginal coverage guarantees.

Worked example: Harbor Analytics visual defect triage

Harbor Analytics runs a conveyor-line camera model that classifies PCB snapshots into ok, scratch, missing_component, or solder_bridge. A false “ok” ships defective boards; a false alarm stops the line for manual inspection ($2,400/hour). Their baseline EfficientNet hits 94% top-1 accuracy but only 88% recall on solder_bridge — unacceptable without uncertainty handling.

Setup: 40k labeled images; 32k train the model, 4k early-stop validation, 4k conformal calibration (stratified by class). Score: s(x, y) = 1 − π_y(x) with APS for set construction. Target 1 − α = 0.95 marginal coverage on calibration.

Policy:

|C(x)| = 1 and top class ≠ ok → auto-reject lane.
|C(x)| = 1 and top class = ok → auto-accept (coverage guarantee means ≤5% of ok labels were wrongly excluded from the set — still monitor).
|C(x)| ≥ 2 → human review station; prioritize if “ok” ∈ C(x) alongside a defect class.

On a two-week pilot, auto-handled fraction rose from 61% to 78% while missed-defect rate on auto-accepted units fell from 1.2% to 0.4% because ambiguous boards routed to humans. Average set size was 1.3 classes. They log empirical coverage weekly in MLflow alongside ECE calibration curves and trigger recalibration when audit coverage drops below 93%.

Method decision table

Problem	Recommended conformal variant	Avoid
Multiclass tabular, plenty of data	Split conformal + APS/RAPS	Ignoring rare-class conditional coverage
Small tabular dataset	Cross-conformal+	Full conformal on deep nets
Heteroscedastic regression	CQR intervals	Symmetric ±q around mean only
Time-series forecast	Rolling calibration window + CQR	Single static calibration set under drift
LLM label selection	Score over candidate label strings; set-size abstain	Treating softmax of one token as calibrated prob
Subgroup fairness audit	Mondrian / group-conditional conformal	Assuming marginal coverage protects minorities
Real-time SLA < 50ms	Precompute quantile; vectorized score	Retraining full conformal per request
Distribution shift detected	Weighted conformal or fresh calibration	Stale calibration scores after deployment change

Common pitfalls

Data leakage — tuning the base model on the same points used for calibration destroys the proof; hold out a sacred calibration split.
Test-time covariate shift — cameras moved, new supplier, seasonal demand; marginal coverage silently fails. Monitor and refresh.
Confusing confidence with calibration — a 99% softmax score is not a 99% frequency unless the model is calibrated; conformal fixes this at the set level, not by trusting raw logits.
Oversized sets treated as precision — returning all ten classes satisfies coverage but is useless; track average set size and efficiency.
Conditional coverage blind spot — 95% overall can mean 80% on a critical rare defect; slice metrics by class and line.
Feedback loops — if humans only label escalated cases, future training data are biased; log auto-accepted outcomes for audit sampling.
Wrong score for the task — margin scores on imbalanced data inflate sets for head classes; try class-conditional normalization.
No abstain path — forcing a decision when |C(x)| is large negates the point; pair with human or heavier inference on escalations.

Production checklist

Reserve calibration data before any hyperparameter search on the base model.
Choose α from business cost of errors vs review volume (not default 0.05 because textbooks say so).
Implement nonconformity score + quantile threshold as a versioned artifact.
Log set size, empirical coverage, and per-class coverage on an audit stream.
Define escalation policy for |C(x)| > 1 before launch; train reviewers.
Schedule recalibration triggers (coverage drop, drift alarm, quarterly).
Benchmark against naive softmax thresholding — conformal should win on risk at comparable automation rate.
For regression, plot interval width vs feature buckets to catch heterogeneity.
Document exchangeability assumptions and known violations (seasonality, etc.).
Store calibration sets and quantiles in model registry for reproducibility.

Key takeaways

Conformal prediction turns any model into one with finite-sample marginal coverage guarantees at level 1 − α.
Nonconformity scores measure how strange each (x, y) pair is; better scores yield smaller prediction sets.
Split conformal is the default production path; CQR is the regression workhorse under heteroscedastic noise.
Prediction set size is an actionable uncertainty signal for abstain, escalate, or auto-act policies.
Coverage is not eternal — monitor, slice by subgroup, and recalibrate when the world shifts.