Guide
Feature selection explained
A credit-risk model with 400 engineered columns trains slowly, overfits noise, and confuses auditors who ask why browser_timezone_offset predicts default. Feature selection answers a practical question: which inputs actually carry signal, and which can you drop without hurting generalization? Unlike feature engineering, which creates new signals, selection prunes the set you already have. The payoff is faster training, lower inference cost, simpler explanations, and often better out-of-sample performance when redundant or leaky columns are removed. This guide covers filter, wrapper, and embedded methods, multicollinearity diagnostics, cross-validation-safe pipelines, a Harbor Payments credit-scoring worked example, a method decision table, pitfalls, and a production checklist — building on machine learning fundamentals and cross-validation discipline.
Why selection beats using every column
More features are not automatically better. Each additional dimension increases
the volume of input space your model must cover — the curse of
dimensionality. Sparse regions get filled by interpolation that does not
generalize. Redundant features inflate variance in linear models and split
importance in trees without adding information. Noisy columns act as random
memorization hooks, especially when p approaches or exceeds
n.
Selection also serves operations: fewer features mean smaller serialized models, lower latency at scoring time, and cheaper feature-store backfills. For regulated domains, a 25-feature fraud or credit model is easier to document than a 400-column black box. The goal is not minimalism for its own sake — it is keeping columns that improve validation metrics while dropping those that only help the training set.
Selection vs extraction
Feature selection keeps a subset of original columns. Feature extraction (PCA, autoencoders) builds new combined dimensions that are harder to interpret but can compress correlated groups. Start with selection when interpretability and compliance matter; reach for extraction when you need aggressive compression and can sacrifice per-column explanations.
Filter methods: score columns before training
Filters rank or threshold features using statistics computed on the data alone — no iterative model fitting. They are fast, parallelizable, and a good first pass on high-dimensional tabular data.
Variance and missingness thresholds
Drop near-constant columns: if 99.8% of rows share the same value, the feature cannot discriminate. Likewise remove columns with excessive missing rates unless missingness itself is informative (encode as a flag, then re-evaluate).
Correlation and redundancy
For numeric pairs with |r| > 0.95, keep one representative. Pearson correlation on raw scales can mislead when relationships are nonlinear — rank correlation or mutual information catches monotonic ties Pearson misses.
Chi-square and mutual information
For classification, chi-square tests independence between each
feature and the label (works on count data and binned numerics).
Mutual information (MI) measures how much knowing a feature
reduces uncertainty about the target — it captures nonlinear relationships
chi-square may miss. sklearn.feature_selection.SelectKBest with
mutual_info_classif is a common pattern: keep top-k
scorers, then train your real model on the subset.
Filters ignore feature interactions: two weak columns might combine into a strong signal that univariate MI never surfaces. That is why filters are often stage one, not the final word.
Wrapper methods: search with your actual model
Wrappers treat feature subsets as a search problem. They train (or partially train) a model on candidate subsets and score them with cross-validation. More accurate than filters for interaction-heavy problems; exponentially more expensive.
Recursive feature elimination (RFE)
Train a model with all features, rank by importance (coefficient magnitude for
linear models, feature_importances_ for tree ensembles), drop the
weakest, repeat until k features remain.
RFECV wraps RFE in cross-validation to pick k
automatically. RFE with logistic regression or gradient boosting is a workhorse
for tabular credit and fraud pipelines.
Sequential forward and backward selection
Forward selection starts empty and adds the feature that most improves validation score. Backward elimination starts full and removes the least helpful. Both are greedy — they can miss optimal combinations — but with 50–150 candidates they often beat univariate filters on structured data.
Cost control
Limit search depth: run filters first to cut 400 columns to 80, then RFE on the survivors. Use stratified k-fold with a consistent random seed. Log every subset score so you can audit why a column survived.
Embedded methods: selection inside training
Embedded methods bake sparsity or importance into the learning algorithm itself. No separate search loop — regularization or tree structure does the pruning.
L1 (Lasso) and elastic net
Lasso regression penalizes the sum of absolute coefficients. Irrelevant features
shrink exactly to zero, yielding a sparse linear model. Elastic net mixes L1 and L2
— better when many correlated features should survive as a group rather
than one arbitrary winner. After
scaling numeric features,
fit Lasso with cross-validated alpha (LassoCV in
scikit-learn) and keep columns with nonzero coefficients.
Tree-based importance
Random forests and gradient boosting report split-gain importance. Train once, threshold on cumulative importance (e.g., keep features summing to 95% of total gain). Beware: importance favors high-cardinality columns; use permutation importance on a held-out set to validate that a “top” feature actually hurts metrics when shuffled.
Regularized generalized linear models
Logistic regression with L1 penalty performs embedded selection for classification.
For wide sparse text matrices, linear SVMs with L1 are an alternative. Embedded
paths shine when p >> n and you need one training pass.
Multicollinearity and stability
Highly correlated features do not always hurt tree models, but they destabilize
linear coefficients: small data shifts flip signs and magnitudes. The
variance inflation factor (VIF) measures how much variance of
coefficient j inflates due to correlation with other columns. Rule
of thumb: investigate VIF > 5–10; drop or combine collinear groups.
Domain knowledge resolves ambiguity filters cannot: annual_income and monthly_income are redundant — keep one. Pair VIF screening with business semantics before trusting automated drops.
Worked example: Harbor Payments credit default model
Harbor Payments trains a logistic default model on 380 applicant features: bureau tradelines, cash-flow aggregates, device fingerprints, and merchant category one-hots. Offline AUC is 0.91 with all columns; production inference p95 latency is 42 ms — too slow for real-time checkout.
Stage 1: filter pass
Drop 47 near-zero-variance device flags. Remove 12 columns with >40% missing (no missingness signal after encoding). Chi-square on binned numerics eliminates 89 weak category dummies. MI keeps top 120 of the remainder. Runtime: seconds.
Stage 2: embedded LassoCV
Standard-scale continuous columns; one-hot high-cardinality merchants with
frequency encoding inside a pipeline. LassoCV with 5-fold stratified
CV zeros out 61 of 120 features. Validation AUC: 0.908 — negligible loss
from 380 to 59 columns.
Stage 3: RFE confirmation
RFECV with logistic regression on the 59 survivors suggests 44
features are sufficient (AUC 0.907). Permutation importance on the holdout month
confirms debt_to_income, months_on_file, and
avg_daily_balance_90d drive most gain; three device timezone columns
were filter survivors but permutation shows near-zero impact — removed.
Outcome
Final model: 41 features, validation AUC 0.907, inference p95 11 ms. Compliance
documentation lists each retained column with MI rank, Lasso coefficient sign, and
permutation delta-AUC. Retrain monthly inside a sklearn
Pipeline so selection steps refit only on training folds.
Method decision table
| Method | Speed | Captures interactions | Best when |
|---|---|---|---|
| Variance / missing filter | Very fast | No | First pass on any wide table; removes obvious dead columns |
| Mutual information / chi-square | Fast | No (univariate) | Classification with hundreds of candidates; quick shortlist |
| RFE / sequential search | Slow | Yes | Moderate feature count after filtering; need model-aware subset |
| Lasso / elastic net | Medium | Partial (linear) | High-dimensional linear models; p >> n sparse solutions |
| Tree importance + permutation | Medium | Yes | Nonlinear tabular; validate importance with shuffle tests |
| PCA / autoencoders | Medium | Yes (latent) | Interpretability optional; aggressive compression needed |
CV-safe pipelines (non-negotiable)
Fitting a selector on the full dataset before cross-validation leaks label information into every fold. The fix: nest selection inside each training fold.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegressionCV
pipe = Pipeline([
("scale", StandardScaler()),
("select", SelectFromModel(LogisticRegressionCV(penalty="l1", solver="saga", cv=5))),
("clf", LogisticRegressionCV(cv=5)),
])
cross_val_score(pipe, X, y, cv=5) now reflects honest generalization.
Persist the entire pipeline with joblib so production applies the
same scale-then-select-then-predict steps fitted on the final training window.
Common pitfalls
- Selecting on the test set — any tuning of
kor thresholds using holdout data inflates reported metrics; use nested CV or a dedicated validation split. - Univariate filters on interacting features — pairwise MI or wrapper search may be required when signal lives in combinations.
- Ignoring leakage columns — features computed after the prediction point (post-default collections activity) score high in MI but fail in production; audit timelines before selection.
- Trusting default tree importance — high-cardinality categoricals dominate split counts; confirm with permutation importance.
- Unscaled Lasso on mixed units — income in dollars and age in years penalized on different scales; always scale before L1.
- Stable feature myth across retrains — monthly retrains may swap borderline columns; monitor set overlap and coefficient sign stability.
- Dropping protected proxies carelessly — zip code may proxy for demographics; selection does not remove fairness obligations.
Production checklist
- Document baseline metrics with all features vs selected subset on the same CV splits.
- Run variance, missingness, and MI filters as stage one on wide tables.
- Check VIF on linear model finalists; resolve redundant business duplicates manually.
- Nest selectors inside
Pipeline; never fit selectors outside CV loops. - Validate surviving features with permutation importance on a recent holdout month.
- Log selected feature names, version, and selection method in the model registry.
- Measure inference latency and model size before and after pruning.
- Monitor feature-set drift: alert when >20% of columns change between retrains.
- Publish a model card listing each retained column and why it survived.
- Re-run selection when label definition or data source schema changes.
Key takeaways
- Feature selection prunes; engineering creates — use both, in that order, on wide tabular problems.
- Filters are fast first passes; wrappers and embedded methods capture model-specific signal — combine stages rather than picking one religion.
- Nested pipelines prevent selection leakage — the selector must refit inside each training fold.
- Permutation importance validates tree rankings — never ship based on split gain alone.
- Fewer strong features beat hundreds of weak ones for speed, stability, and auditability.
Related reading
- Feature engineering explained — encoding, scaling, and building signals before selection
- Overfitting and cross-validation explained — honest validation while tuning feature subsets
- Data leakage in machine learning explained — why some high-MI columns must never ship
- scikit-learn fundamentals explained — pipelines, selectors, and model persistence