Guide
Naive Bayes explained
Your inbox routes obvious spam to junk before you read it. Behind many of those filters sits a classifier that is almost embarrassingly simple on paper yet remarkably effective in production: naive Bayes. Given a new email, the model asks: “Which label — spam or not spam — makes the observed words most probable?” That question is Bayes theorem applied to classification. The “naive” part is a deliberate shortcut: assume every word appears independently of every other word, given the class. Real language violates that assumption constantly — “not” and “free” are not independent — yet naive Bayes often competes with far heavier models on text, especially when training data is limited and features are high-dimensional. This guide covers the math intuition, Gaussian/multinomial/ Bernoulli variants, Laplace smoothing, log-space numerics, a Harbor Support ticket triage worked example, a model comparison table, pitfalls, and a production checklist — building on machine learning fundamentals alongside logistic regression and support vector machines.
Bayes theorem for classification
For class label C and feature vector x = (x1, …, xd),
Bayes theorem gives:
P(C | x) = P(x | C) · P(C) / P(x)
P(C) is the prior — how often each class appears in
training data. P(x | C) is the likelihood of observing features
given the class. P(x) is the evidence, identical for all classes
at prediction time, so you can ignore it when comparing labels. The classifier picks
the class with highest posterior P(C | x).
The naive independence assumption
Estimating the joint likelihood P(x | C) directly requires counting
every combination of features — impossible when d is thousands
of word counts. Naive Bayes factorizes:
P(x | C) ≈ ∏i P(xi | C)
Each feature’s distribution is learned separately per class. Training reduces to counting: for text, how often word refund appears in billing tickets versus password-reset tickets. Prediction multiplies (or sums logs of) those per-feature probabilities. Wrong independence assumption, fast estimation, often good enough — especially when the decision boundary cares about which features lean toward which class, not perfect joint modeling.
Why log probabilities matter
Multiplying hundreds of probabilities in (0, 1) underflows floating point to zero.
Implementations work in log space: log P(C | x) ∝ log P(C) + ∑i log P(xi | C).
Libraries like
scikit-learn
expose predict_log_proba for exactly this reason. Argmax of log
posteriors equals argmax of posteriors.
Three common variants
The “naive” independence shell is shared; what changes is how each
P(xi | C) is modeled.
Multinomial naive Bayes
The default for bag-of-words text and document counts. Features
are non-negative integers (term frequencies or TF-IDF weights treated as counts).
Each class maintains a multinomial over the vocabulary: P(word | C)
is the fraction of tokens in class-C documents belonging to that word.
Use MultinomialNB when input is word counts, n-gram counts, or
positive-valued sparse vectors.
Bernoulli naive Bayes
Features are binary: word present or absent, not how many times.
Better for short texts where presence matters more than frequency (“congratulations
you won” once is enough). Use BernoulliNB on binarized
document-term matrices.
Gaussian naive Bayes
Features are continuous. Each feature per class is modeled as a
Gaussian with its own mean and variance: P(xi | C) = N(μi,C, σ2i,C).
Use GaussianNB on low-to-moderate dimensional numeric tabular data
when classes might be roughly bell-shaped. Poor fit for heavy-tailed financial
ratios without transforms.
Laplace (additive) smoothing
A word never seen in spam training data gets P(word | spam) = 0,
zeroing the entire product. Laplace smoothing (additive smoothing)
adds pseudo-counts α (often 1) to every feature value before
normalizing:
P(xi | C) = (count(xi, C) + α) / (total count in C + α · V)
where V is vocabulary size. α = 1 is Laplace;
smaller α stays closer to raw counts; larger α
pulls estimates toward uniform. Always smooth text models unless your vocabulary
is frozen and you accept zero probability for out-of-vocabulary tokens at train time.
scikit-learn exposes this as the alpha hyperparameter on all three
variants.
Class priors and imbalance
P(C) defaults to empirical class frequencies. With 98% legitimate
email, the model biases toward “ham.” For rare-positive detection
(fraud, abuse), set class_prior explicitly or use
fit_prior=False with balanced priors, then tune decision thresholds
on a validation set using precision-recall curves rather than default 0.5.
Text preprocessing pipeline
Naive Bayes quality depends more on representation than on tuning twelve hyperparameters.
- Tokenization — lowercase, strip punctuation, optionally stem or lemmatize. Keep it consistent between train and serve.
- Stop words — removing “the” and “and” can help or hurt depending on domain; evaluate on validation, do not assume.
- N-grams — bigrams capture “not happy”;
vocabulary explodes, so cap
max_featuresor use chi-squared selection. - TF-IDF vs counts — multinomial NB traditionally uses raw counts; TF-IDF weighting with NB is debated. If you use TF-IDF, treat values as non-negative features and test against count baselines.
- Vocabulary bounds —
CountVectorizer(max_features=50000, min_df=5)drops rare noise and controls memory.
Worked example: Harbor Support ticket routing
Harbor Support receives 14,000 tickets per week across four queues: billing, technical, account access, and sales. Agents manually tag each ticket; leadership wants a first-pass auto-router to cut median time-to-first-response by 30%.
Data and features
- Historical set: 82,000 labeled tickets (subject + first customer message).
- Preprocessing: lowercase, remove HTML,
CountVectorizerwith unigrams + bigrams,max_features=40000,min_df=3. - Holdout: stratified 70/15/15 train/validation/test by week to catch vocabulary drift.
Model choice and results
The team trains MultinomialNB(alpha=0.5) inside a sklearn
Pipeline with the vectorizer. Validation macro-F1 = 0.81; billing
and technical classes exceed 0.88 F1. Account-access tickets that mention
“password” and “2FA” route correctly; confusion concentrates
between billing and sales when customers write “cancel my subscription
and upgrade plan” in one message.
They compare against linear SVM (macro-F1 0.84, 40 ms inference) and a small fine-tuned transformer (macro-F1 0.87, 220 ms on CPU). Naive Bayes hits macro-F1 0.81 at 2 ms per ticket on the same hardware with a 12 MB serialized pipeline — acceptable for edge routing where top-2 suggestions go to agents, not fully automated closure.
Production rules: if max posterior < 0.55, route to human triage; log
predict_proba for weekly calibration checks. When “OAuth”
spikes after a product launch, retrain weekly until posteriors stabilize.
Model comparison table
| Algorithm | Strengths | Weaknesses vs naive Bayes |
|---|---|---|
| Multinomial naive Bayes | Extremely fast train/serve; strong text baseline; closed-form counts | — |
| Logistic regression | Learns feature interactions via weights; calibrated probabilities | Slower on huge sparse vocab without careful solvers; needs more tuning |
| Linear SVM | Max-margin; good on sparse high-dim text | No native probabilities; heavier than NB for streaming retrains |
| Gradient-boosted trees | Best on mixed numeric/categorical tabular | Poor default for raw bag-of-words; large models |
| Transformers (BERT etc.) | State-of-art accuracy on nuanced language | GPU cost, latency, data hunger; overkill for coarse routing |
Practical rule: start with multinomial naive Bayes for text routing, spam, and sentiment baselines. If validation F1 is within a few points of heavier models and latency or interpretability matter, ship naive Bayes. Reach for transformers when semantic nuance (sarcasm, long context, cross-lingual) dominates errors.
Common pitfalls
- Wrong variant for feature type — Gaussian NB on sparse word counts, or multinomial NB on negative TF-IDF values, produces nonsense.
- No smoothing — zero counts for unseen words kill
posteriors; always set
alpha > 0for text. - Data leakage in vectorizer — fit
CountVectorizeronly on training folds inside CV pipelines, not on the full corpus before split. - Ignoring class imbalance — majority-class prior hides rare abuse; tune thresholds on precision-recall, not accuracy.
- Train-serve skew — different tokenization in production (HTML entities, emoji stripping) shifts word distributions silently.
- Overfitting vocabulary — retaining every hapax legomenon
memorizes typos; use
min_dfandmax_features. - Assuming independence holds — bigrams or logistic regression may be needed when negation and phrase order drive errors.
- Comparing raw probabilities across retrains — recalibrate or monitor ranking metrics; absolute posteriors drift when priors change.
Production checklist
- Pick variant: multinomial (counts), Bernoulli (binary), or Gaussian (continuous).
- Wrap vectorizer + model in a single
Pipeline; persist withjoblib. - Stratified split by time if vocabulary drifts; walk-forward validation for weekly retrains.
- Set
alphavia grid search on validation macro-F1 or cost-weighted loss. - Configure class priors or decision thresholds for rare-positive classes.
- Expose top-2 classes and confidence when routing to humans below a posterior floor.
- Log misroutes with raw text (redacted) for label audit and vocabulary updates.
- Monitor OOV rate and per-class F1; alert when billing F1 drops >5 points week-over-week.
- Document preprocessing (stemming, n-grams) in the model card for compliance.
- Benchmark latency on production CPU; naive Bayes should be single-digit milliseconds per doc.
Key takeaways
- Naive Bayes is Bayes theorem with a factorized likelihood — training is counting, prediction is summing log probabilities.
- Choose the variant to match features — multinomial for text counts, Bernoulli for presence, Gaussian for continuous.
- Laplace smoothing is mandatory for open vocabularies; tune
alphaon validation. - Speed and interpretability are the wins — inspect per-class word odds ratios for debugging routes.
- Benchmark against logistic regression and linear SVM before jumping to transformers on text routing tasks.
Related reading
- Bayesian inference explained — priors, posteriors, and MCMC beyond naive factorization
- Logistic regression explained — linear classifiers with calibrated log-loss training
- Support vector machines explained — max-margin alternatives on sparse text
- scikit-learn fundamentals explained — pipelines, vectorizers, and model persistence