Guide

Naive Bayes explained

Your inbox routes obvious spam to junk before you read it. Behind many of those filters sits a classifier that is almost embarrassingly simple on paper yet remarkably effective in production: naive Bayes. Given a new email, the model asks: “Which label — spam or not spam — makes the observed words most probable?” That question is Bayes theorem applied to classification. The “naive” part is a deliberate shortcut: assume every word appears independently of every other word, given the class. Real language violates that assumption constantly — “not” and “free” are not independent — yet naive Bayes often competes with far heavier models on text, especially when training data is limited and features are high-dimensional. This guide covers the math intuition, Gaussian/multinomial/ Bernoulli variants, Laplace smoothing, log-space numerics, a Harbor Support ticket triage worked example, a model comparison table, pitfalls, and a production checklist — building on machine learning fundamentals alongside logistic regression and support vector machines.

Bayes theorem for classification

For class label C and feature vector x = (x1, …, xd), Bayes theorem gives:

P(C | x) = P(x | C) · P(C) / P(x)

P(C) is the prior — how often each class appears in training data. P(x | C) is the likelihood of observing features given the class. P(x) is the evidence, identical for all classes at prediction time, so you can ignore it when comparing labels. The classifier picks the class with highest posterior P(C | x).

The naive independence assumption

Estimating the joint likelihood P(x | C) directly requires counting every combination of features — impossible when d is thousands of word counts. Naive Bayes factorizes:

P(x | C) ≈ ∏i P(xi | C)

Each feature’s distribution is learned separately per class. Training reduces to counting: for text, how often word refund appears in billing tickets versus password-reset tickets. Prediction multiplies (or sums logs of) those per-feature probabilities. Wrong independence assumption, fast estimation, often good enough — especially when the decision boundary cares about which features lean toward which class, not perfect joint modeling.

Why log probabilities matter

Multiplying hundreds of probabilities in (0, 1) underflows floating point to zero. Implementations work in log space: log P(C | x) ∝ log P(C) + ∑i log P(xi | C). Libraries like scikit-learn expose predict_log_proba for exactly this reason. Argmax of log posteriors equals argmax of posteriors.

Three common variants

The “naive” independence shell is shared; what changes is how each P(xi | C) is modeled.

Multinomial naive Bayes

The default for bag-of-words text and document counts. Features are non-negative integers (term frequencies or TF-IDF weights treated as counts). Each class maintains a multinomial over the vocabulary: P(word | C) is the fraction of tokens in class-C documents belonging to that word. Use MultinomialNB when input is word counts, n-gram counts, or positive-valued sparse vectors.

Bernoulli naive Bayes

Features are binary: word present or absent, not how many times. Better for short texts where presence matters more than frequency (“congratulations you won” once is enough). Use BernoulliNB on binarized document-term matrices.

Gaussian naive Bayes

Features are continuous. Each feature per class is modeled as a Gaussian with its own mean and variance: P(xi | C) = N(μi,C, σ2i,C). Use GaussianNB on low-to-moderate dimensional numeric tabular data when classes might be roughly bell-shaped. Poor fit for heavy-tailed financial ratios without transforms.

Laplace (additive) smoothing

A word never seen in spam training data gets P(word | spam) = 0, zeroing the entire product. Laplace smoothing (additive smoothing) adds pseudo-counts α (often 1) to every feature value before normalizing:

P(xi | C) = (count(xi, C) + α) / (total count in C + α · V)

where V is vocabulary size. α = 1 is Laplace; smaller α stays closer to raw counts; larger α pulls estimates toward uniform. Always smooth text models unless your vocabulary is frozen and you accept zero probability for out-of-vocabulary tokens at train time. scikit-learn exposes this as the alpha hyperparameter on all three variants.

Class priors and imbalance

P(C) defaults to empirical class frequencies. With 98% legitimate email, the model biases toward “ham.” For rare-positive detection (fraud, abuse), set class_prior explicitly or use fit_prior=False with balanced priors, then tune decision thresholds on a validation set using precision-recall curves rather than default 0.5.

Text preprocessing pipeline

Naive Bayes quality depends more on representation than on tuning twelve hyperparameters.

  • Tokenization — lowercase, strip punctuation, optionally stem or lemmatize. Keep it consistent between train and serve.
  • Stop words — removing “the” and “and” can help or hurt depending on domain; evaluate on validation, do not assume.
  • N-grams — bigrams capture “not happy”; vocabulary explodes, so cap max_features or use chi-squared selection.
  • TF-IDF vs counts — multinomial NB traditionally uses raw counts; TF-IDF weighting with NB is debated. If you use TF-IDF, treat values as non-negative features and test against count baselines.
  • Vocabulary boundsCountVectorizer(max_features=50000, min_df=5) drops rare noise and controls memory.

Worked example: Harbor Support ticket routing

Harbor Support receives 14,000 tickets per week across four queues: billing, technical, account access, and sales. Agents manually tag each ticket; leadership wants a first-pass auto-router to cut median time-to-first-response by 30%.

Data and features

  • Historical set: 82,000 labeled tickets (subject + first customer message).
  • Preprocessing: lowercase, remove HTML, CountVectorizer with unigrams + bigrams, max_features=40000, min_df=3.
  • Holdout: stratified 70/15/15 train/validation/test by week to catch vocabulary drift.

Model choice and results

The team trains MultinomialNB(alpha=0.5) inside a sklearn Pipeline with the vectorizer. Validation macro-F1 = 0.81; billing and technical classes exceed 0.88 F1. Account-access tickets that mention “password” and “2FA” route correctly; confusion concentrates between billing and sales when customers write “cancel my subscription and upgrade plan” in one message.

They compare against linear SVM (macro-F1 0.84, 40 ms inference) and a small fine-tuned transformer (macro-F1 0.87, 220 ms on CPU). Naive Bayes hits macro-F1 0.81 at 2 ms per ticket on the same hardware with a 12 MB serialized pipeline — acceptable for edge routing where top-2 suggestions go to agents, not fully automated closure.

Production rules: if max posterior < 0.55, route to human triage; log predict_proba for weekly calibration checks. When “OAuth” spikes after a product launch, retrain weekly until posteriors stabilize.

Model comparison table

AlgorithmStrengthsWeaknesses vs naive Bayes
Multinomial naive BayesExtremely fast train/serve; strong text baseline; closed-form counts
Logistic regressionLearns feature interactions via weights; calibrated probabilitiesSlower on huge sparse vocab without careful solvers; needs more tuning
Linear SVMMax-margin; good on sparse high-dim textNo native probabilities; heavier than NB for streaming retrains
Gradient-boosted treesBest on mixed numeric/categorical tabularPoor default for raw bag-of-words; large models
Transformers (BERT etc.)State-of-art accuracy on nuanced languageGPU cost, latency, data hunger; overkill for coarse routing

Practical rule: start with multinomial naive Bayes for text routing, spam, and sentiment baselines. If validation F1 is within a few points of heavier models and latency or interpretability matter, ship naive Bayes. Reach for transformers when semantic nuance (sarcasm, long context, cross-lingual) dominates errors.

Common pitfalls

  • Wrong variant for feature type — Gaussian NB on sparse word counts, or multinomial NB on negative TF-IDF values, produces nonsense.
  • No smoothing — zero counts for unseen words kill posteriors; always set alpha > 0 for text.
  • Data leakage in vectorizer — fit CountVectorizer only on training folds inside CV pipelines, not on the full corpus before split.
  • Ignoring class imbalance — majority-class prior hides rare abuse; tune thresholds on precision-recall, not accuracy.
  • Train-serve skew — different tokenization in production (HTML entities, emoji stripping) shifts word distributions silently.
  • Overfitting vocabulary — retaining every hapax legomenon memorizes typos; use min_df and max_features.
  • Assuming independence holds — bigrams or logistic regression may be needed when negation and phrase order drive errors.
  • Comparing raw probabilities across retrains — recalibrate or monitor ranking metrics; absolute posteriors drift when priors change.

Production checklist

  • Pick variant: multinomial (counts), Bernoulli (binary), or Gaussian (continuous).
  • Wrap vectorizer + model in a single Pipeline; persist with joblib.
  • Stratified split by time if vocabulary drifts; walk-forward validation for weekly retrains.
  • Set alpha via grid search on validation macro-F1 or cost-weighted loss.
  • Configure class priors or decision thresholds for rare-positive classes.
  • Expose top-2 classes and confidence when routing to humans below a posterior floor.
  • Log misroutes with raw text (redacted) for label audit and vocabulary updates.
  • Monitor OOV rate and per-class F1; alert when billing F1 drops >5 points week-over-week.
  • Document preprocessing (stemming, n-grams) in the model card for compliance.
  • Benchmark latency on production CPU; naive Bayes should be single-digit milliseconds per doc.

Key takeaways

  • Naive Bayes is Bayes theorem with a factorized likelihood — training is counting, prediction is summing log probabilities.
  • Choose the variant to match features — multinomial for text counts, Bernoulli for presence, Gaussian for continuous.
  • Laplace smoothing is mandatory for open vocabularies; tune alpha on validation.
  • Speed and interpretability are the wins — inspect per-class word odds ratios for debugging routes.
  • Benchmark against logistic regression and linear SVM before jumping to transformers on text routing tasks.

Related reading