This post is a simple code implementation of the research paper ‘Naive Bayes Classifier for Efficient Text Classification’ by Peter Norvig.
code
research paper
impact
Author
Risheek kumar B
Published
June 15, 2026
How a reverend from the 1700s gave us one of the most practical classifiers in machine learning
In 1763, a English Presbyterian minister named Thomas Bayes had his most famous work published — posthumously. He’d been thinking about a simple question: if I see something happen, how should that change what I believe? Nearly 260 years later, his theorem powers spam filters, medical diagnoses, and recommendation engines. Let’s trace the journey from a bag of balls to a working classifier — and build everything by hand along the way.
The Bag of Balls
Before we get to spam, let’s start somewhere tangible. Imagine you have a bag with 3 red balls and 7 blue balls. You draw one ball, don’t put it back, and draw another.
Question: what’s the probability both are red?
Think about it — the first draw has a 3/10 chance. But if you drew red, now there are only 2 red balls left out of 9 total. So:
This is the multiplication rule for dependent events — the second draw depends on what happened in the first. This dependency is exactly where Bayes’ insight begins.
Bayes’ Theorem
Let’s build the theorem from scratch. Start with the definition of conditional probability:
\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]
In plain English: “out of all the times B happens, how often does A also happen?”
Rearranging: \(P(A \cap B) = P(A|B) \cdot P(B)\)
By symmetry, we can also write: \(P(A \cap B) = P(B|A) \cdot P(A)\)
Both expressions equal \(P(A \cap B)\), so set them equal and solve:
This is Bayes’ theorem — the equation Bayes never actually wrote in this form (that was Laplace). The terms have names:
Prior\(P(A)\) — what you believed before seeing evidence
Likelihood\(P(B|A)\) — how probable the evidence is, given your hypothesis
Posterior\(P(A|B)\) — your updated belief after seeing evidence
The Bayesian mindset in one sentence: start with a prior, update it with evidence.
Frequentists vs Bayesians: A Centuries-Old Debate
Bayes’ theorem isn’t just a formula — it represents a fundamentally different way of thinking about probability. This difference sparked one of the longest-running arguments in statistics.
The Frequentist view: Probability is about long-run frequencies. A coin has a 50% chance of heads because if you flipped it infinitely many times, half would be heads. A hypothesis (like “this drug works”) is either true or false — there’s no meaningful way to say “I’m 80% sure it works.”
The Bayesian view: Probability is a measure of belief. You can say “I’m 80% sure this drug works” — it reflects your state of knowledge, updated by evidence. Before a trial, you have a prior belief. After seeing data, you have a posterior belief.
This isn’t just philosophy — it changes how you answer practical questions:
Question
Frequentist
Bayesian
“Is this email spam?”
Either yes or no; I’ll use a test with 95% confidence
P(spam) = 0.88 given these words
“Does this drug work?”
Reject or fail to reject the null hypothesis
There’s a 93% probability it works
“How confident are you?”
“If I repeated this experiment 100 times…”
“Given everything I’ve seen…”
For most of the 20th century, frequentist methods dominated (think p-values, confidence intervals, hypothesis tests). But Bayesian methods have surged in recent decades, powered by faster computers that can handle the harder calculations.
Naive Bayes sits squarely in the Bayesian camp: it starts with a prior (how common is spam?) and updates it with evidence (which words appear?). The “prior → evidence → posterior” loop is the beating heart of Bayesian thinking.
Applying Bayes to Classification
Now let’s make this practical. Suppose you’re building a spam filter for email. You want to classify a new email as spam or not spam based on the words it contains.
The prior \(P(\text{spam})\) is easy — just the fraction of spam emails in your training data. But \(P(\text{words} | \text{spam})\)? That’s the probability of seeing this exact combination of words in a spam email. With a vocabulary of 10,000 words, the number of possible combinations is astronomical. You’d never have enough data to estimate it directly.
This is the wall that Bayes’ theorem alone can’t climb. We need a simplification.
The Naive Assumption
Here’s the trick that makes it all work: assume every word is independent of every other word.
Instead of computing \(P(\text{"free", "money"} | \text{spam})\) as one monster probability, we break it apart:
This is obviously wrong. “Free” and “win” tend to appear together in spam. “Dear” and “friend” travel as a pair. Words aren’t independent.
But here’s the surprising twist — it doesn’t matter much for classification. Even if the individual probabilities are off, the ranking of classes usually stays correct. The spam email still scores higher than the not-spam email. Naive Bayes is a poor probability estimator but a surprisingly good classifier.
This is why it’s called “naive” — and why it works despite being naive.
Hands-On: Naive Bayes by Hand
Let’s walk through a concrete example with three training emails:
But for not spam — “free” and “money” never appeared. Their probability is zero. Multiply by zero and the entire score vanishes. This is the zero probability problem, and it’s a dealbreaker in real systems.
Laplace Smoothing
The fix is beautifully simple: add 1 to every word count, so no word ever has zero probability. Add the vocabulary size to the denominator to keep things normalized:
\[P(\text{word} | \text{class}) = \frac{\text{count(word in class)} + 1}{\text{total words in class} + |\text{vocab}|}\]
Spam wins — by a factor of 7! Notice we never divided by \(P(\text{words})\). It’s the same for both classes, so it cancels out when we compare. That’s why Naive Bayes is so fast — we only need the numerators.
Two Flavors: Multinomial vs Gaussian
So far we’ve been counting words — discrete values like 0, 1, 2… This is where Multinomial Naive Bayes lives. The name comes from the multinomial distribution, a generalization of the binomial: instead of coin flips (2 outcomes), you have dice rolls (many outcomes). Each word drawn from an email is like rolling a vocabulary-sized die.
But what if your features are continuous? Blood pressure readings, temperatures, pixel intensities? That’s where Gaussian Naive Bayes steps in. It assumes each feature follows a bell curve (normal distribution) within each class. Instead of counting, you compute the mean and variance from training data, then plug a new value into the Gaussian formula.
Rule of thumb: Multinomial for counts, Gaussian for measurements. The right choice depends on your data.
Implementation
Let’s put it all together — first from scratch, then with sklearn.
From scratch:
spam_words = ["free", "money", "free", "offer"]notspam_words = ["meeting", "tomorrow"]vocab = {"free", "money", "offer", "meeting", "tomorrow"}V =len(vocab)def p_word_given_class(word, class_words): count = class_words.count(word)return (count +1) / (len(class_words) + V)p_spam =2/3p_notspam =1/3test = ["free", "money"]score_spam = p_spamfor w in test: score_spam *= p_word_given_class(w, spam_words)score_notspam = p_notspamfor w in test: score_notspam *= p_word_given_class(w, notspam_words)# Normalized probability of spamprint(score_spam / (score_spam + score_notspam)) # 0.879
Both approaches give the same answer — 87.9% spam probability. The sklearn version scales to millions of emails.
When to Use Naive Bayes
Naive Bayes isn’t always the best tool, but it’s often the best first tool.
Reach for it when: - ✅ You need a fast, interpretable baseline - ✅ Training data is limited (it outperforms logistic regression with scarce data) - ✅ You’re working with high-dimensional sparse data like text - ✅ You need real-time predictions
Watch out when: - ⚠️ Features are strongly correlated (it double-counts evidence) - ⚠️ You need well-calibrated probabilities, not just classification
Interview tip: If asked “when would you choose Naive Bayes over logistic regression?”, the key answer is small data and text. As data grows, logistic regression catches up and surpasses it because it can learn feature interactions that Naive Bayes ignores.
Thomas Bayes never imagined spam filters. But his simple idea — update your beliefs with evidence — turned out to be one of the most practical tools in machine learning.