8. Naïve Bayes
Naive Bayes Classifier is a family of probabilistic machine learning algorithms used primarily for classification tasks. It's based on Bayes' Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event.
The "naive" part comes from the strong assumption that the features (predictors) in the dataset are independent of each other given the class label. This simplifies calculations significantly, making the algorithm efficient even for large datasets.
In essence, Naive Bayes calculates the probability that a given instance belongs to a particular class and assigns it to the class with the highest posterior probability. It's particularly popular in text classification because it handles high-dimensional data well (e.g., word counts in documents).
Key variants include:
- Gaussian Naive Bayes: Assumes features follow a normal distribution (for continuous data).
- Multinomial Naive Bayes: Suited for discrete data, like word frequencies in text.
- Bernoulli Naive Bayes: For binary/boolean features, like word presence/absence.
Naive Bayes is derived from Bayes' Theorem:
In classification terms:
- Let \( C \) be the class label.
- Let \( X = (x_1, x_2, \dots, x_n) \) be the feature vector.
We want the posterior probability \( P(C|X) \), and we classify \( X \) to the class \( C_k \) that maximizes this:
Since \( P(X) \) is constant for all classes, we can ignore it and focus on maximizing \( P(X|C_k) \cdot P(C_k) \).
The naive assumption: Features are conditionally independent given the class, so:
- Prior \( P(C_k) \): Probability of class \( C_k \) in the training data (e.g., fraction of samples in that class).
- Likelihood \( P(x_i|C_k) \): Depends on the variant:
- For Multinomial: \( P(x_i|C_k) = \frac{N_{ki} + \alpha}{N_k + \alpha \cdot V} \) (with Laplace smoothing, where \( \alpha \) is the smoothing parameter, \( N_{ki} \) is count of feature i in class k, \( N_k \) is total counts in class k, V is vocabulary size).
- For Gaussian: \( P(x_i|C_k) = \frac{1}{\sqrt{2\pi\sigma_{ki}^2}} \exp\left( -\frac{(x_i - \mu_{ki})^2}{2\sigma_{ki}^2} \right) \) (mean \( \mu \) and variance \( \sigma^2 \) estimated from training data).
To avoid zero probabilities (when a feature doesn't appear in a class), we use additive smoothing (e.g., Laplace with \( \alpha = 1 \)).
The final prediction is:
(Often computed in log space to avoid underflow: \( \log P(C_k) + \sum_{i=1}^n \log P(x_i|C_k) \).)
Implementation
Accuracy: 0.9814814814814815
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups # Example dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load sample text data
newsgroups = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'soc.religion.christian'])
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Applications
Naive Bayes is widely used due to its simplicity and speed:
- Spam Detection: Classify emails as spam/ham based on word frequencies (e.g., Gmail's early filters).
- Sentiment Analysis: Determine if reviews are positive/negative by analyzing text features.
- Document Classification: Categorize news articles into topics like sports, politics.
- Medical Diagnosis: Predict diseases from symptoms (assuming independence).
- Recommendation Systems: Basic user preference prediction.
- Real-Time Prediction: Fraud detection in finance, where quick decisions are needed on high-dimensional data.
It's especially effective in natural language processing (NLP) with bag-of-words models.
Pros and Cons
Pros
-
Simple and Fast
Easy to implement, trains quickly (O(n) time), and predicts in constant time relative to data size.
-
Handles High Dimensions
Performs well with many features, like in text data (curse of dimensionality resistant).
-
Scalable
Can be applied to large datasets and updated easily with new data.
-
Interpretable
Probabilistic nature allows understanding of feature contributions to predictions.
-
Good with Small Datasets
Requires less training data than complex models.
-
Probabilistic Output
Provides probability estimates, not just labels.
-
Robust to Irrelevant Features
The independence assumption helps ignore noise.
Cons
-
Independence Assumption
Rarely holds in real data (e.g., words like "machine" and "learning" are correlated), leading to suboptimal performance.
-
Zero Probability Problem
If a feature-class combo is missing in training, probability becomes zero (mitigated by smoothing, but not perfect).
-
Poor for Continuous Data Without Proper Variant
Multinomial works for counts, but Gaussian assumes normality, which may not fit.
-
Biased Estimates
Posterior probabilities are often inaccurate, though classification can still be good.
-
Sensitive to Data Representation
In text, needs careful preprocessing (e.g., stemming, stop words).
Additional
Play Golf
A classic example used to illustrate Naive Bayes is predicting whether to play golf based on weather conditions. The dataset includes features like Outlook (Sunny, Overcast, Rain), Temperature (Hot, Mild, Cool), Humidity (High, Normal), and Windy (True, False), along with the target variable Play (Yes, No).
| Outlook | Temperature | Humidity | Windy | Play | |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | False | No |
| 2 | Sunny | Hot | High | True | No |
| 3 | Overcast | Hot | High | False | Yes |
| 4 | Rain | Mild | High | False | Yes |
| 5 | Rain | Cool | Normal | False | Yes |
| 6 | Rain | Cool | Normal | True | No |
| 7 | Overcast | Cool | Normal | True | Yes |
| 8 | Sunny | Mild | High | False | No |
| 9 | Sunny | Cool | Normal | False | Yes |
| 10 | Rain | Mild | Normal | False | Yes |
| 11 | Sunny | Mild | Normal | True | Yes |
| 12 | Overcast | Mild | High | True | Yes |
| 13 | Overcast | Hot | Normal | False | Yes |
| 14 | Rain | Mild | High | True | No |
Using this dataset, we can calculate the probabilities needed to predict whether to play golf given specific weather conditions using the Naive Bayes approach.
The goal
The objective is to predict whether a person will play golf on a new, unseen day based on its weather conditions, such as:
Outlook = Sunny,
Temperature = Cool,
Humidity = High,
and Windy = True.
To make this prediction, we will calculate the posterior probabilities for both classes (Play = Yes and Play = No) using the Naive Bayes formula and then compare them.
Calculations steps
-
Calculate prior probabilities:
-
P(Play = Yes): 9 out of 14 days were "Yes."
\( \displaystyle P(\text{Play} = \text{Yes}) = \frac{\text{Number of Yes}}{\text{Total}} = \frac{9}{14} = 0.643 \)
-
P(Play = No): 5 out of 14 days were "No."
\( \displaystyle P(\text{Play} = \text{No}) = \frac{\text{Number of No}}{\text{Total}} = \frac{5}{14} = 0.357 \)
-
-
Calculate likelihood probabilities:
For each feature given the class, we calculate the likelihoods:
Feature P(Feature=Value|Play=Yes) P(Feature=Value|Play=No) Outlook=Sunny \( \displaystyle P(\text{Sunny}\|\text{Yes}) = \frac{2}{9} \) \( \displaystyle P(\text{Sunny}\|\text{No}) = \frac{3}{5} \) Temperature=Cool \( \displaystyle P(\text{Cool}\|\text{Yes}) = \frac{3}{9} \) \( \displaystyle P(\text{Cool}\|\text{No}) = \frac{1}{5} \) Humidity=High \( \displaystyle P(\text{High}\|\text{Yes}) = \frac{3}{9} \) \( \displaystyle P(\text{High}\|\text{No}) = \frac{4}{5} \) Windy=True \( \displaystyle P(\text{True}\|\text{Yes}) = \frac{3}{9} \) \( \displaystyle P(\text{True}\|\text{No}) = \frac{3}{5} \) -
Calculate posterior probabilities (Bayes' theorem):
Using the naive assumption that all features are independent, apply the theorem to find the posterior probability for each class. The simplified formula used is:
\[ P(C|X) \propto P(C) \cdot \prod_{i} P(x_i|C) \]-
P(Yes | Sunny, Cool, High, True):
\(\begin{align*} = & P(\text{Yes}) \cdot P(\text{Sunny}|\text{Yes}) \cdot P(\text{Cool}|\text{Yes}) \cdot P(\text{High}|\text{Yes}) \cdot P(\text{True}|\text{Yes}) \\ = & \frac{9}{14} \cdot \frac{2}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \cdot \frac{3}{9} \\ \approx & 0.0053 \end{align*}\)
-
P(No | Sunny, Cool, High, True):
\(\begin{align*} = & P(\text{No}) \cdot P(\text{Sunny}|\text{No}) \cdot P(\text{Cool}|\text{No}) \cdot P(\text{High}|\text{No}) \cdot P(\text{True}|\text{No}) \\ = & \frac{5}{14} \cdot \frac{3}{5} \cdot \frac{1}{5} \cdot \frac{4}{5} \cdot \frac{3}{5} \\ \approx & 0.02056 \end{align*}\)
-
-
Make the prediction:
Since \( P(\text{No} | \text{Sunny, Cool, High, True}) \approx 0.02056 \) is greater than \( P(\text{Yes} | \text{Sunny, Cool, High, True}) \approx 0.0053 \), we predict that the person will not play golf on that day.