The Naïve Bayes classifier is a popular machine learning algorithm. In this post we will discuss why it still works in practice even when the (naïve) conditional independence assumption is violated.
Introduction
The Naïve Bayes classifier is a machine algorithm that can be used for classification tasks, similar to the perceptron algorithm. It is a probabilistic machine learning method which means it relies on probability theory to calculate the probability of a specific class given the input features.
For example, consider an input instance
Now, the naïve conditional independence assumption comes into play. Basically the algorithm assumes that the features are conditionally independent on the output label
What is conditional independence?
Before we go further let’s take a detour and give some intuition on what it means to be conditionally independent. For starters, two events A and B are independent if the occurrence of one event has no effect on the occurrence of the other, in probability terms we say
However, conditional independence is always with respect to a third event so we say events
To understand it better let us consider this nice example. Suppose that we toss the same coin twice where event
What Happens in practice?
In practice, the Naïve assumption is often violated. For example, trying to predict a plant type based on width, height and color of the leaves. The classifier assumes that all these features are independent without consideration of their effect on each other. In this case the width and height are dependent so the assumption is over estimating the true probability.
Another example, in text classification, each classification class
Why does it work?
The answer to this question appeared in a paper from 1997 titled On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. In that paper, the authors argued that the although the Naïve Bayes algorithm only estimated the probabilities
Let
In other words, it does not matter if the Naïve Bayes classifier overestimatea or under estimatea the probabilities as long as the correct ratio is maintained between the true and estimated probabilities.
To make this idea of optimality clearer, let’s us describe an example, that appeared in the same paper, about a dataset with three Boolean attributes
A theoretical optimal classifier will choose to ignore the attribute
As such, the optimal classifier will assign the positive
On the other hand, the Naïve Bayes classifier does not ignore the attribute B and it is included in the calculations which give the following definitions for
The Naïve Bayes classifier assigns the positive
We can simplify equations
Plugging these rewrites into the optimal classifier decision equations it becomes:
Using the fact that
Cancelling
Similarly, plugging these rewrites into the Naïve Bayes classifier decision equations it becomes:
Using the fact that
Cancelling
Let
The optimal classifier decision:
Which results in a decision boundary descibed by:
The Naïve Bayes classifier decision:
$$
$$
Which results in a decision boundary descibed by:
The figure below shows the plot of these two decision boundaries
Conclusion
In this post, we looked at the Naïve Bayes algorithm. We then looked at the conditional independence assumption and examples of when it is violated in practive. We then looked at why Naïve Bayes works so well in practice.