The List of Data Science Blogs: ‘ Eight to Late ‘

Hello, beyond the World… Today I discovered another good blog to add to my List of Data Science Blogs. The name of the Blog is: ‘Eight to Late‘. Mysteriously enough the title of the Blog, but its content is about good practice and theory in Data Science issues. Today’s post is about the important Machine Learning topic of Naive Bayes classifiers. With a gentle introduction to a topic that is quite sophisticated in the Statistics knowledge required. A good brain wrap up:

A gentle introduction to Naïve Bayes classification using R

Some highlights:

We’re interested only in relative probabilities of the representative being a democrat or republican because the predicted party affiliation depends only on which of the two probabilities is larger (the actual value of the probability is not important). This being the case, we can factor out any terms that are constant.  As it happens, the denominator of the above equation – the probability of a particular voting pattern – is a constant because it depends on the total number of representatives (from both parties)  who voted a particular way.

Now, using the chain rule of conditional probability, we can rewrite the numerator as:

$p(D) p(v1=y, v2=n,v3=y|D)$

$= p(D)p(v1=y|D) p(v2=n,v3=y|D,v1=y)$

(…)

Basically, the second term on the left hand side, $p(v1=y, v2=n,v3=y|D)$, is the probability of getting a particular voting pattern  (y,n,y) assuming the rep is a Democrat (D). The definition of conditional probabilityallows us to rewrite this as the probability of getting a n vote for issue v2 and a y vote for issue v3 given that the rep is a Democrat who has voted y on issue v1. Again, this is simply a consequence of the definition of conditional probability.

(…)

The key assumption of Naïve Bayes is that the conditional probability of each feature given the class is independent of all other features. In mathematical terms this means that,

$p(v2=n|D,v1=y) = p(v2=n|D)$

and

$p(v3=y|D,v1=y,v2=n) = p(v3=y|D)$

The quantity of interest, the numerator of equation (1) can then be written as:

$p(D) p(v1=y, v2=n,v3=y|D)$

$= p(D)p(v1=y|D)p(v2=n|D)p(v3=y|D).......(2)$

(…)

The assumption of independent conditional probabilities is a drastic one.  What it is saying is that the features are completely independent of each other. This is clearly not the case in the situation above: how representatives vote on a particular issue is coloured by their beliefs and values. For example, the conditional probability of voting patterns  on socially progressive issues are definitely not independent of each other. However, as we shall see  in the next section, the Naïve Bayes assumption works well for this problem as it does in many other situations where we know upfront that it is grossly incorrect.

(…)

Another good example of the unreasonable efficacy of Naive Bayes is in spam filtering.  In the case of spam, the features are individual words in an email.  It is clear that certain word combinations tend to show up consistently in spam – for example, “online”, “meds”, “Viagra” and “pharmacy.” In other words, we know upfront that their occurrences are definitely not independent of each other. Nevertheless, Naïve Bayes based spam detectors which assume mutual independence of features do remarkably well in distinguishing spam from ham.

I am well reminded in this post how Machine Learning techniques are sometimes regarded as a kind of black magic art. How is it possible for the Naive Bayes classifier to work well under the incorrect assumption of independent conditional probabilities isn’t easy to accept… But it is the truth. Anyway I recommend the full reading of the post to more on this and I just would like to remark that I do not intend to display any political view with this post. I found the political subject by the middle of the article and was a bit annoyed by that. But neutrality, and recognizing the appropriateness of the subject to the kind of study presented softened the little distaste. And the timing couldn’t be better…

The rest of the post is well worth to read to better understand the Naive Bayes classifier, and its implementation with the R computing language.

To sum up: I have illustrated the use of a popular Naive Bayes implementation in R and attempted to convey an intuition for how the algorithm works.  As we have seen, the algorithm works quite well in the example case, despite the violation of the assumption of independent conditional probabilities.

The reason for the unreasonable effectiveness of the algorithm is two-fold. Firstly, the algorithm picks the predicted class based on the largest predicted probability, so ordering is more important than the actual value of the probability. Secondly, in many cases, a bias one way for a particular vote may well be counteracted by a bias the other way for another vote. That is, biases tend to cancel out, particularly if there are a large number of features.

That said, there are many cases in which the algorithm fails miserably – and we’ll look at some of these in a future post.  However, despite its well-known shortcomings, Naive Bayes is often the first port of call in prediction problems simply because it is easy to set up and is fast compared to many of the iterative algorithms we will explore later in this series of articles.

Featured Image: Project 3: Naïve Bayes and the Perceptron