IMDB movie review polarity using Naive Bayes Classifier

Develop a naive Bayes algorithm-based machine learning model that predicts sentiment or polarity of an IMDB movie review.

Photo by Hitesh Choudhary on Unsplash


Naive Bayes is an algorithm that uses Baye’s theorem. Baye’s theorem is a formula that calculates a probability by counting the frequency of given values or combinations of values in a data set [6]. If A represents the prior events, and B represents the dependent event then Bayes’ Theorem can be stated as in equation

Bayes Theorem:

here x if for different words in the review text, Ck is for the class label

p(Ck|x): the probability of class label given text review words x

review text(x) can be represented as {x1,x2,x3, …….. ,xn}

p(Ck|x) ∝ p(Ck|x1,x2,x3, …….. ,xn}


About IMDB movie review dataset:

Data source:

The IMDB Movie Review Dataset consists of text reviews with data frame named as ‘data’

The dataset contains text movie reviews with the given polarity of positive and negative.

Let us take some random examples (not from the dataset) and learn how naive Bayes classifier works.

Taking a few examples:

Let’s take a toy example of movie text and review and it’s sentiment polarity (0->negative, 1->positive).

Text Preprocessing:

Here is a checklist to use to clean your data:

  1. Begin by removing the HTML tags
  2. Remove any punctuations or a limited set of special characters like, or . or #, etc.
  3. Check if the word is made up of English letters and is not alpha-numeric
  4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
  5. Convert the word to lowercase
  6. Remove Stopwords example:( the, and, a ….)

After following these steps and checking for additional errors, we can start using the clean, labeled data to train models!

Bag of Words Representation:

The next step is to create a numerical feature vector for each document. BoW counts the number of times that tokens appear in every document of the collection. It returns a matrix with the next characteristics:

The number of columns = number of unique tokens in the whole collection of documents (vocabulary).

The number of rows = number of documents in the whole collection of documents.

Every cell contains the frequency of a particular token (column) in a particular document (row).

we compute the posterior probabilities. This is easily done by looking up the tables we built in the learning phase.

P(class=1|text) = P(class=1)* Π(P(wi|class=1))

Some important points:

  1. Laplace/Additive Smoothing

In statistics, additive smoothing, also called Laplace smoothing. Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a “smoothed” version of the data gives the estimator:

where the pseudo count α > 0 is the smoothing parameter (α = 0 corresponds to no smoothing). Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N and the uniform probability 1/d. Using Laplace’s rule of succession, some authors have argued that α should be 1 (in which case the term add-one smoothing is also used), though in practice a smaller value is typically chosen.

So how do we apply Laplace smoothing in our case?

We might consider setting the smoothing parameter α =0.1 and d=1 (see the equation above), we add 1 to every probability, therefore the probability, such as P(class| text), will never be zero.

2. Log probability for numerical stability

The use of log probabilities improves numerical stability when the probabilities are very small.

P(class=1 or 0|text) = P(class=1 or 0)* Π(P(wi|class=1 or 0))

log(P(class=1 or 0|text)) = log(P(class=1 or 0))+∑(log(P(wi|class=1 or 0)))

text query1: The plot of the movie was pointless with the worst music ever.

text preprocessed : * plot * movie * pointless * worst music *

P(class=1|text) = P(class=1)*P(plot|1)*P(movie|1)*P(pointless|1)*P(worst|1)*P(music|1) =(4/7)*(0.1/4.2)*(3.1/4.2)*(0.1/4.2)*(0.1/4.2)*(1.1/4.2) =1.49097*10^(-6)

P(class=0|text) = P(class=0)*P(plot|0)*P(movie|0)*P(pointless|0)*P(worst|0)*P(music|0)

=(3/7)*(1.1/3.2)*(3.1/3.2)*(1.1/3.2)*(1.1/3.2)*(2.1/3.2) =1.10670*10^(-2)

#since probablity of P(class=0|text) is greater than probablity of P(class=1|text) for text query1 so we classify the query text as negative review.

text query2 : In love with the action scenes and music was amazing too.

text preprocessed : * love * * action scenes * music * amazing *

P(class=1|text) = P(class=1)*P(love|1)*P(action|1)*P(scenes|1)*P(music|1)*P(amazing|1)=(4/7)*(2.1/4.2)*(2.1/4.2)*(3.1/4.2)*(1.1/4.2)*(2.1/4.2) =1.380790411*10^(-2)

P(class=0|text) = P(class=0)*P(love|0)*P(action|0)*P(scenes|0)*P(music|0)*P(amazing|0)

=(3/7)*(0.1/3.2)*(0.1/3.2)*(0.1/3.2)*(2.1/3.2)*(0.1/3.2) =2.6822*10^(-7)

#since probablity of P(class=1|text) is greater than probablity of P(class=0|text) for text query2 so we classify the query text as positive review.

Implementing Multinomial Naive Bayes Classifier:

Apply Multinomial Naive Bayes classifier for different values of alpha and get a plot of error vs alpha to get optimal value of alpha with minimum error.

plot for error vs hyperparameter alpha

we get optimal value of alpha at a value of 6, so

Now we will perform the following steps:

  1. Apply Multinomial Naive Bayes for alpha=6

2. Predict the output using Multinomial Naive Bayes classifier

3. Find test accuracy and train accuracy

4. Plot a confusion matrix and heatmap

A confusion matrix is a table that allows us to visualize the performance of a classification algorithm

Finding the most frequent words used in both positive and negative reviews.

We have taken a sample of positive and negative words and found out the frequency of most frequent words used.

Here we observe that that words like “bad” are frequently used which depicts negative reviews.

Here we observe that that words like “great” are frequently used which depicts positive reviews.


Some words take place in several documents from both classes, so they do not give relevant information. To overcome this problem there is a useful technique called term frequency-inverse document frequency (tf-IDF). It contemplates not just frequency but also how unique the word is.

Furthermore, in the BoW model that we created, each token represents a single word. That’s called the unigram model. We can also try adding bigrams, where tokens represent pairs of consecutive words.

Scikit-learn implements TF-IDF with the TfidfVectorizer class.

By this, we can improve our test accuracy to from 82.464% to 85.508% and train accuracy from 87% to 93%


Naive Bayes is a simple but useful technique for text classification tasks. We can create solid baselines with little effort and depending on business needs explore more complex solutions.

Naive Bayes is a very good algorithm for text classification and considered as baseline. Basically for text classification, Naive Bayes is a benchmark where the accuracy of other algorithms is compared with Naive Bayes.

You can get full code here.

Thank You for reading

Please give 👏🏻 Claps if you like the blog

Writes About Data Science | Data Scientist | Programmer | Connect:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store