Naive Bayes Classifier in R Programming
Overview
Classification is one of the important tasks in machine learning and data analysis projects. It includes classifying data based on similarities or differences. Identifying spam emails, sentiment analysis, healthcare, image recognition, etc., are some common real-world situations that require classification.
In this article, we will discuss the supervised machine learning algorithm - Naive Bayes in R. This algorithm is simple yet powerful and is based on Bayes' theorem. This algorithm works on the concept of probability and can be applied to different classification tasks. We will explore how it works and also implement the algorithm in R programming to understand its relevance to us.
What is Naive Bayes?
The Naïve Bayes algorithm is made up of two main elements: "Naïve" and "Bayes." They are:
-
Naïve: The term "Naïve" in Naïve Bayes refers to the algorithm's assumption that all features used in classification are independent. It means that when making predictions, the algorithm treats each feature as if it has no relationship with or influence on the others.
For example, when determining whether an email is spam, each word is evaluated independently, without considering its relationship with other words. This assumption decreases the computational complexity of the algorithm and enhances its efficiency.
On the other hand, if a model incorrectly assumes that symptoms are independent, it may misdiagnose a rare disease in a patient because it has not considered that the symptoms might be correlated in diagnosing the disease.
-
Bayes: The term "Bayes" in Naïve Bayes refers to applying Bayes' theorem, an important theorem in probability theory. This theorem provides a mathematical approach for calculating conditional probabilities, allowing the algorithm to update its probabilities about a particular event using new data. It can be expressed as:
where:
- P(A|B): It indicates the probability of an event A happening provided that an earlier event B has already happened.
- P(B|A): It indicates the probability of an event B happening given that an event A has occurred.
- P(A): The prior probability of event A.
- P(B): The prior probability of event B.
Introduction to Naive Bayes Classification
Naive Bayes Classification operates by calculating the probability that a given data point belongs to a particular class based on the available features.
Understanding the Data
To understand the concept of the Naive Bayes classification algorithm, let us consider a dataset with customers' snack preferences and whether they purchased the snack. Here's what our dataset looks like:
| Sr. No. | Snack Items | Purchase |
|---|---|---|
| 1 | Burger | Yes |
| 2 | Pizza | No |
| 3 | Noodles | Yes |
| 4 | Burger | Yes |
| 5 | Pizza | Yes |
| 6 | Burger | No |
| 7 | Noodles | No |
| 8 | Noodles | Yes |
| 9 | Pizza | No |
| 10 | Noodles | Yes |
| 11 | Noodles | No |
| 12 | Noodles | No |
| 13 | Pizza | Yes |
| 14 | Pizza | Yes |
| 15 | Noodles | No |
Creating Frequency Tables
In the next step, we will convert our dataset into frequency tables. Here, we will count how many times each snack choice (e.g., Pizza, Burger, Noodles) corresponds to a purchase (Yes or No). It will help us to understand the distribution of purchases for each snack.
| Snack Items | No | Yes |
|---|---|---|
| Burger | 1 | 2 |
| Pizza | 2 | 3 |
| Noodles | 4 | 3 |
| Total | 7 | 8 |
Creating a Likelihood Table
Further, we will create the likelihood table by calculating the probabilities of snack items based on the target variable i.e. purchase decision. To simplify calculating prior and posterior probabilities, we can use two tables. Likelihood Table 1 displays prior label probabilities, while Likelihood Table 2 shows posterior probabilities.
For each snack option, we will determine the likelihood of a purchase (Yes) and no purchase (No). Here is our Likelihood Table 1:
| Snack Items | No | Yes | |
|---|---|---|---|
| Burger | 1 | 2 | = 3/15 = 0.2 |
| Pizza | 2 | 3 | = 5/15 = 0.33 |
| Noodles | 4 | 3 | = 7/15 = 0.47 |
| Total | 7 | 8 | |
| = 7/15 = 0.47 | = 8/15 = 0.53 |
Calculating Probabilities
We will create another likelihood table that summarizes the probabilities of a customer choosing a particular snack and making a purchase decision. Below is our Likelihood Table 2:
| Snack Items | No | Yes | Posterior Probability for No | Posterior Probability for Yes |
|---|---|---|---|---|
| Burger | 1 | 2 | 1/7 = 0.14 | 2/8 = 0.25 |
| Pizza | 2 | 3 | 2/7=0.28 | 3/8=0.375 |
| Noodles | 4 | 3 | 4/7=0.57 | 3/8=0.375 |
| Total | 7 | 8 |
Now suppose we want to calculate the probability of purchasing Pizza, we can use the following formula and steps:
- Calculate Prior Probabilities:
- Calculate Posterior Probabilities:
- Put Prior and Posterior probabilities in the equation:
So, the probability of purchasing a Pizza is approximately 0.99. This indicates that there is a high likelihood that the customer will purchase a Pizza.
Data Preparation/Dataset for Naive Bayes Classification
To develop a Naive Bayes classifier, data preparation is an essential step for any machine learning project. The functionality and accuracy of our model is directly dependent on the quality and appropriateness of our dataset.
Here are some important steps for preparing a dataset for Naive Bayes classification:
- Data Collection: Before data preparation, we need to collect relevant data from various sources, such as databases, APIs, etc.
- Data Understanding: It is important to understand the domain and context of our dataset. It includes knowing the target variable, understanding the meaning of each variable, and identifying any possible problems in the data.
- Data Cleaning: The dataset must be cleaned to deal with missing values, outliers, and inconsistent data.
- Data Exploration: The data should be explored to gain insights about its distribution and characteristics. We can identify patterns and relationships among variables using different methods such as visualizations, summary statistics, etc.
- Data Encoding: All the features in the dataset should be in a format that the Naive Bayes algorithm can handle. We can convert categorical variables into a suitable numerical format by applying one-hot or label encoding.
Building a Naive Bayes Classifier in R Programming
Now, let us build a Naive Bayes classifier in R using the following steps:
Step 1 - Import necessary Libraries
In this step, we will import the three essential R packages - mlbench, caret, and e1071 for building a Naive Bayes classifier in R. If these packages are not already installed, we can easily do so with the following commands:
Once the packages are successfully installed, we can load them in our R environment as shown below:
The mlbench package provides us with various datasets that can be used to evaluate the performance of machine learning algorithms. On the other hand, we can use the caret package for building and evaluating predictive models. Additionally, the e1071 package includes functions for training a Naive Bayes classifier.
Step 2 - Read and Explore the Dataset
We will use the built-in "Sonar" dataset from the "mlbench" package in R to build a Naive Bayes Classifier in R.
This dataset contains 208 rows of data with 61 variables collected by reflecting sonar waves off a metal cylinder and a rock at various angles and situations. We will use the head() function to quickly print the first few rows of the dataset.
Output:
The target variable in this dataset is Class. The Class variable represents the type of object that was detected by the sonar, either a rock or a mine.
Step 3 - Train and Test Data
To evaluate the effectiveness of our Naive Bayes classifier, we will split the dataset into two parts, i.e., a training set and a testing set. We will use the training set for model training, while the testing set will be used to measure its accuracy.
Here, we set a random seed for reproducibility. Also, we are using 70% of the data for training and 30% for testing.
Step 4 - Create a naive Bayes model
Let us now train our model using the naiveBayes() function with our training data.
Here, Class ~ . means we're using the "Class" variable as the one to predict, while all other variables in the dataset (denoted by the dot ".") are used to make that prediction.
Step 5 - Make Predictions on the Test Dataset
Next, we can make predictions on our test data using the trained model.
Step 6 - Check the Accuracy of the Model
Now, we will create a confusion matrix and calculate the accuracy score. The confusion matrix will tell us how many instances our model predicted correctly and incorrectly. Additionally, we will print the accuracy score, calculated by dividing the total number of occurrences in the test set by the number of properly predicted cases.
Output:
Here our model correctly classified 24 instances as M (Mine) and 25 instances as R (Rock), while it incorrectly classified 4 instances as M (Mine) and 9 instances as R (Rock). Therefore, the model correctly classified approximately 79.03% of all instances in our test set.
Next, let us extract a particular entry from the test dataset and make a prediction as shown in the following code:
Output:
Here, we extracted the fifth entry from the test dataset and used our trained Naive Bayes model to make a prediction for this entry. It then prints the actual Class of the entry and the predicted Class.
Types of Naive Bayes Classifiers
The Naive Bayes classifiers are a group of algorithms based on probability and find applications in various classification tasks.
Let us discuss in brief the four important types of Naive Bayes classifiers:
- Gaussian Naive Bayes Classifier
- It is ideal for continuous data that follows a Gaussian (normal) distribution.
- It assumes that the features are normally distributed within each Class.
- The Gaussian Naive Bayes Classifier has a wide range of applications. In natural language processing, for example, to deal with word frequencies.
- Multinomial Naive Bayes Classifier
- It is commonly used for classification of text documents.
- It assumes that the features indicate word or any specific item frequencies.
- It works especially well in document classification and spam detection applications when the data can be provided as word counts.
- Bernoulli Naive Bayes Classifier
- It is suitable for binary or boolean features, where each feature is present or absent.
- It is mostly used in text classification tasks to find whether a specific word exists within a given document.
- Complementary Naive Bayes Classifier
- It can be easily used to deal with datasets of imbalanced nature.
- It can handle datasets where one Class has many more instances than the others.
- To avoid bias towards the dominant Class, it balances the contribution of each Class by inversely weighting their probability.
Applications of Naive Bayes Classifier
As it is simple and effective in many situations, Naive Bayes classifiers find applications in various domains. Some common applications of Naive Bayes classifiers are -
- Text Classification: Naive Bayes classifiers can be used for sentiment analysis, spam detection, and document classification.
- Email Filtering: It can classify incoming emails as spam or non-spam.
- Medical Diagnosis: It can be used to detect diseases based on patient symptoms and test results and to predict the possibility of the existence of a particular disease.
- Recommendation Systems: These systems can provide relevant recommendations for products, services, movies, or similar content to customers on the basis of their preferences and behavior.
- Fraud Detection: It can detect fraudulent transactions, such as credit card or insurance fraud, by analyzing transaction data and patterns.
Conclusion
In conclusion, Naive Bayes in R is an efficient classification technique that has several real-world applications.
- Naive Bayes is an essential tool for R data analysis and modeling as it successfully identifies insights from data.
- Naive Bayes may be further developed and improved to handle more complex situations, such as handling imbalanced data sets. This is possible using approaches like oversampling and undersampling.
- Various variants of Naive Bayes, such as the Gaussian Naive Bayes or Multinomial Naive Bayes, allow regularization techniques to prevent overfitting, which is vital for producing reliable models in R.
- However, some common Naive Bayes pitfalls include the assumption of feature independence, sensitivity to sparse data, and difficulties when handling imbalanced datasets.