BBC News Classification Project

Learn via video courses
Topics Covered

Overview

Working with Machine Learning and Natural Language Processing at the same time seems like a daunting task. However, with the help of this article, you will be able to work with them in no time. In this article, we will use ML and NLP to classify the news from BBC (British Broadcasting Corporation) into five sections.

Additionally, by developing multiple classifiers that might assist news networks in directing their information to a particular audience, this project can be expanded to real-time applications.

What are We Building?

In the BBC News Classification Project, we are building a predictive model to evaluate the various news records and classify them accordingly with the help of some parameters.

The parameters into consideration are the various headlines with their respective categories.

After cleaning and preprocessing the dataset using NLP techniques, we will use Machine Learning algorithms like Random Forest and SVM to classify each headline to its respective category.

Pre-requisites

To churn the best out of this article, the following prerequisites would be a plus:

  • Basic Knowledge of Python would be beneficial.
  • Implementation of libraries like Pandas, Numpy, Seaborn, Matplotlib, and SciKit Learn.
  • Understanding of Machine Learning algorithms like Random Forest, Linear, and Logistic Regression.
  • Intermediate understanding of various text cleaning and preprocessing techniques like stemming and lemmatization.

How Are We Going to Build This?

Here's how we are going to work on this project:

  • Libraries - Importing the necessary NLP and ML libraries.
  • Data Analysis - This step will enable us to figure out the various values and features of the dataset.
  • Data Visualization - With basic Data Visualization, we will be able to figure out the various underlying patterns of our dataset.
  • Preprocessing - Since we are working with textual data in this project, preprocessing is very important for us to create a classifier. Using techniques like Tokenization and Lemmatization, we'll make our data model-ready.
  • Model Training - In this step, we will use various Machine Learning algorithms like Random Forest, Logistic Regression, SVC, etc., to try and create a classifier that successfully classifies news to their respective categories.
  • Model Evaluation - To make sure our classifier is working the way we want it to, we'll perform various evaluation techniques like accuracy and ROC score.
  • Model Testing - Finally, we will test our classifier with real-life data to see if it can actually predict the category of the headline.

Final Output

Our final output would be to create a classifier that can predict the headline to its respective category.

Requirements

  • Environment - Either an online or offline compiler
  • Libraries - Pandas, Numpy, Seaborn, Matplotlib, SciKit Learn, PorterStemmer, Lemmatizer, stopwords, etc.

About the Data

The dataset that we'll be working with is the BBC News Dataset. BBC News news story datasets are made available for use as standards in machine learning research. For the convenience of use, the original data is transformed into a single CSV file while preserving the news title, the name of the relevant text file, the news content, and its category.

Understanding the Dataset

The aim of our classifier is to be able to successfully predict the category of a headline. This could potentially help news corporations to segregate their news headlines and be more accurate with their telecasts.

Algorithms Used in BBC News Classification

For this project, here are the Machine Algorithms that we used to successfully predict the category from its headline:

  • Logistic Regression Logistic Regression is a classification algorithm. It is used to forecast a binary outcome based on a number of independent variables.

  • Naive Bayes The Naive Bayes algorithm is a supervised learning method for classification issues that is based on the Bayes theorem. It is primarily employed in text categorization with an extensive training set.

  • Random Forest The supervised machine learning approach Random Forest grows and combines different decision trees to produce a "forest." The Random Forest model is built on the premise that different uncorrelated models (decision trees) perform noticeably better when combined than when used independently.

  • Linear SVC The Linear SVC (Support Vector Classifier) helps us by fitting our data, returning a "best fit" hyperplane that divides or categorizes your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class is. This makes this specific algorithm rather suitable for our uses, though you can use this for many situations.

Building the Classifier

Loading the Dataset

To start things off, we will load our dataset using Pandas.

Output

output-loading-dataset

Visualization of the Dataset

Now that we have loaded our dataset, let us get some insights. We need to find a way to generate ideas by looking at the dataset alone. Hence, we will use Seaborn to see the various plots from our dataset.

Firstly, let us check the total number of headlines using a simple bar plot.

Output

output-simple-bar-plot

We will visualize the number of stop words present in our dataset:

Output

output-visualize-number-of-stop-words

Using a rainbow plot, we will determine the most frequently occurring words in our dataset:

Output

output-rainbow-plot

To conclude, let us create a word cloud to visualize all the words in our dataset:

Output

output-word-cloud-to-visualize-all-word-in-dataset

Now that we have finished generating insights using visualization, let us move on to cleaning and preprocessing our dataset.

Data Preprocessing

Right now, our dataset is not ideal for model building. It consists of things like stop words, recurring words, punctuations, inconsistencies, etc., which will make our model under/overfit. Using basic techniques like tokenization, lemmatization, etc, we will be able to clean and preprocess our dataset.

  • Lowering

We will make sure that our dataset is in lowercase, as it'll make it consistent throughout entirely.

  • Tokenization

    Splitting paragraphs and sentences into smaller units so they are easily trained in Natural Language Processing is called tokenization. The NLTK module in Python has word_tokenize(), which helps us perform tokenization.

    Output

  • Stop Word Removal

    The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and do not add much information to the text.

    Output

  • Stemming and Lemmatization

    Stemming is the practice of removing the last few characters from a word, which frequently results in inaccurate spelling and meanings. By taking context into account, lemmatization reduces a term to its logical base form or lemma.

  • Encoding Data

    The process of encoding meaningful text into number/vector representation to maintain the relationship between words and sentences is called Encoding Data.

    In this project, we will encode our dataset using the TF-IDF Vectorizer in NLP.

    Output

To conclude, we will map the numerical values to the various categories of our dataset:

Output

Model Training

Before moving on with model training, we will split our dataset into train and test using train_test_split() provided by sklearn.

To remove spaces, commas, exclamation, etc., we will use the CountVectorizer() function.

  • Logistic Regression

    Output

  • Random Forest

    Output

  • Naive Bayes

    Output

  • Linear SVM (Support Vector Machine)

    Output

Model Evaluation

Model Evaluation helps us validate our Machine Learning models to know which model performs best and which model needs improvement.

  • Logistic Regression Metrics

    Output

  • Random Forest Metrics

    Output

  • Naive Bayes Metrics

    Output

  • Linear SVM Metrics

    Output

Model Testing

According to our evaluation, all the models are performing exceptionally well. All of them have similar accuracy and precision score (97-98%)

To check if our Machine Learning models perform well with actual data, we will take some of the real-time data available on the BBC website and see what category they are in. Since we can use any of our models, we will go with Naive Bayes

Output

As we can see, our headline was from the FIFA World Cup. Our Naive Bayes model has successfully predicted it to be from the SPORTS category.

What is Next

Here are the next steps you can take to give this project your own spin!

  • Web Implementation - We can use React or Flask to create a web interface for this ML project.
  • Models - Apart from the standard ML models, you can use ensemble models and figure out the best model to work with.

Conclusion

  • BBC News Classification is a project in which we create a classifier that can successfully predict any news article's category.
  • The maximum accuracy we could have achieved was 98% using Naive Bayes.