Identify Topics from News Items with NLP

Learn via video courses
Topics Covered

Overview

Natural language processing comes with a plethora of tasks, such as sentiment analysis, text summarization, and many more. One such important task is topic modeling. It is the process of recognizing the words from the topics present in the document or the corpus of data. This is useful because extracting the words from a document takes more time and is much more complex than extracting them from topics present in the document. This article revolves around topic modeling, and we will do a small project on it.

What are We Building?

We are going to build a machine learning model using Latent Dirichlet Allocation (LDA) that will perform topic modeling on a dataset containing news items.

Description of Problem Statement

Input: A dataset containing news items belonging to different topics.

Output: A model that can identify the topic of unseen news data.

Pre-Requisites

  • A preliminary understanding of machine learning.
  • Sound knowledge on Natural Language Processing(NLP) tasks such as stemming, lemmatization, etc.
  • Good command of the Python language.

How Are We Going to Build This?

This is an overview of the steps involved:-

  • First, we are going to preprocess the raw text data by performing tasks such as tokenization, stopword removal, stemming, and lemmatization.
  • Next, this preprocessed text is converted into a bag of words. This is a dictionary where the key is the word, and the value is the number of times that word occurs in the entire corpus.
  • Using this bag of words, we will build our LDA model.
  • Train the model on the input data.
  • Test the model on unseen news data.

Final Output

The final output will be a model that could take in any news item as input and give out the topic category the input belongs to.

Requirements

  • scikit-learn: Simple and efficient tools for predictive data analysis.

  • gensim: Gensim is a free, open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.

  • nltk: NLTK is a leading platform for building Python programs to work with human language data.

  • numpy: NumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. Interoperable.

Building a Model to Identify Topics from News Items with NLP

Dataset

The dataset we have chosen for this task is the 20newsgroup dataset that is available from sklearn. This dataset has news articles grouped into 20 news categories. This dataset is easily available and is quick to import, hence ready to use.

Here's how we load the dataset:-

Now, let's see some of the news items.

Output:

We can also see some of the target topics available:-

Output:

By observing this we can say that some of the topics are:-

  • Science
  • Politics
  • Sports
  • Religion
  • Technology

Data Preprocessing

As discussed earlier, we are going to perform a few data preprocessing steps to convert the raw text data into machine-readable form:-

  • Tokenization: Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered discrete elements. The token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning.
  • Stopwords removal: A stop word is a commonly used word (such as “the”, “a”, “an”, or “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. These words should be removed as they would contribute nothing to the model.
  • Stemming: Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or the roots of words known as "lemmas". Example: "faster" is converted to "fast".
  • Lemmatization: Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return to 'Car'. For instance, lemmatizing the word 'Caring' would return 'Care'.

We will implement the aforementioned steps using gensim and nltk.

Code Implementation

Now, we will create two functions:-

  • For implementing stemming and lemmatization.
  • For implementing tokenization and removing stopwords.

Code Implementation:

Let's see how the above function performs. Code:

Output:

Now, we will push our dataset to the function.

Output:

Converting Text to Bag of Words

As discussed earlier, we are going to convert the preprocessed text to the bag of words. This bag of words will be a dictionary where the key is the word, and the value is the number of times that word occurs in the entire corpus.

This is implemented through gensim:

Here's what the dictionary looks like:-

Output:

We can also remove the very rare and common words.

Create the Bag-of-words model for each document, i.e., for each document, we create a dictionary reporting how many words and how many times those words appear. Save this to 'bow_corpus'

Here's a preview of the bag of words.

Building the Model

LDA can be implemented quickly and efficiently through the gensim library. However, we need to specify how many topics are there in the data set. Let's say we start with eight unique topics. Num of passes is the number of training passes over the document.

Now, For each topic, we will explore the words occurring in that topic and their relative weight.

Output:

The LDA model doesn’t give a topic name to those words and it is for us humans to interpret them.

Testing the Model on Unseen Document

Output:

We have to perform the same preprocessing steps for unseen document.

Output:

Output:

Hence, the model correctly classifies the unseen document with 'x'% probability to the X category.

Conclusion

  • Topic modeling is useful because extracting the words from a document takes more time and is much more complex than extracting them from topics present in the document.
  • LDA is one of the most widely used algorithms for topic modeling.
  • We can implement LDA easily using gensim.
  • All the necessary NLP preprocessing tasks are superbly managed by NLTK.