Introduction to NLP with PyTorch

Overview

Natural Language Processing (or NLP) is a fascinating machine learning use case that deals with our natural language. Recent breakthroughs in the field of technology consist of many applications of Natural Language Processing like automatic chatbots on websites, AI assistants like Alexa, Siri etcetera, next-word prediction systems in modern devices, and so on. The latest advances in Natural Language Processing deal with deep neural networks. To this end, this article introduces NLP in PyTorch along with the fundamental concepts essential to getting a head start in the field of NLP. These concepts also form the backbone of the present-day NLP's most modern and state-of-the-art tools.

Introduction

Text and Speech are the primary ways humans interact with each other. The study of how we interact, that is, the study of the language we speak, is called Linguistics, and this field has been around for many decades now.

Computational linguistics encompasses a wide variety of tasks that involve the use of computational algorithms to study the human language. This also has been around for a while and was dominated by the Statistical study of Natural Language called Statistical Natural Language Processing starting from the early 1990s and stretching till the 2010s. While Statistical NLP successfully advanced many developments during its time, it required a lot of bookkeeping and storage due to manually defined algorithms involving lookup tables and dictionaries.

The field of Natural Language Processing is now dominated by Machine Learning, and more specifically by deep neural networks (neural NLP) that have accelerated major breakthroughs and advancements in building systems around Natural Language text data. The prime reason behind this is obviously the easily available and accessible abundant amount of text data and, even more importantly, the advancements in the computational industry that have developed faster and more efficient ways of processing a large amount of data in short periods of time in a cost-effective manner. The latter is what empowers modern-day computers to process a lot of text data and hence become capable of learning from it.

Computers are now (artificially) intelligent enough to predict the next word in a sentence we're typing and even correct us not only in spelling errors but also in sentence structure, grammar and formation, and so on.

All of these are applications of Natural Language Processing with Deep Learning, and so in this article, we will learn NLP with PyTorch, which is one of the most popular and easy-to-use libraries for deep neural modeling. Let us begin with a formal definition of Natural Language Processing.

What is NLP?

"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data." - from Wikipedia.

Our end goal is to build software systems capable of "understanding" text data in its written form, including the contextual and semantic specifics of the language used. Furthermore, we want to build such systems capable of recognizing patterns and deriving insights from Natural Language text data.

Before diving into how computers can be made to understand and identify human language, let us first explore some exciting use cases of modeling text data using Natural Language Processing.

Use Cases of NLP

Let us look at some use cases of modeling text data with deep learning that is, NLP with neural networks -

AI-based assistants work with speech data, essentially spoken in natural human language. They are trained to act as personalized search engines by learning and extracting information from our daily verbal activity.
email systems, e-commerce platforms like Amazon, Flipkart, etc. can deploy natural language systems to classify the label (spam or not spam, or sentiment) associated with a piece of text to boost user productivity by classifying spam mails from non-spam ones, or by using the similarity scores between user reviews to recommend the right products.
Automatic information extraction is another use case of NLP that can automatically extract relevant information from text documents like the name, gender, contact, etc. This use case falls under the broader task of Named entity recognition where entities in a document are tagged and labeled.
Summarization of long documents is a good area where Natural Language Processing can be trained to generate brief summaries from long pieces of text.

Deep Learning with PyTorch for NLP

Approaching an NLP problem

Computers understand numbers! For the computers to understand and thus model our text data using algorithms, we need to define a way to map text data into something the computers can work or "compute" with - numbers.

Deep Learning Algorithms require tensors or vectors as input to them. With images, the task is trivial as internally, for a computer, the images are multidimensional arrays of pixels where the pixel value represents the brightness at that spot.

However, when working with text data, the task of mapping it to numbers or vectors, to be specific, is a challenging one, and we need to devise intuitive algorithms for this mapping.

There are a bunch of ways to map text data into vectors. Although more advanced techniques to do work with training the vectors for separate tokens, which are formally known as word or token embeddings, there are some basic fundamental ways to do that that we will discuss next. Understanding these methods is important as these can still be used to build NLP systems for various reasons like low compute availability etc.

Note: Tokens and words slightly differ from each other. A lot of state-of-the-art NLP techniques for generating vectors for text data (better known as a word or text embeddings) deal with tokens rather than with words. Our further discussions in this article shall make the difference between tokens and words clear while also highlighting the shortcomings of using one over the other.

There are a bunch of algorithms available to map text data to numbers like TFIDF, OHE (One Hot Encoding), N-grams, BOW (Bag of Words), etc. We will be discussing some of these in the following sections.

One-Hot Encoding

One hot encoding is used to map words into vectors where each vector is of dimensionality equal to the size of the Vocabulary $N$ .
Elaborating on it, in One hot encoding, each index in a vector corresponds to a unique word in the Vocabulary, and so corresponding to each word, we can assign a vector of size $N$ with one corresponding to its index and 0s otherwise.
This means each word in the Vocabulary has a unique vector assigned to it, and no two words can have the same one-hot vector representation.

With One-hot encoding, one word is represented as one vector, and so one sentence with $w$ number of words can be represented as an array of vectors or a two-dimensional array of size $w * N$ .

And furthermore, for a paragraph with more than one sentence, say $s$ sentences, the encoding will produce a three-dimensional array of size $s * w * N$ .

Implementation in PyTorch

To see a working example of One-hot encoding, we will take a slightly different approach for brevity. We will be treating characters as the smallest unit rather than words, and so each word will be treated as a sequence of characters.

So, a word containing $c$ characters shall now be mapped into an array of size $c * N$ where $N$ is the size of the Vocabulary of characters.

This means that the Vocabulary is defined by the smallest unit being recognized. If we take characters as the smallest unit, then the Vocabulary shall consist of the 26 alphabets of the English language, and so $N=26$ .

On the other hand, if words are taken as the smallest unit, then the size of the Vocabulary shall be defined by the number of words in the English language. While it is impossible to create vectors of size as high as the number of words in any language, we restrict ourselves to a set of defined words that we treat as our Vocabulary, and any word out of that Vocabulary is termed as OOV (out of Vocabulary).

In PyTorch -

Output -

Drawbacks of One Hot Encoding

While One Hot Encoding is the simplest and intuitive way to encode the text, it suffers from major drawbacks that make it unsuitable for a majority of tasks for NLP in PyTorch. The drawbacks summarised are as follows -

With the large size of the Vocabulary, the encodings produced are too big and too sparse, causing inefficient and expensive computational overload.
There is no contextual or semantic information carried by One hot encoded vectors, and so these aren't suitable for tasks relying on the semantic properties of text data like POS tagging, NER, etc.
Each one hot encoded vector is equidistant from every other vector, so there is no scope to expose the relational information between the words of a vocabulary.

Tokenization

A Token can be any discrete unit of the language structure. Tokens can represent characters, words, sub-words, and even sentences.

Tokenization is the process of breaking raw text into constituent tokens. These tokens can further be used in certain ways to create embeddings and, therefore, for modeling raw text.

Hence, depending on the type of tokens, Tokenization can be of many types like word tokenization, subword tokenization, sentence tokenization, etc.

We will demonstrate tokenization usinf the torchtext module in PyTorch.

Output -

N-Gram Language Models

N-gram language models estimate the next word in a sentence by assigning probabilities to the possible words. For example, given the sentence "France is the capital of," what word shall come next in the sentence.

The probability assigned to "Paris" shall be higher than that assigned to "parrot."

An n-gram can be defined as a contiguous sequence of n discrete units (words, subwords, characters, etc.) from a body of text or Speech. N-gram language models are then trained on these n-grams to model words in sequences.

Here is how the n-grams materialize: For the sentence "this is an article about NLP in PyTorch," we can construct monograms, bigrams, trigrams, and so on as follows -

1-gram (monogram): "this", "is," "an", "article", "about", "NLP", "in", "PyTorch"
2-gram (bigram): "this is", "is an", "an article", "article about", "about NLP", "NLP in", "in PyTorch"
3-gram (trigram): "this is an", "is an article", "an article about", "article about NLP", "about nlp in", "NLP in PyTorch"

Example of n-grams in PyTorch

Let us create an n-gram model using PyTorch where we will create tri grams from the text corpus and train a model with an embedding layer for the text embeddings and two linear layers to learn patterns from the data.

First of all, we will define an embedding dimension and context size for creating the trigrams. Then we manually construct trigrams from the test sentence and define a vocabulary and a dictionary to map words to integers, like so -

After this, we will inherit from the base module torch.nn.Module to define a custom model class to specify our model and the forward call, like so -

We will now loop over our trigrams and train our model for 10 epochs, while printing out the losses, like so -

output -

Bag of Words Language Models

While the n-gram language model worked with a sequential pattern of text constructing n-grams in the order they appear, i.e., maintaining their inherent notion of sequence, the bag of words model ignores the order of occurrence and focuses only on the words occurring in the text and their corresponding frequency. Hence, the model treats the words as if they are contained in a bag in an unordered manner.

With this model, we completely discard the "structure of information" in the text documents and only focus on a known vocabulary of words and a measure of the presence of known words.

Let us build and train a bag of words model using PyTorch. We will also make the next word prediction for a given set of words.

Firstly, we will define a function for constructing the context vector, our model vocabulary using the raw text, and the mapper dictionaries between words and integers.

We will now define our custom model class with one embedding layer, two linear layers, and two suitable activation layers. Finally, the softmax layer is applied before producing the final output from the model.

Let us now train our model for 50 epochs and test it on a sample to make a prediction, like so -

Output -

Our model predicts 'the' as the most likely word for the given sample of context words.

With this, we are done studying the fundamental techniques that form strong basics for NLP in PyTorch.

Conclusion

This article gave the readers an introduction to NLP in PyTorch. Let us conclude what we've studied in this article -

We began by getting an introduction to the field of Natural Language Processing and studied its applications in the technological industry.
We looked at how computational linguistics works by understanding the concept of word embeddings.
Then, we studied various techniques to create vectors corresponding to words. One hot encoding, N-gram Language models, Bag of words model, and the concept of Tokenization in NLP.
Every technique was learned with suitable code examples in PyTorch.