Introduction to N-grams in NLP

Learn via video courses
Topics Covered

Overview

N-grams are contiguous sequences of items that are collected from a sequence of text or speech corpus or almost any type of data. The n in n-grams specify the size of a number of items to consider, unigram for n =1, bigram for n = 2, and trigram for n = 3, and so on.

n-gram and n-gram models are widely used in probability, communication theory, computational linguistics like statistical natural language processing), computational biology etc.

Introduction

  • In the modern world, the innumerable arsenal of languages helps us express our feelings and thoughts, which is unique to the human species because it is a way to express unique ideas and customs within different cultures and societies.
    • Humans also have the capacity to use complex language far more than any other species on Earth.
  • AI systems that understand these languages and generate text are known as language models and are the latest and trending software technology in this decade.
  • Language modeling is a crucial element in modern NLP applications and makes the machines understand qualitative information. Each language model type, in one way or another, turns the qualitative information generated by humans into quantitative information, which in turn allows people to communicate with machines as they do with each other to a limited extent.

Let us understand further about N-grams and language models and how they are built using N-grams.

Language Modeling

Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models assign the probabilities to a sentence or a sequence of words or the probability of an upcoming word given a previous set of words.

  • Language models are useful for a vast number of NLP applications such as next word prediction, machine translation, spelling correction, authorship Identification, and natural language generation.
  • The central idea in language models is to use probability distributions over word sequences that describe how often the sequence occurs as a sentence in some domain of interest.
  • Language models estimate the likelihood of texts belonging to a language. The sequences are divided into multiple elements, and the language model models the probability of an element given the previous elements.
    • The elements can be bytes, characters, subwords or tokens. Then, the sequence likelihood is the product of the elements’ probabilities.
  • Language models are primarily of two kinds: N-Gram language models and Grammar-based language models such as probabilistic context-free grammar.
  • Language models can also be classified into Statistical Language Models and Neural language models.
  • Utility of generating language text using language modeling: Independently of any application, we could use a language model as a random sentence generator where we sample sentences according to their language model probability.
    • There are very few real-world use cases where you want to actually generate language randomly.
    • But the understanding of how to do this and what happens when you do so will allow us to do more interesting things later.

Statistical Language Modeling

A statistical language model is simply a probability distribution over all possible sentences. Statistical language models learn the probability of word occurrence based on examples of text.

  • While simpler models may look at a context of a short sequence of words, larger statistical language models may function at the level of sentences or paragraphs.
    • Typically, most commonly used statistical language models operate at the level of words.
    • Also, almost all language models decompose the probability of a sentence into a product of conditional probabilities.
  • Statistical language models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.
    • Out of all these models, the most popular and easily implementable are N-gram language models.

Neural Language Modeling

  • Neural Language Models use different kinds of approaches like neural networks such as feedforward neural networks, recurrent neural nets, attention-based networks, and transformers-based neural nets late to model the language, and they have also surpassed the statistical language models in their effectiveness.
  • Neural language models have many advantages over the statistical language models as they can handle much longer histories and also can generalize better over contexts of similar words and are more accurate at word prediction.
  • Neural net language models are also much more complex and slower and need more energy to train and are less interpretable than statistical language models.
  • Hence for practical purposes where there are not a lot of computing power and training data, and especially for smaller tasks, statistical language models like the n-gram language model are the right tool.
  • On the other side, Large language models (LLMs) based on neural networks, in particular, represented state of the art and gave rise to major advancements in NLP AI .
    • They hold the promise of transforming domains through learned knowledge and slowly becoming ubiquitous in day-to-day life related to text and speech use cases
    • LLM sizes have also been increasing 10X every year for the last few years, and these models grow in complexity and size along with their capabilities, like, for example, few shot learners.

Evaluating Language Models

  • A machine learning model, in general, is considered good if it predicts a test set of sentences.
    • We reserve some portion of the data for estimating parameters, and we use the remainder for testing the model.
    • A good model assigns high probabilities to the test sentences where typically the probability of each sentence is normalized for length.
  • An alternative approach that measures how well the test samples are predicted is using the measure of perplexity.
    • Models that minimize perplexity will also maximize the probability.
    • When we look at the mathematical formulation for perplexity, it is simply the inverse of the probability with a log transform applied on top of it.
  • Perplexity: Perplexity is a measure of surprise in random choices.
    • Distributions with high uncertainty have high perplexity. A uniform distribution has high perplexity because it is hard to predict a random draw from it.
    • A peaked distribution has low perplexity because it is easy to predict the outcome of a random draw from it.

What are N-grams?

N-gram Models

N-gram models are a particular set of language models based on the statistical frequency of groups of tokens.

  • An n-gram is an ordered group of n tokens. The bigrams of the sentence The cat eats fish. are (The, cat), (cat, eats), (eats, fish) and (fish, .). The trigrams are (The, cat, eats), (cat, eats, fish) and (eats, fish, .).
  • The smallest n-grams with n =1 are called unigrams. Unigrams are simply the tokens appearing in the sentence.
  • The conditional probability that a certain token appears after previous tokens are estimated by Maximum Likelihood Estimation on a set of training sequences.

N-grams - The central concept is that the next word dependent on the previous n words.

Intuitive Formulation

The intuitive idea behind n-grams and n-gram models is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words like humans do while understanding speech and text.

Illustration for N-gram probabilities

P(wnw1wn1)P(wn) unigram P(wnw1wn1)P(wnwn1) bigram P(wnw1wn1)P(wnwn1wn2) trigram P(wnw1wn1)P(wnwn1wn2wn3) 4-gram P(wnw1wn1)P(wnwn1wn2wn3wn4)5-gram \begin{aligned} & \mathbf{P}\left(\mathbf{w}_{\mathrm{n}} \mid \mathbf{w}_1 \ldots \mathbf{w}_{\mathrm{n}-1}\right) \approx \mathbf{P}\left(\mathbf{w}_{\mathrm{n}}\right)\quad \text { unigram } \\ & \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_1 \ldots \mathbf{w}_{\mathrm{n}-1}\right) \approx \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_{\mathrm{n}-1}\right) \quad \text { bigram } \\ & \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_{\mathbf{1}} \ldots \mathbf{w}_{\mathbf{n}-1}\right) \approx \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_{\mathbf{n}-\mathbf{1}} \mathbf{w}_{\mathbf{n}-2}\right) \quad \text { trigram } \\ & \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_{\mathbf{1}} \ldots \mathbf{w}_{\mathbf{n}-1}\right) \approx \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_{\mathbf{n}-1} \mathbf{w}_{\mathbf{n}-2} \mathbf{w}_{\mathbf{n}-3}\right) \quad \text { 4-gram } \\ & \mathbf{P}\left(\mathbf{w}_{\mathbf{n}} \mid \mathbf{w}_1 \ldots \mathbf{w}_{\mathrm{n}-1}\right) \approx \mathbf{P}\left(\mathbf{w}_{\mathrm{n}} \mid \mathbf{w}_{\mathrm{n}-1} \mathbf{w}_{\mathrm{n}-2} \mathbf{w}_{\mathrm{n}-3} \mathbf{w}_{\mathrm{n}-4}\right) \quad 5 \text {-gram } \\ & \end{aligned}

  • Hence N-Gram models are also a simple class of language models (LM's) that assign probabilities to sequences of words using shorter context.
  • We can predict the chance of emerging a word in a given position using these assigned probabilities. For example, the last word of an n-gram given the previous words.

The Use of N-grams

N-gram models are the simplest and most common kind of language model.

  • In general, simply modeling using n-grams in n-gram models is an insufficient model of a language because the text in languages have long-distance dependencies.
  • But based on practice, we can still effectively use N-Gram models to represent languages.
  • Choice of N in N-Gram models: The accuracy of the model increases with an increase in N.
    • But with bigger N values, we run into the risk that we may not get good estimates for N-Gram probabilities, and the N-Gram tables will be more sparse.
    • Smaller the N, the model will be less accurate. But we may get better estimates for N-Gram probabilities, and The N-Gram tables will be less sparse.
  • In reality, we do not use higher than Trigram (not more than Bigram)
    • For example, the size of N-gram tables with 10,000 words: 10,000 for unigram, 10,00010,000=100,000,000 for bigram10,000*10,000 = 100,000,000~for~bigram, 10,00010,00010,000=1,000,000,000,000 for trigram10,000*10,000*10,000 = 1,000,000,000,000~ for~ trigram.

Bi-gram Model

  • The bigram model approximates the probability of a word given all the previous words by using only the conditional probability of one preceding word.
  • So, the prediction for the next word is dependent on the previous word alone.
  • It is also one of the widely used models.

Probability Estimation

  • There are two main steps generally in building a machine learning model: Defining the model and Estimating the model’s parameters, which is called the training or the learning step. For language models, the definition depends on the model we choose.
  • There are also two quantities we need to estimate for developing the language models for all words in the vocabulary of the language for which we are working with:
    • The Probability of observing a sequence of words from a language. For example, Pr(Colorless green ideas sleep furiously)
      • This is the probability of a sentence or sequence Pr(w1,w2,,wn)Pr(w1, w2, …, wn)
    • The Probability of observing a word having observed a sequence. For example, Pr(furiously | Colorless green ideas)
      • This is the probability of the next word in a sequence Pr(wk+1w1,,wk)Pr(wk+1 | w1, …, wk)
    • Pr(w1, w2, …, wn) is short for Pr(W1=w1,W1=w2,,Wn=wn)Pr(W1 = w1, W1 = w2 , …, Wn = wn)
    • The w notation denotes that the random variable W1 takes on value w1 and so on. e.g., Pr(I,love,fish)=Pr(W1=I,W2=love,W3=fish)Pr(I, love, fish) = Pr(W1 = I, W2 = love, W3 = fish)
  • Typical assumptions made in probability estimation methods: Probability models almost always make independence assumptions.
    • Even though two variables, X and Y, are not actually independent, the model may treat them as independent, which can drastically reduce the number of parameters to be estimated.
    • Models without independence assumptions have way too many parameters to estimate reliably from the data we may have and are intractable.
    • Sometimes, the independence assumptions may not be correct, and these models, when relying on those incorrect formulations, may often be incorrect as well as they may assign probability mass to events that cannot occur.
  • Probability estimation without independence assumption: If there are no independence assumptions about the sequence, then one way to estimate is the fraction of times we see it. Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N where N is the total number of sequences.
    • Estimating using frequency fractionals is problematic because we need to have seen the particular sentence many times to have to assign a good probability for Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N.
    • Also, estimating from sparse observations is unreliable, and we won't have a solution for new sequences.
  • Maximum Likelihood estimation: The most basic parameter estimation technique is the relative frequency estimation (frequencies are counts) which is also called the method of Maximum Likelihood Estimation (MLE).
    • The estimation simply works by counting the number of times the word appears conditioned on the sentence and then normalizing the probabilities. We also need some source text corpora.
    • Chain rule of probability in estimation: To estimate the probabilities, we usually rely on the Chain Rule of Probability, where we decompose the joint probability into a product of conditional probabilities using the independence assumption.
      • It is also to be kept in mind that estimating conditional probabilities with long contexts is usually difficult, and for example, conditioning on 4 or more words itself is very hard.
  • Markov assumption in probability estimation: The use of Markov assumption in probability estimation solves a lot of problems we encounter with data sparsity and conditional probability calculations.
    • The assumption is that the probability of a word depends only on the previous word(s). It is like saying the next event in a sequence depends only on its immediate past context.
    • Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past.

Challenges of Probability Estimation

  • Reliable Generalization: The main problem to tackle with estimating parameters or probabilities is to do it reliably so that the models generalize well to unseen data.
    • We can estimate unigrams quite reliably, but they are often not a good model**.
    • Higher order n-grams require large amounts of data but are better models but also have a tendency to overfit the data.
  • General Challenges of building a good language model
    • It is difficult to build a language model that does well on the task we want to use it for, as it takes a huge amount of time for both development and training.
    • A good language model should also model the language well where if we ask questions of the model, it should provide reasonable answers, and also, it should be able to do it a bit recursively depending on our further responses.
    • We also need well-formed language (like the English language) sentences to be more probable where words that the model predicts as the next in a sequence should fit well grammatically, semantically, contextually, and culturally.
      • Given the simplicity at which the basic models work, this is a complex task and too much to ask for.

Sensitivity to the Training Corpus

Most language models currently in practice, especially statistical language models, are extremely sensitive to changes in the style, topic, or genre of the text on which they are trained.

  • Hence they are very sensitive to the training corpus from which they are trained on.
  • For example, one is much better off using a very big corpus of words of transcripts from telephone conversations for modeling casual phone conversations than using a corpus even of millions of words of transcripts from TV and radio news broadcasts.
  • The effect is quite strong, even for changes that seem trivial to a human. For example, a language model trained on a certain publication of news wire text will see its perplexity doubled when applied to a very similar news publication of wire text from the same time period.

Smoothing

Smoothing is the task of adjusting the maximum likelihood estimate of probabilities to produce more accurate probabilities.

  • Central idea in smoothing algorithms: We will assign some probability mass to unseen events such that we will take away some probability mass from seen events.
  • Smoothing can also be seen as the process of flattening a probability distribution implied by a language model so that all reasonable word sequences can occur with some probability.
    • This often involves broadening the distribution by redistributing weight from high-probability regions to zero-probability regions.
  • The name smoothing is used in the sense that these smoothing techniques tend to make distributions more uniform by adjusting low probabilities, such as zero probabilities upward and high probabilities downward.
  • Smoothing not only prevents zero probabilities but also attempts to improve the accuracy of the model as a whole.
  • In higher-order N-gram models, which tend to be domain or application-specific, smoothing provides a way of generating generalized language models.

Laplace Smoothing or Add-one Smoothing

Laplace smoothing merely adds the number one to each count (hence the alternate name adds one smoothing). We also need to adjust the denominator to take into account the extra observations since there is a fixed number of words in the vocabulary, and for unseen words, there is an increment of one.

  • Laplace smoothing does not perform well enough to be used in modern n-gram models but is a useful tool and introduction to most other concepts.
  • A related way to view Laplace smoothing is as discounting or lowering some non-zero discount counts in order to get the probability mass that will be assigned to the zero counts.
  • Add-k smoothing: One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. Instead of adding 1 to each count and we add a fractional count. This algorithm is therefore called add-k smoothing.
    • Add-k smoothing requires that we have a method for choosing k, and that can be done by optimizing for k by trying different values on a holdout set.
    • Add-k smoothing is useful for some tasks like text classification also, in addition to probability estimation, but still doesn’t work well for language modeling as it generates counts with poor variances and often inappropriate discounts.

Implementation

Setting up The Jupyter Notebook

Jupyter Notebook can be installed from the popular setup of Anaconda or jupyter directly. Reference instructions for setting up from source can be followed from here. Once installed, we can spin up an instance of the jupyter notebook server and open a python notebook instance, and run the following code for setting up basic libraries and functionalities.

If a particular library is not found, it can simply be downloaded from within the notebook by running a system command pip install sample_library command. For example, to install searborn, run the command from a cell within jupyter notebook like:

Getting the Dataset

  • For this particular n-gram modeling problem, we are getting data directly from here and also from a kaggle datasets here.
  • The dataset is related to an excerpt article from Adam Kilgarriff, who is a corpus linguist specialized and lexicography.
  • The datasets will be downloaded and kept in the same location in the place where the jupyter notebook with the code which we run here is there.
  • A sample of an excerpt from the first dataset looks like this:

Data Processing

We will use a naive sentence tokenizer and a toktok tokenizer that requires no dependencies for tokenization. An example of how this tokenization works is below.

  • Traditionally, we can use n-grams to generate language models to predict which word comes next given a history of words. We'll use the lm module in nltk to get a sense of how non-neural language modeling is done.
  • If we want to train a bigram model, we need to turn this text into bigrams.
  • Let us look at what the first sentence of our text would look like if we use the ngrams function from NLTK for this

The output looks like this:

  • Concept of Padding: Notice how in the above output, "when" occurs both as the first and second member of different bigrams, but "hence" and "linguistic" don't.
    • It is better to indicate how often sentences also start with "hence" and end with "linguistic" as a general practice.
    • A standard way to deal with this is to add special "padding" symbols to the sentence before splitting it into ngrams.
    • NLTK has a function for padding, and let us implement the same to see how it works on our example the first sentence.
    • The n argument in nltk.lm: Tells the function we need padding for specified n-grams and the nltk.lm module provides a convenience function that has all these arguments already set while the other arguments remain the same as for pad_sequence.
  • The output after padding looks like below:
  • Bigram padded example:
  • Trigram padded example:
  • Let us combine the two parts discussed so far. We get the following preparation steps for one sentence.
  • To make our model more robust, we could also train it on unigrams (single words) as well as bigrams which are the main source of information. NLTK provides a function called everygrams for the same
  • During training and evaluation the model, we will rely on a vocabulary that defines which words are "known" to the model. To create this vocabulary, we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words. Let us do that here.
  • In most cases, we want to use the same text as the source for both vocabulary and gram counts. So we can simply import a function that does everything together using NLTK, which is padded_everygram_pipeline.

Building the Model

  • With all the data preprocessing steps done and out prepared, we can start training a model. We will train a Maximum Likelihood Estimator (MLE) for this case. We only need to specify the highest n-gram order to instantiate the model.
  • In some cases, we want to ignore words that we did see during training, but that didn't occur frequently enough to provide us with useful information.

  • We can tell the vocabulary to ignore such words using the unk_cutoff (present in nltk.lm.vocabulary.Vocabulary) argument for the vocabulary lookup.

  • Also, for more sophisticated ngram models that include smoothing, we can look at the following objects from nltk.lm.models:

    • Lidstone: Provides Lidstone-smoothed scores.
    • Laplace: Implements Laplace (add one) smoothing.
    • InterpolatedLanguageModel: Logic common to all interpolated language models (Chen & Goodman 1995).
    • WittenBellInterpolated: Interpolated version of WittenBell smoothing.
  • Using the N-gram Language Model: When it comes to n-gram models, the training boils down to counting up the ngrams from the training corpus.

  • Generation using N-gram Language Model
  • The output generated from the model looks like this:
  • Saving the model if needed

Conclusion

  • Language modeling is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence.
  • N-gram language model predicts the probability of a given N-gram within any sequence of words in the language.
  • Maximum likelihood estimation is one of the popular techniques for estimating parameters and probabilities of NLP models.
  • Smoothing is a method for adjusting the maximum likelihood estimate of probabilities to produce more accurate probabilities when we encounter problems of sparsity.
  • We simply add 1 to all the counts of words so that we never incur 0 probabilities in Laplace smoothing or Add-1 smoothing.