Backoff in NLP
Overview
Backoff and smoothing are techniques in NLP to adjust probabilities to tackle data sparseness and parameter estimation while building NLP models. The methods involve broadening the estimated probability distribution by redistributing weight from highprobability regions to zeroprobability regions in a suitable fashion.
Introduction
First, we will briefly understand what language models are, what Ngram models are, what are the issues that arise with estimating probability distributions and, in turn, the parameters of these models, and then see how smoothing methods like backoff techniques can help solve these problems.
Language models
One of the underlying ideas behind language models is the simple idea that some word sequences are more frequent than others, which led to the development of language models for text and also gave rise to mainstream NLP research.
 Statistical Language Models (SLMs), in particular, are based on ngrams and are now widely used in most of the NLP fields.
 Statistical Language Models have the power to represent some crucial aspects of language and also are quick and easy to implement, and do not require the laborintensive development of NLP resources such as dictionaries or grammar, making them ideal for research in relatively resourcepoor languages also.
 Ngram language models: Statistical language models can be viewed as a probability distribution over a set of possible sentences representing how often each sentence occurs in the modeled language.
 The most widelyused SLMs are based on ngrams in which the probability of a given word in a sentence is determined by the n1 previous words, which is also the recent history.
 SLM computes probabilities based on the maximum likelihood (ML) estimates in the simplest form with some allowance for unseen instances.
 In a trigram model, for example, the maximum likelihood conditional probability of a word wi given its two predecessors wi2 and wi1 is defined as the counting (c) of occurrences of the trigram divided by the number of occurrences of its bigram constituent.
Now, let us see what issues arise when estimating parameters and computing probabilities in these language models.
Issues with probability estimation in LM
When dealing with probability distributions to be estimated for language models, we come across scenarios related to determining the probability or likelihood estimate of a sequence of words (like a sentence) occurring together when one or more words individually (unigram) or Ngrams such as bigramKaTeX parse error: Expected 'EOF', got '−' at position 7: (wi/wi−̲1) or trigramKaTeX parse error: Expected 'EOF', got '−' at position 7: (wi/wi−̲1wi−2) in the given set have never occurred in the past.
 Even if one ngram is unseen in the sentence, the probability of the whole sentence becomes zero.
 Hence the probability of a test document given some class can be zero even if a single word in that document is unseen
 One way to avoid this is to allot some probability mass by reserving for the unseen words.
 Smoothing techniques are touted as solutions to tackle this major problem in building language models.
 Example of unseen word estimation using MLE: If we take the example of probability estimation using MLE, for the P(the  burnish). If the bigram "burnish the" doesn't appear in our training corpus, then P(the burnish) = 0.
 But we know that no word sequence should have 0 probability as that can’t be intuitively right.
 Furthermore, MLE also produces poor estimates when the counts are nonzero but still small.
 Hence the need for better estimators than MLE for rare events, which is solved by somewhat decreasing the probability of previously seen events so that there is a little bit of probability mass left over for previously unseen events.
What is Backoff?
The general idea of backoff is that it always helps to use lesser context to generalize for contexts that the model doesn’t know enough about.
 For example,we can use trigram probabilities if there is sufficient evidence, else we can use bigram or unigram probabilities.
 It is the concept where sentences are generated in small steps which can be recombined in other ways.
 If we have no examples of a particular trigram Wn2, Wn1, Wn, to compute $P(WnWn1, Wn2)$, we can estimate its probability by using the bigram probability $P(WnWn1)$.
Why Do We Need Backoff in NLP?
We need techniques like smoothing (and backoff is a subset of smoothing techniques) to tackle the problems of sparsity and improve the generalization power of NLP models.
Issues of Sparsity in NLP
 Language models use huge amounts of parameters on data to build the NLP models, and sparsity becomes an issue, and incorporating proper smoothing techniques into the models usually leads to more accuracy than the original models.
 Data sparsity is another issue with language models, statistical modeling, and also NLP techniques in general, and smoothing can help with the performance issue.
 Sometimes, we come across the extreme case where there is so much training data, and even though all parameters can be accurately trained without smoothing, even in such cases, it is almost always usual to expand the model such as by moving to a higher ngram model to achieve improved performance and smoothing can help with such cases too.
, For example,, in several million words of English text, more than 50% of the trigrams occur only once, and 80% of the trigrams occur less than five times.
Issues with Model Generalization in NLP
Language models, in general, use a huge amount of data for training, and the output of these models is dependent on the Ngrams appearing in the training corpus. The models work well only if the training corpus is similar to the testing dataset; this is the risk of overfitting to the training set.
 As with any machine learning method, we would like results that are generalizable to new information.
 One of the other harder problems with language models and, in general, NLP is how we deal with words that do not even appear in training but are in the test data.
Thus, no matter how much data one has or how many parameters are there in the model, smoothing can almost always help performance for a relatively small effort.
Concept of Combining Estimators in NLP
 Typical probability estimation techniques like maximum likelihood estimation don't work when dealing with cases of unknown events where sparsity leads to poor performance.
 We can choose to estimate the parameters with different smoothing estimation techniques like Laplace smoothing or good Turing discounting.
 Smoothing methods provide the same estimate for all unseen or rare ngrams with the same prefix making use only of the raw frequency of the ngram.
 One other method is to combine a complex model with simpler models with techniques like linear interpolation or modified KneserNey smoothing or backoff techniques.
 Backoff techniques utilize the additional source of knowledge based on the ngram hierarchy.
 We can estimate the probability of those particular trigrams where there are no examples by using the bigram probability, and further, if there are no examples of the bigram to compute $P(WnWn1)$, then we can use the unigram probability $P(Wn)$ as a step ahead.
 Hence suitably combining various models of different orders is the secret to success for ngram models.
 The ngram models at level n can be combined with the models for the shorter unigrams to (n1)grams.
GoodTuring Discounting
Good Turing discounting is one of the useful techniques in smoothing to improve the model accuracy by taking into account additional information about the data distribution.
The basic idea behind good Turing smoothing is to use the total frequency of events that occur only once to estimate how much mass to shift to unseen events.
 GoodTuring estimator (GT) developed by Gale & Sampson in 1995 assumes the probability mass of unseen events to be n1 / N, where N is the total of instances observed in the training data and n1 is the number of instances observed only once.
 Reassign the probability mass of all events that occur k times in the training data to all events that occur k–1 times.
 Reestimates the amount of probability mass to assign to Ngrams with zero or low counts by looking at the number of Ngrams with higher counts.
 As you can see in the above illustration, we take the probability of Ngrams with lower values and give it to ones with zero.
 It is the norm to treat Ngrams with low counts, for example, with very small counts of one, as if the count was 0 in all the discounting methods.
 Advantages of Good Turing Discounting: Good Turing discounting Works very well in practice.
 Usually, the excellent Turing discounted estimate is used only for unreliable counts, for example, for counts less than 5.
 Disadvantages with Good Turing Discounting: There are two main problems with good Turing discounting.
 The reassignment may change the probability of the most frequent event, and also, we generally don’t observe events for every k.
 Both of these problems can be solved by a variant of good Turing called Simple GoodTuring Discounting.
Ngram Backoff
Concept of Ngram Backoff
The central idea in ngram backoff is to approximate the probability of an unobserved Ngram using more frequently occurring lower order Ngrams.
 If we are trying to compute $P\left(w_n \mid w_{n2} w_{n1}\right)$, but we have no examples of a particular trigram $w_{n2} w_{n1} w_{n}$, we can instead estimate its probability by using the bigram probability $P\left(w_n \mid w_{n1}\right)$.
 Similarly, if we don’t have counts to compute $P\left(w_n \mid w_{n1}\right)$, we can look to the unigram $P(w_n)$. Hence if an Ngram count is zero, we approximate its probability using a lower order Ngram.
General Formula for Backoff
\begin{aligned} &P\left(w_n \mid w_{n1}, w_{n2}\right) \ &=P^*\left(w_n \mid w_{n1}, w_{n2}\right), \text { if } C\left(w_{n2}, w_{n1}, w_n\right)>0 \ &=\alpha\left(w_{n1}, w_{n2}\right) P\left(w_n \mid w_{n1}\right), \text { otherwise } \end{aligned}
 Where P* is discounted probability (nonMLE probability) to save some probability mass for lower order ngrams
 α(wn1,wn2) is to ensure that probability mass from all bigrams sums up exactly to the amount saved by discounting in trigrams.
 The scaling factor is chosen to make the conditional distribution sum to one.
Example for Ngram Backoff
 If we consider the example i spent xxx, and let's say the trigram i spent in occurs only once, the probability of word / i, spent is 1 if the word is in and 0 for all other words.
 But the bigram spent three may occur many times when we look at bigram tokens beginning with spent, so we these instead to compute the probability.
Main Advantage of Ngram backoff: Extremely popular for Ngram modeling in speech recognition because we can control complexity as well as generalization.
Discounted Backoff
Need for Discounting in Backoff
 While modeling probabilities with backoff, higherorder ngrams are explicitly discounted and the ngram which has zero probability is replaced with a lowerorder ngram in backoff.
 This will add probability mass and the total probability assigned to all possible strings by the language model would be greater than 1!
 Hence the need to modify the discounted MLE probabilities also.
 So, in addition to the explicit discount factor (from higher grams to lower ones), we develop a function to distribute the additional probability mass to the lower order ngrams, which is done by the Discounted Backoff.
Katz Backoff
Katz backoff is another way to solve the issue of a zeroprobability word sequence in typical sparse train datasets where the corpus is limited and finite in size.
 We rely on a discounted probability P∗ if we’ve seen the ngram before if we have nonzero counts, and otherwise, we recursively back off to the Katz probability for the shorterhistory (N1)gram.
 Utilising the Katz backoff will help in assigning meaningful probabilities to acceptable word sequences in the language considered, which are missing in the dataset.
 The concept of discounting in Katz backoff involves subtracting a fixed discount D from each nonzero count for the ngrams (or entities) and then redistributing this probability mass to Ngrams with zero counts.
 Discounting can also be seen as a kind of additional smoothing step in the algorithm which introduces a discount factor which is the ratio of the discounted counts to the original count
 Hence discounting is the way of lowering each nonzero count c to c* (c > c*) according to the smoothing algorithm.
Stupid Backoff
There's too much information to use interpolation effectively when we are dealing with webscale applications, and therefore we need a different backoff technique.
 In Stupid Backoff, we use the trigram if we have enough data points to make it seem credible; otherwise, if we don't have enough of a trigram count, we back off and use the bigram, and if there still isn't enough of a bigram count, we use the unigram probability.
 Instead of the discounting step in the smoothing procedure, we can use the relative frequencies of the words alone.
 The method is named stupid because the initial thought when the method was proposed was that such a simple scheme can’t be possibly good.
 Eventually, the method turned out to be as good as the stateoftheart Kneser Ney smoothing.
Equations for Stupid Backoff
\begin{aligned} &S\left(w_n \mid w_{n1}, w_{n2}\right)\ &=\frac{C\left(w_{n2}, w_{n1}, w_n\right)}{C\left(w_{n2}, w_{n1}\right)} \text {, if } C\left(w_{n2}, w_{n1}, w_n\right)>0\ &=\alpha S\left(w_n \mid w_{n1}\right) \text {, otherwise }\ &S\left(w_n\right)=\frac{C\left(w_n\right)}{\text { Size of training corpus }} \end{aligned} The symbol S is used instead of the probability symbol P as these are not probabilities but scores.
Advantages of Stupid backoff
 Inexpensive yet accurate calculations with large training sets: Stupid Backoff is inexpensive to calculate in a distributed environment while approaching the quality of KneserNey smoothing for large amounts of data.
 The lack of normalization also does not affect the functioning of the language model as smoothing based on stupid backoff depends on relative rather than absolute featurefunction values.
 Inexpensive to calculate for webscale ngrams.
Deleted Interpolation
Interpolation is another smoothing technique to combine knowledge from multiple types of Ngram* into the computation of the probabilities. Deleted interpolation is similar to backoff, where for computing trigram probability, we use the bigram and unigram information of focus words.

The main difference with backoff is that if we have nonzero trigram counts, we rely on these trigram counts completely instead of interpolating with bigram and trigram counts.

We can also say that we are linearly interpolating a bigram and a unigram model when combining bigrams and unigrams with trigrams. We can generalize this to interpolating an Ngram model using an (N1)gram model.
 We need to note that this leads to a recursive procedure if the lower order Ngram probability also doesn't exist & hence if necessary, everything can be estimated in terms of a unigram model.

A scaling factor is also used to make sure that the conditional distribution will sum to one.
 An Ngramspecific weight is used, and it would lead to far too many parameters to estimate. In such cases, we need to cluster multiple such weights like with word classes suitably, or in extreme cases. We can also use a single weight if clustering doesn't work.

Equation for simple linear interpolation: $\begin{aligned} \hat{P}\left(w_n \mid w_{n2} w_{n1}\right)=& \lambda_1 P\left(w_n\right) \\ &+\lambda_2 P\left(w_n \mid w_{n1}\right) \\ &+\lambda_3 P\left(w_n \mid w_{n2} w_{n1}\right) \end{aligned}$
 One constraint is that the λs must sum to 1.

Concept of conditional interpolation λ: This is a slightly sophisticated version of linear interpolation where each λ weight is computed by conditioning on the context.
 This way, if we have particularly accurate counts for a particular bigram, we assume that the counts of the trigrams based on this bigram will be more trustworthy.
 So we can make the λs for those trigrams higher and thus give that trigram more weight in the interpolation.

Finding values for λ: Both the simple interpolation and conditional interpolation λs are learned from a heldout corpus.
 We choose the λ values that maximize the likelihood of the given heldout corpus.
 We fix the ngram probabilities and then search for the λ values that—when plugged into the overall equation, give us the highest probability of the heldout set.
 We can also use the expectation maximization (EM) algorithm, which is an iterative learning algorithm that converges on locally optimal λs.
Comparison of Different Smoothing Techniques
 The relative performance of smoothing techniques can vary based on training set size, ngram order, and also the quality of the training corpus.
 Backoff vs. Interpolation: Backoff is superior on large datasets because it is superior on high counts, while interpolation is superior on low counts.
 Interpolation works well in scenarios with low counts as lower order distributions provide valuable information about the correct amount to discount, and thus interpolation is superior for such situations.
 Katz backoff smoothing performs well on ngrams with large counts.
 Kneser Ney smoothing is also one of the best for small counts.
 Algorithms that perform well on low counts perform well overall when low counts form a larger fraction of the total entropy in small datasets.
 Backoff, in general, performs better on bigram models than on trigram models as bigram models contain higher counts than trigram models on the same size data.
Conclusion
 Language models are used in NLP applications like Text Categorization and smoothing & backoff techniques to solve the sparsity issues arising from training such models.
 Interpolation is a smoothing technique to combine knowledge from multiple types of Ngrams into the computation of the probabilities.
 For very large training sets like web data, the stupid backoff technique is more efficient.
 Backoff is superior on large datasets because it is superior on high counts, while interpolation is superior on low counts.
 Katz’s Backoff Model allows us to calculate the conditional probability of a word against its history of preceding words.