Evaluating Language Models in NLP

Overview

Language modeling is the task of predicting the next word or character in a document and can be used to train language models that can be applied to a wide range of natural language tasks like text generation, text classification, and question answering.

The performance of language models in NLP is crucial to understand and can be evaluated with metrics like perplexity, cross-entropy, and bits-per-character (BPC).

Introduction

Language models are very useful in a broad range of applications like speech recognition, machine translation part-of-speech tagging, parsing, Optical Character Recognition (OCR), handwriting recognition, information retrieval, and many other daily tasks.

One of the main steps in the usage of language models is to evaluate the performance beforehand and use them in further tasks.
This lets us build confidence in the handling of the language models in NLP and also lets us know if there are any places where the model may behave uncharacteristically.

In practice, we need to decide on the dataset to use, the method to evaluate, and also select a metric to evaluate language models. Let us learn about each of the elements further.

How to Evaluate a Language Model?

Evaluating a language model lets us know whether one language model is better than another during experimentation and also to choose among already trained models.
There are two ways to evaluate language models in NLP: Extrinsic evaluation and Intrinsic evaluation.
- Intrinsic evaluation captures how well the model captures what it is supposed to capture, like probabilities.
- Extrinsic evaluation (or task-based evaluation) captures how useful the model is in a particular task.
Comparing among language models: We compare models by collecting a corpus of text which is common for models which we are comparing for.
- We then divide the data into training and test sets and train the parameters of both models on the training set.
- We then compare how well the two trained models fit the test set.

What Does Evaluating a Model Mean?

After we train models, Whichever model assigns a higher probability to the test set is generally considered to accurately predicts the test set and hence a better model.
Among multiple probabilistic language models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data and hence will assign a higher probability to the test data.

Issue of Data Leakage or Bias in Language Models

Most evaluation metrics for language models in NLP are based on test set probability, so it is important not to let the test sentences into the training set.
Example: Assuming we are trying to compute the probability of a particular test sentence, and if our test sentence is part of the training corpus, we will mistakenly assign it an artificially high probability when it occurs in the test set.
- We call this situation training on the test set.
- Training on the test set introduces a bias that makes the probabilities all look too high and causes huge inaccuracies in metrics like perplexity. **

Scaler Placement Report and Statistics

₹23L

AVG CTC

SCALER PLACEMENT PROOF

Scaler learners achieved 2.5x salary growth with average post-Scaler CTC reaching ₹23L.

11,000+placements

650+companies

Verified data

See full placement report

Extrinsic Evaluation

Extrinsic evaluation is the best way to evaluate the performance of a language model by embedding it in an application and measuring how much the application improves.

It is an end-to-end evaluation where we can understand if a particular improvement in a component is really going to help the task at hand.
Example: For speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.

Intrinsic Evaluation

We need to take advantage of intrinsic measures because running big language models in NLP systems end-to-end is often very expensive, and it is easier to have a metric that can be used to quickly evaluate potential improvements in a language model.

An intrinsic evaluation metric is one that measures the quality of a model-independent of any application.

We also need a test set for an intrinsic evaluation of a language model in NLP
The probabilities of an N-gram model training set come from the corpus it is trained on, the training set or training corpus.
We can then measure the quality of an N-gram model by its performance on some unseen test set data called the test set or test corpus.
We will also sometimes call test sets and other datasets that are not in our training sets held out corpora because we hold them out from the training data.

Good scores during intrinsic evaluation do not always mean better scores during extrinsic evaluation, so we need both types of evaluation in practice.

Transform Your Career

Choose from our industry-leading programs designed for career success

NSDC Certified

Modern Software and AI Engineering Program

Master full-stack development with AI integration

12 MonthsDuration

AI-LedCurriculum

Career SupportSupport

+1000 more

Go to Program

NSDC Certified

Modern Data Science and ML with specialisation in AI

Advanced data science techniques with AI specialization

12 MonthsDuration

AI-LedCurriculum

Career SupportSupport

+1000 more

Go to Program

NSDC Certified

Advanced AIML with Specialisation in Agentic AI

Deep dive into AIML with focus on Agentic systems

12 MonthsDuration

AI-LedCurriculum

Career SupportSupport

+1000 more

Go to Program

NSDC Certified

DevOps, Cloud & AI Platform Engineering

Build and manage AI-powered cloud infrastructure

12 MonthsDuration

AI-LedCurriculum

Career SupportSupport

+1000 more

Go to Program

NSDC Certified

AI Engineering Advanced Certification by IIT-Roorkee

Premier AI engineering certification from IIT-Roorkee

3 MonthsDuration

AI-LedCurriculum

Career SupportSupport

Go to Program

Stop learning AI in fragments—master a structured AI Engineering Course with hands-on GenAI systems with IIT Roorkee CEC Certification

AI Engineering Course Advanced Certification by IIT-Roorkee CEC

A hands on AI engineering program covering Machine Learning, Generative AI, and LLMs - designed for working professionals & delivered by IIT Roorkee in collaboration with Scaler.

Enrol Now

Perplexity

Perplexity is a very common method to evaluate the language model on some held-out data. It is a measure of how well a probability model predicts a sample.

Perplexity is also an intrinsic measure (without the use of external datasets) to evaluate the performance of language models which come under NLP.
- Perplexity as a metric quantifies how uncertain a model is about the predictions it makes. Low perplexity only guarantees a model is confident, not accurate.
- Perplexity also often correlates well with the model’s final real-world performance, and it can be quickly calculated using just the probability distribution the model learns from the training dataset.

The Intuition

The basic intuition is that the higher the perplexity measure is, the better the language model is at modeling unseen sentences.
Perplexity can also be seen as a simple monotonic function of entropy. But perplexity is often used instead of entropy due to the fact that it is arguably more intuitive to our human minds than entropy.

Calculating Perplexity

Perplexity of a probability model like language models in NLP: For a model of an unknown probability distribution, and a proposed probability model, we can evaluate perplexity measure mathematically as $b^{-\frac{1}{N}} \sum_{i=1}^N \log _b q\left(x_i\right)$
- We can choose b as 2.
- In general, better models assign higher probabilities to the test events. Hence good models will have lower perplexity values and are less surprised by the test sample.
- If all the probabilities were 1, then the perplexity would be one and the model would perfectly predict the text. Conversely, the perplexity will be higher for poorer language models.
Perplexity denoted by PP of a discrete probability distribution p is mathematically defined as $P P(p):=2^{H(p)}=2^{-\sum_x p(x) \log _2 p(x)}=\prod_x p(x)^{-p(x)}$
- Where H(p) is the entropy (in bits) of the distribution and x ranges over events which we will learn about further.
- Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x.
One other formulation for Perplexity from the perspective of language models in NLP: It is the multiplicative inverse of the probability assigned to the test set by the language model normalized by the number of words in the test set.
- We can define perplexity mathematically as:
  - $\begin{aligned} P P(W) &=P\left(w_1 w_2 \ldots w_N\right)^{-\frac{1}{N}} \\ &=\sqrt[N]{\frac{1}{P\left(w_1 w_2 \ldots w_N\right)}} \end{aligned}$
- We know that if a language model can predict unseen words from the test set if the P(a sentence from a test set) is highest, then such a language model is more accurate.

Interpreting Perplexity

Perplexity intuitively provides a more human way of thinking about the random variable’s uncertainty. The reasoning is that the perplexity of a uniform discrete random variable with K outcomes is K.
- Example: The perplexity of a fair coin is two and the perplexity of a fair six-sided die is six.
- This kind of framework provides a frame of reference for interpreting a perplexity value.
Simple framework to interpret perplexity: If the perplexity of some random variable X is 10, our uncertainty towards the outcome of X is equal to the uncertainty we would feel towards a 10-sided die, helping us intuit the uncertainty more deeply.

Perplexity to Compare Different N-Gram Models

Steps to compute perplexity for n-gram models:
- We first calculate the joint probability of all the words in the sentence under the n-gram model after we estimate the model parameters from the training corpus.
- We will then transform the joint probability into a perplexity for each sentence by multiplying the probabilities of each word together.
  - We first need to calculate the length of the sentence in words by including the end-of-sentence word as well and then calculate the perplexity = 1/(pow(sentence_probability, 1.0/sentence_length))
- Then we compute a single perplexity from the overall model (if there are multiple sentences) as:
  - $\begin{aligned} P P(W) &=\sqrt[N]{\frac{1}{P\left(s_1 s_2 \ldots s_N\right)}} \end{aligned}$
Generic benchmarks and typical values of perplexity n-gram models: If we assume a corpus of English with a vocabulary size of ~50,000, we can establish typical evaluation metrics for unigram, bigram, and trigram language models.
- In a bigram model, each word depends only on the previous word in the sentence while in a unigram model each word is chosen completely independently of other words in the sentence.
- The typical reported perplexity figures for such a dataset are ~74 for a trigram model, ~137 for a bigram model, and ~955 for a unigram model. The perplexity for a model that simply assigns probability 1/50,000 to each word in the vocabulary would be 50,000.
- Hence the trigram model gives a big improvement over bigram and unigram models and a huge improvement over assigning a probability of 1/50,000 to each word in the vocabulary.

Master structured AI Engineering + GenAI hands-on, earn IIT Roorkee CEC Certification at ₹40,000

AI Engineering Course Advanced Certification by IIT-Roorkee CEC

A hands on AI engineering program covering Machine Learning, Generative AI, and LLMs - designed for working professionals & delivered by IIT Roorkee in collaboration with Scaler.

Enrol Now

Turn Learning into Career Growth

1200+Hiring Partners

89%Placement Rate

11,000+Placements

147%Avg Salary Increment

2.5XCareer Growth

₹23 LPAAvg Post-Scaler Salary

1200+Hiring Partners

89%Placement Rate

11,000+Placements

147%Avg Salary Increment

2.5XCareer Growth

₹23 LPAAvg Post-Scaler Salary

Perplexity in the Real World

Perplexity is used as a measure in training language models related to standardized datasets like One Billion Word Benchmark. The dataset was collected from thousands of online news articles published in 2011, all broken down into their component sentences.
Perplexity as a metric measures how accurately a model can mimic the style of the dataset it is being tested against models trained on datasets from some period as the benchmark dataset have an unfair advantage due to vocabulary similarity and may not work when testing for different time periods and slightly different datasets even though they were in the same domain.
Perplexity also rewards models for mimicking the test dataset, and it may end up favoring the models most likely to imitate subtly toxic content (if the dataset is related to freely flowing language like a text from social media) as studies have shown such content is more polarized and gets easily discussed compared to non-toxic topics.

Pros and Cons

Advantages of using Perplexity
- Fast to calculate and hence allows researchers to select among models that are unlikely to perform well in real-world scenarios where computing is prohibitively costly and testing is time-consuming and expensive.
- Useful to have an estimate of the model uncertainty/information density
Disadvantages of Perplexity
- Not good for final evaluation since it just measures the model’s confidence and not its accuracy
- Hard to make comparisons across different datasets with different context lengths, vocabulary sizes, word vs. character-based models, etc.
- Perplexity can also end up rewarding models that mimic outdated datasets.

Entropy

Entropy is a metric that has been used to quantify the randomness of a process in many fields and compare worldwide languages, specifically in computational linguistics.

Definition for Entropy: The entropy (also called self-information) of a random variable is the average level of the information, surprise, or uncertainty inherent to the single variable's possible outcomes.

The more certain or the more deterministic an event is, the less information it will contain. In a nutshell, the information is an increase in uncertainty or entropy.
Entropy of a discrete distribution p(x) over the event space X is given by: $H(p)=-\sum_{x \in X} p(x) \log p(x)$
- H(X) >=0; H(X) = 0 only when the value of X is indeterminate and hence providing no new information
- The smallest possible entropy for any distribution is zero.
- We also know that the entropy of a probability distribution is maximized when it is uniform.

Entropy in Different Fields of NLP & AI

In terms of probability theory NLP, language perspective, and probability theory NLP, entropy can also be defined as a statistical parameter that measures how much information is produced for each letter of a text in the language.
- If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number of binary digits required per letter of the original language.
From a machine learning perspective, entropy is a measure of uncertainty, and the objective of the machine learning model is to minimize uncertainty.
- Decision tree learning algorithms use relative entropy to determine the decision rules that govern the data at each node.
- Classification algorithms in machine learning like logistic regression or artificial neural networks often employ a standard loss function called cross entropy loss that minimizes the average cross entropy between ground truth and predicted distributions.

Become the Ai engineer who can design, build, and iterate real AI products, not just demos with an IIT Roorkee CEC Certification

AI Engineering Course Advanced Certification by IIT-Roorkee CEC

A hands on AI engineering program covering Machine Learning, Generative AI, and LLMs - designed for working professionals & delivered by IIT Roorkee in collaboration with Scaler.

Enrol Now

Historical Perspective for Entropy

Entropy in Information Theory: It was introduced by Claude Shannon in his definition is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language.
- If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy is the average number of binary digits required per letter of the original language.
Entropy for a natural language: The entropy of a natural language is the average amount of information of one character in an infinite length of text, which characterizes the complexity of natural language.
- Historically, there have been many proposals for experimentally estimating the entropy rate as the true probability distributions of natural language.
- Most of these approaches relied on the predictive power of humans or computational models such as n-gram language models and compression algorithms.
Using entropy as a metric: The main idea is that if a model captures more of the structure of a language, then the entropy of the model should be lower and we can use entropy as a measure of the quality of the models.

Cross Entropy

Due to the fact that we can not access an infinite amount of text in the language, and the true distribution of the language is unknown, we define a more useful and usable metric called Cross Entropy.

Intuition for Cross entropy: It is often used to measure the closeness of two distributions where one distribution is from the sample text (Q) that the language model aims to learn with as much proximity as possible and the other is the empirical distribution of the language (P).
- Mathematical cross-entropy is defined as:
  - $H(P,Q)=E_P[-logQ]$ which can also be written as $H(P,Q)=H(P)+D_{KL}(P||Q)$
  - H(P,Q) is the entropy and $D_{KL}(P||Q)$ is the Kullback–Leibler (KL) divergence of Q from P. It is also known as the relative entropy of P with respect to Q.
From the formulation, we can see that the cross entropy of Q with respect to P is the sum of two terms entropy and relative entropy:
- H(P), the entropy of P, is the average number of bits needed to encode any possible outcome of P.
- The number of extra bits required to encode any possible outcome of P optimized over Q.

The empirical entropy H(P) is unoptimizable, so when we train a language model with the objective of minimizing the cross-entropy loss, the true objective is to minimize the KL divergence of the distribution which was learned by our language model from the empirical distribution of the language.

Scaler Placement Report and Statistics

₹23L

AVG CTC

SCALER PLACEMENT PROOF

Scaler learners achieved 2.5x salary growth with average post-Scaler CTC reaching ₹23L.

11,000+placements

650+companies

Verified data

See full placement report

Handling Unknown Words

Tokenizers in Language Models: Tokenization is the first and important step in any NLP pipeline, especially for language models which break unstructured data and natural language text into chunks of information that can be considered as discrete elements.
- The token occurrences in a document can be used directly as a vector representing that document.
- The goal when crafting the vocabulary with tokenizers is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.
Issue with unknown vocabulary/tokens: The general approach in most tokenizers is to encode the rare words in your dataset using a special token UNK by convention so that any new out-of-vocabulary word would be labeled as belonging to the rare word category.
- We expect the model to learn how to deal with the other words from the custom UNK token.
- It is also generally a bad sign if we see that the tokenizer is producing a lot of these unknown tokens as the tokenizer was not able to retrieve a sensible representation of a word and we are losing information along the way.
Methods to handle unknown tokens / OOV (out of vocabulary): Character level embeddings and sub-word tokenization are some effective ways to unknown tokens.
- Under sub-word tokenization, WordPiece and BPE are de facto methods employed by successful language models such as BERT and GPT, etc.
Character level embeddings: Character and subword embeddings are introduced as an attempt to limit the size of embedding matrices such as in BERT but they have the advantage of being able to handle new slang words, misspellings, and OOV words.
- The required embedding matrix is much smaller than what is required for word-level embeddings. Generally, the vectors represent each character in any language
- Example: Instead of a single vector for "king" like in word embeddings, there would be a separate vector for each of the letters "k", "i", "n", and "g".
- Character embeddings do not encode the same type of information that word embeddings contain and can be thought of as encoding lexical information and may be used to enhance or enrich word-level embeddings.
- Character level embeddings are also generally shallow in meaning but if we have the character embedding, every single word's vector can be formed even it is out-of-vocabulary words.
Subword tokenization: Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful context-independent representations and also enables the model to process words it has never seen before by decomposing them into known subwords.
- Example: The word refactoring can be split into re, factor, and ing. Subwords re, factor, and ing occur more frequently than the word refactoring, and their overall meaning is also kept intact.
Byte-Pair Encoding (BPE): BPE was initially developed as an algorithm to compress texts and then used by OpenAI for tokenization when pretraining the GPT model.
- It is used by a lot of Transformer models like GPT, GPT-2, RoBERTa, BART, and DeBERTa.
- BPE brings the perfect balance between character and word-level hybrid representations which makes it capable of managing large corpora.
- This kind of behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

Conclusion

Intrinsic evaluation and extrinsic evaluation are two methods to evaluate the performance of language models in NLP.
Intrinsic evaluation captures how well the model captures what it is supposed to capture on test sets from the corpus.
Extrinsic evaluation is also called task-based evaluation and captures how useful the model is in a particular task that is used in downstream applications.
Entropy, Cross entropy, and Perplexity are common metrics for evaluating the performance of language models in NLP.
Words not seen while training a language model are out of vocabulary words and can be handled using custom tokens, character level embeddings, and sub-word tokenization techniques.