Bleu Score in NLP - Scaler Topics

Overview

A lot of NLP tasks in today's world generate text as their output. For example, tasks such as machine translation, chatbots, and even caption generators all generate texts which are not one word but rather proper sentences. In such cases, simple metrics such as Precision, Accuracy, and so on need to properly address the problem of correct measurement of the model's performance. This is where the Bleu Score in NLP comes into play.

Pre-Requisites

Knowledge of Python.
Basic knowledge of metrics like precision and recall would be helpful.

Introduction

You are working on a caption generator, and you are given the image below.

Introduction

For this,

A man in a white shirt giving a speech
A crowd of people listening to a man
A man with a white shirt is pointing to the left and holding a mike

All of these statements are viable and correct. They convey different information, but each of them can be used to give a proper caption to the image.

So, in this case, how will you measure which generated text is better than others?

Maybe use a human interpreter to choose the correct statement
Or select the longest statement
Or choose one with the highest number of keywords

This can be a pestering problem. How can we solve this?

We use metrics such as Bleu Score in NLP to solve such types of problems.

What is Bleu Score?

As we discussed earlier, Bleu Score is an evaluation metric for NLP tasks. BLEU stands for Bilingual Evaluation Understudy. It compares your predicted output with a reference text and calculates a score for the same.

This score is calculated by comparing the n-grams of your predicted text with the n-grams of the reference text.

Here, n-grams are just the combination of a sequence of words, with "n" specifying the number of words.

It is a combination of precision using n-grams and something called a brevity penalty.

We will understand later how this is calculated precision (another metric for comparing predicted and true output), n-grams, and brevity penalty.

Where is Bleu Score Used?

Originally, Bleu Score was used for machine-translation tasks. But with time, the use of this metric has become abundant with a lot more tasks now, such as:

Caption Generator: Evaluate the generated caption with that of a reference caption(human based usually in this case)
Chatbots: Evaluate the generated text to that of an actual conversation
Text Summarization: Evaluate the goodness of the summarized text to that of a human-summarized text
Automatic Speech Recognition: Although not a direct application of NLP, it still is based on the speech-text model and uses Bleu Score to evaluate the generated output
Machine Translation: Evaluate the generated translated text to actual text.

How to Compute the Bleu Score?

Before we understand how to compute the Bleu Score in NLP, let's understand two very important things we will be using:

Precision

In simple terms, precision measures the number of words in the predicted text that also appear in the reference text.

For example, say

Reference text: It was raining today

Predicted text: It is raining today

precision formula would be:

Number of correct predicted words/total number of words in the predicted text

Hence, here it will be $3/4$ But using precision like this can lead to a few problems:

This does not help with repetition. Say, if the predicted text is "It It It," the precision will become $3/4$ still, which is wrong.
As we saw earlier, there are multiple ways to write the same sentence; hence there can be multiple reference texts for the same output

To workaround these two issues, we use a modified version of precision, called "Clipped Precision," for computing Bleu Score in NLP.

Clipped precision

Say,

Reference Text 1: It was a rainy day

Reference Text 2: It was raining heavily today

Predicted Text: It It It is raining heavily

Now,

We compare each word in the predicted text to all of the reference text
We limit the count for each correct word to the maximum number of times it occurs in the reference text.

Refer to the table below for more clarity:

Word	Matched Text	Predicted Match Count	Clipped Count
It	Both	3	1
is	None	0	0
raining	Ref Text 2	1	1
heavily	Ref Text 2	1	1
Total		5	3

Now, as we can see, the clipped precision here will be: clipped count/total number of predicted words = $3/6$ = $1/2$

The precision for this would have been $5/6$ . Hence we were able to overcome the shortcomings of using only precision.

N-grams

Simply said, N-grams are just a set of "n-consecutive words." For example, in "It is raining heavily." 1-gram (unigram): "It", "is", "raining", "heavily" 2-gram (bigram): "It is", "is raining", "raining heavily" 3-gram (trigram): "It is raining", "is raining heavily" 4-gram: "It is raining heavily"

Calculating Bleu Score

Now, based on these, how can we calculate Bleu Score in NLP?

Let's take into account the earlier predicted and reference text.

Reference Text: It was raining heavily today

Predicted Text: It It is raining heavily

We will take two cases, uni-gram, and bi-gram here, for simpler calculation, though usually 1-gram to 4-gram is taken.

Now we have to calculate clipped precision for unigram and bigram.

Unigram:

The clipped Precision count, as we saw from the table earlier, will be $3/6=1/2$ for this case.

Bigram:

Bigrams for reference text: ["It was", "was raining", "raining heavily", "heavily today"]

Bigrams for predicted text: ["It It", "It It", "It is", "is raining", "raining heavily"]

Clipped Precision: $1/5$

Now we combine these precision scores

$Global \space Average \space Precision=exp(\sum_{n=1}^{N}w_n log p_n)$ = $\prod_{n=1}^{N} {p_{n}}^{w_n}=(p_1)^{1/2}.(p_2)^{1/2}$

Usually we use N=4 and $w_n = 1/4$ , but in this case we will use N=2 and $w_n=1/2$

$p_1$ and $p_2$ are the clipped precision count of unigram and bigram, respectively.

Thus, the value of Global Average Precision = 0.316 (approx.)

Brevity Penalty

Now, what is the brevity penalty?

Suppose our predicted text is just a single word, "raining". Now for this, clipped precision would have been 1. This is misleading as this tells the model to output fewer words with high precision. To penalize such cases, we use Brevity Penalty.

$Brevity \space Penalty= \begin{cases} 1, & \text{if c>r}\\ e^{(1-r/c)}, & \text{if c<=r} \end{cases}$

Here,

r = number of words in the reference text

c = number of words in the predicted text

This ensures that the brevity penalty cannot be larger than 1, even if the predicted text is larger than the reference text.

In our example,

r = 5

c = 6

Since $c>r$ Brevity Penalty = 1

Calculating Bleu Score based on all these values

Now, to calculate the Bleu Score in NLP, we simply multiply the Brevity Penalty by the Global Average Precision = $1*0.316 = 0.316$

Thus, the Bleu Score for our predicted text is 0.316

In reality, we use 1-gram to 4-gram. Thus it is also called Bleu-4, so the results will vary slightly when we compute the Bleu Score using code.

What we did now is called Bleu-2.

Problems with the Bleu Score

Even though Bleu Score in NLP is one of the most popular metrics for machine translation, it has several shortcomings:

Bleu Score is based on precision; hence it doesn't look at the recall metrics. In easier terms, it fails to focus on the fact whether all words of reference are covered by the predicted text or not.
We know the Bleu score takes into account groups of words repeating to an extent using n-gram, yet it is not good for dealing with chunks of words of predicted text which maps to chunks of words in reference text.
It fails to capture the meaning and order of words.
- For example, both "house" and "home" can mean the same thing, or even a variation, say "walk" and "walking", yet Bleu Score will not be able to capture this similarity
- Similarly, "Kenny arrived late due to the traffic", and "traffic arrived late due to Kenny", will get the same Bleu Score even though the order and meaning are different.
It doesn't take into account the importance of certain words in the sentence.

Implementation for Bleu Score with Python

Code using nltk:

Bleu implementation from scratch:

First, we generate n-grams for our generated data and the reference data (n=2 for this case)
Next, we calculate the clipped precision score and use this to calculate the global precision score.
Finally, we calculate the brevity penalty and, using this and the global precision score, calculate the bleu score. You can refer to the earlier section of "calculating bleu score" for the theory behind the same.

Let's look at the code implementation.

Output in Both Cases:

Here we are using NLTK library's sentence_bleu module to calculate the Bleu Score, as well as a scratch-level implementation of the same. Notice the result is the same as we got by our calculation!

Conclusion

A point to note, even though we calculated the bleu score on a single sentence, in practice, it's always calculated on a corpus of text and not a single sentence.
Irrespective of its shortcomings, Bleu Score is still one of the most widely used metrics when it comes to machine translation.
Not only is Bleu Score easy and fast to calculate, but also it is language-independent and incorporates how humans would evaluate a similar text.
One more advantage of the same is it can be used against multiple ground truth sentences

This article gives you a clear understanding of the intuition and an in-depth idea of the algorithm behind how the BLEU Score in NLP works.