# Introduction to Probability Theory in NLP

## Overview

**Probability theory** is the branch of mathematics concerned with the analysis of random phenomena and probability. It is an essential field to many human activities that involve quantitative analysis of data.

## Introduction

**Randomness** and **uncertainty** is all around us and exist in our daily lives as well as in every discipline in science, engineering, and technology.

- The outcome of any such random event cannot be determined before it occurs, but it may be any one of several possible outcomes.
- A random phenomenon can have several outcomes, and the actual outcome is considered to be determined by chance.
- Probability theory is the mathematical framework that allows us to analyze chance events in a logically sound manner.
- The probability of an event computed under probability theory nlp is a
**number**indicating**how likely that event will occur**. This number is always between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.

## Some Standard Terms

### Experiment (Trial)

A **random experiment** in probability theory nlp can be defined as a trial that is repeated multiple times to get a well-defined set of possible outcomes. For example, tossing a coin or rolling a die is an example of a random experiment.

- Also, an experiment has a meaning a little different from the one we use in day-to-day terms. It is anything with a
**well-defined set of possible outcomes**that can be repeated infinitely. **Random experiments**are also often conducted in a repeated manner so that we can collect results and**perform statistical analysis**on these multiple outcomes.- A fixed number of repetitions of the same experiment can be thought of as a
**composed experiment**in which case the individual repetitions are called**trials**. - If one were to toss the same coin one hundred times and record each result, each toss would be considered a trial within the experiment composed of all hundred tosses.

- A fixed number of repetitions of the same experiment can be thought of as a
**Example of the experiment:**

We can toss a fair coin in which the two possible outcomes are heads or tails. The probability of flipping a head or a tail is 1/2.- In an actual series of coin tosses, we may get more or less than exactly 50% heads.
- But as the number of flips increases, the long-run frequency of heads is bound to get closer and closer to 50%.

### Random Variable

**Random variable** is defined as a function that assigns a real number to each outcome in the probability space.

- In probability theory nlp, a random variable can also be thought of as a variable that assumes the value of all possible outcomes of an experiment.
- There are two types of random variables, as given below :
**Discrete Random Variable:**

Discrete random variables can take an exact countable value such as 0, 1, 2...etc. It can also be described by the**cumulative distribution function**and the probability mass function.**Continuous Random Variable:**

A variable that can take on an infinite number of values is known as a continuous random variable. The cumulative distribution function and probability density function are used to define the characteristics of this variable.

### Sample Space

A **sample space** is a collection or a set of possible outcomes that result from conducting a random experiment. The sample space is represented using the symbol S.

- The sample spaces for a random experiment are written within curly braces { }.
- A sample space may contain several outcomes that depend on the experiment. If it contains a finite number of outcomes, then it is known as
**discrete**or**finite sample spaces**. - For example, the sample space of tossing a fair coin is {heads, tails}.

### Event

An **event** is a desired outcome out of all possible outcomes of the random experiment. It is a subset of the respective sample space.

- The probability of occurrence of any event lies between 0 and 1.
- The sample space for a coin two times subsequently is given by :

### Event Space

An **event space** contains all possible events for a given experiment or happening. An event space is often denoted by the Greek letter **sigma** (Σ).

- An event under event space is just a set of outcomes of an experiment combined with their probability.
- Event space is generally confused with the sample space of an experiment, referred to usually by
**omega(Ω)**. **Difference between event space and sample space:**

While the sample space of an experiment contains all possible outcomes, the event space contains all sets of outcomes, all subsets of the sample space.**Examples of Event Space:**

For rolling a die, we will get the sample space**S**as {One, Two, Three, Four, Five, Six} whereas the**event**can be written as {One, Three, Five} which represents the set of odd numbers and {Two, Four, Six} which represents the set of even numbers.

### Disjoint Sets

Two sets are said to be disjoint sets if they have **no element in common**. Disjoint sets are also the sets whose **intersection is the empty set**.

- A collection of two or more sets is called disjoint if any two distinct sets of the collection are disjoint.
- For example, {1, 2, 3} and {4, 5, 6} are disjoint sets, while {1, 2, 3} and {3, 4, 5} are not disjoint.

- The chance of any (one or more) of two or more events occurring is called the union of the events. The
**probability**of the**union**of disjoint events is the**sum of their probabilities**. - If two events A and B are
**disjoint**, then the probability of**either**event is the**sum**of the probabilities of the two events: $P(A$ or $B) = P(A) + P(B)$. **Examples for disjoint events:**

Tossing a coin and getting a head and a tail at the same time is impossible. A football game can’t be held at the same time as a rugby game on the same field.

### Well-founded Probability Space

When working with probabilities and probability space, we need to establish the following quantities under probability theory nlp to maintain a well-founded probability space :

- Sample space Ω with a σ field of events F.
- A probability function (where all the individual probabilities sum to 1).

## What is Probability Theory?

**Probability theory** is a branch of mathematics that investigates the probabilities associated with a random phenomenon. Probability theory describes the chance of occurrence of a particular outcome by using certain formal concepts.

- Probability theory makes use of some fundamentals such as sample space, probability distributions, random variables, etc., we have defined above to find the likelihood of occurrence of an event.
- Probability theory also makes use of random variables and probability distributions, and cumulative distribution functions together to assess uncertain situations mathematically by modeling the random event and determining various associated probabilities in such events.
- The concept of probability is used in probability theory nlp to assign a numerical description to the likelihood of occurrence of an event.
- Probability can be defined as the number of favorable outcomes divided by the total number of possible outcomes of an event.
- Probability also can be thought of as how likely an event is to occur.

## Conditional Probability

Conditional probability is a **measure** of the probability of an event occurring **given that another event has already occurred** in the sequence of events of the experiment. It allows us to account for information we have already known about our system of interest.

- Conditional probability is
**calculated by multiplying the probability of the preceding event**by the renewed probability of the succeeding or the conditional event. **Formula**and**Notation:**

The probability of an event occurring, based on the occurrence of a previous event is called**Conditional probability $P(A|B)$**- A and B are the events.
- Probability that event A happens, given that event B has already happened.

**Example for conditional probability:**

We might expect the probability that it will rain tomorrow to be smaller than the probability it will rain tomorrow, given that it is cloudy today.- The latter probability is a conditional probability as it accounts for relevant information that we already possess.
- Instead of looking at how often it rains on any day in general, we assume that our sample space consists of only those days for which the previous day was cloudy. We then determined how many of those days were rainy.

Computing a conditional probability amounts to **shrinking our sample space** to a particular event.

### Independent

Two events A and B are said to be independent of each other if $P(AB) = P(A)P(B)$. The independence assumption means that there is **no overlap between the two events** concerning each other.

- When applying the independence assumption, $P(A|B) = P(A)$ because the presence of B does not affect the probability of A at all.
- When there are
**three events**under consideration (A, B, C), two events, A and B are conditionally independent given an event C where $P( C )>0$ if $P(A∩B|C) = P(A|C) P(B|C)$. **Example of independent events:**

Rolling dice and flipping a coin simultaneously are separate independent events because the outcome of the first event as rolling dice is separate from the second event of filliping a coin. So rolling a six doesn't increase or decrease the probability of the coin landing heads or tails.

In terms of information obtained by running an experiment, **independence** means that knowing the value of A does not reduce the uncertainty about B.

### Dependent

Two events are said to be dependent if the outcome of one event affects the outcome of the other.

- If A and B are dependent events, then the probability of A and B occurring is written as :
- Given that event A has already occurred Probability of event A is $P(A)$, then the probability of event B is $P(B|A) = P(B ∩ A) / P(A)$, the formula can also be written as $P(B|A) = P(A ∩ B) / P(A)$
- $P(B|A)$ means that event A has already happened. It is also called the
**conditional probability**of B given A. - We also need to have the constraint that $P(A)$ must be greater than 0 as $P(A)$ less than 0 means A is an
**impossible event**. - In $P(A ∩ B)$ the intersection denotes a
**compound probability**of an event.

**Example for dependent events :**

Purchasing a few lottery tickets and winning the lottery. The more tickets we purchased, the greater our odds of winning.

Dependent events **influence the probability of other** events or their probability of occurring is affected by other events.

### Joint Probability

As we have encountered, the term $P(A ∩ B)$ (also can be called $P(A\ and\ B)$) or $P(B ∩ A)$ (which is also $P(B\ and\ A)$) in computing conditional probabilities, we have to find these probabilities of multiple events occurring together which is formally called as **joint probability**.

- It is the probability of
**both events occurring simultaneously**which is the intersection of two or more events. - Joint probability $P(A ∩ B) = P(A) * P(B)$
- $P(A ⋂ B)$ is the notation for the joint probability of events A and B.
- $P(A)$ is the probability of event “A” occurring.
- $P(B)$ is the probability of event “B” occurring.
- Joint probability is also symmetric meaning $P(A ∩ B) = P(B ∩ A)$.
- For joint probability calculations to work, the
**events can be independent where one event**must not be able to influence each other or otherwise dependent.

**Joint probability** in probability theory NLP refers to the probability that two events will both occur and is the likelihood of two events occurring together.

## NLP Example

Let us see an example sentence and **compute probabilities for NLP entities for reference**.

**Sentence :**

The fast dog ran quickly. The dog ran quickly. The bird dog ran. The fat bird dog ate very slowly.- Let us break the sentence into its parts of speech: Adj N V Adv Det N N V Det Adj N N V Adv
- where Det is
**Determiner**, Adj is**Adjective**, N is**Noun**, V is**Verb**& Adv is**Adverb**.

- where Det is
- The marginal probabilities (also called prior probabilities) look like this :
- $Det - 4/20 = 0.2; Adj - 2/20 = 0.1; N - 6/20 = 0.3; V - 4/20 = 0.2; Adv - 4/20 = 0.2$

- Let us compute probabilities for Noun Phrases which are Adj Noun or Noun Nouns appearing together.
- If we assume independence among sentence, $P(NP) = P(AdjN) + P(NN)$
- P(Adj. N) = P(Adj) _ P(Noun) = 0.1 _ 0.3 = 0.03
- P(NN) = P(Noun) _ P(Noun) = 0. 3 _ 0. 3 = 0.09
- P(NP) = 0.03 + 0.09 = 0.12 or 12%

Similarly, we can compute joint probabilities for different combinations of parts of speech.

## What is Entropy in NLP?

The **entropy** (also called self-information) of a random variable is the **average level of the information**, surprise, or uncertainty inherent to the single variable's possible outcomes.

- The more certain or the more deterministic an event is, the
**less information**it will contain. In a nutshell, the information is an increase in uncertainty or entropy. - Entropy of a discrete distribution $p(x)$ over the event space X is given by: $H(p)=-\sum_{x \in X} p(x) \log p(x)$
- $H(X) >=0; H(X) = 0$ only when the value of X is indeterminate and hence providing no new information
- The smallest possible entropy for any distribution is zero.
- We also know that the
**entropy of a probability distribution**is maximized when it is uniform.

- In terms of probability theory NLP, language perspective, and probability theory nlp, entropy can also be defined as a statistical parameter that measures how much information is produced for each letter of a text in the language.
- If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number of binary digits required per letter of the original language.

- From a
**machine learning perspective**, entropy is a measure of uncertainty, and the objective of the machine learning model**is to minimize uncertainty**.- Decision tree learning algorithms use relative entropy to determine the decision rules that govern the data at each node.
- Classification algorithms in machine learning like logistic regression or artificial neural networks often employ a
**standard loss function**called**cross-entropy loss**that minimizes the average cross entropy between ground truth and predicted distributions.

## Perplexity in NLP

**Perplexity** is a measurement of how well a probability model predicts a sample under probability theory nlp.

- Perplexity is one of the ways to
**intrinsically**(without the use of external datasets)**evaluate**the performance of language models which come under NLP.- Intrinsic evaluation is the way of finding some property of a model that estimates the model’s quality independent of the specific tasks it used to perform.
- Perplexity is a metric that quantifies how uncertain a model is about the predictions it makes. Low perplexity only guarantees a model is confident, not accurate.
- Perplexity also often
**correlates**well with the**model’s final real-world performance**and it can be quickly calculated using just the probability distribution the model learns from the training dataset.

One **caveat** with perplexity is that a lower perplexity isn’t guaranteed to translate into better model performance. Let us learn about them :

- A model’s worst-case perplexity is fixed by the language’s vocabulary size and we can greatly lower the model’s perplexity just by switching from a word-level model to something else.
- For example, we can switch from word-level models with a typical vocabulary size of multiple thousands of words to a
**character-level model**with a vocabulary size of around 26 for the English language regardless of whether the character-level model is more accurate. - Other variables like the size of your training dataset or your model’s context length can also have a disproportionate effect on a model’s perplexity.

- For example, we can switch from word-level models with a typical vocabulary size of multiple thousands of words to a
- Perplexity, like all internal evaluation doesn’t provide any form of sanity-checking.
- For example, models trained on the two different datasets can have identical perplexities but would output wildly different answers.

**Advantages of Perplexity:**

Perplexity as a metric is**fast to calculate**and hence allows researchers to select among models that are unlikely to perform well in expensive/time-consuming real-world testing.**Disadvantage of perplexity :**

Not good for final evaluation since it just measures the model’s confidence not its accuracy & can end up rewarding models that mimic toxic or outdated datasets.

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample, generally probability theory NLP domain.

## Kullback in NLP

**Kullback Leibler Divergence** (also called relative entropy) is a measure to compare the difference between two probability distributions (or any distributions, especially in NLP).

- The KL divergence metric is defined by the notation $K L(P \| Q)$ and given by the following formulation :
- $K L(P \| Q)=\sum p_i(x) \log \left(\frac{p_i(x)}{q_i(x)}\right)$
- The || operator indicates divergence or Ps divergence from Q.
- KL divergence is a
**non-symmetric measure**of the discrepancy between p and q. - Kullback Leibler Divergence metric is only well defined when $Q(i) > 0$ and $P(i) > 0$. In the cases where it is not defined, we can take the computation to be that $0 log(0) = 0$.
- KL divergence will be infinite if P is Gaussian and Q is uniform (not necessarily the other way around).

**Understanding the meaning of the KL(p||q) divergence:**

If we assume a language that generates strings over an alphabet with distribution p where $p(a)$ is the likelihood that symbol a is used in strings of the language.- We construct an estimate of p using modeling that is inaccurate in the form of the distribution q over the same alphabet.
- We can now construct an encoding using some suitable scheme over the alphabet based on q.
- The average number of bits per symbol wasted because we used q instead of p to encode the symbols over the language is $KL(p||q)$.

- In the context of recently popularized unsupervised learning methods GAN's and VAE, Kullback-Leibler divergence (KLD) is the distance metric that computes the similarity between the real sample given to the encoder and the generated fake image or text from the decoder.
- If the loss function yields more value, it means the decoder does not generate fake images similar to the real samples.

Identifying a distribution q that minimizes the divergence $KL(p||q)$ for a known p is a useful way to intuitively understand and evaluate several machine learning methods, especially under probability theory nlp.

## Bayesian in NLP

**Bayesian techniques** are useful tools for modeling a wide range of data and phenomena and especially NLP-related phenomenon like natural languages.

- Let us see a glimpse of the Bayesian approach toward NLP :
- We will model the data x probabilistically with $p(x|θ)$, where θ are some unknown parameters. For example, the modeling could be a generative story for a sentence x based on some unknown context-free grammar parameters θ.
- We will then represent the uncertainty about θ by a prior distribution $p(θ)$.
- Then, after collecting the relevant data, we can apply
**Bayes theorem**$p(θ|x) ∝ p(x|θ)p(θ)$ to find the posterior distribution for the quantities of interest.

- The Bayesian approach provides us with an elegant and unified way to incorporate prior knowledge and manage uncertainty over parameters.
- The Bayesian method can also be used to provide capacity control for complex models as an alternative to smoothing.
- There have been many successful applications of Bayesian techniques in probability theory NLP (natural language processing) like word segmentation, syntax determination, morphology understanding, coreference resolution, machine translation, etc.

- Let us also understand how Bayes theorem works which is the basis for modeling any process under consideration :
- Bayesian inferences use probability theory to model any self-consistent quantification of uncertainty.
- Using the Bayes model, we update a prior distribution over the model configuration space that quantifies our domain expertise into a posterior distribution over the model configuration space that incorporates both our domain expertise and the information encoded in the observed data.
- Mathematically Bayes theorem is given by the formulation : $p(Y=y \mid X=x)=\frac{p(X=x \mid Y=y) p(Y=y)}{p(X=x)}$
- where $P(Y|X)$ is called
**likelihood probability**, $P(X)$ is called**prior probability**. - $P(Y)$ is called
**evidence**and $P(X|Y)$ is called**posterior probability**.

- where $P(Y|X)$ is called

- Importance of Bayes theorem: Bayes Theorem tells us about the probability of an event based on prior knowledge of conditions that might be related to the event.
- Bayes theorem is very useful to gain confidence in an event to happen.
- It also
**combines**the marginal and the conditional probability. - Bayes theorem is generally used to find the
**reversed conditional probability**that we already know with the data.

## How is this Related to a Language Model?

In the context of language models, Bayesian methods make it easy to analyze the experimental data through the lens of our probabilistic cognitive models and allow us to reason about various relevant unknowns based on the data observed.

- The relevant unknowns include either
**parameters**or**predictions**or the**models**themselves: model parameters, future observations, and model hypotheses.

### Estimation of P

An **N-gram model** is built by counting how often word sequences occur in corpus text and then estimating the probabilities, which are **computed using the Bayesian** perspective using conditional probabilities.

- An
**example n-gram model**for computing probability : P('There was heavy rain') = P('There', 'was', 'heavy', 'rain') = P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy') - This is impractical to calculate with simply the conditional probabilities, hence we use Markov assumption to approximate using bigram model - P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')

## Conclusion

**Probability theory**is a branch of mathematics that investigates the probabilities associated with a random phenomenon.**Bayesian approach**provides an elegant and unified way to incorporate prior knowledge and manage uncertainty over parameters into language models.**Entropy**,**Kullback**, and**Perplexity**are measures to compare across distributions and evaluate the performance of a probability theory NLP model.**Perplexity**is one of the ways to intrinsically evaluate the performance of language models which come under NLP.**Kullback-Leibler divergence**calculates a score that measures the divergence of one probability distribution from another.