# The Principle of Maximum Entropy

Learn via video courses
Topics Covered

## Overview

The Principle of Maximum Entropy is a fundamental principle in statistical mechanics and information theory that states that, when we are uncertain about the probability distribution of a system, we should choose the distribution that has maximum entropy subject to the constraints imposed by the available information. In simpler terms, if we know some basic information about a system (such as the average value of some quantity), but we don't know the full probability distribution of the system, then the principle of maximum entropy tells us to choose the most "random" or "uncertain" distribution that still satisfies the known constraints.

## Introduction

The principle of maximum entropy is a principle in probability theory and information theory that suggests choosing the probability distribution that has the maximum entropy subject to the constraints imposed by the available information. In other words, when we have incomplete information about a probability distribution, the principle of maximum entropy suggests that we should choose the distribution that is the least biased or most unbiased, and has the least amount of bias.

Entropy, in this context, refers to the degree of randomness or uncertainty in a probability distribution. The maximum entropy distribution is the one that has the highest degree of randomness or uncertainty and is most spread out. This principle has applications in various fields, including physics, economics, and computer science. It is used to derive probability distributions in situations where we have incomplete information, such as in natural language processing, image processing, and machine learning. It is also used to solve optimization problems and to model complex systems in physics and biology.

## What is Entropy?

Entropy is a measure of the information content of an outcome of $X$. A less probable outcome conveys more information than a more probable ones. Thus, entropy can be stated as a measure of uncertainty. A system's entropy determines how difficult it is to communicate and make judgments about it. When the goal is to find a distribution that is as ignorant as possible, then, consequently, entropy should be maximal. Formally, entropy is defined as follows: If $X$ is a discrete random variable with distribution $P(X=x_i)=p_i$ then the entropy of $X$ is

$H(X)= - \sum{p_ilogp_i}$

## What is the Principle of Maximum Entropy

The principle of maximum entropy is a fundamental principle in statistical inference and information theory that states that when we have incomplete information about a probability distribution, we should choose the distribution that has the maximum entropy subject to the constraints imposed by the available information.

In other words, the principle of maximum entropy suggests that we should choose the most unbiased probability distribution that is consistent with the available knowledge. The maximum entropy distribution is the one that is most spread out, with the least amount of structure or bias.

The concept of entropy comes from thermodynamics, which measures the disorder or randomness of a physical system. In information theory, entropy is used to measure the amount of uncertainty or randomness in a message or data set.

## Examples

There are mainly two examples of The Principle of Maximum Entropy. One is a crude business model, and the other is a crude model of a physical system.

### Berger’s Burgers

Berger's Burgers is a hypothetical example often used to illustrate the Principle of Maximum Entropy, which is a fundamental principle in statistical inference and information theory.

According to the Principle of Maximum Entropy, when we have incomplete information about a probability distribution, we should choose the distribution that has the maximum entropy subject to the constraints imposed by the available information.

In the context of Berger's Burgers, suppose we have a burger joint that sells three types of burgers: veggie, chicken, and beef. We know that the total number of burgers sold in a day is 100, but we don't know the exact number of each type sold. We also know that the joint sells more chicken burgers than beef burgers, but we don't know the exact ratio.

Using the Principle of Maximum Entropy, we can calculate the probability distribution that maximizes the entropy subject to these constraints. It turns out that the distribution that satisfies these constraints and has the maximum entropy is the one that assigns equal probabilities to each type of burger, i.e., each burger type has a probability of 1/3.

This means that, according to the Principle of Maximum Entropy, we should assume that each type of burger is equally likely to be sold, even though we have some prior knowledge that chicken burgers are more popular than beef burgers.

### Magnetic Dipole Model

The magnetic dipole model is a simple physical model used to describe the behavior of a magnet. It assumes that the magnet can be approximated as a small magnetic dipole, which is a pair of equal and opposite magnetic poles separated by a small distance.

In the context of the Principle of Maximum Entropy, the magnetic dipole model can be used to determine the probability distribution of the orientation of the magnet's magnetic moment, which is a vector that points from the magnet's south pole to its north pole.

Suppose we have some prior knowledge about the orientation of the magnet's magnetic moment. For example, we might know that the magnet is placed in a magnetic field that tends to align the moment with the field. However, we might not know the exact orientation of the moment.

Using the Principle of Maximum Entropy, we can determine the probability distribution of the orientation of the magnetic moment that has the maximum entropy subject to the available information. This means that we choose the distribution that is consistent with the available knowledge but is otherwise as uniform as possible.

It turns out that the maximum entropy distribution for the orientation of a magnetic dipole's moment is the one that assigns equal probabilities to all possible directions of the moment. This means that, according to the Principle of Maximum Entropy, we should assume that the magnet's magnetic moment is equally likely to be oriented in any direction, given the available information.

## Applications

There are two ways that the maximization of entropy is frequently applied to inferential problems:

• Prior probabilities: The principle of maximum entropy is a way to get a probability estimate before we have all the information. This approach is popular and is used in areas like communication channels. A probability distribution is what we assume will happen before we have any evidence.
• Posterior probabilities: The maximum entropy rule can be used to update probabilities when new information comes in. This approach is used in probability kinematics, which is a way to update probabilities. It is not a one-size-fits-all approach, and it might not work for all updating rules. Radical probabilism is an idea that suggests we can never be 100% sure of anything.
• Maximum entropy models: Another way to use the maximum entropy principle is to build models where we assume that the data we observe is all the information we have. These models are used in natural language processing, and logistic regression is an example of such a model.
• Probability density estimation: One of the main uses of the maximum entropy principle is to estimate probability densities, whether they are continuous or discrete. A great advantage of this approach is that we can use previous knowledge to help us make better estimates.

## Information Entropy

Information entropy is a concept from information theory that measures the amount of uncertainty or randomness in a message or data set. It was introduced by Claude Shannon in the 1940s as a way to quantify the amount of information contained in a message or signal.

The entropy of a message or data set is defined as the average amount of information contained in each symbol or data point. It is calculated as the negative sum of the probabilities of each possible symbol or data point, multiplied by the logarithm of those probabilities.

### Shannon’s theorem

Shannon’s approach starts by stating conditions that a measure of the amount of uncertainty $H_n$ has to satisfy.

• We can associate uncertainty with real numbers.
• $H_n$ is a continuous function of $p_i$. If $H_n$ (the amount of uncertainty) changes abruptly with small changes in the probability distribution, then we could end up with a lot of uncertainty with just a small change in the probabilities. Therefore, Hn should change smoothly with small changes in the probabilities.
• $H_n$ should correspond to common sense in that, the more possibilities there are, the more uncertain we should be, and the fewer possibilities there are, the less uncertain we should be. This condition has the effect that in case the pi are all equal, the quantity $h(n)$ is a monotonic increasing function of $n$.
• If there are multiple ways to calculate the amount of uncertainty $H_n$, then they should all give the same answer. So the amount of uncertainty should be consistent, regardless of how we calculate it.

Under these assumptions, the resulting unique measure of uncertainty of a probability distribution p turns out to be just the average log-probability:

$H(p)=-\sum{p_ilog(p_i)}$

Accepting this interpretation of entropy, it follows that the distribution (p1,...,pn) that maximizes the aforementioned equation, subject to restrictions imposed by the information at hand, will serve as the most accurate representation of what the model knows regarding the propositions (A1,...,An). The information entropy of the distribution {p I } is also known as the function H's entropy.

### The Wallis derivation

A second and perhaps more intuitive approach to deriving entropy was suggested by G. Wallis. We are given information I, which is to be used in assigning probabilities {p1,...,pm}to m different probabilities. We have a total amount of probability

$\sum_{i=1}^mp_i=1$

to allocate among them.

The problem can be stated as follows. Choose some integer n>>m, and imagine that we have n little quanta of probabilities each of magnitude δ =1/n, to distribute in a way we see fit.

Suppose we were to scatter these quanta at random among the m choices (penny-pitch game into m equal boxes). If we simply toss these quanta of probability at random, so that each box has an equal probability of getting them, nobody can claim that any box is being unfairly favored over any other.

If we do this and the first box receives exactly $n_1$ quanta, the second $n_2$ quanta etc., we will say the random experiment has generated the probability assignment:

pi=niδ=ni/n,with i=1,2,...,m.

The probability that this will happen is the multinomial distribution:

$m^{{-n}}n!/n_1!...n_m!$

To sum it up: Entropy is a measure of uncertainty. The higher the entropy of a random variable $X$, the more uncertainty it incorporates.

## The KL Divergence

KL divergence, short for Kullback-Leibler divergence, is a measure of the difference between two probability distributions. It was introduced by Solomon Kullback and Richard Leibler in 1951 as a way to quantify the amount of information lost when using one probability distribution to approximate another.

KL divergence is a non-symmetric measure, which means that the KL divergence between distribution A and distribution B is not necessarily the same as the KL divergence between distribution B and distribution A.

The formula for KL divergence is:

$KL(P\left | \right |Q)=\sum p(x)log(p(x)/q(x))$

where P and Q are the two probability distributions, p(x) is the probability of the event x under distribution P, and q(x) is the probability of the event x under distribution Q.

Intuitively, KL divergence measures how much extra information is needed to encode a message using one probability distribution when compared to encoding it using another distribution. It is commonly used in machine learning and information theory, for example, in model selection and parameter estimation.

KL divergence is always non-negative and is zero when the two distributions are identical. KL divergence is not a true distance measure, as it is not symmetric, does not satisfy the triangle inequality, and is not always defined.

## Bayesian Inference

Bayesian inference is a method of statistical inference that involves updating the probability of a hypothesis or model in light of new evidence or data. It is based on Bayes' theorem, which provides a way to calculate the probability of a hypothesis given some observed data.

Bayesian inference is a powerful tool for modeling and analyzing complex systems, and it is widely used in a variety of fields, including physics, biology, engineering, and economics. The basic steps in Bayesian inference are as follows:

• Specify a prior probability distribution for the hypothesis or model.
• Collect data or evidence.
• Update the prior probability distribution using Bayes' theorem to obtain the posterior probability distribution.
• Evaluate the goodness of fit of the model to the data and compare it to alternative models.

The prior probability distribution represents our beliefs about the hypothesis or model before we have seen any data. The posterior probability distribution represents our updated beliefs about the hypothesis or model after we have seen the data. The probability function represents the probability of the data given the hypothesis or model.

Bayesian inference has several advantages over classical or frequentist statistical inference. It provides a more coherent and consistent way of incorporating prior knowledge or beliefs into the analysis. It also provides a way to quantify the uncertainty in the results, which is often useful for decision-making and risk assessment. However, it requires more computational resources and expertise to implement compared to classical methods. Frequentist statistical inference is a method of making conclusions based on the frequency of certain events occurring in a given dataset.

## Conclusion

• The Principle of Maximum Entropy is a probability theory and an information theory that suggests choosing the probability distribution that has the maximum entropy subject to the constraints imposed by the available information.
• Entropy, in this context, refers to the degree of randomness or uncertainty in a probability distribution.
• The principle has applications in various fields, including physics, economics, and computer science. It is used to derive probability distributions in situations where we have incomplete information, such as in natural language processing, image processing, and machine learning.
• The maximum entropy distribution is the one that is most spread out, with the least amount of structure or bias.
• The principle is used to solve optimization problems and to model complex systems in physics and biology.
• The principle has been explained using two examples, namely, Berger's Burgers and the Magnetic Dipole Model.
• Bayesian inference is a statistical method that uses Bayes' theorem to update the probability of a hypothesis as new evidence or information becomes available. It is commonly used in machine learning and artificial intelligence.