Huffman Coding

Learn via video course
FREE
View all courses
DSA Problem Solving for Interviews using Java
DSA Problem Solving for Interviews using Java
by Jitender Punia
1000
4.9
Start Learning
DSA Problem Solving for Interviews using Java
DSA Problem Solving for Interviews using Java
by Jitender Punia
1000
4.9
Start Learning
Topics Covered

Overview

When you want to send files or store the files on a computer, if the size is very huge we generally compress it and store it.

One way of compressing these files is using Huffman Coding which is a greedy-based algorithm that is also known as variable-length coding or lossless data compression technique.

Introduction to Huffman Coding

Huffman Coding is a technique that is used for compressing data to reduce its size without losing any of its details. It was first developed by David Huffman and was named after him.

Huffman Coding is generally used to compress the data which consists of the frequently repeating characters.

Huffman Coding is a famous Greedy algorithm. It is said to be a Greedy Algorithm because the size of code assigned to a character depends on the frequency of the character.

The character with higher frequency gets the short-length variable code and vice-versa for characters with lower frequency. It uses a variable-length encoding which means that it assigns a variable-length code to all the characters in the given stream of data.

Prefix Rule

This rule basically means that the code which is assigned to any character should not be the prefix of any other code.

If this rule is not followed, then some ambiguities can occur during the decoding of the Huffman tree formed.

To understand more about this rule let's consider an example: For example, code is given for each character:

Now assume that the generated bitstream is 001, while decoding the code it can be written as follows:

This means that the given string can be aab or else ac, such ambiguities can be avoided if we follow the prefix rule. Since no codeword is a prefix of any other, the codeword that starts with encoded data is unambiguous.

How does Huffman Coding work?

There are mainly two major steps involved in obtaining the Huffman Code for each unique character:

  1. Firstly, build a Huffman Tree from the given stream of data(only taking the unique characters).
  2. Secondly, we need to traverse the Huffman Tree which is formed and assign codes to characters and decode the given string using these huffman codes.

Major Steps in Huffman Coding

Steps involved in building the Huffman tree from the given set of characters

Here, if Huffman Coding is used for data compression, then we need to determine the following for decoding:

  • Huffman Code for each character
  • Average code length
  • Length of Huffman encoded message (in bits)

The last two of them are found using formulas that are discussed below.

How to Build a Huffman Tree from Input Characters?

  • Firstly, we need to calculate the frequency of each character in the given string.
CharacterFrequency/count
a4
b7
c3
d2
e4
  1. Sort the characters in ascending order of frequency. These are stored in a priority queue Q/ min-heap.
  2. Create a leaf node for every unique character and its frequency from the given stream of data.
  3. Extract the two minimum frequency nodes from the nodes and the sum of these frequencies is made the new root of the tree.
  4. While extracting the nodes with the minimum frequency from the min-heap:
    • Make the first extracted node its left child and the other extracted node as its right child.
    • Add this node to the min-heap.
    • Since the minimum frequency should always be on the left side of the root.
  5. Repeat steps 3 and 4 until the heap contains only one node(i.e, all characters are included in the tree). The remaining node is the root node and the tree is complete.

Example of Huffman Coding

Let us understand the algorithm with an example:

example-of-huffman-coding

example-of-huffman-coding2

Huffman Coding Algorithm

Step 1: Build a min-heap that contains 5 (number of unique characters from the given stream of data) nodes where each node represents the root of a tree with a single node.

huffman-min-heap

Step 2: Get two minimum frequency nodes from the min heap. Also, add a new internal node formed by combining the two nodes which are extracted, frequency 2 + 3 = 5.

minimum-frequency-nodes-from-min-heap

  • Now min-heap contains 4 nodes where 3 nodes are roots of trees with a single element each, and one heap node is the root of the tree with 2 elements

Step 3: Similarly, get the two minimum frequency nodes from the heap. Also add a new internal node formed by combining the two nodes which are extracted, the frequency to be added in the tree is 4 + 4 = 8

two-minimum-frequency-nodes-from-min-heap

  • Now min heap contains 3 nodes where 1 node is the root of trees with a single element, and two heap nodes are the root of the tree with more than one node.

Step 4: Get the two minimum frequency nodes. Also add a new internal node formed by combining the two nodes which are extracted, frequency to be added in the tree is 5 + 7 = 12.

  • When designing a Huffman tree, we need to check that the minimum value is placed on the left side of the tree always and the second one should be on the right side of the tree. Now the tree formed is shown in the below image:

minimum-value-placed-on-left-sideff

Step 5: Get the next two minimum frequency nodes. Also, add a new internal node formed by combining the two nodes which are extracted, the frequency to be added in the tree is 12 + 8 = 20

huffman-tree

Repeat the procedure till all the unique characters are inserted into the tree. The above image is the Huffman tree formed for the given set of characters.

Now to form the code for each character, for each non-leaf node, assign 0 to the left edge and 1 to the right edge.

Rules to be followed while assigning weights for edges:

  1. If you assign weight 0 to the left edges, then we should assign weight 1 to the right edges.
  2. If you assign weight 1 to the left edges, then we need to assign weight 0 to the right edges.
  3. Any of the above two conventions can be followed.
  4. But use the same convention during decoding the tree as well.

The modified tree after assigning the weights is shown below:

modified-huffman-tree-after-assigning-the-weights

Decoding the Code

  • For decoding the Huffman code for each character from the Huffman tree formed, we need to traverse through the Huffman tree until we reach the leaf node, where the element is present.
  • While traversing, the weights across the nodes need to be stored and assigned to the elements present at the particular leaf node.
  • This can be better explained using the example below: modified-huffman-tree-after-assigning-the-weights
  • In the above image to get the code for each character, we need to traverse the full tree (until all leaf nodes are covered).
  • Therefore the codes for every node are decoded with the help of the tree formed. The Code obtained for each character is shown below:
CharacterFrequency/countCode
a401
b711
c3101
d2100
e400

Algorithm:

  • Firstly create a priority queue Q1 consisting of each unique character from the given stream of data.
  • Then sort them in the ascending order of their frequencies(this is can be done using a map/comparator).
  • For all the unique characters present do the following:
    • Create a newNode
    • We need to draw the minimum value from the priority Queue and assign it to the left child of newNode which is formed.
    • We need to draw the minimum value from priority Q and assign it to the right child of newNode which is formed.
    • We need to calculate the sum of the two minimum values and assign it to the value of the newNode in the min-heap.
    • Insert this newNode into the tree.
  • Finally we need to return the root of the Huffman Tree formed.

Implementation of the above approach:

Python implementation of Huffman Coding:

Output:

Java implementation of Huffman Coding:

Output:

C++ implementation of Huffman Coding:

Output:

Javascript implementation of Huffman Coding

Output:

Explanation:

The Huffman Tree is formed and decoded by traversing. When it reaches the leaf node, then the values obtained by traversing are to be assigned to the character present at the leaf node. This way Huffman Code is obtained for each unique character present in the given stream of data.

Huffman Coding Complexity

  • Time complexity: O(N logN)
    • The time taken for encoding each given unique character depending on its frequency is O(N logN) where N is the number of unique given characters.
    • Getting minimum from the priority queue(min-heap): time complexity is O(log N) since it calls minHeapify method.
    • And extractmin() is called for 2*(N-1) times.
    • Therefore the overall complexity of Huffman Coding is considered to be O(N logN).
  • Space Complexity: O(N)
    • The space complexity is O(N) because there are N characters used for storing them in a map/array while calculating their frequency.

Disadvantages of Huffman Coding

In this section let's discuss the disadvantages of Huffman Coding and why it is said to be not optimal always:

  • It is not considered to be optimal unless all probabilities/frequencies of the characters are the negative powers of 2.
  • Although by grouping symbols and extending the alphabet, one may come closer to the optimal, the blocking method requires a larger alphabet to be handled. So, sometimes Huffman coding is not that effective at all.
  • Counting the frequency of each symbol/character has many efficient methods in spite of that, it can be very slow when rebuilding the entire tree for each symbol/character. This is usually the case when the alphabet is big and the probability distributions change rapidly with each symbol.

Greedy Algorithm for Constructing a Huffman Code

  • Huffman invented a greedy algorithm that creates an optimal prefix code called a Huffman Code for every unique character from the given stream of data.
  • The algorithm builds the Huffman tree in a bottom-up manner using the minimum nodes every time.
  • This is called a greedy approach because each character gets its length of code depending on its frequency in the given stream of data. If the size of the code obtained is smaller, then it means it is a frequently occurring element in the data.

Important Formulas

In this section, we discuss the few important formulas required to estimate the size of the given data stream after compilation i.e(after forming the Huffman Tree and assigning codes for each unique character).

The total size which is required now:

  • Average code length:=

    ( frequencyifrequency_i x codelengthicodelength_i ) // ( frequencyifrequency_i )

  • Length of Huffman Encoded Message: Total number of bits in Huffman encoded message = Total number of characters in the message x Average code length per character(obtained from the above formula)

Applications of Huffman Coding

Here we will be discussing the real-life applications of Huffman Coding:

  • Huffman Coding is frequently used by conventional compression formats like PKZIP, GZIP, etc.
  • For data transmission using fax and text, Huffman Coding is used because it reduces the size and improves the speed of transmission.
  • Many multimedia storage like JPEG, PNG, and MP3 use Huffman encoding(especially the prefix codes) for compressing the files.
  • Image compression is mostly done using Huffman Coding.
  • It is more useful in some cases where there is a series of frequently occurring characters to be transmitted.

Conclusion

  • Huffman Coding is generally useful to compress the data in which there are frequently occurring characters.
  • We can observe that the most frequent character gets the smallest code and the least frequent character gets the largest code.
  • The variable-length code i.e, a different number of bits for each character/symbol is obtained by using the Huffman Code compression technique which is more advantageous than fixed-length coding because it reduces the memory space and decreases the time required for transmitting the data
  • To gain a better understanding of Greedy Algorithm, go through this article.