# CMU 15-112: Fundamentals of Programming and Computer Science Class Notes: Data Compression with Huffman Encoding

Data Compression: Overview

Whenever we represent data in a computer, we need to choose some sort of encoding with which to represent it. When representing strings, for example, we have learned about ASCII codes to represent individual characters. Under the ASCII encoding, each character is represented using 8 bits, so a string of length n requires 8n bits of storage. So for example let's consider an encoding for the non-whitespace characters of the string "more free coffee". Ignoring spaces, this string can be represented in ASCII using 14 * 8 = 112 bits. The table below shows the relevant subset of the standard ASCII table:

 Character ASCII bit pattern 'm' 109 01101101 'o' 111 01101111 'r' 114 01110010 'e' 101 01100101 'f' 102 01100110 'c' 99 01100011

So the string "more free coffee" would be encoded in ASCII as:

 109 111 114 101 102 114 101 101 99 111 102 102 101 101

or as the following stream of bits:

 01101101 01101111 01110010 01100101 01100110 01110010 01100101 01100101 01100011 01101111 01100110 01100110 01100101 01100101

When using the ASCII encoding we confine ourselves to representing each character using the same number of bits ( fixed-length encoding ). What if we allowed ourselves to use a variable length encoding ? In that case we can take advatage of special properties of data, such as letter frequency, by assigning shorter codes to characters that occur more frequently. For example, consider using the following code:

 Character Code 'e' 0 'o' 110 'm' 1010 'c' 1011 'r' 100 'f' 111

Notice that the above encoding is prefix-free : no code word is a prefix of any other code word. For instance, m is encoded above as 1010, and for it's three prefixes (1, 10, and 101), there are no characters with that encoding. Can you think why this is an important feature? According to the encoding we have specified above, the representation for the non-whitespace characters in the string "more free coffee" would be:

 1010 110 100 0 111 100 0 0 1011 110 111 111 0 0

which is encoded with only 34 bits. This is clearly much better than the ASCII encoding. In fact, it can be proven that this particular encoding is optimal for this string: no other encoding can represent the string using less than 34 bits. Furthermore, this encoding is optimal for any string that has the same distribution of characters. A method to define such an optimal coding scheme was developed by David Huffman and is discussed below.

Note: While we ignored whitespaces in the above examples, there really isn't anything special about spaces and we could easily determine a word code for space. We just chose not to as we can still communicate the basic principles without worrying about them. In real life, spaces matter and you do want to give them their own code word.

Tree data structure: Overview

The Huffman coding scheme takes each symbol and its frequency of occurrence, and generates proper encodings for each symbol taking account of the frequency of each symbol, so that symbols with higher frequency have fewer bits in their encoding. Huffman encoding initially creates a tree of nodes and then utilizes this tree to read the codes for each of the specified characters. So let's first do a brief review of trees.

What is a tree?
A tree is a collection of nodes, where each node is a data structure consisting of a value, together with a list of references to child nodes. Here, we'll use a class to represent a tree node.

Vocabulary
• Root - The top node in a tree
• Parent - An internal node has one or more child nodes and is called the parent of its child nodes.
• Siblings - Nodes with the same parent
• Leaf - A node with no children
• Internal node - A node with at least one child
• Edge - Connection between one node to another

Huffman Encoding: Overview

The first step of Huffman encoding is building the Huffman tree. Given a set of characters and their associated frequencies, we can build an optimal Huffman tree as follows:

1. Construct leaf Huffman trees for each character/frequency pair
2. Repeatedly choose two minimum-frequency Huffman trees and join them together into a new Huffman tree whose frequency is the sum of their frequencies.
3. When only one Huffman tree remains, it represents an optimal encoding.

Then the code for each character can be obtained by following the path from the root of the tree to the leaf holding the given character, assigning and accumulating a '0' when following a left edge and a '1' when following a right edge. The accumulated zeros and ones at each leaf constitute a Huffman encoding for those symbols. The image below illustrates this process.

Huffman Encoding: Implementation

Note: We are using a module called heapq in the code below. This module allows us to quickly get the node with the lowest frequency as we build our huffman tree, instead of having to loop over the whole list of trees each time. You do not need to worry about exactly how it works, but if you are interested, read about the Heap data structure and see pq_heap.py

import heapq class HuffmanNode(object): def __init__(self, freq, char=None, left=None, right=None): self.char = char self.freq = freq self.left = left self.right = right # used mainly for debugging purposes def __repr__(self): return "HuffmanNode(char=%s, freq=%s)" % (self.char, self.freq) # needed for node comparison. Utilized to order the nodes appropriately # in the priority queue def __lt__(self, other): return self.freq < other.freq def isLeaf(self): return (self.left == None and self.right == None) def buildHTree(freqData): huffmanNodes = [] for char in freqData: huffmanNodes.append(HuffmanNode(freqData[char], char)) # the list of huffmanNodes is transformed into a priority queue to keep # track of the minimum-frequency Huffman Nodes heapq.heapify(huffmanNodes) while (len(huffmanNodes) > 1): # obtain the two minimum-frequency Huffman nodes child1 = heapq.heappop(huffmanNodes) child2 = heapq.heappop(huffmanNodes) parent = HuffmanNode(child1.freq + child2.freq, left=child1, right=child2) heapq.heappush(huffmanNodes, parent) return None if huffmanNodes == [] else heapq.heappop(huffmanNodes) def hTreeToHCode(hTree): code = dict() # a left edge represents a 0 bit, a right edge represents a 1 bit, and # the path from the root to a leaf gives the code word for the character # stored at that leaf. def getCode(hNode, curCode=""): if (hNode == None): return if (hNode.left == None and hNode.right == None): code[hNode.char] = curCode getCode(hNode.left, curCode + "0") getCode(hNode.right, curCode + "1") getCode(hTree) return code def encode(s, freqData): hTree = buildHTree(freqData) hCode = hTreeToHCode(hTree) hEncoded = "" for char in s: hEncoded += hCode[char] return hEncoded.strip() def decode(s, freqData): hTree = buildHTree(freqData) decodedStr = "" curTreeNode = hTree for charCode in s: if (charCode == "0"): curTreeNode = curTreeNode.left else: curTreeNode = curTreeNode.right if (curTreeNode.isLeaf()): decodedStr += curTreeNode.char curTreeNode = hTree return decodedStr freqData = {"e":5, "o":2, "m":1, "c":1, "r":2, "f":3} encodedStr = encode("morefreecofee", freqData) print("encodedStr", encodedStr) decodedStr = decode(encodedStr, freqData) print("decodedStr", decodedStr)

Accessibility Application

Obviously, Huffman encoding is widely used in compressing text and other files on your computer, but there are less obvious ways compression can be applied, like accessibility. Take a look at this typing interface and see how Huffman encoding can lead to a higher typing efficiency. Note that linear scanning as detailed on the demo is actually currently used as a typing method by many people with motor impairement who do not have fine motor control to use a keyboard or a touchscreen.

Discussion

1. If Huffman encoding is "better", ie makes shorter strings, why does ASCII exist? Why do computers by default use ASCII?
2. The advantages of the ASCII encoding scheme is that boundaries between characters are easily determined. Furthermore, the pattern used for each character is fixed and universal. "Decoding" an ASCII represented string (i.e. translating the binary encoding back into the original text file) simply consists of breaking the encoding into 8-bit bytes, and converting each byte into the character is represents using the ASCII table. A file that is encoded in this scheme has the advantage of needing no additional information to be passed along with the encoding, since all files and computers have the same binary-to-character mapping. On the other hand, when using Huffman encoding, data with different character distribution will have different encoding and hence storing the Huffman tree that was used to encode the data is neccessary for decoding.

3. When I zip my files, is it using Huffman encoding?
4. Depends. A lot of the compression algorithms use a combination of different techniques. For example DEFLATE uses a combination of LZ77 algorithm and Huffman encoding. LZ77 is used for repetitive data, while Huffman encoding is used for non-repetitive data.

5. What other kinds of data compressions are there?
6. There are many other ways of compressing data. Here's a few links if you're curious to learn more:

All of the above algorithms belong to a data compression category known as lossless compression. You can also look into lossy data compression which is widely used in image and video compression.

7. I think accessibility is super cool! Where can I learn more and how can I get more involved?

• Read more about Google Accessibility. Look through the features and products to get an idea of what is currently being offered to help people with certain disabilities have equal access to technology. Keep these teams in mind when you apply for internships!
• Get involved in research. Here's a few research groups at CMU that you might find interesting:
• Take some HCI-Accessibility classes. Here's one being offered in the spring.
• Read here about a novel idea to help blind people navigate.
• Email Rudina (rmorina). I would love to tell you more and help you get involved.