Shannon Entropy (aka information entropy, hereafter abbreviated SE) is one of the most important, under-appreciated, and very often misunderstood aspects of theoretical computer science, with applications in compression, encryption, signaling, and even applications as remote as statistics. This article is a brief, accessible[1], CS-oriented introduction to SE and its applications.

I should note before we dive in for the engineers that there is not necessarily any direct connection with “entropy” as it is used in thermodynamics[2]. It has a similar name because it is similar, but if you assume it is the same you can run into serious trouble.

Suppose that Bob lives in an apartment with a rather large safe. Inside the safe is a tasty cake, which Bob would very much like to eat, but unfortunately the safe is locked securely, and can only be opened if the correct combination of 5 octal digits are entered (octal digits are numbers 0-7). As a proper mathematician, you see readily that there are 8^5 valid combinations, and that the cake will spoil long before Bob has an opportunity to try them all.

If Bob had the first digit to the safe, there would only be 4096 remaining possible combinations. If Bob had the first two digits, there would only be 512 remaining combinations. If Bob had the first three digits, there would only be 64 remaining combinations. At this point, Bob can probably try them all and eat the cake. So far so good.

Unfortunately the only person who has the digits to the safe is Alice, who lives some sizable distance away, but within line-of-sight of Bob’s window. Since Alice’s FIOS deployment was delayed, the only way she can get a message to Bob is via a red laser pointer to Bob’s window. She devises the following scheme: an eight-digit number can be represented as three binary bits (000 = 0, 001 = 1, 010 = 2, 011 = 3, etc.). Each second, Alice will set the laser pointer to either on or off, representing a bit. So to signal the entire code, she will signal three bits for the first digit (taking three seconds), three bits for the second digit (taking an additional three seconds), etc. At the end, she will stop, wait a few minutes, and then repeat the message.

Meanwhile, Bob, being a good computer scientist, immediately realizes the purpose of Alice’s message, and quickly works out the encoding scheme (binary is a pretty obvious encoding). When 15 seconds elapse, Bob has the complete code and is able to eat the delicious cake. This should all be obvious.

But instead, something rather unfortunate happens: the battery in Alice’s laser pointer runs out during the message! Bob is only able to receive the first part of the code. Is the information he has sufficient to try the remaining possible combinations, or will the cake spoil in the locked safe?

Let’s consider some cases. Suppose Alice is sending this:

010101000101111

Suppose Bob only receives the first digit, a 0. So therefore, it cannot start with a 1, meaning that the first digit of the safe combination must be less than 4 (since 100, the least number beginning with 1, = 4). This eliminates all combinations starting with 4, 5, 6, or 7. If you work out the math, Bob has gone from 32,678 possible combinations to 16,384 combinations, a reduction of half.

Now suppose Alice is able to send the first two bits before the battery run out. Assuming Bob already has the first bit (and knows that there are 16,384 remaining possibilities), after the second bit there are only 8,192 possibilities. And if she manages to get three bits out (a full digit), there are only 4096 possibilities. Following it through, the last bit eliminates exactly one possibility.

A curious thing has just happened while you weren’t paying attention. You see, each of these bits took the same amount of time (1 second) to transmit. But the first bit (and so the first second) eliminated 16,000 combinations, whereas the third bit only eliminated 4,000 combinations. Each bit taking an identical amount of work to transmit for Alice, Alice is clearly experiencing the law of diminishing returns, as no matter where you draw the line, the first N digits will be more helpful to Bob than the next M digits. In other words, **not all bits are created equal**. Some bits are better than others.

Shannon Entropy is, stated simply, a quantitative unit of measurement for comparing these bits to those bits. In our example, we’ve established, quite specifically, that the first bit will be twice as good as the second, which is twice as good as the third, and so on. By “twice as good” we simply mean that we’ve eliminated twice as much work for Bob typing combinations into the safe. And so, without knowing it, we’ve walked right into the arms of SE theory.

However, Alice has bigger problems. It turns out that the laser pointer is cheap crap from some Taiwanese sweatshop, and that the switch to turn off and on the laser doesn’t always work. Each time she transmits the entire 15 bits comprising the combination, exactly once (at a random place) the switch will get stuck and what should be a 1 will become a 0 and vice versa. So if Bob receives the entire combination, he now has to check 15 combinations (one for each of the 15 bits, one of which he knows was inverted). And if Alice’s battery is still intermittent and Bob only receives 14 bits, he has to try:

probability_of_bad_bit_received_yet * combinations_required_in_case_1 + (1 – probability_of_bad_bit_received_yet) * combinations_required_in_case_2

or

bits_sent/15 * (bits_sent * 2^(15 – bits_sent)) + (1-bits_sent/15) * (2^(15-bits_sent))

or

14/15 * (14 * 2^1) + (1/15)*2^1 = about 26 combinations.

We can graph these approaches side-by-side:

We see two things pretty immediately. First, that the encoding scheme without the bit flipping interference is quite a bit better than the one with interference (which is common sense, interference should make things worse), and second, that Bob needs about 7 bits in a no-interference system or about 10 bits in an interfering system to get to the cake. That is, 7 bits in one system is about as helpful to Bob as 10 bits in the other system, which is a very specific result. In fact, with this graph, or one quite like it, you can say “k bits in system A will take Bob five years, which is equivalent to m bits in system B”. For lots of values of A and B. For instance:

- What if Alice has a tri-state laser pointer that can be red, green, or off?
- What if exactly k bits per 15 bits transferred are randomly flipped, for k > 1?
- What if every kth bit Alice transmits is incorrect?
- What if the switch gets stuck “on” 10% of the time, so that there’s a random 10% chance Alice cannot turn it off if it is on (but can still turn it on if it is off)
- And many more!

And all of this is still in the world of “Alice uses a fixed encoding scheme, does Bob get the cake under condition X?” But if Alice and Bob are aware of the situation and can meet and talk things over in advance, perhaps they can design a better encoding scheme. And now our SE example is suddenly dealing with compression, i.e. choosing our encoding scheme such that we transmit the most useful information in the smallest number of bits.

Although it may not be immediately obvious, Alice’s initial choice of using three binary bits per digit was very fortuitous. And not only, say, because it was more efficient than using 100 binary bits per digit. When selecting an encoding scheme, Alice has equivalently selected a decoding algorithm for Bob to use. And not just any algorithm, but the algorithm of binary search. Binary search, as you recall, is the algorithm in which you search for an item by eliminating half of the remaining search space on each iteration. And every time Bob receives a bit, he eliminates half the remaining safe combinations from consideration. Just as binary search is an excellent comparison search algorithm, Alice’s encoding is a very good one for message passing.

However, binary search is not optimal *strictly* because it eliminates half the elements on each iteration, but rather because it divides the search space into **two equally probable groups**. If there is some hanky panky going on with the safe combinations, and “greater or less than 40000” is *not* a good way to divide the search space into two equally-probable groups, changes have to be made.

Think of what would happen if Alice knew that the first digit was probably, but not certainly, a 5. Alice and Bob could agree in advance to transmit the probably-five digit last instead of first. Most of the time, this will yield better results, because the other (more uncertain) digits will be transmitted first, which is more useful if Alice could run out of battery at any time.

If the safe combination is chosen by an evil mathematician who typically chooses numbers with interesting mathematical properties, Alice might try a totally different approach, with the first bit representing prime or composite numbers, the second bit representing (if prime) whether it is a Mersenne prime, and so on, finally drilling down to a very small set of possibilities. This will not work well in the general case since most safe combinations are mathematically uninteresting, but with a bit of background knowledge about the safe creator she can design a system that works better on average for the particular application.

If Alice is trying to send an English message to Bob, and she knows that ‘e’ is much more common than ‘z’, there are even more options. Rather than create a fixed-width encoding scheme of exactly 5 bits per character, she decides that “1” should mean ‘e’, leaving binary strings starting with 0 open to be other letters. If ‘l’ is the second most common letter, “01” can be ‘l’, “001” can be ‘r’, and so on, with odd characters like ‘q’ being very long encodings. With such a system, exploiting the knowledge of English means a more-efficient variable-length encoding for average English messages, resulting in faster message transfer. The encoding I’ve described is almost exactly how Huffman Encoding works in practice.

Returning from the stratosphere of compression examples, however, it turns out that Alice’s initial binary encoding scheme is as good as she can do in the no-interference situation. Shannon’s source coding theorem tells us that, if all safe combinations are equally likely, the best we can hope for is going to be O(log2,x), which is exactly the order of the function Alice has created for the no-interference encoding method. So barring additional information about safe combinations, Alice has reached the theoretical limit already for compressing the safe combination into fewer characters.

Similarly, we can use one of the many other theorems in information theory (the noisy channel coding theorem is particularly useful) to give us tight lower bounds for theoretical compression maximums for arbitrary cases, including noise, interference, little gremlins in the transmission channel, etc. These theorems let us talk quite quantitatively about the theoretical limits of compression and will let us evaluate the current state of compression algorithms against this theoretical best-case algorithm.

A brief recapitulation of what we’ve done so far. We’ve compared the relative usefulness of some bits to other bits using SE. We’ve taken that information to compare the effects or arbitrary error on a communications channel. And then we’ve used those results to give us a theoretical maximum compression ratio for a given situation. All of these results are tied to some particular situation: the usefulness of a bit to Bob depends in no small part on how many other bits he has, the effect of arbitrary error depends on the kind and frequency of the error, and the ideal compression ratio is particular to certain known properties of the possible data.

Now we can come full circle. Because if we know there exists some theoretical compression algorithm, A, which can compress English well, we also know that “sandwich”, being a valid English word, has some compressed form via this algorithm with exactly k bits. Since we know that our compression algorithm is optimal for English, we know that our representation of “sandwich” is optimal for English, and so we know that “sandwich” can be optimally represented with k bits. And with that, we’ve quantified, quite literally, the information content in “sandwich” the same way we’ve quantified the effectiveness of Alice’s laser pointer at getting Bob the cake. And as it turns out, k is about 10-11 bits for “sandwich”, meaning that what is, in ASCII encoding, 8 bytes (of 8 bits each, for 64 bits) really contains only 10-11 actual, informative bits that convey actual information content. Those other 53 bits that are allocated by your system to display “sandwich” are just fluff. You’re using an inefficient encoding, with poor assumptions of the underlying information, the same was as if Alice was using a binary encoding with safe combinations chosen by a mathematician. Only 17% of the bits used to store English words are actually used to impart meaning, the rest of the bits can just be thrown out.

Yu’roe udenr the ipmressoin taht Egnilsh caers if you tatolly roedrer the ltteres in wrods. & the l4titud3 t0 r3pl4c3 ch4r4c|-|3r5 wh0l3s4l3 15 pr3tty 1ncr3dibl3. Pls u cn jst shrtn ur wrds, mking evrytng stl readbl. Ud *bttr bliev* u cn jst thro awy bits…

It is important to emphasize here that our SE analysis of English (like any other SE analysis) is context-sensitive; it’s only relevant for English. We would be wrong to assume the same information density for Spanish or Italian (or poorly-written “English”).

We’ve looked at SE and its applications to noise/interference analysis, data loss, compression, and its ability to directly quantify information content. SE analysis is powerful and lets us perform a quantitative study of different forms of information. As we saw, “binary: the encoding” has a suspiciously-parallel algorithm “binary search”. In fact, every algorithm that operates on data can be considered an encoding or decoding and can be analyzed with SE theory. We can take any algorithm, and instead of performing traditional algorithmic analysis, instead analyze the input and output quantitatively with SE. We can relate the O-notation for the algorithm to the difference in SE from the input to the output. So SE can be considered the weird twin of algorithmic analysis, and algorithmic analysis the long-lost cousin of SE analysis. They answer the same question, merely with a different focus.

But there is one important difference, although not an absolute one. Algorithmic analysis is focused on the behavior of a *fixed* algorithm with *arbitrary* (best-case, worst-case) input, whereas SE analysis is focused on the *fixed* input with an *arbitrary* algorithm. In practice, then, SE tends to be employed in fields like cryptography or genetics, where the actual algorithm may be unknown. (We don’t know how every gene is actually decoded and used, but here’s some DNA, let’s do an SE analysis.) But if you are a computer scientist searching for an algorithm, you may consider doing an SE-based analysis instead of enumerating algorithms and then performing algorithmic analysis on each one individually and just hoping to draw some broad conclusions. SE is much more direct for that case, and unfortunately it seems relatively unknown in undergraduate CS education.

[1] Although smart people posit that such a relationship exists.

[2] Like any accessible piece, it has some, shall we say, mistruths, in the interests of communicating the most amount of information in the shortest amount of space. True information theorists should be able to live with this.

Copyright © 2011 Drew Crawford, All Rights Reserved

Powered by WordPress

Is it just me, or are the labels “Normal” and “One Bit Flipped” inverted on the graph?

You’re right! Sorry

Thanks for clarifying, and for writing this article in the first place. I have had on my list to try to find a concise explanation of info entropy that would stick in my head and yours did.