News & E-mail : Entropy


To augment my No Magic and Kraft articles, I am including Shannon' derivation of the entropy function.

The entropy, H, is a measure of the information content of a source. A source is anything that emits a series of messages.

Clearly a message cannot have arbitrary minimum length. For example, to send three values of equal probability using a binary code, you cannot do better than 0,10,11 (or something linearly equivalent). Codes like 0,1,01 are not decodable, because the decoder has to know the length, which corresponds to sending an addition EOF : 0eof,1eof, 01eof which must be explicitly done.

Thus a message has a minimum length L in which it can be represented. Let us assume for simplicity that our source is an order-0 source, i.e. each message occurs with a probability P and subsequent messages have no effect on eachother. Any source can be transformed into an apparent order-0 source using modeling techniques.

Thus each message occurs with probability P and requires a minimum of length L to store.

Let there be N messages. Pi be the probability of message i , and L the length to store message i ; let 0 <= i < N

The minimum length of a message (from a certain source) is equal to its information content. Thus the Entropy is the average length of messages sent. This is simply:

H = Sum(i) { Pi * Li }

i.e. the probability-weighted average message length.

Now let us state some properties which we intuitively require of H.

1. H is a continuous function of each Pi

2. If Pi = 1/N for all i, then H should be monotonically increasing in N i.e. more possibilities mean more information can be sent.

3. If the probabilities are represented as a tree - each node being a message to send, and each branch being the probability of taking that branch - then the entropy of each node should be equal to: H(node) = Sum(branches) { P(branch) * H(branch target node) }
The Entropy of the source is the entropy of the root node.

Condition #3 is the extensibility property of the entropy and deserves a brief comment. In our definition of entropy, we created a tree with all the characters directly branching from the root node. Also, such a tree is a good coding tree when the entropy of each node is as close as possible to an integer. Thus tree-decomposition is a good way to acheive a coding tree (thus Huffman and Shannon-Fano coding). BTW the sum of probabilities branching from each node is always One, by definition of probability.

Finally one brief examples of poperty #3 :

H( 1/2 , 1/3 , 1/6 ) = H( 1/2 , 1/2 ) + 1/2 * H( 2/3 , 1/3 )

Which corresponds to the tree:

				a
					/
				(1/2)
				/
(source)	- (1/3) - b
				\
				(1/6)
					\
						c

Changed into the tree:

				a
					/
				(1/2)
				/
(source)							b
				\						/
				(1/2)			(2/3)
					\				/
						(node)
									\
									(1/3)
										\
											c


Now, given these properties, does our current definition of H satisfy them all?

H = Sum(i) { Pi * Li }

No, not yet. We need to say more about Li, because the extensibility property of entropy has put some constraints on it.

First, Li must be a variable only of the source message probabilities, for this is the only thing H depends on.

Here is Shannon's proof:

Take the case were all Pi = P = 1/N
Thus all Li = L
H = L(N)

From the extensibility property, we have:
L(N) = m*L(s)
where s^m = N , m is the depth of the tree, and s is
the number of branches at each node

also L(t^n) = n*L(t) for some other size alphabet

choose n very large, and choose m such that (s and t arbitrary, >= 2) :

s^m <= t^n < s^(m+1)

Now take the log2 of each side:

m*log2(s) <= n*log2(t) < (m+1)*log2(s)

now divide through by n*log(s)

m/n <= log2(t)/log2(s) < (m+1)/n

or

log2(t)/log2(s) - (m/n) < 1/n

Now we know L(N) is monotonic in N, so:

L(s^m) <= L(t^n) < L(s^(m+1))

so

m*L(s) <= n*L(t) < (m+1)*L(s)

divide by n*L(s) and rearrange:

L(t)/L(s) - m/n < 1/n

recall:

log2(t)/log2(s) - (m/n) < 1/n

thus:

[ L(t)/L(s) - log2(t)/log2(s) ] < 2/n

but n may be arbitrarily large.

thus:

L(N) = log2(N)

and P = 1/N thus

L = - log2(P)


Conclusions:

The message length to (generally) encode a message of probability P is log(1/P)

The entropy of a source is:

H = Sum(i) { Pi * ( - log2(Pi) ) }

This is the minimum average output length to encode message i from a source of probabilities Pi


Charles Bloom / cb at my domain

Send Me Email


Back to the Index


The free web counter says you are visitor number