There are no magic compressors that can pack any data!!!

I shall now go through a semi-technical proof of the fact
that NO compressor can compress all data.


First of all, we must understand that all compressors of an
input file S must work statistically.  Really, this isn't
saying anything at all, or making any incorrect assumptions - it
is just a mode of interpretation (kind of like the various
interpretations of quantum mechanics - the results are always the
same under the mathematic framework, you just have some freedom in
choosing your philosophical viewpoint).  Let me describe
what I mean.

Each file S, we will assume is a string of characters in
an array, terminated by an EOF character of value 0x100 .
Imagine we "know" the probability of each file appearing;
Let us call this P(S) , then all we do to compress a file
is to assign a number to each S, and we write that number
in -log2(P(S)) bits.  It is known that you cannot do better
than this for a given P(S) (see Shannon).

Really, the only this the statistical view is to consider the 
probabilities P(S) of each code instead of the length L(S) of
each code ; by the relation L(S) = -log2(P(S)) , these views
are equal by a transform (kind of like Matrix Mechanics vs.
Wave Mechanics vs. q-number algebra ; I'm just full of quantum
mechanical analogies today).

see Shannon's derivation of L(S) = -log2(P(S))

But we can generalize this, for those of you who say
"you don't have to use a probability compressor!" - let us
just assign a code to each possible input S.  These codes
must be prefix codes in order for them to be uniquely
decodable - i.e. we do not want two codes to be confused
by the decoder.  Let us call the length of the code for a
certain input L(S) ; now all we do is write this code
with this length.  However, we cannot just choose our
lengths arbitrarily - obviously, if S is 2 bytes long,
then I could not write 1 bit for each code in S - some
of the possible values of S would be unspecifiable!  i.e.

S    code
00    0
01    1
10    00
11    01

IS NOT OK - because the decoder doesn't know how long
code is, so when it sees '00' it doesn't know if that
means S = 00 or S = 10 ; if I want to use that code
set I must also write out code lengths, which make all
the codes longer.

It turns out there is an easy way to specify this
"decodability" requirement.  It is called the Kraft
Inequality.  I have written about it before in :
Kraft Requirement on Prefix Codes and Power of the Kraft Inequality

However, there is an equally good way of saying it, more simply:

For each L(S) let me define a number P(S) which I have not yet given
a name or meaning, and let me define it by the relation:

 L(S) = -log2( P(S) )

or P(S) = 2^( - L(S) )

Now, the decodability requirement is :

 Sum[ all S ] P(S) <= 1

And the average output length is minimized when this Sum is at
its maximum, i.e.

 Sum[ all S ] P(S) = 1

Furthermore, it is a property of the definition of P(S) that

 0 < P(S) < 1    ( with L(S) > 0 )

Now these properties look just like the properties of a probability,
so I will call P(S) a probability, since it acts just like one,
whether it really is one or not.

Thus, we come back to the assertion : however I compress the
stream S, I MUST assign each S an output length L(S) ; these
lengths are NOT arbitrary, but must satisfy the Kraft inequality
to be decodable.  This requirment is the same as saying that
each L(S) corresponds to a P(S), which acts like a probability,
and is constrained by normal probability rules.  Thus, finding
the best L(S) set is identical to finding the best P(S) set,
and coding with L(S) = -log2( P(S) ).

NOTE that I did not say ANYTHING about our coder being
a statistical coder - all my conclusions simply came from the
fact that it was a coder which wrote a length L(S) for each
input S, and that its output was decodable.  Thus these
conclusions apply to ANY compressor - including one that
works by mathematical tricks like factoring !!!!!

Now, we can say some more very nice things given this fact :

1. Any statistical coder (PPM,LZ77,Zip,HA,ACB,etc. etc.) is
   the same as the simple P(S) coder I have described, except
   that instead of keeping a massive offline database of every
   P(S) , they use tricks to adaptively figure out P(S) based
   one what they expect a "reasonable" file to look like.
   In general, a coder which works character by character is
   just a subset of the more general whole-file-at-once coder

2. Because we know we are constrained by the Kraft Inequality,
   my proof from  Power of the Kraft Inequality 
   Holds on our L(S) set : therefore if you compress one possible
   input file by a factor F, you MUST expand another file by
   a factor greater than F - this is a simple mathematical consequence
   of the Kraft Inequality.

3. Let's take a look at what the "average" output length will be.
   Imagine that each input S occurs with a probability Q(S) , exactly.
   Note that Q(S) not necessarily = P(S) , because P(S) isn't yet
   a probability, it is just defined to be P(S) = 2^( - L(S) ) , from
   the output lengths of our general coder.  Now, the output length
   L(S) will occur with probability Q(S), so the average output length
   L(avg) is:

     L(avg) = Sum[ all S ] Q(S) * L(S)

   but L(S) = - log2( P(S) ) so:

     L(avg) = - Sum[ all S ] Q(S) * log2( P(S) )

It is shown by Shannon and others (and is easy to prove, but requires some math) that L(avg) is minimized when P(S) = Q(S)

  Thus let us define H = min[ L(avg) ] = - Sum[ all S ] P(S) * log2( P(S) )

  Now we clearly see that in the best case, P(S) is indeed the probability
  of S appearing, and that the Entropy H is equal to the minimum possible
  average output length.  Thus when people say that you cannot code
  better than the entropy, IT IS TRUE , and it DOES apply to ANY coder!!!

4. Let me attempt a short proof of H = min[ L(avg) ] ;
   I will use a two-symbol (binary) alphabet for S; S = 0 or 1 ; Any
   larger alphabet of S can be trivially constructed by recursive
   application of a binary coder to a binary tree.

   P(0) = a
   P(1) = 1-a
   Q(0) = b
   Q(1) = 1-b

   L(avg) = - Sum[ all S ] Q(S) * log2( P(S) )
          = - [ b * log2( a )  + ( 1-b ) * log2( 1-a ) ]

   Now, let us find the value of a that minimizes this L(avg)
   It is determined by :

   d/da [ L(avg) ] = 0

   L(avg) = - ( 1 / ln(2) ) * [ b * ln(a) + (1-b) * ln(1-a) ]

   d/da [ L(avg) ] = - ( 1 / ln(2) ) * [ b * (1/a) + (1-b) * (-1/(1-a)) ]

     = 0

     b/a - (1-b)/(1-a) = 0

   b/a = (1-b)/(1-a)

   b/(1-b) = a/(1-a)

   b = a

   -> L(avg) minimized when P(0) = Q(0)

   proves it for the binary case, therefore for the general case.

Charles Bloom / cb at my domain

Send Me Email

Back to the Index

The free web counter says you are visitor number