A Fallacy in Model + Encoder



There is a subtle mistake in the Model + Encoder
paradigm.  Let me warm up to it.

The modern coder is actually not a general Model
+ Encoder, because it is too slow to ask the model
to supply probabilities for each new character to
be encoded.  Instead a modern coder is as follows :

	1. a Model generates a "context" for the
	character to be encoded.  All characters
	assigned to a given context shall be
	called a "context group".

	2. the encoder uses a separate set of
	statistics for each context.

In a two-pass scenario step two could be thought of
as separating the characters into many separate
arrays, one for each context group, and then coding each
array to their order zero entropy.  More realistically,
step 2 is done by an adaptive arithmetic coder; this
adaptive coder should not only learn the statistics
of each context group, it can adapt to (temporal, if
you consider a file as a time sequence) changing trends.

The goal of the first step is to decorellate all the
context groups, that is if G1 and G2 are different
groups, the mutual information I(G1|G2) should be zero.

This seems all well and good, but it is wrong.  In
practice, the arithmetic coder used on each context group
is adapting over time, so its statistics are adapting
over time.  This means that the character of each context
group is changing over time.  Thus the correct context
group to add a certain character to might change.  Let
us make this more precise.  There are :

contexts :	Cn
characters :	c(t)
already-filled context groups : Gn(t)
the best-estimated statistics of a character : H(t)
and the current statistics of the adaptive coder for group Gn : Hn(t)

our goal in assigning a context Cn to the character c(t) is
to choose the n such that | Hn(t) - H(t) | is minimized.  This
problem statement differs from the original one above in that I
have now explicitly acknowledged that Hn changes with t : that is
no static context-choosing scheme can ever be locally optimal
(all known context-choosing schemes, such as DMC, PPM, CALIC, ECECOW,
etc. are static; the only scheme which approaches non-static-ness is
Paul Volf's Mixing Scheme or the CTW blending scheme.  (my old PPMZ LOE
is also somewehere in this family)).

Now that we have recongized the problem, we can elaborate on it.
We have implicit control over Hn(t) which is seldom used or recognized.
Rather than imagining that Hn is constant and trying to group similar
statistics, we can acknowledge that Hn(t) varies, and we can control
this variation, (that is, we can do other things than just letting Hn(t)
vary due to the adaptation of the model).

A nice example of this is image coding.  Consider lossless image coding
(this all that matters, since lossy coding is just lossless on the
post-quantized coefficients).  We are looking at pixels in the nearby
two-dimensional neighborhood to create the context.  Once we have
this context, Cn, which we have tried to design to correspond to Hn,
we simply code the character with Hn(t).  But Hn(t) is usually an
adaptive arith-coder, which is gathering statistics in one-dimensional
time : that is, in some arbitrary scan order!  In fact when we made
Cn from the 2d neighborhood we implicitly assumed that Hn(t) also
was localized in the 2d neighborhood.  This is usually not true and
results in a mis-match of Cn(t) and Hn(t).  

When I refer to this sort of mismatch, I mean that in assymptotia
there is some optimal matching : Cx and Hy are matched.  There is
a mismatch if at some finite time t Cx is best-fit to some Hn
with n != y.  That is, the relation of contexts and context-groups
flip-flops at some finite time.

LoPresto, Ramchandran, and Orchard (in "Image Coding based on Mixture
Modeling..." the EQ paper from DCC 97) solved this problem in the most
trivial way.  The simply cut the (t) from Hn, and use constant
probability distributions for each context : Cn(t) choose Hn.  (in
their case "n" corresponds to a standard-deviation of a generalized
gaussian).  This obviously eliminates the incongruity I mentioned above,
but is also obviously not optimal, because it entirely surrends the
possibility of coding with changing statistics.  (using Hn(t) instead
of Hn can compensate for a poor context-choosing algorithm).


Charles Bloom / cb at my domain

Send Me Email

Back to the News Index

The free web counter says you are visitor number