see also:

George Buynovsky's Description of ACB (the author)

Peter Fenwick's LZP = BlockSort = Shannon


From: (Leonid A. Broukhis)
To: Mark Nelson and Charles Bloom

	Mark and Charles,

I went over the George's paper on weekend and here is what
I made out of it (the paper only describes the basic idea without
implementation details; this is my understanding of it):

	A. Encoding.

Consider a sorted dictionary D of N binary strings
with imaginary "all 0's" as 0th entry and "all 1's" as N+1th entry.
The procedure of encoding a binary stream S with the dictionary is as follows:

1. Find the lexicographic location L of S to be encoded
in the dictionary. L = X: { length of matching part of D[X](*) and S(*) = max 
Encode and transmit L to the output stream.

	There may be two cases:

  D[L]:   XX...X0ZZ.Z0V...          D[Lprev]: XX...X0YY...
     S:   XX...X0ZZ.Z1T...             S:     XX...X1ZZ.Z0T...
D[Lnext]: XX...X1YY...              D[L]:     XX...X1ZZ.Z1V...

2. Now, encode and transmit the difference bit B=1 between S(+) and D[L](+);
let K = the number of (~B) bits in ZZ.Z .

3. Update the dictionary if desired.

4. Make the first non-encoded bit to be the new stream head, repeat.

The decoder gets L and B, transmits the common part of D[L] and D[Lnext]
or D[Lprev] depending on B, transmits ~B, gets K, transmits the ZZ..ZZ part
of D[L] until it is about to transmit the K+1th (~B) bit not including it,
then transmits B.

Pretty close to LZP+Sort so far, but what's new is

	B. Dictionary organization.

At this point we define the way the dictionary is sorted. Each dictionary
entry has two parts: content (post-history in George's terms) and context
(pre-history), and it is sorted _rigth_to_left_ based on context.
E.g.the dictionary built using ...D R D O B B S D R D O B B S ... :

	   |  George's terms:
 Dict.     |"Funnels of analogies"
entry #'s  +---------+----------+
	   | pre-    |  post-   |
	   | history |  history |
- -----------+---------+----------+
       i  ...SDRDOBB | SDRDOBB...
      ii  ...BSDRDOB | BSDRDOB...
     iii  ...OBBSDRD | OBBSDRD...
      iv  ...RDOBBSD | RDOBBSD...
       v  ...BBSDRDO | BBSDRDO...
      vi  ...DOBBSDR | DOBBSDR...
     vii  ...DRDOBBS | DRDOBBS...

Here is what happens if we want to encode ...DR | ROBBSSDP...
where ...DR is the context, and ROBBSSDP... is the content.

	Entry (vi) is the best context match, entry (iv) is the best
content match, therefore L = -2, the first letter 'R' matches.

What will be transmitted is (-2, 'O', 1). Now, the context is
...DRRO, and the content is BBSSDP... Best context match is (v), the same
as the best content match. Difference character is 'S', and there are only
two non-S characters is the string (v) before the difference, therefore
we transmit (0, 'S', 2). Now the context is ....BBSS, content = DP....
Best context match is (vii), best content match is (vii) (we are free to
choose between (vi) and (vii), so we choose the min distance),
what's transmitted is (0, 'P', 0), as the first characters of (vi) and
(vii) match, and there's nothing more to copy.

It's easy to see (I'm not sure about "to prove") that for most real inputs
abs(L) will be quite small.

If a sequence of literals needs to be output, it is prepended by a
"magic dict. reference number".

The depth with which the contexts are sorted controls the speed and the
compression ratio of the algorithm.

|Footnotes:     |
|(*) use context|
|(+) use content|

Last note: the actual ACB algorithm uses right-to-left scanning, so
the contexts are left-to-right, and the contents - right-to-left.
This simplifies sorting of the contexts, and George claims that it
improves compression as well (~0.01 bpb, but very consistently).

	Please tell me what you think,

From: (Leonid A. Broukhis)
Newsgroups: comp.compression.research
Subject: Associative coding by George Buynovsky
Date: 5 May 1996 07:05:29 GMT

This article is based on the ideas described by George Buynovsky
in his 8'94 article "Associative coding" and on the "Shannon" coder.
In the setting of the Shannon's experiment, the test subject was asked
to guess the next symbol in the text, then guess again if the first
guess was wrong, etc. The idea is to guess phrases, not single
symbols. That is, the "predictor" gives several possible multicharacter
guesses, and rank and length of the longest correct one is recorded.
An implementation may use the following organization of
the dictionary:
        each entry in the dictionary consists of the two parts -
the context and the content, where the context is unbounded and
is directed "backwards", and the content is also unbounded and
is directed "forwards". The entries in the dictionary are being
alphabetically sorted by the conteXts. It is best done using
the ring buffer. The example shows the dictionary built on
a string "MISSISSIPPI":
        Context         Content
   1          ...MI SSISSIPPI..
   2          ...PI MISSISSIPPI
   3       ...MISSI SSIPPI...
   4       ...SISSI PPI...
   5           ...M ISSISSIPPI.
   6          ...IP PI...
   7          ...PP I...
   8         ...MIS SISSIPPI...
   9         ...SIS SIPPI...
   10       ...MISS ISSIPPI...
   11       ...SISS IPPI...
The contexts are shown up to the point of uniqueness, and sorted
alphabetically right-to-left (towards the past).
Let's see what
happens when we encode a phrase "MISTER IS SIPPING ..." (spaces
are ignored for clarity). For demonstration purposes assume the
previous character was A. This will place the current context
(...A) before the context 1 in the dictionary. The best matching
conteNt (with the current MIST...) is content 2 with match
length 3. The first non-matching character is E. The output of
the coder will be (+2, 3, E), and the context becomes (...E).
The current content (R...) doesn't match any, therefore a literal
must be output.
Now the current conteXt is (...R), placed between the conteXts
7 and 8, and the best conteNt match is 10 with length 7.
Therefore a triplet (+3, 7, N) will be output, etc.
Several improvements can be made to reduce the set of possible
dictionary offsets and lengths that need to be output by restricting
the set of entries the current content is matched against, e.g.
imposing a conteXt match length requirement (this will eliminate
random matches with weird dictionary offsets), and only counting
characters comprising the _unique_ part of the matched entry
(e.g. in the triplet (+3, 7, N) corresponding to the string "ISSIPPI"
the first 4 characters are not unique, they also occur in entry 5,

Charles Bloom / cb at my domain

Send Me Email

Back to the Index

The free web counter says you are visitor number