February 14, 1999

The basic architecture of the Wavelet Video encoder:

- Wavelet transform all the frames
- Send the first frame wavelet with a coder like ECECOW
- Send the band 0 (the LL band) for all frames. I like to make the LL band
have dimensions 8-15 or so; it takes about 200 bytes. We would do a 3D DPCM on
the LL data, presumably cutting it down to about 50 bytes.
- Start band number at N = 1. (A band is a level of LH/HL/HH data)
- For each image on band N, compute the band to delta against (see later).
Then take the difference with this computed band, and send the deltas with our coder.
- repeat for all bands

The inverse is obvious. The only thing to talk about is the "computed delta band". If we do no motion compensation, the delta band is just the previous band, and this is just a wavelet image coder with deltas between the frames.

BTW I don't consider 3d wavelets very practical because of the collosal memory use, and the difficulty of incorporating really nice motion compensation.

Ok, now for the serious stuff, how to compute the band to delta against. To compute the delta for band N, we do the transform up to L bands (L <= (N-1)), so that we have a 2^L size untransformed image. We can choose L to be larger for better quality and harder computation; the fastest version uses L=0, just using the LL band for computation.

Let **I** be the current 2^L low band. Let **I'** be the previous. Let **H** be the current band-N
square, and let **H'** be the previous.

For each pixel in **H** that we wish to code, we go up to its parent in **I**. For each pixel in **I**,
we look for the closest matching local neighborhood in **I'**. We favor local neighborhood offsets
that are similar to the ones chosen nearby in **I**. Now, this offset from **I** to **I'** is transformed
into an offset from **H** to **H'** , so that for our coding in **H** we now have found a local neighborhood
in **H'**. Now we look at the actual already-coded pixels in **H** and compare them to the pixel in our
target neighborhood in H' , and do pixel shifts of <= +- 4 pixels to improve the fit.

Now that we've chosen a local neighborhood in **H'** that corresponds to our pixel in **H**, we compare how
well the already-coded neighbors in **H** and **H'** match eachother. If the match is better than some
tolerance, we do a delta against the pixel in **H'**.

My tests suggest that you can achieve television quality in realtime, at 20 fps (on a 300+ Mhz Pentium), with 256x256 images, at 1024 bytes per frame. (or about 160 kbits/second). A lower quality version, at about 10 fps and 128x128 should be able to run on a 56k modem!.

To provide seeking, we would want to send a whole image, probably once per second, and then just send the delta frames to fill out the second.

Charles Bloom / cb at my domain

Send Me Email

The free web counter says you are visitor number