A discussion on Audio Coding, January 1999.

From cb at my domain Thu Jan 21 22:26:59 1999
Path: news.giganews.com!nntp.giganews.com!news2.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression,comp.dsp
Subject: thoughts on audio compression
Followup-to: comp.compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 121
Message-ID: 
Date: Mon, 18 Jan 1999 05:11:54 GMT
NNTP-Posting-Host: 207.207.3.85
X-Trace: news2.giganews.com 916636314 207.207.3.85 (Sun, 17 Jan 1999 23:11:54 CDT)
NNTP-Posting-Date: Sun, 17 Jan 1999 23:11:54 CDT
Xref: nntp.giganews.com comp.compression:18953 comp.dsp:32580


Some thoughts on audio compression, seeking comments:

All the modern coders (MPEG1-Layer3,LPC/CELP/MELP,etc.)
suffer from pretty serious defficiencies.  For one
thing, they don't take advantage of modern statistical
coding techniques.  For another, they don't take advantage
of varying bitrates very well (eg. internet transmission
could do better with non-fixed rates). Finally, none of
them really take advantage of both space and frequency
localization of energy (eg. as wavelets allow).

Let us imagine for a moment that we don't care about
convenience factors like real-time decoding, or streaming
or anything like that, we simply want to compress a
hunk of sound as much as possible.

In the end this all leads back to semi-ill-conceived
basics.  We have the problem in sound compression that
we must work in both the space and frequency domain.
Most audio signals are actually created by sinusoids
of varying frequencies convolved with gaussians (or
something) that give them a limited lifetime.  So,
a coder like (modified) ADPCM can take full advantage
of silent spaces, and time corellation information (like
coding the twang that comes after a guiter-string pluck)
with some markov model.  On the other hand, if I just
did a big Fourier transform of the whole sound block, I
would be able to take advantage of the human perceptive
model, which we primarily understand in terms of its
frequency responce. (eg. I could quantize all the DCT
coefficients with quantizers scaled to the human hearing
thresholds by frequency, which are tabulated in various
places).

So, how do we capture both space and frequency 
information?  The textbook answer is wavelets, but we'll
come back to that later.  The MPEG answer is to cut
the stream into hunks, and do a Fourier on each hunk
(actually they do a 32-tap subband filter, but that's
not really an important difference).  We can then do
all the frequency-related perceptive masking and
thresholding within each block.  We can also use
statistical correlation across blocks (MPEG doesn't
do this well, but it could and should).  This all
seems well and good, the only problem is in the
fundamental structure of blocking.  At low bitrates,
we get the same problems as JPEG : noise at block
boundaries.  Even aside from these, we kill our
sample.  Imagine we take a sitar and strum it at
time zero.  It makes a sound that dies off in less
than a second.  We strum it again at time one, and
then time two, strumming again once per second.
Now, if our fourier-block size was equal to one
second at our sampling rate, then we would do the
transform once, and our cross-block correlation
coder would crush the stream down to nothing.
On the other hand, if the MPEG block was 1009
milliseconds (that's a prime), then we would get
a different fourier transform every time, and
our stream would suddenly become huge.

In fact, we can create a prescription for making
a very low entropy sound sample which is compressed
well above its entropy by the standard codecs.
Take a sound sample of an instrument.  Create a
new sample by repeating this sample at random
intervals, with a random volume and random 
sampling rate.

This may seem like an artifice, but when I say
"hello" twenty times, I create essentially the
same stream (with a little extra randomness),
so this is not an unrealistic case.  (of course,
this time of stream is also very important for
music, but music is a superposition of these
streams, each generated by different instruments,
and our ability to resolve a sound file into its
component instruments is another step still).

So, fourier or subband on frames seems unacceptable.
Codebook and prediction methods like LPC/CELP/MELP
could be conditioned to allocate more codebook space
proportional to the accuracy of the human ear in
that region, but it's a favor more difficult task
than using the information in frequency space.

Thus, we come back to the question of wavelets.
We always hear that wavelets give us time and
frequency localization of energy.  Thus, they seem
the natural answer.  The problem is the 'frequency';
(little quotes will be henceforth used for innaccuracies)
there's plenty of research on how the human ear
responds to near-sinuisodal stimulus, but wavelet
bases are not sinusoids.  Furthermore, you only
get something like power of two 'frequencies' :
like 8192 Hz, 4096 Hz, 2048 , 1024 ..., but you
don't have as much control over how you send
these; if you crush the '128 Hz' wavelet, you also
damage all higher frequency components.

In addition, the wavelets trivially fail the sample
dilation test.  If the sample's rate is dilated to
exactly an integer power of two of the original, then
the wavelets will not be badly affected (information
moves from one level to another).  If the wavelet
bases are compact (eg. Haar) then they are never
badly affected by dilations; however, in that case,
they correspond poorly to Fourier bases, and they
essentially become a progressive version of ADPCM.

Well, the situation seems pretty bleak.  If these
considerations have been in error, let me know.

----------------------------------
Charles Bloom     cb at my domain
http://www.cbloom.com/~cbloom/

I'm capable of wondering if I am
intelligent life, therefore I am.



From cb at my domain Thu Jan 21 22:27:07 1999
Path: news.giganews.com!nntp.giganews.com!worldfeed.news.gte.net!newshub.northeast.verio.net!btnet-peer!btnet!news-lond.gip.net!newsfeed.uk.ibm.net!news.ibm.net.il!ibm.net!news.biu.ac.il!news.tau.ac.il!not-for-mail
From: "Dr. Noam Amir" 
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
Date: Mon, 18 Jan 1999 10:43:47 +0200
Organization: Centre for Technological Education Holon
Lines: 107
Message-ID: <36A2F443.B495BFAF@wine.cteh.ac.il>
References: 
Reply-To: noamoto@wine.cteh.ac.il
NNTP-Posting-Host: pc169.cteh.ac.il
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 4.5 [en] (Win95; I)
X-Accept-Language: en
Xref: nntp.giganews.com comp.compression:18957


hi,
   as far as i understand, some of your points are correct, though
there's no reason to be so pessimistic about it. after all, mpeg does
quite a good job of compression, considering you can put enormous
amounts of music on a CD with it. so what's bleak about that?

your remark on variable bit rate is very true. radio transmission
requires more or less fixed bit rate, though CDMA cellular is a good
counter example. but for packet switched networks i don't see much of a
reason to stick to fixed bit rate. or for storage, for that matter. 

as for squeezing all the redundancy out of the signal: the more you know
about the signal, the more you can compress it. that's why speech
coders, based on the physical model of speech production - all the LPC
variants - can compress speech very well. compressing music, on the
other hand, you can make very few assumptions about the signal. once
it's you saying hello 20 times, once it's an orchestra. if you could
somehow isolate every instrument in the orchestra and send a few
parameters such as "violin - c# - forte - 1 sec." you could compress
much more. right now what people are using for audio compression is just
statistical redundancy and perceptual masking. i suppose that as
computing power grow, it will be possible to look for more and more
complex forms of redundancy and compress better.

it's quite true that the best compression must be based on better
analysis of the signal - not using fixed size chunks, but chunks the
size of the natural time constants. for instance, you can see this in
Waveform Interpolation, which is a method of coding speech. the analysis
window is synchronized with the pitch period, which makes a lot of
sense.

in any case - i wouldn't get too pessimistic. on the contrary - it seems
that there's a lot of work left to do, which is good, since that way one
can hopefully publish many papers and get tenure.

good luck,
      noam.
Charles Bloom wrote:
> 
> Some thoughts on audio compression, seeking comments:
> 
> All the modern coders (MPEG1-Layer3,LPC/CELP/MELP,etc.)
> suffer from pretty serious defficiencies.  For one
> thing, they don't take advantage of modern statistical
> coding techniques.  For another, they don't take advantage
> of varying bitrates very well (eg. internet transmission
> could do better with non-fixed rates). Finally, none of
> them really take advantage of both space and frequency
> localization of energy (eg. as wavelets allow).

> 
> In the end this all leads back to semi-ill-conceived
> basics.  We have the problem in sound compression that
> we must work in both the space and frequency domain.
> Most audio signals are actually created by sinusoids
> of varying frequencies convolved with gaussians (or
> something) that give them a limited lifetime.  So,
> a coder like (modified) ADPCM can take full advantage
> of silent spaces, and time corellation information (like
> coding the twang that comes after a guiter-string pluck)
> with some markov model.  On the other hand, if I just
> did a big Fourier transform of the whole sound block, I
> would be able to take advantage of the human perceptive
> model, which we primarily understand in terms of its
> frequency responce. (eg. I could quantize all the DCT
> coefficients with quantizers scaled to the human hearing
> thresholds by frequency, which are tabulated in various
> places).
> 
> 

> 

> 

> 
> Well, the situation seems pretty bleak.  If these
> considerations have been in error, let me know.
> 
> ----------------------------------
> Charles Bloom     cb at my domain
> http://www.cbloom.com/~cbloom/
> 
> I'm capable of wondering if I am
> intelligent life, therefore I am.

-- 
*************************************************************************
Dr. Noam Amir            *    Dept. of Communications Engineering      
*
noamoto@wine.cteh.ac.il  *    Center for Technological Education Holon 
*
                         *    52 Golomb st., P.O.B. 305                
*
voice: 972-3-5026689     *    Holon 58102                              
*
Phax:  972-3-5026643     *    ISRAEL                                   
*
*************************************************************************
                  *                       *
                  *     SPACE FOR RENT    *
                  *                       *
                  *************************

My short URL:
http://www.cteh.ac.il/staff/noama


From cb at my domain Thu Jan 21 22:27:13 1999
Path: news.giganews.com!nntp.giganews.com!news1.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:  <36A2F443.B495BFAF@wine.cteh.ac.il>
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 14
Message-ID: <51Jo2.13208$bf6.2538@news1.giganews.com>
Date: Mon, 18 Jan 1999 16:09:37 GMT
NNTP-Posting-Host: 204.1.4.155
X-Trace: news1.giganews.com 916675777 204.1.4.155 (Mon, 18 Jan 1999 10:09:37 CDT)
NNTP-Posting-Date: Mon, 18 Jan 1999 10:09:37 CDT
Xref: nntp.giganews.com comp.compression:18964


In article <36A2F443.B495BFAF@wine.cteh.ac.il>, noamoto@wine.cteh.ac.il says...
>it's quite true that the best compression must be based on better
>analysis of the signal - not using fixed size chunks, but chunks the
>size of the natural time constants. for instance, you can see this in
>Waveform Interpolation, which is a method of coding speech. the analysis
>window is synchronized with the pitch period, which makes a lot of
>sense.

Can you give me a reference or starting point on this?

My pessism is mostly because these tasks are quite daunting
(especially for real-time applications!)



From cb at my domain Thu Jan 21 22:27:23 1999
Path: news.giganews.com!nntp.giganews.com!news.idt.net!news-nyc.telia.net!masternews.telia.net!newsfeed1.swip.net!swipnet!newsfeed1.uni2.dk!sunsite.auc.dk!not-for-mail
Newsgroups: comp.compression
From: Jan Oestergaard 
Subject: Re: thoughts on audio compression
In-Reply-To: 
Message-ID: 
References: 
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Lines: 176
Date: Mon, 18 Jan 1999 09:17:54 GMT
NNTP-Posting-Host: 130.225.51.56
X-Complaints-To: news@sunsite.auc.dk
X-Trace: sunsite.auc.dk 916651074 130.225.51.56 (Mon, 18 Jan 1999 10:17:54 MET DST)
NNTP-Posting-Date: Mon, 18 Jan 1999 10:17:54 MET DST
Organization: SunSITE Denmark (sunsite.auc.dk)
Xref: nntp.giganews.com comp.compression:18959

Hello Charles,

I find your thoughts very interesting and would like to comment a few
things about the wavelets (see deep below).


> Some thoughts on audio compression, seeking comments:
> 
> All the modern coders (MPEG1-Layer3,LPC/CELP/MELP,etc.)
> suffer from pretty serious defficiencies.  For one
> thing, they don't take advantage of modern statistical
> coding techniques.  For another, they don't take advantage
> of varying bitrates very well (eg. internet transmission
> could do better with non-fixed rates). Finally, none of
> them really take advantage of both space and frequency
> localization of energy (eg. as wavelets allow).

Yes I agree that varying bit-rates should be considered more often,
especially for IT purposes.
 
> Let us imagine for a moment that we don't care about
> convenience factors like real-time decoding, or streaming
> or anything like that, we simply want to compress a
> hunk of sound as much as possible.
> 
> In the end this all leads back to semi-ill-conceived
> basics.  We have the problem in sound compression that
> we must work in both the space and frequency domain.
> Most audio signals are actually created by sinusoids
> of varying frequencies convolved with gaussians (or
> something) that give them a limited lifetime.  So,
> a coder like (modified) ADPCM can take full advantage
> of silent spaces, and time corellation information (like
> coding the twang that comes after a guiter-string pluck)
> with some markov model.  On the other hand, if I just
> did a big Fourier transform of the whole sound block, I
> would be able to take advantage of the human perceptive
> model, which we primarily understand in terms of its
> frequency responce. (eg. I could quantize all the DCT
> coefficients with quantizers scaled to the human hearing
> thresholds by frequency, which are tabulated in various
> places).
> 
> So, how do we capture both space and frequency 
> information?  The textbook answer is wavelets, but we'll
> come back to that later.  The MPEG answer is to cut
> the stream into hunks, and do a Fourier on each hunk
> (actually they do a 32-tap subband filter, but that's
> not really an important difference).  We can then do
> all the frequency-related perceptive masking and
> thresholding within each block.  We can also use
> statistical correlation across blocks (MPEG doesn't
> do this well, but it could and should).  This all
> seems well and good, the only problem is in the
> fundamental structure of blocking.  At low bitrates,
> we get the same problems as JPEG : noise at block
> boundaries.  Even aside from these, we kill our
> sample.  Imagine we take a sitar and strum it at
> time zero.  It makes a sound that dies off in less
> than a second.  We strum it again at time one, and
> then time two, strumming again once per second.
> Now, if our fourier-block size was equal to one
> second at our sampling rate, then we would do the
> transform once, and our cross-block correlation
> coder would crush the stream down to nothing.
> On the other hand, if the MPEG block was 1009
> milliseconds (that's a prime), then we would get
> a different fourier transform every time, and
> our stream would suddenly become huge.
> 
> In fact, we can create a prescription for making
> a very low entropy sound sample which is compressed
> well above its entropy by the standard codecs.
> Take a sound sample of an instrument.  Create a
> new sample by repeating this sample at random
> intervals, with a random volume and random 
> sampling rate.
> 
> This may seem like an artifice, but when I say
> "hello" twenty times, I create essentially the
> same stream (with a little extra randomness),
> so this is not an unrealistic case.  (of course,
> this time of stream is also very important for
> music, but music is a superposition of these
> streams, each generated by different instruments,
> and our ability to resolve a sound file into its
> component instruments is another step still).
> 
> So, fourier or subband on frames seems unacceptable.
> Codebook and prediction methods like LPC/CELP/MELP
> could be conditioned to allocate more codebook space
> proportional to the accuracy of the human ear in
> that region, but it's a favor more difficult task
> than using the information in frequency space.
> 
> Thus, we come back to the question of wavelets.
> We always hear that wavelets give us time and
> frequency localization of energy.  Thus, they seem
> the natural answer.  The problem is the 'frequency';
> (little quotes will be henceforth used for innaccuracies)
> there's plenty of research on how the human ear
> responds to near-sinuisodal stimulus, but wavelet
> bases are not sinusoids.

There has been articles concerning this. One article describes how to
transform your wavelet output coefficients into Fourier
transform coefficients, which give you the ability of adapt the 
phenomena of frequency masking into your wavelet compression scheme, see
[1]. Another interesting article, tries to describe the behavior of the
cochlear using wavelets, i.e.\ they try to let the wavelet transform act
like the cochlear does, see [2].

>  Furthermore, you only
> get something like power of two 'frequencies' :
> like 8192 Hz, 4096 Hz, 2048 , 1024 ..., but you
> don't have as much control over how you send
> these; if you crush the '128 Hz' wavelet, you also
> damage all higher frequency components.

Some 'claims' (says) that over sampled (redundant) wavelet
transforms, leads to higher compression rates than critical/dyadic sampled
transforms. You then have the possibility of arbitrarely choose the
needed scaling (1/frequency). If wanted, I can give you some citations,
but I am not home right now, and do not remember the exact titles.

> 
> In addition, the wavelets trivially fail the sample
> dilation test.  If the sample's rate is dilated to
> exactly an integer power of two of the original, then
> the wavelets will not be badly affected (information
> moves from one level to another).  If the wavelet
> bases are compact (eg. Haar) then they are never
> badly affected by dilations; however, in that case,
> they correspond poorly to Fourier bases, and they
> essentially become a progressive version of ADPCM.


Coifman and Meyer thought about how to make the wavelets more 'natural',
and they come up with the Malvar wavelets. They said that music often
begins with an attack, last a period of time and then slowly decays. So
they created a 'wavelet', which consist of a smooth envelope containing an
oscillating sine or cosine. The envelope has a sharp (but smooth)
transition from zero to one in the beginning and then last for an
arbitrarely period (they are stretched when adapted to the signal) and
then they have a slow decay in the end.

> 
> Well, the situation seems pretty bleak.  If these
> considerations have been in error, let me know.
> 


I know that I haven't given any "useful" comments to your thoughts,
but I am interested in transparent compression of wide-band speech
signals using the DWT, and just wanted to let you know that several of the
aspects you touch in your post, are widely being explored around the world.
 

Citations:
----------
[2] Toshio Irino and Hideki Kawahara, "Signal reconstruction from modified
auditory wavelet transform", IEEE Trans. sig. Proc. Vol.41-12, Dec. 1993.

[1] Deepen Sinha, Ahmed H. Tewfik, "Low bit rate transparent audio
compression using adapted wavelets", IEEE Trans. sig. Proc. Vol.41-12,
1993.


 -Jan
=================================================================
           Aalborg University           |     Jan Oestergaard
    Institute for Electronic Systems    |    janoe@kom.auc.dk
Department for Communication Technology | Frb. 7, A4-101 Gr. 926
=================================================================




From cb at my domain Thu Jan 21 22:27:33 1999
Path: news.giganews.com!nntp.giganews.com!news2.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:  
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 27
Message-ID: 
Date: Mon, 18 Jan 1999 16:54:48 GMT
NNTP-Posting-Host: 204.1.4.155
X-Trace: news2.giganews.com 916678488 204.1.4.155 (Mon, 18 Jan 1999 10:54:48 CDT)
NNTP-Posting-Date: Mon, 18 Jan 1999 10:54:48 CDT
Xref: nntp.giganews.com comp.compression:18966


In article , 
janoe@kom.auc.dk says...
>
>There has been articles concerning this. One article describes how to
>transform your wavelet output coefficients into Fourier ...

Very interesting, thanks for the references!

>Some 'claims' (says) that over sampled (redundant) wavelet
>transforms, leads to higher compression rates than critical/dyadic sampled
>transforms. You then have the possibility of arbitrarely choose the
>needed scaling (1/frequency). If wanted, I can give you some citations,
>but I am not home right now, and do not remember the exact titles.

That's a little odd.  (I always think about wavelets in terms of
"lifting", in which scheme anything but dyadic wavelets seem very
unnatural).  On the other hand, wavelet-packets are an over-complete
basis, and of course do better than wavelets on most data...

>Coifman and Meyer thought about how to make the wavelets more 'natural',
>and they come up with the Malvar wavelets. ....

I imagine I can find information on these in standard wavelet reference
tomes (?).  References to both of the last two topics (Malvar wavelets
and over-sampled wavelets) would be appreciated!



From cb at my domain Thu Jan 21 22:27:42 1999
Path: news1.giganews.com!nntp.giganews.com!cyclone.news.idirect.com!island.idirect.com!news-peer.gip.net!news.gsl.net!gip.net!howland.erols.net!news.net.uni-c.dk!sunsite.auc.dk!not-for-mail
Newsgroups: comp.compression
From: Jan Oestergaard 
Subject: Re: thoughts on audio compression
In-Reply-To: 
Message-ID: 
References:   
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Lines: 73
Date: Tue, 19 Jan 1999 17:44:00 GMT
NNTP-Posting-Host: 130.225.51.56
X-Complaints-To: news@sunsite.auc.dk
X-Trace: sunsite.auc.dk 916767840 130.225.51.56 (Tue, 19 Jan 1999 18:44:00 MET DST)
NNTP-Posting-Date: Tue, 19 Jan 1999 18:44:00 MET DST
Organization: SunSITE Denmark (sunsite.auc.dk)
Xref: nntp.giganews.com comp.compression:18982

On Mon, 18 Jan 1999, Charles Bloom wrote:

> 
> In article , 
> janoe@kom.auc.dk says...
> >
> >There has been articles concerning this. One article describes how to
> >transform your wavelet output coefficients into Fourier ...
> 
> Very interesting, thanks for the references!
> 
> >Some 'claims' (says) that over sampled (redundant) wavelet
> >transforms, leads to higher compression rates than critical/dyadic sampled
> >transforms. You then have the possibility of arbitrarely choose the
> >needed scaling (1/frequency). If wanted, I can give you some citations,
> >but I am not home right now, and do not remember the exact titles.
> 
> That's a little odd.  (I always think about wavelets in terms of
> "lifting", in which scheme anything but dyadic wavelets seem very
> unnatural).  On the other hand, wavelet-packets are an over-complete
> basis, and of course do better than wavelets on most data...
> 

Actually the CWT and over sampled wavelet transforms are often used when
analysing signals, since they "visually" gives more information than
traditional dyadic wavelets. You might miss certain transients and like,
because they appear somehow different than expected because of the
downsampling in the DWT. And arbitrarely scaling might emphasize
information "hidden" between dyadic scales.

Look in "Computational Signal Processing with  Wavelets", by Anthony
Teolis, Birkhauser 1998. That is where I remember reading about the over
sampled wavelet transforms used for compression. But for analysing signals
there is various articles describing good results obtained by redundant
wavelet transforms. 

btw: When using wavelet-packets you often choose an orthogonal basis
representation, where as with the redundant wavelet transforms you often
keep some kind of redundancy in the coefficients. If it is compression,
you want to minimize the redundancy, but then again, since redundancy is more
robust to quantization noise and channel distortion, you still might wanna
keep a bit redundancy.

> >Coifman and Meyer thought about how to make the wavelets more 'natural',
> >and they come up with the Malvar wavelets. ....
> 
> I imagine I can find information on these in standard wavelet reference
> tomes (?).  References to both of the last two topics (Malvar wavelets
> and over-sampled wavelets) would be appreciated!
> 

Unfortunately it has been a long time ago since I read about the Malvar
wavelets, but I remember to have read about them in "The World According
to wavelets, the story of a Mathematical Technique in the Making", written
by a journalist 8-) called Barbara Burke Hubbard and published by A K
Peters Ltd., 1996. I remember it was fun reading that book too, even
though it is not a scientific educational book and hence sometimes
explains very simple things.

If I remember which articles contains details about the Malvar wavelets I
will let you know, but at the moment I do not.

 sincerely,

     Jan

=================================================================
           Aalborg University           |     Jan Oestergaard
    Institute for Electronic Systems    |    janoe@kom.auc.dk
Department for Communication Technology | Frb. 7, A4-101 Gr. 926
=================================================================




From cb at my domain Thu Jan 21 22:27:48 1999
Path: news1.giganews.com!nntp.giganews.com!news2.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:    
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 24
Message-ID: 
Date: Wed, 20 Jan 1999 05:01:09 GMT
NNTP-Posting-Host: 207.207.3.160
X-Trace: news2.giganews.com 916808469 207.207.3.160 (Tue, 19 Jan 1999 23:01:09 CDT)
NNTP-Posting-Date: Tue, 19 Jan 1999 23:01:09 CDT
Xref: nntp.giganews.com comp.compression:18997


In article , 
janoe@kom.auc.dk says...
>
>Unfortunately it has been a long time ago since I read about the Malvar
>wavelets, ....
>

A web search turned up very little on the Malvar transform, but I
got enough to see that it was closely related (perhaps synonymous?)
to the Adaptive Local Trigonomentric Transform (ALTT) and the
Modulated Lapped Transform (MLT).  They seem to be just the multiple
of a function with compact support in an interval (like the
step function, though that's not a great choice) with a trig function.

There are some details that I could work out, but I'd like a cue
that I'm on the right track.

----------------------------------
Charles Bloom     cb at my domain
http://www.cbloom.com/~cbloom/
I'm capable of wondering if I am
intelligent life, therefore I am.



From cb at my domain Thu Jan 21 22:27:54 1999
Path: news1.giganews.com!nntp.giganews.com!news.maxwell.syr.edu!newsfeed.ecrc.net!news.net.uni-c.dk!sunsite.auc.dk!not-for-mail
Newsgroups: comp.compression
From: Jan Oestergaard 
Subject: Re: thoughts on audio compression
In-Reply-To: 
Message-ID: 
References:     
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Lines: 30
Date: Wed, 20 Jan 1999 08:12:51 GMT
NNTP-Posting-Host: 130.225.51.56
X-Complaints-To: news@sunsite.auc.dk
X-Trace: sunsite.auc.dk 916819971 130.225.51.56 (Wed, 20 Jan 1999 09:12:51 MET DST)
NNTP-Posting-Date: Wed, 20 Jan 1999 09:12:51 MET DST
Organization: SunSITE Denmark (sunsite.auc.dk)
Xref: nntp.giganews.com comp.compression:18999

On Wed, 20 Jan 1999, Charles Bloom wrote:

> 
> janoe@kom.auc.dk says...
> >
> >Unfortunately it has been a long time ago since I read about the Malvar
> >wavelets, ....
> >
> 
> A web search turned up very little on the Malvar transform, but I
> got enough to see that it was closely related (perhaps synonymous?)
> to the Adaptive Local Trigonomentric Transform (ALTT) and the
> Modulated Lapped Transform (MLT).  They seem to be just the multiple
> of a function with compact support in an interval (like the
> step function, though that's not a great choice) with a trig function.

Yes they are very related, and maybe an extension. You can read alot
about these Lapped Orthogonal Transforms in, "Adapted wavelet analysis
from theory to software", written by Victor Mladen Wickerhauser, published
by Wellesley, 1994. (Though he do not mention the name "Malvar wavelets".)
 
 -Jan
=================================================================
           Aalborg University           |     Jan Oestergaard
    Institute for Electronic Systems    |    janoe@kom.auc.dk
Department for Communication Technology | Frb. 7, A4-101 Gr. 926
=================================================================





From cb at my domain Thu Jan 21 22:28:07 1999
Path: news.giganews.com!nntp.giganews.com!newsfeed.cwix.com!18.181.0.26!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!usenet
From: "Eric Scheirer" 
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
Date: Mon, 18 Jan 1999 10:38:28 -0500
Organization: Massachvsetts Institvte of Technology
Lines: 166
Message-ID: <77vkgc$d23@senator-bedfellow.MIT.EDU>
References: 
NNTP-Posting-Host: ozric.media.mit.edu
X-Newsreader: Microsoft Outlook Express 4.72.3110.5
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3
Xref: nntp.giganews.com comp.compression:18965

Charles Bloom wrote:

>Some thoughts on audio compression, seeking comments:

Some comments attached.

>All the modern coders (MPEG1-Layer3,LPC/CELP/MELP,etc.)

These coders are hardly "modern"!  MPEG-1 Layer 3 was completed
in 1992 and there's been two whole generations of MPEG audio
standards since then (AAC and MPEG-4).  The truly modern
coders do solve many of the problems you highlight below --
it's now a matter of dissemination and implementation,
not codec design.

>suffer from pretty serious defficiencies.  For one
>thing, they don't take advantage of modern statistical
>coding techniques.

The noiseless coding stage in modern perceptual coders (AAC)
quickly reaches a point of diminishing returns.  The need
to do careful bit-allocation for the subband data requires
some remaining structure in the bitstream representation,
and thus not *all* redundancy (especially frame-to-frame
redundancy can be removed).  This is a necessary (?)
implication of the subband/frame-based coding model.

> For another, they don't take advantage
>of varying bitrates very well (eg. internet transmission
>could do better with non-fixed rates).

As another poster pointed out, most audio coders today
do allow variable-bit-rate operation.  It's a matter of
producing encoders that easily support this, not better
codec formats.  MPEG-4 has fine-grained scalability support
(so you can progressively "unwrap" parts of the coded
signal with minimum effect on the sound quality).

> Finally, none of
>them really take advantage of both space and frequency
>localization of energy (eg. as wavelets allow).

What you say is partly true, but of course there are limits
to the human ability to perceive space (by which I assume
you mean "time") and frequency localization.  The quality
of our perceptual coders has progressed as our understanding
of the relevant masking principles.  IMHO, at this point,
for perceptual coding advances, it's all about understanding
psychoacoustics, not fancier bases for signal representation.

>Let us imagine for a moment that we don't care about
>convenience factors like real-time decoding, or streaming
>or anything like that, we simply want to compress a
>hunk of sound as much as possible.

Okay, I like these sorts of thought experiments (much to
the chagrin of other MPEG audio people!)...

>Most audio signals are actually created by sinusoids
>of varying frequencies convolved with gaussians (or
>something) that give them a limited lifetime.

I'm not sure what you mean by this.  *My* audio signals
are creating by my playing my trombone into a microphone
and digitizing the resulting electrical potentials with
a ADC.  Fourier and sinusoidal and other mathematical
representations may have nice properties to work with,
but ultimately have limited correspondance to the sound
models in physical reality or the acoustical processing
in the auditory system.

> So,
>a coder like (modified) ADPCM can take full advantage
>of silent spaces, and time corellation information (like
>coding the twang that comes after a guiter-string pluck)
>with some markov model.

Unfortunately, the models that actually underlie sounds
like guitars are too complex to really be produced by
Markov models of any limited order.  To go in this sort
of direction, you need a more sophisticated notion of
"time correlation" than N-th order statistics.

>On the other hand, if I just
>did a big Fourier transform of the whole sound block, I
>would be able to take advantage of the human perceptive
>model, which we primarily understand in terms of its
>frequency responce.

I think it is no longer true that we understand the
human perceptual system in terms of its frequency response.
Frequency response is an LTI concept, and the
perceptual system is known to be non-linear and
time-dependant.

>This all
>seems well and good, the only problem is in the
>fundamental structure of blocking.

[Good description of blocking elided.]

>In fact, we can create a prescription for making
>a very low entropy sound sample which is compressed
>well above its entropy by the standard codecs.
>Take a sound sample of an instrument.  Create a
>new sample by repeating this sample at random
>intervals, with a random volume and random
>sampling rate.

When we get to a sophisticated argument like this, it
is important to be careful of terms like "entropy".
Entropy is a term that is only defined in terms of
an ensemble of signals, or other probabilistic distribution
of events.  In order to use it properly, we must be
sure that we are really speaking of a stochastic
framework, which I don't think you are here.

However, your larger point holds -- this is a kind of
signal redundancy that is not captured by block-based
codecs.  I call this type of redundancy "structural
redundancy" and you need, as you suggest, some other
type of coding technique to deal with it.  I have written about
this in the context of the MPEG-4 Structured Audio coder --
see http://sound.media.mit.edu/mpeg4/sa-tech.html.
(I also have a real manuscript that is currently in the
peer-review process, and I discuss structural redundancy
briefly in my article in the Jan. 1999 Multimedia Systems [1]).

>(of course,
>this time of stream is also very important for
>music, but music is a superposition of these
>streams, each generated by different instruments,
>and our ability to resolve a sound file into its
>component instruments is another step still).

Since we're in the world of thought experiments, I find it
valuable to distinguish decoding from encoding.  It's easy
to imagine an coding format (MPEG-4, for example) capable
of representing superposed sounds and decoding and mixing
them together.  We don't yet have automatic encoding for
these sorts of sound models, but some kind of human-assisted
process is usefur.

>Well, the situation seems pretty bleak.  If these
>considerations have been in error, let me know.

There *are* new developments in coding technology and coding
theory, particularly around MPEG-4 Audio.  It's hard yet to
read coherent summaries, but that's part of the growing pains
involved with new developments.

Thanks for a thought-provoking posting.

Best,

 -- Eric

REFERENCES

[1] Scheirer ED, Structured audio and effects processing in the
 MPEG-4 multimedia standard.  Multimedia Systems 7:1, pp. 11-22,
 Jan 1999.






From cb at my domain Thu Jan 21 22:28:14 1999
Path: news.giganews.com!nntp.giganews.com!news1.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:  <77vkgc$d23@senator-bedfellow.MIT.EDU>
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 107
Message-ID: 
Date: Mon, 18 Jan 1999 17:07:35 GMT
NNTP-Posting-Host: 204.1.4.155
X-Trace: news1.giganews.com 916679255 204.1.4.155 (Mon, 18 Jan 1999 11:07:35 CDT)
NNTP-Posting-Date: Mon, 18 Jan 1999 11:07:35 CDT
Xref: nntp.giganews.com comp.compression:18967


Thanks for the informative reply!

In article <77vkgc$d23@senator-bedfellow.MIT.EDU>, eds@media.mit.edu says...
> ... modern coders (AAC and MPEG-4). 

Is there an online souce for more information on these?
I've been having trouble finding good online information
on sound compression (contrast to image compression, where
many of the Wavelet guys have full online libraries).
I saw that the Mit SA page points to you for reprint requests...

>The need
>to do careful bit-allocation for the subband data requires
>some remaining structure in the bitstream representation,
>and thus not *all* redundancy (especially frame-to-frame
>redundancy can be removed).  This is a necessary (?)
>implication of the subband/frame-based coding model.

This is the biggest remaining problem (IMHO) in modern
wavelet image coders : combining bit allocation with
non-fixed coding schemes.  Your basic choices are to
fix a coding scheme (eg. for wavelets, separate the
coefficients into classes, fit to generalized gaussian,
and send the parameters of the gaussian for each class) in
which case you can do optimal bit allocation, or the other
option is to use an attractive adaptive coding technique (like 
Xaolin Wu's ECECOW) in which case optimal bit allocation 
becomes extremely (!) hard.

>MPEG-4 has fine-grained scalability support
>(so you can progressively "unwrap" parts of the coded
>signal with minimum effect on the sound quality).

I'm not sure what this means.  Do you mean the stream
is "embedded" in the sence that I can create a lower-quality
version by hacking away parts of the encoded stream?

>>Most audio signals are actually created by sinusoids
>>of varying frequencies convolved with gaussians (or
>>something) that give them a limited lifetime.
>
>Fourier and sinusoidal and other mathematical
>representations may have nice properties to work with,
>but ultimately have limited correspondance to the sound
>models in physical reality or the acoustical processing
>in the auditory system.

Well, to some extent my physics background is showing
through (all sounds are just wave-packets of fourier
density waves!).  In the end, the Fourier transform
of the sound is a fine thing to do, the question is
whether the fourier coefficients are more understandable
than the original.  For an instrument like a flute or
a recorder (or a tuning fork!), I think you see a sound
which is best modeled with a couple of fourier components,
a duration, and the errors to supply the "texture".

>Unfortunately, the models that actually underlie sounds
>like guitars are too complex to really be produced by
>Markov models of any limited order.  To go in this sort
>of direction, you need a more sophisticated notion of
>"time correlation" than N-th order statistics.

True, but there are some wins to be had.  Part of the problem
with blindly applying a Markov model to sounds (it seems to
me) is that many of the redundant shapes in sound are slightly
dilated or amplified, so that a lossless markov model is rapidly
confused.  (eg. 0123210246420 is incompressible by standard markov)
This is partly alleviated by wavelets; the correlation between a
wavelet coefficient and its parent is a scale-independent thing,
and if we normalize them against each other, then it also becomes
amplification independent as well.

>Frequency response is an LTI concept, and the
>perceptual system is known to be non-linear and
>time-dependant.

What does "LTI" mean?  Other than 'temporal masking',
I'm not familiar with other non-frequency-based
masking.  (I've seen the nice pictures that plot the
masking effect as a bump in time and frequency..)

>When we get to a sophisticated argument like this, it
>is important to be careful of terms like "entropy".
> ... In order to use it properly, we must be
>sure that we are really speaking of a stochastic
>framework, which I don't think you are here.

Yeah, I get caught on this.  I guess I was really talking
about MDL (minimum description length) or Kolmogorov
Complexity.

>...  It's easy
>to imagine an coding format (MPEG-4, for example) capable
>of representing superposed sounds and decoding and mixing
>them together....

BTW just my two cents : I'm pretty fond of the way the
*PEGs describe coding formats, so that the encoders can
become more sophisticated and still be compatible (see, eg.
optimal quantization tables for JPEG).  Unfortunately, the
internet software community has not been very good about
jumping on these opportunities (is there even a good motion-
compensating encoder (implemented & disseminated) for MPEG-1
yet?)



From cb at my domain Thu Jan 21 22:28:18 1999
Path: news.giganews.com!nntp.giganews.com!news1.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:  <77vkgc$d23@senator-bedfellow.MIT.EDU>
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 107
Message-ID: 
Date: Mon, 18 Jan 1999 17:07:35 GMT
NNTP-Posting-Host: 204.1.4.155
X-Trace: news1.giganews.com 916679255 204.1.4.155 (Mon, 18 Jan 1999 11:07:35 CDT)
NNTP-Posting-Date: Mon, 18 Jan 1999 11:07:35 CDT
Xref: nntp.giganews.com comp.compression:18967


Thanks for the informative reply!

In article <77vkgc$d23@senator-bedfellow.MIT.EDU>, eds@media.mit.edu says...
> ... modern coders (AAC and MPEG-4). 

Is there an online souce for more information on these?
I've been having trouble finding good online information
on sound compression (contrast to image compression, where
many of the Wavelet guys have full online libraries).
I saw that the Mit SA page points to you for reprint requests...

>The need
>to do careful bit-allocation for the subband data requires
>some remaining structure in the bitstream representation,
>and thus not *all* redundancy (especially frame-to-frame
>redundancy can be removed).  This is a necessary (?)
>implication of the subband/frame-based coding model.

This is the biggest remaining problem (IMHO) in modern
wavelet image coders : combining bit allocation with
non-fixed coding schemes.  Your basic choices are to
fix a coding scheme (eg. for wavelets, separate the
coefficients into classes, fit to generalized gaussian,
and send the parameters of the gaussian for each class) in
which case you can do optimal bit allocation, or the other
option is to use an attractive adaptive coding technique (like 
Xaolin Wu's ECECOW) in which case optimal bit allocation 
becomes extremely (!) hard.

>MPEG-4 has fine-grained scalability support
>(so you can progressively "unwrap" parts of the coded
>signal with minimum effect on the sound quality).

I'm not sure what this means.  Do you mean the stream
is "embedded" in the sence that I can create a lower-quality
version by hacking away parts of the encoded stream?

>>Most audio signals are actually created by sinusoids
>>of varying frequencies convolved with gaussians (or
>>something) that give them a limited lifetime.
>
>Fourier and sinusoidal and other mathematical
>representations may have nice properties to work with,
>but ultimately have limited correspondance to the sound
>models in physical reality or the acoustical processing
>in the auditory system.

Well, to some extent my physics background is showing
through (all sounds are just wave-packets of fourier
density waves!).  In the end, the Fourier transform
of the sound is a fine thing to do, the question is
whether the fourier coefficients are more understandable
than the original.  For an instrument like a flute or
a recorder (or a tuning fork!), I think you see a sound
which is best modeled with a couple of fourier components,
a duration, and the errors to supply the "texture".

>Unfortunately, the models that actually underlie sounds
>like guitars are too complex to really be produced by
>Markov models of any limited order.  To go in this sort
>of direction, you need a more sophisticated notion of
>"time correlation" than N-th order statistics.

True, but there are some wins to be had.  Part of the problem
with blindly applying a Markov model to sounds (it seems to
me) is that many of the redundant shapes in sound are slightly
dilated or amplified, so that a lossless markov model is rapidly
confused.  (eg. 0123210246420 is incompressible by standard markov)
This is partly alleviated by wavelets; the correlation between a
wavelet coefficient and its parent is a scale-independent thing,
and if we normalize them against each other, then it also becomes
amplification independent as well.

>Frequency response is an LTI concept, and the
>perceptual system is known to be non-linear and
>time-dependant.

What does "LTI" mean?  Other than 'temporal masking',
I'm not familiar with other non-frequency-based
masking.  (I've seen the nice pictures that plot the
masking effect as a bump in time and frequency..)

>When we get to a sophisticated argument like this, it
>is important to be careful of terms like "entropy".
> ... In order to use it properly, we must be
>sure that we are really speaking of a stochastic
>framework, which I don't think you are here.

Yeah, I get caught on this.  I guess I was really talking
about MDL (minimum description length) or Kolmogorov
Complexity.

>...  It's easy
>to imagine an coding format (MPEG-4, for example) capable
>of representing superposed sounds and decoding and mixing
>them together....

BTW just my two cents : I'm pretty fond of the way the
*PEGs describe coding formats, so that the encoders can
become more sophisticated and still be compatible (see, eg.
optimal quantization tables for JPEG).  Unfortunately, the
internet software community has not been very good about
jumping on these opportunities (is there even a good motion-
compensating encoder (implemented & disseminated) for MPEG-1
yet?)



From cb at my domain Thu Jan 21 22:28:26 1999
Path: news1.giganews.com!nntp.giganews.com!news.maxwell.syr.edu!howland.erols.net!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!usenet
From: "Eric Scheirer" 
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
Date: Tue, 19 Jan 1999 09:11:16 -0500
Organization: Massachvsetts Institvte of Technology
Lines: 121
Message-ID: <7823p4$dq2@senator-bedfellow.MIT.EDU>
References:  <77vkgc$d23@senator-bedfellow.MIT.EDU> 
NNTP-Posting-Host: ozric.media.mit.edu
X-Newsreader: Microsoft Outlook Express 4.72.3110.5
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3
Xref: nntp.giganews.com comp.compression:18985


Charles Bloom wrote in message ...

>Is there an online souce for more information on these?
>I've been having trouble finding good online information
>on sound compression (contrast to image compression, where
>many of the Wavelet guys have full online libraries).

Not as much as there should be, I'm afraid.  There's articles
in the technical literature, and some information at the
MPEG Audio home page (US mirror at http://sound.media.mit.edu/mpeg4/audio).
I try to do what I can to evangelize and promote the Structured
Audio part of the standard, but even just that part is a
bit much for one person.

>>MPEG-4 has fine-grained scalability support
>>(so you can progressively "unwrap" parts of the coded
>>signal with minimum effect on the sound quality).
>
>I'm not sure what this means.  Do you mean the stream
>is "embedded" in the sence that I can create a lower-quality
>version by hacking away parts of the encoded stream?

Yes, exactly.  The exact way in which this is possible
depends on the original encoding.  You can spend a little
extra bitrate on the high-quality signal and get small-
step scalability (2 kpbs increments); or you can have
a slightly smaller high-quality signal (you save about
12 kbps for equivalent quality) and only have large-step
scalability (16 kbps increments).

There's actually too many different ways you can do this
(you can combine different types of coders to form
the "core" and the "scalability layers") to test them all,
so in a formal sense, MPEG doesn't really know which way
works best in which circumstance.  Continuing to highlight
the point that decoding is much easier than encoding.


>Well, to some extent my physics background is showing
>through (all sounds are just wave-packets of fourier
>density waves!).  In the end, the Fourier transform
>of the sound is a fine thing to do, the question is
>whether the fourier coefficients are more understandable
>than the original.  For an instrument like a flute or
>a recorder (or a tuning fork!), I think you see a sound
>which is best modeled with a couple of fourier components,
>a duration, and the errors to supply the "texture".

You've picked a special example, though -- a flute *is*
well-modeled with a sinusoid+noise representation.  But
a guitar isn't -- it's easier to model with a digital
waveguide, since the waveguide is closer to the physical
sound-producing mechanism.  In my opinion, it's not
the sound-in-air representation that is crucial to model
(since as you say, all sounds behave the same in air),
but the original sound-*generating* mechanism.

CELP coders work so well for speech because they are an
approximatation to the way the human vocal tract makes
sound -- a quasi-periodic excitation followed by
subtractive shaping filter.

>>Frequency response is an LTI concept, and the
>>perceptual system is known to be non-linear and
>>time-dependant.
>
>What does "LTI" mean?  Other than 'temporal masking',
>I'm not familiar with other non-frequency-based
>masking.  (I've seen the nice pictures that plot the
>masking effect as a bump in time and frequency..)

"LTI" means "linear and time-invariant".  It's not only
the time-dependence, but the differing response at different
sound levels that is important.  To get a good psychoacoustic
model for coding requires understanding this behavior
in detail, and it's really only barely understood by
psychacousticians at this point -- thus a lot of the
best coding people are also psychoacousticians themselves.

>Yeah, I get caught on this.  I guess I was really talking
>about MDL (minimum description length) or Kolmogorov
>Complexity.

I've just finished a manuscript discussing structured audio
and perceptual audio coding in connection with algorithmic
description, by connecting these ideas to Kolmogorov
complexity and other ideas from the theory of computing.
The manuscript is currently in review, but I'm happy
to send drafts to anyone interested who contacts me
via email.

>BTW just my two cents : I'm pretty fond of the way the
>*PEGs describe coding formats, so that the encoders can
>become more sophisticated and still be compatible (see, eg.
>optimal quantization tables for JPEG).  Unfortunately, the
>internet software community has not been very good about
>jumping on these opportunities (is there even a good motion-
>compensating encoder (implemented & disseminated) for MPEG-1
>yet?)

It's the difference between encoding and decoding again --
it's much easier to build decoders than encoders.  In almost
any coding domain, it takes Ph.D.-level knowledge (or more)
to build the new advances into encoders.  The intersection
of Ph.D.-level coding researchers with the free-software
community is vanishingly small, so it takes a long time.

As you say, even the advances that are no longer new take
a long time to get disseminated into the Internet world.
All of the free MP3 encoders still use (as far as I'm aware)
the psychoacoustic model "borrowed" from the Fraunhofer
encoder.

Best,

 -- Eric






From cb at my domain Thu Jan 21 22:28:33 1999
Path: news1.giganews.com!nntp.giganews.com!news2.giganews.com.POSTED!not-for-mail
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
From: cb at my domain (Charles Bloom)
X-Newsreader: WinVN 0.99.9 (Beta 3) (x86 32bit)
References:  <77vkgc$d23@senator-bedfellow.MIT.EDU>  <7823p4$dq2@senator-bedfellow.MIT.EDU>
MIME-Version: 1.0
Content-Type: Text/Plain; charset=US-ASCII
Lines: 45
Message-ID: 
Date: Wed, 20 Jan 1999 06:13:45 GMT
NNTP-Posting-Host: 207.207.3.95
X-Trace: news2.giganews.com 916812825 207.207.3.95 (Wed, 20 Jan 1999 00:13:45 CDT)
NNTP-Posting-Date: Wed, 20 Jan 1999 00:13:45 CDT
Xref: nntp.giganews.com comp.compression:18998

In article <7823p4$dq2@senator-bedfellow.MIT.EDU>, eds@media.mit.edu says...
>
>You've picked a special example, though -- a flute *is*
>well-modeled with a sinusoid+noise representation.  But
>a guitar isn't -- it's easier to model with a digital
>waveguide, since the waveguide is closer to the physical
>sound-producing mechanism.

Yeah, that's sort of what I was trying to get at - in
some cases the near-Fourier is a good model, in which
case you can use the frequency-based psychoacoustic models.
Some sort of adaptive transform might be in order, but
the best adaptive transform would actually be a model of
sound-producing phenomena..

I guess this is part of the idea behind MP4, that the
various samples can be modeled in all these different ways...

Of course, we'd prefer to not have to explicitly send this
model to the decoder, so we've just moved the compression
from packing the data to packing the model of the data.

> To get a good psychoacoustic
>model for coding requires understanding this behavior
>in detail,

We have the additional problem that even when we have good
models, computing an MSPE (mean squared perceptual error)
is prohibitively complex for use in optimizations.

>It's the difference between encoding and decoding again --
>it's much easier to build decoders than encoders. 

Unless I'm mistaken, this is going to get even more
pronounced with MP4, which essentially lets you create
arbitrary bitstreams.  (though the MP4 decoder is going
to be by far the most complex we've seen yet, unless I'm 
mistaken!)

----------------------------------
Charles Bloom     cb at my domain
http://www.cbloom.com/~cbloom/
I'm capable of wondering if I am
intelligent life, therefore I am.



From cb at my domain Thu Jan 21 22:28:38 1999
Path: news1.giganews.com!nntp.giganews.com!news.idt.net!feed1.news.rcn.net!rcn!master.news.rcn.net!howland.erols.net!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!usenet
From: "Eric Scheirer" 
Newsgroups: comp.compression
Subject: Re: thoughts on audio compression
Date: Wed, 20 Jan 1999 10:16:30 -0500
Organization: Massachvsetts Institvte of Technology
Lines: 31
Message-ID: <784s1h$jkr@senator-bedfellow.MIT.EDU>
References:  <77vkgc$d23@senator-bedfellow.MIT.EDU>  <7823p4$dq2@senator-bedfellow.MIT.EDU> 
NNTP-Posting-Host: ozric.media.mit.edu
X-Newsreader: Microsoft Outlook Express 4.72.3110.5
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3
Xref: nntp.giganews.com comp.compression:19016

>Of course, we'd prefer to not have to explicitly send this
>model to the decoder, so we've just moved the compression
>from packing the data to packing the model of the data.

I don't think there's a problem with explicitly sending the
model to the decoder.  For most kinds of sounds, models
are very small compared to the sound data itself.  Even
the whole MPEG-AAC decoder is only 450 KB of software
(uncompressed) -- this is only about 30 seconds of sound
equivalent.

>Unless I'm mistaken, this is going to get even more
>pronounced with MP4, which essentially lets you create
>arbitrary bitstreams.  (though the MP4 decoder is going
>to be by far the most complex we've seen yet, unless I'm
>mistaken!)

Yes, the MP4 decoder for the full profile is very complex
and multifaceted.  But what the MPEG-4 standard for
audio emphasizes more clearly than MPEG-2 is that
there's no "one right way" to encode things -- you have
lots of alternatives.  It's not only a tradeoff between
bandwidth and quality, it's a tradeoff between encoding
effort, bandwidth, and quality.

Best,

 -- Eric



Charles Bloom / cb at my domain

Send Me Email


Back to the Index


The free web counter says you are visitor number