I'm going to look at how the Alpha channel should be weighted vs the RGB components when measuring distortions in RGBA textures, when that RGBA texture is used for alpha blending.
(aside: of course many RGBA 4-channel images are not actually using A for opacity in alpha blending and this discussion does not apply to those; if your RGBA image is really 4 scalar channels of the same type of signal then the weight of each channel should be the same)
For concreteness, an RGBA image used for alpha blending means that we construct an output pixel like :
out = A * RGB + (1 - A) * background
RGBA in [0,1] scale here
and for most of this discussion I will assume that the distortion measure we are looking at is weighted squared difference (L2 norm) :
' = prime = indicates values in the modified texture
SD(RGB) = ( RGB' - RGB )^2 = (R' - R)^2 + (G' - G)^2 + (B' - B)^2
SD(RGBA) = SD(RGB) + (A' - A)^2
WSD(RGBA) = SD(RGB) + W(A) * (A' - A)^2
W(A) is the weight of the A term in the RGBA squared difference
In the end what we care about is what the user sees, which are the "out" pixel values after alpha blending. So the correct error metric (or weight for A) should be determined by minimizing the error in the output pixel.
(choosing a weight for A is equivalent to choosing a bit allocation ratio between RGB color channels and the A channel; in an RD encoder, changing W(A) corresponds to transferring bits)
The output pixel is RGB only (no A), and we'll assume that we want to minimize the squared difference of the output pixel. So the question is how should that be expressed in an error metric on the RGBA texture.
There are three intuitive factors which should obviously affect this issue, and we will see them come out in the math. They are :
1. When background RGB is nearly equal to the texture RGB, A doesn't matter at all. If background == RGB , then any A value produces the same output pixel. The bigger difference there is between background and the current texture, the more important A is.
2. When A is smaller, the texture RGB is less important. In the extreme case of A=0, the texture RGB doesn't matter at all, it isn't used and is fully redundant. This is a well known problem in texture compression and is why we prefer premultiplied alpha (see more later).
3. Ignoring those factors, A affects the output color in all 3 channels, so will be something like ~3 times more important than each RGB channel.
So, without further ado let's do the simple math :
E = (out' - out)^2
out = A * RGB + (1 - A) * background
let background = RGB + D
(background is an RGB vector; D is a scalar difference on each channel)
out = A * RGB + (1 - A) * (RGB + D) = RGB + (1 - A)*D
out(R) = R + (1-A)*D
out' = A' * RGB' + (1 - A') * (RGB + D)
(note the background RGB in out' is not RGB' ; the background color stays the same)
R' = R + dR
A' = A + dA
out'(R) = (A + dA) * (R + dR) + (1 - A-dA) * (R+D)
out'(R) = (A + dA) * dR + R + (1 - A-dA) * D
out'(R) = A * dR + R + (1 - A)*D -dA * D
(dropping terms squared in the delta d)
(out' - out)(R) = A * dR - dA * D
E = A^2 dRGB^2 + 3*D^2 dA^2 + linear terms
linear terms = - 2 * A * D * dA * dRGB
dropping the linear terms (see later on why) :
E = A^2 * dRGB^2 + 3*D^2 * dA^2
This is equivalent to a weight on alpha :
W(A) = 3 * D^2 / A^2
D = difference between texture and background (scalar)
note instead of 3*D^2 you could use the vector RGB difference to background :
W(A) = D_RGB^2 / A^2
This weight term encapsulates the three intuitive principles we saw at the beginning: when the foreground and background are the same RGB color, then A doesn't matter, W(A) ~ D^2 goes to zero. As A goes to zero, RGB doesn't matter; W(A) goes to infinity like 1/A^2 (if you prefer, the weight of RGB goes to zero like W(RGB) ~ A^2). Other than that, if A is ~ D, then W(A) is ~ 3X the RGB weight because it affects all three channels.
(revisiting the note on dropping the linear terms : several reasons why I think this is right. First of all, because these are signed linear terms, they will average out to zero when summing over pixels, assuming your distortions dRGB are random around the target value and not net biased to one side. Second of all, because we don't really want to be measuring this effect. What the linear term measures is ways that you can compensate for miss-blending the background with a wrong A by using a wrong RGB to fix it. Say you're blending white onto a black background and your A is too high, you can compensate and lower E by making RGB lower to get the right output color. When that does happen to work in our favor we just don't even want to know about that. It also assumes exact knowledge of the background.)
Now we can't assume that we know the background. We could look at the worst case, D = 1.0, blending black on white or vice versa.
That's when A matters the most, and W(A) = 3 / A^2 ; in that case :
maximal difference to background, D = 1.0
W(A = 1) = 3
W(A = 0.5) = 12
W(A = 0.25) = 48
Alpha should be weighted much higher than RGB.
(note that because of interpolation we probably don't want the weight of RGB to ever go completely to zero)
But D = 1 is rare in practice. In fact in games we very often do alpha blending when D is closer to zero. For example say you're doing some alpha blended particle effects, often you blend many fire or smoke puffs on top of each other, so they are often blending onto other fire and smoke that is very similar. On many games, the RGB palette is squarely in the gray/brown domain so we expect D to be much less than 1 most of the time.
If we assume that foreground and background are both random numbers on the unit interval [0,1] , then we would have :
D = < |x - y| > (x and y uniform random on the unit interval)
(this is simple integral and fun exercise for the reader)
D = 1/3
W(A) = (1/3) / (A^2)
W(A = 0.5) = 4/3
Now, assuming the colors are random is also clearly wrong, don't take this as "the answer", it's just another data point, perhaps
more realistic than D=1, but still not using any assumption about texture and background often matching, which would make D even smaller
than 1/3.
If you assume the worst case (D=1) you are over-weighting A when it's not always necessary, but assuming D very small would make you
under-weight A when the background is in fact very different from the texture.
Let's set that aside and revisit something we mentioned early on : premultiplied alpha.
We like premultiplied alpha in texture compression because without it you are sending totally unnecessary bits in RGB values when A=0. With some types of textures that are largely transparent this can be a huge amount of bits wasted. (some texture codecs will not encode RGB when A=0 exactly, but premultiplied also correctly down-weights the importance of RGB when A is finite but tiny; premultiplied also filters/interpolates better)
With premultiplied alpha the blending equation is :
out = RGB + (1 - A) * background
(no *A on the RGB from the texture, A is only used to multiply the background)
Let's compute what the squared difference weight W(A) should be for RGBA textures used in premultiplied alpha. For laughs I'll do it a different way.
for small deltas :
out_R = R + (1 - A) * background_R
d(out_R) = d/dR(out_R) * dR + d/dA(out_R) * dA
d/dR(out_R) = 1
d/dA(out_R) = - background_R
d(out_R) = (dR - background_R * dA)
E = d(out)^2 = d(out_R)^2 + d(out_G)^2 + d(out_B)^2
dropping linear terms (see arguments above)
E = dR^2 + dG^2 + dB^2 + (background_R^2 + background_G^2 + background_B^2) * dA^2
E = SD(RGB) + WPM(A) * dA^2
WPM(A) = background_R^2 + background_G^2 + background_B^2
WPM(A) = background_RGB^2
WPM(A) = 3 * background^2 (assuming the same scalar "background" value on all channels)
The weight for pre-multiplied alpha is similar to the non-PM case, but without the annoying 1/A^2 in the denominator, which is a big advantage.
Our weighting no longer depends on our A at all, which means the channels can be treated independently. The A weight has been taken in the "premultipy" scaling of RGB.
This is just a way of saying that our intuition for premultiplied alpha was correct : using premultiplied alpha has correctly
scaled the RGB component for its importance with alpha, so that you can use the L2 norm on the RGBA channels without any inter-channel
correction factor.
Rather than a scaling by D (difference of texture and background), we now have a scaling with just "background". It's obvious from the premultiplied blending equation that if background == 0, then A has no affect on the output color.
Obviously when encoding a texture we don't know the value of the background. Looking at a few values :
WPM(A) , background=0 : = 0
WPM(A) , background=1/2 : = 3/4
WPM(A) , background=2/3 : = 4/3
WPM(A) , background=1 : = 3
it's probably okay to just use WPM(A) = 1, that is just weight all channels the same, use RGBA squared difference.
This a compromise given the unknown background; it's convenient that equal weight is not too bad for pre-multiplied alpha.
If you have domain-specific knowledge you could use something different. For example on the web or word-processing where you are
mostly blending onto white backgrounds, alpha weight closer to 3 would be better.
Conclusion :
When measuring distortion on RGBA textures that you know will be used for alpha-blending, we can find a squared-difference in texture-space that approximates the error in the visible output pixel.
This can be expressed as a weight on the alpha component of the squared difference :
SD(RGBA) = SD(RGB) + W(A) * dA^2
for standard alpha blending :
W(A) = 3 * D^2 / A^2
D = difference between texture and background (scalar)
or W(A) = D_RGB^2 / A^2
for pre-multiplied alpha :
WPM(A) = background_RGB^2
WPM(A) = 3 * background^2
For pre-multiplied alpha, using WPM(A)=1 is reasonable, and just using 4-channel uniform RGBA squared difference is probably fine.
This is just another great reason to prefer pre-multiplied alpha.
For non-pre-multiplied alpha, it's hard to escape the need to scale RGB by A for purposes of the error metric. ( W(A) ~ 1/A^2 is equivalent to W(RGB) ~ A^2 which is equivalent to scaling RGB by A). If you aren't scaling RGB by A then you're giving too many bits to RGB when A is low, and not enough when A is high. Assuming you want a scalar weight that doesn't depend on background or A, then plain old uniform W(A) = 1 is not too bad. (D ~ 1/3 and A ~ 1/2 gives W(A) = 4/3 , so W(A)=1 is in the right ballpark).
I wrote before in Topics in Quantization for Games some general basics on quantization. We're going to continue from there, but focus just on the specific case : GPU convention quantization of n-bit unsigned values (UNORM).
As noted before, there are two viable conventions for uniform linear quantizers; either "floor" quantization (with bias on reconstruct) or "round" quantization (bias on quantizer) with direct reconstruction (end points restored directly). The latter is what GPUs use, therefore games should always use this convention for all quantizers to avoid mistakes.
(for example RGBE quantization in the HDR file format uses the opposite convention; floor quantize and bias on dequantize; make sure to use the right convention to match the file format)
The case we will talk about here is :
n-bit value
N = 2^n
quantized values in [0, N-1]
source values in [0.0, 1.0] (inclusive)
quantize(f) = (int)( f * (N-1) + 0.5f )
dequantize(i) = (float) i / (N-1);
quantized 0 -> 0.0
dequantized N-1 -> 1.0
Now we want to talk about "requantization". Requantization is when you have some n-bit value and need to change the representation into an
m-bit value; we will assume always that both values are in UNORM GPU convention (bias on quantize).
A correct requantizer is one that corresponds to dequantization and quantization to the new bit depth :
x is quantized in n bits
dequantize :
f = (float) x / (N-1);
quantize to m bits :
y = (int)( f * (M-1) + 0.5f);
y = (int)( x * (float)(M-1)/(N-1) + 0.5f);
y is the requantization of x from n bits to m bits
(N = 2^n , M = 2^m)
Okay so back in Topics in Quantization for Games we showed that "bit replication" is the correct requantizer for changing n to 2n.
In requantization what we focus on is the denominator of the quantizer, so changing n bits to m bits is changing denominator from (N-1) to (M-1). The denominator is also the quantized representation of "One" (1.0 in dequantized range), and we need that to be preserved. The fact that bit replication is the correct requantizer is because bit replication is (N+1) (because N+1 = (1<<n) + 1) and (N+1)*(N-1) (that's (N+1)*One) = (N^2 - 1) which is the new One.
To continue that, we can show that you can repeat bit replication to double bits again :
Start at n bits
Bit replicate once : *= (N+1)
now at 2n bits
2n bits range is N^2
Bit replicate again : *= (N^2+1)
What happens to One in n bits (N-1) :
(N-1)*(N+1)*(N^2+1)
(N^2-1)*(N^2+1)
(N^4-1)
becomes One in 4n bits
Alternatively just write out the requantizer :
y = (int)( (x/(N-1)) * (N^4-1) + 0.5f );
then using the factorization above this is just :
y = (int)( ((x/(N-1)) * (N^2-1) * (N^2+1) + 0.5f );
y = (int)( (x * (N+1) * (N^2+1) + 0.5f );
y = x * (N+1) * (N^2+1);
exact in ints , which is bit replicate twice
So for example to go from 4 to 16 bits, you bit-replicate 0xA to 0xAAAA
Note that while bit doubling is the correct requantizer for bit doubling you can *not* just truncate bits to bit half. eg. to requantize from 16 bits to 8 bits, you can not just grab the top 8 bits. While grabbing the top 8 bits would be a round-trip from 8 bit values that had been bit-doubled, it would not put the bucket boundaries in the right place for values that are in between. eg. 16-bit values in [129,255] should requantize to "1" in 8-bit, not just the 0 you would get if you truncated to the top 8 bits.
Okay, so let's get into some requantizers. I want to note that just doing the simple form :
y = (int)( x * (float)(M-1)/(N-1) + 0.5f);
is totally fine and you can just do that and not worry about this. But that said, our goal here is to work out requantizers
that are exactly equal to that result, but that stay in integer.
There are a couple general techniques that help with this, so I'll do examples of both.
1. Taylor series in the denominator method.
For example, let's find the requantizer from 16 bit to 8 bit. The simple form is :
y = (int)( x * 255/65535.f + 0.5f);
To find the int form, we can try changing 65535 to 65536. One way to do that is approximate :
1/65535 = 1/(65536-1) = 1/65536 * 1/(1 - 1/65536)
We can use the Taylor expansion of 1/(1 - x) for x small :
1/(1 - x) = 1 + x + O(x^2)
using x = 1/65536 , it's reasonable to drop terms in x^2
So :
1/65535 ~= 1/65536 * ( 1 + 1/65536 )
y = (int)( x * 255/65535.f + 0.5f);
y = (int)( (x * 255 + 32767.5.f)/65535.f );
y = (int)( (x * 255 + 32767.5.f) * (1/65536) * ( 1 + 1/65536 ) );
y = (int)( (x * 255 + 32767.5.f + x*255/65536 + 32767.5.f/65536) * (1/65536) );
65536/255 = 257.003922.. ; try 257
32767.5.f + 32767.5.f/65536 = 32768 - 0.5f/65536 ; try 32768
y ~= (int)( (x * 255 + (x/257.f) + 32768) * (1/65536) );
y ~= (int)( (x * 255 + (x/257.f) + 32768) ) >> 16;
y = (x * 255 + (x/257) + 32768) >> 16;
and in fact that is the correct 16 to 8 requantizer in integers (in the sense that it matches the simple float
formula exactly).
There is however an even simpler 16 to 8 requantizer which also matches :
int requantize_16_to_8( int x )
{
ASSERT( x >= 0 && x <= 65535 );
return ( x * 255 + 32768 + 127 )>>16;
}
which can be made by replacing the (x/257) term with its average.
2. Bit replicating up to make an infinite repeating fraction
Another technique that sometimes works is based on a different conceptual picture.
We have (for example) an 8 bit value; 0x00 is "zero" , and 0xFF is "one" (that's what they dequantize to).
To get 0xFF to mean "one" we think of it as a fixed point fraction, and bit replicate so for 0xFF think :
0xFF = 0. FF FF FF ... forever
then there is no number between that and 1.0 , therefore it is equal to 1.0
so to find requantizers or interpolators, we can bit replicate several times and approximate the infinite
repetition with a finite number , but then we'll be a little bit too low so we'll have to bias up a bit.
To be clear, this repeated fraction is not just hand having and not just right at the end points; if you
could repeat forever it would be exact. We can see it by using our Taylor expansion again and doing all
the terms :
1/(1-x) = 1 + x + x^2 + x^3 + x^4 + ...
using x = 1/256 , this is :
256/255 = 1 + 2^-8 + 2^-16 + 2^-24 + 2^-32 + ...
or
1/255 = shift down 8 + shift down 16 + shift down 24 + ...
or
0x34/255 = 0 . 34 34 34 34 ...
Any denominator that's a power of two minus one is a repeated binary fraction.
We want to see if we can use just a few repeats and then bias up and get correct requantizer results.
As an example, let's work out the requantizer from 10 bits to 8 bits. As usual the simple way is :
f = x/1023.f
y = (int)( f * 255.f + 0.5f );
but we'll try to find an exact match to that formula that stays in int. We guess that we want to start by
making a repeated fraction :
thinking of f as a fraction with decimal 20 places to the left
f = (x << 10) | x;
y = (int)( f * 255.f + 0.5f );
y = ( ((x << 10) | x) * 255 + (1<<19) + bias )>>20;
where we guess we need "bias" because of the fact that we are using a truncated 20 place fraction rather than forever
that is, we need to push up 0.FFFFF slightly to get to 1.0
It turns out it works with bias = (1<<9) (among others)
y = ( x * 1025 * 255 + (1<<19) + (1<<9) )>>20;
( using ( (x << 10) | x ) = x * 1025 )
This is in fact a correct requantizer from 10 to 8 bits :
int requantize_10_to_8( int x )
{
ASSERT( x >= 0 && x <= 1023 );
return ( x * 0x03FCFF + 0x080200 )>>20;
}
Okay, so we can see it's frequently possible to use an approximation of the infinite repeated fraction to find correct
requantizers.
(you could also get this from our one term Taylor expansion as well; essentially we just did 1/1023 ~= 1025/(1024^2) )
This requantize_10_to_8 requires 28-bit intermediates, so it fits in 32-bit registers. For scalar that's fine, but for SIMD we'd like to be able to do these requantizers in as few intermediate bits as possible (ideally in 16 to get wider SIMD).
If we go back to the verbose form we can see how to get these in fewer bits :
y = ( ((x << 10) | x) * 255 + (1<<19) + (1<<9) )>>20;
y = ( ((x*255) << 10) + ( x * 255 ) + (1<<19) + (1<<9) )>>20;
t = (x*255) + (1<<9)
y = ( (t<<10) + t )>>20;
y = ( t + (t>>10) )>>10;
int requantize_10_to_8_alt( int x )
{
ASSERT( x >= 0 && x <= 1023 );
int t = (x*255) + (1<<9);
return ( t + (t>>10) )>>10;
}
This gives the same result again; it now uses only 18 bit intermediates.
This form is familiar from the classic "mul8bit" from Jim Blinn's "Three Wrongs Make a Right" :
int mul8bit( int a, int b )
{
int t = a*b + 128;
return (t + (t>>8) )>>8;
}
mul8bit takes a [0,255] value and makes it act as an interpolator from 0% to 100% , eg. for alpha blending.
It is equivalent to :
(int)( ((a / 255.f) * b) + 0.5f );
which is dequantizing a to UNORM, using it to scale b, then requantizing.
We can now see that the "mul8bit" interpolator is a relative of our requantizers, and we can derive mul8bit by using the repeated fraction or Taylor series.
The end!
Oodle 2.9.0 is now out. The full changelog is :
## Release 2.9.0 - March 23, 2021
$* *fix* : OodleLZ_Compress had an encoder crash bug in the Optimal level encodes on data in sizes just slightly over a 256KB chunk (eg. 262145) with a repeated substring at the very end
$* *change* : Mac libs and dylibs are now fat binaries with x64 and ARM64
$* *change* : Tex : Oodle Texture no longer checks for license file
$* *change* : defining OODLE_IMPORT_LIB / OODLE_IMPORT_DLL is no longer needed; you can link with either type of lib without setting a define
$* *change* : Oodle public headers no longer define types like U8, SINTa, they are instead OO_U8, OO_SINTa, etc.
$* *change* : Oodle public headers now require stdint.h which on Windows/MSVC means VC2010 or higher
$* *change* : Net : OODLE_PLATFORM_HAS_SELECTDICTIONARYANDTRAIN define removed. Call OodleNetwork1_SelectDictionarySupported instead.
$* *removed* : Core : support for the deprecated LZ compressors is now removed (LZH,LZA,etc.). Only newlz (Kraken,Mermaid,Selkie,Leviathan,Hydra) and LZB16 are supported.
$* *removed* : Core : OodleLZ_CompressContext_* functions removed; streaming encoders no longer supported
$* *removed* : Ext : OODLEX_PATH_* public defines removed.
$* *removed* : Ext : OODLEX_WCHAR_SIZE public define removed.
$* *removed* : Tex : OodleTex_CheckLicense func removed ; Oodle Texture no longer checks for license files
$* *deprecation* : OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size no longer used; newlz compressors no longer ever drop down to LZB16 (default behavior unchanged)
The biggest change is the removal of all deprecated pre-Kraken compressors (LZH,LZHLW,LZNA,LZNIB,BitKnit,etc.) except for LZB16 which stays for the moment (but is also
deprecated). Data compressed with the old codecs cannot be decoded by Oodle 2.9.0 ; up to Oodle 2.8.14 data
made by any earlier version can always be decoded by later versions.
Note that old versions of the Kraken-family of compressors can still be decoded (Oodle 2.9.0 will read data written by Oodle 2.3 Kraken), and that will always be the case, we never break reading old data (made by the supported codecs). So if you are using the Kraken-family of compressors, you can always update to newer versions and your old data will be readable.
If you were using one of the old codecs, we recommend changing to one of the Oceanic Cryptozoology codecs (Kraken, Mermaid, Selkie, Leviathan). They are better in every way. To do this, you need to decode that data with the old version of Oodle you have (or any version up through 2.8.14) because 2.9.0 cannot decode the old codecs. Assuming your old version is post-Kraken (2.1.5), you can re-encode to Kraken with your old sdk and the current version (2.9.0) will be able to load that data.
If you can't switch codecs for some reason, you can always keep using the old version of Oodle you are currently on.
We have also designated the last version before 2.9.0 (which was 2.8.14) as a "long term support" release. We will continue to push critical bug fixes to 2.8.14.lts for people who need to stay on old codecs.
(2.8.14.lts has already received a bug fix since the 2.8.14 first release; clients on 2.8.14 should update to the latest version)
If at all possible we recommend everyone to get on the Oceanic Cryptozoology codecs and update to 2.9.0 ; if you need the old codecs we recommend 2.8.14 for long term support.
LZB16 is still supported in 2.9.0 because some data may need it, but we strongly discourage its use.
The BWT (Burrows Wheeler Transform) has long fascinated people for its ability to capture complex correlations with a very simple inverse transform. Unfortunately despite that inverse transform being very simple, it is also slow. I will briefly review the inverse BWT (i-BWT) and then look at ways to speed it up.
Jump to the end for the punch line : speed results and source code.
Let's briefly review the basic BWT to establish the baseline algorithm and notation.
Given an input string like "inputstring" add an EOF symbol so S = "inputstring|" then form all rotations :
inputstring|
nputstring|i
putstring|in
utstring|inp
tstring|inpu
string|input
tring|inputs
ring|inputst
ing|inputstr
ng|inputstri
g|inputstrin
|inputstring
Then sort all the rows, this gives the matrix M :
g|inputstrin
ing|inputstr
inputstring|
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^ ^
F L
this is all the suffixes of the file, and we've sorted them all. We've labeled the first column F
and the last column L.
The last column L is the BWT, which is what we transmit. (typically we don't transmit the EOF character, instead we transmit its location and leave it out of the BWT string that is sent).
In the matrix M each column is a permutation of the original string S (remember the rows are cyclic rotations). The first column F is just the alphabetic sort of S. F can obviously be reconstructed from a histogram of S, which can be counted from L. The column L is the original characters of S in order of their following suffixes.
The inverse transform is simple conceptually. We transmit L, the BWT string (the last column), we need to regenerate S the top row.
Because the rows are all just rotations of S, the character at F[i] on each row is the character that
follows L[i]. This gives you an answer to "what's the next character after this one?" which is all you
need :
Output decoded string D
Start with D=| the EOF character.
What's the next character?
L = nr|iinttsupg
F = giinnprsttu|
Find | in L , it's L[2]
so the next char is F[2] = i
D=|i
What's the next character?
Find i in L , it's L[4] (NOT L[3])
so the next char is F[4] = n
D=|in
Then find n at L[5] (not L[0])
the next char is p, so D=|inp
after p is u, etc.
So you can just read out the characters one by one like that.
When a character occured multiple times in L we had to find the right one (which corresponds to the same location of that character in the original string S). eg. when finding 'i' we needed L[4] not L[3] because L[4] is the i that follows | in "input", while L[3] is the i in "string".
The way to do that is to find the same occurance number of that character. If it's the 2nd "i" in F, we want that to map to the 2nd "i" in L.
The reason is because L has the characters sorted by their suffixes, while F is sorted by the whole strings. So when they are the same symbol in L, then the order they occur in L is the same as the order the occur in F.
For example the n's in F :
ng|inputstri
nputstring|i
are in this order because "ng" is less than "np"; in the last column L, the n's occur as :
g|inputstrin
putstring|in
they are sorted by their suffixes "g|" and "pu", so they are in the same order.
So the nth occurance of each character in L/F corresponds to the nth occurance of each character in L/F. The main thing we need for inversion is this mapping between L and F for each character.
It turns out to be more convenient to go backwards and make the mapping from L to F (rather than F to L).
This is because the F array can just be implicit, we never need to make it in memory. F is just the "cum prob"
table. That is :
Each symbol has a histogram count[s]
start[s] = prefix sum of counts < s
cumulative probability range[s] = [ start[s] , start[s] + count[s] )
(closed interval on low end, open interval on high end)
F[] = fill s in range[s]
To find the nth occurance of s in F[] is just i = start[s] + n
This is the same array you would use in arithmetic coding or RANS.
We'll calling the mapping from L to F , LF[] and it is just constructed thusly :
scan L[]
s = L[i]
count C[s] as you go (start at zero)
find the C[s]th occurance of s in F[]
LF[i] = start[s] + C[s]
then C[s] ++
Now LF[i] takes us from the symbol at L[i] to the same symbol in F[] which will be at F[ LF[i] ], then from
there you can step to the prior symbol, which is at L[ LF[i] ].
Note that this is backwards from what we were doing before, so we are now reading at the stream in
reverse. To read out the i-BWT (backwards) is now : :
start at i = location of EOF in L[]
for N times :
{
find symbol L[i] in F[] ; this is i = LF[i]
then step to the prior symbol F -> L
output L[i]
}
that's it! A full BWT inverse transform is then :
L[] = the BWT string, we get this from the decoder
count[i] = histogram of L[]
start[i] = prefix sum of count[]
make LF :
C[] = 0
for i to N
{
s = L[i];
LF[i] = start[s] + C[s];
C[s]++;
}
read out :
i = index of EOF (transmitted)
// L[i] is EOF sym which we don't need to output
for N times
{
i = LF[i];
*ptr++ = L[i];
}
This produces the string backwards, which we usually handle by reversing in the encoder before doing the
BWT so that the decoder can just output forwards.
That's it!
I'm not going to talk about how the compressor transmits the BWT string L[] compactly. See the original BW94 paper or the bzip code for the classic MTF + RLE + switching Huffmans encoding. See Fenwick for analysis of BWT as a symbol ranker. In the modern world most people do BWT encoding without MTF; see Maniscalco M99, context induction, BCM and BBB.
I'm also not talking about the forward transform at all. The forward transform also has interesting efficiency issues, mainly also relating to cache misses for large buffers or to GPU implementations, and there has been lots of good research on that; see the papers. In this blog I will assume you don't care much about forward transform speed, so I'll push all the slow operations to the encoder side to make the decoder as fast as possible.
I will also just briefly mention that BWT of long string recurrances can be sped up with an LZP pre-pass. We will assume this has been done where appropriate so that there are no very long identical strings in our source input (eg. perhaps suffixes are always unique after 16 symbols).
End of review and we'll dive into speeding up the i-BWT.
So the transform is delightfully simple, but it's slow. Why?
(and by "slow" I mean *really* slow; specifically 200-400 cycles per iteration on my Intel machine "beasty")
The problem is that each instruction of the i-BWT intrinsically depends on the previous instruction (instrinsically meaning can't be removed with code reorganization), so no instruction-level-parallelism (ILP) can be used. On modern CPUs that might be able to do 4 instructions per clock this is bad. What's worse is that there is an intrinsic cache miss in the lookup step of the i-BWT.
The problem is the index of the next symbol in the sequence is semi-random and goes all over the BWT buffer. It cache misses often (almost every single iteration). Now you can obviously fix this by making the BWT chunk smaller, which is what the old bzip does. If the chunk is small enough to fit in cache, no problem. But that also kills compression and sort of ruins the whole point of the BWT (if small chunk BWT compression is good enough for you, then you should just use an LZ like Kraken which better fits that space-speed niche; high speed small cache is outside the operating range of where BWT makes sense vs alternatives). In-cache BWT is not a Pareto optimal way to do a fast compressor; we are only interested in BWT in the high-compression domain. So we want to use BWT chunk sizes that are bigger than the fast cache (16M chunks perhaps, which often fits in L3) and speed things up.
The core of the i-BWT is just this :
for N times
{
i = LF[i];
*ptr++ = L[i];
}
The problem is the LF[i] are a near-random walk around the buffer.
They are in order of the sorted suffixes, which tend to be completely randomly
distributed in N, assuming the input array is not in some degenerate pre-sorted order.
You load at LF[i] and it cache misses, you can't do any more work, the loop completely stalls. As soon
as the cache miss returns, you immediately do i = LF[i] and look up at LF[i] again and stall again.
Now one obvious thing we can do is to first combine the L[] and LF[] arrays into a single merged array, so that we get exactly 1 cache missing load per iteration instead of 2. But we still need fundamental improvements.
There are two big deficiencies we need to address :
1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol
The most obvious approach to deal with these deficiencies is to use a technique we are now familiar with, which is to do multiple streams at the same time.
Independent multiple streams allow you to keep more execution units busy, especially in a case like this where there is a fundamental cache miss stall that you want to fill with useful CPU work during those cycles.
Say you have a single execution stream where every instruction is dependent on the last (c follows b follows a),
only 1 can execute per cycle, and if one cache misses you get unused cycles with no work to do :
clock tick on the left
"1" is the independent stream number
a,b,c indicate a sequence of dependent instructions
0: 1a
1: 1b
2: 1c (cache miss)
3: .
4: .
5: .
6: .
7: 1d
If you run 4 streams of independent work, each stream dependent on the last instruction within its stream but no
cross-stream dependencies, now the CPU can execute 2 instructions per cycle (say), and fill in during the cache miss :
0: 1a + 2a
1: 1b + 2b
2: 1c + 2c (both cache miss)
3: 3a + 4a
4: 3b + 4b
5: 3c + 4c (both cache miss)
6: .
7: 1d + 2d
It's a bit like what hyper-threading does for you, but we get it just by running more independent execution sequences
on our single thread to give the out-of-order execution a bigger selection of work to find instructions to fill the gaps.
In our case with heavy cache missing dependent work, in the 1-stream case you are waiting on the latency of each cache miss with the processor doing nothing, so you get a full stall all that time. Going to more streams at least lets us get multiple memory operations pending at once, rather than stall, wake up, stall again on one load at a time.
In the case of i-BWT we can do this by taking the input data (before BWT) and think of it as T segments. We don't cut the string into pieces (remember chunking hurts compression and we don't want to do that), we still do the forward BWT on the whole string.
Like :
"inputstring|"
in 2 segments :
input+string|
do the BWT on the whole string "inputstring|"
then find the locations of the EOF symbol | in L[]
and the + segment boundary, by finding where "string|" starts
g|inputstrin
ing|inputstr
inputstring| <- EOF key
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input <- segment key
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^ ^
F L
EOF key = 2
segment key = 7
Then you would transmit the BWT string L[] and the EOF key (2) as usual, and also transmit the segment key (7).
The segment key lets you start the i-BWT core loop at that location and read out characters from there, in addition
to the stream from the EOF key.
In the i-BWT you can start output cursors for all T segments and you can read them out of the i-BWT simultaneously.
The i-BWT core loop just does "from this symbol, what's the next symbol?" so you can do T of those independently.
1->..+2->...|
cursor 1 starts at i=2 in L[]
cursor 2 starts at i=7 in L[]
decodes:
i....+s.....|
in...+st....|
etc..
input+string|
The N-stream decode is very simple, literally just doing the 1-stream process to N output pointers :
T segments
key[T] are the start indexes
i1,i2,etc. = key[]
ptr1 = start of output buffer
ptr2 = ptr1 + (N/T) , start of next segment, etc.
for N/T times
{
i1 = LF[i1];
*ptr1++ = L[i1];
i2 = LF[i2];
*ptr2++ = L[i2];
}
it just does the basic i-BWT loop on T segments in each iteration. Crucially these are independent, so when one of them
stalls on a cache miss, the others can still do work. If all of them were stalling on cache misses and the memory
system could service T cache line fetches simultaneously, we would be able to stall with all T requests in flight which
would give us factor of T speedup. In practice it appears to be somewhat less than that.
You can also cut into segments like this for parallel decoding on multiple threads for very large blocks (without doing any chunking, which sacrifices compression). eg. you might find 16 segments to decode and give 4 segments each to 4 threads. The threads have no cache contention because they are writing linearly to different output segments, and the shared memory of the LF[] array is read-only in this loop.
Going to 4 streams provides around a 2.7X speedup, 8 streams gets us about 4X :
enwik7 :
ibwt_1x1 : seconds:0.7500 | cycles per: 299.330 | MB/s : 13.33
ibwt_1x4 : seconds:0.2801 | cycles per: 111.769 | MB/s : 35.71
ibwt_1x8 : seconds:0.1835 | cycles per: 73.218 | MB/s : 54.51
(see more speeds at end)
The next thing we can do is an idea from the paper "Cache Friendly Burrows-Wheeler Inversion". We can make our basic i-BWT step output words or dwords instead of bytes.
This is a simple idea, but I think it's quite interesting *why* and *when* it works.
What we want to do is just do our core i-BWT to output words at a time instead of bytes, by outputting
a precomputed pair of one byte steps :
core i-BWT loop for words :
i = key (EOF location in L[])
for N/2 times
{
*word_ptr++ = LL[i];
i = LF2[i];
}
where LL = a word, two steps of L[] , and LF2[] = two steps of LF[],
that is LF2[i] = LF[LF[i]].
(reminder that in practice we put the LL[] and LF2[] into a single struct so that we get one cache miss array lookup rather than two).
So to use that we just need to build the LL and LF2 arrays first. To do that we just need to precompute
each pair of steps in the byte-at-a-time BWT :
prep for word i-BWT :
first build LF[] as usual
for i = 0 to N
{
int i2 = LF[i];
LL[i] = L[i] | (L[i2]<<8);
LF2[i] = LF[i2];
}
At every index in L[] we precompute a two-character step by following LF[] once.
(side note : you could precompute a two-character step forward more simply; from L[i] the next character is just F[i], which you could get from the cumulative probability table at i, but that would give you a step forward and what we want is a step backward; L[i2] here is the character that precedes L[i]; it wouldn't help because we need a lookup at i2 to make LF2[] anyway)
Now at first glance this might not seem helpful, what we have done is transform :
byte at-a-time BWT :
core i-BWT loop :
N times :
i <- LF[i]
output byte from i
word at-a-time BWT :
prep :
N times :
LF2[i] = LF[LF[i]]
word = L[i] + L[LF[i]]
N/2 times :
i <- LF2[i]
output word from i
We are actually doing more work; we now do 3/2 N iterations instead of N iterations.
But it is in fact much faster.
Why?
The reason is that the N-step "prep" loop to build LF2 does not have the two big problems of the
core loop. Remember the two big issues :
1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol
these are what make the "core" i-BWT loop so slow.
First issue #1 : the "prep" loop naively allows for execution ahead, unlike the core loop. The reason
is that it is just doing i++ to iterate the loop, rather than chasing i = LF[i] around a data-dependent
serpentine walk through the arrays. This means that the processor can effectively unroll the loop to execute
ahead :
prep loop :
iter 1 :
LF2[i] = LF[LF[i]]
word = L[i] + L[LF[i]]
i++;
iter 2 :
LF2[i] = LF[LF[i]]
word = L[i] + L[LF[i]]
i++;
iter 1 and iter 2 can run at the same time because i++ can be precomputed.
If LF[LF[i]] stalls on a cache miss in iter 1, iter 2 can still proceed.
core loop :
iter 1 :
i = LF[i]
output byte/word from i
iter 2 :
i = LF[i]
output byte/word from i
iter 2 can't start until iter 1 is done because we don't know "i" until the
results of the lookup at LF[i] are done.
A modern processor will be able to execute ahead in the "prep" loop to fully saturate
execution, and we don't need to manually do the N-streams to get ILP because it just
naturally exists.
issue #2: cache misses. The "prep" loop cache misses far less than the "core" loop.
Again this is not naively obvious. The byte "core" loop does N lookups at LF[i]. The "prep" loop also does N lookups at LF[i]. So they should cache miss the same, right?
Nope. The difference is because the "core" loop is doing i = LF[i] which takes you to a semi-random
place each time, whereas the "prep" loop is doing i++ for subsequent lookups, and that means the
lookups are often to nearby locations that stay in cache.
core loop does :
starting from i
i1 = LF[i]
lookup at i1
i2 = LF[i1]
lookup at i2
i3 = LF[i2]
lookup at i3
prep loop does :
starting from i
i1 = LF[i]
lookup at i1
i2 = LF[i+1]
lookup at i2
i3 = LF[i+2]
lookup at i3
The indexes i1,i2,i3 in the core loop are semi-random, in the prep loop they are not.
To understand, let's remember what LF[] is. It takes us from the symbol at L[i] to the location of the preceding symbol in the BWT (we're outputting backwards).
So say you were currently at "ing|" (in "inputstring|"), it takes you to the location of "ring|" then "tring|" ; those are i1,i2,i3 in the core loop.
In the "prep" loop you are taking the starting strings at i, i+1, i+2. These are subsequent in sorted
order. These might be something like :
"hella ..."
"hellish a.."
"hellish b.."
"hello a..."
"hello b..."
So the steps LF[] gives us to i1,i2,i3 will be to tack on the preceding character and look that up.
They might all be preceded by random characters, but in practice because these strings are similar
their preceding characters tend to be similar. That might be :
" hella ..."
" hellish a.."
"shellish b.."
" hello a..."
"shello b..."
then you look up those locations in the suffix sort. As long as you tacked on the same preceding
character, they will stay adjacent :
" hella ..." -> i =4000
" hellish a.." -> i = 4001
"shellish b.." -> i = 6078
" hello a..." -> i = 4002
"shello b..." -> i = 6079
The preceding character for the sorted suffixes is of course just the BWT string L[] itself. The
portion here would be " s s".
So as long as the BWT string has repeated characters, the indexes of the "prep" loop lookup will be adjacent. If they are adjacent indexes, they will not cache miss because we already loaded their neighbor which brought in that cache line (load at i=4001 won't cache miss because we just did a load at i=4000).
Well, getting repeated characters in L[] is exactly what the BWT gives us! If we think of it in terms of post-MTF BWT strings, a 0 will be a load from the most likely location, 1 will be the second most likely, etc. so there will tend to be a few cache lines that we already have that provide most of our loads for the "prep" loop. Only rarely do you get unexpected characters that cause cache misses.
Note that this only works on *compressible* data where the BWT is giving the coherence we want. On random data that doesn't compress this breaks down. (though even in that case, there are still at most 256 hot cache lines = 16 KB that we need to read from, so it's still better than the "core" loop, and we still have property #1 that each iteration is independent).
Actual load locations so you can see how this works in practice :
LF[i] = 8853
LF[i] = 8854
LF[i] = 8855
LF[i] = 8856
LF[i] = 10553
LF[i] = 8857
LF[i] = 10554
LF[i] = 48
LF[i] = 49
LF[i] = 50
LF[i] = 10555
LF[i] = 10556
LF[i] = 51
LF[i] = 52
LF[i] = 5070
LF[i] = 10557
LF[i] = 2477
LF[i] = 53
LF[i] = 54
Where you have 10553,10554,10555 that means the same character was preceding those suffixes, so they lookup
in adjacent places.
You can of course take this idea for word output and repeat it again to get dword (u32) output per iteration. In practice what I see is that dword output is only faster for compressible data where the cache coherence property above works well. On random input dword output gets a lot slower than word-at-a-time. Because of this I think word output is best if you only choose one i-BWT, but you can do even better by adaptively choosing the output algorithm by the compression ratio.
The dword (4x) variant does an N-loop to build LF2 which is quite coherent, the next N-loop to build LF4 is
slightly less coherent unless the data is compressible (we're relying on two-symbols-away correlation now
instead of neighbor correlation, which is less strong), then it does N/4 of the "core" loop which is still quite
cache missy. So the net is :
1x byte loop : N steps of "core" , cache mising
2x word loop : N steps of LF2 + (N/2) steps of core
4x word loop : N steps of LF2 + N steps of LF4 + (N/4) steps of core
The authors of "Cache Friendly Burrows-Wheeler Inversion" have another method that accelerates long repeated substrings called "precopy". I think that a pre-pass with an LZP transform is probably a faster to way to accomplish the same thing on data where that is helpful. "precopy" also makes N-stream interleaving much more complex. I have not tested it.
Speeds measured on my Intel Core i7 3770 (locked at 3403 MHz) (Ivy Bridge)
ibwt 1x1 = classic byte at a time inverse BWT
ibwt 1x4 = byte x 4 streams
ibwt 2x4 = word x 4 streams
ibwt 4x4 = dword x 4 streams
ibwt 1x8 = byte x 8 streams
ibwt 2x8 = word x 8 streams
ibwt 4x8 = dword x 8 streams
cycles per byte for the ibwt :
| name | 1x1 | 1x4 | 2x4 | 4x4 |
| dickens | 350.6 | 123.7 | 70.0 | 47.9 |
| enwik7 | 301.2 | 116.0 | 66.8 | 47.0 |
| webster | 376.9 | 139.0 | 78.5 | 51.2 |
| lzt99 | 211.3 | 114.6 | 65.1 | 61.4 |
| rand_16M | 401.6 | 130.6 | 80.4 | 297.3 |
| name | 1x1 | 1x8 | 2x8 | 4x8 |
| dickens | 337.0 | 79.8 | 46.8 | 35.8 |
| enwik7 | 299.1 | 73.2 | 46.0 | 36.1 |
| webster | 376.5 | 98.0 | 57.5 | 40.3 |
| lzt99 | 208.6 | 77.9 | 48.1 | 53.7 |
| rand_16M | 401.3 | 84.1 | 55.0 | 273.4 |
We've improved the i-BWT speed from around 300 cycles/byte to 50 cycles/byte. That's still extremely slow. It's around the speed of Oodle LZNA (30 cycles/bytes) or 7z LZMA (70 cycles/byte). It's an order of magnitude slower than Oodle Leviathan (5 cycles/byte). But in some cases, such as text, BWT on large blocks gets a lot more compression than those, so it could be interesting on the right data types if you care about the very high compression domain. All speeds single threaded.
The i-BWT on a small buffer that stays in cache is around 10 cycles/byte (and could probably be faster; we haven't looked at all at micro-optimizing the actual instructions here, since if we're stalling on cache misses they don't matter), so at 50 cycles/byte we're still spending 40 cycles/byte stalled on the memory subsystem in our best algorithm, which is not great.
ADD : with MEM_LARGE_PAGES :
| name | 1x1 | 1x8 | 2x8 | 4x8 |
| dickens | 301.3 | 52.4 | 33.9 | 27.7 |
| enwik7 | 265.8 | 50.5 | 33.1 | 28.2 |
| webster | 319.6 | 54.5 | 34.8 | 29.0 |
| lzt99 | 177.5 | 50.6 | 32.5 | 35.6 |
| rand_16M | 336.6 | 53.5 | 41.4 | 154.1 |
Without large pages, the bottleneck appears to be in TLB/page mapping. With large pages it seems to be limited by the rate that cache missing reads can be retired (assuming the latency of that is fixed, this is the number of different cache lines that can have in-flight reads simultaneously). eg. if latency is 300 cycles, with 10 simultaneous reads we could get to 30 cycles/byte.
Downloads :
The source code is public domain. The ibwt routines are standard C. The test driver uses cblib
Oodle Texture does rate-distortion optimization (RDO) BCN encoding, optimizing the BCN encoding for an R-D two axis score such that the size of the BCN texture after a following lossless compression is reduced while the distortion (difference from original) is not increased too much.
One way to think about RDO conceptually is as rate allocation. Rate allocation is when an encoder intentionally puts more or fewer bits in parts of the data to control what the decoder will receive, so that it can use more bits where it helps more, and the result is a better quality for a given output size. I want to talk a bit about that viewpoint and how it works in Oodle Texture.
Traditional lossy encoders like JPEG didn't have active rate allocation; they used a fixed scheme (DCT and quantize) that did a fixed rate allocation (give more bits to the lower frequencies) and didn't try to optimize that based on image contents. Modern codecs like H264 use rate allocating encoders and furthermore have explicit tools in the format to make rate allocation more flexible, such as variable quantizers. Formats without explicit rate allocation controls can still be rate adjusted, which is what we do in Oodle Texture, but it's not as easy to dial.
In that context, we think of a traditional non-RDO BCN encoder as allocating rate evenly to all blocks; it doesn't care where the rate goes and just tries to minimize distortion. The BCN encoder itself produces fixed size blocks (8 bytes per block for BC1 or 16 bytes per block for BC7), but the "rate" we care about is the size of the block after subsequent LZ compression (eg. with Kraken or Zip/deflate).
In Oodle Texture, we use the Lagrange parameter method for RDO. Specifically we construct a
Lagrange cost :
J = D + lambda * R
and then we can just minimize a single number, J, rather than a two-axis score of {R,D}.
Obviously when lambda=0 this reduces to J=D , minimize D, which is a non-RDO encoding that only cares
about quality and not rate. As lambda increases your score is more and more a balance of minimizing both
distortion and rate.
We sometimes think of D(R) , distortion just being a function of R, but that's not how it works in an implementation, you don't actually know the functions for how R and D relate.
In practice there is some encoding, which I will index by "i" , and you have two separate functions :
R(i) and D(i)
That is, you just choose various encodings that look promising, and measure R and D of each. This will
give you a scattering of points on an R-D plot. Some points are off the Pareto Frontier - strictly worse in terms of R & D
and we just reject those points.
See Fabian's blog on RDO
for some pictures of this.
Henceforth I will assume that points off the Pareto Frontier have been rejected and we now only are looking at encodings on the Pareto R-D curve. Also I will sort the index of encodings "i" by rate.
The R(i) and D(i) curves both tend to be hyperbolic-shaped, with minima at opposite ends of the "i" spectrum. If you minimized either one on their own, they don't have any local minima so you would just go to the edge of what the encoding format is capable of doing. By adding them together with lambda, it makes a trough somewhere in the middle :
High D, low R on the left (very lossy), low D high R (non-RDO) on the right.
At low lambda, the minimum of J is in the high-R, low-D domain. As lambda increases it shifts the minimum of J to lower R. In the example chart, the red low-lambda curve has minimum at i=14, the teal medium-lambda curve has minimum at i=12, the gold high-lambda curve has minimum at i=10. In this way, lambda can be used to dial rate, but indirectly.
Concretely, if we reparameterize to imagine we have a function D(R) (Distortion as a function of R; there
is only one encoding for each R because we discard non-Pareto encodings), we have :
J(R) = D(R) + lambda * R
minimum is at d/dR J = 0
d/dR J = 0 = d D /dR + lambda
lambda = - dD / dR
lambda is the slope on the D(R) curve.
Because the D(R) curve is hyperbolic-shaped, the slope acts like a parameter of D itself. That is, where D is higher, the curve is also steeper, so by dialing the slope we have control over the principle value D as well.
Aside for experts : we are assuming that D(R) is continuous and monotonic AND dD/dR is continuous and monotonic; that is, the slope steadily increases from low to high. The whole idea of lambda as a slope of D(R) only really works if these curves are smooth and hyperbolic shaped as expected. We also need J to only have one local minimum. If these assumptions fail (which in practice they do!) you can wind up at a minimum that's not where you want, and the lambda-dials-D relationship can break down. There are some tricks to avoid these pitfalls. Try to avoid large steps of R/D in your encoding choices; provide the encoder with a fine grained series of steps; don't use the true R which can have strange non-monotonic wiggles, instead use an approximation of R which tends to smooth things out; you generally want to pre-classify the encodings that you think are low R vs high R and force them to be monotonic rather than measuring true R.
Now, "lambda is the slope on the D(R) curve" sounds a bit unintuitive and abstract, but in fact it is a concrete simple thing.
Lambda is a price. It is the gain in D that you must get for a gain in R to be worth it.
By using lambda as your control parameter instead of D, you are dialing quality via the *cost* of quality in terms of rate. You are saying "this much gain in quality is worth this much rate to me". We typically measure R in bits, so let's say lambda is quality per bit. In order to make a file bigger by 1 bit, it must provide at least "lambda" quality gain. Smaller quality gain, it's not worth it, don't do it.
Having our control paramter be price instead of directly controlling R or D turns out to be very beneficial, both for us internally and for you externally.
First why it's right externally : lambda control (slope control) lets you set a single parameter that works for all kinds of data or images. It automatically gives more rate to images that are harder to compress ("harder" here very concretely means a steeper slope - they give up more distortion than average for each step or rate). So you can have a mix of very vastly different data, lightmaps and normal maps and cartoon drawings, and they automatically rate adjust to send bits where they help the most. Many novices prefer to think of "rate reduction" or a "constant quality" kind of setting, like "I want to RDO all my images to 30% rate reduction", but that's really wrong. On some images, getting to 30% rate reduction would do lots of quality damage, while on others you could easily get more rate reduction (say 50%) without much quality loss at all. Lambda does that for you automatically.
Internally it's important because it lets us make each coding decision in isolation, just by looking at the J cost of that one decision in isolation. It makes the problem separable (which is also great for parallelism), but still achieve a global minimum. Lagrange optimization automatically does rate allocation within the image to different blocks and decisions.
I think it helps to compare to an alternative manual rate allocation method to see how advantageous it is :
Algorithm Manual Rate Allocation :
you have an image of N BCN 4x4 blocks
for each block, find the lowest D (non-RDO) encoding
measure R of that encoding - this is the max rate for each block
now incrementally reduce total rate
you want to take rate from the best block to take rate from of the N
for all N blocks
find the next smaller encoding
measure the D increase for that step down of R
slope(n) = delta(D) / delta(R)
take the one block change n with the lowest slope(n)
repeat until all slopes(n) are > max_desired_slope
vs.
Algorithm Lagrange Rate Allocation :
for all N blocks in parallel :
try various encodings of the block
compute J = D + lambda * R for each encoding
take the encoding on that block with the best J
In "Manual Rate Allocation" we have this palette of various bins (the blocks) to choose to take rate from, and you take it from the
one that does the least damage to D, if you keep repeating that the slopes will converge until they are all roughly equal.
The powerful thing is that these two algorithms converge to exactly the same solution (caveat: assuming monotonic and smooth R/D curves, which in fact they are not). The Lagrange method is actually doing rate allocation between the blocks, even though it's not explicit, and each block can be considered in isolation. In an RDO encoder, rate is like water, it flows away from high ground (dD/dR is high) to low ground, until level is reached. The Lagrange method is able to do this separably because the total amount of water (rate) is not conserved.
We can control exactly what the rate allocation does through the choice of the D function. The R function for rate is relatively inflexible - we're counting bits of output size (though in practice we may want to use a smoothed rate rather than true rate), we don't have much choice in R - in contrast, the D function is not absolutely specified and there are lots of options there. The choice of D changes where rate flows.
For example consider the common choices of D = L1 norm (SAD) vs L2 norm (SSD). These metric score different errors differently, which
causes rate to flow. In particular L2 (squaring) puts a very big weight on single large errors vs. smaller multiple errors.
Source data = { 10, 10 }
Approximation = { 8, 12 }
SAD error = 2 + 2 = 4
SSD error = 2^2 + 2^2 = 8
Source data = { 10, 10 }
Approximation = { 10, 14 }
SAD error = 0 + 4 = 4
SSD error = 0^2 + 4^2 = 16
Choosing D = SSD would cause rate to flow from the {8,12} approximation to the {10,14} approximation because the error is much
bigger there (relative to where it is with SAD). All reasonable D functions will be zero for an exact match, and increase as
the approximation gets further from the source data, but they can increase at different rates for different types of distortions.
In a non-RDO encoding, these different D functions might find very similar encodings, but with RDO
by choosing different D you get rate to flow between blocks to different characters of error.
With no further ado, we can get to the fun part : pictures of how rate allocation plays out in practice in Oodle Texture.
These images show the original image, followed by gray scale images of the rate allocation. These are for BC7 encodings with Oodle Texture. The non-RDO encoding is Oodle Texture at lambda=0 ; the RDO encodings are at lambda=40 (typical medium quality setting).
In the rate allocation images, each pixel represents one 4x4 block. White is full rate (16 bytes per block) and black is zero bits. These encodings are not normalized to the same total rate (in particular the non-RDO is of course much larger). The images are gamma corrected so that linear light intensity corresponds to bits per block. The original images are halved in size for display here. So for example "mysoup" was 1024x512, the original shown here is 512x256, and rate map is one pixel per block, so 256x128.
The "rmse" and "perceptual" images use the exact same coding procedure at the same lambda, they differ only in the D function used to measure distortion. "rmse" tries to minimize simple L2 norm, while "perceptual" has some estimation of visual quality (trying to avoid blocking artifacts, for example).
The R measured for these maps is the size of each block after subsequent compression by Oodle's LZA compressor.
| non-RDO | RDO rmse | RDO perceptual |
![]() |
![]() |
![]() |
The non-RDO "mysoup" has very high entropy, the BC7 blocks are nearly incompressible by the LZ (7.759 bpb). The RDO with D=rmse keeps very high bit rate in the blocks of the bowl rim, but allocates lots of bits away from the smooth regions in the egg and pumpkin. In the "perceptual" allocation, the bits in the rim are reduced, allowing more error there, and the bit rate of the smooth regions come up to avoid blocking artifacts.
| non-RDO | RDO rmse | RDO perceptual |
![]() |
![]() |
![]() |
In the house image "test6" the bit rate of the "rmse" allocation goes nearly to zero in the smooth dark areas of the beams under the eaves. That's a mistake perceptually and causes bad blocking artifacts, so the "perceptual" allocator shifts rate back to those blocks.
| non-RDO | RDO rmse | RDO perceptual |
![]() |
![]() |
![]() |
We can see that Oodle Texture is changing near-even rate allocation to very variable rate per block. Even though the BC7 blocks are always 16 bytes, we have made some more compressible than others, shifting rate to where it is needed. On many blocks, 16 bytes is just too much, that much rate is not needed to get a good encoding, and the difference in D to a reduced-size encoding is small; these are low slope blocks and will lose rate in RDO. After rate allocation, the blocks all have the same dD/dR slope; they reach an equilibrium where you can't take bits from one block and move it to another block and improve total quality.
Something you may notice in all the rate allocation maps is there are horizontal lines of higher rate going through the image. This is where we slice the images into chunks for parallelism of some operations that have to work per-chunk instead of per-block. The stripes of higher rate show that we could find some improvement there.
Links :
Oodle Texture at radgametools.com
Lagrange space-speed optimization - cbloom blog
Leviathan is a market trader - cbloom blog
Rate-Distortion Optimisation in Dirac
Rate-distortion optimization The ryg blog
Oodle 2.8.14 is out. The full changelog is at RAD.
The highlights are :
## Release 2.8.14 - February 15, 2021
$* *enhancement* : BC7 encoding is faster ; slightly different encodings at higher speed with similar quality
$* *new* : Mac ARM64 build now provided ; Mac example exes are fat x64+arm64
$* *new* : Apple tvOS build now provided
$* *deprecation* : Mac 32 bit x86 build no longer provided
We're also now shipping plugin integrations for Unreal 4.26
Kraken decompression is wicked fast on the M1 :
Kraken, Win-x64 msvc-1916, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 = 3.243 bpb = 2.467 to 1
decode : 16.444 millis, 2.26 c/B, rate= 1502.09 MB/s
Kraken, Mac-x64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 = 3.243 bpb = 2.467 to 1
decode : 16.183 millis, 2.10 c/B, rate= 1526.36 MB/s
(emulated!)
Kraken, Mac-ARM64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 = 3.243 bpb = 2.467 to 1
decode : 11.967 millis, 1.55 c/B, rate= 2064.13 MB/s
Win64 run is :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz, no turbo)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz
Mac runs on Macbook Mini M1 at 3205 MHz
Mac x64 is emulated on the M1
c/B = cycles per byte should be taken with some salt as we have trouble finding real clock rates, but it's clear the M1
has superior IPC (instructions per clock) to the Zen2. It seems to be about the same speed as the Zen2 in emulated x64!
It will be interesting to see what the M1 high performance variants can do.
Some more speeds cuz I like big numbers :
Macbook Mini M1 ARM64 :
Mermaid, Normal, lzt99 :
24,700,820 ->11,189,930 = 3.624 bpb = 2.207 to 1
encode (x1) : 299.154 millis, 37.21 c/B, rate= 82.57 MB/s
decode (x30) : 6.438 millis, 0.80 c/B, rate= 3836.60 MB/s
Mermaid, Optimal2, lzt99 :
24,700,820 ->10,381,175 = 3.362 bpb = 2.379 to 1
encode (x1) : 3.292 seconds, 409.43 c/B, rate= 7.50 MB/s
decode (x30) : 7.134 millis, 0.89 c/B, rate= 3462.57 MB/s
Selkie, Normal, lzt99 :
24,700,820 ->13,258,742 = 4.294 bpb = 1.863 to 1
encode (x1) : 213.197 millis, 26.51 c/B, rate= 115.86 MB/s
decode (x30) : 3.126 millis, 0.39 c/B, rate= 7901.52 MB/s
Selkie, Optimal2, lzt99 :
24,700,820 ->12,712,659 = 4.117 bpb = 1.943 to 1
encode (x1) : 1.861 seconds, 231.49 c/B, rate= 13.27 MB/s
decode (x30) : 3.102 millis, 0.39 c/B, rate= 7962.55 MB/s
AVIF is an image format derived from I-frames of AV1 video (similar to HEIC/HEIF from H265/HEVC). See also my 2014 Test of BPG , which is an H265 I-frame image format.
Here are some links I've found on AVIF :
AVIF image format supported by Cloudflare Image Resizing
GitHub - AOMediaCodeclibavif libavif - Library for encoding and decoding .avif files
GitHub - googlebrunsli Practical JPEG Repacker
Releases · kornelskicavif-rs · GitHub
GitHub - link-ucavif avif encoder, using libaom directly.
GitHub - xiphrav1e The fastest and safest AV1 encoder.
AVIF for Next-Generation Image Coding by Netflix Technology Blog Netflix TechBlog
Submissions from xiph.org Hacker News
Image formats for the web HEIC and AVIF – The Publishing Project
Squoosh
Comparing AVIF vs WebP file sizes at the same DSSIM
AV1 Codec
AVIF images color losschange AV1
Unfortunately I have not found a standard encoder with recommended settings. There appears to be zero guidance on settings anywhere.
Because of that I am using the simplest encoder I could find, Kornelski's "cavif" which has a simple --quality parameter. I run thusly :
cavif --quality=X -o out.avif in.png
avifdec out.avif dec.png
imdiff in.png dec.png
CALL FOR HELP : if you know better programs/settings for encoding AVIF, please let me know ; in avifenc , what is the min max Q parameter?
adaptive quantization appears to be off by default, don't we want that on?
I measure results using my imdiff
I will compare AVIF to what I call "JPEG++" which is JPEG with the packjpg/Lepton back end, and a deblocking decoder (my "jpegdec"). This a rough stand-in for what I think a real JPEG++ should be (it should really have an R-D front-end and chroma-from-luma as well; that's all very easy and unequivocably good).
With no further ado, some results :
(imdiff "fit" is a quality in 0-10 , higher is better)
porsche640.bmp :
PDI_1200 :
Results are a bit disappointing. AVIF is much beter on RMSE but slightly worse on my other two scores. Overall that means it's most likely better overall, but it's not a huge margin.
(I'm sure AVIF is a big win on graphic/text images where JPEG does poorly)
AVIF results here look worse than what I saw from BPG (HEIC). Perhaps better encoders/settings will fix that.
Looking at the results visually, AVIF preserves sharp lines much better, but is completely throwing away detail in some places. There are some places where I see AVIF actually *change* the image, whereas JPEG is always just making the same image but slightly worse.
7z of encoded images to compare (2 MB)
NOTE : the fact that AVIF wins strongly in RGB RMSE but not in my other perceptual metrics indicates that it is not optimizing for those metrics. Perhaps in other perceptual metrics it would show a strong win. The metrics I use here from imdiff were chosen because I found them to be the best fit to human quality scores. Lots of the standard scores that people use (like naive SSIM) I have found to be utter junk, with no better correlation to human quality than RMSE. MS-SSIM-IW is the best variant of SSIM I know, but I haven't tested some of the newer metrics that have come out in the last few years.
A couple of tips and tools for working with JPEG files. I use :
Of course you can use exiftool and jpegtran but these are rather simpler.
1. Strip all personal data before uploading images to the web.
JPEG EXIF headers contain things like date shot and location. If you don't want Google to scrape that and use it
to track your movements and send drones to follow you carrying advertisements, you might want to strip all that private
data before you upload images to the net. I use :
jhead -purejpg %*
jhead -mkexif %*
jhead -dsft %*
with the very handy jhead tool. This removes all non-image headers, then makes a blank exif, then sets the exif time from the mod time.
The last two steps are optional, if you want to preserve the shot time (assuming the mod time of the file was equal to the exif time).
Note if the times are not equal you can use "jhead -ft" to set modtime from exif time before this.
Also note that if you use tools to modify images (resizing, changing exposure, what have you), they may or may not carry through the old exif data; whether that's good or bad is up to you, but it's very inconsistent. They also will probably set the modtime to now, whereas I prefer to keep the modtime = shot time so that I can find files by date more easily.
2. Remove orientation tags
iPhones are the only device I know of that consistently make use of the JPEG orientation tags rather than actually rotate the pixels. Unfortunately lots of loaders don't support these tags right, so if you load the image in a non-compliant loader it will appear sideways or upside down.
(this is a classic problem with data formats that have too many features; inevitabily many implementers only support a subset of the features which they deem to be the necessary ones; then some bone-head says "oh look the format has this weird feature, let's use it", and they output data which most loaders won't load right (then they blame the loaders). This is a mistake of the people defining the format; don't include unnecessary features, and make the practical subset a well-defined subset, not something that forms ad-hoc from use.)
To fix this, you want to read the orientation tag, rotate the pixels, then blank out the orientation tag (you have to do the last step or
compliant loaders will doubly rotate). I used to have scripts to do all that with jpegtran, but it's super easy to do with jhead, you just :
jhead -autorot %*
3. Lossless crop
Avoid loading a JPEG, modifying it, and saving it out again. It destroys the image by re-applying the lossy quantization. I think by far the most common modification people do is cropping and there's no damn reason to re-encode for a crop. You can crop at 8x8 granularity losslessly, and you should. (all the web image uploaders that give you a crop tool need to do this, please, stop sucking so bad).
jpegcrop is a GUI tool that provides a decent lossless crop.
FreeVimager is okay too. Lossless crop is hidden in "JPEG Advanced".
We've got JPEG-XL coming soon, which looks great, but all the tech in the world won't help if people like Instagram & Youtube keep re-encoding uploads, and re-encoding every time you modify.
Oodle 2.8.13 fixes an issue in Oodle Texture with consistency of encodings across machine architectures.
We try to ensure that Oodle Texture creates the same encodings regardless of the machine you run on. So for example if you run on machines that have AVX2 or not, our optional AVX2 routines won't change the results, so you get binary identical encodings.
We had a mistake that was causing some BC1 RDO encodings to be different on AMD and Intel chips. This is now fixed.
The fixed encoding (for BC1 RDO) made by 2.8.13 can be different than either of the previous (AMD or Intel) encodings. I want to use this announcement as an opportunity to repeat I point I am trying to push :
Do not use the binary output of encodings to decide what content needs to be patched!
This is widespread practice in games and it is really a bad idea. The problem is that any changes to the encoders you use (either Oodle Texture, your compressor, or other), can cause you to patch all your content unnecessarily. We do NOT gaurantee that different versions of Oodle Texture produce the same output, and in fact we can specifically promise they won't (always produce the same encodings) because we will continue to improve the encoder and find better encodings over time. Aside from version changes causing binary diffs, you might want to change settings, quality or speed levels, etc. and that shouldn't force a full patch down to customers.
The alternative to using binary output to make patches is to check if the pre-encoding content has changed, and don't patch if that is the same. I know there are difficult issues with that, often for textures you do something like load a bitmap, apply various transforms, then do the BCN encoding, and there's not a snapshot taken of the fully processed texture right before the BCN encoding.
If you are shipping a game with a short lifespan, your intention is to only ship it and only patch for bugs, then patching based on binary compiled content diffs is probably fine. But if you are shipping a game that you intend to have a long lifetime, with various generations of DLC, or a long term online player base, then you seriously need to consider patching based on pre-compression content diffs. It is very likely you will at some point in the life time of the game have to face OS version changes, compiler changes, or perhaps bugs in the encoder which force you to get on a newer version of the compression tools. You don't want to be in a situation where that's impossible because it would generate big patches.
One elegant way to solve this, and also speed up your content cooking, is to implement a cooked content cache.
Take the source bitmap, and all the texture cooking & encoding options, and use those as a key to look up pre-cooked content from a network share or similar. If found, don't re-encode.
Every time you export a level, you don't want to have to re-encode all the textures with Oodle Texture. With huge art teams, when someone edits a texture, everyone else on the team can just fetch from the cook cache rather than encode it locally.
The same kind of system can be used to avoid unnecessary patches. If you then populate your cook-cache with the content from the last shipped version, you won't make new encodings unless the source art or options change, even if the encoder algorithm changes.
For the lossless package compressor, ideally the patch generator would be able to decode the compression and wouldn't generate patches if only the compressed version changed.
To be clear, I'm assuming here that you are compressing assets individually, or even in smaller sub-asset chunks, or some kind of paging unit. I'm assuming you are NOT compressing whole packages as single compression units; if you did that any a single byte changing in any asset could change the entire compressed unit, that's a bad idea for patching.
(note that encryption also has the same property of taking a single byte change and spreading it around the chunk, so encryption should usually be done on the same or smaller chunk than the compression)
With most existing patchers that are not compression-aware, if you change the compressor (for example by changing the encode level option, or updating to a newer version), the compressed bytes will change and generate large patches. What they should ideally do is see that while the compressed bytes changed, the decompressed bytes are the same, so no patch is needed, and the old version of the compressed bytes can be retained. This would allow you to deploy new compressors and have them used for all new content and gradually roll out, without generating unnecessary large patches.
The Sony PS5 will have the fastest data loading ever available in a mass market consumer device, and we think it may be even better than you have previously heard. What makes that possible is a fast SSD, an excellent IO stack that is fully independent of the CPU, and the Kraken hardware decoder. Kraken compression acts as a multiplier for the IO speed and disk capacity, storing more games and loading faster in proportion to the compression ratio.
Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1.
Oodle Kraken is the lossless data compression we invented at RAD Game Tools, which gets very high compression ratios and is also very fast to decode. Kraken is uniquely well suited to compress game content and keep up with the speed requirements of the fast SSD without ever being the bottleneck. We originally developed Oodle Kraken as software for modern CPUs. In Kraken our goal was to reformulate traditional dictionary compression to maximize instruction level parallelism in the CPU with lots independent work running at all times, and a minimum of serial dependencies and branches. Adapting it for hardware was a new challenge, but it turned out that the design decisions we had made to make Kraken great on modern CPUs were also exactly what was needed to be good in hardware.
The Kraken decoder acts as an effective speed multiplier for data loading. Data is stored compressed on the SSD and decoded transparently at load time on PS5. What the game sees is the rate that it receives decompressed data, which is equal to the SSD speed multiplied by the compression ratio.
Good data compression also improves game download times, and lets you store more games on disk. Again the compression ratio acts as an effective multiplier for download speed and disk capacity. A game might use 80 GB uncompressed, but with 2 to 1 compression it only take 40 GB on disk, letting you store twice as many games. A smaller disk with better compression can hold more games than a larger disk with worse compression.
When a game needs data on PS5, it makes a request to the IO system, which loads compressed data from the SSD; that is then handed to the hardware Kraken decoder, which outputs the decompressed data the game wanted to the RAM. As far the game is concerned, they just get their decompressed data, but with higher throughput. On other platforms, Kraken can be run in software, getting the same compression gains but using CPU time to decode. When using software Kraken, you would first load the compressed data, then when that IO completes perform decompression on the CPU.
If the compression ratio was exactly 1.5 to 1, the 5.5 GB/s peak bandwidth of the SSD would decompress to 8.25 GB/s uncompressed bytes output from the Kraken decoder. Sony has estimated an average compression ratio of between 1.45 to 1 and 1.64 to 1 for games without Oodle Texture, resulting in expected decompressed bandwidth of 8-9 GB/s.
Since then, Sony has licensed our new technology Oodle Texture for all games on the PS4 and PS5. Oodle Texture lets games encode their textures so that they are drastically more compressible by Kraken, but with high visual quality . Textures often make up the majority of content of large games and prior to Oodle Texture were difficult to compress for general purpose compressors like Kraken.
The combination of Oodle Texture and Kraken can give very large gains in compression ratio. For example on a texture set from a recent game :
| Zip | 1.64 to 1 |
| Kraken | 1.82 to 1 |
| Zip + Oodle Texture | 2.69 to 1 |
| Kraken + Oodle Texture | 3.16 to 1 |
Kraken plus Oodle Texture gets nearly double the compression of Zip alone on this texture set.
Oodle Texture is a software library that game developers use at content creation time to compile their source art into GPU-ready BC1-7 formats. All games use GPU texture encoders, but previous encoders did not optimize the compiled textures for compression like Oodle Texture does. Not all games at launch of PS5 will be using Oodle Texture as it's a very new technology, but we expect it to be in the majority of PS5 games in the future. Because of this we expect the average compression ratio and therefore the effective IO speed to be even better than previously estimated.
The most common alternative to Kraken would be the well known Zip compressor (aka "zlib" or "deflate"). Zip hardware decoders are readily available, but Kraken has special advantages over Zip for this application. Kraken gets more compression than Zip because it's able to model patterns and redundancy in the data that Zip can't. Kraken is also inherently faster to decode than Zip, which in hardware translates to more bytes processed per cycle.
Kraken is a reinvention of dictionary compression for the modern world. Traditional compressors like Zip were built around the requirement of streaming with low delay. In the past it was important for compressors to be able to process a few bytes of input and immediately output a few bytes, so that encoding and decoding could be done incrementally. This was needed due to very small RAM budgets and very slow communication channels, and typical data sizes were far smaller than they are now. When loading from HDD or SSD, we always load data in chunks, so decompressing in smaller increments is not needed. Kraken is fundamentally built around decoding whole chunks, and by changing that requirement Kraken is able to work in different ways that are much more efficient for hardware.
All dictionary compressors send commands to the decoder to reproduce the uncompressed bytes. These are either a "match" to a previous substring of a specified length at an "offset" from the current output pointer in the uncompressed stream, or a "literal" for a raw byte that was not matched.
Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step. Kraken can pre-decode all the tokens it needs to form the output, then fetch them all at once and do one branchless select to form output bytes.
One of the special things about Kraken is that the encoded bit stream format is modular. Different features of the encoder can be turned on and off, such as entropy coding modes for the different components, data transforms, and string match modes. Crucially the Kraken encoder can choose these modes without re-encoding the entire stream, so it can optimize the way the encoder works for each chunk of data it sees. Orthogonality of bit stream options is a game changer; it means we can try N boolean options in only O(N) time by measuring the benefit of each option independently. If you had to re-encode for each set of options (as in traditional monolithic compressors), it would take O(2^N) time to find the best settings.
The various bit stream options do well on different types of data, and they have different performance trade offs in terms of decoder speed vs compression ratio. On the Sony PS5 we use this to make encoded bit streams that can be consumed at the peak SSD bandwidth so that the Kraken decoder is never the bottleneck. As long as the Kraken decoder is running faster than 5.5 GB/s input, we can turn on slower modes that get more compression. This lets us tune the stream to make maximum use of the time budget, to maximize the compression ratio under the constraint of always reading compressed bits from the SSD at full speed. Without this ability to tune the stream you would have very variable decode speed, so you would have to way over-provision the decoder to ensure it was never the bottleneck, and it would often be wasting computational capacity.
There are a huge number of possible compressed streams that will all decode to the same uncompressed bytes. We think of the Kraken decoder as a virtual machine that executes instructions to make output bytes, and the compressed streams are programs for that virtual machine. The Kraken encoder is then like an optimizing compiler that tries to find the best possible program to run on that virtual machine (the decoder). Previous compressors only tried to minimize the size of the compressed stream without considering how choices affect decode time. When we're encoding for a software decoder, the Kraken encoder targets a blend of decode time and size. When encoding for the PS5 hardware decoder, we look for the smallest stream that meets the speed requirement.
We designed Kraken to inherently have less variable performance than traditional dictionary compressors like Zip. All dictionary compressors work by copying matches to frequently occurring substrings; therefore they have a fast mode of decompression when they are getting lots of long string matches, they can output many bytes per step of the decoder. Prior compressors like Zip fall into a much slower mode on hard to compress data with few matches, where only one byte at a time is being output per step, and another slow mode when they have to switch back and forth between literals and short matches. In Kraken we rearrange the decoder so that more work needs to be done to output long matches, since that's already a super fast path, and we make sure the worst case is faster. Data with short matches or no matches or frequent switches between the two can still be decoded in one step to output at least three bytes per step. This ensures that our performance is much more stable, which means the clock rate of the hardware Kraken decoder doesn't have to be as high to meet the minimum speed required.
Kraken is a powerful generic compressor that can find good compression on data with repeated patterns or structure. Some types of data are scrambled in such a way that the compressability is hard for Kraken to find unless that data is prepared in the right way to put it in a usable form. An important case of this for games is in GPU textures.
Oodle Kraken offers even bigger advantages for games when combined with Oodle Texture. Often the majority of game content is in BC1-BC7 textures. BC1-7 textures are a lossy format for GPU that encodes 4x4 blocks of pixels into 8 or 16 byte blocks. Oodle Kraken is designed to model patterns in this kind of granularity, but with previous BC1-BC7 texture encoders, there simply wasn't any pattern there to find, they were nearly incompressible with both Zip and Kraken. Oodle Texture creates BC1-7 textures in a way that has patterns in the data that Kraken can find to improve compression, but that are not visible to the human eye. Kraken can see that certain structures in the data repeat, the lengths of matches and offsets and space between matches, and code them in fewer bits. This is done without expensive operations like context coding or arithmetic coding.
It's been a real pleasure working with Sony on the hardware implementation of Kraken for PS5. It has long been our mission at RAD to develop the best possible compression for games, so we're happy to see publishers and platforms taking data loading and sizes seriously.
We do "quantization" any time we take a high precision value (a floating point, or higher-bit integer) and store it in a smaller value. The quantized value has less precision. Dequantization takes you back to the space of the input and should be done to minimize the desired error function.
I want to encourage you to think of quantization like this :
quantization takes some interval or "bucket" and assigns it to a label
dequantization restores a given label to a certain restoration point
"quantization" does not necessarily take you to a linear numeric space with fewer bits
The total expected error might be what we want to minimize :
Total_Error = Sum_x P(x) * Error( x, dequantization( quantization(x) ) )
Note that in general the input values x do not have uniform probability, and the Error is not just linear L1 or L2
error, you might care about some other type of error. (you might also care more about minimizing the maximum rather
than the average error).
I like to think of the quantized space as "labels" because it may not be just a linear numerical space where you can do distance metrics - you always dequantize back to your original value space before you do math on the quantization labels.
I started thinking about this because of my recent posts on Widespread error in RGBE and Alternative quantizers for RGBE, and I've been looking in various game-related code bases and found lots of mistakes in quantization code. These are really quite big errors compared to what we work very hard to reduce. I've found this kind of thing before outside of games too. For example it's very common for the YUV conversions in video and image codecs to be quite crap, giving up lots of error for no good reason. Common errors I have seem in the YUV conversions are : using the terribad 16-235 range, using the rec601/bt709 matrix so that you encode with one and decode with the other, using terribad down and/or up filters for the chroma downsample). It's frustrating when the actual H264 layer works very hard to minimize error, but then the YUV-RGB layer outside it adds some that could be easily avoided.
We do quantization all the time. A common case is for 8-bit RGB colors to float colors, and vice versa. We do it over and over when we do rendering passes; every time you write values out to a render target and read them back, you are quantizing and dequantizing. It is important to take care to make sure that those quantization errors are not magnified by later passes. For example when writing something like normals or lighting information, a quantization error of 1/256 can become much larger in the next stage of rendering.
(a common example of that is dot products or cosines; if you have two vectors and store something that acts like a dot product between them (or a cosine of an angle), the quantization bucket around 1.0 for the two vectors being parallel corresponds to a huge amount of angular variation, and this often right where you care most about having good precision, it's much better to store something that's like the acos of the dot product)
If you aren't going to do the analysis about how quantization errors propagate through your pipeline, then the easiest thing to do is to only quantize once, at the very end, and keep as much precision through the stages as possible. If you do something like a video codec, or an image processing pipeline, and try to work in limited precision (even 16 bit), it is important to recognize that each stage is an implicit quantization and to look at how those errors propagate through the stages.
(aside: I will mention just briefly that we commonly talk about a "float" as being the "unquantized" result of dequantization; of course that's not quite right. A "float" is a quantized representation of a real number, it just has variable size quantization bins, smaller bins for smaller numbers, but it's still quantized with steps of 1 ulp (unit in last place). More correctly, going to float is not dequantization, but rather requantization to a higher precision quantizer. The analysis of propagating through quantization error to work in 8 bits or whatever is the same you should do for how float error propagates through a series of operations. That said I will henceforth be sloppy and mostly talk about floats as "dequantized" and assume that 1 ulp is much smaller than precision that we care about.)
So lets go back and start at the beginning :
If our input values x are all equally probable ( P(x) is a constant ), and the error metric we care about is linear L1 or L2 norm, then the optimal quantizer is just equal size buckets with restoration to center of bucket.
(for L1 norm the total error is actually the same for any restoration point in the bucket; for L2 norm total error is minimized at center of bucket; for L1 norm the maximum error is minimized at center of bucket)
We'll now specifically look at the case of an input value in [0,1) and quantizing to N buckets. The primary options are :
int quantize_floor( float x , int N )
{
return (int)( x * N );
// or floor( x * N );
// output is in [0, N-1] , input x in [0,1) not including 1.0
}
float dequantize_floor( int q, int N )
{
return (q + 0.5f ) * (1.f / N);
}
int quantize_centered( float x, int N )
{
return (int)( x * (N-1) + 0.5f );
// or round( x * (N-1) )
// output is in [0, N-1] , input x in [0,1] , including 1.0 is okay
}
float dequantize_centered( int q, int N )
{
return q * (1.f / (N-1));
}
The rule of thumb for these quantizers is you either bias by 0.5 in the quantizer, or in the dequantizer. You must bias on one side or the other, not both and not neither! The "floor" quantizer is "bias on dequant", while the "centered" quantizer is "bias on quant".
Visually they look like this, for the case of N = 4 :
(the top is "floor" quantization, the bottom is "centered")
(the top is "floor" quantization, the bottom is "centered")
In both cases we have 4 buckets and 4 restoration points. In the "floor" case the terminal bucket boundaries correspond to the boundaries of the [0,1) input interval. In the "centered" case, the terminal buckets are centered on the [0,1) endpoint, which means the bucket boundaries actually go past the end, but they restore exactly to the endpoints.
If your input values are actually all equally likely and the error metric that you care about is just L2 norm, then "floor" quantization is strictly better. You can see that the bucket size for "floor" quantization is 1/4 vs. 1/3 for "centered", which means the maximum error after dequantization is 1/8 vs. 1/6.
In practice we often care more about the endpoints or the integers, not just average or maximum error; we suspect the probability P(x) for x = 0 and 1 is higher, and the error metric Error( dequantization( quantization(x) ) - x ) may also be non-linear, giving higher weight to the error when x = 0 and 1.
"centered" quantization also has the property of preserving integers. For example say your input range was [0,255) in floats. If you quantize to N=256 buckets with "centered" quantization, it will restore exactly to the integers.
While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!
The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too. Certainly you need to do that for anything that interacts with the GPU and textures, but I encourage you to just do it for all your quantization, because it leads to confusion and bugs if you have multiple different conventions of quantizer in your codebase.
The GPU UNORM convention is :
float dequantize_U8_UNORM( unsigned char u8 )
{
return u8 * (1.f/255);
}
which implies centered quantization, so please use centered quantization everywhere in games. That means : bias 0.5 on quantize,
no bias on dequantize.
While on the topic of UNORM, let's look at conversion between quantized spaces with different precision. Let's do U8 UNORM to U16 UNORM for example.
The way to get that right is to think about it as dequantization followed by quantization. We dequantize the U8 UNORM back to
real numbers, then quantize real numbers back to U16 :
dequant = u8 * (1.f/255);
u16 = round( dequant * 65535 );
u16 = round( u8 * (1.f/255) * 65535 );
u16 = round( u8 * 257 );
u16 = u8 * 257;
u16 = u8 * (256 + 1);
u16 = (u8<<8) + u8;
So U8 to U16 re-quantization for UNORM is : take the U8 value, and replicate it shifted up by 8.
requantize U8 UNORM to U16 UNORM :
0xAB -> 0xABAB
This obviously has the necessary property that 00 stays zero, and 0xFF becomes 0xFFFF, so 1.0 is preserved.
This is something we call "bit replication". Let's take a moment to see why it works exactly in some cases and only approximately in others.
Bit replication is often used in games to change the bit count of a quantized value (to "requantize" it).
For example it's used to take 5-bit colors in BC1 to 8-bit :
The top 3 bits of the 5-bit value are replicated to the bottom :
abcde -> abcde|abc
giving an 8 bit value
Bit replication clearly gets the boundary cases right : all 0 bits to all 0's (dequantizes to 0.0), and all 1 bits
to all 1 bits (dequantizes to 1.0); in between bit replication linearly increases the low bits between those
endpoints, so it's obviously sort of what you want. In some cases bit replication corresponds exactly to requantization,
but not in others.
With a B-bit UNORM value, it has N = 2^B values. The important thing for quantization is the denominator (N-1).
For example with a 5-bit value, (N-1) = 31 is the denominator. It becomes clear if we think about requantization as
changing the *denominator* of a fraction.
Requantization from 5 bits to 10 bits is changing the denominator from 31 to 1023 :
dequant( 5b ) = 5b / 31.0;
requant_10( x ) = round( x * 1023.0 );
requant_5_to_10 = round( x * 1023 / 31 );
1023/31 = 33 exactly, so :
requant_5_to_10 = x * 33
in integers. And 33 = (32 + 1) = shift up 5 and replicate
requantization from 5 to 10 bits is just duplicating the bits shifted up
abcde -> abcde|abcde
What that means is bit replication from B to 2B is exactly equal to what you would get if you dequantized that number to
UNORM and requantized it again.
This is of course general for any B :
denominator for B is (N-1)
denominator for 2B is (N^2 - 1)
requantiztion is *= (N^2 - 1) / (N-1)
(N^2 - 1) = (N-1) * (N+1)
so
requantization is *= (N+1)
which is bit replication
Now more generally for bit replication to some number of bits that's not just double (but <= double, eg. between
B and 2B) :
b between B and 2B
n = 2^b
requant_B_to_b(x) = round( x * (n-1) / (N-1) )
requant_B_to_b(x) = round( x * (N+1) * (n-1) / (N^2-1) )
requant_B_to_b(x) = round( (x bit replicated to 2B) * ( scale down ) )
bit replication from B to b is :
bitrep(x) = (x bit replicated to 2B) >> (2B - b)
that is, just replicate to 2B and then truncate low bits to get to b
when b = 2B , these are exactly equal as we showed above
obviously also at b = B (NOP)
and also at b = B+1 (adding one bit)
in the range b = [B+2, 2B-1] they are not quite exactly equal, but close
Let's look at an example, 5 bits -> 8 bits :
bitdouble( 5b ) = (5b * 33)
requant_5_to_8(5b) = round( (5b * 33) * ( 255.0 / 1023.0 ) )
bitrep_5_to_8(5b) = (5b * 33) >> 2
we can see where the small difference comes from :
bit replication just truncates off the 2 bottom bits
requantization does * (255/1023) , which is almost a /4 (like >>2) but not quite
and the requantization also rounds instead of truncating
so we should see how bit replication is similar to centered UNORM requantization, but not quite the same.
Now, bit replication is used in BC7, ASTC, etc. Is it a source of error? No, not if you do your encoder right. What it does mean is that you can't just find the 5-bit color value by doing a centered quantizer to 5 bits. Instead you have to ask what does the 5-bit value bit-replicate to, and find the closest value to your input.
So far we've talked about quantizing finite ranges, specifically [0,1) but you can map any other finite range to that interval. Let's have a brief look at quantizing infinite ranges.
If you just quantize a signed number to a signed quantized number, then you can use the above _floor or _centered quantizers without thinking any more about it. You will have uniform buckets across the whole number line. But what we often want to do is take a signed input number and quantize it to *unsigned* and separate out the sign bit, to create a sign+magnitude representation. (this makes the most sense with values whose probability P(x) is symmetric about zero and whose mean is at zero; eg. after a transform that subtracts off the mean)
One reason we might want to do that is because most of our schemes for sending unbounded (variable length) numbers work on unsigned numbers. For example : Encode Mod and Exp Golomb .
Now one option would be to quantize to signed ints and then Fold up Negatives to make an unsigned number to feed to your variable length scheme.
There are reasons we don't like that in data compression. Folded up negatives have a number line like : {0, -1, 1, -2, 2, -3 ... }
The annoying thing about that for data compression is that if you have a probability model like a Laplacian that decreases with absolutely value of x, the probabilities have these steps where values are repeated : { P(0), P(1), P(1), P(2), P(2), ... } and coding them with something like exp-golomb is no longer quite correct as they don't progressively fall off. Some codecs in the past have used tricks to reduce this (eg. JPEG-LS and CALIC) by doing things like being able to flip the sign so that you get either {0, -1, 1, -2, ... } or {0, 1, -1, 2, ... } depending on whether positive or negative is more probable.
Rather than do all that, let's assume you want to extract the sign bit and send it separately. So you are sending only the magnitude.
So we have taken the sign out and now only have a one sided interval [0, inf) to quantize.
You can take that one-sided interval and just apply floor or centered quantization to it :
unsigned half_line_quantize( float x )
{
ASSERT( x >= 0.f );
//return floor( x ); // floor quantizer
//return round( x ); // centered quantizer
float bias = 0.f for floor and 0.5 for centered;
return (unsigned) ( x + bias );
}
but something a bit funny has happened.
Floor and centered quantization now just act to shift where the boundary of the 0 bin is. But the 0 bin now occurs on both sides of the half interval, so to make the 0 bin the same size as the other bins, it should have a boundary at 0.5 (half the size of the other bins on the half interval). (I'm assuming here that your quantization bucket size is 1.0 ; for general sized quantization buckets just scale x before it gets here).
It's clear that the zero bin is a bit special, so we usually just go ahead and special case it :
pseudocode signed_line_quantizer( float x )
{
// x signed
float ax = fabsf(x);
if ( ax < deadzone )
{
// special bucket for zero :
// don't send sign bit
return 0;
}
else
{
// do send sign bit of x
// do floor quantizer above the zero bucket :
return floor(ax - deadzone);
}
}
Now if you want the zero bucket to have the same size as all others, you would set deadzone = 0.5 (it's half the zero
bucket size on the full line). If you want to use a uniform floor quantizer on the half line, that would correspond to
deadzone = 1.0 (making the zero bucket actually twice the size of others after mirroring to the negative half of
the line).
What's been found in data compression is that a "deadzone" larger than equal size buckets (larger than 0.5) is beneficial. There are two primary reasons :
We use codecs where coding zeros is especially cheap, so sending more zeros is very desirable. So larger deadzone in the quantizer will give you more zeros, hence cheaper coding, and this is a greater benefit than the loss in quality. This is sort of a hacky way of doing some rate-distortion optimization, like trellis quantization but without any work.
The other reason is perceptual modeling; many human perception systems (eyes and ears) are less sensitive to the initial onset of a signal than they are to variations once the signal is present. Signals near zero are not detected by humans at all until they reach some threshold, and then once they pass the threshold there's a finer discrimination of level. For example the human ear might not detect a harmonic until it is 10 dB, but then distinguish volume levels at 1 dB changes after that.
Essentially your quantizer has two parameters, the bucket size for zero, and then the bucket size for values above zero. This is a very simple form of a more general variable quantizer.
In theory you would like to have variable size bins, such that each bin corresponds to an equal amount of perceptual importance (eg. larger bins where the values are less important). For the most part we now do that by applying a nonlinear transformation to the value before it reaches the uniform quantizer, rather than trying to do variable size bins. For example you might take log(x) before quantizing if you think precision of high values is less important. Another common example is the "gamma corrected" color space (or sRGB) for images; that's a non-linear transform applied to the signal (roughly pow 2.2) to map it to a space that's more perceptually uniform so that the quantization buckets give more precision where it's needed.
Something to watch out for is that a lot of code uses a deadzone quantizer without being clear about it.
If you see something like :
int half_line_quantizer_thats_actually_a_deadzone( float x )
{
ASSERT( x >= 0.f );
return (int) x;
}
That's actually a deadzone quantizer with a 2x sized bin zero, if it's being used after sign removal.
In the olden days, variable-size quantization buckets were used as a kind of entropy coder. They would have smaller buckets in higher probability regions and larger buckets in lower probability regions, so that the quantized output value had equal probability for all bins. Then you could send the quantized value with no entropy coding. This is now almost never done, it's better to use quantization purely for error metric optimization and use a separate entropy coder on the output.
Just briefly some topics in dequantization.
For values that are all equally likely, under an L2 (SSD/RMSE) error norm, dequantization to the center of the bucket is optimal. More generally the restoration point for each bucket should minimize the error metric weighted by the probability of that input value.
An easy case is with an L2 error metric but a non-uniform probability. Then the error in a given bucket for a restoration point is :
L2 error of restoring to r in this bucket :
E = Sum_x P(x) * ( r - x )^2
( Sum_x for x's in this bucket )
find r that minimizes E by taking d/dr and setting to zero :
d/dr E = 0
d/dr E = Sum_x P(x) * ( r - x ) * 2
Sum_x P(x) * ( r - x ) = 0
Sum_x P(x) * r = Sum_x P(x) * x
r = ( Sum_x P(x) * x ) / ( Sum_x P(x) )
that's just the expectation value of x in the bucket
we should restore to the average expected 'x' value in the bucket.
A common case of that is for a skewed probability distribution - something like Laplacian or Poisson with a falloff of probabilities away from the peak - we should restore each bucket to a value that's skewed slightly towards the peak, rather than restoring the center.
Now if you have a mathematical model of P(x) then you could compute where these centers should be, and perhaps store them in a table.
What's often better in practice is just to measure them experimentally. Do trial runs and record all the values that fall into each quantization bucket and take their mean - that's your restoration point.
Then you could store those measured restoration points in constants in your code, OR you could measure them and store them per-data item. (for example an image compressor could transmit them per image - maybe not all but a few of the most important ones).
Another thing you can do in dequantization is to not always restore to the same point. I noted briefly previously that if what you care about is L1 norm, then any restoration point in the bucket has the same error. Rather than just pick one, you could restore to any random point in the bucket and that would give the same expected L1 norm.
L2 norm strongly prefers the mean (minimizing L2 is blurring or smoothing, while L1 allows lots of noise), but perceptually it may be better to add some randomness. You could restore to mean in the bucket plus a small amplitude of noise around there. Again this noise could be global constant, or could be sent per-image, or per-band; it could also be predicted from local context so you could have more or less noisy areas.
Note that adding noise in dequantization is not the same as just adding noise arbitrarily after the fact. The values
are still within the quantization bucket, so they could have been the true source values. That is, we can reframe
dequantization as trying to guess the source given the quantized version :
Encoder had original image I
made Q = quant( I )
Q was transmitted
rather than just run I' = dequant( Q )
we instead pose it as :
we want to find I'
such that
Q = quant( I' )
and I' has the maximum probability of being the original I
or I' has the most perceptual similarity to our guess of I
The key thing here is that noise within the quantization bucket keeps the constraint Q = quant(I') satisfied.
As an example I'll mention something I've done in the past for wavelet bit-plane truncation.
Wavelet coding converts an image into activity residuals at various frequency subbands. These are initially quantized with a uniform+deadzone quantizer (if a floating point wavelet transform was used). Then in many codecs they are sent progressively in bit planes, so the highest bits are sent first, then lower bits, so that you get the most important bits first. You can then truncate the stream, cutting off transmission of lower bits in the higher subbands, effectively increasing the quantizer there. This is done in JPEG2000 with the EBCOT scheme for example.
So a given wavelet residual might be sent like :
value 45
= 101101
only top 2 bits sent :
10xxxx
the others are cut off.
In the decoder you know which bits you got and which are missing, which is equivalent to a larger quantization
bucket.
The classic option (eg. SPIHT) was just to fill the lost xx bits with zeros :
10xxxx -> 100000
This makes values that are too low and is generally very smoothing (high frequency detail just goes away)
You might think, it's a quantization bucket, we should restore to the middle, which is 0.5 which is the
next bit on :
10xxxx -> 101000 or 100111
That is much too high, it's larger than the expectation and actually looks like a sharpen filter.
The reason is that wavelet amplitudes have P(x) strongly skewed towards zero, so the mean value is
way below the middle of the bucket.
Restoring to 0.25 is a bit better :
10xxxx -> 100100
but even better is to just measure what is the mean in the image for each missing bit count; that
mean depends on how large our value was (the part that's not truncated).
Finally in addition to restoring the missing bits to mean, you could add randomness in the dequantization,
either within the quantization bucket (below the bottom bit), or in the low part of the missing bits
(eg. if 4 bits are missing the bottom 2 might get some randomness). You can compute the amount of
randomness desired such that the decompressed image matches the high frequency energy of the original image.
And that's enough on quantization for now!
In Oodle, "BC1_WithTransparency" doesn't necessarily mean that the texture has any 1-bit transparency. (for background: BC1 can encoding alpha values of 0 or 255 (1.0), which binary on/off alpha; when alpha is 0 the color is always 0 or black). It means that the transparency bit is preserved. In our normal "BC1" encoder, we assume that the alpha value will not be read, so we are free to choose the encoding to maximize RGB quality.
The choice of "BC1_WithTransparency" vs "BC1" should not be made based on whether the source has alpha or not, it should be made based on whether your shader will *consume* alpha or not. So for example if you just have opaque RGB textures and you need to be able to pipe them to a shader that reads A to do alpha blending, and you are unable to manage the book-keeping to pipe constant 1.0 as a shader source for the A channel, then you must use "BC1_WithTransparency" to encoding opaque alpha in the texture.
When possible, we think it is best to use the "BC1" format which does not preserve alpha. Most people do not actually use binary transparency in BC1 (even for textures where the alpha is binary in the top mip, you typically need full alpha to make decent mips, so you should use BC7), they use BC1 for opaque textures. On opaque textures the "BC1" format that is free to change alpha can give much higher quality. You can then just map constant 1.0 as the A value source for the shader when binding a texture that is marked as opaque.
We understand that is not always practical in your pipeline, so we are trying to make "BC1_WithTransparency" work as well as possible.
Our RDO for "BC1_WithTransparency" will never change the binary alpha state of a pixel. Because of this the RMSE is actually only RGB, the A values will never differ from the original, assuming the original only had alpha values of 0 and 255 in U8.
An example of the quality of "BC1_WithTransparency" RDO on the "mysoup" image available in the
Oodle Texture sample run :
otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1 RMSE per texel: 7.0511
otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 7.0510
otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1 RMSE per texel: 7.5995
otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1_WithTransparency RMSE per texel: 7.6006
On photographic images like this without a lot of opaque-black, the quality of "BC1_WithTransparency" is
almost identical to "BC1".
On images that mix opaque black and other colors, the quality difference can be severe :
otexdds bc1 frymire.png r:\out.dds --verbose
OodleTex_BC1 RMSE per texel: 6.1506
otexdds bc1a frymire.png r:\out.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 12.1483
On "Frymire" from the Waterloo Bragzone set, the RMSE is nearly double with "BC1_WithTransparency".
We have also updated the Unreal integration for Oodle Texture to use "BC1_WithTransparency", as Unreal expects to be able to fetch opaque A from the texture on all BC1 encodings. Prior to 2.8.11 we were incorrectly using our "BC1" format in Unreal, which could change opaque black texels to transparent black.
Note that "BC1_WithTransparency" RDO's roughly the same as "BC1", so we expect compressed sizes to stay roughly the same.
I thought I'd have a look at how various options for the back end lossless compressor do on BCN texture data after Oodle Texture RDO. (Oodle 2.8.9)
127,822,976 bytes of BC1-7 sample data from a game. BC1,3,4,5, and 7. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.
"baseline" is the non-RDO encoding to BCN by Oodle Texture. "rdo lambda 40" is a medium quality RDO run; at that level visual degradation is just starting to become easier to spot (lambda 30 and below is high quality).
by ratio:
ooLeviathan8 : 1.79:1 , 1.4 enc MB/s , 1069.7 dec MB/s
lzma_def9 : 1.79:1 , 8.7 enc MB/s , 34.4 dec MB/s
ooKraken8 : 1.76:1 , 2.2 enc MB/s , 1743.5 dec MB/s
ooMermaid8 : 1.71:1 , 4.9 enc MB/s , 3268.7 dec MB/s
zstd22 : 1.70:1 , 4.5 enc MB/s , 648.7 dec MB/s
zlib9 : 1.64:1 , 15.1 enc MB/s , 316.3 dec MB/s
lz4hc1 : 1.55:1 , 72.9 enc MB/s , 4657.8 dec MB/s
ooSelkie8 : 1.53:1 , 7.4 enc MB/s , 7028.2 dec MB/s
|
|
|
by ratio:
lzma_def9 : 3.19:1 , 7.7 enc MB/s , 60.7 dec MB/s
ooLeviathan8 : 3.18:1 , 1.1 enc MB/s , 1139.3 dec MB/s
ooKraken8 : 3.13:1 , 1.7 enc MB/s , 1902.9 dec MB/s
ooMermaid8 : 3.01:1 , 4.2 enc MB/s , 3050.6 dec MB/s
zstd22 : 2.88:1 , 3.3 enc MB/s , 733.9 dec MB/s
zlib9 : 2.69:1 , 16.5 enc MB/s , 415.3 dec MB/s
ooSelkie8 : 2.41:1 , 6.6 enc MB/s , 6010.1 dec MB/s
lz4hc1 : 2.41:1 , 106.6 enc MB/s , 4244.5 dec MB/s
|
|
|
If you compare the log-log charts before & after RDO, it's easy to see that the relative position of all the compressors is basically unchanged, they just all get more compression.
The output size from baseline divided by the output size from post-RDO is the compression improvement
factor. For each compressor it is :
ooLeviathan8 : 1.7765
ooKraken8 : 1.7784
ooMermaid8 : 1.7602
ooSelkie8 : 1.5548
lzma_def9 : 1.7821
zstd22 : 1.6941
zlib9 : 1.6402
lz4hc1 : 1.5751
Leviathan, Kraken, Mermaid and LZMA all improve around 1.77 X ; ZStd and Zlib a little bit less (1.65-1.70X),
LZ4 and Selkie by less (1.55X - 1.57X).
Basically the stronger compressors (on this type of data) get more help from RDO and their advantage grows.
ZStd is stronger than Mermaid on many types of data, but Mermaid is particularly good on BCN.
* : Caveat on ZStd & LZ4 speed here : this is a run of all compressors built with MSVC 2017 on my AMD reference machine. ZStd & LZ4 have very poor speed in their MSVC build, they do much better in a clang build. Their clang build can be around 1.5X faster; ZStd-clang is usually slightly faster to decode than Leviathan, not slower. LZ4-clang is probably similar in decode speed to Selkie. The speed numbers fo ZStd & LZ4 here should not be taken literally.
It is common that the more powerful compressors speed up (decompression) slightly on RDO data because they speed up with higher compression ratios, while the weaker compressors (LZ4 and Selkie) slow down slightly on RDO data (because they are often in the incompressible path on baseline BCN, which is a fast path).
Looking at the log-log plots some things stand out to me as different than generic data :
Leviathan, Kraken & Mermaid have a smaller gap than usual. Their compression ratio on this data is quite similar, usually there's a bigger step, but here the line connecting them in log-log space is more horizontal. This makes Mermaid more attractive because you're not losing much compression ratio for the speed gains. (for example, Mermaid + BC7Prep is much better for space & speed than Kraken alone).
ZStd is relatively poor on this type of data. Usually it has more compression than Mermaid and is closer to Kraken, here it's lagging quite far behind, and Mermaid is significantly better.
Selkie is relatively poor on this type of data. Usually Selkie beats LZ4 for compression ratio (sometimes it even beats zlib), but here it's just slightly worse than LZ4. Part of that is the 256 KB chunking is not allowing Selkie to do long-distance matches, but that's not the main issue. Mermaid looks like a much better choice than Selkie here.
Another BCN data set :
358,883,720 of BCN data. Mostly BC7 with a bit of BC6. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.
by ratio:
ooLeviathan8 : 1.89:1 , 1.1 enc MB/s , 937.0 dec MB/s
lzma_def9 : 1.88:1 , 7.6 enc MB/s , 35.9 dec MB/s
ooKraken8 : 1.85:1 , 1.7 enc MB/s , 1567.5 dec MB/s
ooMermaid8 : 1.77:1 , 4.3 enc MB/s , 3295.8 dec MB/s
zstd22 : 1.76:1 , 3.9 enc MB/s , 645.6 dec MB/s
zlib9 : 1.69:1 , 11.1 enc MB/s , 312.2 dec MB/s
lz4hc1 : 1.60:1 , 73.3 enc MB/s , 4659.9 dec MB/s
ooSelkie8 : 1.60:1 , 7.0 enc MB/s , 8084.8 dec MB/s
|
|
|
by ratio:
lzma_def9 : 4.06:1 , 7.2 enc MB/s , 75.2 dec MB/s
ooLeviathan8 : 4.05:1 , 0.8 enc MB/s , 1167.3 dec MB/s
ooKraken8 : 3.99:1 , 1.3 enc MB/s , 1919.3 dec MB/s
ooMermaid8 : 3.69:1 , 3.9 enc MB/s , 2917.8 dec MB/s
zstd22 : 3.65:1 , 2.9 enc MB/s , 760.0 dec MB/s
zlib9 : 3.36:1 , 19.1 enc MB/s , 438.9 dec MB/s
ooSelkie8 : 2.93:1 , 6.2 enc MB/s , 4987.6 dec MB/s
lz4hc1 : 2.80:1 , 114.8 enc MB/s , 4529.0 dec MB/s
|
|
|
On this data set, Mermaid lags between the stronger compressors more, and it's almost equal to ZStd. On BCN data, the strong compressors (LZMA, Leviathan, & Kraken) have less difference in compression ratio than they do on some other types of data. On this data set, Selkie pulls ahead of LZ4 after RDO, as the increased compressibility of post-RDO data helps it find some gains. Zlib, LZ4, and Selkie are almost identical compression ratios on the baseline pre-RDO data but zlib pulls ahead post-RDO.
The improvement factors are :
ooLeviathan8 : 2.154
ooKraken8 : 2.157
ooMermaid8 : 2.085
ooSelkie8 : 1.831
lzma_def9 : 2.148
zstd22 : 2.074
zlib9 : 1.988
lz4hc1 : 1.750
Similar pattern, around 2.15X for the stronger compressors, around 2.08X for the medium ones, and under 2.0 for the weaker ones.
Oodle Texture works great with all the lossless LZ coders tested here. We expect it to work well with all packaging systems.
The compression improvement factor from Oodle Texture is similar and good for all the compressors, but stronger compressors like Oodle Kraken are able to get even more benefit from the entropy reduction of Oodle Texture. Not only do they start out with more compression on baseline non-RDO data, they also improve by a larger multiplier on RDO data.
The Oodle Data lossless compressors are particularly good on BCN data, even relatively stronger than alternatives like zlib and ZStd than they are on some other data types. For example Oodle Mermaid is often slightly lower compression than ZStd on other data types, but is slightly higher compression than ZStd on BCN.
Mermaid has a substantial compression advantage over zlib on post-RDO BCN data, and decompresses 5-10X faster, making Mermaid a huge win over software zlib (zip/deflate/inflate).
Oodle Texture RDO is always going to be slower than non-RDO encoding, it simply has to do a lot more work. It has to search many possible encodings of the block to BCN, and then it has to evaluate those possible encodings for both R & D, and it has to use more sophisicated D functions, and it has to search for possible good encodings in a non-convex search space. It simply has to be something like 5X slower than non-RDO encoding. But previously we just had a perf bug where working set got larger than cache sized that caused a performance cliff, and that shouldn't happen. If you do find any performance anomalies, such as encoding on a specific texture or with specific options causes much slower performance, please contact RAD.
timerun 287 vs 289
hero_xxx_n.png ; 4096 x 4096
timerun textest bcn bc7 r:\hero_xxx_n.png r:\out.dds -r40 --w32
got opt: rdo_lagrange_parameter=40
Oodle 2.8.7 :
encode time: ~ 8.9 s
per-pixel rmse (bc7): 0.8238
---------------------------------------------
timerun: 10.881 seconds
Oodle 2.8.9 :
encode time: 4.948s
per-pixel rmse (bc7): 0.8229
---------------------------------------------
timerun: 6.818 seconds
the "timerun" time includes all loading and saving and startup, which appears to be about 1.9s ; the RDO encode
time has gone from about 8.9s to 4.95 s
(Oodle 2.8.7 textest bcn didn't log encode time so that's estimated; the default number of worker threads has changed, so use --w32 to make it equal for both runs)
The Oodle Texture integration is currently only for Oodle Texture RDO/non-RDO BCN encoders (not BC7Prep). It should be pretty simple, once you integrate it your Editor will just do Oodle Texture encodes. The texture previews in the Editor let you see how the encodings look, and that's what you pack in the game. It uses the Unreal Derived Data Cache to avoid regenerating the encodings.
We expose our "lambda" parameter via the "LossyCompressionAmount" field which is already in the Editor GUI
per texture. Our engine patches further make it so that LossyCompressionAmount inherits from LODGroup,
and if not set there, it inherits from a global default. So you can set lambda at :
per texture LossyCompressionAmount
if Default then look at :
LODGroup LossyCompressionAmount
if Default then look at :
global lambda
We believe that best practice is to avoid having artists tweaking lambda a lot per-texture. We recommend
leaving that at "Default" (inherit) as much as possible. The tech leads should set up the global lambda to
what's right for your game, and possibly set up the LODGroups to override that for specific texture classes.
Only rarely should you need to override on specific textures.
LIMITATIONS :
Currently our Oodle Texture for UE4 integration only works for non-console builds. (eg. Windows,Linux,Mac, host PC builds). It cannot export content for PS4/5/Xbox/Switch console builds. We will hopefully be working with Epic to fix this ASAP.
If you are a console dev, you can still try Oodle Texture for UE4, and it will work in your Editor and if you package a build for Windows, but if you do "package for PS4" it won't be used.
Sample package sizes for "InfiltratorDemo" :
InfiltratorDemo-WindowsNoEditor.pak
No compression : 2,536,094,378
No Oodle Data (Zlib), no Oodle Texture : 1,175,375,893
Yes Oodle Data, no Oodle Texture : 969,205,688
No Oodle Data (Zlib), yes Oodle Texture : 948,127,728
Oodle Data + Oodle Texture lambda=40 : 759,825,164
Oodle Texture provides great size benefit even with the default Zlib compression in Unreal, but it
works even better when combined with Oodle Data.
We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.
The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.
2. Fastmail tua culpa.
Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!
Details for each :
1. Mea Culpa.
We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.
The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.
It was sort of a story of being too "mature" again.
In our image analysis process, we do a lowpass filter with a Gaussian. In coding that up, I was experimenting with lots of different ideas, so I just did a first quick dumb implementation as a temp thing to get the results and see how it worked. I always intended to come back and rewrite it in the optimization phase if it worked out. (90% of the stuff I try in the experimentation phase just gets deleted, so I try to avoid spending too much time on early implementation until we work out what method is the one we want to ship).
So we tried various things and eventually settled on a process, and came back to optimize what we settled on. I immediately thought, oh well this Gaussian filter I did was a really dumb implementation and obviously we know there are various ways to do fast implementations of that, that's an obvious place to look at speed.
But rather than just dive in and optimize it, I decided to be "mature". The mature programmer doesn't just optimize code that is obviously a bad implementation. Instead they profile, and measure how much time it is actually taking. That way you can prioritize your efforts to spend your programming time where it has the biggest impact. Any programmer work is not zero-sum; if you spend time on X it takes away time from Y, so you can't just say yes of course we should do X, you have to say "X is more important than Y". If I'm optimizing the Gaussian I'm not doing something else important.
So I profiled it, and it was ~1% of total CPU Time. So I thought hrmm, well that's surprising, but I guess it's not important to total CPU time, so I won't optimize it.
I was wrong. The problem was I tested on an image that was too small. There's a huge cliff in performance that happens when the image doesn't fit in cache.
(for people who are aware of the performance issues in image filtering, this is obvious. The main issue for CPU image filtering is the cache usage pattern; there are various ways to fix that, tiles and strips and different local access patterns; that's well known)
Images up to 1024*1024 easily fit in cache (even in 4-float format at 16 bytes per pel, that's 16 MB). Up to 2k x 2k can almost fit in the big 64 MB L3 that is increasingly common.
At 8k x 8k , a 4-float image is 1 GB. (it's unintuitive how fast exponential growth goes up!). At that size you get a huge performance penalty from naive filtering implementations, which are constantly cache missing.
Foolishly, I did my one profile of this code section on a 1k x 1k image, so it looked totally fine.
The solution is simple and we'll have it out soon. (in typical Charles & Fabian obsessive perfectionism style, we can't just fix it "good enough", we have to fix it the best way possible, so we'll wind up with the super over-engineered world's best implemenation) I just feel a bit embarassed about it because doing good profiling and making smart implementation decisions is our specialty and I totally F'ed it.
I think it is an instructive case of some general principles :
1A. Profiling is hard, and a little bit of profiling is worse than none.
In this case there's a huge performance cliff when you go from working sets that fit in cache to ones that don't. That depends on cache size and machine; it can also depend on how much other CPU work is happening that's competing for cache. It depends on machine architecture, for example we've seen many compressors perform horribly on ARM big-little systems where latency to main memory can be much bigger than is typical on x86/64 desktops, because their architects did not profile on that type of machine.
Profiling is a special case of the more general "measurement fallacy". People have this very misplaced faith in a measured number. That can be extremely misleading, and in fact bad measurement can be worse than not doing at all. For example medical trials without valid controls or insufficiently large samples can lead to very harmful public policy decisions if their results are not ignored.
You can be making a completely garbage point, but if you start showing that it was 17.20 and here's a chart with some points, all of a sudden people start thinking "this is rigorous"; to trust any measurement you have to dig into how it was done, does it actually measure what you want to know? were noise and biasing factors controlled and measured? You have to pose the right question, measure the right thing in the right way, sample the right group, do statistical analysis of error and bias, etc. without that it's fucking pseudoscience garbage.
I see far too many people who know about this measurement problem, but then ignore it. For example pretty much everyone knows that GDP is a terrible measure of overall economic health of a country, and yet they still talk about GDP all the time. Maybe they'll toss in a little aside about ("GDP isn't really what we should talk about, but...") and then after the "but" they proceed to do a whole article looking at GDP growth. This is the trap! When you have a bad measurement, you MUST ignore it and not even think about it at all. (see also: graduation rates, diet, cost of social programs, etc. etc.)
You see this all the time with profiling where people measure some micro-benchmark of a hash table, or a mutex lock, and find the "fastest" implementation. These things are massively context dependent and measuring them accurately in a synthetic benchmark is nearly impossible (it would require very complex simulation of different input types, memory layouts and working set sizes, different numbers of threads in different thread usage patterns).
The problem with a bad measurement is it gives a number which then people can brandish as if it's unimpeachable (X was 4 cycles and Y was 5 cycles, therefore we must use X even though it's complicated and fragile and harder to use, and in fact after all the surrounding implementation it winds up being much worse). It far too often makes people believe that the result they saw in one measurement is universally true, when in fact all you observed is that *if* measured in that particular way in that particular situation, this is what you saw that one time. (reminds me of the old "black sheep" joke about the engineer, physicist and the mathematician).
There are lots of common mistakes in profiling that we see all the time, unfortunately, as people try Oodle and feel the need to measure performance for themselves. It's not that easy to just "measure performance". We try to be very careful about using data sets that are realistic samples of expected data, we remove fluctuations due to thermal throttling or single-core boosts, we run multiple times to check repeatability of results, etc. This is literally our job and we spend a lot of time thinking about it, and sometimes we still get it wrong, and yet every single day we get people going "oh I just cooked up this benchmark in two seconds and I'm getting weird results". See also : Tips for benchmarking a compressor and The Perils of Holistic Profiling .
In the modern world you have to consider profiling with N other threads running that you don't control, you can't assume that you get the whole machine to yourself. For example a very common huge mistake that I see is unnecessary thread switches; let's just hand off to this other thread very briefly then come back to our first thread to continue the work. That may be totally fine when you test it on a machine that is otherwise idle, but if you're competing for CPU time on a machine that has a ton of other threads running, that "little thread switch" to pop over to a different async task might take seconds. Over-threading tends to improve benchmarks when run on machines in isolation but hurt performance in the real world.
(See also *2 at end)
1B. Optimization is good for its own sake.
The whole idea that you should "avoid premature optimization" has many flaws and should be one of the learnings that you forget. Yes yes, of course don't go off and spend a month writing an assembly version of a loop without being sure it's an important thing to do, and also that you've got the right overall algorithmic structure and memory access pattern and so on. I'm not advocating just being dumb.
But also, don't use a really slow LogPrintf() implementation just because it doesn't show up in profiles.
When you have bad/slow code, it changes the way you use it. You wind up avoiding that function in high performance areas. It makes you code around the performance bug rather than just writing things the way you should.
I've worked at a number of companies where they disable asserts in debug builds because they've gotten too slow. I of course try turning on asserts, and a thousand of them fire because nobody else is testing with asserts on. The solution should have been to fix the speed of the debug build to something usable, not to avoid important programming tools.
Sometimes when you do a good implementation of something (even when it wasn't necessary for any particular profile of total app performance), it becomes a really useful component that you then wind up using all over. Like maybe you do a really cool memcpy that can do interleaves and shuffles, that winds up being a really useful tool that you can build things with, that you wouldn't have thought about until you did the good implementation of it.
It's also just fun and fun is good.
1C. Trust what is obviously true.
When the truth is staring you in the face, but some measurement, or some complex second thoughts contradict it, you need to stop and reevaluate. The obvious truth is probably right and your overthinking or being too "mature" with measuring things may be misleading you.
In this case the fact that a naive filter implementation was a temp place-holder and needed to be optimized was obviously true, and some over-thinking clouded that.
2. Fastmail tua culpa.
Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!
What was happening was fastmail (*1) was generating emails that failed SPF check. This would cause my sent emails to be just rejected by some receivers, with no "undelivered" response at all, so I didn't know it was happening.
The SPF record is supposed to verify that an email came from the sending mail host that it claims to (but not the sending address). Emails coming from the fastmail mail host mark themselves as being from fastmail, then the receiver can look up the SPF record at fastmail.com and see the IP's that it should have come from to verify it actually came from there. This prevents spammers from claiming to be sending mail from fastmail servers but actually using a different server. This makes it possible for receivers to have white & black lists for hosts. (SPF records do *not* verify the "from" field of the email)
I had my fastmail email set up to forward to an alias account (also inside fastmail). When I then replied to these (via SMTP through
smtp.fastmail.com), it was going out identified as :
helo=wforward1-smtp.messagingengine.com;
client-ip=64.147.123.30
then receivers would check the SPF record for fastmail and get :
v=spf1 include:spf.messagingengine.com ?all
64.147.123.17
64.147.123.18
64.147.123.19
64.147.123.20
64.147.123.21
64.147.123.24
64.147.123.25
64.147.123.26
64.147.123.27
64.147.123.28
64.147.123.29
which does not include the .30 IP , therefore my email was marked as an SPF failure.
Fastmail tech support was useless and unhelpful about figuring this out. It also sucks that I get no notification of the undelivered mail.
Some things that were useful :
NIST Email Authentication Tester
dmarcanalyzer SPF checker
*1: I switched to fastmail from dreamhost because dreamhost was failing to deliver my sent email. Deja vu. Why is it so fucking hard to deliver a god damn email !? (in dreamhost's case it's because they intentionally provide smtp service to lots of spammers, so the dreamhost smtp servers get into lots of blacklists)
*2: Another common problem with profiling and benchmarking I've been thinking about recently is the drawback of large tests, which you then average or sum.
People now often have access to large amounts of data to test on. That may or may not be great. It depends on whether that data is an unbiased random sampling of real world data that reflects what you care about the performance on in your final application.
The problem is that you often don't know exactly what data you will be used on, and the data you have is just "some stuff" that you don't really know if it reflects the distribution of data that will be observed later. (this is a bit like the machine learning issue of having a training set that is a good reflection of what will be seen in production use).
Again like the "measurement fallacy" the "big data" test can lead to a false sense of getting an accurate number. If you test on 4 TB of sample data that does not mean the numbers that come out are more useful than a test on 1 MB of sample data.
Large data averages and totals can swamp interesting cases with lots of other cases. There might be some extreme outliers in there where your performance is very bad, but they get washed away in the total. That would be fine if that was in fact a good representation of what you will see in real use, but if it's not you could be very bad.
The worst case is for a library provider like us, we don't know what data types are important to the client. That one weird case where we do badly might be 90% of the client's data.
Any time you're working with test sets where you take averages and totals you have to be aware of how you're pooling (weighted by size? (eg. adding times is weighted by size), or by file? or are categories equally weighted?). If you test set is 20% text and 40% executable that is assigning an effective importance weight to those data types.
In data compression we also have the issue of incompressible files, such as already compressed files, which are not something you should ever be running through your compressor. People running "lots of data" that just iterate every file on their personal disk and think they're getting a good measurement are actually biasing the inputs toward weird things that should not be used.
Because of these considerations and more, I have been increasingly using the method of "minimizing the maximum" of bad performance, or boosting the worst case.
Rather than using a big testset to take an average performance, I use a big test set to find the one file with the worse performance, and then do all I can to optimize that bad case. Measure again, find the new worst case, attack that one.
This has many advantages. It prevents clients from ever seeing a really bad case. That one worst case might actually be the type of data they really care about. It also tends to find interesting corner cases and reveals flaws you don't see on average cases (like oh this one weird file runs most of the loop iterations in the tail/safe loop), that lets you find and fix those cases. It's sort of a case of "you learn from your mistakes" by really digging into these examples of bad performance.
Another nice thing about the "fix the worst" method is that it's strictly additive for bigger test sets. You can just always toss more in your test set and you have more chances to find a worst case. You don't have to worry about how the data is distributed and if that reflects real world distributions. For example say someone gives you a terrabyte of images that are all grayscale. You don't have to worry that this is going to bias your image test set towards a weird over-weighting of grayscale.
This approach has been used on both Oodle Leviathan and Oodle Texture. It was one of the guiding principles of Leviathan that we not only be good on average, but we minimize the gap to the best compressor on every type of data. (we can't be the best possible compressor on every type of data, where specialized compressors can excel in some cases, but we wanted to minimize the worst difference). That led to lots of good discoveries in Leviathan that also helped the average case, and we used a similar principle in Oodle Texture. I think of it as a bit like the machine learning technique AdaBoost, where you take your worst cases and train on them more to get better at them, then keep repeating that and you wind up with a good classifier in general.
ReadFile(size) either successfully reads all "size" bytes or fails mysteriously and we should abortWhat you actually need to be handling is :
ReadFile(size) succeeded but got less than size failed but failed due to being already at EOF failed but failed due to a temporary system condition that we should retry succeeded but is not asynchronous the way we expected succeeded and was asynchronous but then GetOverlapped result does not wait as we expected failed but failed due to IO size being too big and we should cut it into pieces
In a surely pointless attempt to improve matters, I've tried to make easy to use clean helpers that do all this for you, so you can just include this code and have robust IO :
Even if you are being careful and checking all the error codes, some issues you may not be handling :
(this is now rarely an issue on 64-bit windows, but was common on 32-bit windows, and can still happen on the MS consoles)
NOTE this is not the issue with trying to do GetOverLappedResult on the same OVERLAPPED struct from multiple threads - that is just wrong; access to the OVERLAPPED struct should be mutex protected if you will query results from multiple threads, and you should also track your own "is io async" status to check before calling GetOverLappedResult.
For example, here's how to call GetOverlappedResult : (only call if st == win32_io_started_async)
BOOL win32_get_async_result(HANDLE handle,
OVERLAPPED * data,
DWORD * pSize)
{
// only call this if you got "win32_io_started_async"
// so you know IO is actually pending
DWORD dwSize = 0;
// first check result with no wait
// this also resets the event so that the next call to GOR works :
if ( GetOverlappedResult(handle,data,&dwSize,FALSE) )
{
if ( dwSize > 0 )
{
*pSize = (DWORD) dwSize;
return true;
}
}
// if you don't do the GOR(FALSE)
// then the GOR(TRUE) call here can return even though the IO is not actually done
// call GOR with TRUE -> this yields our thread if the IO is still pending
if ( ! GetOverlappedResult(handle,data,&dwSize,TRUE) )
{
DWORD err = GetLastError();
if ( err == ERROR_HANDLE_EOF )
{
if ( dwSize > 0 )
{
*pSize = (DWORD) dwSize;
return true;
}
}
return false;
}
*pSize = (DWORD) dwSize;
return true;
}
Get the code :
Also note that I don't recommend trying to do unbuffered writes on Windows. It is possible but complex and requires privilege elevation which puts up a UAC prompt, so it's not very usable in practice. Just do buffered writes. See also : 01-30-09 - SetFileValidData and async writing and 03-12-09 - ERROR_NO_SYSTEM_RESOURCES
(Integrating BC7Prep is rather different; see BC7Prep data flow here. Essentially BC7Prep will integrate like a compressor, you ship the runtime with a decompressor; it doesn't actually make texture data but rather something you can unpack into a texture. BC7Prep is not actually a compressor, it relies on your back-end compressor (Kraken or zip/deflate typically), but it integrates as if it was.)
Caching output for speed and patches
You may wish to cache the output of Oodle Texture BCN encoding, and reuse it when the source content hasn't changed, rather than regenerate it.
One reason is for speed of iteration; most likely you already have this system in some form so that artists can run levels without rebaking to BCN all the time. Perhaps you'd like to have a two-stage cache; a local cache on each person's machine, and also a baked content server that they can fetch from so they don't rebake locally when they encounter new levels.
Oodle Texture RDO encodes can be slow. You wouldn't like to have to rebake all the BCN textures in a level on a regular basis. We will be speeding it up in future versions and probably adding faster (lower quality) modes for quicker iteration, but it will never be realtime.
Caching can also be used to reduce unnecessary patch generation.
Oodle Texture guarantees deterministic output. That is, the same input on the same version of Oodle Texture, on the same platform, with the same options will make the same output. So you might think you can rely on there being no binary diff in the generated output to avoid patches.
The problem with that idea is it locks you into a specific version of Oodle Texture. This is a brand new product and it's not a good idea to assume that you will never have to update to a new version. Newer versions *will* make different output as we improve the algorithms. Relying on there being no binary diff to avoid making patches means never taking updates of Oodle Texture from RAD. While it is possible you could get away with this, it's very risky. It has the potential of leaving you in a situation where you are unable to update to a better new version because it would generate too many binary diffs and cause lots of patching.
It's much safer to make your patches not change if the source content hasn't changed. If the source art and options are the same, use the same cached BCN.
With these considerations, the cache should be indexed by : a hash of the source art bits (perhaps Meow hash ; not the file name and mod time), the options used (such as the lambda level), but NOT the Oodle Texture version.
Lambda for texture types depends on your usage context
It's worth thinking a bit about how your want to expose lambda to control quality for your artists. Just exposing direct control of the lambda value per texture is probably not the right way. A few issues to consider :
You should make it possible to tweak lambda across the board at a later date. It's very common to not know your size target until very late in development. Perhaps only a month before ship you see your game is 9 GB and you'd like to hit 8 GB. You can do that very easily if you have a global multiplier to scale lambdas. What you don't want is to have lots of hard-coded lambda values associated with individual textures.
We try to make "lambda" have the same approximate meaning in terms of visual quality across various texture types, but we can only see how that affects error in the texels, not in how they are shown on screen. Transformations that happen in your shader can affect how important the errors are, and lambda should be scaled appropriately.
For example say you have some type of maps that use a funny shader :
fetch rgb
color *= 2
in that case, texel errors in the map are actually twice as important as we think. So if you were using lambda=40 for standard
diffuse textures, you should use lambda=20 for these funny textures.
Now doubling the color is obviously silly, but that is effectively what you do with maps that become more important on screen.
Probably the most intuitive and well known example is normal maps. Normal maps can sometimes massively scale up errors from texel to screen, it depends on how they are used. If you only do diffuse lighting in smooth lighting environments, then normal map errors can be quite mild, standard lambda scaling might be fine. But in other situations, for example if you did environment map reflections with very sharp contrast (class case is like a rotating car with a "chrome" map) then any errors in normals become massively magnified and you will want very little error indeed.
(note that even in the "very little error" case you should still use Oodle Texture RDO, just set lambda to 1 for near lossless encoding; this can still save quite a lot of size with no distortion penalty; for maximum quality you should almost always still be doing RDO in near-lossless mode, not turning it off)
We (RAD) can't just say that "normal maps should always be at 1/2 the lambda of diffuse maps". It really depends on how you're using
them, how high contrast the specular lighting is. What really matters is the error in the final image on screen, but what we measure
is the error level in the texture; the ratio of the two is how you should multiply lambda :
lambda multiplier = (texture error) / (screen error)
This kind of error magnification depends mainly on the type of map (normals, AO, gloss, metalness, translucency, etc.) and how your engine interprets them. If we think of diffuse albedo as the baseline, the other maps will have errors that are 75% or 200% or whatever the importance, and lambda should be scaled accordingly.
I suggest that you should have a lambda scaling per type of map. This should not be set per texture by artists, but should be set by the shader programmer or tech artists that know how maps feed the rendering pipeline.
In the end, the way you should tweak the map-type lambda scaling is by looking at how the errors come out on the screen in the final rendered image, not by looking at the errors in the texture itself. The transformations you do between texel fetch and screen effect how much those texel errors are visible.
Aside from the per-map-type lambda scaling, you probably want to provide artists with a per-texture override. I would encourage this to be used as little as possible. You don't want artists going through scaling every lambda because they don't like the default, rather get them to change the default. This should be used for cases where almost all the textures look great but we want to tweak a few.
Per-texture scaling can be used to tweak for things that are outside the scope of what we can see inside the texture. For example if the texture is used on the player's weapons so it's right in your face all the time, perhaps you'd like it higher quality. Another common case is human faces are much more sensitive to the human observer, so you might want them to be at higher quality.
I think a good way to expose per-texture scaling is as a percentage of the global map-type lambda. eg. you expose a default of 100% = a 1.0
multiplier, artists can slide that to 50% or 200% to get 0.5 or 2.0x the map-type lambda. So perhaps you set something up like :
global map type default lambdas :
diffuse/albedo lambda = 30
normal maps lambda = 10
AO lambda = 30
roughness lambda = 40
then an artist takes a specific normal map
because it's for a car, say
and slides the "rate reduction %" from "100%" down to "50%"
so it would get a lambda of 5
then late in dev you decide you want everything to be globally a bit smaller
you can go through and tweak just the global map type lambdas and everything adjusts
Delay baking and format choices
It's best practice to delay baking and format choices until right before the BCN encode, and do all the earlier steps at maximum precision.
For example don't hard-code the BCN choice; some older engines specify BC1 for diffuse and "DXT5n" (BC3) for normals. Those are the not the formats you want in most modern games, you probably want BC7 for diffuse and BC5 for normals. It's probably best in your tools to not directly expose the BCN choice to artists, but rather just the texture type and let your baker choose the format.
Oodle Texture is designed to be a very low level library; we don't do a lot of texture processing for you, we only do the final RGB -> BCN encode step. We assume you will have a baker layer that's just above Oodle Texture that does things like mip maps and format conversions.
Normal maps require special care. If possible they should be kept at maximum precision all the way through the pipeline (from whatever tool that made them up to the Oodle Texture encode). If they come out of geometry normals as F32 float, just keep them that way, don't quantize down to U8 color maps early. You may decide later that you want to process them with something like a semi-octahedral encoding, and to do that you should feed them with full precision input, not quantized U8 values that have large steps. 16 bit maps also have plenty of precision, but with 16 bit integers ensure you are using a valid quantizer (restore to center of quantizer bucket), and the correct normalization (eg. signed float -1.0 to 1.0 should correspond to S16 -32767 to 32767 , -32768 unused). Our BC4/5 encoders are best fed S16 or U16 input.
Delaying quantization to a specific type of int map lets you choose the best way to feed the BCN encoder.
In the Oodle Texture SDK help there's an extensive discussion in the "Texture Mastering Guide" on choosing which BC1-7 and some tips on preparing textures for BCN.
In the previous post about Oodle Texture we looked at the data flow of texture content in games, for the case of Oodle Texture RDO.
There is a different technology in Oodle Texture called "bc7prep" that can be used differently, or in addition to RDO.
BC7prep is a lossless transform specifically for BC7 (not BC1-6) that rearranges the bits to improve the compression ratio. Unlike Oodle Texture RDO, BC7Prep requires a reverse transform at runtime to unpack it back to BC7. This is a super fast operation, and can also be done with a GPU compute shader so the CPU doesn't have to touch the bits at all.
BC7prep can be used in combination with Oodle Texture RDO encoding, or on BC7 encodings made from any source. It typically improves the compression of BC7 by 5-15%
BC7 is particularly important because it makes up a lot of the size of modern games. It's a commonly used texture format, and without RDO or bc7prep it often doesn't compress much at all.
The similar data flow chart for BC7 textures that use bc7prep is :
RGB uncompressed source art content like BMP or PNG
| <- this step is where Oodle Texture RDO goes
V
BC7 compressed texture <- you can also start here for bc7prep
| <- bc7prep transforms BC7 to "prepped" data
V
BC7Prep transformed texture
|
V
Kraken or zlib compression on the game package containing the BCN textures
| TOOLS ^
V
sent over network, stored on disk
| RUNTIME v
V
decompress Kraken or zlib to load (sometimes with hardware decompressor)
|
V
BC7Prep transformed texture
| <- bc7prep unpacking, can be done on GPU
V
BC7 compressed texture in memory
|
V
rendered on GPU
Oodle Texture gives you several options for how you use it depending on what best fits your game. You can use only Oodle Texture RDO, in which case no runtime decoding is needed. You can use just bc7prep on existing BC7 encoded data, in which case you don't need to use Oodle Texture's BCN encoding from source art at all. Or you can use both together.
BC7Prep combined with Oodle Texture RDO at "near lossless" levels provides size savings for BC7 with almost no visual quality difference from a non-RDO BC7 encoding.
On varied BC7 textures :
total :
BC7 + Kraken : 1.387 to 1
BC7Prep + Kraken : 1.530 to 1
NearLossless RDO + BC7Prep + Kraken : 1.650 to 1
The size savings with BC7Prep are not huge and dramatic the way they are with RDO, because the BC7Prep transform
is lossless, but they are very large compared to the normal differences between lossless compression options.
For example the compression ratio on that BC7 set with Oodle Leviathan is 1.409 to 1, not much better than Kraken, and a far smaller gain than BC7Prep gives. Oodle Leviathan is a very strong compressor that usually finds bigger gains over Kraken than that, but BC7 data is hard for compressors to parse. BC7Prep and Oodle Texture RDO put the data into a form that increases the benefit of strong compressors like Oodle over weaker ones like zlib. (on data that's incompressible, all compressors are the same).
The runtime BC7Prep untransform is extremely fast. If you're using Oodle Data compression in software, it's best to do the BC7Prep untransform in software right after decompression. If you're on a platform with hardware decompression, you may want to use BC7Prep untransform compute shaders so the texture data never has to be touched by the CPU.
Visit the RAD Game Tools website to read all about Oodle Texture.
The example here is my photograph "mysoup" which I make CC0. While most game textures are not like photographs, this is typical of the results we see. To try Oodle Texture on your own images contact RAD Game Tools for an evaluation. You can also see more sample encodings on real game textures at the Oodle Texture web site.
I will be showing "mysoup" encoded to BC7 here, which is the format commonly used by games for high quality textures. It is 8 bits per texel (vs 24 for source RGB). The BC7 images I provide here are in DDS format; they are not viewable by normal image viewers, this is intended for game developers and technical artists to see real game data.
I have made these BC7 DDS with Oodle Texture in "BC7 RGBA" mode, which attempts to preserve the opaque alpha channel in the encoding. I would prefer to use our "BC7 RGB" which ignores alpha; this would get slightly higher RGB quality but can output undefined in alpha. Because many third party viewers don't handle this mode, I've not used it here.
I will also show the encoding of a few different maps from a physically based texture : coral_mud_01 from texturehaven.com (CC0), to show a sample of a full texture with diffuse, normal, and attribute maps.
I show RMSE here for reference, you should be able to reproduce the same numbers on the sample data. The RMSE I show is per texel (not per channel), and I compute only RGB RMSE (no A). Note that while I show RMSE, Oodle Texture has been run for Perceptual quality, and visual quality is what we hope to maximize (which sacrifices RMSE performance).
Oodle Texture is not a compressor itself, it works with the back end lossless compressor of your choice. Here I will show sizes with software Kraken at level 8 and zlib at level 9. You can download the data and try different back end compressors yourself. Oodle Texture works with lots of different lossless back end compressors to greatly increase their compression ratio.
The "mysoup" images are 1024x1024 but I'm showing them shrunk to 512x512 for the web site; you should always inspect images for visual quality without minification; click any image for the full size.
Download all the files for mysoup1024 here : (BC7 in DDS) :
mysoup1024_all.7z
Download all the files for coral_mud_01 here : (BC7,BC4,BC5 in DDS) :
coral_mud_01.7z
mysoup1024.png :
original uncompressed RGB :
click image for full resolution.
baseline BC7 :
(no RDO, max quality BC7)
RMSE per texel: 2.7190
Kraken : 1,048,724 -> 990,347 = 7.555 bpb = 1.059 to 1
Kraken+bc7prep : 1,080,696 -> 861,173 = 6.375 bpb = 1.255 to 1
zlib9 : 1,048,724 -> 1,021,869 = 7.795 bpb = 1.026 to 1
BC7 data is difficult for traditional compressors to handle. We can see here that neither Kraken nor zlib can get much compression on the baseline BC7, sending the data near the uncompressed 8 bits per texel size of BC7.
Oodle Texture provides "bc7prep", which is a lossless transform that makes the BC7 data more compressible. "bc7prep" can be used with Kraken, zlib, or any other back end compressor. "bc7prep" does require a runtime pass to transform the data back to BC7 that the GPU can read, but this can be done with a GPU compute shader so no CPU involvement is needed. Here bc7prep helps quite a bit in changing the data into a form that can be compressed by Kraken.
rdo lambda=5 :
RDO in near lossless mode
RMSE per texel: 2.8473
Kraken : 1,048,724 -> 849,991 = 6.484 bpb = 1.234 to 1
Kraken+bc7prep : 1,080,468 -> 767,149 = 5.680 bpb = 1.408 to 1
zlib9 : 1,048,724 -> 895,421 = 6.831 bpb = 1.171 to 1
In near lossless mode, Oodle Texture can make encodings that are visually indistinguishable from baseline, but compress much better. At this level, Oodle Texture RDO is finding blocks that are lower rate (eg. compress better with subsequent Kraken compression), but are no worse than the baseline choice. It's simply a smarter encoding that considers rate as well as distortion when considering the many possible ways the block can be encoded in BCN.
Note that when we say "near lossless RDO" we mean nearly the same quality as the baseline encoding. The baseline encoding to BC7 forces some quality loss from the original, and this RDO setting does not increase that. The RMSE difference to baseline is very small, but the visual quality difference is even smaller.
We believe that legacy non-RDO encodings of most texture types should never be used. Oodle Texture RDO provides huge wins in size with no compromise; if you need max quality just run in near lossless mode. It simply makes a better encoding to BC7 which is much more compressible. Many common BC7 encoders produce worse quality than Oodle Texture does in near-lossless mode.
rdo lambda=40 :
RDO in medium quality mode
RMSE per texel: 4.2264
Kraken : 1,048,724 -> 509,639 = 3.888 bpb = 2.058 to 1
Kraken+bc7prep : 1,079,916 -> 455,747 = 3.376 bpb = 2.370 to 1
zlib9 : 1,048,724 -> 576,918 = 4.401 bpb = 1.818 to 1
At lambda=40 we are now trading off some quality for larger rate savings. At this level, visual differences from the original may start to appear, but are still very small, and usually acceptable. (for example the errors here are far far smaller than if you encoded to BC1, or even if you encoded with a poor BC7 encoder that reduces choices in a hacky/heuristic way).
At this level, Kraken is now able to compress the image nearly 2 to 1 , to 3.888 bits per texel, starting from baseline which got almost no compression at all. We've shrunk the Kraken compressed size nearly by half. This also means the content can load twice as fast, giving us an effective 2X multiplier on the disk speed. This is a HUGE real world impact on game content sizes with very little down side.
zlib has also benefitted from RDO, going from 1,021,869 to 576,918 bytes after compression. Kraken does a bit better because it's a bit stronger compressor than zlib. The difference is not so much because Oodle Texture is specifically tuned for Kraken (it's in fact quite generic), but because more compressible data will tend to show the difference between the back end compressors more. On the baseline BC7 data, it's nearly incompressible, so the difference between Kraken and zlib looks smaller there.
Download all the "mysoup" files here : (BC7 in DDS) :
mysoup1024_all.7z
Summary of all the compression results :
baseline BC7 :
Kraken : 1,048,724 -> 990,347 = 7.555 bpb = 1.059 to 1
Kraken+bc7prep : 1,080,696 -> 861,173 = 6.375 bpb = 1.255 to 1
zlib9 : 1,048,724 -> 1,021,869 = 7.795 bpb = 1.026 to 1
RDO lambda=5 :
Kraken : 1,048,724 -> 849,991 = 6.484 bpb = 1.234 to 1
Kraken+bc7prep : 1,080,468 -> 767,149 = 5.680 bpb = 1.408 to 1
zlib9 : 1,048,724 -> 895,421 = 6.831 bpb = 1.171 to 1
RDO lambda=40 :
Kraken : 1,048,724 -> 509,639 = 3.888 bpb = 2.058 to 1
Kraken+bc7prep : 1,079,916 -> 455,747 = 3.376 bpb = 2.370 to 1
zlib9 : 1,048,724 -> 576,918 = 4.401 bpb = 1.818 to 1
coral_mud_01 :
You can get the source art for coral_mud_01 at texturehaven.com. I used the 1k PNG option. On the web site here I am showing a 256x256 crop of the images so they can be seen without minification. Download the archive for the full res images.
coral_mud_01_diff_1k :
diffuse (albedo) color in BC7 (RGBA)
| BC7 non-RDO | BC7 lambda=30 | BC7 lambda=50 |
| RMSE 3.5537 | RMSE 4.9021 | RMSE 5.7194 |
| 7.954 bpb | 5.339 bpb | 4.683 bpb |
|
|
|
coral_mud_01_Nor_1k.png:
normal XY in RG channels only in BC5
BC5 decodes to 16 bit
RMSE is RG only
| BC5 non-RDO | BC5 lambda=30 | BC5 lambda=50 |
| RMSE 3.4594 | RMSE 5.8147 | RMSE 7.3816 |
| 8.000 bpb | 5.808 bpb | 5.083 bpb |
|
|
|
coral_mud_01_bump_1k.png:
bump map, single scalar channel
BC4 decodes to 16 bit
| BC4 non-RDO | BC4 lambda=30 | BC4 lambda=50 |
| RMSE 3.2536 | RMSE 4.1185 | RMSE 5.2258 |
| 7.839 bpb | 6.871 bpb | 6.181 bpb |
|
|
|
Single scalar channels in BC4 is an unusual usage for games. Typically several scalar channels would be combined in a BC7 texture.
Compressed sizes are with software Kraken at level 8, no additional filters. "bpb" means "bits per byte", it's the compressed size in bits, per uncompressed input byte. The non-RDO textures in coral_mud all get almost no compression at all with Kraken (they're between 7.8 and 8.0 bpb). With RDO that improves to around 5 bpb, which is an 8:5 ratio or 1.6 to 1 compression.
With BC7 and BC5, the "bpb" size is also the number of bits per texel, because they start at 8 bits per texel when uncompressed. If RDO can improve BC7 to 4 bits per texel, that means it's now the same size on disk as uncompressed BC1, but at far higher visual quality. (2:1 on BC7 is a bit more than we typically expect; 8:5 or 8:6 is more common)
Download all the files for coral_mud_01 here : (BC7,BC4,BC5 in DDS) :
coral_mud_01.7z
Read more about Oodle Texture on the RAD Game Tools web site
Oodle Texture creates BC1-7 GPU textures that are far more compressible, so that when packaged for storage or distribution they are much smaller - up to 2X smaller. Many games have most of their content in this form, so this leads to a huge impact on compressed game sizes, usually 10%-50% smaller depending on the content and how Oodle Texture is used.
Smaller content also loads faster, so improving the compression ratio by 2X also improves effective IO speed by 2X. This is possible when the decompression is not the bottleneck, such as when you use super fast Oodle Kraken decompression, or a hardware decoder.
At RAD, we previously developed Oodle Kraken, part of Oodle Data Compression, which provides super fast decompression with good compression ratios, which makes Kraken great for game data loading where you need high speed. But Kraken is generic, it works on all types of data and doesn't try to figure out data-specific optimizations. Oodle Texture is able to greatly decrease the size that a following Kraken compression gets by preparing the textures in ways that make them more compressible.
Oodle Texture is specialized for what are called "block compressed textures". These are a form of compressed image data that is used by GPUs to provide the rendering attributes for surfaces in games. Oodle Texture works on BC1-BC7 textures, sometimes called "BCN textures". The BC1-7 are seven slightly different GPU formats for different bit depths and content types, and most games use a mix of different BCN formats for their textures. Modern games use a huge amount of BCN texture data. Shrinking the BCN textures to half their previous compressed size will make a dramatic difference in game sizes.
For an example of what Oodle Texture can do, on a real game data test set from a small selection of textures from
real shipping content :
127 MB BCN GPU textures, mix of BC1-7, before any further compression
78 MB with zip/zlib/deflate
70 MB with Oodle Kraken
40 MB with Oodle Texture + Kraken
Without Oodle, the game may have shipped the zlib compressed textures at 78 MB.
The Oodle Texture + Kraken compressed game is almost half the size of the traditional zlib-compressed game (40 MB).
While Oodle Texture is great with Kraken, it also works to prepare textures for compression by other lossless back ends
(like zlib). We believe that Oodle Texture should be widely used on game textures, even when Kraken isn't available.
While Kraken is a huge technological advance over zip/zlib, it only saved 8 MB in the example above (this is partly because BCN texture data is difficult for generic compressors to work with), while Oodle Texture saved an additional 30 MB, nearly 4X more than Kraken alone. The size savings possible with Oodle Texture are huge, much bigger than we've seen from traditional compressors, and you don't need to accept painful quality loss to get these savings.
The way that games process texture data is :
RGB uncompressed source art content like BMP or PNG
| <- this step is where Oodle Texture RDO goes
V
BCN compressed texture
|
V
Kraken or zlib compression on the game package containing the BCN textures
| TOOLS ^
V
sent over network, stored on disk
| RUNTIME v
V
decompress Kraken or zlib to load (sometimes with hardware decompressor)
|
V
BCN compressed texture in memory
|
V
rendered on GPU
Oodle Texture doesn't change this data flow, it just makes the content compress better so that the packaged size is smaller.
You still get GPU-ready textures as the output. Note that Oodle Texture RDO isn't required in the runtime side at all.
(Oodle Texture also contains bc7prep which has slightly different usage; see more later, or here)
Games don't decompress the BCN encoding, rendering reads directly from BCN. Games use BCN textures directly in memory because GPUs are optimized to consume that format, and they also take less memory than the original uncompressed RGB image would (and therefore also use less bandwidth), but they aren't a great way to do lossy compression to optimize size in packages. For example the familiar JPEG lossy image compression can make images much smaller than BCN can at similar visual quality levels. In Oodle Texture we want to shrink the package sizes, but without changing the texture formats, because games need them to load into BCN. We also don't want to use any slow transcoding step, cause an unnecessary loss of quality, or require decoding at runtime.
Oodle Texture can be used on the new consoles that have hardware decompression without adding any software processing step. You just load the BCN textures into memory and they are decompressed by the hardware, and you get the benefit of much smaller compressed sizes, which also effectively multiplies the load speed.
Oodle Texture RDO can't be used to compress games with existing BCN texture content, as that has already been encoded. We need to re-encode to BCN from source art as part of the game's content baking tools.
BCN textures work on 4x4 blocks of pixels, hence the name "block compressed textures". They are a lossy encoding that stores an approximation of the original source texture in fewer bits. The source colors are typically 24 or 32 bits per texel, while BCN stores them in 4 or 8 bits per texel. So BCN is already a compression factor of something like 6:1 (it varies depending on BC1-7 and the source format).
How does Oodle Texture do it?
To understand the principles of how Oodle Texture finds these savings, we'll have to dig a little into what a BCN encoding is. All the BCN are a little different but have similar principles. I'm going to henceforth talk about BC1 to be concrete as an example that illustrates the main points that apply to all the BC1-7.
BC1 stores 24 bit RGB in 4 bits per texel, which is 64 bits per block of 4x4 texels. It does this by sending the block with two 16-bit endpoints for a line segment in color space (32 bits total for endpoints), and then sixteen 2-bit indices that select an interpolation along those endpoints. 2-bits can encode 4 values for each texel, which are each of the endpoints, or 1/3 or 2/3 of the way between them. (BC1 also has another mode with 3 interpolants instead of 4, but we'll ignore that here for simplicity). The BC1 endpoints are 16-bit in 5:6:5 for R:G:B which is a coarser quantization of the color space than the original 8 bits.
We think of RGB as a "color space" where the R,G, and B are axes of a 3d dimensional coordinate system. A single color is a point in this color space. The original 4x4 block of uncompressed texels is equivalent to sixteen points in this color space. In general those points are scattered around this big 3d space, but in practice they usually form a cloud (or a few clusters) that is compact, because colors that are nearby each other in the image tend to have similar RGB values.
BC1 approximates these points with a line segment that has 4 discrete codable points on the segment, at the endpoints, and 1/3 of the way from each end. Each color in the original sixteen can pick the closest of the 4 codable points with the 2 bits sent per texel. The problem of BC1 encoding is to choose the endpoints for this line segment, so that the reproduced image looks as good as possible. Once you choose the endpoints, it's easy to find the indices that minimize error for that line segment.
The thing that makes BC1 encoding interesting and difficult is that there are a large number of encodings that have nearly the same error. Your goal is to put a line segment through a cluster of points, and slightly different endpoints correspond to stretches or rotations of that line. You can hit any given color with either an endpoint of the segment or a 1/3 interpolant, so you can do these big stretches or contractions of the line segment and still have nearly the same error.
For example, here are two clusters of points (the black dots) in color space, with some possible BC1 encodings that produce similar errors :
If you're only considering distortion, then these options have nearly the same error. In fact you could just put your line segment through the principle axis of the color clusters, and then you are within bounded error of the best possible encoding (if the line segment was sent with real numbers, then the best fit line would in fact minimize squared error, by definition; the quantization of the endpoints means this doesn't necessarily give you minimum error). That's possible because the distortion varies smoothly and convexly (except for quantization effects, which are bounded). This is just a way of saying that there's a minimum error encoding where the line segment goes through the original colors, and if you keep stepping the endpoints away from that line segment, the error gets worse.
Oodle Texture isn't just looking for the lowest error (or "distortion") when encoding to BCN; it does "rate-distortion optimization". This means that in addition to considering the distortion of each possible encoding, it also considers the rate. The "rate" in this case is the estimated size of the chosen block encoding after subsequent compression by a lossless compressor like Kraken or zlib.
By considering rate, Oodle Texture can make smarter encodings that optimize for compressed size as well as quality. Sometimes this is just free, by measuring the rate of different choices you may see that two encodings with equal quality do not have the same rate, and you should choose the one with better rate. Sometimes this means a tradeoff, where you sacrifice a small amount of quality to get a big rate gain.
Rate Distortion Optimization or RDO does not mean that we are introducing loss or bad quality into the encoding. It simply means the encoder is considering two types of cost when it makes decisions. It can balance the desire for maximum quality against the desire for the smallest possible size, since both are not possible at the same time a trade off must be made, which the game developer can control with a quality parameter. Oodle Texture RDO can product very high quality encodings that are nearly visually indistinguishable from non-RDO encodings, but compress much more, simply by being a smart encoding which takes into consideration the rate of the choices.
People actually do rate-distortion optimization in games all the time without realizing it. When you choose to use a 4k x 4k texture vs. an 8k x 8k texture, you are making a visual quality vs size decision. Similarly if you choose BC1 vs BC7, you're choosing 4 or 8 bits per texel vs a quality tradeoff. Those are very big coarse steps, and the value of the tradeoff is not systematically measured. The difference with Oodle Texture is that our RDO is automatic, it provides a smooth easy to control parameter, the tradeoff is scored carefully and the best possible ways to trade size for quality are chosen.
Here's an example of Oodle Texture BC7 encoding made with and without RDO :
| BC7 baseline | BC7 RDO lambda=30 |
| 1.081 to 1 compression | 1.778 to 1 compression |
|
|
(texture from cc0textures.com, resized to 512x512 before BC7 encoding; compression ratio is with Kraken level 8)
(BC7 textures like this that hardly compress at all without RDO are common)
Oodle Texture RDO encodes source art to BCN, looking at the many different options for endpoints and measuring "rate" and "distortion" on them. We noted previously that distortion is pretty well behaved as you search for endpoints, but in contrast, the rate does not behave the same way. The rate of two different endpoint choices could be vastly different even for endpoints whose colors are right next to each other in color space. Rate does not vary smoothly or monotonically as you explore the endpoint possibilities, it varies wildly up and down, which means a lot more possibilities have to be searched.
The way we get compression of BCN textures is mainly through reuse of components of the block encoding. That is, the back end compressor will find that a set of endpoints or indices (the two 32-bit parts of a BC1 block, for example) are used in two different places, and therefore can send the second use as an LZ77 match instead of transmitting them again. We don't generally look for repetition of entire blocks, though this can reduce rate, because it causes visually obvious repetitions. Instead by looking to repeat the building components that make up the BCN blocks, we get rate reduction without obvious visual repetition.
You might have something like
Encode block 1 with endpoints {[5,10,7] - [11,3,7]} and indices 0xE3F0805C
Block 2 has lots of choices of endpoints with similar distortions
{[6,11,7] - [11,3,7]} distortion 90 rate 32 bits
{[1,10,7] - [16,5,7]} distortion 95 rate 32 bits
{[5,10,7] - [11,3,7]} distortion 100 rate 12 bits
the choice of {[5,10,7] - [11,3,7]} has a rate that's much lower than the others
because it matches previously used endpoints
Part of what makes RDO encoding difficult is that both "rate" and "distortion" are not trivial to evaluate. There's no simple formula for either that provides the rate and distortion we need.
For distortion, you could easily just measure the squared distance error of the encoding (aka RMSE, SSD or PSNR), but that's not actually what we care about. We care about the visual quality of the block, and the human eye does not work like RMSE, it sees some errors as objectionable even when they are quite numerically small. For RDO BCN we need to be able to evaluate distortion millions of times on the possible encodings, so complex human-visual simulations are not possible. We use a very simple approximation that treats errors as more significant when they occur in smooth or flat areas, because those will be more jarring to the viewer; errors that occur in areas that were already noisy or detailed will not be as noticeable, so they get a lower D score. Getting this right has huge consequences, without a perceptual distortion measure the RDO can produce ugly visible blocking artifacts even when RMSE is quite low.
To measure the rate of each block coding decision, we need to guess how well a block will compress, but we don't yet have all the other blocks, and the compressors that we use are dependent on context. That is, the actual rate will depend on what comes before, and the encoding we choose for the current block will affect the rate of future blocks. In LZ77 encoding this comes mainly through the ability to match the components of blocks; when choosing a current block you want it to be low "rate" in the sense that it is a match against something in the past, but also that it is useful to match against in the future. We use a mix of techniques to try to estimate how different choices for the current block will affect the final compressed size.
When choosing the indices for the BCN encoding (the four interpolants along the line segment that each texel chooses), the non-RDO encoder just took the closest one, giving the minimum color error. The RDO encoder also considers taking interpolants that are not the closest if it allows you to make index bytes that occur elsewhere in the image, thus reducing rate. Often a given color is nearly the same distance from two interpolants, but they might have very different rate. Also, some choice of endpoints might not give you any endpoint reuse, but it might change the way you map the colors to indices that gives you reuse there. Considering all these possibilities quickly is challenging.
Oodle Texture measures these rate and distortion scores for lots of possible block encodings, and makes a combined score
J = D + lambda * R
that lets us optimize for a certain tradeoff of rate and distortion, depending on the lambda parameter.
You can't minimize distortion and rate at the same time, but you can minimize J, which reaches the ideal mix of
rate and distortion at that tradeoff. The client specifies lambda to control if they want maximum quality, or
lower quality for more rate reduction. Lambda is a smooth continuous parameter that gives fine control, so there are
no big jumps in quality. Oodle Texture RDO can encode to the same quality as the non-RDO encoders at low lambda,
and gradually decreases rate as lambda goes up.
This optimization automatically finds the rate savings in the best possible places. It takes rate away where it makes the smallest distortion gain (measured with our perceptual metric, so the distortion goes where it is least visible). This means that not all textures get the same rate savings, particularly difficult ones will get less rate reduction because they need the bits to maintain quality. That's a feature that gives you the best quality for your bits across your set of textures. Oodle Texture is a bit like a market trader going around to all your textures, asking who can offer a bit of rate savings for the lowest distortion cost and automatically taking the best price.
Textures encoded with Oodle Texture RDO and then Kraken act a bit more like a traditional lossy encoding like JPEG. Non-RDO BCN without followup compression encodes every 4x4 block to the same number of output bits (either 64 or 128). With Oodle Texture RDO + Kraken, the size of output blocks is now variable depending on their content and how we choose to encode them. Easier to compress blocks will take fewer bits. By allocating bits differently, we can reduce the number of bits a given block takes, and perhaps lower its quality. One way to think about Oodle Texture RDO is as a bit allocation process. It's looking at the number of bits taken by each block (after compression) and deciding where those bits are best spent to maximize visual quality.
Rate-distortion optimization is standard in modern lossy codecs such as H264 & H265. They do similar bit allocation decisions in the encoder, usually by explicitly changing quantizers (a quantizer is like the JPEG quality parameter, but modern codecs can vary quantizer around the image rather than having a single value for the whole image) or thresholding small values to zero. What's different here is that Oodle Texture still outputs fixed size blocks, we don't have direct control of the final compression stage, we can only estimate what it will do. We don't have anything as simple as a quantizer to control block rate, we make the lower rate block encodings by finding ways to pack the RGB to BCN that are likely to compress more.
BC7 textures offer higher quality than BC1 at double the size (before compression). Without RDO, BC7 textures have been particularly large in game packages because they naturally compress very poorly. BC7 has many different modes, and packs its fields off byte alignment, which confuses traditional compressors like Kraken and zlib, and makes it hard for them to find any compression. It's quite common for non-RDO BC7 texture to compress by less than 10%.
Oodle Texture RDO can make BC7 encodings that are much more compressible. For example :
"mysoup1024"
non-RDO BC7 :
Kraken : 1,048,724 -> 990,347 = 7.555 bpb = 1.059 to 1
RDO lambda=40 BC7 :
Kraken : 1,048,724 -> 509,639 = 3.888 bpb = 2.058 to 1
Modern games are using more and more BC7 textures because they provide much higher quality than BC1 (which suffers
from chunky artifacts even at max quality). This means lots of game packages don't benefit as much from compression
as we'd like. Oodle Texture RDO on BC7 fixes this.
Oodle Texture also has a lossless transform for BC7 called "bc7prep" that rearranges the fields of BC7 to make it more compressible. This gives a 5-15% compression gain on existing BC7 encodings. It works great stacked with RDO in the high quality levels as well.
We think that Oodle Texture is just a better way to encode BCN textures, and it should be used on games on all platforms. Oodle Texture has the potential to dramatically shrink compressed game sizes.
You can read more about Oodle Texture at the RAD Game Tools web site, along with the rest of the Oodle family of data compression solutions.
In any case, we'll look at a couple RGBE followup topics because I think they may be educational. Do NOT use these. This is for our education only, don't copy paste these and put them in production! If you want an RGBE conversion you can use, see the previous post!
In the previous post I wrote that I generally prefer centered quantization that does bias on encode.
This is different than what is standard for Radiance HDR RGBE files. (DO NOT USE THIS). But say
you wanted to do that, what would it look like exactly?
// float RGB -> U8 RGBE quantization
void float_to_rgbe_centered(unsigned char * rgbe,const float * rgbf)
{
// NOT HDR Radiance RGBE conversion! don't use me!
float maxf = rgbf[0] > rgbf[1] ? rgbf[0] : rgbf[1];
maxf = maxf > rgbf[2] ? maxf : rgbf[2];
if ( maxf <= 1e-32f )
{
// Exponent byte = 0 is a special encoding that makes RGB output = 0
rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
}
else
{
int exponent;
frexpf(maxf, &exponent);
float scale = ldexpf(1.f, -exponent + 8);
// bias might push us up to 256
// instead increase the exponent and send 128
if ( maxf*scale >= 255.5f )
{
exponent++;
scale *= 0.5f;
}
// NOT HDR Radiance RGBE conversion! don't use me!
rgbe[0] = (unsigned char)( rgbf[0] * scale + 0.5f );
rgbe[1] = (unsigned char)( rgbf[1] * scale + 0.5f );
rgbe[2] = (unsigned char)( rgbf[2] * scale + 0.5f );
rgbe[3] = (unsigned char)( exponent + 128 );
}
}
// U8 RGBE -> float RGB dequantization
void rgbe_to_float_centered(float * rgbf,const unsigned char * rgbe)
{
// NOT HDR Radiance RGBE conversion! don't use me!
if ( rgbe[3] == 0 )
{
rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
}
else
{
// NOT HDR Radiance RGBE conversion! don't use me!
float fexp = ldexpf(1.f, (int)rgbe[3] - (128 + 8));
// centered restoration, no bias :
rgbf[0] = rgbe[0] * fexp;
rgbf[1] = rgbe[1] * fexp;
rgbf[2] = rgbe[2] * fexp;
}
}
what's the difference in practice ?
On random floats, there is no difference. This has the same 0.39% max round trip error as the reference implementation that does bias on decode.
The difference is that on integer colors, centered quantization restores them exactly. Specifically : for all the 24-bit LDR (low dynamic range) RGB colors, the "centered" version here has zero error, perfect restoration.
That sound pretty sweet but it's not actually helpful in practice, because the way we in games use HDR data typically has the LDR range scaled in [0,1.0] not ,[0,255]. The "centered" way does preserve 0 and 1 exactly.
The other thing I thought might be fun to look at is :
The Radiance RGBE conversion has 0.39% max round trip error. That's exactly the same as a flat quantizer from the unit interval to 7 bits. (the bad conversion that did floor-floor had max error of 0.78% - just the same as a flat quantizer to 6 bits).
But our RGBE all have 8 bits. We should be able to get 8 bits of precision. How would you do that?
Well one obvious issue is that we are sending the max component with the top bit on. It's in [128,255], we always have the top bit set and then only get 7 bits of precision. We could send that more like a real floating point encoding with an implicit top bit, and use all 8 bits.
If we do that, then the decoder needs to know which component was the max to put the implicit top bit back on. So we need to signal it. Well, fortunately we have 8 bits for the exponent which is way more dynamic range than we need for HDR imaging, so we can take 2 bits from there to send the max component index and leave 6 bits for exponent.
Then we also want to make sure we use the full 8 bits for the non-maximal components. To do that we can scale their fractional size relative to max up to 255.
Go through the work and we get what I call "rgbeplus" :
/**
! NOT Radiance HDR RGBE ! DONT USE ME !
"rgbeplus" packing
still doing 8888 RGBE, one field in each 8 bits, not the best possible general 32 bit packing
how to get a full 8 bits of precision for each component
(eg. maximum error 0.19% instead of 0.38% like RGBE)
for the max component, we store an 8-bit mantissa without the implicit top bit
(like a real floating point encoding, unlike RGBE which stores the on bit)
(normal RGBE has the max component in 128-255 so only 7 bits of precision)
because we aren't storing the top bit we need to know which component was the max
so the decoder can find it
we put the max component index in the E field, so we only get 6 bits for exponent
(6 is plenty of orders of magnitude for HDR images)
then for the non-max fields, we need to get a full 8 bits for them too
in normal RGBE they waste the bit space above max, because we know they are <= max
eg. if max component was 150 , then the other components can only be in [0,150]
and all the values above that are wasted precision
therefore worst case in RGBE the off-max components also only have 7 bits of precision.
To get a full 8, we convert them to fractions of max :
frac = not_max / max
which we know is in [0,1]
and then scale that up by 255 so it uses all 8 bits
this all sounds a bit complicated but it's very simple to decode
I do centered quantization (bias on encode, not on decode)
**/
// float RGB -> U8 RGBE quantization
void float_to_rgbeplus(unsigned char * rgbe,const float * rgbf)
{
// rgbf[] should all be >= 0 , RGBE does not support signed values
// ! NOT Radiance HDR RGBE ! DONT USE ME !
// find max component :
int maxi = 0;
if ( rgbf[1] > rgbf[0] ) maxi = 1;
if ( rgbf[2] > rgbf[maxi] ) maxi = 2;
float maxf = rgbf[maxi];
// 0x1.p-32 ?
if ( maxf <= 1e-10 ) // power of 10! that's around 2^-32
{
// Exponent byte = 0 is a special encoding that makes RGB output = 0
rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
}
else
{
int exponent;
frexpf(maxf, &exponent);
float scale = ldexpf(1.f, -exponent + 9);
// "scale" is just a power of 2 to put maxf in [256,512)
// 6 bits of exponent :
if ( exponent < -32 )
{
// Exponent byte = 0 is a special encoding that makes RGB output = 0
rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
return;
}
myassert( exponent < 32 );
// bias quantizer in encoder (centered restoration quantization)
int max_scaled = (int)( maxf * scale + 0.5f );
if ( max_scaled == 512 )
{
// slipped up because of the round in the quantizer
// instead do ++ on the exp
scale *= 0.5f;
exponent++;
//max_scaled = (int)( maxf * scale + 0.5f );
//myassert( max_scaled == 256 );
max_scaled = 256;
}
myassert( max_scaled >= 256 && max_scaled < 512 );
// grab the 8 bits below the top bit :
rgbe[0] = (unsigned char) max_scaled;
// to scale the other two components
// we need to use the maxf the *decoder* will see
float maxf_dec = max_scaled / scale;
myassert( fabsf(maxf - maxf_dec) <= (0.5/scale) );
// scale lower components to use full 255 for their fractional magnitude :
int i1 = (maxi+1)%3;
int i2 = (maxi+2)%3;
rgbe[1] = u8_check( rgbf[i1] * 255.f / maxf_dec + 0.4999f );
rgbe[2] = u8_check( rgbf[i2] * 255.f / maxf_dec + 0.4999f );
// rgbf[i1] <= maxf
// so ( rgbf[i1] * 255.f / maxf ) <= 255
// BUT
// warning : maxf_dec can be lower than maxf
// maxf_dec is lower by a maximum of (0.5/scale)
// worst case is
// (rgbf[i1] * 255.f / maxf_dec ) <= 255.5
// so you can't add + 0.5 or you will go to 256
// therefore we use the fudged bias 0.4999f
rgbe[3] = (unsigned char)( ( (exponent + 32) << 2 ) + maxi );
}
}
// U8 RGBE -> float RGB dequantization
void rgbeplus_to_float(float * rgbf,const unsigned char * rgbe)
{
// ! NOT Radiance HDR RGBE ! DONT USE ME !
if ( rgbe[3] == 0 )
{
rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
}
else
{
int maxi = rgbe[3]&3;
int exp = (rgbe[3]>>2) - 32;
float fexp = ldexpf(1.f, exp - 9);
float maxf = (rgbe[0] + 256) * fexp;
float f1 = rgbe[1] * maxf / 255.f;
float f2 = rgbe[2] * maxf / 255.f;
int i1 = (maxi+1)%3;
int i2 = (maxi+2)%3;
rgbf[maxi] = maxf;
rgbf[i1] = f1;
rgbf[i2] = f2;
}
}
and this in fact gets a full 8 bits of precision. The max round trip error is 0.196% , the same as a flat
quantizer to 8 bits.
(max error is always measured as a percent of the max component, not of the component that has the error; any shared exponent format has 100% max error if you measure as a percentage of the component)
Again repeating myself : this is a maximum precision encoding assuming you need to stick to the "RGBE" style of using RGB color space and putting each component in its own byte. That is not the best possible way to send HDR images in 32 bits, and there's no particular reason to use that constraint.
So I don't recommend using this in practice. But I think it's educational because these kind of considerations should always be studied when designing a conversion. The errors from getting these trivial things wrong are very large compared to the errors that we spend years of research trying to save, so it's quite frustrating when they're done wrong.
Note that RGBE 8888 is a very lossy encoding of floating point HDR images to begin with. It is really not appropriate as a working HDR image format if further processing will be applied that may magnify the errors. A half-float format is probably a better choice. If you do use RGBE HDR at least use the correct conversions which I will provide here. (See Greg Ward's article on HDR encodings and also nice table of dynamic range and accuracy of various HDR file formats).
The Radiance HDR RGBE encoding stores 3-channel floating point RGB using four 8-bit components. It takes the RGB floats and finds the exponent of the largest component, sends that exponent in E, then shifts the components to put the largest component's top bit at 128 and sends them as linear 8 bits.
In making the corrected version, I have been guided by 3 principles in order of priority : 1. ensure existing RGBE HDR images are correct encodings, 2. treat the original Radiance implementation as defining the file format, 3. minimize the error of round trip encoding.
With no further ado, let's have the correct RGBE conversion, then we'll talk about the details :
RGBE encoding puts the largest component in the range [128,255). It actually only has 7 bits of mantissa. The smaller components are put on a shared exponent with the largest component, which means if you ever care about tiny values in the non-maximal component this encoding is terrible (as are any shared-exponent encodings). There are much better packings of RGB floats to 32 bits that store more useful bits of precision.
Let's now look at the broken version that's being widely used. It uses the same encoder as above, but in the
decoder the bad version does :
// RGBE 8888 -> float RGB
// BAD decode restoration, don't copy me
rgbf[0] = rgbe[0] * fexp;
rgbf[1] = rgbe[1] * fexp;
rgbf[2] = rgbe[2] * fexp;
missing the biases by half.
Essentially this is just a quantization problem. We are taking floats and quantizing them down to 8 bits. Each 8 bit index refers to a "bucket" or range of float values. When you restore after quantizing, you should restore to the middle of the bucket. (with non-uniform error metrics or underlying probability distributions, the ideal restoration point might not be the middle; more generally restore to the value that minimizes error over the expected distribution of input values).
The broken version is essentially doing a "floor" on both the quantization and restoration.
floor quantization :
+---------+---------+---------+---------+---------+---------+
|0 |1 |2 |3 |4 |5 |
+---------+---------+---------+---------+---------+---------+
->
floor restoration :
*0 *1 *2 *3 *4 *5
->
bias 0.5 restoration :
*0.5 *1.5 *2.5 *3.5 *4.5 *5.5
the rule of thumb is that you need an 0.5 bias on either the quantization side or the restoration side.
If you do floor quantization, do +0.5 in restoration. If you do centered quantization (+0.5) you can do
integer restoration.
The broken version is just a bad quantizer that restores value ranges to one edge of the bucket. That's creating a net shift of values downward and just creates error that doesn't need to be there. On random RGB floats :
Maximum relative error :
Correct RGBE encoding : 0.3891%
Bad floor-floor RGBE encoding : 0.7812%
(percentage error relative to largest component).
Note this is a LOT of error. If you just took a real number in [0,1) and quantized it to 8 bits, the maximum error is 0.195% , so even the correct encoding at around 0.39% error is double that (reflecting that we only have 7 bits of precision in the RGBE encoding), and the bad encoding at around 0.78% is double that again (it's equal to the maximum error of uniform quantization if we only had 6 bits of precision).
Reference test code that will print these errors : test_rgbe_error.cpp
I have where possible tried to make this corrected RGBE quantizer match the original HDR file IO from Greg Ward's Radiance . I believe we should treat the Radiance version as canonical and make HDR files that match it. I have adopted a small difference which I note in Appendix 4. The change that I propose here actually makes the encoder match the description of its behavior better than it did before (see Appendix 4). the original Radiance code does floor quantization and midpoint restoration, like here. The broken version was introduced later.
ADD : I have also recently found Greg Ward's code from Real Pixels in Graphics Gems II this also contains the correct conversion, and is easier to go explore than Radiance. (conversions are in color.c)
The broken floor-floor code has been widely copied into many tools used in games (such as STBI and NVTT), hopefully those will be fixed soon. Fortunately, the encoder does not need to be changed, only the decode code is changed, so existing HDR images are okay. They can be loaded with the corrected restoration function and should see an improvement in quality for free.
That's it! We'll followup some details in appendices for those who are interested.
Appendix 1 :
Correct conversion in raw text if the pastebin isn't working :
// float RGB -> U8 RGBE quantization
void float_to_rgbe(unsigned char * rgbe,const float * rgbf)
{
// rgbf[] should all be >= 0 , RGBE does not support signed values
float maxf = rgbf[0] > rgbf[1] ? rgbf[0] : rgbf[1];
maxf = maxf > rgbf[2] ? maxf : rgbf[2];
if ( maxf <= 1e-32f )
{
// Exponent byte = 0 is a special encoding that makes RGB output = 0
rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
}
else
{
int exponent;
float scale;
frexpf(maxf, &exponent);
scale = ldexpf(1.f, -exponent + 8);
rgbe[0] = (unsigned char)( rgbf[0] * scale );
rgbe[1] = (unsigned char)( rgbf[1] * scale );
rgbe[2] = (unsigned char)( rgbf[2] * scale );
rgbe[3] = (unsigned char)( exponent + 128 );
}
}
// U8 RGBE -> float RGB dequantization
void rgbe_to_float(float * rgbf,const unsigned char * rgbe)
{
if ( rgbe[3] == 0 )
{
rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
}
else
{
// the extra 8 here does the /256
float fexp = ldexpf(1.f, (int)rgbe[3] - (128 + 8));
rgbf[0] = (rgbe[0] + 0.5f) * fexp;
rgbf[1] = (rgbe[1] + 0.5f) * fexp;
rgbf[2] = (rgbe[2] + 0.5f) * fexp;
}
}
Appendix 2 :
Why I prefer centered quantization.
Radiance HDR does floor quantization & centered restoration. I think it should have been the other way around (centered quantization & int restoration), but I don't propose changing it here because we should stick to the reference Greg Ward implementation, and match how existing .hdr files have been encoded.
The reason I prefer centered quantization is that it exactly preserves integers and powers of two. (it has a small bonus of not needing a bias at decode time).
If your input real numbers are evenly distributed with no special values, then there's no reason to prefer either style, they're equal. But usually our inputs are not evenly distributed and random, we often deal with inputs where values like 0.0 and 1.0 or 256.0 are special and we'd like to preserve them exactly.
If you do centered quantization, these restore exactly. If you do floor quantization and then have an 0.5 bias on dequantization to do centered restoration, values like 0 shift to be in the center of their bucket.
In the correct RGBE encoding above, a float input of 1.0 is restored as 1.0039063 (= 1 + 1/256), because 1.0 corresponds to the bottom edge of a quantization bucket, and we restore to the middle of the bucket.
For example for non-HDR images the quantization I like is :
// U8 to float
// map 1.0 to 255
// centered quantization
unsigned char unit_float_to_u8(float f)
{
// f in [0,1.0]
// scale up to [0,255] , then do centered quantization :
// eg. values after *255 in [0.5,1.5] -> 1
// clamp before casting if you aren't sure your floats are bounded!
return (unsigned char) (f * 255.f + 0.5f);
}
float u8_to_unit_float(unsigned char u8)
{
// u8 in [0,255]
// scale to [0,1.0]
// do straight mapped dequantization :
return u8 * (1.f/255);
}
It seems that perhaps the desire to preserve 1.0 exactly is what got us into this whole mess.
A widely referenced extraction from Radiance
was posted here : bjw rgbe, but
with a crucial flaw. The reconstruction was changed to remove the bias :
/* standard conversion from rgbe to float pixels */
/* note: Ward uses ldexp(col+0.5,exp-(128+8)). However we wanted pixels */
/* in the range [0,1] to map back into the range [0,1]. */
Ward had a valid quantizer (floor encode, bias decode), but it disappeared in this version.
The correct way to get 1.0 preserved would have been to do a centered quantization (bias on the encode). As noted previously I don't recommend changing that now as it would mean existing hdr files were encoded wrong, and it deviates from Ward's original Radiance implementation, which should be taken as defining the format. We should consider the BJW implementation to simply have an incorrect decoder. (the BJW encoder is okay, but it does also suffer from the small flaw discussed in Appendix 4)
If you do centered quantization with RGBE, you get perfect reconstruction of the LDR (low dynamic range) integer colors. That is all colors with RGB in integers in [0,255] are preserved exactly in RGBE *if* we did centered quantization. That sounds appealing but it's not actually a common way to use HDR image, most people put their colors in [0,1.0]
Appendix 3 :
RGBE encoder using IEEE floating point bit manipulations.
I don't post this as a suggested optimization, but rather because I think it illuminates what's actually going on in the RGBE encoding.
The IEEE floating points that we are encoding are sign-exponent-mantissa :
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
To send the largest component with RGBE, what we are doing is taking the top 7 bits of the mantissa, add the implicit 1 bit
ahead of those, and put those in 8 bits.
000000001MMMMMMM0000000000000000
What we're doing with dropping the bits below the top 7 is the floor quantization, whatever those bits were, they're gone.
The +0.5 bias on restoration is equivalent to saying we expect all of those dropped bits to be equally likely, hence their
average value is 10000 (the highest drop bit is turned on, the rest are left off).
000000001MMMMMMM1000000000000000
The components that aren't the highest are forced onto the same exponent as the highest; this shifts in zeros from the left, their mantissa bits are shifted out the bottom of the word, and then we grab the top 8 bits of them.
The "ldexpf" we do in the correct implementation is just making a pure power of two, which is just an exponent in IEEE floats (all mantissa bits zero). When you multiply that on another float, you're just adding to the exponent part. Unfortunately there aren't standard C ways to do simple float operations like getting an exponent or adding to an exponent so we'll have to dig into the bits.
The full operation is :
union float_and_bits
{
float f;
unsigned int bits;
};
// float RGB -> U8 RGBE quantization
void float_to_rgbe_bits(unsigned char * rgbe,const float * rgbf)
{
float_and_bits fab[3];
fab[0].f = rgbf[0];
fab[1].f = rgbf[1];
fab[2].f = rgbf[2];
unsigned int max_bits = fab[0].bits > fab[1].bits ? fab[0].bits : fab[1].bits;
max_bits = max_bits > fab[2].bits ? max_bits : fab[2].bits;
int max_exp = (max_bits >> 23) - 127;
//max_bits == 0 is exact float zero
//(int)max_bits < 0 means the sign bit is on (illegal negative input)
//max_exp == 128 is NaN
if ( (int)max_bits <= 0 || max_exp <= -100 || max_exp == 128 )
{
// Exponent byte = 0 is a special encoding that makes RGB output = 0
rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
}
else
{
// float exponent is a value in [1,2)*e^exp , we want [0.5,1) so do ++
max_exp++;
unsigned int mantissa = 1 << 23;
int exp0 = (fab[0].bits>>23) - 127;
int exp1 = (fab[1].bits>>23) - 127;
int exp2 = (fab[2].bits>>23) - 127;
int man0 = (fab[0].bits & (mantissa-1)) | mantissa;
int man1 = (fab[1].bits & (mantissa-1)) | mantissa;
int man2 = (fab[2].bits & (mantissa-1)) | mantissa;
man0 >>= min(max_exp - exp0 - 8 + 23,31);
man1 >>= min(max_exp - exp1 - 8 + 23,31);
man2 >>= min(max_exp - exp2 - 8 + 23,31);
rgbe[0] = (unsigned char)( man0 );
rgbe[1] = (unsigned char)( man1 );
rgbe[2] = (unsigned char)( man2 );
rgbe[3] = (unsigned char)( max_exp + 128 );
}
}
// U8 RGBE -> float RGB dequantization
void rgbe_to_float_bits(float * rgbf,const unsigned char * rgbe)
{
if ( rgbe[3] == 0 )
{
rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
}
else
{
float_and_bits fab;
int exp = (int)rgbe[3] - 128 - 8;
fab.bits = (exp+127)<<23;
float fexp = fab.f;
rgbf[0] = (rgbe[0] + 0.5f) * fexp;
rgbf[1] = (rgbe[1] + 0.5f) * fexp;
rgbf[2] = (rgbe[2] + 0.5f) * fexp;
}
}
this version using bit manipulation produces exactly identical encodings to the correct version I have
posted before.
Appendix 4 :
Another common oddity in the RGBE encoders.
I have tried to stick to the original Radiance version where reasonable, but I have changed one part that is widely done strangely.
The major bias error we have talked about before is in the decoder. This error is in the encode side, but unlike the bias this error is not a large one, it's a small correction and doesn't change the values intended to be stored in the format.
In my correct version, this part :
int exponent;
float scale;
frexpf(maxf, &exponent);
scale = ldexpf(1.f, -exponent + 8);
is just trying to get the exponent of the largest component, and then form a pure power of 2 float that multiplies by
256 / 2^exponent.
This has been widely done strangely in other implementations. The majority of other implementations do this :
divide by self version :
scale = frexp(maxf,&exponent) * 256.0/maxf;
frexp returns the mantissa part of the float, scaled to [0.5,1), so when you divide that by maxf what you're doing is cancelling out
the mantissa part and leaving just the exponent part of maxf (2^-exponent).
This is not only a bit strange, it is in fact worse. scale made this way does not come out as an exact power of 2, because floating point math is not exact (a reciprocal then multiply does not give back exact 1.0). This divide-by-yourself method does not produce exactly the same encoding as the bit extraction reference version above.
Aside from small drifts of 'scale' causing small error in reconstruction, if it ever got too big, you could have another problem. Our value is supposed to be in [128,256) (inclusive on the bottom of the range, exclusive on the top). That means you can just cast to unsigned char. But if scale could ever be slightly over 1.0, you could get up to 256 exactly, then the cast to unsigned char would turn into 0, which would be a huge error.
This appears to have motivated Greg Ward in the original Radiance code to add a safety margin :
original Radiance code version :
/ray/src/common/color.c line 272
d = frexp(d, &e) * 255.9999 / d;
The 255.9999 fudge ensures that imprecise math can't make us incorrectly get to 256. It's an ugly way to handle it,
and it does cause a small loss of quality, so despite my goal to stick to the original Radiance
implementation I am not copying this.
(Radiance was started in 1985 when IEEE compliant floating point rounding was by no means standard, so we can certainly forgive it some fudgy bias factors)
The way I have shown creates a pure power of two scale factor, so it can't suffer from any drift of the scale due to floating point imprecision. This actually matches the described behavior of the RGBE encoding better than previous versions.
For example from Wikipedia :
Radiance calculates light values as floating point triplets, one each for red, green and blue. But storing a full double precision float for each channel (8 bytes × 3 = 24 bytes) is a burden even for modern systems. Two stages are used to compress the image data. The first scales the three floating point values to share a common 8-bit exponent, taken from the brightest of the three. Each value is then truncated to an 8-bit mantissa (fractional part). The result is four bytes, 32 bits, for each pixel. This results in a 6:1 compression, at the expense of reduced colour fidelity.
This is different than the benchmarks I've shown here, which are just the decompressors being run in isolation in a test bench. Jon measures times in Unreal. "Time to first frame" is the best measurement of load time, but it isn't total load time, as Unreal continues to do async loading after first frame.
Jon measured times on his PC, but with Unreal locked to one thread (as well as we could), and with various simulated disk speeds. This let us see the times more clearly, and lets us make some times that are more comparable to slower machines, or the last gen of consoles. Consoles like Switch will see the most load time benefit from Oodle, as they need help with both CPU time and IO time, so the double benefit of Oodle really helps there (Ooodle gives more compression than zlib and is also much faster to decode, so you save time on both IO and decompression).
Since Unreal 4.24 we're offer Oodle Data Compression and Network plugins that drop into Unreal without any changes to the Unreal Engine code. This uses the facilities that Epic added for us to make the compressor & network stack pluggable so that we wouldn't have to patch Engine source code to do our integration. (thanks Epic!). Now, you just drop in our plugins and Oodle automatically replaces the default compression in Unreal.
Some previous posts on Oodle in Unreal for more :
Oodle Data compression in Unreal
Oodle Network compression test (including Unreal)
Oodle 2.8.6 continues small fixes and tweaks in the 2.8.x major version. Leviathan decompression is now 5-10% faster on modern processors.
I have a new standard test machine, so I want to use this space to leave myself some checkpoint reference numbers on the new machine. My new standard machine is :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz
I always down-clock my test machines and disable turbo (or "boost") for the core clocks. This gives me very reliable profiling (of single threaded work, anyway). If you don't do this, you will see not just variability of runs, but also the first thing you test will usually seem faster than later tests, so your testing may be order dependent. If you don't lock your cores at low clocks, then you should be repeating your tests many times, trying them in different orders, and seeing if your results are stable.
new machine :
ect seven Oodle 2.8.6
AMD Ryzen 9 3950X (CPU locked at 3393 MHz) (Zen 2)
ooSelkie7 : 2.19:1 , 7.7 enc MB/s , 4205.6 dec MB/s
ooMermaid7 : 2.85:1 , 4.3 enc MB/s , 1985.1 dec MB/s
ooKraken7 : 3.08:1 , 3.5 enc MB/s , 1258.7 dec MB/s
ooLeviathan7 : 3.22:1 , 2.2 enc MB/s , 778.4 dec MB/s
zlib9 : 2.33:1 , 9.4 enc MB/s , 328.1 dec MB/s
lzma_def9 : 3.18:1 , 4.0 enc MB/s , 58.2 dec MB/s
old machine :
ect seven Oodle 2.8.6
Intel Core i7 3770 (locked at 3403 MHz) (Ivy Bridge)
ooSelkie7 : 2.19:1 , 7.5 enc MB/s , 3682.9 dec MB/s
ooMermaid7 : 2.85:1 , 3.9 enc MB/s , 1722.3 dec MB/s
ooKraken7 : 3.08:1 , 3.0 enc MB/s , 1024.9 dec MB/s
ooLeviathan7 : 3.22:1 , 1.9 enc MB/s , 679.4 dec MB/s
zlib9 : 2.33:1 , 8.0 enc MB/s , 310.2 dec MB/s
lzma_def9 : 3.18:1 , 3.8 enc MB/s , 55.4 dec MB/s
speeds are all single threaded, except the Oodle Optimal level encoders which use 2 threads for encoding (Jobify).
All reports on my blog before this post were on the Core i7 3770, where the run platform was not explicitly identified. All reports in the future will be on the AMD Ryzen.
Here's an "example_lz_chart" run on the new machine :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Oodle 2.8.6 example_lz_chart
<file>
lz_chart loading r:\testsets\lztestset\lzt99...
file size : 24700820
------------------------------------------------------------------------------
Selkie : super fast to encode & decode, least compression
Mermaid: fast decode with better-than-zlib compression
Kraken : good compression, fast decoding, great tradeoff!
Leviathan : very high compression, slowest decode
------------------------------------------------------------------------------
chart cell shows | raw/comp ratio : encode MB/s : decode MB/s |
All compressors run at various encoder effort levels (SuperFast - Optimal).
Many repetitions are run for accurate timing.
------------------------------------------------------------------------------
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie |1.41:834:4353|1.45:742:4355|1.53:557:4112|1.68:465:4257|1.70:412:4232|
Mermaid|1.54:702:3119|1.66:535:2591|1.79:434:2450|2.01:350:2429|2.04:324:2395|
Kraken |1.55:702:2247|1.71:532:1432|1.88:421:1367|2.10:364:1399|2.27:241:1272|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 1.412 | 1.447 | 1.526 | 1.678 | 1.698 |
Mermaid| 1.542 | 1.660 | 1.793 | 2.011 | 2.041 |
Kraken | 1.548 | 1.711 | 1.877 | 2.103 | 2.268 |
------------------------------------------------------------------------------
encode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 834.386 | 742.003 | 557.065 | 465.025 | 412.442 |
Mermaid| 701.818 | 534.711 | 433.517 | 350.444 | 324.358 |
Kraken | 701.792 | 531.799 | 420.887 | 364.245 | 240.661 |
------------------------------------------------------------------------------
decode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 4352.567 | 4355.253 | 4111.801 | 4256.927 | 4231.549 |
Mermaid| 3118.633 | 2590.950 | 2449.676 | 2429.461 | 2394.976 |
Kraken | 2247.102 | 1431.774 | 1366.672 | 1399.416 | 1272.313 |
------------------------------------------------------------------------------
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie |1.75:285:3847|1.83:127:4121|1.86: 55:4296|1.93: 10:4317|1.94:7.2:4297|
Mermaid|2.12:226:2307|2.19:115:2533|2.21: 52:2661|2.37:5.5:2320|2.44:4.2:2256|
Kraken |2.32:152:1387|2.39: 30:1483|2.44: 23:1469|2.55:9.8:1350|2.64:3.5:1292|
Leviath|2.48: 58: 899|2.56: 23: 937|2.62: 11: 968|2.71:3.9: 948|2.75:2.4: 932|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 1.748 | 1.833 | 1.863 | 1.933 | 1.943 |
Mermaid| 2.118 | 2.194 | 2.207 | 2.370 | 2.439 |
Kraken | 2.320 | 2.390 | 2.435 | 2.553 | 2.640 |
Leviath| 2.479 | 2.557 | 2.616 | 2.708 | 2.749 |
------------------------------------------------------------------------------
encode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 284.979 | 127.375 | 55.468 | 10.398 | 7.168 |
Mermaid| 226.279 | 114.597 | 52.334 | 5.457 | 4.229 |
Kraken | 152.400 | 29.891 | 22.928 | 9.849 | 3.530 |
Leviath| 58.356 | 23.379 | 10.845 | 3.927 | 2.380 |
------------------------------------------------------------------------------
decode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 3846.881 | 4121.199 | 4296.318 | 4317.344 | 4297.364 |
Mermaid| 2307.345 | 2532.950 | 2660.551 | 2320.415 | 2255.556 |
Kraken | 1387.219 | 1483.488 | 1469.246 | 1350.332 | 1292.404 |
Leviath| 899.052 | 937.473 | 968.337 | 948.179 | 932.194 |
------------------------------------------------------------------------------
A couple more reference numbers on the public Silesia set :
AMD on Silesia :
ooMermaid8 : 3.77:1 , 2.9 enc MB/s , 2208.1 dec MB/s
ooKraken8 : 4.25:1 , 1.1 enc MB/s , 1357.1 dec MB/s
ooLeviathan8 : 4.34:1 , 0.7 enc MB/s , 1015.2 dec MB/s
zlib9 : 3.13:1 , 10.0 enc MB/s , 373.0 dec MB/s
lzma_def9 : 4.27:1 , 3.0 enc MB/s , 85.5 dec MB/s
AMD on Silesia/mozilla :
ooMermaid8 : 3.34:1 , 2.8 enc MB/s , 2145.0 dec MB/s
ooKraken8 : 3.74:1 , 1.1 enc MB/s , 1256.5 dec MB/s
ooLeviathan8 : 3.82:1 , 0.7 enc MB/s , 814.2 dec MB/s
zlib9 : 2.68:1 , 6.4 enc MB/s , 326.0 dec MB/s
lzma_def9 : 3.81:1 , 2.8 enc MB/s , 71.0 dec MB/s
// returns number in [0,mod) aka [0,mod-1]
// similar to rand % mod
U32 rand_in_range_1(U32 mod)
{
U32 r = rand32();
return ( (U64) r * mod ) >> 32;
}
On x86 this is just mul ret edx, which does 32x32 -> 64 bit and take the high 32 bits. This is very fast and highly
portable.
I always knew rand_in_range_1 wasn't perfectly uniform (that not all the output values in [0,mod-1] had exactly equal probability, assuming your input rand32() did generate all numbers in [0,2^32) with equal probability), but I had never thought about exactly how non-uniform it was or how to fix it. So let's look at that.
How non-uniform is it?
Barely. Extremely not. It's so uniform it's hard to imagine a case where you care about it. (assuming always that mod is small relative to 2^32 ; if you want larger mods, use a 64-bit rand)
For example when mod = 7 , if you enumerate all 32-bit possibilities for rand32, the histogram is :
{
613566757,613566757,613566756,613566757,613566756,613566757,613566756
};
this pattern holds for all 'mod'; the difference between maximum and minimum count is at most 1. So for all practical uses that
I can imagine, this is good enough and I see no reason to want a better distribution.
In fact with a static mapping that invokes rand only once, you cannot possibly do better. If you think of "rand_in_range" as taking a 32-bit integer input and mapping it to something in [0,mod) repeatably (same input always makes same output), you cannot do better than this.
If we want a true uniform distribution with all output values equally likely, we must do something that does not have constant output over the input of a single 32-bit rand. We must occasionally re-draw a rand to spread the imbalance, so that if we histogram over many runs, it will tend towards perfect uniformity.
See Footnote *1 for classic rejection method.
The minimal amount of extra work comes if we could make the histogram :
613566757,613566757,613566756,613566757,613566756,613566757,613566756
->
613566756,613566756,613566756,613566756,613566756,613566756,613566756
redraw ,redraw ,ok ,redraw ,ok ,redraw ,ok
that is, we want to map 613566756 values to each histo bucket from our original rand, then if we hit
the 613566757th draw for that number, just restart the function to draw again.
That would give us perfect uniformity with only 4 redraws out of 2^32 rands.
Said again; we are trying to make a function that takes the 2^32 input number line and maps it to either an output number in [0,mod) or a special output "redraw". We want to map the minimum amount of the 2^32 input space to "redraw" (which is 4 here) and the other numbers map to [0,mod) with equal count.
Let's look at how we might do this.
The uniform histogram distribution has "even" numbers from the 2^32 line mapped to each output :
even = (2^32) / mod
all_even = even * mod
remainder = 2^32 - all_even
remainder = ( 2^32 ) % mod
remainder = ( 2^32 - mod ) % mod
remainder = ( (U32) - mod ) % mod
(this is just a way of getting 2^32 % mod using only 32 bit ops)
(note that when remainder == 0 you can't store all_even = 2^32 in 32 bits)
So "remainder" are the number of entries on the 2^32 line that map to an excess in one of the output histogram.
If we only had "all_even" input numbers (remainder less than 2^32) *and* we did a divide by all_even instead of
2^32 , we would have an even distribution. That is :
U32 rand_in_range_2(U32 mod)
{
U32 remainder = ( (U32) - mod ) % mod;
U32 all_even = - remainder;
for(;;)
{
U32 r = rand32();
// if r is not in [0,all_even) then redraw
if ( r >= all_even ) continue;
// uniform mapping of [0,all_even) -> [0,mod)
return ( (U64) r * mod ) / all_even;
}
}
This is obviously not how we want to do it, it uses two divides, but it has the general structure we want : we need
to retry 'remainder' times and if we're in the "all_even" subset (2^32 - remainder) then we just want a uniform
mapping.
Footnote *2 for alternate way to remove one divide.
{
BTW as an irrelevant aside, intuition on how the naive rand_in_range_1 generates its uneven distribution :
we're dividing by a denominator that's a little too big (2^32 instead of 2^32 - remainder). What this does is
sometimes cause a number to slip down into the lower bucket.
for mod = 7
remainder = 4
even = floor( 2^32 / 7 ) = 613566756
when x = even
(x*7) / (2^32 - 4) == 1
but using the too-large divisor it slips down :
(x*7) / 2^32 == 0
so for the lower part of the 32 bit range in [0,all_even) we slip down and put too many entries in the low histos, but then
we also have [remainder] extras at the top of the range that make up for the missing ones in the last histo.
}
So we want this technique: for [remainder] entries we will just redraw, and on the others we want a uniform mapping.
How do generate that? If we just took our naive starting point rand_in_range_1 , and we could generate this table :
redraw ,redraw ,ok ,redraw ,ok ,redraw ,ok
and find an "r" (rand value) in each of those redraw buckets to tell us when to redraw, that would give us our desired
uniform mapping. We want to choose just one input 'r' for each output value that has an excess count in the histo, and
cause a redraw in that case.
To see how to find that, it helped me to think about what a modulo multiply is doing.
r = 32-bit input from rand
U64 m = (U64) r * mod;
U32 hi = (U32) (m >> 32);
U32 lo = (U32) m;
what do hi and lo do as we step through the integers 'r' in order?
m starts at 0
each integer step of 'r' is m += mod
'lo' is making steps up by 'mod'
when m hits 2^32 ,
'lo' wraps around back through zero (m % 2^32)
let us for now consider the case of mod being a prime
then when 'lo' wraps around it will always hit a value it hasn't been yet
when 'lo' wraps, 'hi' increments
so first 'lo' steps up by 'mod' through the 2^32 range with 'hi' = 0
then it steps back up with 'hi' = 1
etc.
if you step all 2^32 values of r, then all 2^32 values of 'lo' will be visited
what I want to imagine is :
step through the values of 'lo'
at each entry for 'lo' , fill in the current value of 'hi'
first start at lo=0 with hi=0
so start filling in 0's , stepping by 7 :
0......0......0......0......[...]0...
when you get to 2^32
it wraps back to lo = 3 and you fill in 1's :
0..1...0..1...0..1...0..1...
then hi=2 starts at 6 :
0..1..20..1..20..1..20..1..2[...]0..1
0.31..20.31..20.31..20.31..2[...]0.31
0.31.420.31.420.31.420.31.42[...]0.31
0531.420531.420531.420531.42[...]0531
0531642053164205316420531642[...]0531
we have blocks of [mod] (7 in this case) that are all the values of 'hi' once :
[0531642] over and over up the end
then [remainder] extra at the end :
in this case remainder = 4
the last 4 entries on the 2^32 line will be [0531]
if you look at where we had over-counts original :
redraw ,redraw ,ok ,redraw ,ok ,redraw ,ok
that's 0,1,3,5
So this is exactly what we need.
If we take the low part of our multiply :
lo = (U32) m;
then as lo counts through the ints, 'hi' does [0531642]
If we only allow 'lo' up to 'all_even' (2^32 - remainder), the maximum number of groups of 'mod',
then we will have an equal number of occurance of each value of 'hi' - that's a uniform distribution.
So we can write our first attempt at using this :
U32 rand_in_range_3(U32 mod)
{
U32 remainder = ( (U32) - mod ) % mod;
U32 all_even = - remainder;
for(;;)
{
U32 r = rand32();
U64 m = (U64) r * mod;
U32 hi = (U32) (m >> 32);
U32 lo = (U32) m;
if ( lo >= all_even ) continue;
return hi;
}
}
The check for 'lo' being in the remainder range tells us that we are in the spots where 'hi' will get an excess count
that we don't want.
We've eliminated one divide from rand_in_range_2 ; we are now making the output just by taking the high part of the multiply (not dividing by all_even), but we are still doing a mod to get all_even.
(note also that this doesn't work when remainder = 0 because all_even = 2^32 can't be stored in 32 bits; we'll fix that too)
We can't eliminate the last divide, but we can make it rare :
First realize that we can reject
if ( lo < remainder )
instead of
if ( lo >= all_even )
because the set where we extra-count hi [0531] occurs the same at the top and bottom of the 'lo' number line.
Then realize remainder is less than mod, so ( lo < remainder ) can only occur if ( lo < mod ),
that means we can defer the expension calculation of 'remainder' (requires a divide)
inside the rare case of (lo < mod) which only occurs mod/2^32 of the time :
the result is :
U32 rand_in_range_4(U32 mod)
{
for(;;)
{
U32 r = rand32();
U64 m = (U64) r * mod;
U32 hi = (U32) (m >> 32);
U32 lo = (U32) m;
if ( lo < mod )
{
// might have lo < remainder
// compute it now :
U32 remainder = ( (U32) - mod ) % mod;
if ( lo < remainder )
{
//redraw
continue;
}
}
return hi;
}
}
and that's Lemire's algorithm.
Footnote *4 for more direct way to see that this is a correct solution.
When mod is small relative to 2^32 , the expected runtime is nearly identical to the original rand_in_range_1 , so there's very little cost. OTOH as noted I think there's also extremely little benefit.
I've worked here with 32-bit integers and only a 32-bit mul hi; you can of course do the same thing with 64 bit, but the 64-bit "mul hi" operation is not portably fast.
Conclusion:
For all practical applications you should just be using the trivial standard form of multiply and shift to
generate random numbers in an interval :
// returns number in [0,mod) aka [0,mod-1]
// similar to rand % mod
U32 rand_in_range_1(U32 mod)
{
U32 r = rand32();
return ( (U64) r * mod ) >> 32;
}
this method does produce very slightly slightly biased probabilities, which can be fixed using the rejection
method with very low expected cost with Lemire's algorithm that defers the expensive division needed to find
the remainder into a low probability branch.
Footnotes :
*1. The classic rejection method for generating random numbers with some desired property is to just draw a uniform random (say rand32() for example), and reject it if it doesn't fit the desired property, then try again.
In this case the classic rejection method would be :
// returns number in [0,mod) aka [0,mod-1]
// similar to rand % mod
U32 rand_in_range_classic_rejection(U32 mod)
{
U32 mask = get_greater_or_equal_pow2(mod) - 1;
for(;;)
{
U32 r = rand32() & mask;
if ( r < mod ) return r;
}
}
this generates perfectly uniform probability rands in the desired range. get_greater_or_equal_pow2 can
be implemented with clz. This method has a maximum redraw rate of just under 50% (when mod is just over a
power of two, for example mod = 65537).
*2. In rand_in_range_2 we looked at just taking the input random number and dividing into equal size bins to make our uniform random output. This took two divides (one for remainder, and one to assign the output value). We can eliminate one of those divides a different way by approximating the division.
Reminder what we are doing here is finding the number of elements that can be evenly divided :
even = floor( 2^32 / mod )
all_even = even * mod;
remainder = 2^32 - all_even;
If we had only randoms in [all_even] and we divided them by [even] we would get the desired result.
The step we need to look at is :
out = (r * mod) / all_even;
all_even = 2^32 - remainder;
we would rather do (r * mod) >> 32
but all_even is slightly less than 2^32
for remainder small we can use the approximation :
1 / (1-x) ~= 1 + x
1/all_even = (1/2^32) * (1 /(1 - remainder/2^32));
~= (1/2^32) * (1 + remainder/2^32);
so
x/all_even ~= (x + (remainder*x)/2^32 )/2^32;
and that in fact works, we can write it in pseudo-code as :
r = rand32();
reject r >= all_even
U64 m = (x+1) * mod;
m += (remainder * m) >> 32;
ret = (U32)( m >> 32 );
ret is in [0,mod) and perfectly uniform
For some large enough mod, the approximation that (mod/2^32) is small must break down; that seems to be
around mod = 80k .
We'll write out the full code to be precise, and slightly modify things to avoid use of "all_even" which is
problematic when remainder == 0 :
U32 rand_in_range_2_one_divide(U32 mod)
{
U32 remainder = ( (U32) - mod ) % mod;
for(;;)
{
U32 r = rand32();
// redraw in bottom [remainder] portion, so [all_even] top values remain
if ( r < remainder ) continue;
// slide down remainder and bias for rounding :
r = r - remainder + 1;
// uniform mapping of [0,all_even) -> [0,mod)
U64 m = r * (U64) mod;
m += (remainder * m) >> 32;
return (U32)( m >> 32 );
}
}
we still need a divide for remainder here, so this is quite pointless, but there you go.
*4. More direct way to see that Lemire's algorithm has the property we want (uniform probability of each output value). This is roughly his argument in the paper, which is clear and good, you may wish to consult the original paper.
If we start from the random value :
U32 r = rand32();
it has values in [0,2^32) and all are equally likely. If we now scale it up by mod :
U64 m = (U64) mod * r;
it is now in the range [0,2^32*mod) and only values divisible my mod occur, and they are equally likely.
Now separate m into top and bottom components :
U32 hi = (U32) (m>>32);
U32 lo = (U32) m;
Now hi is in the range [0,mod) , and the frequency of each value of 'hi' is equal to the number of values of 'lo' that can be made.
The values of 'lo' only occur (mod) apart. The number of values of 'lo' that can be made is either floor(2^32/mod) or ceil(2^32/mod),
depending on where the first value occurs.
In order to make all the values of 'hi' equally likely, we just have to make the number of accessible values in 'lo' equal, that is, make it floor(2^32/mod).
To do that, instead of allowing all 2^32 values of 'lo', only allow a contiguous range of (floor(2^32/mod)*mod) (which is the same as 'all_even' or '2^32 - remainder').
Done.
Think about for example if you had a range of 16 (4 bits) instead of 2^32 and you had a mod step of 7.
how many spots can you access in the 4-bit number line
taking steps of 7 ?
either 2 or 3, depending on where you start :
................
.x......x......x
...x......x.....
if you cut restrict the line to 14 contiguous values,
then you will always have 2 accessible values,
regardless of where they start :
..............OO
.x......x.....OO
...x......x...OO
The toy problem we will study is this :
Given an array of random bytes, you for some silly reason decide to compress them by sending the distance to the most recent preceding occurance of that byte.
So to send 'A' you count backward from your current position P ; if there's an A at P-1, you send a 0, if it's at P-2 send a 1, etc.
What is the entropy of this distance encoding?
This "compressor" is silly, it expands (it will send more than 8 bits per byte). How much does it expand?
First of all, why are we looking at this. It's a boiled down version of LZ77 on a specific type of data. LZ77 sends repeated strings by sending the distance to the previous occurance of that string. Imagine your data consists of a finite set of tokens. Each token is several bytes and imagine the tokens are drawn from a random source, and there are 256 of them. If you had the indices of the tokens, you would just have a random byte sequence. But LZ77 does not have that and cannot convert the stream back to the token indexes. Instead the best it can do is to parse on token boundaries, and send the distance to the previous occurance of that token.
This does occur in the real world. For example consider something like a palettized image. The palette indices are bytes that refer to a dictionary of 256 colors. If you are given the image to compress after de-palettizing, you would see something like 32-bit RGBA tokens that repeat. Once you have seen the 256 unique colors, you no longer need to send anything but distances to previous occurances of a color. English text is also a bit like this, with the tokens equal to a word+punction string of variable length.
So back to our toy problem. To get the entropy of the distance coding, we need the probability of the distances.
To find that, I think about coding the distance with a sequence of binary coding events. Start at distance = 0. Either the current byte matches ours, or it doesn't. Chance of matching is (1/256) for random bytes. The chance of no-match is (255/256). So we multiply our current probability by (1/256) and the probability of all future distances by (255/256), then move to the next distance. (perhaps it's easier to imagine that the preceding bytes don't exist yet; rather you are drawing a random number as you step back in distance; any time you get your own value you stop)
This gives you the probability distribution :
P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
...
P(n) = (255/256)^n * (1/256)
an alternative way to get the same distribution is to look at distance D. First multiply by the probability that it is not a lower distance
(one minus the sum of all lower distance probabilities).
Then the probability that it is here is (1/256). That's :
P(0) = (1/256)
P(1) = (1 - P(0)) * (1/256)
P(2) = (1 - P(0) - P(1)) * (1/256)
...
P(n) = (1 - P(0) - P(1) .. - P(n-1)) * (1/256)
which is equal to the first way.
Given this distribution we can compute the entropy :
H = - Sum_n P(n) * log2( P(n) )
starting at n = 0
let x = (255/256)
P(n) = (1-x) * x^n
log2( P(n) ) = log2( 1-x ) + n * log2( x )
H = - Sum_n (1-x) * x^n * ( log2( 1-x ) + n * log2( x ) )
terms that don't depend on n pull out of the sum :
H = - (1-x) * log2( 1-x ) * Sum_n x^n
- (1-x) * log2( x ) * Sum_n n * x^n
we need two sums :
G = Sum_n x^n
S = Sum_n n * x^n
G is just the geometric series
G = 1 / (1 - x)
recall the trick to find G is to look at G - x*G
we can use the same trick to find S
S - x*S = G - 1
the other standard trick to find S is to take the d/dx of G
either way you get:
S = x / (1-x)^2
H = - (1-x) * log2( 1-x ) * G
- (1-x) * log2( x ) * S
H = - (1-x) * log2( 1-x ) * ( 1 / (1-x) )
- (1-x) * log2( x ) * ( x / (1-x)^2 )
H = - log2( 1-x ) - log2( x ) * ( x / (1-x) )
putting back in x = (255/256)
1-x = 1/256
the first term "- log2( 1-x )" is just 8 bits, send a byte
H = 8 + 255 * log2( 256 / 255 )
H = 8 + 1.43987 bits
about 9.44 bits
or if we look at general alphabet size N now instead of source bytes
H = log2(N) + (N-1) * log2( N / (N-1) )
recalling of course log2(N) is just the number of bits it would take to code a random symbol of N possible values
we'll look at the expansion due to distance coding, H - log2(N)
if we now take the limit of N large
H - log2(N) -> (N-1) * log2( N / (N-1) )
H - log2(N) -> (N-1) * log2( 1 + 1 / (N-1) )
H - log2(N) -> log2( ( 1 + 1 / (N-1) ) ^ (N-1) )
H - log2(N) -> log2( ( 1 + 1/N ) ^ N )
( 1 + 1/N ) ^ N -> e !!
H - log2(N) -> log2( e ) = 1.44269 bits
for large alphabet, the excess bits sent due to coding distances instead of indices is log2(e) !!
(for finite N the excess bits are < log2(e) but quite close; eg. 1.406 at N = 20)
I thought it was pretty entertaining for Euler to show up in this problem.
Why is distance coding fundamentally inefficient (vs just coding the indices of these repeated tokens) ? It's due to repetitions of values.
If our preceding bytes were "ABCB" and we now wish to code an "A" , it's at distance = 3 because we had to count the two B's. Our average distance is getting bumped up because symbols may occur multiple times before we find our match. If we did an LZ77 coder that made all substrings of the match length unique, we would not have this inefficiency. (for example LZFG which sends suffix trie node indices rather than distances does this)
We can see where this inefficiency appears in the probabilities:
if you are drawing a random number from 256 at each step
keep stepping until you get a match
each time you have to draw from all 256 possibilities (repeats allowed)
P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
...
P(n) = (255/256)^n * (1/256)
(as above)
instead imagine drawing balls from a hat
once you draw a ball it is discarded so it cannot repeat
stop when you match your byte
first draw has the same 1/256 chance of match :
P(0) = (1/256)
as before multiply by 1-P to get the probability of continuing
but now we only have 255 balls left, so the chance of a match is (1/255)
P(1) = (255/256) * (1/255) = (1/256)
current P was (1/255) so multiply the next by (254/255)
now we have 254 , so we match (1/254) of the time :
P(2) = (255/256) * (254/255) * (1/254) = (1/256)
...
P(n) = (1/256)
it's just a flat probability.
decrementing the alphabet size by one each time makes a big difference.
This theoretical coding loss is also observed exactly in the real world.
experiment :
make 100,000 random bytes
make a palette of 256 32-bit random dwords
use the 100,000 random bytes to index the palette to output 400,000 bytes
what can LZ77 on these 400,000 bytes?
Our theoretical analysis says the best possible is 9.44 bits per palette index
plus we have to send 1024 bytes of the palette data as well
best = 100,000 * 9.44/8 + 1024 = 119,024
real world :
Leviathan Optimal5
random_bytes_depal.bin : 400,000 -> 119,034 = 2.381 bpb = 3.360 to 1
it turns out this theoretical limit is almost exactly achieved by Oodle Leviathan, only 10 bytes over due to headers and so on.
Leviathan is able to hit this limit (unlike simpler LZ77's) because it will entropy code all the match signals to nearly zero bits (all the matches will be "match len 4" commands, so they will have a probability near 1 and thus entropy code to zero bits). Also the offsets are in multiples of 4, but Leviathan will see the bottom bits are always zero and not send them. The result is that Leviathan sends the matches in this kind of data just as if it was sending a distance in tokens (as opposed to bytes) back to find the repeated token, which is exactly what our toy problem looked at.
We cannot do better than this with LZ77-style distance coding of matches. If we want to beat this we must induce the dictionary and send id references to whole words. Leviathan and similar LZ's cannot get as close to the optimum on text, because we don't handle tokens of varying length as well. In this kind of depalettized data, the match length is totally redundant and should be coded in zero bits. With text there is content-length correlation which we don't model at all.
Also note that we assume random tokens here. The loss due to sending distances instead of token id's gets even worse if the tokens are not equiprobable, as the distances are a poor way to capture the different token probabilities.
Summary :
The minimum possible coding loss due to sending repeated tokens by distance rather than by index approaches log2(e) bits for large alphabet (large = 20). This theoretical limit acts as a real world limit on the compression that LZ77 class algorithms can achieve on depalettized data.
ozip now has a benchmarking option (ozip -b) which is an easy way to test Oodle.
ozip -b runs encode & decode many times to provide accurate timing. It does not include IO. It was designed to be similar to zstd -b so that they are directly comparable.
ozip -b can take a file or a dir (plus wildcard), in which case it will test all the files in the dir. You can set up the specific compressor and options you want to test to see how they affect performance and compression ratio.
So for example you can test the effect of spaceSpeedTradeoffBytes on Kraken level Optimal1 :
r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os512
K 5 mozilla : 51220480 -> 14288373 (3.585), 3.5 MB/s, 1080.4 MB/s
r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla
K 5 mozilla : 51220480 -> 14216948 (3.603), 3.5 MB/s, 1048.6 MB/s
r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os128
K 5 mozilla : 51220480 -> 14164777 (3.616), 3.5 MB/s, 1004.6 MB/s
Or to test Kraken HyperFast3 on all the files in Silesia :
r:\>ozip -b -c8 -ol-3 r:\testsets\silesia\*
K-3 12 files : 211938580 -> 81913142 (2.587), 339.0 MB/s, 1087.6 MB/s
Another option for easy testing with Oodle is example_lz_chart, which is also provided as a pre-compiled exe and also as source code.
example_lz_chart runs on a single file you provide and prints a report of the compression ratio and speed of various Oodle compressors and encode levels. This gives you an overview of the different performance points you can hit with Oodle.
example_lz_chart takes a while because it's running all the compressors at all levels, and doing each
several times to get accurate timing. You can run it on your own data file to get a
(Core i7-8750H @ 4 GHz)
Oodle 2.8.1 example_lz_chart
<file>
lz_chart loading r:\testsets\lztestset/lzt99...
file size : 24700820
------------------------------------------------------------------------------
Selkie : super fast to encode & decode, least compression
Mermaid: fast decode with better-than-zlib compression
Kraken : good compression, fast decoding, great tradeoff!
Leviathan : very high compression, slowest decode
------------------------------------------------------------------------------
chart cell shows | raw/comp ratio : encode MB/s : decode MB/s |
All compressors run at various encoder effort levels (SuperFast - Optimal).
Many repetitions are run for accurate timing.
------------------------------------------------------------------------------
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie |1.41:976:5636|1.45:906:5581|1.53:684:5299|1.68:473:5327|1.70:439:5287|
Mermaid|1.54:820:3946|1.66:639:3218|1.79:524:3031|2.01:368:2943|2.04:343:2900|
Kraken |1.55:840:2774|1.71:624:1710|1.88:500:1632|2.10:376:1625|2.27:233:1477|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 1.412 | 1.447 | 1.526 | 1.678 | 1.698 |
Mermaid| 1.542 | 1.660 | 1.793 | 2.011 | 2.041 |
Kraken | 1.548 | 1.711 | 1.877 | 2.103 | 2.268 |
------------------------------------------------------------------------------
encode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 976.375 | 905.754 | 683.646 | 472.663 | 439.319 |
Mermaid| 820.129 | 638.548 | 524.130 | 368.366 | 343.193 |
Kraken | 840.333 | 624.029 | 500.286 | 376.477 | 232.842 |
------------------------------------------------------------------------------
decode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 5636.497 | 5581.097 | 5299.468 | 5327.012 | 5287.330 |
Mermaid| 3946.133 | 3217.677 | 3030.813 | 2942.921 | 2899.940 |
Kraken | 2774.189 | 1709.920 | 1631.861 | 1625.268 | 1477.216 |
------------------------------------------------------------------------------
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie |1.75:290:4854|1.83:143:5141|1.86: 56:5289|1.93: 12:5336|1.94:8.3:5307|
Mermaid|2.12:246:2817|2.19:128:3070|2.21: 51:3234|2.37:6.7:2834|2.44:5.1:2731|
Kraken |2.32:171:1625|2.39: 43:1742|2.44: 27:1719|2.55: 11:1591|2.64:3.9:1499|
Leviath|2.48: 74:1085|2.56: 30:1126|2.62: 13:1146|2.71:4.2:1137|2.75:2.6:1103|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 1.748 | 1.833 | 1.863 | 1.933 | 1.943 |
Mermaid| 2.118 | 2.194 | 2.207 | 2.370 | 2.439 |
Kraken | 2.320 | 2.390 | 2.435 | 2.553 | 2.643 |
Leviath| 2.479 | 2.557 | 2.617 | 2.708 | 2.749 |
------------------------------------------------------------------------------
encode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 290.440 | 143.077 | 55.614 | 12.236 | 8.348 |
Mermaid| 246.499 | 128.477 | 51.089 | 6.696 | 5.106 |
Kraken | 171.500 | 43.091 | 26.710 | 11.366 | 3.932 |
Leviath| 73.601 | 30.005 | 13.121 | 4.218 | 2.625 |
------------------------------------------------------------------------------
decode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 4853.862 | 5141.077 | 5288.575 | 5336.449 | 5307.097 |
Mermaid| 2816.706 | 3070.103 | 3233.684 | 2833.508 | 2730.911 |
Kraken | 1624.872 | 1742.194 | 1718.545 | 1590.779 | 1498.863 |
Leviath| 1085.373 | 1126.420 | 1145.960 | 1136.569 | 1102.641 |
------------------------------------------------------------------------------
WARNING :
Do not try to profile Oodle by process timing ozip.
The normal ozip (not -b mode) uses stdio and is not designed to be as efficient as possible. It's designed for simplicity and to replicated gzip behavior when used for streaming pipes on UNIX.
In general it is not recommended to benchmark by timing with things like IO included because it's very difficult to do that right and can give misleading results.
See also :
Tips for benchmarking a compressor
The Perils of Holistic Profiling
Tips for using Oodle as Efficiently as Possible
NOTE :
ozip does not have any threading. ozip -b is benchmarking single threaded performance.
This is true even for the new Jobify threading because ozip initializes OodleX without threads :
OodleX_Init_Default(OODLE_HEADER_VERSION,OodleX_Init_GetDefaults_DebugSystems_No,OodleX_Init_GetDefaults_Threads_No);
I believe that zstd -b is also single threaded so they are apples to apples. However some compressors uses threads by default (LZMA, LZHAM, etc.) so if they are being compared they should be set to not use threads OR you should use Oodle with threads. Measuring multi-threaded performance is context dependent (for example are you encoding many small chunks simultaneously?) and I generally don't recommend it, it's much easier to compare fairly with single threaded performance.
For high performance on large files, ask for the "oozi" example.
The major new features in Oodle 2.8 are :
An example of the encoder speed improvement on the "seven" test set, measured with ect on a Core i7 3770, Kraken levels 5 (Optimal1) and 7 (Optimal3) :
Oodle 2.7.5 :
ooKraken5 : 3.02:1 , 3.3 enc MB/s , 1089.2 dec MB/s
ooKraken7 : 3.09:1 , 1.5 enc MB/s , 1038.1 dec MB/s
Oodle 2.8.0 : (without Jobify)
ooKraken5 : 3.01:1 , 4.6 enc MB/s , 1093.6 dec MB/s
ooKraken7 : 3.08:1 , 2.3 enc MB/s , 1027.6 dec MB/s
Oodle 2.8.0 : (with Jobify)
ooKraken5 : 3.01:1 , 7.2 enc MB/s , 1088.3 dec MB/s
ooKraken7 : 3.08:1 , 2.9 enc MB/s , 1024.6 dec MB/s
See the full change log for more.
About the new Jobify threading of Oodle Core :
Oodle Core is a pure code lib (as much as possible) that just does memory to memory compression and decompression. It does not have IO, threading, or other system dependencies. (that's provided by Oodle Ext). The system functions that Oodle Core needs are accessed through function pointers that the user can provide, such as for allocations and logging. We have extended this so you can now plug in a Job threading system which Oodle Core can optionally use to multi-thread operations.
Currently the only thing we will multi-thread is OodleLZ_Compress encoding of the new LZ compressors (Kraken, Mermaid, Selkie, Leviathan) at the Optimal levels, on buffers larger than one BLOCK (256 KB). In the future we may multi-thread other things.
Previously if you wanted multi-threaded encoding you had to split your buffers into chunks and multi-thread at the chunk level (with or without overlap), or by encoding multiple files simultaneously. You still can and should do that. Oodle Ext for example provides functions to multi-thread at this granularity. Oodle Core does not do this for you. I refer to this as "macro" parallelism.
The Oodle Core provides more "micro" parallelism that uses multiple cores even on individual buffers. It parallelizes at the BLOCK level, hence it will not get any parallelism on buffers <= one BLOCK (256 KB).
Threading of OodleLZ_Compress is controlled by the OodleLZ_CompressOptions:jobify setting. If you don't touch it, the default value (Jobify_Default) is to use threads if a thread system is plugged in to Oodle Core, and to not use threads if no thread system is plugged in. You may change that option to explicitly control which calls try to use threads and which don't.
OodleX_Init plugs the Oodle Ext thread system in to Oodle Core. So if you use OodleX and don't touch anything, you will now have Jobify threading of OodleLZ_Compress automatically enabled. You can specify Threads_No in OodleX_Init if you don't want the OodleX thread system. If you use OodleX you should NOT plug in your own thread system or allocators into Oodle Core - you must let OodleX provide all the plugins. The Oodle Core plugins allow people who are not using OodleX to provide the systems from their own engine.
WHO WILL SEE AN ENCODE PERF WIN :
If you are encoding buffers larger than 1 BLOCK (256 KB).
If you are encoding at level Optimal1 (5) or higher.
If you use the new LZ codecs (Kraken, Mermaid, Selkie, Leviathan)
If you plug in a job system, either with OodleX or your own.
CAVEAT :
If you are already heavily macro-threading, eg. encoding lots of files multi-threaded, using all your cores, then Jobify probably won't help much. It also won't hurt, and might help ensure full utilization of all your cores. YMMV.
If you are encoding small chunks (say 64 KB or 256 KB), then you should be macro-threading, encoding those chunks simultaneously on many threads and Jobify does not apply to you. Note when encoding lots of small chunks you should be passing pre-allocated memory to Oodle and reusing that memory for all your compress calls (but not sharing it across threads - one scratch memory buffer per thread!). Allocation time overhead can be very significant on small chunks.
If you are encoding huge files, you should be macro-threading at the chunk level, possibly with dictionary backup for overlap. Contact RAD support for the "oozi" example that demonstrates multi-threaded encoding of huge files with async IO.
NOTE : All the perf numbers we post about and shows graphs for are for single threaded speed. I will try to continue to stick to that.
A few APIs have changed & the CompressOptions struct has changed.
This is why the middle version number (8) was incremented. When the middle ("major") version of Oodle is the same, the Oodle lib is binary link compatible. That means you can just drop in a new DLL without recompiling. When the major version changes you must recompile.
A few APIs have small signature changes :
OodleLZ_GetDecodeBufferSize, OodleLZ_GetCompressedBufferSizeNeeded and OodleLZ_GetInPlaceDecodeBufferSize :
take compressor argument to return smaller padding for the new codecs.
OodleLZ_GetChunkCompressor API : take compressed size argument to ensure it doesn't read past end
these should give compile errors and be easy to fix.
The CompressOptions struct has new fields. Those fields may be zero initialized to get default values. So if you were initializing the struct
thusly :
struct OodleLZ_CompressOptions my_options = { 1, 2, 3 };
the new fields on the end will be implicitly zeroed by C, and that is fine.
NOTE : I do NOT recommend that style of initializing CompressOptions. The recommended pattern is to GetDefaults and then modify the fields you
want to change :
struct OodleLZ_CompressOptions my_options = OodleLZ_CompressOptions_GetDefault();
my_options.seekChunkReset = true;
my_options.dictionarySize = 256*1024;
then after you set up options you should Validate :
OodleLZ_CompressOptions_Validate(&my_options);
Validate will clamp values into valid ranges and make sure that any constraints are met. Note that if Validate changes your options you should
really look at why, you shouldn't be shipping code where you rely on Validate to clamp your options.
WARNINGS :
example_lz before 2.8.0 had a bug that caused it to stomp the user-provided input file, if one was provided.
YES IT WOULD STOMP YOUR FILE!
That bug is not in the Oodle library, it's in the example code, so we did not issue a critical bug fix for it, but please beware running the old example_lz with a file argument. If you update to the 280 SDK please make sure you update the *whole* SDK including the examples, not just the lib!
On Windows it is very important to not link both Oodle Core and Ext. The Oodle Ext DLL includes a whole copy of Oodle Core - if you use OodleX you should ONLY link to the Ext DLL, *not* both.
Unfortunately because of the C linkage model, if you link to both Core and Ext, the Oodle Core symbols will be multiply defined and just silently link without a warning or anything. That is not benign. (It's almost never benign and C needs to get its act together and fix linkage in general). It's specifically not benign here, because Oodle Ext will be calling its own copy of Core, but you might be calling to the other copy of Core, so the static variables will not be shared.
I thought it might be interesting to write up LZNIB and look at some of the things we learned from it.
LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits).
(* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman, but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details on this later)
LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match").
One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time.
After Literals :
LRL should not occur (would have just been a longer LRL)
match & rep match are possible
After Match or Rep Match :
rep match should not occur (would have just been a longer match)
normal match & lrl are possible
because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value :
Threshold T
values in [0 , T-1] are action type 1
values in [T , 15] are action type 2
So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action.
The values within your threshold group are the first part of your LRL or match len.
There are at least two different thresholds, one for {match/rep-match} in the after-literals state,
and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match
to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data.
Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal)
In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T. This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression a lot on some data, but is obviously slow.
In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}. For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence, therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...}
LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being divided in two ranges. One range is for values that can terminate in the current word, the other range is for values that don't fit and includes a portion of the remaining value. See the encodemod post series for details.
The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len (after literal), and rep match len.
All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first another nibble, then any further values are sent with bytes.
The full decoder for LZNIB in pseudocode is :
Do a literal run to start.
After-literal state :
Get control nibble
if nib < 5
{
rep match
if nib == 4 get further ml using encodemod, first nibble then bytes
copy rep match
}
else
{
normal match
if nib == 15 get further ml using encodemod, first nibble then bytes
get offset using encodemod, first 12 bits then bytes
copy match
}
After-match state :
Get control nibble
if nib < T
{
literal run
if nib == T-1 get further lrl using encodemod, first nibble then bytes
copy literal run
goto After-literal
}
else
{
normal match
if nib == 15 get further ml using encodemod, first nibble then bytes
get offset using encodemod, first 12 bits then bytes
copy match
goto After-match
}
That's it.
LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie.
One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed.
To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches.
Let's look at some simple examples to get a feel for these issues.
Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2 nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that.
L = literal , costs 2 nibbles + 1 at start of lrl
R = rep match, costs 1 control nibble
M = normal match, costs 1 control nibble + 3 or more nibbles for offset
LLR = 1+2*2+1 = 6 nibbles
LLL = 1+3*2 = 7 nibbles
so LLR is best right? take the rep!
LLRL = 1+2*2+1 + 1+2 = 9 nibbles
LLLL = 1+4*2 = 9 nibbles
No! Nibble count is the same but fewer controls = prefer the literal.
It depends what follows the rep.
LLRM
LLLM
in this case the rep is cheaper.
So if a literal follows the rep, don't take it, right?
LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles
LLLLLLLLL = 1+2*9 + 1 = 20 nibbles
No! In the second case the LRL of 9 can't fit in the control nibble, so an
extra lrl nibble must be sent. So prefer the rep here.
So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made
non-locally.
A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits
takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match,
that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that
1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match :
MMMLMMML = 4 + 3 + 4 + 3 = 14
LLMMMMLL = 5 + 4 + 5 = 14
same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's
A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6
nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes
the LRL to fit in control.
You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter matches only save a nibble here and there, but on some files there are many of those possible. For the optimal parsers we wanted to be able to get those wins when they are possible.
The challenge of optimal parsing LZNIB.
The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse (as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in formats that do not have adaptive state which is parse-path dependent).
See some posts on forward-arrivals parse : here and here
The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options), and fill out the costs at P+1, P+len, etc.
The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as being independent.
We'll look at some ways to improve the forward arrivals parse but first a detour. Fortunately in this case it's possible to solve this systematically.
Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error). The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target, it lets us know how close our approximation is, and may give us guidance on how to get there.
In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix.
Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len) and also how that changes the internal state, and you move to the next array slot. In this case the important state variable is LRL.
(we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a state variable. The result is that we get correct code costs for all LRL steps from each pos.)
(note that the offset which is used for rep matches is also an important internal state variable, but we are continuing to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate between arrivals that only differ by rep offset)
We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7, with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21). LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow.
We make a 2d matrix that's (file len) * 22 entries.
At each pos P, you can advance all the entries by adding one literal. This does :
for all LRL
costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL)
(2 nibbles is the cost of a literal byte)
where LRL_Delta_Cost =
1 at LRL = 0
1 at LRL = 7
1 at LRL = 21
otherwise zero
Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step
by the match len and fill costs[P+len][0]. You can also advance with a rep match, which is similar except that it
cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL).
When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the root to find the parse.
This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible.
An important note : The dynamic programming parse needs to be making space-speed decisions. As noted previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it if the penalty is only one nibble.
(modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed, which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we found to greatly improve decode speed without much penalty to compressed size.)
So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and
compare that to how forward-arrivals was parsing. What we saw was :
dynamic programming :
---------------------------------------------------------
7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO
AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl
AM: offlow 40.7% ml3 24.0% ml4 21.8%
AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len
AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9%
---------------------------------------------------------
forward arrivals :
---------------------------------------------------------
5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO
AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl
AM: offlow 43.1% ml3 37.5% ml4 19.9%
AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len
AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4%
---------------------------------------------------------
key :
---------------------------------------------------------
decompressed bytes per token ; % of control tokens of each type
AM = in after-match state
AM: % of tokens ; average len of that token
AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3
AL = in after-literal state
LO means "last offset" which a synonym for "rep match"
---------------------------------------------------------
In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster).
We can easily see what's going on :
DP parse has more bytes per token, eg. fewer tokens for the whole file. This is why it is faster. This is more of the end result
of the problem rather than the source of the problem.
DP parse has way more bytes in literals (30.2% vs 21%)
DP parse has way longer average LRL (7.0 vs 3.5)
forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%)
forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%)
I found it quite interesting that two ways to parse the file could produce nearly the same
compressed size, but get there in very different ways.
Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way.
When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos;
consider something like this :
1234
LLLR
At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival,
and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3.
A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using cheapest arrival), it is always in favor of LRL. The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places.
The solution is pretty simple :
Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P
from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc.
considering an example from before :
12345678
MMMLMMML
LLMMMMLL
at pos 8 I look at the ways to go forward (find matches & reps)
say I find a match
how should I arrive at pos 8 ?
I can arrive via
arrival[7] (MMMLMMM) + LRL 1
or
arrival[6] (LLMMMM) + LRL 2
You should choose the latter.
Even though arrival[7] was the cheaper way to get to pos 7, arrival[6] is the better way to get to pos 8.
You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary).
If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival.
Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc. In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte.
Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format has separate literal run and match actions, they act like they are combined.
In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action.
So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states,
there is only one state :
LZNIB always after-match :
nibble action : match
(no rep match is possible)
byte action : LRL then match
byte action : LRL then rep-match
If you parse using these actions, then the non-local effects go away.
In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have similar issues but are even far more complicated to parse well.
Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days :
cbloom rants 10-22-12 - LZ-Bytewise conclusions
cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies
cbloom rants 05-13-15 - Skewed Pareto Chart
cbloom rants Oodle Results Update
03/2015 to 10/2018
04/2012 to 02/2015
06/2011 to 04/2012
01/2011 to 06/2011
10/2010 to 01/2011
01/2010 to 10/2010
01/2009 to 12/2009
10/2008 to 01/2009
08/2008 to 10/2008
03/2008 to 08/2008
11/2007 to 03/2008
07/2006 to 11/2007
12/2005 to 07/2006
06/2005 to 12/2005
01/1999 to 06/2005