Go to the new cbloom rants @ blogspot

02-20-15 | Shoes All Suck

Shoes are fucking bullshit. This is another in the long series of "everyone else is a fucking moron and I should be in charge of everything".

Skippable background : {

My feet are wide in the toebox area (forefoot). This has caused me enormous problems over the years. Basically no shoes fit. As a result of wearing shoes that are too tight in the forefoot, I developed a neuroma. My neuroma is in the usual place, between the metatarsal heads. There's basically no treatment for neuromas. I've tried orthotics and cortisone shots and all the usual stuff. There are surgical options, but the success rate is extremely low and the problem often recurs worse after surgery. The neuroma goes from being a minor annoyance to being incredibly painful when inflamed. It's affected what I can do in my life; I never run on concrete, I try to avoid long walks or standing on concrete. I still do things like hiking and backpacking, but I know I'll be in great pain afterward and just take that as part of the deal.

Recently it's gotten much worse. The problem is I've subconsciously developed a gait that avoids putting pressure on the neuroma. For the past I don't know how many years, I've been walking really crooked. I didn't really notice until it started having bad effects up the chain. First I had an SI slip in my left hip. I got it popped back in, then a month later had another SI slip. (I now know how to pop it back in myself, it's pretty easy). I started going to physical therapy for the left hip, and that immediately set of a groin strain on my left pelvis, which has been lingering now for several months. Basically short version is that the left foot neuroma has now caused bad effects all the way up the chain due to subconscious favoring.

(I had the same kind of issue with my shoulder separation ; if I could just force my body to keep moving normally and ignore the pain, it actually doesn't have any bad mechanical consequences. In fact the surgical treatment for a neuroma is just to try to kill the nerve. The pain is not indicative of damage occuring, you want to just power through it. The problem is that subconsciously the body just takes over and modifes your movements to avoid the pain, and those modifications are actually much worse for you than the original injury. I'm still doing PT almost every day to try convince my shoulder that it's allowed to move in a normal way and doesn't have to do janky things to avoid pops.)

So I decided that the neuroma pain that I've been just ignoring for a long time was something that I need to try to address if it's going to fuck up my hips and low back and so on. Part of that is trying to find shoes that give me enough toe room.


So. I need 4E wide shoes, which basically just don't exist. (New Balance is the only significant maker of wide shoes, and they're bullshit. They just aren't actually wide. NB makes it wide shoes by putting a larger upper on the same sole, which means the ball of my foot is hanging over the side of the sole. Fail New Balance.) Even the rare 4E shoes that do exist are wrong, which we will now discuss.

Anyway. That's just background. The point is shoes fucking suck. I'm going to go by type of shoes.

First, oxfords and traditional dress shoes and such. I don't know WTF the point of these shoes is. Raised heel = terrible for gait. No. Tapered toe box = neuroma pain. No. Hard leather sole = uncomfortable. Most have terrible grip soles. Overly rigid soles that don't bend with the step. Just non-functional all around.

It's a shame because I'm tempted to get custom made shoes with a last cut for my foot. The problem is that the custom shoe maker guys only make these fucking non-functional antiquated bullshit shoes.

Second, traditional running shoes.

The stupid minimalist community is all down on these, and I agree with some of the complaints, but there are good things about them.

Good : cushioning to protect pain from concrete and pebbles. flexible but structured upper than you can tighten to really fit snug on the foot so it doesn't slide around (thus allowing you to make a hard athletic "cut". If you can't make a cut in a shoe without the shoe sliding around, then the shoe does not work and goes in the fucking trash). generally grippy soles that aren't made of something retarded like vibram or leather.

Bad : too much heel-to-toe drop. Too much heel height in general causing early heel strike and strange gait. Too little feel of the ground. Too much taper in the toe box.

Let's talk about the last one : almost every shoe in the world has too much taper in the toe box. Feet are not pointed at the front. It's like the shoe makers all thing we have feet that are shaped like boats that come to a point. They're trying to sell us jester shoes.

In fact real feet (especially freak feet like mine) are pretty flat in front. The foot is widest at the ball of the foot and then does *NOT* taper ahead of the ball - the toes go straight ahead, and then there's a pretty straight line across the front edge. Not a pointy taper.

I will illustrate the problem on a shoe I actually like a lot. The Asics Gel Unifire TR 4E is one of the best shoes of the 100 I have tried in the past few months. It's a true 4E sole, has good structure in the heel so my foot doesn't slide all over, and is wide enough in the ball of the foot. *However* it's pointed, which makes it still a bit painful for me.

I believe these images are self explanatory :

... and everyone who makes shoes is fired.

(the other things I hate about the Unifire are : 1. they're ugly as piss, and 2. the tread chunks stick out past the edges of the upper, which makes them feel bigger than they are. A shoe should feel as small as possible, like it's not there. 3. the heel is way too high; I do not want or need high-heel sneakers.)

cliff note version :
These are better shoes than the stupid shit that mainstream shoe manufacturers make :

Now, you may object - "cbloom, you're taking your own personal weird foot issue and projecting on the masses; blaming shoe makers for making crap shoes when in fact their shoes work fine for most people". Mmm, yes, but I don't think so. The thing is, I don't think your shoes actually *do* work for you. You may wear something ridiculously narrow and pointed like typical a Nike shoe, and you've been lucky enough to not get any obvious painful problems like a bunion or hammertoe or neuroma, so you think your shoes are great. But I don't think they are. Most people who wear western shoes have got their pinky & big toes jammed inward. This is fucking up your feet. So you've endured the toe-smushing, it doesn't mean it's actually right.

Aside from the shape of the toe box, traditional running shoes would be easy to fix. For me the perfect sole is something like what is on my beloved Onitsuka Tigers. I don't need a fucking one inch thick insane high-heel rubber sole like you get on New Balance or the Unifire. I want one centimeter like you get on the Tigers. (Tigers do have a tiny bit too much drop, I'd like a little more padding under the toes and a little less in the heel). I think that retro sneakers are as close as you can get to a perfect shoe. They're generally light weight, you can actually tie them tight to your foot, they have pretty good ground feel. They have the plusses of running shoes but without too much sole. Unfortunately for me there's not a single one of them that's made wide, and they're all too tapered in the front.

Third, let's talk about minimal shoes.

Minimal shoes are fucking bullshit. Shoes in the real world need some padding. Maybe if we only walked on grass and dirt they would be fine. But we don't, we walk on concrete and gravel and other painful shit. Like the fucking moron "paleo" eaters they love to talk about how the "foot evolved to be barefoot" blah blah. Okay, first of all, a minimal shoes is not at all like actually being barefoot. That's just a lie. You don't move in the same way at all. Second of all, when the foot evolved we didn't walk on fucking concrete all day, we didn't have broken glass and nails, etc. Third of all, when apes evolved the ability to make things one of the very first fucking things they make in any primitive society is shoes.

(the only minimal shoe that I sort of believe in is the VFF, which at least allows your toes to grip the ground and so on, in theory. The "minimal" shoe I'm primarily objecting to here is something like the vivobarefoot, which is basically a leather sock. It provides no padding. It slips around on your foot if you try to do anything athletic in it. It removes your toes ability to actually feel the ground. It's total bullshit.)

(I do actually believe that truly barefoot is the best way to be. That's very different than "barefoot" shoes. It also only makes sense to me on surfaces that are sufficiently soft. eg. maybe if you have a nice rubber running track that you know to be free of nails and glass - run on it truly barefoot, and yeah I totally support that)

The distressing thing is that in some ways the minimal shoes are doing things right. Some of them actually have quite wide toe boxes (*) that let your feet spread nicely. I like zero drop. Actually for me I think a tiny bit of drop is better, but zero drop is better than the huge drop of traditional shoes. I like light weight. What I don't like is zero padding.

(* = unfortunately almost all of the minimal shoes CLAIM to have "wide toe boxes" but very few of them actually do. Altra, innov8, not at all wide. VivoBarefoot is the best I've found in terms of shoe shape - actually wide and squared off in the front (they look like clown shoes), but they have literally zero padding and are unwearable. Reebok Crossfit Nanos and Lems shoes are probably good for someone with a normal width foot; they feel like a 2E to me, which is not wide enough for me but probably okay for others.)

Crossfit Nanos and Lems are what I would call "semi-minimal". They have a little more structure and actually have about 1 cm of sole thickness for some padding. Lems are the best designed of all the shoes I tried. Unfortunately, they only make whole Euro sizes (**), and they're a tiny bit too narrow for me. The Nanos should be pretty great, but for some moronic reason they've made the sole out of hard plastic instead of soft rubber. (yeah yeah weight lifting, bullshit, whatever).

(** = whole Euro sizes, WTF. The step between sizes is way too big.)

The best shaped shoe that I've found is the Patagonia Loulu. It's a hybrid shoe; semi-minimal, semi-sneaker. Unfortunately it's almost ruined by a fucking retarded synthetic sole. It's some recycled plastic garbage and it just doesn't work. It's widely known to wear away insanely quickly, and despite that it also has zero grip in the wet. It's like ice scating. Fucking awful moronic sole material. JUST USE NATURAL RUBBER you fucking morons, it was perfect a hundred years ago and you're only making it worse.

While I'm at it let me complain about shoe shopping in general :

1. How hard is it to fucking standardize the sizes? I should be able to measure my foot and buy a shoe that will fit. Instead a fucking "size 10" is slightly different in every damn brand.

2. Widths are even worse. There's an ambiguity in width. Is it just the toe box that's wide, and the heel is narrow? Are they both wide? Is it just the upper (bullshit) or the sole also? Furthermore, in a lot of brands "wide" actually means "fat" and the shoe is just bigger all over.

3. How fucking terrible is the internet. I have to search on a whole mess of different web sites. They all have different searches, most of which are broken (like width is not an option, or I can't filter by size until I select a type of shoe). Lots of shoes are only for sale on the manufacturer's web site which I have to go digging around to find and then use their own custom terrible interface. But most of all -

- none of them actually take advantage of the internet. Shopping sites could easily be wikis where customers can mark up entries with extra information. Some people would be nerdy enough to actually measure shoes and add that information so I could search by true measurements instead of the stupid official sizes. I should be able to cross-index all the different shopping sites with one global index. But no. None of that. It's just so fucking backwards.

4. Why can't I get custom made sneakers? I should be able to measure my foot and provide specs and get shoes made to order for reasonable fees.

It's all so antiquated. It requires actually trying on 100's of shoes to find one that fits. There will of course always be some aspect of personal feel, but right now it's not even that. It's like I put on a shoe and immediately go "nope, this 4E is not actually wide at all" or "nope, this supposed size 10.5 is actually a 10 or 11". I'm just wasting all this time weeding out candidates that could have been done systematically.

02-14-15 | Template Arithmetic Coder Generation

I did something for Oodle LZA that I thought worked out very neatly. I used templates to generate efficient & arbitrary variants of arithmetic coder decompositions.

First you define a pattern. Mine was :

"coder" pattern implements :

void reset();
 int decode( arithmetic_decoder * dec );
void encode( arithmetic_encoder * enc, int val );
int cost(int val) const;

where encode & decode also both adapt the model (if any).

You can then implement that pattern in various ways.

Perhaps the first "coder" pattern is just a single bit :

template <int t_tot_shift, int t_upd_shift>
struct arithbit_updshift
    U16  m_p0;

    void reset()
        m_p0 = 1<<(t_tot_shift-1);

    .. etc ..


And then you do some N-ary coders. The standard ones being :

template <int t_alphabet_size>
struct counting_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_bottomup_coder;

template <int t_maxval, typename t_arithbit>
struct unary_coder;

I'm not going to show implementations of all of these in this post, but I'll do a few here and there for clarity. The point is more about the architecture than the implementation.

For example :

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder

    enum { c_numvals = 1<<t_numbits };
    //t_arithbit m_bins[c_numvals-1]; // <- what's needed (use ctx-1 index)
    t_arithbit m_bins[c_numvals]; // padded for simpler addressing (use ctx index)
    int decode( arithmetic_decoder * dec )
        int ctx = 1;

        for(int i=0;i<t_numbits;i++)
            int bit = m_bins[ctx].decode(dec);
            ctx <<= 1; ctx |= bit;

        return ctx & (c_numvals-1);

    etc ...

and "t_arithbit" is your base coder pattern for doing a single modeled bit. By making that a template parameter it can be something like


or a template like :



The fun thing is you can start combining them to make new coders which also fit the pattern.

For example, say you want to do a bitwise decomposition coder, but you don't have an even power of 2 worth of values? And maybe you don't want to put your split points right in the middle?

// t_fracmul256 is the split fraction times 256
// eg. t_fracmul256 = 128 is middle splits (binary subdivision)
// t_fracmul256 = 0 is a unary coder

template <int t_numvals, int t_fracmul256, typename t_arithbit>
struct bitwise_split_coder
    enum { c_numvals = t_numvals };
    enum { c_numlo = RR_CLAMP( ((t_numvals*t_fracmul256)/256) , 1, (t_numvals-1) ) };
    enum { c_numhi = t_numvals - c_numlo };

    t_arithbit  m_bin;
    bitwise_split_coder<c_numlo,t_fracmul256,t_arithbit> m_lo;
    bitwise_split_coder<c_numhi,t_fracmul256,t_arithbit> m_hi;

    void reset()

    void encode(arithmetic_encoder * enc,int val)
        if ( val < c_numlo )
            m_hi.encode(arith, val - c_numlo );

    .. etc ..


+ explicit template specialize for t_numvals=1,2

This lets you compile-time generate funny branching trees to be able to handle something like "my alphabet has 37 symbols, and I want to code each interval as a binary flag at 1/3 of the range, so the first event is [0-12][13-37]" and so on.

And then you can make yourself some generic tools for plugging coders together. The main ones I use are :

// val_split_coder :
// use t_low_coder below t_low_count
// then t_high_coder above t_low_count

template <int t_low_count, typename t_low_coder, typename t_high_coder , typename t_arithbit>
struct val_split_coder
    t_arithbit  m_bin;
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
        if ( val < t_low_count )
            m_hi.encode(arith, val - t_low_count );

    .. etc .. 


// bit_split_coder :
// use t_low_coder for t_low_bits
// use t_high_coder for higher bits
// (high and low bits are independent)

template <int t_low_bit_count, typename t_low_coder, typename t_high_coder >
struct bit_split_coder
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
        int low = val & ((1<<t_low_bit_count)-1);
        int high = val >> t_low_bit_count;

    .. etc .. 

// bit_split_coder_contexted :
// split bits, code hi then low with hi as context
// (used for decomposition of a value where the bits are dependent)

template <int t_low_bit_count, typename t_low_coder, int t_high_bit_count, typename t_high_coder >
struct bit_split_coder_contexted
    t_high_coder m_hi;
    t_low_coder m_lo[(1<<t_high_bit_count)];

    void encode(arithmetic_encoder * enc,int val)
        int high = val >> t_low_bit_count;
        int low = val & ((1<<t_low_bit_count)-1);


    .. etc .. 

So that gives us a bunch of tools. Then you put them together and make complicated things.

For example, my LZ match length coder is this :

val_split_coder< 8 , 
    bitwise_topdown_coder< 3 , arithbit_updshift<12,5> > ,  // low val coder
    numsigbit_coder< unary_coder<16, arithbit_updshift<12,5> > > ,  // high val coder
    arithbit_updshift<12,5> >

which says : first code a binary event for length < 8 or not. If length < 8 then code it as a 3-bit binary value. If >= 8 then code it using a num-sig-bits decomposition where the # of sig bits are written using unary (arithmetic coded) and the remainder is sent raw.

And the standard LZ offset coder that I described in the last post is something like this :

val_split_coder< 64 , 
    bitwise_topdown_coder< 6 , arithbit_updshift<12,5> > , // low vals
    bit_split_coder< 5 ,  // high vals
        bitwise_bottomup_coder< 5 , arithbit_updshift<12,5> > ,  // low bits of high vals
        numsigbit_coder< unary_coder<30, arithbit_updshift<12,5> > > ,  // high bits of high vals
    > ,

To be clear : the advantage of this approach is that you can easily play with different variations and decompositions, plug in different coders for each portion of the operation, etc.

The code generated by this is very good, but once you fully lock down your coders you probably want to copy out the exact cases you used and hand code them, since the human eye can do things the optimizer can't.

02-10-15 | LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.

low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :

topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :

slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :

Straight #SB slots :


Slot from #SB plus one lower bit :


More generally it helps a little to put more slots near the bottom :


but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.

So, a whole offset is sent something like this :

offset bits (MSB on left) :


S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded

Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :

 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.

An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

02-09-15 | LZSA - Some Results

So I re-ran some Oodle Network tests to generate some LZSA results so there's some concreteness to this series.

"Oodle Network" is a UDP packet compressor that works by training a model/dictionary on captured packets.

The shipping "OodleNetwork1 UDP" is a variant of LZP. "OodleStaticLZ" is LZSA-Basic and obviously HC is HC.

Testing on one capture with dictionaries from 2 - 64 MB :

test1   380,289,015 bytes

OodleNetwork1 UDP [1|17] : 1388.3 -> 568.4 average = 2.442:1 = 59.06% reduction
OodleNetwork1 UDP [2|18] : 1388.3 -> 558.8 average = 2.484:1 = 59.75% reduction
OodleNetwork1 UDP [4|19] : 1388.3 -> 544.3 average = 2.550:1 = 60.79% reduction
OodleNetwork1 UDP [8|20] : 1388.3 -> 524.0 average = 2.649:1 = 62.26% reduction
OodleNetwork1 UDP [16|21] : 1388.3 -> 493.7 average = 2.812:1 = 64.44% reduction
OodleNetwork1 UDP [32|22] : 1388.3 -> 450.4 average = 3.082:1 = 67.55% reduction
OodleNetwork1 UDP [64|23] : 1388.3 -> 390.9 average = 3.552:1 = 71.84% reduction

OodleStaticLZ [2] : 1388.3 -> 593.1 average = 2.341:1 = 57.28% reduction
OodleStaticLZ [4] : 1388.3 -> 575.2 average = 2.414:1 = 58.57% reduction
OodleStaticLZ [8] : 1388.3 -> 546.1 average = 2.542:1 = 60.66% reduction
OodleStaticLZ [16] : 1388.3 -> 506.9 average = 2.739:1 = 63.48% reduction
OodleStaticLZ [32] : 1388.3 -> 445.8 average = 3.114:1 = 67.89% reduction
OodleStaticLZ [64] : 1388.3 -> 347.8 average = 3.992:1 = 74.95% reduction

OodleStaticLZHC [2] : 1388.3 -> 581.6 average = 2.387:1 = 58.10% reduction
OodleStaticLZHC [4] : 1388.3 -> 561.4 average = 2.473:1 = 59.56% reduction
OodleStaticLZHC [8] : 1388.3 -> 529.9 average = 2.620:1 = 61.83% reduction
OodleStaticLZHC [16] : 1388.3 -> 488.6 average = 2.841:1 = 64.81% reduction
OodleStaticLZHC [32] : 1388.3 -> 429.4 average = 3.233:1 = 69.07% reduction
OodleStaticLZHC [64] : 1388.3 -> 332.9 average = 4.170:1 = 76.02% reduction


test2   423,029,291 bytes

OodleNetwork1 UDP [1|17] : 1406.4 -> 585.4 average = 2.402:1 = 58.37% reduction
OodleNetwork1 UDP [2|18] : 1406.4 -> 575.7 average = 2.443:1 = 59.06% reduction
OodleNetwork1 UDP [4|19] : 1406.4 -> 562.0 average = 2.503:1 = 60.04% reduction
OodleNetwork1 UDP [8|20] : 1406.4 -> 542.4 average = 2.593:1 = 61.44% reduction
OodleNetwork1 UDP [16|21] : 1406.4 -> 515.6 average = 2.728:1 = 63.34% reduction
OodleNetwork1 UDP [32|22] : 1406.4 -> 472.8 average = 2.975:1 = 66.38% reduction
OodleNetwork1 UDP [64|23] : 1406.4 -> 410.3 average = 3.428:1 = 70.83% reduction

OodleStaticLZ [2] : 1406.4 -> 611.6 average = 2.300:1 = 56.52% reduction
OodleStaticLZ [4] : 1406.4 -> 593.0 average = 2.372:1 = 57.83% reduction
OodleStaticLZ [8] : 1406.4 -> 568.2 average = 2.475:1 = 59.60% reduction
OodleStaticLZ [16] : 1406.4 -> 528.6 average = 2.661:1 = 62.42% reduction
OodleStaticLZ [32] : 1406.4 -> 471.1 average = 2.986:1 = 66.50% reduction
OodleStaticLZ [64] : 1406.4 -> 374.2 average = 3.758:1 = 73.39% reduction

OodleStaticLZHC [2] : 1406.4 -> 600.4 average = 2.342:1 = 57.31% reduction
OodleStaticLZHC [4] : 1406.4 -> 579.9 average = 2.425:1 = 58.77% reduction
OodleStaticLZHC [8] : 1406.4 -> 552.8 average = 2.544:1 = 60.70% reduction
OodleStaticLZHC [16] : 1406.4 -> 511.8 average = 2.748:1 = 63.61% reduction
OodleStaticLZHC [32] : 1406.4 -> 453.8 average = 3.099:1 = 67.73% reduction
OodleStaticLZHC [64] : 1406.4 -> 358.3 average = 3.925:1 = 74.52% reduction

Here's a plot of the compression on test1 ; LZP vs. LZSA-HC :

Y axis is comp/raw and X axis is log2(dic mb)

What you should see is :

OodleNetwork1 (LZP) is better at small dictionary sizes. I think this is just because it's a lot more tweaked out; it's an actual shipping quality codec, whereas the LZSA implementation is pretty straightforward. Things like the way you context model, how literals & lengths are coded, etc. needs tweakage.

At around 8 MB LZSA catches up, and then as dictionary increases it rapidly passes LZP.

This is the cool thing about LZSA. You can just throw more data at the dictionary and it just gets better. With normal LZ77 encoding you have to worry about your offsets taking more bits. With LZ77 or LZP you have to make sure the data's not redundant or doesn't replace other more useful data. (OodleNetwork1 benefits from a rather careful and slow process of optimizing the dictionary for LZP so that it gets the most useful strings)

Memory use of LZSA is quite a bit higher per byte of dictionary, so it's not really a fair comparison in that sense. It's a comparison at equal dictionary size, not a comparison at equal memory use.

02-07-15 | LZMA note

Originally posted at encode.ru

I just had a look in the LZMA code and want to write down what I saw. (I could be wrong; it's hard to follow!) The way execution flows in LZMA is pretty weird, it jumps around : The outer loop is for the encoding position. At each encoding position, he asks for the result of the optimal parse ahead. The optimal parse ahead caches results found in some sliding window ahead. If the current pos is in the window, return it, else fill a new window. To fill a window : find matches at current position. set window_len = longest match length fill costs going forward as described by Bulat above as you go forward, if a match length crosses past window_len, extend the window if window reaches "fast bytes" then break out when you reach the end of the window, reverse the parse and fill the cache now you can return the parse result at the base position of the window So, some notes : 1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse. The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier. 2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh. (from the start of the current parse window) 3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps. LZMA considers the price to go forward using : literal rep matches normal matches If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers : literal + rep0 match rep + literal + rep0 normal match + literal + rep0 These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise. ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos ) That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima.

02-07-15 | LZSA - Index Post

LZSA series :

cbloom rants 02-04-15 - LZSA - Part 1
cbloom rants 02-04-15 - LZSA - Part 2
cbloom rants 02-05-15 - LZSA - Part 3
cbloom rants 02-05-15 - LZSA - Part 4
cbloom rants 02-06-15 - LZSA - Part 5
cbloom rants 02-06-15 - LZSA - Part 6
cbloom rants 02-09-15 - LZSA - Some Results

And other suffix related posts :

cbloom rants 09-27-08 - 2 (On LZ and ACB)
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
09-26-11 - Tiny Suffix Note
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 06-21-14 - Suffix Trie Note
cbloom rants 07-14-14 - Suffix-Trie Coded LZ
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 09-10-14 - Suffix Trie EOF handling

02-06-15 | LZSA - Part 6

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.

Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

02-06-15 | LZSA - Part 5

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :

P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :

LZSA-Basic decoder rough sketch :


handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size


whereas a similar LZ77 on a static dictionary with a flat offset model is :

LZ77-static-flat decoder rough sketch :


handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size


they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :


ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :

string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.


(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :

..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

02-05-15 | LZSA - Part 4

When you don't find a match in LZSA that actually carries a lot of information.

Say you have an MML of 2 and encode a literal "a". That means all digrams in the dictionary that start with "a" are exclusions. The following character cannot be any of those, and those are the most likely characters so they are worth a lot.

Discussed previously here :

cbloom rants 09-03-10 - LZ and Exclusions

Even when you send a match, you have a powerful exclude from the characters that cannot extend that match. In a normal LZ77 like LZMA this is done for the *single* match offset that we sent with a literal-after-match exclude. In LZSA we can easily exclude *all* the characters that could have extended the match.

LZSA-HC = LZSA- High Compression

For LZSA-HC we will take MML down to 1, so literals and matches are unified and we are always writing "matches". However we will also write matches by first writing the first character of the match and then writing the rest of the match. I assume that you have ensured the dictionary contains every character at least once, so there's no issue of impossible encodings.

Algorithm LZSA-HC :

Data structures required are the same as LZSA-Basic

When the match lookup structure (typically suffix trie) returns a match
it should also provide the set of characters which were found in the dictionary
 but did not match the next character in the search string

To encode :

initialize exclude_set = { emtpy }

for all of ptr

encode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

look up the current string (ptr) in the match lookup structure
set exclude_set = characters found that followed match but != ptr[match_length]

transmit match length ( >= 1 )

if ( match_length > 1 )
  send the suffix substring matched :
  simply one arithmetic encode call

  we only need to send it within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  arithmetic_encode( suffix_sort_low_index - char_suffix_low, suffix_count , char_suffix_count );

ptr += match_length;

a few notes on this.

We only send matches, length 1,2,3... Then exclude all characters that could not come after the match. This exclusion is exactly the same as exclusion in PPM when you escape down orders. In LZ you escape from order-ML down to order-0.

Coding the first character of the match separately is just so that I can use a different coder there. I use order-1 plus some bits of match history as context. For purity, I could have left that out and just always coded a suffix index for the match. In that case "exclusion" would consist of subtracting off the char_suffix_count[] number of suffixes from each excluded character.

Because match_length is sent after the first character, I use the first character as context for coding the match length.

The "if ( match_length > 1 )" is actually optional. You could go ahead and run that block for match_length = 1 as well. It will arithmetic_encode a symbol whose probability is 1; that is a cumprob which is equal to the full range. This should be a NOP on your arithmetic encoder. Whether that's a good idea or not depends on implementation.

In practice the exclude_set is 256 bit flags = 32 bytes.

LZSA-HC must use greedy parsing (always send the longest possible match) because of full exclusion. There's no such thing as lazy/flexible/optimal parses.

To decode :

we now can't really use the faster backward tree
because we need a lookup structure that will provide

initialize exclude_set = { emtpy }

for all of ptr

decode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

decode match_length

if ( match_length == 1 )
  // just precompute a table of exclude sets for the after-1-character case :
  exclude_set = o1_exclude_set[ *ptr ]
  decode the suffix index
  within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  int suffix_index = arithmetic_fetch( char_suffix_count ) + char_suffix_low;

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );
  here get_suffix_count must be a real match lookup and should also fill exclude_set

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );

ptr += match_length

And that is LZSA-HC

02-05-15 | LZSA - Part 3


You can of course implement a static PPM easily with a suffix array. (you would not want to do it for dynamic PPM because inserting into a sort is painful; the standard context tree used for PPM is a suffix trie)

Say my string is "abraabracadabra" (Steve Miller style). The suffix sort is :

I want to encode the next character starting with order-3 PPM. My current context is "bra". So I look up "bra..." in the suffix sort. I've seen "a" and "c" before, each with count 1. So I have a total count of 2 and a novel count of 2. I can do the old fashioned PPM escape estimation from that either code a symbol in that context or escape. If I escape, I go to order 2 and look up "ra..." , etc.

Now, what if you did a funny thing with your PPM.

Start at order-0 and encode a character. At order-0 we have all the strings in the suffix array, so we just count the occurance of each first character ( C(a) = 7, C(b)=3, C(c)=1, C(d)=1, C(r)=3 , Total = 15).

Say we encode an "a". So we send 7 in 15. (and a no-escape flag)

On the next character move to order-1. So we reduce to these strings :

Within this context we have C(a)=1, C(b)=3, C(c)=1, C(d)=1 (to send an 'r' we would have to escape). Say we sent a "c".

Now we change to order-2. We only have this string at order-2 :

So if our next character is "a" we can send that with just a no-escape flag. Otherwise we have to escape out of this context.

What we have now is a "deterministic context" and we can keep increasing order as long as we match from it.

This is LZSA !

(the only difference is that in LZSA the escape is only modeled based on context order, not the contents of the context, which it normally would be in PPM)

To be clear :

When LZSA encodes a reference to the string "abc" it does with an arithmetic encoder an the equivalent probability :

P = probability equivalent as used in arithmetic encoding

P = C("abc") / C_total

C("abc") = C("a") * C("ab")/C("a") * C("abc")/C("ab")

as the counts become large and match the true probabilities :

C("ab")/C("a") => P("b"|"a")   (reads count of "b" given a previous "a")

C("abc") => C("a") * P("b"|"a") * P("c"|"ab")

P encoded => P("a") * P("b"|"a") * P("c"|"ab")

That's order-0 * order-1 * order-2. (and so on for larger match lengths).

The match length can be thought of as unary. So ML=3 is "1110". We read that as "match,match,match, no-match".

In this way we see the match length as a series of escape/no-escape flags.

Another note on PPM.

In real-world modern PPM's you do LOE. (LOE = Local Order Estimation). In the olden days you just chose "we always use order-4 PPM", which meands you start at order-4, and escape down to order-3,2,1. With LOE you look at local information and decide which order to use.

Now, let's say you had some data that was binary and had a repeated pattern that was like :

1 byte is random
3 bytes predictable
1 byte is random
7 bytes predictable
repeated. What orders should you use?
order-0 for random
order-0 for random
these are the orders that LZ uses!

This is part of why LZ can beat PPM on binaries but not on text, because the funny way that LZ jumps down to order-0 at the start of a match can actually be a good thing on binary.

Now, what if you had skip contexts?

(skip contexts are contexts that use some non-consecutive range of previous characters; eg. something like YYN is a two-symbol context that uses the 2nd & 3rd symbols back, but not the immediately preceding 1 symbol.)

If you have random bytes and skip contexts, then what you want is :

order-0 for random
order-0, Y, YY
order-0 for random
and this is what a "repeat match" is in LZSA.

Say I matched "abr" in the example above, but then my string is "abrd" , so I get a match of length 3, then a mismatch. I can then continue from the "repeat match" skip-context using YYYN of "abrX" :

so there are two strings available matches for a "repeat match". If your next character is not "a" or "c" then you code an "escape" and drop the YYYN context which is the same as saying you code a flag to select a normal match and not a repeat match.

If we like, we could make LZSA more of a true PPM.

In LZSA-Basic we encoded the match length and then the suffix index to select the match.

Instead, you could code the suffix index first. The decoder can then get the match string :

  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];

but we still don't know the match length.

You can then encode "match length" unary style, a sequence of binary "more length" flags (these are PPM escape flags). Because we already know match_string, we can use the characters matched so far in our match flag coding. We could use them as contexts for the match flag. Or we could do a true PPM and use them to look up a context node and get total & novel counts to do PPM-style escape estimation.

If we do that, then LZSA really becomes a true PPM. It's a PPM that does this funny order selection : order-0,order-1,... order-N, then escapes back to order-0.

It has a neat advantage over traditional PPM - we are encoding the character selection in one single arithmetic coding operation, instead of one at each context level.

02-04-15 | LZSA - Part 2

Algorithm LZSA-Basic :

LZSA-Basic encoder :

given a dictionary
form the suffix sort
make a match lookup structure on the suffix sort (suffix trie for example)

when you look up a string in the match structure
you are given the lowest index of that substring in the suffix sort
and also the number of entries that match that prefix

for example in a suffix trie
each leaf corresponds to an entry in the suffix sort
each node stores the lowest leaf index under it, and the number of leaves

To encode :

look up the current string in the match lookup structure
if match length < MML 
  flag literal & send literal
  flag not literal
  send match length

  send the suffix substring matched :
  simply one arithmetic encode call
  (dictionary size usually a power of 2 for more speed)

  arithmetic_encode( suffix_sort_low_index, suffix_count , dictionary_size );

Lazy parsing and other standard LZ things are optional.

Minimum Match Length , MML >= 2 as written. However, you could also set MML=1 and dispense with the match flag entirely. Then literals are written as a match of length 1, (and you must ensure every character occurs at least once in the dictionary). This is identical to order-0 coding of the literals, because the suffix ranges for matches of length 1 are just the order-0 counts! In practice it's better to code literal separately because it lets you do a custom literal coder (using order-1 context, or match history context, or whatever).

LZSA-Basic decoder :

decoder requires the suffix sort
it also requires the suffix count for the given match length
(see later)

To decode :

get match flag
if not match
  decode literal
  decode match length

  get the suffix index :

  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );

  this must be the same interval that was encoded :
  (suffix_index is somewhere in that range)
  (note that all suffix_index values in that range provide the same match string over match_length)

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );

easy peasy, and very fast. Decoding is just as fast as normal LZ77, except for one piece : get_suffix_count.

To implement get_suffix_count we need something like the suffix trie that was used in the encoder. But we can do something a bit more compact and efficient. Rather than a forward tree, we can use a backward only tree, because we have a leaf index to jump into, and we only need to go up to parents to find the right node.

get_suffix_count :

struct backward_suffix_node
  int parent; // node index
  int depth;
  int low,count; // suffix sort range

unsigned char * suffix_sort[ dictionary_size ];
int suffix_leaf_parent[ dictionary_size ];
backward_suffix_node suffix_nodes[ dictionary_size ];

suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length )
    int node = -1;
    int parent = suffix_leaf_parent[ suffix_index ];
    while( match_length <= suffix_nodes[ parent ] )
        node = parent;
        parent = suffix_nodes[ node ].parent;

    if ( node == -1 )
        return suffix_index, 1;
        return suffix_nodes[ node ].low , suffix_nodes[ node ].count;

the logic here is just slightly fiddly due to path compression. With path compression, match_length can be between the depth of two nodes, and when that happens you want to stay at the child node. The leaf nodes are implicit, and the number of internal nodes is always <= the number of leaves.

You could of course also accelerate the suffix_count lookup for low match lengths, at ML=3 or 4 for example by just having a direct array lookup for that case.

In theory walking backward like this has a bad O(N^2) possible running time (if the tree is deep, but you're only getting short matches in it). Conversely, walking forward up the tree ensures that decode time is O(N), because the trie walk is proportional to match length. In practice the backward walk is always significantly faster. (a single forward trie step can involve lots of linked list steps and jumping around in memory; the backward trie is much more compact and easier to walk without conditionals; you have to have a very deep average tree depth for it to be worse). If this was actually an issue, you could augment the backwards trie with a Fenwick/skip-list style larger parent step in a binary pattern (some halfway-to-root steps, some quarter-way-to-root steps, etc.). But it just isn't an issue.

02-04-15 | LZSA - Part 1

I'm going to introduce what I believe is a novel (*) compression algorithm. I'm calling it "LZSA" for "LZ Suffix Array" , though ryg rightly points out it's not really an LZ.

(* = if you're actually a scientist and not a cock-munching "entrepreneur" you should know that nothing is ever novel. This could be considered a simplification of ACB)

(Note to self : first emailed 07/27/2014 as "Internal Compression Blog")

I'm going to write a mini series about this.

Here's some previous posts that are related :

cbloom rants 09-27-08 - 2
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 07-14-14 - Suffix-Trie Coded LZ

So let's dive in.

Part 1 : motivation and background

I was working on compression from static dictionaries. The problem with traditional LZ77 on static dictionaries is that to get good compression you want a large dictionary, but then the offsets require more bits as well. In a normal dynamic scan dictionary, you have very strong offset modeling (they tend to be small, as well as binary patterns). In particular, short common strings will occur at low offset and thus not require many bits. But in a static dictionary all references take the same large number of bits, even if the match is short and the substring matched is very common. (*)

(* = obviously you could sort the substrings by frequency to try to create an optimal static dictionary that has strongly biased offsets; but then you also have to be aware of how adjacent substrings form larger strings (eg. "ab" next to "cd" also adds "abcd"), and have to make that whole grammar sorted, and that seems like a monstrous hard problem)

The problem is that common substrings occur all over the static dictionary (eg. in an english text dictionary "the" occurs in thousands of places), but in LZ77 you have to code an offset to one specific occurance of that substring. In effect you are wasting log2(N) bits, where N is the count of that substring.

In fact, the solution is very easy conceptually. Just take the static dictionary and do a suffix sort on it. Now all occurances of "the" are consecutive in the suffix sort.

Say our dictionary is "banana" , then the strings we can match are :

to code "ana" we could send index 1 or 3, they both decode as "ana" at length 3.

After suffix sort our strings are :

And to send "ana" we send index 1 or 2.

So now we need to send an integer, and we need it to be in a range, but we don't need to specify it exactly.

That is, we want to send an integer in the range {suffix_lo,suffix_hi} but we don't care what it is exactly in that range (because they all decode to the same string), and we don't want to waste bits unnecessarily specifying what it is in that region.

That's exactly what an arithmetic encoder does! We just need the low and high index of our substring in the suffix array, and we send that as an arithmetic encoder.

It's exactly like a cumulative frequency table. The arithmetic encoder is gauranteed to send an integer that is somewhere in the range we need. We don't know which exact integer the decoder will see; it won't be determined until we do some more arithmetic encodings and the range is reduced further.

We're just treating the # of strings in the dictionary as the cumulative probability total. Then the low & high suffix index that contains our substring are the probabilities that we use to encode a "symbol" in the arithmetic coder. By coding that range we have specified a substring, and we save log2(substring_count) bits of unnecessary information.

Next post I'll describe the algorithm more precisely and then I'll talk about it.

02-01-15 | Fucking Fuck the Fucking Web

ADDED PREFACE : This is a pretty ranty rant. I write these on a semi-regular basis, and pretty much always delete them before posting them these days. I thought I would leave this one for old time rants' sake.

So I was forced to upgrade my Firefox from 15 to 35. AAAARRGGGGH.

Of course first thing is "Classic Theme Restorer" to make it look reasonable again. (christ fucking rounded shit). Ok, waste a bunch of time as usual learning new settings and updating. Whatever, fucking awful life as usual, no biggie.

But it just SUCKS it sucks it sucks. I want fucking FF 15 back.

It's so much *less* responsive and less asynchronous than before despite the big rewrite to make it async. Loading a page now, the elements keeping popping in and re-flowing and shit moves all over. It takes way longer for the focus text box to become active. I have to either manually click in it or wait forever for most of the page to load for it to activate. So bad. Even typing in the address bar to go to a new page (while a page is still loading) is all hitchy now, that is so fucking retarded and borked, your basic GUI elements need to always be independent of the page loading.

The HTML5 video player is just so not ready for primetime. Fucking buffering doesn't work right, it hitches and stalls out and then won't restart. Audio cuts out; you can't seek; etc. It's just so frustrating I want to smash my goddamn computer. I was hoping to run without Flash now, but it's just not ready. You can't fucking ship it like that.

Every so often I log in from weird places and Google decides that's "suspicious activity". So now I have to pick a new password.

If my password isn't good enough, the entry form just gets wiped, and I have to enter my old password again, and the new one twice. Nope, not good enough, wipe the form.

God damn you. I don't want to enter a complicated password because I have to type the damn thing in on my fucking PHONE with fucking nightmare starred-out stupid fucking touch keyboard, where things like going caps and lower case takes forever. I want my password to be "hello" or some shit. Nope, blank the form.

This is really fucking important Google. Because the big security leak is hackers just guessing passwords. It's not when fucking Target or Visa or someone just loses an entire mainframe full of personal account info.

Actually the biggest leak is when a major retailer literally GIVES AWAY everyone's personal account info. Because most of the big hacks are "human engineering" which just means the hackers called someone up and that person gave them access to their back end.

So yeah, really fucking helpful that you're making me put punctuation in my password and starring it out so that noone in the room can get it. Thanks a lot. That's going to help when Home Depot loses millions of credit card numbers.

Oh yeah, and when I get suspicious activity, Google locks my account and wants to send me an SMS.

My phone number is a FUCKING GOOGLE VOICE NUMBER. How am I supposed to get an SMS? Oh, I'll send it to my actual hardware phone number (which I never use and don't even know, but I have it written down somewhere.) okay, let' try that... MY PHONE GETS SMS WITH FUCKING GOOGLE HANGOUTS. And it won't let me log in and get my SMS because Google has flagged my account for suspicious activity. Amazing.

(and when Google Hangouts can't login with my primary Google account it automatically switches to my RAD account without asking. No retry / enter a new password prompt or anything. Awesome.)

Oh my god, I hate the new fucking google maps. Every time I want a map now the process is :

1. Huh, WTF, why is it loading so slow ?
2. Oh my god the fucking nav buttons still haven't finished popping in?
3. Aw crap, it's the new maps I forgot about this nightmare
4. How do I go back to the old maps? Is it settings? Nope
5. "Help" button.
6. Select the reason why I want old maps : "I want to punch you in the nuts".
7. Return to Classic Google Maps

Fucking christ. If it's slower it's worse. Don't fucking ship it. My god it's so fucking slow.

And how is the fucking most basic search interaction of maps & normal search still so fucking garbage. Basic shit like if I do a normal search and I get various search results, I can't just click one and go "show on maps". If I'm in maps, I can't just take my current search text and switch back to a web search. If I'm in maps trying to get directions, and I put in a description search instead of a specific address, it kicks me out of directions mode when I choose the address. WTF WTF. This is what you fucking do and you're so garbage at it. Everyone is fired.

I was going to write "I don't understand how companies look at these things and think they're okay" but actually I know exactly how it happens.

The fundamental way that tech is developed is all wrong.

There's this very weird non-results-oriented thing that can happen with developers. You sort of go into this rabbit hole where you have a blind spot for what the actual user experience is.

In any product development cycle, it starts out with reasonably good intentions. You want to solve a problem and make things good for the user. But then you translate that into todos which are technical, like "rewrite the back end in such a way". The weird thing that then happens is the developers focus on the technical todos, and they are happy if they accomplish them, even if the end result is actually worse for the user.

It's like :

1. We want to make it good for the user
2. To do that we have to figure out some tech todos
3. So we decide to do A, B, C
4. We go away and put our heads down and crank on A, B, & C
5. Great success!  We have accomplished A !  Pat ourselves on the back
6. We have totally forgotten about #1.

(though realistically, even at the start there are usually lots of "todos" on the list that have nothing to do with improving user experience. There will be things like "make it IEEE PPTP RFC 40187 compatible" which you feel is really important but in fact for the user means almost nothing. There will also be lots of pet todos from the developers like "make it fine-grain threaded" which again, if they actually result in a user experience that's more snappy, great, but if not then it's just tech masturbation).

As far as how the horrific moronic GUIs keep getting green lit, I don't really know. I can only guess. I guess that there are probably GUI mockup meetings where things are approved based on still screen shots. I guess that managers who are in charge of these things have no fucking clue about the most basic principles of good GUI design, like "buttons should obviously be buttons", "focus should never move without the user doing it", "GUI elements should not trickle in or change unexpectedly", "buttons should always be in the same place", "delays in rendering should not change the meaning of UI input", etc. etc.

And there's a general lack of testing. You have to get people in and watch them use your software. People who aren't familiar with it. Don't give them any instructions. Any time they need the help or get frustrated, you have failed. It can't be a last minute thing when you aren't willing to make drastic changes. It has to be early and often and you have to have many months left to make changes before shipping. I cannot believe that these companies actually get anyone to try their product (or web page), and see how fucking shitty the user experience is, and actually respond to it.

(Nobody who ever uses Outlook.com comes away doing anything but screaming their head off)

Fucking Amazon sucks as a UI. Fucking ebay sucks. Fucking facebook sucks. Not like sucks a little bit, like they're just awful. Our fucking most valuable, biggest most important software products are just fucking garbage. Why is it so fucking hard for me to spend my money on the internet? It's ridiculous.

Part of the problem is that by the time you have something to show, the team is usually so weary and calcified, that they don't really want to listen to any feedback. They'll make lots of excuses to not change things.

Managers are hesitant to ever admit a mistake or cut a huge feature or redo work, because it reflects badly on them. That's terrible. It means that broken awful shit gets shipped just because they politically have to keep insisting "we did a great job". In a better world, a manager would be able to say "actually our GUI rewrite for VC2010 is a total fuckup and we're going to just bin the entire project and go back to VC6". Unfortunately in corporations, like real politics, it serves you better to insist that you totally believe in all your choices still even when they have obviously turned out to be huge fuckups. Kind of mind boggling that people like it better when their politician is walking down the street naked saying "look at my fabulous choice of clothes" than if they would just admit that they got duped.

I think an interesting idea for development would be to have a totally separate "finishing team". You get all new developers and managers on at the end, so that they aren't wedded to any of the ideas, they aren't building fences around their work and unwilling to change anything, they aren't prideful and unable to admit that it sucks. The finishing team is only concerned with the *results*, what the actual user experience is, and they can mercilessly cut up the product to make it better. Politically the finishers have no stake in keeping work, so they can even revert huge chunks or just cut features.

We've had great results at RAD with ryg as a sort of ruthless finisher. Developers very regularly get some crazy ideas in their head about why they absolutely must do something in a crazy over-complicated way, and they cling to some silly reason why they can't just do it the simple obvious way. It takes a 3rd party coming in to rip up the code and take out the stupidity. (This used to be sort of my specialty).

(sort of like how Hollywood brings in separate writers to finish screenplays; the analogy's not great though because in writing there's a lot to be said for the flawed artistry of a single voice, but in web apps not so much)

I certainly have been guilty of this sin myself many times. It's just very fundamental to the developer personality. We all get Balkan and prideful and defensive and calcified.

One example that I regret is the camera on Stranger's Wrath. I wrote the camera; I had this vision of things that I wanted to do, partly based on mistakes I'd seen in the past. I wanted it to be very robust, physical and analog feeling (not digital), to not get stuck or ever jump ahead or go through objects, framerate independent, very solid. And for the most part I accomplished those and thought that I was very clever and had done an amazing thing.

But people didn't like it. Almost right away I started getting feedback from within the company that was not glowing; why can't it be more like this other game, can it move faster, can it do this or that. Later as we started getting outside play testing it was a common complaint. And basically I did nothing about it. We did some minor tweaking to speed it up a bit based on feedback, but I didn't have the "come to jesus" moment of hey this doesn't actually work as something that makes users happy. It may have hit all the todos that I set out for it originally, but it didn't accomplish the only goal that really matters which is to make users go "this feels great".

One of the stupid excuses that I made with the camera, which is a very common one, was : "people are just used to other cameras (Mario 64, Jak & Daxter and so on), so they're expecting that and just think it feels bad because it's different. Once they get used to it they'll like it."


Even if that's true, you're wrong. The fact that people are used to some other thing is a fact of the world and you have to work within that. They're used to GUIs in a certain style. They're used to the controls in their car being in a certain place. Maybe you have thought of a better way to do it, but if that feels bad to people in the real world because they're used to the old way, YOU'RE WRONG. You have to make it good within the context of actual users. (that's not to say you can't ever change any GUI, but the changes should feel good to someone who knows the old way; and of course if it's something like a car where the user will switch between various vehicles, then fuck you just make it standard and don't change things that don't actually make anything better)

Whoah. Enough ranting.

01-23-15 | LZA New Optimal Parse

Oodle LZA has a new optimal parse.

It's described at encode.ru and I don't feel like repeating myself today. So go read about it there. Thanks to Bulat Ziganshin for explaining the LZMA-style parse in a way that I could finally understand it.

First see Bulat's earlier post [LZ] Optimal parsing which is a description of Tornado's optimal parse and probably the clearest place to get started.

This style of parse is also described in the paper "Bit-Optimal Lempel-Ziv compression".

Some previous rants on the topic :

cbloom rants 10-10-08 - 7
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 06-21-14 - Suffix Trie Note

I should note that the advantage of this optimal parse over my previous LZA optimal parse (backward-forward-chain-N) is that it can achieve roughly the same compression in much less time. The chain-N forward parse can get more compression if N is increased, but the time taken is exponential in N, so it becomes very slow for N over 2, and N over 4 is unusable in practice.

New LZA Optimal level 1 (-z5) uses the straightforward version of this parse. I flush the parse to update statistics any time there are no matches that cross a position (either because no match is possible, or a long match is possible in which case I force it to be chosen, ala LZMA). This looks like :

New LZA Optimal level 2 (-z6) stores the 4 cheapest arrivals to each position. This allows you to arrive from a previous parse which was not the cheapest on its prior subgraph, but results in a cheaper total parse to your current subgraph. You get this sort of braided 4-line thing.

When you flush the parse to output codes and update statistics, you choose one specific path through the parse. You trace it backwards from the end point, because you have arrivals, then you reverse it to output the codes.

Here's a drawing showing the trace backwards to flush the parse after it's been filled out :

At the flush pos, you take arrival slot 0 (the cheapest) and discard the other arrivals. When you resume from that spot to continue the parse you must resume from only slot 0. It sort of reminds me of the earlier QM post - the parse is uncertain, not set, with these 4 different realities that are possible at each position, until you make a Copenhagen measurement and actually flush the parse, at which point the alternative histories are discarded and you snap to just one. When you resume the parse, the number of arrivals very rapidly grows from 1 up to the maximum of 4, and then stays at 4 through the parse until you snap back to 0.

Unlike the level 1 parse, I do not sync at unambiguous points because you want to keep the possibilities alive for a while. There's a tradeoff between up-to-date statistics and longer flexible parse intervals. (the reason this tradeoff exists is that I don't bring the statistics forward in the parse, though it is possible to do as described by Bulat in the encode.ru thread, my tests indicate it's not worth the speed & memory use hit).

I'm going to show some prints of an actual parse. The columns are the 4 arrivals, and positions increase going down.

The first number is how many positions steps back to your previous point, and the next number is which thread (arrival context) you came from at that position.

So a literal is step back 1 :

  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3

This is a chunk of parse where only literals are possible, and the costs are about the same in each context so the states just carry forward.

  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3

Literals don't always just carry forward, because the cost to code a literal depends on the last-offset context which can be different in each arrival. Here threads 1 & 2 swapped because of literal cost differences.

Through the vast majority of the parse, slot 0 (cheapest) arrives from slot 0. Through all that range, the alternate parses are playing no role. But once in a while they swap threads. Here's an example with the trace-back highlighted :

 [3|0]  3|1   3|2   1|0
 [1|0]  1|1   1|2   1|3
 [1|0]  1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  5|0  [6|0]  5|1   6|1
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  8|0  [8|1]  8|2   8|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  4|0  [4|1]  4|2   4|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  3|0  [3|1]  3|2   3|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
 [7|1]  7|3   7|0   7|2

If the last pos there was a flush point, that would be the parse we trace back to output. Remember that the length is not the length coded at that pos, but the length to *arrive* at that pos (the length coded at the previous pos).

A more detailed printout showing how this type of thing happens. Here I'm also printing match vs. last-offset match, and the offset used in the match :

 [4L+708|0] 3L+708|0  4L+708|1  4L+708|2
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [3L+708|0] 3L+708|1  3L+708|2  3L+708|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+228|0] 4M+228|1  4M+228|2  3M+228|0
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+732|0] 4M+732|1  1 +  0|0  3M+732|0
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  4M+ 12|0 [3L+732|0] 5L+ 12|0  4L+ 12|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  4M+252|0 [4M+252|1] 4M+252|2  4M+252|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4L+708|1] 4L+708|2  4L+708|3  4M+708|0

In the last position here we want to code a match of length 4 at offset 708. Arrivals 1-3 can code it as a last-offset, but arrival 0 (the cheapest) does not have 708 in the last-offset set (because it previously coded 252,12,732,228 and knocked the 708 out of its history).

You can see way back in the past 708 was used twice, and threads 1-3 only used three other offsets (my last-offset set keeps four offsets) so they still have 708.

This is really what the multiple arrivals is all about. You get an extremely non-local effect, that by keeping that 708 in the last offset set, you get a cheaper parse way off in the future when it's used again.

Another one because I find this amusing :

  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 15M+360|0 12M+360|0[16M+720|0]12M+360|1
  1 +  0|0 [1 +  0|2] 1 +  0|1  1 +  0|3
 [1 +  0|1] 1 +  0|0  1 +  0|2  1 +  0|3
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3

What's happened here is the literal coding is cheaper with offset 720 as the last-offset. It's not until two literals later that it becomes the winner. The first literal after the match uses LAM exclusion, and then the next literal after that uses LAM as coding context. ( as described here )

I've been highlighting interesting bits of parse. It should be noted that the vast majority of every parse is just staying on the channel-0 (cheapest) thread. Like this :

     1688: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1689: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1690: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1691: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1692: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1693: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1694: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1695:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1696:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1697: [  3M+167|0]   3M+167|1    3M+167|2    1 +  0|0 
     1698:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1699:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1700: [  3M+ 45|0]   4M+ 45|0    3M+ 45|1    3M+ 45|2 
     1701:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1702:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1703:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1704:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1705:    3M+ 55|0    3M+ 55|1    3M+ 55|2    3M+ 55|3 
     1706:    6M+ 58|0    6M+ 58|1    6M+ 58|2    6M+ 58|3 
     1707:    7M+251|0    7M+251|1    1 +  0|0    1 +  0|1 
     1708: [  8M+330|0]   8M+330|1    8M+330|2    1 +  0|0 
     1709:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1710:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1711:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 

Anyway, here's one I spotted which exhibits unusual jumping around :

     3642: [  4L+612|0]
     3643: [  1 +  0|0]
     3644: [  1 +  0|0]
     3645: [  1 +  0|0]
     3646: [  1 +  0|0]
     3647: [  1 +  0|0]
     3648:    1 +  0|0 
     3649:    1 +  0|0 
     3650: [  3M+192|0]   1 +  0|0 
     3651:    1 +  0|0    1 +  0|1 
     3652:    1 +  0|0    1 +  0|1 
     3653:    1 +  0|0    1 +  0|1 
     3654:    5L+ 12|0 [  4M+ 12|0]   4L+ 12|1    3M+ 12|0 
     3655:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3656:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3657:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3658:    3L+612|0 [  3L+612|1]   3L+612|2    3L+612|3 
     3659:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3660:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3661:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3662:    3L+228|0 [  3L+192|1]   3L+228|2    3L+192|3 
     3663:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3664:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3665:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3666:    3L+228|0    4M+624|0    4M+624|1 [  3L+612|1]
     3667:    1 +  0|0    1 +  0|1 [  1 +  0|3]   1 +  0|2 
     3668:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3669:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3670:    3M+576|0    3M+576|1 [  3M+576|2]   3M+576|3 
     3671:    1 +  0|0    1 +  0|1 [  1 +  0|2]   1 +  0|3 
     3672:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3673:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3674: [  3L+192|2]   3L+192|3    3M+192|0    3M+192|1 
     3675:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3676:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3677:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3678: [  4L+ 12|0]   4M+ 12|1    3L+ 12|0    3L+624|1 

All the previous examples have been from highly structured binary files. Here's an example on text :

     1527: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1528:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1529:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1530:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1531: [  4M+100|0]   4M+100|1    4M+100|2    4M+100|3 
     1532:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1533:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1534:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1535:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1536:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1537:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1538:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1539:   10M+ 32|0   10M+ 32|1 [  8M+ 32|0]   8M+ 32|1 
     1540:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1541:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1542:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1543:    3M+ 47|0    3M+ 47|1    3M+ 47|2    3M+ 47|3 
     1544:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1545:    5M+ 49|0    5M+ 49|1    5M+ 49|2    5M+ 49|3 
     1546:    5M+  1|0    5M+  1|1    5M+  1|2    5M+  1|3 
     1547:    8M+ 50|0 [  8M+ 50|2]   8M+ 50|1    8M+ 50|3 
     1548:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     1549:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1550: [  2L+999|1]   2L+999|3    2L+999|0    2L+999|2 
     1551: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1552: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 

The benefit of the N-arrival parse is far far lower on text than on binary.

text :

enwik8 -z5 : 25384698
enwik8 -z6 : 25358366

binary :

fez_essentials -z5 : 9780036
fez_essentials -z6 : 9512780

lzt02 -z5 : 146098
lzt02 -z6 : 141141

lzt25 -z5 : 63376
lzt25 -z6 : 56982

Some random notes :

In LZA I do not have the N^2 problem like Tornado. For each match, I only consider the maximum possible match length. The reason I can do this is because I enforce strict LAM exclusion. Shorter lengths are not allowed because they would not exclude the LAM.

This is not an "optimal" parse in the sense of being exactly right. To be a true optimal parse, you would have to carry the statistics forward through the tree, and you would have to store *all* arrivals to a given position, not just the best 4.

The A-Star parse I previously wrote about is pretty similar to this actually. Rather than only store 4 arrivals at each position, it can store an unbounded # of arrivals, and it tries to discard paths that cannot ever beat the best path (with a fuzzy tolerance to make that viable).

Storing the 4 cheapest arrivals is maybe not right. You want the 1 cheapest, but then what you want for the other 3 is for them to be interesting alternatives. For example you want their last-offset set to be different than the cheapest arrival, you don't just want a more expensive way to wind up with the same thing. I've done some limited experiments with this but haven't found anything definitive yet.

There's also the issue of deciding when to reset the statistical state.

Because statistics aren't carried forward, there's the issue of parse-statistics feedback. In particular, some hueristic guidance does help (like don't ever take a normal match when a longer repeat-match is possible). Also if the beginning of the file is unusual, it can lead you into a bad local minimum.

01-22-15 | Nexus 5 Usability

The physical design of these phones is so rotten.

First of all, here's a fundamental principle of good device design :

It should not look symmetric unless it actually *is* symmetric.

I frequently pick it up and try to tap the on button, only to discover that I have it upside down. I shouldn't have to spend those seconds staring at the thing to figure out which end is up.

Since it's in fact behaviorally *not* symmetric, they shouldn't have tried so hard to disguise the buttons and make it look symmetric. It should have a strong different-color strip on one end to make it obvious which end is up. This whole "make it look like a single color brick" that came from Apple is fucking retarded design.

Also the side-buttons should not be black and blend in to the device. They should be a contrast color.

My grilling process would be like this : "why are the buttons the same color as the body?" "uhh, because it looks cool?" "you're fired". "which end is up?" "well, it looks cooler all the same color" "you're fired".

A related symmetry complaint is that it doesn't rotate the screen when you turn it upside down. Again if it looks fucking symmetric, then act symmetric. Don't leave me there going "why the fuck are you not rotating the screen? Oh it's fucking upside down". If I turn it upside then fucking rotate the screen the way I'm holding it.

And of course the fucking volume-up-down should be a physical slider so that I can tell the level at a glance and change it even when the thing is off. And tapping it by accident doesn't mute it or send the volume through the roof or any shit like that. Like all fucking knobs and sliders should *always* be physical not fucking digital shuttles OMG. Fucking AC controls and volume knobs in cars that are just digital shuttles make me want to murder everyone.

Maybe the neatest symmetry solution would be to make it slightly tapered. Make the bottom slightly narrower than the top, which would fit the hand better anyway. More curved at the bottom and more square at the top.

There's a huge hardware problem in that if you have a finger on the screen it won't register other taps. I don't understand how this acceptable. If I ham-fist it and accidentally touch the screen with my gripping hand, it just doesn't work. WTF. Perhaps worse, if my baby has her finger on it, then I can no longer operate it to change to a video for her or whatever.

Many of the standard Google Apps have major GUI flaws.

For example in the Camera it's way too hard to change the basic function, such as switching between pictures and video or switching the camera direction, you have to go through some fucking settings menu; they should just be buttons.

In the phone, it's way too easy to call someone when you're just trying to get info on your call history. (I had to add a "call confirm" app because I kept dialing numbers when I scrolled around)

In all the apps, if you have them set to only sync on wifi or only sync on charger, there's no override button to just "fucking sync now anyway". You have to dig into the menus and change the setting then change it back after syncing, very annoying.

In gmail it's nearly fucking impossible to change between threaded & non-threaded view, which would be handy to do frequently. (the threaded view also makes it really hard to find the newest mail)

But there's a worse problem in general with apps.

It's just completely hidden what you can actually do in them. What areas can I click on? What are buttons and what do they do? And then some spots are different actions if I hold a tap there. And some spots are a different action if I drag there. WTF WTF. This is such rotten usabibility GUI design.

One option to fix it would be to obviously color all GUI elements based on how they can be used. Red lines are draggable, blue elements are tappable, green have a tap-and-hold extra action.

Even aside from making it more obvious, there needs to be a way to highlight all GUI actions. Somewhere I can tap that will make them all pop up tool-tips, or animate "this is holdable, this is draggable" etc.

Why do I have to lecture people on fucking 1950's GUI principles. Make fucking buttons looks like buttons! Don't hide them as graphical elements! Function over form please!

The copy-paste interface is horrific. The main way I run into it is if I get something like an address in an email and it fails to auto-link to the maps app, and I want to manually get that address over there. For fucks sake it's impossible. Or like if I find a URL and it doesn't auto-link to the browser, and I have to try to select it and copy it, I want to kill myself. Which I guess is just a general phone problem - as long as the function you want is hard-coded in, it's okay, but god help you actually make apps interact in a nonstandard way. (like take a photo that's on your phone and upload it as an attachment in a web forum post)

Some other random complaints while I'm at it :

There's no fucking missed calls & texts notification on the home page or lock page. Uh, hello. Make it work as a basic fucking phone please. It's like the most basic fucking thing you need in a phone, it should tell you what calls and texts you got. I have to fucking browse around various apps just to find out if I got any calls or texts? WTF. (yes I know I can buy an app to add these features, but seriously WTF).

(addendum : seriously WTF. Recent missed calls and texts need to be on the home page. This is one complaint I actually need to fix because clicking around every time I turn the phone on is ridiculous.)

There seems to be no way to cache my entire local region in maps. For a real world where we have spotty cell coverage in the country, this is pretty awful. In general offline use (as you might want in areas of terrible reception, eg. everywhere) is extremely poor.

There seems to be no way to clear all private data on every boot. I guess nobody in the world but me even tries to have any privacy these days. I'd like to have no saved cookies, no web passwords, and zero app-sign-ins on every boot.

I should be able to get on the internet over the USB cable.

I should be able to remote control the phone from a PC app over a USB cable. I'd like to do all my setup and management from a PC with a fucking mouse and keyboard please, not by poking with my ham-fist. I'd like to have just a big tree-view of all the options so I can clearly see them and not have to poke around to find them.

I fucking hate the way maps is always trying to twist on me. Fucking up is north. Seems to not be a setting to lock north, I have to constantly tap the fucking icon. Very annoying.

There needs to be a global option to turn off all starring of passwords globally. Becaused starred out passwords and ham-fingers don't mix.

01-14-15 | Spooky Action at a Distance

I loved "Only Lovers Left Alive". There's not much to the plot, but as just a bit of mood it's lovely. The design, the music, everything is perfect. It reminds me of my youth, actually. An aethestic and lifestyle that's no longer fashionable.

Adam's song - Spooky Action At A Distance - is fantastic.

BUT - there is no fucking spooky action at a distance!

Ooo, you say, Einstein was confused about it. It must be all mysterious and nobody understands it. NO!

Quantum Mechanics is not some crazy mysterious metaphysical thing that science can't explain and therefore allows all sorts of quackery.

First of all, the average person shouldn't read thoughts about Quantum Mechanics that were written in the 20's and 30's when it was still new and not very well understood yet. Hey, obviously they were a little confused back then. Read something modern!

Second of all, I think the way QM is taught in schools is still largely terrible. I don't know how it's done now, but when I went through it it was still being taught in a "historical" way. That is, going through the progression in the same way it was discovered, roughly. First you learn quantization of photons, then you get a wave function, you get copenhagen interpretation and measurement collapse, and then you finally start to get bra-ket and hilbert spaces and only much later (grad school) do you get decoherence theory.

The right way to teach QM is to jump straight to the modern understanding without confusing people with all that other gunk :

1. The universe is a linear vector space of *amplitudes* (this is the source of all the weirdness and is the fundamental postulate) (the universe seems to work in the square roots of things, and double-covers, and this leads to a lot of the non-classical weirdness)

2. The entire universe is QM, there is no classical "observer" that lies outside the QM system.

3. The universe takes all possible outcomes all the time. However, an observer in any given state is not in all of those outcomes.

4. Measurement is the process of entangling the QM state of the measurement device with the QM state of the observed process.

In a very brief nutshell measurement decoherence (entanglement of the measurement device / outside world with a quantum system) goes like this :

You have a measurement device which is initially in an undetermined state |M>

You have something to measure.  Let's take a quantum state which is either 0 or 1 
(eg. a single particle that's either spin up or down) 

|0> + |1>

Initially they are not entangled :

|M> ( |0> + |1> )

And measurement changes them to be entangled :

( |M0>|0> + |M1>|1> )

M0 means the measurement device has observed the state 0

Note that there is no "collapse". Both possibilities still exist, but if the measurement device observes a 0 then the particle must be in state 0.

Now, in the book teaching QM we'd have to go into more detail about what "measurement decoherence" actually is. Because the entire universe is governed by QM, we'd have to show that that process happens by QM as well. In fact that has been done and the details worked out in the past 40 years, so it's relatively new.

Measurement decoherence doesn't happen at the single-particle level. Rather it's a large-scale phenomenon that happens due to entropy, much like thermodynamics. This is why "measurement collapse" could be seen as a connection to an outside classical system for large objects, because it only happens at large system scales. It only happens one way - you can entangle the measurement device with the quantum particles, but you can never de-entangle them (this is why in the old copenhagen interpretation you could get away with talking about "collapse", because if the measurement device is classical-scale then it can never go backwards). The basic laws of QM are time-reversible, but decoherence is not. It's exactly like if you took a bucket of red water and a bucket of blue water and mixed them in one larger bucket - it is now effectively impossible to ever separate out all the particles and restore the two different buckets.

So. The so-called "EPR Paradox" that is not a paradox at all. (well it is if you assume a deterministic physics with hidden variables, which is just wrong; it should be called the EPR Proof that Einstein was Wrong Sometimes). The EPR thought experiment goes like this :

You have two particles that are either in state 0 or 1, and are entangled so that they are either both 0 or both 1 :

|0>|0> + |1>|1>

You separate them to opposite ends of the universe, where lie Alice and Bob :

|A>|B>( |0>|0> + |1>|1> )

Bob then measures his particle and becomes entangled with it, in state B0 or B1 :

|A>( |B0>|0>|0> + |B1>|1>|1> )

If Alice now measures her particle, she will get either a 0 or 1, and see the same thing as Bob. However, nothing has moved faster than light. Her entanglement happens entirely on her local particle :

|A>|0> -> |A0>|0>

it has nothing to do with the state of Bob far away. The initial entangled state of the particle simply gaurantees they will get the same result. But the physical interaction is still local and slower than light, and the information about the result must travel slower than light.

There is never any "action at a distance" or any "paradox" or anything "spooky".

Sheesh. Come on vampires, you've been alive long enough to get your physics right.

01-05-15 | Daala PVQ Summary

A much belated attempt at summarizing the Daala PVQ coder. This post is in the wrong temporal order, it's sort of an introduction before the other rambling.

First consider Intra, that is the case of no prediction.

We have a block of some variable size (4x4 to 16x16) that has been transformed. The DC is sent by some other mechanism, so we are left with some AC's.

The AC's are grouped into subbands. These are just collections of likely-similar AC coefficients. Ideally they should be statistically similar. The current subbands in Daala are :

The AC coefficients in each subband are consider a vector. We first send the "gain" of the vector, which is the L2 norm (the "length" of the vector). (*1)

The gain can be sent with a non-uniform scalar quantizer. This allows you to build variance-adaptive quantization into the codec without side data. (H264 by sending per-block quantizer changes based on activity, which costs extra bits in signalling). Essentially you want to use larger quantization bins for larger gains, because specifying the exact amount of a large energy is not as important. Daala seems to take gain to the 2/3 power and uniformly quantize that. (*2) (*3)

Once we've sent the gain, we need to send how that energy is distributed on the coefficients. I won't go into the details of how it's done. Of course you have already sent the total energy so you don't want to just send the coefficient values, that would be highly redundant. You may think of it as dividing the vector by the gain, we are left with a unit vector. The Fischer "Pyramid Vector Quantizer" is one good starting point.

Note that Fischer PVQ is explicitly not quite applicable here, because it is only optimal if the coefficients in the vector are independent and have the same distribution, which is not the case in images. Because of that, just sending the index of the PVQ codebook without entropy coding is probably wrong, and a way of coding a PVQ codebook selection that uses the properties of the AC coefficients is preferrable. (eg. code one by one in a z-scan order so you know likelihood is geometrically decreasing, etc. Daala has some proposals along these lines).

A key issue is determining the PVQ codebook resolution. With PVQ this is K, the L1 norm of the unnormalized (integer coefficients) codebook vectors. Daala computes K without sending it. K is computed such that error due to the PVQ quantization is equal to error due to gain quantization. (*4). This makes K a function of the overall quantization level, and also of the gain of that block - more gain = more K = more resolution for where that gain goes.

A non-uniform quantization matrix is a bit murky in this scheme (ala the JPEG CSF factors) because it changes the scaling of each axis in the vector, as well as the average magnitude, which violates the assumptions of Pyramid VQ. Applying a different quantizer per subband is an easy compromise, but pretty coarse. (*5)

And that's it for intra.

Advantages vs. traditional residual coding :

1. You send the gain up front, which is a good summary of the character of the block. This allows for good modeling of that gain (eg. from neighbors, previous gains). It also allows that gain to be used as a context or coding model selector for the rest of the block.

2. Because you send the gain explicitly, it is better preserved than with traditional coding (especially with Trellis Quant which tends to kill gain). (deadzone quantizers also tend to kill gain; here a deadzone quantizer might be appropriate for the total gain, but that's better than doing it for each coefficient independently). Preserving gain is good perceptually. It also allows for separate loss factor for gain and distribution of energy which may or may not be useful.

3. Because you send the gain explicitly, you can use a non-linear quantizer on it.

4. You have a simple way of sending large subbands = zero without coding an extra redundant flag.

5. No patents! (unbelievable that this is even an issue, all you software patenters out there plz DIAGF)

asides :

*1 = In my studies I found preserving L1 norm of AC activity to be more visually important than preserving L2 norm. That doesn't mean it's better to code L1 norm though.

*2 = One advantage of this scheme is having VAQ built in. A disadvantage is that it's hard-coded into the codec so that the encoder isn't free to do adaptive quantization based on better human visual studies. Of course the encoder can send corrections to the built-in quantizer but you better be careful about tweak the VAQ that's in the standard!

*3 = I always found variable quantization of log(gain) to be appealing.

*4 = My perceptual studies indicate that there should probably be more error due to the distribution quantization (K) than from gain quantization.

*5 = Of course, applying a CSF-like matrix to *residuals* as is done in most video coders is also very murky.

Okay, but now we have prediction, from motion compensation or intra prediction or whatever. You have a current block to encode and a predicted block that you think is likely similar.

The problem is we can't just subtract off the predicted block and make a residual the way normal video codecs do. If you did that, you would no longer have a "gain" which was the correct energy level of the block - it would be an energy level of the *residual*, which is not a useful thing either for perceptual quality control or for non-uniform quantization to mimic VAQ.

Sending the gain is easy. You have the gain for each predicted subband, so you can just predict the gain for the current subband to be similar to that (send the delta, context code, etc.). You want to send the delta in quantized space (after the power mapping). The previous block might have more information than you can preserve at the current quantization level (it may have a finely specified gain which your quantizer cannot represent). With a normal residual method, we could just send zero residuals and keep whatever detail was in the previous frame. Daala makes it possible to carry this forward by centering a quantization bucket exactly on the previous gain.

Now we need the distribution of coefficients. The issue is you can't just send the delta using Pyramid VQ. We already have the gain, which is the length of the coefficient N-vector , it's not the length of the delta vector.

Geometrically, we are on an N-sphere (since we know the length of the current vector) and we have a point on that sphere where the predicted block was. So we need to send our current location on the sphere relative to that known previous position. Rather than mess around with spherical coordinate systems, we can take the two vectors, and then essentially just send the parallel part (the dot product, or angle between them), and then the perpendicular part.

JM Valin's solution for Daala using the Householder reflection is just a way of getting the "perpendicular part" in a convenient coordinate system, where you can isolate the N-2 degrees of freedom. You have the length of the perpendicular part (it's gain*sin(theta)), so it's a unit vector, and we can use Pyramid VQ to send it.

So, to transmit our coefficients we send the gain (as delta from previous gain), we send the extent to which the vectors are parallel (eg. by sending theta (*6)), we then know the length of the perpendicular part and just need to send the remaining N-2 directional DOF using Pyramid VQ.

As in the intra case, the K (quantizer resolution) of the Pyramid VQ is computed from other factors. Here obviously rather than being proportional to the total gain, it should be proportional to the length of the perpendicular part, eg. gain*sin(theta). In particular if you send a theta near zero, K goes toward zero.

One funny thing caused by the Householder reflection (coordinate system change to get the perpendicular part) is that you've scrambled up the axes in a way you can't really work with. So custom trained knowledge of the axes, like expected magnitudes and Z-scan order and things like that are gone. eg. with a normal coefficient delta, you know that it's very likely the majority of the delta is in the first few AC's, but after the rotation that's lost. (you can still build that in at the subband level, just not within the subbands).

Another funny thing is the degeneracy of polar conversion around the origin. When two vectors are very close (by Euclidean distance) they have a small angle between them, *if* they are long enough. Near the origin, the polar conversion has a pole (ha ha punny). This occurs for subbands near zero, eg. nearly flat, low energy blocks. Since the gain has previously been sent, it's possible that could be used to change to a different coder for gains near zero (eg. just use the Intra coder). (in Daala you would just send a "noref" flag). To be clear, the issue is that the vectors might be very close in Euclidean distance, and thus seem like good matches based on SAD motion search, but could easily be vectors pointing in completely opposite directions, hence be very bad to code using this theta scheme.

And I think that's about it.

The big goal of this funny business is to be able to send the gain (length) of the current subband vector, rather than sending the length of the delta. This gives you the advantages as discussed previously in the simpler Intra case.

asides :

*6 = send theta, or sin(theta) ? They have slightly different quantization bucket scaling. Most of this assumes that we have a reasonably good prediction so theta is small and theta ~= sin(theta).

Geometrically with white board drawing :

Traditional video coders just form the Euclidean "Delta" vector and send its components.

Daala Intra (no prediction) takes the "Current" vector, sends its length, and then its unit vector (direction) using Pyramid VQ.

Daala Predictive VQ sends the length of "Current", the angle (theta) from Predicted to Current, and the unit vector (direction) of the "Perpendicular" vector.

(of course Daala doesn't send the "Perpendicular" vector in the coordinate space draw above; it's reflected into the space where "Predicted" is aligned with an axis, that way Perpendicular is known to have a zero in that direction and is simply a vector in N-1 dimensions (and you've already sent the length so it has N-2 DOF))

12-27-14 | Lagrange Rate Control Part 4

I wrote previously about Lagrange Rate control for video :

01-12-10 - Lagrange Rate Control Part 1
01-12-10 - Lagrange Rate Control Part 2
01-13-10 - Lagrange Rate Control Part 3

I was thinking about it recently, and I wanted to do some vague rambling about the overall issue.

First of all, the lagrange method in general for image & video coding is just totally wrong. The main problem is that it assumes every coding decision is independent. That distortions are isolated and additive, which they aren't.

The core of the lagrange method is that you set a "bit usefulness" value (lambda) and then you make each independent coding decision based on whether more bits improve D by lambda or more.

But that's just wrong, because distortions are *not* localized and independent. I've mentioned a few times recently the issue of quality variation; if you make one block in image blurry, and leave others at high detail, that looks far worse than the localized D value tells you, because it's different and stands out. If you have a big patch of similar blocks, then making them different in any way is very noticeable and ugly. There are simple non-local effects, like if the current block is part of a smooth/gradient area, then blending smoothly with neighbors in the output is crucial, and the localized D won't tell you that. There are difficult non-local effects, like if the exact same kind of texture occurs in multiple parts of the image, then coding them differently makes the viewer go "WTF", it's a quality penalty worse than the local D would tell you.

In video, the non-local D effects are even more extreme due to temporal coherence. Any change of quality over time that's due to the coder (and not due to motion) is very ugly (like I frames coming in with too few bits and then being corrected over time, or even worse the horrible MPEG pop if the I-frame doesn't match the cut). Flickering of blocks if they change coding quality over time is horrific. etc. etc. None of this is measurable in a localized lagrangian decision.

(I'm even ignoring for the moment the fact that the encoding itself is non-local; eg. coding of the current block affects the coding of future blocks, either due to context modeling or value prediction or whatever; I'm just talking about the fact that D is highly non-local).

The correct thing to do is to have a total-image (or total-video) perceptual quality metric, and make each coding decision based on how it affects the total quality. But this is impossible.


So the funny thing is that the lagrange method actually gets you some global perceptual quality by accident.

Assume we are using quite a simple local D metric like SSD or SAD possibly with SATD or something. Just in images, perceptually what you want is for smooth blocks to be preserved quite well, and very noisey/random blocks to have more error. Constant quantizer doesn't do that, but constant lambda does! Because the random-ish blocks are much harder to code, they cost more bits per quality, they will be coded at lower quality.

In video, it's even more extreme and kind of magical. Blocks with a lot of temporal change are not as important visually - it's okay to have high error where there's major motion, and they are harder to code so they get worse quality. Blocks that stay still are important to have high quality, but they are also easier to code so that happens automatically.

That's just within a frame, but frame-to-frame, which is what I was talking about as "lagrange rate control" the same magic sort of comes out. Frames with lots of detail and motion are harder to code, so get lower quality. Chunks of the video that are still are easier to code, so get higher quality. The high-motion frames will still get more bits than the low-motion frames, just not as many more bits as they would at constant-quality.

It can sort of all seem well justified.

But it's not. The funny thing is that we're optimizing a non-perceptual local D. This D is not taking into account things like the fact that high motion block errors are less noticeable. It's just a hack that by optimizing for a non-perceptual D we wind up with a pretty good perceptual optimization.

Lagrange rate control is sort of neat because it gets you started with pretty good bit allocation without any obvious heuristic tweakage. But that goes away pretty fast. You find that using L1 vs. L2 norm for D makes a big difference in perceptual quality; maybe L1 squared? other powers of D change bit allocation a lot. And then you want to do something like MB-tree to push bits backward; for example the I frame at a cut should get a bigger chunk of bits so that quality pops in rather than trickles in, etc.

I was thinking of this because I mentioned to ryg the other day that I never got B frames working well in my video coder. They worked, and they helped in terms of naive distortion measures, but they created an ugly perceptual quality problem - they had a slightly different look and quality than the P frames, so in a PBPBP sequence you would see a pulsing of quality that was really awful.

The problem is they didn't have uniform perceptual quality. There were a few nasty issues.

One is that at low bit rates, the "B skip" block becomes very desirable in B frames. (for me "B skip" = send no movec or residual; use predicted movec to future and past frames to make an interpolated output block). The "B skip" is very cheap to send, and has pretty decent quality. As you lower bit rate, suddenly the B frames start picking "B skip" all over, and they actually have lower quality than the P frames. This is an example of a problem I mentioned in the PVQ posts - if you don't have a very smooth contimuum of R/D choices, then an RD optimizing coder will get stuck in some holes and there will be sudden pops of quality that are very ugly.

At higher bit rates, the B frames are easier to code to high quality, (among other things, the P frame is using mocomp from further in the past), so the pulsing of quality is high quality B's and lower quality P's.

It's just an issue that lagrange rate control can't handle. You either need a very good real perceptual quality metric to do B-P rate control, or you just need well tweaked heuristics, which seems to be what most people do.

12-17-14 | PVQ Vector Distribution Note

So, PVQ, as in "Pyramid VQ" is just a way of making a VQ codebook for a unit vector that has a certain probability model (each value is laplacian (the same laplacian) and independent).

You have a bunch of values, you send the length separately so you're left with a unit vector. Independent values are not equally distributed on the unit sphere, so you don't want a normal unit vector quantizer, you want this one that is scaled squares.

Okay, that's all fine, but the application to images is problematic.

In images, we have these various AC residuals. Let's assume 8x8 blocks for now for clarity. In each block, you have ACij (ij in 0-7 and AC00 excluded). The ij are frequency coordinates for each coefficient. You also have spatial neighbors in the adjacent blocks.

The problem is that the AC's are not independent - their not independent either in frequency or spatial coordinates. AC42 is strongly correlated to AC21 and also to AC42 in the neighboring blocks. They also don't have the same distribution; lower freqency-index AC's have much higher means.

In order to use Pyramid VQ, we need to find a grouping of AC's into a vector, such that the values we are putting in that vector are as uncorrelated and equally distributed as possible.

One paper I found that I linked in the previous email forms a vector by taking all the coefficients in the same frequency slot in a spatial region (for each ij, take all ACij in the 4x4 neighorhood of blocks). This is appealing in the sense that it gathers AC's of the same frequency subband, so they have roughly the same distribution. The problem is there are strong spatial correlations.

In Daala they form "subbands" of the coefficients by grouping together chunks of ACs that are in similar frequency groups.

The reason why correlation is bad is that it makes the PVQ codebook not optimal. For correlated values you should have more codebook vectors with neighboring values similar. eg. more entries around {0, 2, 2, 0} and fewer around {2, 0, 0, 2}. The PVQ codebok assumes those are equally likely.

You can however, make up for this a bit with the way you encode the codebook index. It doesn't fix the quantizer, but it does extract the correlation.

In classical PVQ (P = Pyramid) you would simply form an index to the vector and send it with an equiprobable code. But in practice you might do it with binary subdivision or incremental enumeration schemes, and then you need not make all codebook vectors equiprobable.

For example in Daala one of the issues for the lower subbands is that the vectors that have signal in the low AC's are more probable than the high AC's. eg. for subband that spans AC10 - AC40 , {1,0,0,0} is much more likely than {0,0,0,1}.

Of course this becomes a big mess when you consider Predictive VQ, because the Householder transform scrambles everywhere up in a way that makes it hard to model these built-in skews. On the other hand, if the Predictive VQ removes enough of the correlation with neighbors and subband, then you are left with a roughly evenly distributed vector again which is what you want.

12-16-14 | Daala PVQ Emails

First, I want to note that the PVQ Demo page has good links at the bottom with more details in them, worth reading.

Also, the Main Daala page has more links, including the "Intro to Video" series, which is rather more than an intro and is a good read. It's a broad survey of modern video coding.

Now, a big raw dump of emails between me, ryg, and JM Valin. I'm gonna try to color them to make it a bit easier to follow. Thusly :


And this all starts with me being not very clear on PVQ so the beginning is a little fuzzy.

I will be following this up with a "summary of PVQ as I now understand it" which is probably more useful for most people. So, read that, not this.

(also jebus the internet is rindoculuos. Can I have BBS's back? Like literally 1200 baud text is better than the fucking nightmare that the internet has become. And I wouldn't mind playing a little Trade Wars once a day...)

Hi, I'm the author of the latest Daala demo on PVQ on which you commented recently. Here's some comments on your comments. I wasn't able to submit this to your blog, but feel free to copy them there. > 1. I've long believed that blocks should be categorized into > "smooth", "detail" and "edge". For most of this discussion we're > going to ignore smooth and edge and just talk about detail. That is, > blocks that are not entirely smooth, and don't have a dominant edge > through them (perhaps because that edge was predicted). I agree here, although right now we're treating "smooth" and "detail" in the same way (smooth is just low detail). Do you see any reason to treat those separately? > 2. The most important thing in detail blocks is preserving the > amount of energy in the various frequency subbands. This is something > that I've talked about before in terms of perceptual metrics. This is exactly what I had in mind with this PVQ work. Before Daala, I worked on the CELT part of the Opus codec, which has strict preservation of the energy. In the case of Daala, it looks so far like we want to relax the energy constraint a little. Right now, the codebook has an energy-preserving structure, but the actual search is R/D optimized with the same weight given to the amount of energy and its location. It's pretty easy to change the code to give more weight to energy preservation. I could even show you how to play with it if you're interested. > 3. You can take a standard type of codec and optimize the encoding > towards this type of perceptual metric, and that helps a bit, but > it's the wrong way to go. Because you're still spending bits to > exactly specify the noise in the high frequency area. Correct, hence the gain-shape quantization in my post. > 4. What you really want is a joint quantizer of summed energy and > the distribution of that energy. At max bit rate you send all the > coefficients exactly. As you reduce bitrate, the sum is preserved > pretty well, but the distribution of the lower-right (highest > frequency) coefficients becomes lossy. As you reduce bit rate more, > the total sum is still pretty good and the overall distribution of > energy is mostly right, but you get more loss in where the energy is > going in the lower frequency subbands, and you also get more scalar > quantization of the lower frequency subbands, etc. Well, my experience with both Opus and CELT is that you want the same resolution for the energy as you use for the "details". That being said, having an explicit energy still means you can better preserve it in the quantization process (i.e. it won't increase or decrease too much due to the quantization). > 5. When the energy is unspecified, you'd like to restore in some nice > way. That is, don't just restore to the same quantization vector > every time ("center" of the quantization bucket), since that could > create patterns. I dunno. Maybe restore with some randomness; restore > based on prediction from the neighborhood; restore to maximum > likelihood? (ML based on neighborhood/prediction/image not just a > global ML) I experimented a little bit with adding noise at the correct energy and while it slightly improved the quality on still images, it wasn't clear how to apply it to video because then you have the problem of static vs dynamic noise. > 6. An idea I've tossed around for a while is a quadtree/wavelet-like > coding scheme. Take the 8x8 block of coefficients (and as always > exclude DC in some way). Send the sum of the whole thing. Divide into > four children. So now you have to send a (lossy) distribution of that > sum onto the 4 child slots. Go to the upper left (LL band) and do it > again, etc. I considered something along these lines, but it would not be easy to do because the lowest frequencies would kind of drown out the high frequencies. > 7. The more energy you have, the less important its exact > distribution, due to masking. As you have more energy to distribute, > the number of vectors you need goes up a lot, but the loss you can > tolerate also goes up. In terms of the bits to send a block, it > should still increase as a function of the energy level of that > block, but it should increase less quickly than naive > log2(distributions) would indicate. Yes, that is exactly what the PVQ companding does. > 8. Not all AC's are equally likely or equally perceptually important. > Specifically the vector codebook should contain more entries that > preserve values in the upper-left (low frequency) area. This is the equivalent of the quantization matrix, which PVQ has as well (though I didn't really talk about it). > 9. The interaction with prediction is ugly. (eg. I don't know how to > do it right). The nature of AC values after mocomp or > intra-prediction is not the same as the nature of AC's after just > transform (as in JPEG). Specifically, ideas like variance masking and > energy preservation apply to the transformed AC values, *not* to the > deltas that you typically see in video coding. Handling the prediction is exactly what the whole Householder reflection in PVQ is about (see the 6 steps figure). The PVQ gain encoding scheme is always done on the input and not on the prediction. So the activity masking is applied on the input energy and not based on the energy of the residual. > 10. You want to send the information about the AC in a useful order. > That is, the things you send first should be very strong classifiers > of the entropy of that block for coding purposes, and of the masking > properties for quantization purposes. Well, coding the energy first achieves most of this. > You don't want sending the "category" or "masking" information to be > separate side-band data. It should just be the first part of sending > the coefficients. So your category is maybe something like the > bit-vector of which coefficient groups have any non-zero > coefficients. Something like that which is not redundant with sending > them, it's just the first gross bit of information about their > distribution. Well, the masking information is tied to the gain. For now, the category information is only tied to the block size decision (based on the assumption that edges will be 4x4), but it's not ideal and it's something I'd like to improve. On the topic of lapped transform, this has indeed been causing us all sorts of headaches, but it also has interesting properties. Jury's still out on that one, but so far I think we've managed to make reasonably good use of it. Cheers, Jean-Marc

Thanks for writing! Before I address specific points, maybe you can teach me a bit about PVQ and how you use it? I can't find any good resources on the web (your abstract is rather terse). Maybe you can point me at some relevant reference material. (the CELT paper is rather terse too!) Are you constructing the PVQ vector from the various AC's within a single block? Or gathering the same subband from spatial neighbors? (I think the former, but I've seen the latter in papers) Assuming the former - Isn't it just wrong? The various AC's have different laplacian distributions (lower frequencies more likely) so using PVQ just doesn't seem right. In particular PVQ assumes all coefficients are equally likely and equally distributed. In your abstract you seem to describe a coding scheme which is not a uniform length codeword like traditional PVQ. It looks like it assigns shorter codes to vectors that have their values early on in some kind of z-scan order. How is K chosen?

Hi, On 02/12/14 08:52 PM, Charles Bloom wrote: > Thanks for writing! Before I address specific points, maybe you can > teach me a bit about PVQ and how you use it? I can't find any good > resources on the web (your abstract is rather terse). Maybe you can > point me at some relevant reference material. (the CELT paper is rather > terse too!) I'm currently writing a longer paper for a conference in February, but for now there isn't much more than the demo and the abstract I link to at the bottom. I have some notes that describe some of the maths, but it's a bit all over the place right now: http://jmvalin.ca/video/video_pvq.pdf > Are you constructing the PVQ vector from the various AC's within a > single block? Or gathering the same subband from spatial neighbors? (I > think the former, but I've seen the latter in papers) > > Assuming the former - Correct. You can see the grouping (bands) in Fig. 1 of: http://jmvalin.ca/video/spie_pvq_abstract.pdf > Isn't it just wrong? The various AC's have different laplacian > distributions (lower frequencies more likely) so using PVQ just doesn't > seem right. > > In particular PVQ assumes all coefficients are equally likely and > equally distributed. > > > In your abstract you seem to describe a coding scheme which is not a > uniform length codeword like traditional PVQ. It looks like it assigns > shorter codes to vectors that have their values early on in some kind of > z-scan order. One thing to keep in mind if that the P in PVQ now stands for "perceptual". In Daala we are no longer using the indexing scheme from CELT (which does assume identical distribution). Rather, we're using a coding scheme based on Laplace distribution of unequal variance. You can read more about the actual encoding process in another document: http://jmvalin.ca/video/pvq_encoding.pdf > How is K chosen? The math is described (poorly) in section 6.1 of http://jmvalin.ca/video/video_pvq.pdf Basically, the idea is to have the same resolution in the direction of the gain as in any other direction. In the no prediction case, it's roughly proportional to the gain times the square root of the number of dimensions. Because K only depends on values that are available to the decoder, we don't actually need to signal it. Hope this helps, Jean-Marc

Thanks for the responses and the early release papers, yeah I'm figuring most of it out. K is chosen so that distortion from the PVQ (P = Pyramid) quantization is the same as distortion from gain quantization. Presumably under a simple D metric like L2. The actual PVQ (P = Pyramid) part is the simplest and least ambiguous. The predictive stuff is complex. Let me make sure I understand this correctly - You never actually make a "residual" in the classic sense by subtracting the prediction off. You form the prediction in transformed space. (perhaps by having a motion vector, taking the pixels it points to and transforming them, dealing with lapping, yuck!) The gain of the current block is sent (for each subband). Not the gain of the delta. The gain of the prediction in the same band is used as coding context? (the delta of the quantized gains could be sent). The big win that you guys were after in sending the gain seems to have been the non-linear quantization levels; essentially you're getting "variance adaptive quantization" without explicitly sending per block quantizers. The Householder reflection is the way that vectors near the prediction are favored. This is the only way that the predicted block is used!? Madness! (presumably if the prediction had detail that was finer than the quantization level of the current block that could be used to restore within the quantization bucket; eg. for "golden frames")

On 03/12/14 12:18 AM, Charles Bloom wrote: > K is chosen so that distortion from the PVQ (P = Pyramid) quantization > is the same as distortion from gain quantization. Presumably under a > simple D metric like L2. Yes, it's an L2 metric, although since the gain is already warped, the distortion is implicitly weighted by the activity masking, which is exactly what we want. > You never actually make a "residual" in the classic sense by subtracting > the prediction off. Correct. > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do. > The gain of the current block is sent (for each subband). Not the gain > of the delta. Correct. > The gain of the prediction in the same band is used as > coding context? (the delta of the quantized gains could be sent). Yes, the gain is delta-coded, so coding "same gain" is cheap. Especially, there's a special symbol for gain=0,theta=0, which means "skip this band and use prediction as is". > The big win that you guys were after in sending the gain seems to have > been the non-linear quantization levels; essentially you're getting > "variance adaptive quantization" without explicitly sending per block > quantizers. Exactly. Not only that but it's adaptive based on the variance of the current band, not just an entire macroblock. > The Householder reflection is the way that vectors near the prediction > are favored. This is the only way that the predicted block is used!? > Madness! Well, the reference is used to compute the reflection *and* the gain. In the end, we're using exactly the same amount of information, just in a different space. > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? Jean-Marc

So one thing that strikes me is that at very low bit rate, it would be nice to go below K=1. In the large high-frequency subbands, the vector dimension N is very large, so even at K=1 it takes a lot of bits to specify where the energy should go. It would be nice to be more lossy with that location. It seems that for low K you're using a zero-runlength coder to send the distribution, with a kind of Z-scan order, which makes it very similar to standard MPEG. (maybe you guys aren't focusing on such low bit rates; when I looked at low bit rate video the K=1 case dominated) At 09:42 PM 12/2/2014, you wrote: > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? If you happen to have a very high quality previous block (much better than your current quantizer / bit rate should give you) - with normal mocomp you can easily carry that block forward, and perhaps apply corrections to it, but the high detail of that block is preserved. With the PVQ scheme it's not obvious to me that that works. When you send the quantized gain of the subbands you're losing precision (it looks like you guys have a special fudge to fix this, by offsetting the gain based on the prediction's gain?) But for the VQ part, you can't really "carry forward" detail in the same way. I guess the reflection vector can be higher precision than the quantizer, so in a sense that preserves detail, but it doesn't carry forward the same values, because they drift due to rotation and staying a unit vector, etc.

Some more questions - Is the Householder reflection method also used for Intra prediction? (do you guys do the directional Intra like H26x ?) How much of this scheme is because you believe it's the best thing to do vs. you have to avoid H26x patents? If you're not sending any explicit per-block quantizer, it seems like that removes a lot of freedom for future encoders to do more sophisticated perceptual optimization. (ROI bit allocation or whatever)

On 03/12/14 02:17 PM, Charles Bloom wrote: > So one thing that strikes me is that at very low bit rate, it would be > nice to go below K=1. In the large high-frequency subbands, the vector > dimension N is very large, so even at K=1 it takes a lot of bits to > specify where the energy should go. It would be nice to be more lossy > with that location. Well, for large N, the first gain step already has K>1, which I believe is better than K=1. I've considered adding an extra gain step with K=1 or below, but never had anything that was really worth it (didn't try very hard). > It seems that for low K you're using a zero-runlength coder to send the > distribution, with a kind of Z-scan order, which makes it very similar > to standard MPEG. > > (maybe you guys aren't focusing on such low bit rates; when I looked at > low bit rate video the K=1 case dominated) We're also targeting low bit-rates, similar to H.265. We're not yet at our target level of performance though. > Is the Householder reflection method also used for Intra prediction? > (do you guys do the directional Intra like H26x ?) We also use it for intra prediction, though right now our intra prediction is very limited because of the lapped transform. Except for chroma which we predict from the luma. PVQ makes this particularly easy. We just use the unit vector from luma as chroma prediction and code the gain. > How much of this scheme is because you believe it's the best thing to > do vs. you have to avoid H26x patents? The original goal wasn't to avoid patents, but it's a nice added benefit. > If you're not sending any explicit per-block quantizer, it seems like > that removes a lot of freedom for future encoders to do more > sophisticated perceptual optimization. (ROI bit allocation or > whatever) We're still planning on adding some per-block/macroblock/something quantizers, but we just won't need them for activity masking. Cheers, Jean-Marc

Hi, Just read your "smooth blocks" post and I thought I'd mention on thing we do in Daala to improve the quality of smooth regions. It's called "Haar DC" and the idea is basically to apply a Haar transforms to all the DCs in a superblock. This has the advantage of getting us much better quantization resolution at large scales. Unfortunately, there's absolutely no documentation about it, so you'd have to look at the source code, mostly in od_quantize_haar_dc() and a bit of od_compute_dcts() http://git.xiph.org/?p=daala.git;a=blob;f=src/encode.c;h=879dda;hb=HEAD Cheers, Jean-Marc

Yeah I definitely can't follow that code without digging into it. But this : "much better quantization resolution at large scales." is interesting. When I did the DLI test : http://cbloomrants.blogspot.com/2014/08/08-31-14-dli-image-compression.html something I noticed in both JPEG and DLI (and in everything else, I'm sure) is : Because everyone just does naive scalar quantization on DC's, large regions of solid color will shift in a way that is very visible. That is, it's a very bad perceptual RD allocation. Some bits should be taken away from AC detail and put into making that large region DC color more precise. The problem is that DC scalar quantization assumes the blocks are independent and random and so on. It models the distortion of each block as being independent, etc. But it's not. If you have the right scalar quantizer for the DC when the blocks are in a region of high variation (lots of different DC's) then that is much too large a quantizer for regions where blocks all have roughly the same DC. This is true even when there is a decent amount of AC energy, eg. the image I noticed it in was the "Porsche640" test image posted on that page - the greens of the bushes all color shift in a very bad way. The leaf detail does not mask this kind of perceptual error.

Two more questions - 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, how does that work with gain preservation and the Pyramid VQ unit vector? 2. Do you mind if I post all these mails publicly?

On 11/12/14 02:03 PM, Charles Bloom wrote: > 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, > how does that work with gain preservation and the Pyramid VQ unit vector? Right now, we just set a different quantizer value for each "band", so we can't change resolution on a coefficient-by-coefficient basis, but it still looks like a good enough approximation. If needed we might try doing something fancier at some point. > 2. Do you mind if I post all these mails publicly? I have no problem with that and in fact I encourage you to do so. Cheers, Jean-Marc

ryg: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? Classic JPEG/MPEG style coding rules (H.264 and later are somewhat different) 1. assume short runs are more probable than long ones and 2. give a really cheap way to end blocks early. The result is that the coder likes blocks with a fairly dense cluster in the first few coded components (and only this is where zig-zag comes in) and truncated past that point. Now take Fischer-style PVQ (original paper is behind a paywall, but this: http://www.nul.com/pbody17.pdf covers what seems to be the proposed coding scheme). You have two parameters, N and K. N is the dimensionality of the data you're coding (this is a constant at the block syntax level and not coded) and K is the number of unit pulses (=your "energy"). You code K and then send an integer (with a uniform model!) that says which of all possible arrangements of K unit pulses across N dimensions you mean. For 16-bit ACs in a 8x8 block so N=63, there's on the order of 2^(63*16) = 2^1008 different values you could theoretically code, so clearly for large K this integer denoting the configuration can get quite huge. Anyway, suppose that K=1 (easiest case). Then the "configuration number" will tell us where the pulse goes and what sign it has, uniformly coded. That's essentially a run length with *uniform* distribution plus sign. K=2: we have two pulses. There's N*2 ways to code +-2 in one AC and the rest zeros (code AC index, code sign), and (N choose 2) * 2^2 ways to code two slots at +-1 each. And so forth for higher K. From there, we can extrapolate what the general case looks like. I think the overall structure ends up being isomorphic to this: 1. You code the number M (<=N) of nonzero coefficients using a model derived from the combinatorics given N and K (purely counting-based). (K=1 implies M=1, so nothing to code in that case.) 2. Code the M sign bits. 3. Code the positions of the M nonzero coeffs - (N choose M) options here. 4. Code another number denoting how we split the K pulses among the M coeffs - that's an integer partition of K into exactly M parts, not sure if there's a nice name/formula for that. This is close enough to the structure of existing AC entropy coders that we can meaningfully talk about the differences. 1) and 2) are bog-standard (we use a different model knowing K than a regular codec that doesn't know K would, but that's it). You can view 3) in terms of significance masks, and the probabilities have a reasonably simple form (I think you can adapt the Reservoir sampling algorithm to generate them) - or, by looking at the zero runs, in term of run lengths. And 4) is a magnitude coder constrained by knowing the final sum of everything. So the big difference is that we know K at the start, which influences our choice of models forthwith. But it's not actually changing the internal structure that much! That said, I don't think they're actually doing Fischer-style PVQ of "just send a uniform code". The advantage of breaking it down like above is that you have separate syntax elements that you can apply additional modeling on separately. Just having a giant integer flat code is not only massively unwieldy, it's also a bit of a dead end as far as further modeling is concerned. -Fabian

cbloom: At 01:36 PM 12/2/2014, you wrote: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? My belief is that for R/D optimization, it's bad when there's a big R step that doesn't correspond to a big D step. You want the prices of things to be "fair". So the problem is cases like : XX00 X01 0 vs XX00 X001 000 00 0 which is not a very big D change at all, but is a very big R step. I think it's easy to see that even keeping something equivalent to the zigzag, you could change it so that the position of the next coded value is sent in a way such that the rates better match entropy and distortion. But of course, really what you want is to send the positions of those later values in a lossy way. Even keeping something zigzagish you can imagine easy ways to do it, like you send a zigzag RLE that's something like {1,2,3-4,5-7,8-13} whatever.

ryg: Actually the Fischer (magnitude enumeration) construction corresponds pretty much directly to a direct coder: from the IEEE paper, l = dim, k = number of pulses, then number of code words N(l,k) is N(l,k) = sum_{i=-k}^k N(l-1, k-|i|) This is really direct: N(l,k) just loops over all possible values i for the first AC coeff. The remaining uncoded ACs then are l-1 dimensional and <= k-|i|. Divide through by N(l,k) and you have a probability distribution for coding a single AC coeff. Splitting out the i=0 case and sign, we get: N(l,k) = N(l-1,k) + 2 * sum_{j=1}^k N(l-1,k-j) =: N(l-1,k) + 2 * S(l-1,k) which corresponds 1:1 to this encoder: // While energy (k) left for (i = 0; k > 0; i++) { { assert(i < Ndims); // shouldn't get to N with leftover energy int l = N - i; // remaining dims int coeff = coeffs[i]; // encode significance code_binary(coeff == 0, N(l-1,k) / N(l,k)); if (coeff != 0) { // encode sign code_binary(coeff < 0, 0.5); int mag = abs(coeff); // encode magnitude (multi-symbol) // prob(mag=j) = N(l-1,k-j) / S(l-1, k) // then: k -= mag; } } and this is probably how you'd want to implement it given an arithmetic back end anyway. Factoring it into multiple decisions is much more convenient (and as said before, easier to do secondary modeling on) than the whole "one giant bigint" mess you get if you're not low-dimensional. Having the high-dimensional crap in there blows because the probabilities can get crazy. Certainly Ndims=63 would suck to work with directly. Separately, I'd expect that for k "large" (k >= Ndims? Ndims*4? More? Less?) you can use a simpler coder and/or fairly inaccurate probabilities because that's gonna be infrequent. Maybe given k = AC_sum(1,63) = sum_{i=1}^63 |coeff_i|, there's a reasonably nice way to figure out say AC_sum(1,32) and AC_sum(33,63). And if you can do that once, you can do it more than once. Kind of a top-down approach: you start with "I have k energy for this block" and first figure out which subband groups that energy goes into. Then you do the "detail" encode like above within each subband of maybe 8-16 coeffs; with l<=Ndim<=8 and k<=Ndim*small, you would have reasonable (practical) model sizes. -Fabian

cbloom: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email)

ryg: On 12/2/2014 5:46 PM, Charles Bloom {RAD} wrote: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I wasn't stating that Fischer's scheme is variable-length; I was stating that the decomposition as given implies a corresponding way to encode it that is equivalent (in the sense of exact same cost). It's not variable length. It's variable number of symbols but the output length is always the same (provided you use an exact multi-precision arithmetic coder that is, otherwise it can end up larger due to round-off error). log2(N(l,k)) is the number of bits we need to spend to encode which one out of N(l,k) equiprobable codewords we use. The ceil(log2(N)) is what you get when you say "fuck it" and just round it to an integral number of bits, but clearly that's not required. So suppose we're coding to the exact target rate using a bignum rationals and an exact arithmetic coder. Say I have a permutation of 3 values and want to encode which one it is. I can come up with a canonical enumeration (doesn't matter which) and send an index stating which one of the 6 candidates it is, in log2(6) bits. I can send one bit stating whether it's an even or odd permutation, which partitions my 6 cases into 2 disjoint subsets of 3 cases each, and then send log2(3) bits to encode which of the even/odd permutations I am, for a total of log2(2) + log2(3) = log2(6) bits. Or I can get fancier. In the general case, I can (arbitrarily!) partition my N values into disjoint subsets with k_1, k_2, ..., k_m elements, respectively, sum_i k_i = N. To code a number, I then first code the number of the subset it's in (using probability p_i = k_i/N) and then send a uniform integer denoting which element it is, in log2(k_i) bits. Say I want to encode some number x, and it falls into subset j. Then I will spend -log2(p_i) + log2(k_i) = -log2(k_i / N) + log2(k_i) = log2(N / k_i) + log2(k_i) = log2(N) bits (surprise... not). I'm just partitioning my uniform distribution into several distributions over smaller sets, always setting probabilities exactly according to the number of "leaves" (=final coded values) below that part of the subtree, so that the product along each path is still a uniform distribution. I can nest that process of course, and it's easy to do so in some trees but not others meaning I get non-uniform path lengths, but at no point am I changing the size of the output bitstream. That's exactly what I did in the "coder" given below. What's the value of the first AC coefficient? It must obey -k <= ac_0 <= k per definition of k, and I'm using that to partition our codebook C into 2k+1 disjoint subsets, namely C_x = { c in C | ac0(c) = x } and nicely enough, by the unit-pulse definition that leads to the enumeration formula, each of the C_x corresponds to another PVQ codebook, namely with dimension l-1 and energy k-|x|. Which implies the whole thing decomposes into "send x and then do a PVQ encode of the rest", i.e. the loop I gave. That said, one important point that I didn't cover in my original mail: from the purposes of coding this is really quite similar to a regular AC coder, but of course the values being coded don't mean the same thing. In a JPEG/MPEG style entropy coder, the values I'm emitting are raw ACs. PVQ works (for convenience) with code points on an integer lattice Z^N, but the actual AC coeffs coded aren't those lattice points, they're (gain(K) / len(lattice_point)) * lattice_point (len here being Euclidean and not 1-norm!). I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email) The link to the extended abstract for the Daala scheme (which covers this) is on the Xiph demo page: http://jmvalin.ca/video/spie_pvq_abstract.pdf Page 2 has the assignment of coeffs to subbands. They're only using a handful, and notably they treat 4x4 blocks as a single subband. -Fabian

cbloom: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Send the sum in each subband ; this is the "gain" ; let's say g_s g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) Quantize the vector to a PVQ lattice point; send the lattice index So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). The slides (reference #4) clear things up a bit.

ryg: On 12/2/2014 8:21 PM, Charles Bloom {RAD} wrote: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. Well, the one I linked to has a couple variants already. But it's pretty much besides the point. You can of course turn this into a giant combinatorical circle-jerk, but I don't see the use. For example (that's one of the things in the paper I linked to) if you're actually assigning indexes to values then yeah, the difference between assigning codewords in order { 0, -1, 1, -2, 2, ... } and { -k, -k+1, ..., -1, 0, 1, 2, ..., k } matters, but once you decompose it into several syntax elements most of that incidental complexity just disappears completely. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Yeah, agreed. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Yup. Send the sum in each subband ; this is the "gain" ; let's say g_s No, the gain isn't sum (1-norm), it's the Euclidean (2-norm) length. If you used 1-norm you wouldn't deform the integer lattice, meaning you're still just a scalar quantizer, just one with a funky backend. E.g. in 2D, just sending k = q(|x| + |y|) (with q being a uniform scalar quantizer without dead zone for simplicity) and then coding where the pulses go is just using the same rectangular lattice as you would have if you were sending q(x), q(y) directly. (Once you add a dead zone that's not true any more; scalar favors a "+" shape around the origin whereas the 1-norm PVQ doesn't. But let's ignore that for now.) With an ideal vector quantizer you make the "buckets" (=Voronoi regions) approximately equally likely. For general arbitrary 2D points that means the usual hex lattice. The PVQ equivalent of that is the NPVQ pattern: https://people.xiph.org/~jm/daala/pvq_demo/quantizer4.png That's clearly suboptimal (not a real hex lattice at all), but it has the nice gain/shape-separation: the circles are all equal-gain. You unwrap each circle by normalizing the point in the 1-norm, and then sending the corresponding AC pulses. g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) q_s would come from the rate control, as usual. g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. You look at circles with a radius in the right neighborhood (most obviously, just floor(g) and ceil(g), though you might want to widen the search if you're doing RDO). You find the closest lattice points on both circles (this is convex, so no risk of getting stuck in a local min). Choose whichever of the two circles is better. (All points *on* the same circle have the same cost, at least with vanilla PVQ. So the only RD trade-off you do is picking which circle.) K_s is the index of the circle you're on. The origin is K_s=0, the first real circle is K_s=1 (and has 2N points where N is your dimensionality), and so forth. Quantize the vector to a PVQ lattice point; send the lattice index Finding that is the convex search. So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. Well, that's the combinatorical part. The actual vector quantizer is the idea of warping the 1-norm diamonds into sensibly-spaced 2-norm circles. The regular structure enables the simplified search. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Not sure if they're doing any additional masking beyond that. If they do, they're not talking about it. Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). As far as I can tell, yeah. And yes, definitely gain of the overall block, not of the residual! Again, you have the separation into gain and shape here. The gains are coded separately, and hence out of the equation. What remains is unit vectors for both your target block and your prediction. That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). This tells you how good the prediction is. If it's 0.9, you've just removed 90% of the energy to code. You need to quantize this appropriately - you want to make sure the quantizer resolution here is reasonably matched to quantizer resolution of points on your sphere, or you're wasting bits. Now you turn whatever is left of g into K (as above). You can iterate this as necessary. If you do bi-prediction, you can do another Householder reflection to align the energy of pred2 that was orthogonal to pred1 (the rest is gone already!) with the 2nd AC. You code another correlation coefficient and then deal with the residuals. Fade-in / fade-out just kind of fall out when you do prediction like this. It's not a special thing. The "ACs" don't change, just the gains. Handling cross-fades with one predictor is still shitty, but if you're doing bipred they kinda fall out as well. It all sounds pretty cool. But I have no clue at all how well it works in practice or where it stands cost/benefit-wise. -Fabian

ryg: That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. Important note for this and all that follows: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. If you weren't splitting by subbands, I believe you could finesse your way around this: (normalized) DCT and Householder reflections are both unitary, so they preserve both the L2 norm and dot products. Which means you could calculate both the overall gain and the correlation coeffs for your prediction *before* you do the DCT (and hence in the decoder, add that stuff back in post-IDCT, without having to DCT your reference). But with the subband splitting, that no longer works, at least not directly. You could still do it with a custom filter bank that just passes through precisely the DCT coeffs we're interested in for each subband, but eh, somehow I have my doubts that this is gonna be much more efficient than just eating the DCT. It would certainly add yet another complicated mess to the pile. -Fabian

cbloom: At 10:23 PM 12/2/2014, Fabian Giesen wrote: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. They transform the predicted image the same way as the current frame. Each subband gain is sent as a delta from the predicted image subband gain. Crazy! His words : > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do.

cbloom: At 09:31 PM 12/2/2014, you wrote: q_s would come from the rate control, as usual. Yeah, I just mean the details of that is actually one of the most important issues. eg. how does Q vary for the different subbands. Is there inter-subband masking, etc. In Daala the Q is non-linear (variance adaptive quantizer) g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! Err, yeah maybe. What they actually did was take the x264 VAQ and try to reproduce it. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah.

ryg: Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. That doesn't have anything to do with the lapping, I think - that's because they don't use regular block-based mocomp. At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation). There's no nice way to do this block-per-block in the first place, not with OBMC in the mix anyway; if you chop it up into tiles you end up doing a lot of work twice.

ryg: On 12/03/2014 10:26 AM, Charles Bloom {RAD} g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. Ah, my bad. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. Ah OK, that makes sense. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. Didn't check how they send it. I do find thinking of this in terms of cross-correlation between block and pred a lot simpler than phrasing it in terms of angles. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. The reflection itself doesn't do anything meaningful. Your normalized points were on a unit sphere before, and still are after. You're just spinning it around. It does mean that your coefficient numbering really loses all meaning. After one such reflection, you're already scrambled. Overall energy is still the same (because it's unitary) but the direction is completely different. Since PVQ already assumes that the directions are equiprobable (well, more or less, since the PVQ doesn't actually uniformly cover the sphere), they don't care. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah. Well, it's just a different goal for the predictors. Regular motion search tries to minimize SAD or similar, as do the H.264 spatial predictors. For this kind of scheme you don't care about differences at all, instead you want to maximize the normalized correlation coeff between image and reference. (You want texture matches, not pixel matches.) -Fabian

cbloom: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that.

ryg: On 12/3/2014 1:04 PM, Charles Bloom {RAD} wrote: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. Yeah, the decision to send a subband *at all* means you have to code gain, theta and your AC index. For N=16 that's gonna be hard to get below 8 bits even for trivial signals. At which point you get a big jump in the RD curve, which is bad. Terriberry has a few slides that explain how they're doing inter-band activity masking currently: https://people.xiph.org/~tterribe/daala/pvq201404.pdf The example image is kind of terrible though. The "rose" dress (you'll see what I mean) is definitely better in the AM variant, but the rest is hard to tell for me unless I zoom in, which is cheating. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that. This is only really interesting at low K, where the PVQ codebook is relatively small. So, er, let's just throw this one in: suppose you're actually sending codebook indices. You just have a rate allocation function that tells you how many bits to send, independent of how big the codebook actually is. If you truly believe that preserving narrowband energy is more important than getting the direction right, then getting a random vector with the right energy envelope is better than nothing. Say K=1, Ndim=16. You have N=32 codewords, so a codebook index stored directly is 5 bits. Rate function says "you get 0 bits". So you don't send an index at all, and the decoder just takes codeword 0. Or rate function says "you get 2 bits" so you send two bits of the codebook index, and take the rest as zero. This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). For any given block (subband, whatever), this doesn't help you at all. You either guess right or you guess wrong. But statistically, suppose you shaved 2 bits off the codebook IDs for 1000 blocks. Then you'd expect about 250 of these blocks to reconstruct the right ACs. For the rest, you reconstructed garbage ACs, but it's garbage with the right energy levels at least! :) No clue if this is actually a good idea at all. It definitely allows you to remove a lot of potholes from the RD curve. -Fabian

ryg: On 12/3/2014 1:48 PM, Fabian Giesen wrote: > [..] This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). Now this is all assuming you either get the right code or you get garbage, and living with whichever one it is. You can also go in the other direction and try to get the direction at least mostly right. You can try to determine an ordering of the code book so that distortion more or less smoothly goes down as you add extra bits. (First bit tells you which hemisphere, that kind of thing.) That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". (Really with K=1 and high dim it's always gonna be garbage, though, because you just don't have any other vector in the code book that's even close; this is more interesting at K=2 or up). This makes the per-block randomization (you want that to avoid systematic bias) harder, though. One approach that would work is to do a Householder reflection with a random vector (again hashed from position or similar). All that said, I don't believe in this at all. It's "solving" a problem by "reducing" it to a more difficult unsolved problem (in this case, "I want a VQ codebook that's close to optimal for embedded coding"). Of course, even if you do a bad job here, it's still not gonna be worse than the direct "random permutation" stuff. But I doubt it's gonna be appreciably better either, and it's definitely more complex. -Fabian

cbloom: At 02:17 PM 12/3/2014, Fabian Giesen wrote: That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". Yes, this is the type of scheme I imagine. Sort of like a wavelet significant bit thing. As you send fewer bits the location gets coarser. The codebook for K=1 is pretty obvious. You're just sending a location; you want the top bits to grossly classify the AC and the bottom bits to distinguish neighbors (H neighbors for H-type AC's, and V neighbors for V-type AC's) For K=2 and up it's more complex. You could just train them and store them (major over-train risk) up to maybe K=3 but then you have to switch to an algorithmic method. Really the only missing piece for me is how you get the # of bits used to specify the locations. It takes too many bits to actually send it, so it has to be implicit from some other factors like the block Q and K and I'm not sure how to get that.

12-08-14 | BPG

BPG is a nice packaging of HEVC (H265) I-frame compression for still images. He provides Windows command line tools with reasonable options (yay!), so I'm quite happy to test it.

It's pretty dang slow. Really slow. It's all covered by many patents. So I'm not sure how realistic it is as a useable format. Nonetheless, it's very useful as something to compare against.

I ran with default options (YCbCr in 420). I compared against JPEG_pdec, which as I previously noted JPEG pdec is very comparable to DLI . (JPEG_pdec = JPEG+packjpg+ my jpegdec (decblocker, etc)).

Conclusion :

BPG is really good. The best I've seen. It kills JPEG-pdec in RMSE, in fact I think it's the best RMSE performance I've seen despite being at a disadvantage (YCbCr and 420). Under the perceptual metrics (MS-SSIM-Y and "Combo") it doesn't win so strongly. That tells me there is probably room for better perceptual tuning of bit allocation and quantizers. But it's definitely strong.

Quick visual evaluation by me :

Porsche640 : BPG wins pretty hard here. Perhaps the most noticeable thing is much better detail preservation in the texture regions (the gravel and bushes). It also does a better job on the edges of the car, it doesn't smear them into nasty DCT block artifacts. You may download Porsche640 comparison images here (1 MB, RAR)

Moses : actually not a very big win here. It does much better at preserving the smooth gradient background (my current JPEGdec doesn't have any special modes for big smooth areas). Visually the main thing you'll notice is that the smooth gradients are nasty chunky steps with JPEG and are nice and smooth with BPG. Other than that, I actually think JPEG is better on Moses himself. Both make a big perceptual rate allocation mistake and put too many bits on the jacket texture and not enough on the human skin texture. But JPEG preserves more of the face texture; when you A/B compare it's clear that BPG is way over-smoothing his face. Particularly on the forehead and the neck fat. But all over really. Both BPG and JPEG make a classic mistake on Moses : they kill too much of the red and blue detail in the tie because it's in chroma.

The raw reports :

porsche640 :

imdiff RMSE_RGB
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|18.55,13.46,9.59,8.46,7.87,6.79,5.45,4.72,3.14,2.56|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|22.22,19.98,16.74,14.86,13.60,12.73,11.49,10.62,9.82,9.35,8.81,8.30,7.63,6.87,5.91,4.57,3.07| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.64,5.17,5.69,5.87,5.96,6.16,6.43,6.60,7.02,7.20|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|4.33,4.52,4.82,5.01,5.16,5.26,5.42,5.54,5.65,5.72,5.81,5.89,6.01,6.14,6.33,6.63,7.04|

imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|83.03,89.78,93.93,95.07,95.57,96.43,97.39,97.88,98.87,99.25|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|79.88,82.47,86.76,89.12,90.67,91.72,93.14,94.07,94.84,95.30,95.74,96.18,96.72,97.27,97.95,98.70,99.40| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.23,5.06,5.73,5.96,6.07,6.29,6.57,6.73,7.18,7.41|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|3.90,4.17,4.66,4.97,5.19,5.35,5.59,5.76,5.91,6.01,6.11,6.22,6.36,6.53,6.76,7.08,7.53|

imdiff Combo
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.83,4.05,3.41,3.21,3.10,2.90,2.63,2.46,2.03,1.80|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|5.30,4.97,4.44,4.11,3.88,3.72,3.48,3.31,3.16,3.06,2.96,2.86,2.72,2.56,2.35,2.05,1.74| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.19,4.99,5.65,5.86,5.98,6.18,6.45,6.62,7.05,7.28|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|3.69,4.04,4.59,4.93,5.17,5.34,5.58,5.75,5.91,6.01,6.11,6.22,6.36,6.52,6.74,7.03,7.34|


imdiff RMSE_RGB
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|16.50,12.13,8.70,7.79,7.27,6.38,5.31,4.79,3.76,3.43|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|17.05,15.36,14.18,13.28,12.07,11.20,10.38,9.91,9.36,8.80,8.09,6.38| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.84,5.34,5.83,5.98,6.07,6.24,6.46,6.58,6.84,6.93|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.79,4.96,5.09,5.19,5.34,5.46,5.57,5.64,5.72,5.81,5.93,6.24|

imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|87.81,92.21,95.12,95.98,96.34,96.99,97.77,98.16,98.98,99.30|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|89.84,91.44,92.46,93.22,94.23,94.94,95.53,95.87,96.23,96.60,97.04,98.08| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.80,5.43,5.97,6.17,6.26,6.44,6.69,6.84,7.24,7.45|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|5.07,5.31,5.47,5.60,5.79,5.93,6.06,6.14,6.23,6.33,6.45,6.81|

imdiff Combo
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.51,3.84,3.27,3.08,2.98,2.80,2.55,2.40,2.02,1.81|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.25,3.96,3.76,3.60,3.39,3.23,3.09,3.00,2.91,2.80,2.67,2.32| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.52,5.21,5.80,5.99,6.09,6.28,6.53,6.68,7.07,7.27|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.79,5.08,5.29,5.46,5.67,5.84,5.98,6.07,6.17,6.28,6.41,6.76|


moses :

imdiff RMSE_RGB
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|13.96,9.96,7.15,6.37,5.99,5.30,4.38,3.86,2.65,2.18|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|13.20,11.59,10.38,9.65,8.60,7.86,7.18,6.83,6.44,6.06,5.58,4.40| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|5.11,5.63,6.09,6.24,6.32,6.46,6.68,6.81,7.18,7.34|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|5.20,5.41,5.57,5.68,5.84,5.97,6.09,6.15,6.23,6.30,6.40,6.68|

imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|84.91,90.55,94.24,95.26,95.72,96.50,97.41,97.88,98.88,99.25|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|87.54,89.76,91.23,92.22,93.54,94.44,95.20,95.62,96.02,96.43,96.93,98.06| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.45,5.17,5.79,6.00,6.11,6.30,6.57,6.73,7.18,7.42|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.76,5.06,5.27,5.43,5.66,5.83,5.99,6.08,6.18,6.29,6.42,6.80|

imdiff Combo
Built Aug 30 2014 11:30:27

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.43,3.78,3.22,3.04,2.95,2.79,2.56,2.44,2.03,1.81|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.19,3.89,3.65,3.51,3.29,3.13,2.98,2.90,2.81,2.72,2.59,2.26| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.60,5.28,5.84,6.03,6.12,6.29,6.52,6.65,7.05,7.28|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.85,5.16,5.41,5.56,5.78,5.94,6.09,6.18,6.27,6.36,6.49,6.83|


12-03-14 | Lossy AC Location Quantization

Okay, I want to ramble a bit about the idea of lossy AC location quantization.

This is the idea that as you are quantizing the AC's, throwing out some information, one of the things you should be throwing out is the *location* of AC signals, not just their energy levels (as you do with a scalar quantizer).

Now, Daala's PVQ scheme (P = Predictive or Pyramid depending on context, confusingly) does this a bit. They send the L2 norm ("gain") of each subband (their "subbands" being chunks of similar AC coefficients), and they send a Pyramid VQ unit vector from a codebook describing the distribution of that gain within the subband. When they do the VQ index selection, that can wind up decreasing some AC values and increasing others, because you rotated a bit in the N-dimensional space. Unlike normal AC scalar quantization, this is energy preserving because it's a rotation - when one value goes down, others go up.

But Daala is not really super lossy on the location part of the AC's in the way my imaginary coder is. In particular the step from sending no AC's to sending one (at K=1) is a very large bit rate step, and there's no way to send one AC in a somewhat-specified location. The step from K=1 to K=2 is also large.

To be clear, this idea is all about low bit rates. At high bit rates, K is high, this all doesn't matter very much.

When I was doing video, I was looking at low bit rates. Well, typical internet video bit rates. It's around 4k bytes per frame. For 720p that's around 2.4 bits per 8x8 block. For 480p it's 7.3 bits per 8x8 block. The bit rates for video are low!

At those low bit rates, when you get to send any AC's at all, it's usually just AC10 and/or AC01. If you have blocks with energy out in the higher AC's, it usually just costs too much to send them at all in an R/D coder. They cost too many bits, because you have to send a long Z-scan run-length, and they don't help D enough. (of course I'm talking about mocomp residuals here)

But perceptually this is bad. As noted previously, it makes it too big of a step in R to get to the next D option. There needs to be something in between.

That is, you start by sending just { AC10 , end-of-block } ; then your next step is { AC10 , AC45 , eob } - that's like a 10 bit step or something. You need something in between that's a few bits and provides some D benefit, like { AC10 , AC-somewhere }. That is, the location of that AC needs to be lossy.

Now, none of the old coders do this because under RMSE it doesn't make any sense. Taking some AC signal energy and putting on a different basis function is just a big WTF in terms of classic transform theory.

But perceptually it is right.

Rather than pulling something out of my ass, this time I'll post the actual DCT categories that I found to be optimal in my perceptual metric studies -

const int c_dctCategory[64] = 
   0, 1, 1, 3, 3, 3, 3, 6,
   2, 5, 5, 5, 6, 6, 6, 6,
   2, 5, 5, 6, 6, 6, 6, 6,
   4, 5, 7, 5, 6, 6, 6, 6,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   7, 7, 7, 7, 8, 8, 8, 8

(there's also inter-band masking; categories 1+2 mask 3-5, and then categories 1-5 mask 6-8).

If you can't get the AC's all right, the next best thing is to get the sums within the categories right.

The categories sort of grossly tell you what kind of detail is in the block perceptually; is it high frequency, is it vertical detail or horizontal detail. It's what the eye picks up most of all in low bit rate video. Hey there's big chunky edge here where it should have been smooth, or hey this textured water is smooth in one block. The most important thing is for there to be *some* texture in the water, not for it to be right texture.

Point being, there is some benefit to filling in the gaps in the R-D space. Between sending no AC's at all, and sending "AC64 is 1" you can send "something in group 8 is 1". (in Pyramid VQ speak, there are big gaps to fill between K=0 and K=1 and K=2)

(note that under a normal D metric, this won't be good, so RDO won't work. You either need a D metric that counts this right, or fudge it by interpolating)

(also note that I'm hand-waving in a nasty way here; the perceptual property that "subband sums make sense" is true when applied to the original block - not the residual after subtracting out mocomp, while the property that "almost all AC's are zero" is a property of the residual! This is a difficulty that leads to the Daala PVQ (P = Predictive) Householder business. We'll ignore it for now.)

So, how might you actually send AC's with lossy positions?

There are a few ways. I'll sketch one or two.

First of all, let's work one subband at a time. We'll assume for now that we want the "lossy positioning" to only move values around within a subband, where they look rather similar. (perhaps at very low bit rates being even more lossy would be good). So, we want to send the sum of energy on each subband, and then send the way that energy is distributed within the subband.

It's an open question to me what the best "energy" on the subband is to preserve; L2, L1 or something else? Some power of those norms? In any case, it's scalar quantized. Then we have to distribute the energy, knowing the sum.

At higher bit rates, you quantize the distribution somehow and send it. One method is the Pyramid VQ (PVQ) that Daala uses. Perhaps I'll describe this more later, or you can follow the links and read old reference material. For small subbands and low quantized sums, you can also just store a codebook.

When the quantized sum is 1 ("K" in PVQ speak), then you just need to encode the position of where that 1 pulse is; eg. in a subband of N slots you need to code the spot that gets the energy, which takes log2(N) bits for a flat code. (of course a flat code is wrong because the slots aren't equally likely; the lower frequencies are slightly more likely, even within a subband that has been selected to be somewhat uniform).

So, what about sending less than log2(N) bits?

Well, at K=1 you can think of the bits as being binary splitting planes, cutting the subband in half with each bit you send. If you send all log2(N) bits then you specify a value precisely.


subband 3 is { a b c d }

K = 0 ; no bits sent , position = { abcd }
1 bit : { ab | cd }
K = 1 : 2 bits : { a | b | c | d }

We now have a continuous way of sending more bits and getting more precision.

On larger subbands, the splitting planes could be not in the middle (binary encoding) but rather at the point where they divide the probability 50/50, so that the index is variable length and doesn't require entropy coding.

When you've specified a lossy range of possible energy locations, eg. at 1 bit we specified "either a or b" you have a lot of options of how to restore that in the decoder. It could be just to the most probably spot (a), or a pseudo-random permutation of either. If the encoder and decoder use the same pseudo-random permutation, that allows the encoder to make an RD choice based on whether the restoration will match reality or not.

There's a tricky issue above K=1 when you want to add lossy detail for the next value distribution. It needs to be done such that D actually goes down, which is not trivial. You want D to be as continuous as possible as you add more lossy detail. The "embedded" style of truncating away bits of location is also not so obvious to me when you have two values to distribute.

More on this whole topic later.

12-03-14 | PVQ Inspired Rambling - RD Ideals

Lossy coders with RD optimization issues.

You make some lossy coder which has various modes and decisions and quantization options. These affect both R&D on a micro scale. You want to run an automated RDO that can in theory try everything.

In practice this creates problems. The RDO winds up doing funny things. It is technically finding the best D for the desired R, but that winds up looking bad.

One issue is that to make RDO efficient we use a simplified D which is not the whole perceptual error.

Often we're measuring D locally (eg. on one block at a time) but there are non-local effects. Either far apart in the frame, or over time.

For example, RDO will often take away too many bits from part of the frame to put them on another, which creates an ugly variation of quality. RDO can create alternating smooth and detailed blocks when they should be uniform. In video, temporal variation can be even worse, with blocks flickering as RDO makes big changes. Even though it was doing the right thing in terms of D on each individual block, it winds up looking terrible.

Some principals that prevent this from happening :

1. There should be a lot of fine steps in R (and D). There should be coding opportunities where you can add a bit or half a bit and get some more quality. Big steps are bad, they cause blocks (either with spatial or temporal shifts) to jump a lot in R & D.

A common problem case is in video residuals, the jumps from "zero movec, zero residual" to "some movec, zero residual" to "some movec, some residual" can easily be very discontinuous if you're not careful (big steps in R) which leads to nasty chunkiness under RDO.

2. Big steps are also bad for local optimization. Assuming you aren't doing a full-search RDO, you want to make a nicely searchable coding space that has smooth and small steps as you vary coding choices.

3. The R(D) (or D(R)) curve should be as smooth as possible, and of course monotonic. Globally, the RDO that you do should result in that. But it should also be true locally (eg. per block) as much as possible.

4. Similar D's should have similar R's. (and vice versa, but it's harder to intuit the other way around). If there are two quite different ways of making a similar distortion number (but different looking) - those should have a similar R. If not, then your coder is visually biased.

eg. if horizontal noise is cheaper to code than vertical noise (at the same D), the RD optimization will kill one but not the other, and the image will visually change. It will appear to smooth in one direction.

Of course this has to be balanced against entropy - if the types of distortion are not equally probably, they must have different bit rates. But it should not be any more than necessary, which is a common flaw. Often rare distortion gets a codeword that is too long, but people don't care much because it's rare, the result being that it just gets killed by RDO.

Part of the problem here is that most coders (such as DCT and z-scan) are based on an implicit model of the image which is built on a global average ideal image. By the time you are coding a given block, you often have enough information to know that this particular image does not match the ideal, but we don't compensate correctly.

You can think about it this way - you start at 0 bits. You gradually increase the rate and give more bits out. Under RDO, you give each bit out to the place where it gives you most increase in D. At each step as you do this, there should lots of places where you can get that good D. It should be available all over the frame with just slightly different D's. If there are big steps in your coding scheme, it will only be in a few places.

12-03-14 | PVQ Inspired Rambling - Smooth Blocks

So there's a question of what exactly is a "smooth" block.

I'm not entirely sure. On the one hand you could just say that you only have "detail" blocks, and a detail block where all the AC energy is zero is a "smooth" block.

But there are some things about "smooth" that I think deserve a more global consideration in the coder as a special mode. I will try to ramble about them a bit :

1. Smooth blocks need more DC resolution. With no detail in the block the eye is much more sensitive to the color being wrong. In fact DC quantizer should go up with AC energy in general. In typical coders the DC's are all sent separately from the AC, before them, so that order should be changed. Either signal block type before sending the DC's, or send some information about the total AC energy.

1a. They particularly need more DC resolution when they have a similar DC to their neighbors. This can be done by sending the delta-DC with a non-uniform quantizer, which is a hack I wrote about previously.

2. Even with lapped transforms or standard deblocking filters, a zero-AC block is lumpy. They use filter shapes that make the block flat in the middle and sloped at the edges, which gives them a C1-smooth stair-step look.

3. Connected regions of "smooth" should be handled specially. They should have restored-with-quantization range maximum-likelihood gradient restoration.

Say you have a DC quantizer of 8, and you have a row of smooth blocks with DC's of [ 0 0 0 8 8 8 ] . Current methods will give you a run of all 0 pixels, then a smooth ramp up to 8, and then a run of all 8's. Obviously what should really be there is a long smooth gradient.

4. A "smooth" next to an "edge" block can imply that the smooth gradient region goes up to the edge found within that edge block and no further.

5. If you get to such low bit rates that you take a detail block and send zero AC's just because you can't afford them, you don't want that block to be considered "smooth" and put into the big-smooth-gradient pathway. Though really this needs to be something your RDO fights hard against (don't let detail blocks go to zero AC's even if it seems like the right RD decision because it actually isn't under a better D metric)

12-01-14 | AC Quantization

I just read the Xiph PVQ demo page ; pretty interesting, go have a read. (BTW it kills me that Daala is using lapped transforms, oy!) (also - thank god some people still write real blogs and RSS them so I can find them!)

It reminded me of some things I've been talking about for a long time, but I couldn't find a link to myself on my blog. Anyway, internally at RAD I've been saying that we should be sending AC coefficients with more precision given to the *sum* of the values, and less given to the *distribution* of the values.

ADD : some searching finds these which are relevant -

cbloom rants 08-25-09 - Oodle Image Compression Looking Back
cbloom rants 08-27-09 - Oodle Image Compression Looking Back Pictures
cbloom rants 03-10-10 - Distortion Measure
cbloom rants 10-30-10 - Detail Preservation in Images

I've never actually pursued this, so I don't have the answers. But it is something that's been nibbling at my brain for a long time, and I still think it's a good idea, so that tells me there's something there.

I'll scribble a few notes -

1. I've long believed that blocks should be categorized into "smooth", "detail" and "edge". For most of this discussion we're going to ignore smooth and edge and just talk about detail. That is, blocks that are not entirely smooth, and don't have a dominant edge through them (perhaps because that edge was predicted).

2. The most important thing in detail blocks is preserving the amount of energy in the various frequency subbands. This is something that I've talked about before in terms of perceptual metrics. In a standard DCT you do make categories something like :

and what's most important is preserving the sum in each category. (that chart was pulled out of my ass but you get the idea).

The sums should be preserved in a kind of wavelet subband quadtree type of way. Like preserve the sum of each of the 4x4 blocks; then go only to the upper-left 4x4 and divide it into 2x2's and preserve those sums, and then go to only the upper-left 2x2, etc.

3. You can take a standard type of codec and optimize the encoding towards this type of perceptual metric, and that helps a bit, but it's the wrong way to go. Because you're still spending bits to exactly specify the noise in the high frequency area. (doing the RD optimization just let you choose the cheapest way to specify that noise).

What you really want is a perceptual quantizer that fundamentally gives up information in the right way as you reduce bitrate. At low bits you just want to say "hey there's some noise" and not spend bits specifying the details of it.

The normal scalar quantizers that we use are just not right. As you remove bits, they kill energy, which looks bad. It looks better for that energy to be there, but wrong.

3 1/2. The normal zig-zag coding schemes we use are really bad.

In order to specify any energy way out in the highest-frequency region (E in the chart above) you have to send a ton of zeros to get there. This makes it prohibitively costly in bits.

One of the first things that you notice when implementing an R/D optimized coder with TQ is that it starts killing all your high frequency detail. This is because under any kind of normal D norm, with zig-zag-like schemes, the R to send those values is just not worth it.

But perceptually that's all wrong. It makes you over-smooth images.

3 3/4. Imagine that the guy allocating bits is standing on the shore. The shore is the DC coefficient at 00. He's looking out to see. Way out at the horizon is the super-high-frequency coefficients in the bottom right of the DCT. In the foreground (that's AC10 and AC01) he can see individual waves and where rocks are. Way out at sea he shouldn't be spending bandwidth trying to describe individual waves, but he should still be saying things like "there are a lot of big waves out there" or "there's a big swell but no breakers". Yeah.

4. What you really want is a joint quantizer of summed energy and the distribution of that energy. At max bit rate you send all the coefficients exactly. As you reduce bitrate, the sum is preserved pretty well, but the distribution of the lower-right (highest frequency) coefficients becomes lossy. As you reduce bit rate more, the total sum is still pretty good and the overall distribution of energy is mostly right, but you get more loss in where the energy is going in the lower frequency subbands, and you also get more scalar quantization of the lower frequency subbands, etc.

Like by the time that AC11 has a scalar quantizer of 8, the highest frequency lower-right area has its total energy sent with a quantizer of 32, and zero bits are sent for the location of that energy.

5. When the energy is unspecified, you'd like to restore in some nice way. That is, don't just restore to the same quantization vector every time ("center" of the quantization bucket), since that could create patterns. I dunno. Maybe restore with some randomness; restore based on prediction from the neighborhood; restore to maximum likelihood? (ML based on neighborhood/prediction/image not just a global ML)

6. An idea I've tossed around for a while is a quadtree/wavelet-like coding scheme. Take the 8x8 block of coefficients (and as always exclude DC in some way). Send the sum of the whole thing. Divide into four children. So now you have to send a (lossy) distribution of that sum onto the 4 child slots. Go to the upper left (LL band) and do it again, etc.

7. The more energy you have, the less important its exact distribution, due to masking. As you have more energy to distribute, the number of vectors you need goes up a lot, but the loss you can tolerate also goes up. In terms of the bits to send a block, it should still increase as a function of the energy level of that block, but it should increase less quickly than naive log2(distributions) would indicate.

8. Not all AC's are equally likely or equally perceptually important. Specifically the vector codebook should contain more entries that preserve values in the upper-left (low frequency) area.

9. The interaction with prediction is ugly. (eg. I don't know how to do it right). The nature of AC values after mocomp or intra-prediction is not the same as the nature of AC's after just transform (as in JPEG). Specifically, ideas like variance masking and energy preservation apply to the transformed AC values, *not* to the deltas that you typically see in video coding.

10. You want to send the information about the AC in a useful order. That is, the things you send first should be very strong classifiers of the entropy of that block for coding purposes, and of the masking properties for quantization purposes.

For example, packJPG first send the (quantizated) location of the last-non-zero coefficient in Z-scan order. This turns out to be the best classifier of blocks for normal JPEG, so that is used as the primary bit of context for coding further information about the block.

You don't want sending the "category" or "masking" information to be separate side-band data. It should just be the first part of sending the coefficients. So your category is maybe something like the bit-vector of which coefficient groups have any non-zero coefficients. Something like that which is not redundant with sending them, it's just the first gross bit of information about their distribution.

I found this in email; I think there are some more somewhere ...

(feel free to ignore this) Currently every video coder in the world does something like mocomp then DCT (or a similar transform (eg. one that is orthonormal, complete, and ) on the residuals. The thing is, the DCT really doesn't make much sense on the residuals. Heck even just removing the DC doesn't usually do much on the residuals, because the mocomp will generally match the DC, so the DC of the residuals is near zero. But the DCT still has value. The big advantage it has is the ability to send just a few coefficients and get a rough approximation of simple shapes like flat vs. gradient vs. noisy. For example if the residual is sort of speckly all over, to send that by coding values one by one you would have to send lots of values. With the DCT you can just send some signal in AC11 or AC22 and you get some rough approximation of splotches all over. The DCT has another advantage; with quantization and R/D optimization, you want to be able to identify which coefficients are okay to crush. You want to be able to put the basis functions in order of perceptual importance so that you can send them in a sequence where truncating the tail is an okay way to send less data. The subtle issue in all cases is that the coder choice implies a general type of error that you will create, because the coder creates an ordering, and under R/D optimization, you will prefer to send the parts of the signal that are cheaper. eg. in the normal DCT + zigzag type of coder, sending high frequency signals is much more expensive than low frequency. The result is that R/D optimization on DCT+zigzag coders tends to make things smoother. That is both good and bad. It means you can't really let the R/D optimization go nuts or you lose all your high frequency energy and your video appears to have no "detail". It would be nice to have coders that create different types of error under heavy R/D optimization. But maybe there's something better than the DCT? 1. Basis transform other than DCT. This is just a small change, not a huge shift in method, but some subsets are special : 1.a. Multiple bases. This makes the coder over-complete, which is bad in theory but can help in practice. For example you could have 3 different bases possible - DCT (cosine), Haar (steps), and Legendre (polynomials), and the coder could first send a selection of bases then the residuals. 1.b. KLT - optimal bases per frame. The encoder looks at the actual residuals and finds the optimum (in the sense of "most energy compacting") basis. You must then transmit the basis at the start of the frame. I believe this is not possible for 8x8 transforms, because sending the basis costs too much; you would have to send 64*64 coefficients, which is large compared to typical video frame sizes. It is possible, however, for 4x4 bases, and that could be cascaded with a 2x2 Haar to make an 8x8 transform. There are some problems with this, though - because the KLT bases are computed, the perceptual character of them is unknown; eg. how to order them by energy is known, but that order might not be their perceptual order. Also there is no "fast" transform because the bases are unknown; you must use straight basis addition to do the reverse transform. (there are weird cases where KLT basis is clearly good; for example scenes that don't match the "natural/smooth" assumption of DCT, such as movie titles or credits, scenes full of the same pattern (eg. videos of grass); but just like with color transforms, the disadvantage of KLT is that you no longer know exactly what your bases are, and you can't pre-decide their perceptual importance (eg. the JPEG quantizer matrix)) 1.c. Overcomplete basis. Rather than multiple bases or per-frame bases, use a static pre-chosen overcomplete set. Maybe some cosines, some linear ramps, some exact hard edges at various locations and angles. 2. Quadtree with "don't care" distribution. I haven't quite figure this out, but the thinking goes like this : Perhaps the most important (perceptual) single piece of information about the residuals is the total amount of AC energy. This is just the L-something sum of the residuals. Less important is exactly how that energy is distribution. So maybe we could come up with a way of encoding the residuals such that we first send the sum of energy, and then we send information about the distribution, and when we increase the "quantizer" or we do R/D truncation, what we lose first is information about the distribution, not the total. That is, loss causes the location of energy to become imprecise, not the total. I don't have a great idea for how to do this, but maybe something like this : First send the AC total for the whole block. Then consider blocks in a quadtree hierarchy. For each block, you already know the total of energy, so transmit how that energy is split among the 4. Then for any child whose total is not zero, recurse onto that child and repeat. At this point it's lossless, so it's not very interesting; the question is how to introduce some loss. Two ideas : 2.a. Add some amount of "don't care" energy to each quad chunk. That is, you know the total for each quadtree block as you descend. Rather than sending the 4 portions which must add up to the total, send 4 portions which are LE the total; the remainder (missing energy) is "don't care" energy and is distributed in some semi-random way. 2.b. Quantize the distribution. Send the total first. Then the distribution is sent with some larger quantizer, and the missing parts are fixed. eg. you send a total of 13. The distribution gets an extra quantizer of 3. So the first block you would send as a distribtuon of 13/3 = 4. So you send how the 4 is spread. That only accounts for 4*3 = 12, so 1 was not sent, you put it somewhere random. If the distribution quantizer is 1, it's lossless; if the distribution quantizer is reasonably large, then you send no info about the distribution, just the total energy. This method is pretty poor at sending linear ramps or edges, though. 3. S/D/E separate coders ; I've been talking about this idea for a while without much substance. The idea is that the perceptually important thing really changes dramatically based on the local character. For smooth areas, you really must keep it smooth, but the exact level or type of ramp is not that important. For edges, you want to keep a sharp edge, and not introduce ringing or blocking, but any other detail near the edge gets severely visually masked, so you can fuck that up. For detail area, the dominating thing is the amount of energy and the rough frequency of it, the exact speckles don't matter. Anyway, it occurred to me that rather than trying to do an SDE coder from the ground up, what you can do is just provide various coders, like maybe 1c and 2b and also normal DCT coding. You don't specifically ever decide "this block is smooth" and use the smooth coder, instead you just try all the coders and take the one that comes out best. One of the difficulties with this is you don't know the perceptual impact of your coding decisions. With a normal DCT zigzag type coder, you can just use the magnitude of the values to be a rough estimate of the perceptual importance, and add some fudge factors for the DC being more importan than AC and so on. With weird bases and weird coders you don't have a good way to make R/D decisions other than by actually running a perceptual D calculation for each R choice, which is several orders of magnitude slower.

(there were some good responses to that mail as well...)

(If someone wants to give me a professorship and a few grad students so that I can solve these problems, just let me know anytime...)

11-12-14 | Intel TSX notes

Had a little look at Intel TSX. Some sloppy notes.

TSX are new instructions for (limited) hardware transactions.

I'm gonna talk mainly about the newer RTM (XBEGIN/XEND) because it's conceptually clearer than XACQUIRE/XRELEASE.

In general RTM allows a chunk of speculative code to be run, which can be fully rolled back, and which is not visible to any other core until it is committed. This allows you to do a transaction involving several memory ops, and only commit if there was no contention, so the entire transaction appears atomically to other cores.

The TSX documentation talks about it a lot in terms of "hardware lock elision" (HLE) but it's more general than that. (and HLE the algorithmic process is not to be confused with the annoyingly named "HLE instructions" (XACQUIRE/XRELEASE))

TSX does not gaurantee forward progress, so there must always be a fallback non-TSX pathway. (complex transactions might always abort even without any contention because they overflow the speculation buffer. Even transactions that could run in theory might livelock forever if you don't have the right pauses to allow forward progress, so the fallback path is needed then too).

TSX works by keeping a speculative set of registers and processor state. It tracks all reads done in the speculation block, and enqueues all writes to be delayed until the transaction ends. The memory tracking of the transaction is currently done using the L1 cache and the standard cache line protocols. This means contention is only detected at cache line granularity, so you have the standard "false sharing" issue.

If your transaction reads a cache line, then any write to that cache line by another core causes the transaction to abort. (reads by other cores do not cause an abort).

If your transaction writes a cache line, then any read or write by another core causes the transaction to abort.

If your transaction aborts, then any cache lines written are evicted from L1. If any of the cache lines involved in the transaction are evicted during the transaction (eg. if you touch too much memory, or another core locks that line), the transaction is aborted.

TSX seems to allow quite a large working set (up to size of L1 ?). Obviously the more memory you touch the more likely to abort due to contention.

Obviously you will get aborts from anything "funny" that's not just plain code and memory access. Context switches, IO, kernel calls, etc. will abort transactions.

At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial.

RTM Memory Ordering :

A successful RTM commit causes all memory operations in the RTM region to appear to execute atomically. A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction. The XBEGIN instruction does not have fencing semantics. However, if an RTM execution aborts, all memory updates from within the RTM region are discarded and never made visible to any other logical processor.

One of the best resources is the new Intel Optimization Manual, which has a whole chapter on TSX.

RTM is very nice as a replacement for traditional lockfree algorithms based on atomics when those algorithms are very complex. Something simple like just an atomic increment, you obviously shouldn't use RTM, just do the atomic. Even something like a lockfree LIFO Stack will be better with traditional atomics. But something complex like a lockfree MPMC FIFO Queue will be appropriate for RTM. (an example MPMC FIFO requires two CAS ops, and an atomic load & store, even without contention; so you can replace all those atomic ops with one RTM section which either commits or aborts)

RTM handles nesting in the simplest way - nested transactions are absorbed into their parent. That is, no transaction commits until the topmost parent commits. Aborts in nested transactions will abort the parent.

BEWARE : the transactional code might not need the old fashioned lock-free atomics, but you do still have to be careful about what the optimizer does. Use volatiles, or perhaps relaxed-order atomic stores to make sure that the variable reads/writes you think are happening actually happen where you expect!!

I think an interesting point is that elided locks don't act like normal locks.

Consider a simple object protected by a lock :

struct Thingy
    int m_lock; // 0 if unlocked
    int m_a;
    int m_b;

The lock provides mutual exclusion for work done on m_a and m_b.

Let's pretend we protected it using a simple spin lock, so a function that used it would be like :

void DoStuff( Thingy * t )

while( AtomicExchange(t->m_lock,1) != 0 )
  // someone else has the lock; spin :

// I own the lock
// do stuff to m_a and m_b in here :

AtomicStore(t->m_lock,0); // release the lock


Now we replace it with the RTM transactional version :

void DoStuff( Thingy * t )

if ( _xbegin() )
  // transactional path

  // add "m_lock" to the transaction read-set :
  if ( t->m_lock != 0 )
    // lock is held, abort transaction

    // do stuff to m_a and m_b in here :

  // transaction aborts go here
  // normal non-transactional path :

    while( AtomicExchange(t->m_lock,1) != 0 )
      // someone else has the lock; spin :

    // I own the lock
    // do stuff to m_a and m_b in here :

    AtomicStore(t->m_lock,0); // release the lock


So, how does this work. Let's have a look :

In the transactional path, m_lock is not written to. The lock is not actually held. We do make sure to *read* m_lock so that if another thread takes the lock, it aborts our transaction. Our transaction will only complete if no other thread writes the memory we access.

In fact, the transactional path does not provide mutual exclusion. Multiple threads can read from the same object without conflicting. As lock as the "DoStuff_Locked" only reads, then the transactions will all proceed. In this sense, RTM has converted a normal lock into a read-writer lock!

The transactional path is fine grained! The way we wrote the code is coarse-grained. That is, m_lock protects the entire object, not the individual fields. So if thread 1 tries to modify m_a , and thread 2 modifies m_b, they must wait for each other, even though they are not actually racing. The transactional path will let them both run at the same time, provided there's cache line padding to prevent false sharing.

To be clear :

Transactional :

T1 :
read m_lock
write m_a

T2 :
read m_lock
write m_b

can run simultaneously with no conflict

traditional/fallback lock :

T1 :
take m_lock
write m_a

T2 :
take m_lock
write m_b

must synchronize against each other.

NOTE : as written the transactional path will actually fail all the time, because all the variables are on the same cache line. They need cache-line-size padding between the fields. Perhaps most importantly, the mutex/lock variable that's checked should be on its own cache line that is not written to on the transactional path.

Obviously this doesn't seem too interesting on this little toy object, but on large objects with many fields, it means that operations working on different parts of the object don't synchronize. You could have many threads reading from part of an object while another write writes a different part of the object with no conflict.

It's pretty interesting to me that the elided lock behaves very differently than the original lock. It changes to an RW lock and becomes fine-grained.

Overall I think TSX is pretty cool and I hope it becomes widespread. On the other hand, there is not much real world benefit to most code at the moment.

Some links :

Scaling Existing Lock-based Applications with Lock Elision - ACM Queue
Lock elision in glibc 01.org
TSX anti patterns in lock elision code Intel« Developer Zone
Exploring Intel« Transactional Synchronization Extensions with Intel« Software Development Emulator Intel« Developer Zone
Web Resources about Intel« Transactional Synchronization Extensions Intel« Developer Zone
Fun with Intel« Transactional Synchronization Extensions Intel« Developer Zone

11-11-14 | x64 movdqa atomic test

How to do an atomic load/store of a 128 bit value on x64 is an ugly question.

The only guaranteed way is via cmpxchg16b. But that's an awfully slow way to do a load or store.

movdqa appears to be an atomic way to move 128 bits - on most chips. Not all. And Intel/AMD don't want to clearly identify the cases where it is atomic or not. (they specifically don't guarantee it)

At the moment, shipping code needs to use cmpx16 to be safe. (my tests indicate that the x64 chips in the modern game consoles *do* have atomic movdqa, so it seems safe to use there)

My main curiousity is whether there exist any modern ("Core 2" or newer) x64 chips that *do not* provide an atomic movdqa.

Anyhoo, here's a test to check if movdqa is atomic on your machine. If you like, run it and send me the results : (Windows, x64 only)


The atomic test will just run for a while. If it has a failure it will break.

(you will get some errors about not having a v: or r: drive; you can just ignore them.)

Copy the output, or should be able to get the log file here :


email me at cb at my domain.

For the record, an atomic 128 bit load/store for x64 Windows using cmpx16 :

#include <intrin.h>

void AtomicLoad128(__int64 * out_Result, __int64 volatile * in_LoadFrom)
    // do a swap of the value in out_Result to itself
    //  if it matches, it stays the same
    //  it it doesn't match, we get a load

void AtomicStore128(__int64 volatile * out_StoreTo,const __int64 * in_Value)
    // do an initial non-atomic load of StoreTo :
    __int64 check_StoreTo[2];
    check_StoreTo[0] = out_StoreTo[0];
    check_StoreTo[1] = out_StoreTo[1];
    // store with cmpx16 :
    while( ! _InterlockedCompareExchange128(out_StoreTo,in_Value[1],in_Value[0],check_StoreTo) )
        // check_StoreTo was reloaded with the value in out_StoreTo

09-24-14 | Smart Phone Advice

Errmmm... I think I might finally get a smart phone.

I don't really want to waste any time researching about this because I fucking hate them and want as little as possible to do with this entire industry.

Definitely not anything Apple. I abolutely don't want to deal with any headaches of doing funny OS flashes or anything nonstandard that will complicate my life.

I'm thinking Google Nexus 5 because I understand it's the most minimal pure android ; I hate dealing with bloatware. I've already spent more time on this than I would like to. I actually prefer something smaller and lighter since I will only use it in emergencies. Galaxy S5 Mini? Not actually significantly smaller. Jesus christ.

It looks like "Straight Talk" is probably the right plan option for someone like me who will rarely use it. (?). I can't use anything T-Mobile because their coverage sucks around Seattle. Too bad because they seem to have the best pay-go plans.

One thing I have no idea about is how much bandwidth I need and whether paying per byte is okay or if I'll be fucked. If I accidentally browse to some web page that spews data at me, will that cost me a fortune? I don't like the idea of having to worry about that.

09-10-14 | Suffix Trie EOF handling

I realized something.

In a Suffix Trie (or suffix sort, or anything similar), handling the end-of-buffer is a mess.

The typical way it's described in the literature is to treat the end-of-buffer (henceforth EOB) as if it is a special character with value out of range (such as -1 or 256). That way you can just compare strings that go up to EOB with other strings, and the EOB will always mismatch a normal character and cause those strings to sort in a predictable way.

eg. on "banana" when you insert the final "na-EOB" you compare against "nana" and wind up comparing EOB vs. 'n' to find the sort order for that suffix.

The problem is that this is just a mess in the code. All over my suffix trie code, everywhere that I do a string lookup, I had to do extra special case checking for "is it EOB" and handle that case.

In addition, when I find a mismatch of "na-EOB" and "nan", the normal path of the code would be to change the prefix "na" into a branch and add children for the two different paths - EOB and "n". But I can't actually do that in my code because the child is selected by an unsigned char (uint8), and EOB is out of bounds for that variable type. So in all the node branch construction paths I have to special case "is it a mismatch just because of EOB, then don't create a branch". Blah blah.

Anyway, I realized that can all go away.

The key point is this :

Once you add a suffix that hits EOB (eg. the first mismatch against the existing suffixes is at EOB), then all future suffixes will also hit EOB, and that will be the deepest match for all of them.

Furthermore, all future suffix nodes can be found immediately using "follows"

eg. in the "banana" case, construction of the trie goes like :

"ban" is in the suffix trie
  (eg. "ban..","an..","n.." are all in the trie)

add "ana" :

we find the existing "an.."

the strings we compare are "anana-EOB" vs "ana-EOB"

so the mismatch hits EOB

That means all future suffixes will also hit EOB, and their placement in the tree can be found
just by using "follows" from the current string.

"ana-EOB" inserts at "anana"
"na-EOB" inserts at "nana"
"a-EOB" inserts at "ana"

That is, at the end of every trie construction when you start hitting EOB you just jump out to this special case of very simple follows addition.

So all the special EOB handling can be pulled out of the normal Trie code and set off to the side, which is lovely for the code clarity.

In practice it's quite common for files to end with a run of "000000000" which now gets swallowed up neatly by this special case.

ADD : if you don't care about adding every occurance of each suffix, then it gets even simpler -

If you hit EOB when adding a string - just don't add it. A full match of that suffix already exists, and your suffix that goes to EOB can't match any lookup better than what's in there.

(note that when I say "hit EOB" I mean you are at a branch node, and your current character doesn't match any of the branches because it is EOB. You will still add leaves that go to EOB, but you will never actually walk down those leaves to reach EOB.)

08-31-14 | DLI Image Compression

I got pinged about DLI so I had a look.

DLI is a closed source image compressor. There's no public information about it. It may be the best lossy image compressor in the world at the moment.

(ASIDE : H265 looks promising but has not yet been developed as a still image compressor; it will need a lot of perceptual work; also as in my previous test of x264 you have to be careful to avoid the terrible YUV and subsamplers that are in the common tools)

I have no idea what the algorithms are in DLI. Based on looking at some of the decompressed images, I can see block transform artifacts, so it has something like an 8x8 DCT in it. I also see certain clues that make me think it uses something like intra prediction. (those clues are good detail and edge preservation, and a tendency to preserve detail even if it's the wrong detail; the same thing that you see in H264 stills)

Anyway, I thought I'd run my own comparo on DLI to see if it really is as good as the author claims.

I tested against JPEG + packJPEG + my JPEG decoder. I'm using an unfinished version of my JPEG decoder which uses the "Nosratinia modified" reconstruction method. It could be a lot better. Note that this is still a super braindead simple JPEG. No per-block quantizer. No intra prediction. Only 8x8 transforms. Standard 420 YCbCr. No trellis quantization or rate-distortion. Just a modern entropy coding back end and a modern deblocking decoder.

I test with my perceptual image tester imdiff . The best metric is Combo which is a linear combo of SCIELAB_MyDelta + MS_SSIM_IW_Y + MyDctDelta_YUV.

You can see some previous tests on mysoup or moses or PDI

NOTE : "dlir.exe" is the super-slow optimizing variant. "dli.exe" is reasonably fast. I tested both. I ran dlir with -ov (optimize for visual quality) since my tests are mostly perceptual. I don't notice a huge difference between them.

My impressions :

DLI and jpeg+packjpg+jpegdec are both very good. Both are miles ahead of what is commonly used these days (old JPEG for example).

DLI preserves detail and contrast much better. JPEG tends to smooth and blur things at lower bit rates. Part of this may be something like a SATD heuristic metric + better bit allocation.

DLI does "mangle" the image. That is, it gets the detail *wrong* sometimes, which is something that JPEG really never does. The primary shapes are preserved by jpeg+packjpg+jpegdec, they just lose detail. With DLI, you sometimes get weird lumps appearing that weren't there before. If you just look at the decompressed image it can be hard to spot, because it looks like there's good detail there, but if you A-B test the uncompressed to the original, you'll see that DLI is actually changing the detail. I saw this before when analyzing x264.

DLI is similar looking to x264-still but better.

DLI seems to have a special mode for gradients. It preserves smooth gradients very well. JPEG-unblock creates a stepped look because it's a series of ramps that are flat in the middle.

DLI seems to make edges a bit chunky. Smooth curves get steppy. jpeg+packjpg+jpegdec is very good at preserving a smooth curved edge.

DLI is the only image coder I've seen that I would say is definitely slightly better than jpeg+packjpg+jpegdec. Though it is worse in some ways, I think the overall impression of the decoded image is definitely better. Much better contrast preservation, much better detail energy level preservation.

Despite jpeg often scoring better than DLI on the visual quality metrics I have, DLI usually looks much better to my eyes. This is a failure of the visual quality metrics.

Okay. Time for some charts.

In all cases I will show the "TID Fit" score. This is a 0-10 quality rating with higher better. This removes the issue of SSIM, RMSE, etc. all being on different scales.

NOTE : I am showing RMSE just for information. It tells you something about how the coders are working and why they look different, where the error is coming from. In both cases (DLI and JPEG) the runs are optimized for *visual* quality, not for RMSE, so this is not a comparison of how well they can do on an RMSE contest. (dlir should be run with -or and jpeg should be run with flat quantization matrices at least).

(see previous tests on mysoup or moses or PDI )

mysoup :

moses :

porsche640 :

pdi1200 :

Qualitative Comparison :

I looked at JPEG and DLI encodings at the same bit rate for each image. Generally I try to look around 1 bpp (that's logbpp of 0) which is the "sweet spot" for lossy image compression comparison.

Here are the original, a JPEG, and a DLI of Porsche640.
Download : RAR of Porsche640 comparison images (1 MB)

What I see :

DLI has very obvious DCT ringing artifacts. Look at the lower-right edge of the hood, for example. The sharp line of the hood has ringing ghosts in 8x8 chunks.

DLI preserves contrast overall much better. The most obvious places are in the background - the leaves, the pebbles. JPEG just blurs those and drops a lot of high frequency detail, DLI keeps it much better. DLI preserves a lot more high frequency data.

DLI adds a lot of noise. JPEG basically never adds noise. For example compare the centers of the wheels. The JPEG just looks like a slightly smoothed version of the original. The DLI has got lots of chunkiness and extra variation that isn't in the original.

In a few places DLI really mangles the image. One is the A-pillar of the car, another is the shadow on the hood, also the rear wheel.

Both DLI and JPEG do the same awful thing to the chroma. All the orange in the gravel is completely lost. The entire color of the laurel bush in the background is changed. Both just produce a desaturated image.

Based on the scores and what I see perceptually, my guess is this : DLI uses an 8x8 DCT. It uses a quantization matrix that is much flatter than JPEG's.

08-27-14 | LZ Match Length Redundancy

A quick note on something that does not work.

I've written before about the redundancy in LZ77 codes. ( for example ). In particular the issue I had a look at was :

Any time you code a match, you know that it must be longer than any possible match at lower offsets.

eg. you won't sent a match of length of 3 to offset 30514 if you could have sent offset 1073 instead. You always choose the lowest possible offset that gives you a given match length.

The easy way to exploit this is to send match lengths as the delta from the next longest match length at lower offset. You only need to send the excess, and you know the excess is greater than zero. So if you have an ML of 3 at offset 1073, and you find a match of length 4 at offset 30514, then you send {30514,+1}

To implement this in the encoder is straightforward. If you walk your matches in order from lowest offset to highest offset, then you know the current best match length as you go. You only consider a match if it exceeds the previous best, and you record the delta in lengths that you will send.

The same principle applies to the "last offsets" ; you don't send LO2 if you could sent LO0 at the same length, so the higher index LO matches must be of greater length. And the same thing applies to ROLZ.

I tried this in all 3 cases (normal LZ matches, LO matches, ROLZ). No win. Not even tiny, but close to zero.

Part of the problem is that match lengths are just not where the bits are; they're small already. But I assume that part of what's happening is that match lengths have patterns that the delta-ing ruins. For example binary files will have patterns of 4 or 8 long matches, or in an LZMA-like you'll have certain patterns show up like at certain pos&3 intervals after a literal you get a 3-long match, etc.

I tried some obvious ideas like using the next-lowest-length as part of the context for coding the delta-length. In theory you could be able to recapture something like a next-lowest of 3 predicts a delta of 1 in places where an ML of 4 is likely. But I couldn't find a win there.

I believe this is a dead end. Even if you could find a small win, it's too slow in the decoder to be worth it.

07-15-14 | I'm back

Well, the blog took a break, and now it's back. I'm going to try moderated comments for a while and see how that goes.

I also renamed the VR post to break the links from reddit and twitter, but it's still there.

07-14-14 | Suffix-Trie Coded LZ

Idea : Suffix-Trie Coded LZ :

You are doing LZ77-style coding (eg. matches in the prior stream or literals), but send the matches in a different way.

You have a Suffix Trie built on the prior stream. To find the longest match for a normal LZ77 you would take the current string to code and look it up by walking it down the Trie. When you reach the point of deepest match, you see what string in the prior stream made that node in the Trie, and send the position of that string as an offset.

Essentially what the offset does is encode a position in the tree.

But there are many redundancies in the normal LZ77 scheme. For example if you only encode a match of length 3, then the offsets that point to "abcd.." and "abce.." are equivalent, and shouldn't be distinguished by the encoding. The fact that they both take up space in the numerical offset is a waste of bits. You only want to distinguish offsets that actually point at something different for the current match length.

The idea in a nutshell is that instead of sending an offset, you send the descent into the trie to find that string.

At each node, first send a single bit for does the next byte in the string match any of the children. (This is equivalent to a PPM escape). If not, then you're done matching. If you like, this is like sending the match length with unary : 1 bits as long as you're in a node that has a matching child, then a 0 bit when you run out of matches. (alternatively you could send the entire match length up front with a different scheme).

When one of the children matches, you must encode which one. This is just an encoding of the next character, selected from the previously seen characters in this context. If all offsets are equally likely (they aren't) then the correct thing is just Probability(child) = Trie_Leaf_Count(child) , because the number of leaves under a node is the number of times we've seen this substring in the past.

(More generally the probability of offsets is not uniform, so you should scale the probability of each child using some modeling of the offsets. Accumulate P(child) += P(offset) for each offset under a child. Ugly. This is unfortunately very important on binary data where the 4-8-struct offset patterns are very strong.)

Ignoring that aside - the big coding gain is that we are no longer uselessly distinguishing offsets that only differ at higher match length, AND instead of just wasting those bits, we instead use them to make those offsets code smaller.

For example : say we've matched "ab" so far. The previous stream contains "abcd","abce","abcf", and "abq". Pretend that somehow those are the only strings. Normal LZ77 needs 2 bits to select from them - but if our match len is only 3 that's a big waste. This way we would say the next char in the match can either be "c" or "q" and the probabilities are 3/4 and 1/4 respectively. So if the length-3 match is a "c" we send that selection in only log2(4/3) bits = 0.415

And the astute reader will already be thinking - this is just PPM! In fact it is exactly a kind of PPM, in which you start out at low order (min match length, typically 3 or so) and your order gets deeper as you match. When you escape you junk back to order 3 coding, and if that escapes it jumps back to order 0 (literal).

There are several major problems :

1. Decoding is slow because you have to maintain the Suffix Trie for both encode and decode. You lose the simple LZ77 decoder.

2. Modern LZ's benefit a lot from modeling the numerical value of the offset in binary files. That's ugly & hard to do in this framework. This method is a lot simpler on text-like data that doesn't have numerical offset patterns.

3. It's not Pareto. If you're doing all this work you may as well just do PPM.

In any case it's theoretically interesting as an ideal of how you would encode LZ offsets if you could.

(and yes I know there have been many similar ideas in the past; LZFG of course, and Langdon's great LZ-PPM equivalence proof)

07-03-14 | Oodle 1.41 Comparison Charts

I did some work for Oodle 1.41 on speeding up compressors. Mainly the Fast/VeryFast encoders got faster. I also took a pass at trying to make sure the various options were "Pareto", that is the best possible space/speed tradeoff. I had some options that were off the curve, like much slower than they needed to be, or just worse with no benefit, so it was just a mistake to use them (LZNib Normal was particularly bad).

Oodle 1.40 got the new LZA compressor. LZA is a very high compression arithmetic-coded LZ. The goal of LZA is as much compression as possible while retaining somewhat reasonable (or at least tolerable) decode speeds. My belief is that LZA should be used for internet distribution, but not for runtime loading.

The charts :

compression ratio : (raw/comp ratio; higher is better)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 2.362 2.508 2.541 2.645 2.698
LZHLW 2.161 2.299 2.33 2.352 2.432
LZH 1.901 1.979 2.039 2.121 2.134
LZNIB 1.727 1.884 1.853 2.079 2.079
LZBLW 1.636 1.761 1.833 1.873 1.873
LZB16 1.481 1.571 1.654 1.674 1.674

lzmamax  : 2.665 to 1
lzmafast : 2.314 to 1
zlib9 : 1.883 to 1 
zlib5 : 1.871 to 1
lz4hc : 1.667 to 1
lz4fast : 1.464 to 1

encode speed : (mb/s)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 23.05 12.7 6.27 1.54 1.07
LZHLW 59.67 19.16 7.21 4.67 1.96
LZH 76.08 17.08 11.7 0.83 0.46
LZNIB 182.14 43.87 10.76 0.51 0.51
LZBLW 246.83 49.67 1.62 1.61 1.61
LZB16 511.36 107.11 36.98 4.02 4.02

lzmamax  : 5.55
lzmafast : 11.08
zlib9 : 4.86
zlib5 : 25.23
lz4hc : 32.32
lz4fast : 606.37

decode speed : (mb/s)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 34.93 37.15 37.76 37.48 37.81
LZHLW 363.94 385.85 384.83 391.28 388.4
LZH 357.62 392.35 397.72 387.28 383.38
LZNIB 923.66 987.11 903.21 1195.66 1194.75
LZBLW 2545.35 2495.37 2465.99 2514.48 2515.25
LZB16 2752.65 2598.69 2687.85 2768.34 2765.92

lzmamax  : 42.17
lzmafast : 40.22
zlib9 : 308.93
zlib5 : 302.53
lz4hc : 2363.75
lz4fast : 2288.58

While working on LZA I found some encoder speed wins that I ported back to LZHLW (mainly in Fast and VeryFast). A big one is to early out for last offsets; when I get a last offset match > N long, I just take it and don't even look for non-last-offset matches. This is done in the non-Optimal modes, and surprisingly hurts compression almost not all while helping speed a lot.

Four of the compressors are now in pretty good shape (LZA,LZHLW,LZNIB, and LZB16). There are a few minor issues to fix someday (someday = never unless the need arises) :

LZA decoder should be a little faster (currently lags LZMA a tiny bit). LZA Optimal1 would be better with a semi-greedy match finder like MMC (LZMA is much faster to encode than me at the same compression level; perhaps a different optimal parse scheme is needed too). LZA Optimal2 should seed with multi-parse. LZHLW Optimal could be faster. LZNIB Normal needs much better match selection heuristics, the ones I have are really just not right. LZNIB Optimal should be faster; needs a better way to do threshold-match-finding. LZB16 Optimal should be faster; needs a better 64k-sliding-window match finder.

The LZH and LZBLW compressors are a bit neglected and you can see they still have some of the anomalies in the space/speed tradeoff curve, like the Normal encode speed for LZBLW is so bad that you may as well just use Optimal. Put aside until there's a reason to fix them.

If another game developer tells me that "zlib is a great compromise and you probably can't beat it by much" I'm going to murder them. For the record :

zlib -9 :
4.86 MB/sec to encode
308.93 MB/sec to decode
1.883 to 1 compression

LZHLW Optimal1 :
4.67 MB/sec to encode
391.28 MB/sec to decode
2.352 to 1 compression
come on! The encoder is slow, the decoder is slow, and it compresses poorly.

LZMA in very high compression settings is a good tradeoff. In its low compression fast modes, it's very poor. zlib has the same flaw - they just don't have good encoders for fast compression modes.

LZ4 I have no issues with; in its designed zone it offers excellent tradeoffs.

In most cases the encoder implementations are :

VeryFast =
cache table match finder
single hash
greedy parse

Fast = 
cache table match finder
hash with ways
second hash
lazy parse
very simple heuristic decisions

Normal =
varies a lot for the different compressors
generally something like a hash-link match finder
or a cache table with more ways
more lazy eval
more careful "is match better" heuristics

Optimal =
exact match finder (SuffixTrie or similar)
cost-based match decision, not heuristic
backward exact parse of LZB16
all others have "last offset" so require an approximate forward parse

I'm mostly ripping out my Hash->Link match finders and replacing them with N-way cache tables. While the cache table is slightly worse for compression, it's a big speed win, which makes it better on the space-speed tradeoff spectrum.

I don't have a good solution for windowed optimal parse match finding (such as LZB16-Optimal). I'm currently using overlapped suffix arrays, but that's not awesome. Sliding window SuffixTrie is an engineering nightmare but would probably be good for that. MMC is a pretty good compromise in practice, though it's not exact and does have degenerate case breakdowns.

LZB16's encode speed is very sensitive to the hash table size.

24,700,820 ->16,944,823 =  5.488 bpb =  1.458 to 1
encode           : 0.045 seconds, 161.75 b/kc, rate= 550.51 mb/s
decode           : 0.009 seconds, 849.04 b/kc, rate= 2889.66 mb/s

24,700,820 ->16,682,108 =  5.403 bpb =  1.481 to 1
encode           : 0.049 seconds, 148.08 b/kc, rate= 503.97 mb/s
decode           : 0.009 seconds, 827.85 b/kc, rate= 2817.56 mb/s

24,700,820 ->16,491,675 =  5.341 bpb =  1.498 to 1
encode           : 0.055 seconds, 133.07 b/kc, rate= 452.89 mb/s
decode           : 0.009 seconds, 812.73 b/kc, rate= 2766.10 mb/s

24,700,820 ->16,409,957 =  5.315 bpb =  1.505 to 1
encode           : 0.064 seconds, 113.23 b/kc, rate= 385.37 mb/s
decode           : 0.009 seconds, 802.46 b/kc, rate= 2731.13 mb/s

If you accidentally set it too big you get a huge drop-off in speed. (The charts above show -h13 ; -h12 is more comparable to lz4fast (which was built with HASH_LOG=12)).

I stole an idea from LZ4 that helped the encoder speed a lot. (lz4fast is very good!) Instead of doing the basic loop like :

  if ( match )
    output match
    output literal

instead do :

  while( ! match )
    output literal

  output match

This lets you make a tight loop just for outputing literals. It makes it clearer to you as a programmer what's happening in that loop and you can save work and simplify things. It winds up being a lot faster. (I've been doing the same thing in my decoders forever but hadn't done in the encoder).

My LZB16 is very slightly more complex to encode than LZ4, because I do some things that let me have a faster decoder. For example my normal matches are all no-overlap, and I hide the overlap matches in the excess-match-length branch.

06-26-14 | VR Impressions

NOTE : changed post title to break the link.

Yesterday I finally went to Valve and saw "The Room". This is a rather rambly post about my thoughts after experiencing it.

For those who have been under a rock (like me), Valve has got this amazing VR demo. It uses unique prototype hardware that provides very good positional head tracking and very low latency graphics. It's in a calibrated room with registration spots all over the walls. It's way way better than any other VR, it's the real thing.

There is this magic thing that happens, it does tickle your brain intuitively. Part of you thinks that you're there. I had the same experiences that I've heard other people recount - your body starts reacting; like when a sphere moves towards you, you flinch and try to dodge it without thinking.

Part of the magic is that it's good enough that you *want* to believe it. It's not actually good enough that it seems real. Even in the carefully calibrated Valve room, it's glitchy and things pop a bit, and you always know you're in a simulation. But you choose to ignore the problems. It felt like when you're watching a good movie, and if you were being rational you would say that this is all illogical and the green screening looks fucking terrible and that is physically impossible what he just did, but if it's good you just choose to ignore all that and go along for the ride. VR felt like that to me.

One of the cool things about VR is that there is an absolute sense of scale, because you are always the size of you. This gives you scale reference in a way that you never have in games. Which is also a problem. It's wonderful if you're making games where you play as a human, but you can't play as a giant (if you just scale down everything else, it feels like you're you in a world where everything else is tiny, not that you're bigger; scale is no longer relative, you are always you). You can't make the characters run at 60 mph the way we usually do in games.

As cool as it is, I don't see how you actually make games with it.

For one thing there are massive short term technical problems. The headset is heavy and uncomfortable. The lenses have to be perfectly aligned to your eyes or you get sick. The registration is very easy to mess up. I'm sure these will be resolved over time. The headset has a cable which is always in danger of tripping or strangling you, which is a major problem and technically hard to get rid of, but perhaps possible some day.

But there are more fundamental inherent problems. When I stepped off the ledge, I wanted to fall. But of course I never actually can. You make my character fall, but not my body? That's weird. Heck if my character steps up on something, I want to step up myself. You can only make games where you basically stand still. In the room with the pipes, I want to climb on the pipes. Nope, you can't - and probably never can. Why would I want to be in a virtual world if I can't do anything there? I don't know how you even walk around a world without it feeling bizarre. All the Valve demos are basically you stuck in a tiny box, which is going to get old.

How do you ever make a game where the player character is moved without their own volition? If an NPC bumps me and pushes my avatar, what happens? You can't push my real human body, so it breaks the illusion. It seems to me that as soon as your viewpoint has a physical reaction with the virtual world and isn't just a viewer with no collision detection, it just doesn't work.

There's this fundamental problem that the software cannot move the player's viewpoint. The player must always get to move their own viewpoint with their head, or the illusion is broken (or worse, you get sick). This is just such a huge problem for games, it means the player can only be a ghost, or an omniscient observer in an RTS game, or other such things. Sure you can make games where you stand over an RTS world map and poke at it. Yay, it's a board game with fancy graphics. I see how it could be great as a sculpting or design tool. I see how it would be great for The Witness and similar games.

For me personally, it's so disappointing that you can't actually physically be in these worlds. The most exciting moments for me were some of the outdoor scenes, or the pipe room, where I just felt viscerally - "I want to run around in this world". What would be amazing for me would be to go in the VR world to alien planets with crazy strange plants and geology, and be able to run around it and climb on it. And I just don't see how that ever works. You can't walk around your living room, you'll trip on things or run into the wall. You can't ever step up or down anything, you have to be on perfectly flat ground all the time. You can't ever touch anything. (It's like a strip club; hey this is so exciting! can I interact with it? No? I have to just sit here and not move or touch anything? How fucking boring and frustrating. I'm leaving.)

At the very minimum you need gloves with amazing force feedback to give you some kind of tactile experience of the VR world, but even then it's just good for VR surgery and VR board games and things where you stand still and touch things. (and we all know the real app is VR fondling).

You could definitely make a killer car racing game. Put me in a seat with force feedback, and that solves all the physical interaction problems. (or, similarly, I'm driving a mech or a space ship or whatever; basically lock the player in a seat so you don't have to address the hard problems for now).

There are also huge huge software problems. Collision detection has to be polygon-perfect ; coarse collision proxies are no longer acceptable. Physics and animation have to be way better. Texture mapping and normal mapping just don't work. Billboard cards just don't work. We basically can't have trees or smoke or anything soft or complex for a long time, it's going to be a lot of super simple rigid objects. Skinned characters and painted on clothing (and just using textures to paint on geometry), none of it works. Flat shaded simple stuff is totally fine, but all the hacks we've used for so long are out the window.

I certainly see the appeal (for a software engineer) of starting from scratch on so many issues and working on the hard problems. Fun.

Socially I find VR rather scary.

One issue is the addictive nature of living in a VR world. Yes yes people are already addicted to their phones and facebook and WoW and whatever, but this is a whole new level. Plus it's even more disengaged from reality; it's one thing for everyone in a coffee shop these days to be staring at their laptops (god I hate you) but when they're all in headsets then interaction in the real world is completely over. I have no doubt that there will be a large class of people that live in the VR world and never leave their living room; Facebook will provide a "deliver pizza" button so that you don't even have to exit the simulation. It will be bad.

Perhaps more disturbing to me is how real and scary it can be. Just having a cube move into me was a kind of real physical fright that I haven't felt in a game. I think that being in a realistic VR world with people shooting each other would be absolutely terrifying and disgusting and really would do bad things to the brains of the players.

And if we wind up with evil overlords like Facebook or Apple or whoever controlling our VR world, that is downright dystopian. We all had our chance to say "no" to the rise of closed platforms when the Apple shit started to take off, and we all fucking dropped the ball (well, you did). Hell we did the same thing with the PATRIOT act. We're all just lying down and getting raped and not doing a damn thing about it and the future of freedom is very bleak indeed. (wow that rant went off the rails)

Anyway, I look forward to trying it again and seeing what people come up with. It's been a long time since I saw anything in games that made me say "yes! I want to play that!" so in that sense VR is a huge win.

Saved comments :

Tom Forsyth said... Playing as a giant is OK - grow the player's height, but don't move their eyes further apart. So the scale is unchanged, but the eyes are higher off the ground. July 3, 2014 at 7:45 PM

brucedawson said... Isn't a giant just somebody who is way taller than everybody else? So yeah, maybe if you 'just' scale down everyone else then you'll still feel normal size. But you'll also feel like you can crush the tiny creatures like bugs! Which is really the essence of being a giant. And yes, I have done the demo. July 3, 2014 at 8:56 PM

Grzegorz Adam Hankiewicz said... I don't understand how you say a steering wheel with force feedback solves any VR problem when the main reason I know I'm driving fast is how forces are being applied to my whole body, not that I'm holding something round instead of a gamepad. You mention it being jarring not being able to climb, wouldn't it be jarring to jump on a terrain bump inside your car and not feel gravity changes? Maybe the point of VR is not replicating dull life but simulating what real life can't possibly give us ever? July 4, 2014 at 3:08 AM

cbloom said... @GAH - not a wheel with force feedback (they all suck right now), but a *seat* like the F1 simulators use. They're quite good at faking short-term forces (terrain bumps and such are easy). I certainly don't mean that that should be the goal of VR. In fact it's quite disappointing that that is the only thing we have any hope of doing a good job of in the short term. July 4, 2014 at 7:33 AM

Stu said... I think you're being a bit defeatist about it, and unimaginative about how it can be used today. Despite being around 30 years old, the tech has only just caught up to the point whereby it can begin to travel down the path towards full immersion, Matrix style brain plugs, holodeck etc. This shit's gotta start somewhere, and can still produce amazing gaming - an obvious killer gaming genre is in any vehicular activity, incl. racing, normal driving, flying, space piloting, etc. Let the other stuff slowly evolve towards your eventual goal - we're in the 'space invaders' and 'pacman' era for VR now, and it works as is for a lot of gaming. July 4, 2014 at 9:11 AM

cbloom said... I'd love to hear any ideas for how game play with physical interaction will ever work. Haven't heard any yet. Obviously the goal should be physical interaction that actually *feels* like physical interaction so that it doesn't break the illusion of being there. That's unattainable for a long time. But even more modest is just how do you do something like walking around a space that has solid objects in it, or there are NPC's walking around. How do you make that work without being super artificial and weird and breaking the magic? In the short term we're going to see games that are basically board games, god games, fucking "experiences" where flower petals fall on you and garbage like that. We're going to see games that are just traditional shitty computer games, where you slide around a fucking lozenge collision proxy using a gamepad, and the VR is just a viewer in that game. That is fucking lame. What I would really hate to see is for the current trend in games to continue into VR - just more of the same shit all the time with better graphics. If people just punt on actually solving VR interaction and just use it as a way to make amazing graphics for fucking Call of Doody or Retro Fucking Mario Indie Bullshit Clone then I will be sad. When the top game is fucking VR Candy Soul-Crush then I will be sad. What is super magical and amazing is the feeling that you actually are somewhere else, and your body gets involved in a way it never has before, you feel like you can actually move around in this VR world. And until games are actually working in that magic it's just bullshit. July 4, 2014 at 9:36 AM

cbloom said... If you like, this is an exhortation to not cop the fuck out on VR the way we have in games for the last 20 years. The hard problems we should be solving in games are AI, animation/motion, physics. But we don't. We just make the same shit and put better graphics on it. Because that sells, and it's easy. Don't do that to VR. Actually work on how people interact with the simulation, and how the simulation responds to them. July 4, 2014 at 10:03 AM

Dillon Robinson said... Son, Bloom, kiddo, you've talking out of your ass again. Think before you speak.

.. and then it really went downhill.

06-21-14 | Suffix Trie Note

A small note on Suffix Tries for LZ compression.

See previously :

Sketch of Suffix Trie for Last Occurance

So. Reminder to myself : Suffix Tries for optimal parsing is clean and awesome. But *only* for finding the length of the longest match. *not* for finding the lowest offset of that match. And *not* for finding the longest match length and the lowest offset of any other (shorter) matches.

I wrote before about the heuristic I currently use in Oodle to solve this. I find the longest match in the trie, then I walk up to parent nodes and see if they provide lower offset / short length matches, because those may be also interesting to consider in the optimal parser.

(eg. for clarity, the situation you need to consider is something like a match of length 8 at offset 482313 vs. a match of length 6 at offset 4 ; it's important to find that lower-length lower-offset match so that you can consider the cost of it, since it might be much cheaper)

Now, I tested the heuristic of just doing parent-gathers and limitted updates, and it performed well *in my LZH coder*. It does *not* necessarily perform well with other coders.

It can miss out on some very low offset short matches. You may need to supplement the Suffix Trie with an additional short range matcher, like even just a 1024 entry hash-chain matcher. Or maybe a [256*256*256] array of the last occurance location of a trigram. Even just checking at offset=1 for the RLE match is helpful. Whether or not they are important or not depends on the back end coder, so you just have to try it.

For LZA I ran into another problem :

The Suffix Trie exactly finds the length of the longest match in O(N). That's fine. The problem is when you go up to the parent nodes - the node depth is *not* the longest match length with the pointer there. It's just the *minimum* match length. The true match length might be anywhere up to *and including* the longest match length.

In LZH I was considering those matches with the node depth as the match length. And actually I re-tested it with the correct match length, and it makes very little difference.

Because LZA does LAM exclusion, it's crucial that you actually find what the longest ML is for that offset.

(note that the original LZMA exclude coder is actually just a *statistical* exclude coder; it is still capable of coding the excluded character, it just has very low probability. My modified version that only codes 7 bits instead of 8 is not capable of coding the excluded character, so you must not allow this.)

One bit of ugliness is that extending the match to find its true length is not part of the neat O(N) time query.

In any case, I think is all a bit of a dead-end for me. I'd rather move my LZA parser to be forward-only and get away from the "find a match at every position" requirement. That allows you to take big steps when you find long matches and makes the whole thing faster.

06-21-14 | The E46 M3

Fuck Yeah.

Oh my god. It's so fucking good.

When I'm working in my little garage office, I can feel her behind me, trying to seduce me. Whispering naughty thoughts to me. "Come on, let's just go for a little spin".

On the road, I love the way you can just pin the throttle on corner exit; the back end gently slides out, just a little wiggle. You actually just straighten the steering early, it's like half way through the corner you go boom throttle and straighten the lock and the car just glides out to finish the turn. Oh god it's like sex. You start the turn with your hands and then finish it with your foot, and it's amazing, it feels so right.

On the track there's a whole other feeling, once she's up to speed at the threshold of grip, on full tilt. My favorite thing so far is the chicane at T5 on the back side of pacific. She just dances through there in such a sweet way. You can just lift off the throttle to get a little engine braking and set the nose, then back on throttle to make the rear end just lightly step out and help ease you around the corner. The weight transfer and grip front to back just so nicely goes back and forth, it's fucking amazing. She feels so light on her feet, like a dancer, like a boxer, like a cat.

There are a few things I miss about the 911. The brakes certainly, the balance under braking and the control under trail-braking yes, the steering feel, oh god the steering feel was good and it fucking SUCKS in the M3, the head room in the 911 was awesome, the M3 has shit head room and it's really bad with a helmet, the visibility - all that wonderful glass and low door lines, the feeling of space in the cabin. Okay maybe more than a few things.

But oh my god the M3. I don't care that I have to sit slightly twisted (WTF); I don't care that there are various reliability problems. I don't care that it requires super expensive annual valve adjustments. I forgive it all. For that engine, so eager, so creamy, screaming all the way through the 8k rev range with not a single dip in torque, for the quick throttle response and lack of electronic fudging, for the chassis balance, for the way you can trim it with the right foot. Wow.

06-18-14 | Oodle Network Test Results

Well, it's been several months now that we've been testing out the beta of Oodle Network Compression on data from real game developers.

Most of the tests have been UDP, with a few TCP. We've done a few tests on data from the major engines (UE3/4, Source) that do delta property bit-packing. Some of the MMO type games with large packets were using zlib on packets.

This is a summary of all the major tests that I've run. This is not excluding any tests where we did badly. So far we have done very well on every single packet capture we've seen from game developers.

MMO game :
427.0 -> 182.7 average = 2.337:1 = 57.21% reduction
compare to zlib -5 : 427.0 -> 271.9 average

MMRTS game :
122.0 -> 75.6 average = 1.615:1 = 38.08% reduction

Unreal game :
220.9 -> 143.3 average = 1.542:1 = 35.15% reduction

Tiny packet game :
21.9 -> 15.6 average = 1.403:1 = 28.72% reduction

Large packet Source engine game :
798.2 -> 519.6 average = 1.536:1 = 34.90% reduction

Some of the tests surprised even me, particularly the tiny packet one. When I saw the average size was only 22 bytes I didn't think we had much room to work with, but we did!

Some notes :

06-16-14 | Rep0 Exclusion in LZMA-like coders

For reference on this topic : see the last post .

I believe there's a mistake in LZMA. I could be totally wrong about that because reading the 7z code is very hard; in any case I'm going to talk about Rep0 exclusion. I believe that LZMA does not do this the way it should, and perhaps this change should be integrated into a future version. In general I have found LZMA to be very good and most of its design decisions have been excellent. My intention is not to nitpick it, but to give back to a very nice algorithm which has been generously made public by its author, Igor Pavlov.

LZMA does "literal-after-match" exclusion. I talked a bit about this last time. Basically, after a match you know that the next literal cannot be the one that would have continued the match. If it was you would have just written a longer match. This relies on always writing the maximum length for matches.

To model "LAM" exclusion, LZMA uses a funny way of doing the binary arithmetic model for those symbols. I wrote a bit about that last time, and the LZMA way to do that is good.

LZMA uses LAM exclusion on the first literal after a match, and then does normal 8-bit literal coding if there are more literals.

That all seems fine, and I worked on the Oodle LZ-Arith variant for about month with a similar scheme, thinking it was right.

But there's a wrinkle. LZMA also does "rep0len1" coding.

For background, LZMA, like LZX before it, does "repeat match" coding. A rep match means using one of the last N offsets (usually N = 3) and you flag that and send it in very few bits. I've talked about the surprising importance of repeat matches before (also here and other places ).

LZMA, like LZX, codes rep matches with MML of 2.

But LZMA also has "rep0len1". rep0len1 codes a single symbol at the 0th repeat match offset. That's the last offset matched from. That's the same offset that provides the LAM exclusion. In fact you can state the LAM exclusion as "rep0len1 cannot occur on the symbol after a match". (and in fact LZMA gets that right and doesn't try to code the rep0len1 bit after a match).

rep0len1 is not a win on text, but it's a decent little win on binary structured data (see example at bottom of this post ). It lets you get things like the top byte of a series of floats (off=4 len1 match, then 3 literals).

The thing is, if you're doing the rep0len1 coding, then normal literals also cannot be the rep0len1 symbol. If they were, then you would just code them with the rep0len1 flag.

So *every* literal should be coded with rep0 exclusion. Not just the literal after a match. And in fact the normal 8-bit literal coding path without exclusion is never used.

To be concrete, coding a literal in LZMA should look like this :

cur_lit = buffer[ pos ];

rep0_lit = buffer[ pos - rep0_offset ];

if ( after match )
    // LAM exclusion means cur_lit should never be = rep0_lit
    ASSERT( cur_lit != rep0_lit );
    if ( cur_lit == rep0_lit )
        // lit is rep0, send rep0len1 :
        ... encode rep0len1 flag ...

        // do not drop through

// either way, we now have exclusion :
ASSERT( cur_lit != rep0_lit );

encode_with_exclude( cur_lit, rep0_lit );

and that provides a pretty solid win. Of all the things I did to try to beat LZMA, this was the only clear winner.

ADDENDUM : some notes on this.

Note that the LZMA binary-exclude coder is *not* just doing exclusion. It's also using the exclude symbol as modelling context. Pure exclusion would just take the probability of the excluded symbol and distribute it to the other symbols, in proportion to their probability.

It turns out that the rep0 literal is an excellent predictor, even beyond just exclusion.

That is, say you're doing normal 8-bit literal coding with no exclusion. You are allowed to use an 8-bit literal as context. You can either use the order-1 literal (that's buffer[pos-1]) or the rep0 literal (that's buffer[pos-rep0_offset]).

It's better to use the rep0 literal!

Of course the rep0 literal becomes a weaker predictor as you get away from the end of the match. It's very good on the literal right after a match (lam exclusion), and still very good on the next literal, and then steadily weaker.

It turns out the transition point is 4-6 literals away from the end of the match; that's the point at which the o1 symbol becomes more correlated to the current symbol than the rep0 lit.

One of the ideas that I had for Oodle LZA was to remove the rep0len1 flag completely and instead get the same effect from context modeling. You can instead take the rep0 lit and use it as an 8-bit context for literal coding, and should get the same benefit. (the coding of the match flag is implicit in the probability model).

I believe the reason I couldn't find a win there is because it turns out that LZ literal coding needs to adapt very fast. You want very few context bits, you want super fast adaptation of the top bits. Part of the issue is that you don't see LZ literals very often; there are big gaps where you had matches, so you aren't getting as much data to adapt to the file. But I can't entirely explain it.

You can intuitively understand why the rep0 literal is such a strong predictor even when it doesn't match. You've taken a string from earlier in the file, and blacked out one symbol. You're told what the symbol was before, and you're told that in the new string it is not that symbol. It's something like :

"Ther" matched before
'r' is to be substituted
What is ? , given that it was 'r' but isn't 'r' here.

Given only the o1 symbol ('e') and the substituted symbol ('r'), you can make a very good guess of what should be there ('n' probably, maybe 'm', or 's', ...). Obviously more context would help, but with limited information, the substituted symbol (rep0 lit) sort of gives you a hint about the whole area.

An ever simpler case is given just the fact that rep0lit is upper or lower case - you're likely to substitute it with a character of the same case. Similarly if it's a vowel or consonant, you're likely to substitute with one of the same. etc. and of course I'm just using English text because it's intuitive, it works just as well on binary structured data.

There's another very small flaw in LZMA's exclude coder, which is more of a minor detail, but I'll include it here.

The LZMA binary exclude coder is equivalent to this clearer version that I posted last time :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 256 );
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
        if ( bit != exclude_bit )
    // then finish bits that are unmatched :
    // non-matched
        int bit = (val >> 7) & 1; val <<= 1;
        ctx = (ctx<<1) | bit;
    while( ctx < 256 );
This codes up to 8 bits while the bits of "val" match the bits of "exclude" , and up to 8 bits while the bits of "val" don't match.

Now, obviously the very first bit can never be coded in the unmatched phase. So we could eliminate that from the unmatched bins. But that only saves us one slot.

(and actually I'm already wasting a slot intentionally; the "ctx" with place value bit like this is always >= 1 , so you should be indexing at "ctx-1" if you want a tight packed array. I intentionally don't do that, so that I have 256 bins instead of 255, because it makes the addressing work with "<<8" instead of "*255")

More importantly, in the "matched" phase, you don't actually need to code all 8 bits. If you code 7 bits, then you know that "val" and "exclude" match in all top 7 bits, so it must be that val == exclude^1. That is, it's just one bit flip away; the decoder will also know that so you can just not send it.

The fixed encoder is :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 128 );
        m_bins[256 + ctx + (exclude_bit<<7)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( bit != exclude_bit )
        if ( ctx >= 128 )
            // I've coded 7 bits
            // and they all match
            // no need to code the last one
    // then finish bits that are unmatched :
    // non-matched
        int bit = (val >> 7) & 1; val <<= 1;
        ctx = (ctx<<1) | bit;
    while( ctx < 256 );
Note that now ctx in the matched phase can only reach 128. That means this coder actually only needs 2*256 bins, not 3*256 bins as stated last time.

This is a little speed savings (tiny because we only get one less arith coding event on a rare path), a little compression savings (tiny because that bottom bit models very well), and a little memory use savings.

06-12-14 | Some LZMA Notes

I've been working on an LZ-Arith for Oodle, and of course the benchmark to beat is LZMA, so I've had a look at a few things.

Some previous posts related to things I'll discuss today :

cbloom rants 09-27-08 - 2 On LZ and ACB
cbloom rants 10-01-08 - 2 First Look at LZMA
cbloom rants 10-05-08 - 5 Rant on New Arithmetic Coders
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions

Some non-trivial things I have noticed :

1. The standard way of coding literals with a binary arithmetic coder has a subtle quirk to it.

LZMA uses the now standard fractional update method for binary probability modeling. That's p0 -= (p0 >> updshift) and so on. See for example : 10-05-08 - 5 : Rant on New Arithmetic Coders .

The fractional update method is an approximation of a standard {num0,num1} binary model in which you are kept right at the renormalization threshold. That is, a counting model does :

P0 = num0 / (num0+num1);
... do coding ...
if ( bit ) num1++;
else num0++;
if ( (num0+num1) > renorm_threshold )
  // scale down somehow; traditionally num0 >>= 1;
The fractional shift method is equivalent to :

num0 = P0;
num1 = (1<<frac_tot) - P0;
if ( bit ) num1++;
else num0++;

// num0+num1 is now ((1<<frac_tot)+1); rescale :
P0 = num0 * (1<<frac_tot)/((1<<frac_tot)+1);

That is, it assumes you're right at renormalization threshold and keeps you there.

The important thing about this is adaptation speed.

A traditional {num0,num1} model adapts very quickly at first. Each observed bit causes a big change to P0 because total is small. As total gets larger, it becomes more stable, it has more inertia and adapts more slowly. The renorm_threshold sets a minimum adaptation speed; that is, it prevents the model from becoming too full of old data and too slow to respond to new data.

Okay, that's all background. Now let's look at coding literals.

The standard way to code an N bit literal using a binary arithmetic coder is to code each bit one by one, either top down or bottom up, and use the previous coded bits as context, so that each subtree of the binary tree gets its own probability models. Something like :

ctx = 1;
while( ctx < 256 ) // 8 codings
    int bit = (val >> 7)&1; // get top bit
    val <<= 1; // slide val for next coding
    BinaryCode( bit, p0[ctx-1] );
    // put bit in ctx for next event
    ctx = (ctx<<1) | bit;


Now first of all there is a common misconception that binary coding is somehow different than N-ary arithmetic coding, or that it will work better on "binary data" that is somehow organized "bitwise" vs text-like data. That is not strictly true.

If we use a pure counting model for our N-ary code and our binary code, and we have not reached the renormalization threshold, then they are in fact *identical*. Exactly identical.

For example, say we're coding two-bit literals :

The initial counts are :

0: 3
1: 1
2: 5
3: 4
total = 13

we code a 2 with probability 5/13 in log2(13/5) bits = 1.37851
and its count becomes 6

With binary modeling the counts are :

no ctx
0: 4
1: 9

0: 3
1: 1

0: 5
1: 4

to code a "2"
we first code a 1 bit with no context
with probability 9/13 in log2(13/9) bits = 0.53051
and the counts become {4,10}

then we code a 0 bit with a 1 context
with probability 5/9 in log2(9/5) bits = 0.84800
and the counts become {6,4}

And of course 1.37851 = 0.53051 + 0.84800

The coding is exactly the same. (and furthermore, binary coding top down or bottom up is also exactly the same).

However, there is a difference, and this is where the quirk of LZMA comes in. Once you start hitting the renormalization threshold, so that the adaptation speed is clamped, they do behave differently.

In a binary model, you will see many more events at the top bit. The exact number depends on how spread your statistics are. If all 256 symbols are equally likely, then the top bit is coded 128X more often than the bottom bits (and each of the next bits is coded 64X, etc.). If only one symbol actually occurs then all the bit levels will be coded the same number of times. In practice it's somewhere in between.

If you were trying to match the normal N-ary counting model, then the binary model should have much *slower* adaptation for the top bit than it does for the bottom bit. With a "fractional shift" binary arithmetic coder that would mean using a different "update shift".

But LZMA, like most code I've seen that implements this kind of binary coding of literals, does not use different adaptation rates for each bit level. Instead they just blindly use the same binary coder for each bit level.

This is wrong, but it turns out to be right. I tested a bunch of variations and found that the LZMA way is best on my test set. It seems that having much faster adaptation of the top bits is a good thing.

Note that this is a consequence of using unequal contexts for the different bit levels. The top bit has 0 bits of context, while the bottom bit has 7 bits of context, which means its statistics are diluted 128X (or less). If you do an order-1 literal coder this way, the top bit has 8 bits of context while the bottom bit gets 15 bits.

2. The LZMA literal-after-match coding is just an exclude

I wrote before (here : cbloom rants 08-20-10 - Deobfuscating LZMA ) about the "funny xor thing" in the literal-after-match coder. Turns out I was wrong, it's not really funny at all.

In LZ coding, there's a very powerful exclusion that can be applied. If you always output matches of the maximum length (more on this later), then you know that the next symbol cannot be the one that followed in the match. Eg. if you just copied a match from "what" but only took 3 symbols, then you know the next symbol cannot be "t", since you would have just done a length-4 match in that case.

This is a particularly good exclusion because the symbol that followed in the match is what you would predict to be the most probable symbol at that spot!

That is, say you need to predict the MPS (most probable symbol) at any spot in the file. Well, what you do is look at the preceding context of symbols and find the longest previous match of the context, and take the symbol that follows that context. This is "PPM*" essentially.

So when you code a literal after a match in LZ, you really want to do exclusion of the last-match predicted symbol. In a normal N-ary arithmetic coder, you would simply set the count of that symbol to 0. But it's not so simple with the binary arithmetic coder.

With a binary arithmetic coder, let's say you have the same top 7 bits as the exclude symbol. Well then, you know exactly what your bottom bit must be without doing any coding at all - it must be the bit that doesn't match the exclude symbol. At the next bit level above that, you can't strictly exclude, but you can probabilistically, exclude. That is :

Working backwards from the bottom :

At bit level 0 :

if symbol top 7 bits == exclude top 7 bits
then full exclude

that is, probability of current bit == exclude bit is zero

At bit level 1 :

if symbol top 6 bits == exclude top 6 bits

if symbol current bit matches exclude current bit, I will get full exclusion in the next level
so chance of that path is reduced but not zero

the other binary path is unaffected

that is, we're currently coding to decide between 4 symbols.  Something like :

0 : {A,B}
1 : {C,D}

we should have P0 = (PA+PB)/(PA+PB+PC+PD)

but we exclude one; let's say B, so instead we want to code with P0 = PA/(PA+PC+PD)


That is, the exclude is strongest at the bottom bit level, and becomes less strong as you go back up to higher bit levels, because there are more symbols on each branch than just the exclude symbol.

The LZMA implementation of this is :

  static void LitEnc_EncodeMatched(CRangeEnc *p, CLzmaProb *probs, UInt32 symbol, UInt32 matchByte)
    UInt32 offs = 0x100;
    symbol |= 0x100;
      matchByte <<= 1;
      RangeEnc_EncodeBit(p, probs + (offs + (matchByte & offs) + (symbol >> 8)), (symbol >> 7) & 1);
      symbol <<= 1;
      offs &= ~(matchByte ^ symbol);
    while (symbol < 0x10000);

I rewrote it to understand it; maybe this is clearer :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
    // same thing but maybe clearer :
    bool matched = true;        
    val |= 0x100; // place holder top bit

    for(int i=0;i<8;i++) // 8 bit literal
        int exclude_bit = (exclude >> (7-i)) & 1;
        int bit = (val >> (7-i)) & 1;

        int context = val >> (8-i);
        if ( matched )
            context += exclude_bit?512:256;


        if ( bit != exclude_bit )
            matched = false;

We're tracking a running flag ("matched" or "offs") which tells us if we are on the same path of the binary tree as the exclude symbol. That is, do all prior bits match. If so, that steps us into another group of contexts, and we add the current bit from the exclude symbol to our context.

Now of course "matched" always starts true, and only turns to false once, and then stays false. So we can instead implement this as two loops with a break :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
        if ( bit != exclude_bit )
    // then finish bits that are unmatched :
    // non-matched
        int bit = (val >> 7) & 1; val <<= 1;
        ctx = (ctx<<1) | bit;
    while( ctx < 256 );

It's actually not weird at all, it's just the way to do symbol exclusion with a binary coder.

ADDENDUM : maybe I'm going to0 far saying it's not weird. It is a bit weird, sort of like point 1, it's actually not right, but in a good way.

The thing that's weird is that when coding the top bits, it's only using the bits seen so far of the exclude symbol. If you wanted to do a correct probability exclusion, you need *all* the bits of the exclude symbol, so that you can see exactly what symbol it is, how much probability it contributes to that side of the binary tree.

The LZMA way appears to work significantly better than doing the full exclude.

That is, it's discarding some bits of the exclude as context, and that seems to help due to some issue with sparsity and adaptation rates. The LZMA uses 3*256 binary probabilities, while full exclusion uses 9*256. (though in both cases, not all probs are actually used; eg. the first bit is always coded from the "matched" probs, not the "un-matched").

ADDENDUM2 : Let me say it again perhaps clearer.

The way to code a full exclude using binary modeling is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
  N bits coded so far
  use 8 bits of exclude as context
  code current bit of val
  if current bit of val != same bit of exclude

while there are bits left to code in val
  N bits coded so far
  use N bits of val as context
  code current bit of val

The LZMA way is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
  N bits coded so far
  use N+1 bits of exclude as context   // <- only difference is here
  code current bit of val
  if current bit of val != same bit of exclude

while there are bits left to code in val
  N bits coded so far
  use N bits of val as context
  code current bit of val

I also tried intermediate schemes like using N+2 bits of exclude (past bits+current bit+one lower bit) which should help a little to identify the exclusion probability without diluting statistics too much - they all hurt.

3. Optimal parsing and exclusions are either/or and equal

There are two major options for coding LZ-arith :

I. Do literal-after-match exclusion and always output the longest match. Use a very simplified optimal parser that only considers literal vs match (and a few other things). Essentially just a fancier lazy parse (sometimes called a "flexible parse").

II. Do not do literal-after-match exclusion, and consider many match lengths in an optimal parser.

It turns out that these give almost identical compression.

Case II has the simpler code stream because it doesn't require the literal-after-match special coder, but it's much much slower to encode at high compression because the optimal parser has to work much harder.

I've seen this same principle many times and it always sort of intrigues me. Either you can make a code format that explicitly avoids redundancy, or you can exploit that redundancy by writing an encoder that aggressively searches the coding space.

In this case the coding of exclude-after-match is quite simple, so it's definitely preferable to do that and not have to do the expensive optimal parse.

4. LZMA is very Pareto

I can't really find any modification to it that's a clear win. Obviously you can replace the statistical coders with either something faster (ANS) or something that gives more compression (CM) and you can move the space/speed tradeoff, but not in a clearly beneficial way.

That is, on the compression_ratio / speed / memory_use three-way tradeoff, if you hold any two of those constant, there's no improvement to be had in the other.

.. except for one flaw, which we'll talk about in the next post.

03-30-14 | Decoding GIF

So I'm writing a little image viewer for myself because I got fed up with ACDSee sucking so bad. Anyway, I had almost every image format except GIF, so I've been adding that.

It's mostly straightforward except for a few odd quirks, so I'm writing them down.

Links :

spec of gif89a
A breakdown of a GIF decoder
The GIFLIB Library
Frame Delay Times for Animated GIFs by humpy77 on deviantART
gif_timing test
ImageMagick - Animation Basics -- IM v6 Examples
(Optional) Building zlib, libjpeg, libpng, libtiff and giflib Ś Leptonica & Visual Studio 2008
theimage.com gif Disposal Methods 2
theimage.com GIF Table of Contents

My notes :

A GIF is a "canvas" and the size of the canvas is in the file header as the "screen width/height". There are then multiple "images" in the file drawn into the canvas.

In theory multiple images could be used even for non-animated gifs. Each image can have its own palette, which lets you do true color gifs by assigning different palettes to different parts of the image. So you should not assume that a GIF decodes into an 8-bit palettized image. I have yet to see any GIF in the wild that does this. (and if you had one, most viewers would interpret it as an animated gif, since delay=0 is not respected literally)

(one hacky way around that which I have seen suggested elsewhere : a gif with multiple images but no GCE blocks should be treated as compositing to form a single still image, whereas GCE blocks even with delay of 0 must be interpreted as animation frames)

Note that animation frames which only update part of the image *is* common. Also the transparency in a GIF must be used when drawing frames onto the canvas - it does not indicate that the final pixel should be transparent. That is, an animation frame may mark some pixels as transparent, and that means don't update those pixels.

There is an (optional) global palette and an (optional) per-image palette. In the global header there is a "background color". That is an index to the global palette, if it exists. The background color will be visible in parts of the canvas where there is no image rectangle, and also where images are transparent all the way down to the canvas. However, the ImageMagick link above has this note :

        There is some thinking that rather than clearing the overlaid area to
        the transparent color, this disposal should clear it to the 'background'
        color meta-data setting stored in the GIF animation. In fact the old
        "Netscape" browser (version 2 and 3), did exactly that. But then it also
        failed to implement the 'Previous' dispose method correctly.

        On the other hand the initial canvas should also be set from the formats
        'background' color too, and that is also not done. However all modern
        web browsers clear just the area that was last overlaid to transparency,
        as such this is now accepted practice, and what IM now follows. 
which makes me think that many decoders (eg. web browsers) ignore background and just make those pixels transparent.

(ADD : I've seen quite a few cases where the "background" value is larger than the global palette. eg. global palette has 64 colors, and "background" is 80 or 152.)

In the optional GCE block, each image can have a transparent color set. This is a palette index which acts as a color-key transparency. Tranparent pixels should show whatever was beneath them in the canvas. That is, they do not necessarily result in transparent pixels in the output if there was a solid pixel beneath them in the canvas from a previous image.

Each image has a "delay" time and "dispose" setting in the optional GCE block (which occurs before the image data). These apply *after* that frame.

Delay is the frame time; it can vary per frame, there is no global constant fps. Delay is in centiseconds, and the support for delay < 10 is nasty. In practice you must interpret a delay of 0 or 1 centiseconds to mean "not set" rather than to mean they actually wanted a delay of 0 or 1. (people who take delay too literally are why some gifs play way too fast in some viewers).

Dispose is an action to take on the image after it has been displayed and held for delay time. Note that it applies to the *image* not the canvas (the image may be just a rectangular portion of the canvas). It essentially tells how that image's data will be committed to the canvas for future frames. Unfortunately the dispose value of 0 for "not specified" is extremely common. It appears to be correct to treat that the same as a value of 1 (leave in place).

(ADD : I've seen several cases of a dispose value of "5". Dispososal is supposed to be a 3 bit value, of which only the values 0-3 are defined (and fucking 0 means "undefined"). Values 4-7 are supposed to be reserved.)

The ImageMagick and "theimage.com" links above are very handy for testing disposal and other funny animation details.

It's a shame that zero-delay is so widely mis-used and not supported, because it is the most interesting feature in GIF for encoder flexibility.

03-25-14 | deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.

03-24-14 | GDC 2014 Aftermath

The Saturday after GDC, I took BART to Richmond to get on an Amtrak train.

(I got on the wrong train; lol; Amtrak trains have no signs or anything indicating which train they are. So I'm standing on the platform waiting, and at about the right time a train pulls up. Everyone on the platform gets on, and there's no announcement or anything so I just hop on. Nobody takes tickets at the door. Eventually a train guy walks through to check tickets and tells me I'm on the wrong train. Oops. It was my first time on Amtrak I think; it was pretty dang nice actually; if they had car-carrying trains I would use that to avoid long freeway treks).

Got on the right train and took it to Sacramento and just got to a bank before 1 PM. Bought an E46 M3. Immediately pointed it north on the 5 and drove all the way to Seattle.

(took one tiny detour to drive some mountain roads near Shasta; just had to open her up a tiny bit. Yum yum fucking yum. What a car.)

Woot. So exhausted from the combined GDC + drive, but happy to be home with the wonderful family, spring flowers blooming, and my dream car in the garage.

... and back to work. Sigh.

03-20-14 | GDC 2014 Intermezzo

The logical conclusion :

Congratulations on your new baby girl! Your delivery medical costs are free, but your baby has been implanted with advertisement-delivering contacts.

Congratulations on your new baby girl! Your delivery medical costs are free, but we will take 1% of all your child's pay forever.

Of course it is your choice. No one is forcing you. It's a free market system. You can opt out if you pay $10 million immediately.

Lots of people are talking about VR, but I have yet to hear anybody talk about what will actually sell VR, which is porn. I'm also interested in VR skype. The easiest way to do remote VR would be to have a robotic camera at the remote site that mimics your head movements. eg. I could have a robotic camera with my baby, and when I move around my room, the camera over there moves the same way, so it is as if I am there. (I miss my baby!) But there are practical problems with that (oops the robotic arm camera killed my baby again) so better solutions are needed. I'm sure you could record "VR video" in a room with all the walls covered in cameras (perhaps plenoptic cameras, or z-cameras, but mainly just lots of them).

We came up with this : "Oodle puts your data on a diet. Effective from the first byte!" (good to the last dropped packet, etc.)

Offical cbloom rants disclaimer :

Despite the constant air of incontrovertibility, some of my rants are well thought out, tested, and fully-cooked. Others are not-quite-cooked musings. Do not base a major game production pipeline on my half-baked ranting! (!! (there are not enough excalamation points in the universe) !!)

In related news, coming to GDC 2015 panel session : How cbloom ruined my engine and delayed my game. Speakers who wish to contribute should contact yo momma.

Oh, also -

Can we just fucking abolish shaking hands as a culture? Not just at GDC but in general. My main concern at GDC is the fact that I get deathly ill every year because some fucker with a cold shakes my hand instead of saying "sorry I'm sick I won't shake your hand" (you asshole, wtf are you thinking, I can see you're obviously deathly ill and you're still holding out your hand to me?).

But even aside from that, hand shaking just sucks. It far too often goes bad.

No I don't need to feel your gross pudgy round blob of a hand. No I don't want to feel your clammy sweaty palm. If you feel the need to wipe your hand on your pants before offering it, then maybe just don't offer it. No I don't want to feel your sticky food encrusted hand. No I'm not impressed by how strong you are (stop crushing my hand you fucking small-dick-having macho low-self-esteem puffed up loser), and also no I don't enjoy your limp fingers-only handshake either.

There is a happy middle ground for a correct hand shake, and then we can pat ourselves on the back and feel good that we didn't epically fail at basic motor control and hygiene. Yay! But that rare reward is in no way worth all the bad times.

Fist bump?

(and no, the fists do *not* explode when they touch!! no no no!)

03-19-14 | GDC 2014

I'm at GDC on Thursday and Friday.

I'll be giving a talk on using luxury branding techniques for upselling in microtransactions, and player retention across advertisements in free-to-play . See the schedule link for details.

Feel free to do a drive-by of the RAD booth and yell "Oodle sucks" at me.

On an actual serious note though, we're showing Oodle's new compressors for network packets . They use a trained model to compress TCP or UDP packets, and give much more compression than previous solutions.

03-15-14 | Bit IO of Elias Gamma

Making a record of something not new :

Elias Gamma codes are made by writing the position of the top bit using unary, and then the lower bits raw. eg to send 30, 30 = 11110 , the top bit is in position 4 so you send that as "00001" and then the bottom 4 bits "1110". The first few values (starting at 1) are :

1 =   1 : 1
2 =  10 : 01 + 0
3 =  11 : 01 + 1
4 = 100 : 001 + 00
5 = 101 : 001 + 01

The naive way to send this code is :

void PutEliasGamma( BITOUT, unsigned val )
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutUnary( BITOUT, topbit );

    val -= (1<<topbit); // or &= (1<<topbit)-1;

    PutBits( BITOUT, val, topbit );

But it can be done more succinctly.

We should notice two things. First of all PutUnary is very simple :

void PutUnary( BITOUT, unsigned val )
    PutBits( BITOUT , 1, val+1 );

That is, it's just putting the value 1 in a variable number of bits. This gives you 'val' leading zero bits and then a 1 bit, which is the unary encoding.

The next is that the 1 from the unary is just the same as the 1 we remove from the top position of 'val'. That is, we can think of the bits thusly :

5 = 101 : 001 + 01

unary of two + remaining 01

5 = 101 : 00 + 101

two zeros + the value 5

The result is a much simplified elias gamma coder :

void PutEliasGamma( BITOUT, unsigned val )
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutBits( BITOUT, val, 2*topbit+1 );

note that if your bit IO is backwards then this all works slightly differently (I'm assuming you can combine two PutBits into one with the first PutBits in the top of the second; that is
PutBits(a,na)+PutBits(b,nb) == PutBits((a<<nb)|b,na+nb)

Perhaps more importantly, we can do a similar transformation on the reading side.

The naive reader is :

int GetEliasGamma( BITIN )
    int bits = GetUnary( BITIN );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;

(assuming your GetBits can handle getting zero bits, and returns a value >= 1). The naive unary reader is :

int GetUnary( BITIN )
    int ret = 0;
    while( GetOneBit( BITIN ) == 0 )
    return ret;

but if your bits are top-justified in your bit input word (as in ans_fast for example, or see the end of this post for a reference implementation), then you can use count_leading_zeros to read unary :

int GetUnary( BITIN )
    int clz = count_leading_zeros( BITIN );

    ASSERT( clz < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, clz+1 );
    ASSERT( one == 1 );

    return clz;

here the GetBits is just consuming the zeros and the one on bit of the unary code. Just like in the Put case, the key thing is that the trailing 1 bit of the unary is the same as the top bit value ( "(1<<bits)" ) that we added in the naive reader. That is :

int GetEliasGamma( BITIN )
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, bits+1 );
    ASSERT( one == 1 );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;

can be simplified to combine the GetBits :

int GetEliasGamma( BITIN )
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int ret = GetBits( BITIN, 2*bits + 1 );

    ASSERT( ret >= (1<<bits) && ret < (2<<bits) );

    return ret;

again assuming that your GetBits combines like big-endian style.

You can do the same for "Exp Golomb" of course, which is just Elias Gamma + some raw bits. (Exp Golomb is the special case of Golomb codes with a power of two divisor).

Summary :

// Elias Gamma works on vals >= 1
// these assume that the value fits in your bit word
// and your bit reader is big-endian and top-justified

#define BITOUT_PUT_ELIASGAMMA(bout_bits,bout_numbits,val) do { \
    ASSERT( val >= 1 ); \
    uint32 topbit = bsr64(val); \
    BITOUT_PUT(bout_bits,bout_numbits, val, 2*topbit + 1 ); \
    } while(0)

#define BITIN_GET_ELIASGAMMA(bitin_bits,bitin_numbits,val) do { \
    uint32 nlz = clz64(bitin_bits); \
    uint32 nbits = 2*nlz+1; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    } while(0)

// MSVC implementations of bsr and clz :

static inline uint32 bsr64( uint64 val )
    ASSERT( val != 0 );
    unsigned long b = 0;
    _BitScanReverse64( &b, val );
    return b;

static inline uint32 clz64(uint64 val)
    return 63 - bsr64(val);

// and for completeness, reference bitio that works with those functions :
//  (big endian ; bit input word top-justified)

#define BITOUT_VARS(bout_bits,bout_numbits,bout_ptr) \
    uint64 bout_bits; \
    int64 bout_numbits; \
    uint8 * bout_ptr;

#define BITOUT_START(bout_bits,bout_numbits,bout_ptr,buf) do { \
        bout_bits = 0; \
        bout_numbits = 0; \
        bout_ptr = (uint8 *)buf; \
    } while(0)

#define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        ASSERT( (val) < (1ULL<<(nb)) ); \
        bout_numbits += nb; \
        bout_bits |= ((uint64)(val)) << (64 - bout_numbits); \
    } while(0)
#define BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr) do { \
        *((uint64 *)bout_ptr) = _byteswap_uint64( bout_bits ); \
        bout_bits <<= (bout_numbits&~7); \
        bout_ptr += (bout_numbits>>3); \
        bout_numbits &= 7; \
    } while(0)
#define BITOUT_END(bout_bits,bout_numbits,bout_ptr) do { \
        BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr); \
        if ( bout_numbits > 0 ) bout_ptr++; \
    } while(0)


#define BITIN_VARS(bitin_bits,bitin_numbits,bitin_ptr) \
    uint64 bitin_bits; \
    int64 bitin_numbits; \
    uint8 * bitin_ptr;

#define BITIN_START(bitin_bits,bitin_numbits,bitin_ptr,begin_ptr) do { \
        bitin_ptr = (uint8 *)begin_ptr; \
        bitin_bits = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += 8; \
        bitin_numbits = 64; \
    } while(0)

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { if ( bitin_numbits <= 56 ) { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        bitin_ptr += bytesToGet; \
        bitin_bits |= next8 >> bitin_numbits; \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } } while(0)

#define BITIN_GET(bitin_bits,bitin_numbits,nb,ret) do { \
        ASSERT( nb <= bitin_numbits ); \
        ret = (bitin_bits >> 1) >> (63 - nb); \
        bitin_bits <<= nb; \
        bitin_numbits -= nb; \
    } while(0)


and yeah yeah I know this bitio could be faster. It's a reference implementation that's trying to avoid obfuscations. GTFO.

Added exp-golomb. The naive put is :

PutEliasGamma( val >> r );
PutBits( val & ((1<<r)-1) , r );

but you do various reductions and get to :

// this Exp Golomb is for val >= 0
// Exp Golomb is Elias Gamma + 'r' raw bits

#define BITOUT_PUT_EXPGOLOMB(bout_bits,bout_numbits,r,val) do { \
    ASSERT( val >= 0 ); \
    uint64 up = (val) + (1ULL<<(r)); \
    uint32 topbit = bsr64(up); \
    ASSERT( topbit >= (uint32)(r) ); \
    BITOUT_PUT(bout_bits,bout_numbits, up, 2*topbit + 1 - r ); \
    } while(0)
#define BITIN_GET_EXPGOLOMB(bitin_bits,bitin_numbits,r,val) do { \
    uint32 nbits = 2*clz64(bitin_bits)+1+r; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    ASSERT( val >= (1ULL<<r) ); \
    val -= (1ULL<<r); \
    } while(0)


03-14-14 | Fold Up Negatives

Making a record of something not new :

Say you want to take the integers {-inf,inf} and map them to just the non-negatives {0,1,..inf}. (and/or vice-versa)

(this is common for example when you want to send a signed value using a variable length code, like unary or golomb or whatever; yes yes there are other ways, for now assume you want to do this).

We need to generate a number line with the negatives "folded up" and interleaved with the positives, like


The naive way is :

// fold_up makes positives even
//   and negatives odd

unsigned fold_up_negatives(int i)
    if ( i >= 0 )
        return i+i;
        return (unsigned)(-i-i-1); 

int unfold_negatives(unsigned i)
    if ( i&1 ) 
        return - (int)((i+1)>>1);
        return (i>>1);

Now we want to do it branchless.

Let's start with folding up. What we want to achieve mathematically is :

fold_up_i = 2*abs(i) - 1 if i is negative

To do this we will use some tricks on 2's complement integers.

The first is getting the sign. Assuming 32-bit integers for now, we can use :

minus_one_if_i_is_negative = (i >> 31);

= 0 if i >= 0
= -1 if i < 0

which works by taking the sign bit and replicating it down. (make sure to use signed right shift, and yes this is probably undefined blah blah gtfo etc).

The other trick is to use the way a negative is made in 2's complement.

(x > 0)

-x = (x^-1) + 1


-x = (x-1)^(-1)

and of course x^-1 is the same as (~x), that is flip all the bits. This also gives us :

x^-1 = -x -1

And it leads obviously to a branchless abs :

minus_one_if_i_is_negative = (i >> 31);
abs_of_i = (i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

since if i is negative this is

-x = (x^-1) + 1

and if i is positive it's

x = (x^0) + 0

So we can plug this in :

fold_up_i = 2*abs(i) - 1 if i is negative

fold_up_i = abs(2i) - 1 if i is negative

minus_one_if_i_is_negative = (i >> 31);
abs(2i) = (2i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

fold_up_i = abs(2i) + minus_one_if_i_is_negative

fold_up_i = (2i) ^ minus_one_if_i_is_negative

or in code :

unsigned fold_up_negatives(int i)
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;

For unfold we use the same tricks. I'll work it backwards from the answer for variety and brevity. The answer is :

int unfold_negatives(unsigned i)
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;

and let's prove it's right :

if i is even

half_i = i>>1;
sign_i = 0;

return half_i ^ 0 = i/2;
// 0,2,4,... -> 0,1,2,...

if i is odd

half_i = i>>1; // i is odd, this truncates
sign_i = -1;

return half_i ^ -1 
 = -half_i -1
 = -(i>>1) -1
 = -((i+1)>>1)
// 1,3,5,... -> -1,-2,-3,...

which is what we wanted.

Small note : on x86 you might rather use cdq to get the replicated sign bit of an integer rather than >>31 ; there are probably similar instructions on other architectures. Is there a neat way to make C generate that? I dunno. Not sure it ever matters. In practice you should use an "int32" type or compiler_assert( sizeof(int) == 4 );

Summary :

unsigned fold_up_negatives(int i)
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;

int unfold_negatives(unsigned i)
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;

03-13-14 | Hilbert Curve Testing

So I stumbled on this blog post about counting the rationals which I found rather amusing.

The snippet that's relevant here is that if you iterate through the rationals naively by doing something like

1/1 2/1 3/1 4/1 ...
1/2 2/2 3/2 4/2 ...
1/3 2/3 ...
then you will never even reach 1/2 because the first line is infinitely long. But if you take a special kind of path, you can reach any particular rational in a finite number of steps. Much like the way a Hilbert curve lets you walk the 2d integer grid using only a 1d path.

It reminded me of something that I've recently changed in my testing practices.

When running code we are always stepping along a 1d path, which you can think of as discrete time. That is, you run test 1, then test 2, etc. You want to be able to walk over some large multi-dimensional space of tests, but you can only do so by stepping along a 1d path through that space.

I've had a lot of problems testing Oodle, because there are lots of options on the compressors, lots of APIs, lots of compression levels and different algorithms - it's impossible to test all combinations, and particularly impossible to test all combinations on lots of different files. So I keep getting bitten by some weird corner case.

(Total diversion - actually this is a good example of why I'm opposed to the "I tested it" style of coding in general. You know when you stumble on some code that looks like total garbage spaghetti, and you ask the author "wtf is going on here, do you even understand it?" and they "well it works, I tested it". Umm, no, wrong answer. Maybe it passed the test, but who knows how it's going to be used down the road and fail in mysterious ways? Anyway, that's an old school cbloom coding practices rant that I don't bother with anymore ... and I haven't been great about following my own advice ...)

If you try to set up a big test by just iterating over each option :

for each file
  for each compressor
    for each compression_level
      for each chunking
        for each parallel branching
          for each dictionary size
            for each sliding window size

then you'll never even get to the second file.

The better approach is to get a broad sampling of options. An approach I like is to enumerate all the tests I want to run, using a loop like the above, put them all in a list, then randomly permute the list. Because it's just a permutation, I will still only run each test once, and will cover all the cases in the enumeration, but by permuting I get a broader sampling more quickly.

(you could also add the nice technique that we worked out here long ago - generate a consistent permutation using a pseudorandom generator with known seed, and save your place in the list with a file or the registry or something. That way when you stop and start the tests, you will resume where you left off, and eventually cover more of the space (also when a test fails you will automatically repeat the same test if you restart)).

The other trick that's useful in practice is to front-load the quicker tests. You do want to have a test where you run on 8 GB files to make sure that works, but if that's your first test you have to wait forever to get a failure. This is particularly for the case that something dumb is broken, it should show up as quickly as possible so you can just cancel the test and fix it. So you want an initial broad sampling of options on very quick tests.

03-03-14 | Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.

02-25-14 | ANS Applications

Some rambling on where I see ANS being useful.

In brief - anywhere you used Huffman in the past, you can use ANS instead.

ANS (or ABS) are not very useful for high end compressors. They are awkward for adaptive modeling. Even if all you care about is decode speed (so you don't mind the buffering up the models to make the encode work backwards) it's just not a big win over arithmetic coding. Things like PAQ/CM , LZMA, H264, all the high compression cases that use adaptive context models, there's no real win from ANS/ABS.

Some specific cases where I see ANS being a good win :

JPEG-ANS obviously. Won't be competitive with sophisticated coders like "packjpg" but will at least fix the cliff at low bit rate caused by Huffman coding.

JPEGNEXT-ANS. I've been thinking for a long time about writing a "JPEGNEXT". Back end coefficient classification; coefficients in each group sent by value with ANS. Front end 32x32 macroblocks with various DCT sizes. Try to keep it as simple as possible but be up to modern quality levels.

LZ-ANS. An "LZX class" (which means large window, "repeat match", entropy resets) LZ with ANS back end should be solid. Won't be quite LZMA compression levels, but way way faster.

Lossless image DPCM. ANS on prediction residual values is a clear killer. Should have a few statistics groups with block classification. No need to transmit the ANS counts if you use a parametric model ala TMW etc. Should be easy, fast, and good.

blocksort-ANS. Should replace bzip. Fast to decode.

MPEG-class video coders. Fast wavelet coders. Anywhere you are using minimal context modelling (only a few contexts) and are sending values by their magnitude (not bit plane coders).


02-25-14 | WiFi

So our WiFi stopped working recently, and I discovered a few things which I will now write down.

First of all, WiFi is fucked. 2.4 GHz is way overcrowded and just keeps getting more crowded. Lots of fucking routers now are offering increased bandwidth by using multiple channels simultaneously, etc. etc. It's one big interference fest.

The first issue I found was baby monitors. Baby monitors, like many wireless devices, are also in the 2.4 GHz band and just crap all over your wifi. Turning them off helped our signal a bit, but we were still getting constant problems.

Next issue is interference from neighbors'ses wifises. This is what inSSIDer looks like at my house :

We are *way* away from any neighbors, at least 50 feet in every direction, and we still get this amount of shit from them. Each of my cock-ass-fuck neighbors seems to have four or five wifi networks. Good job guys, way to fix your signal strength issues by just piling more shit in the spectrum.

What you can't see from the static image is that lots of the fucking neighbor wifis are not locked to a specific channel, many of them are constantly jumping around trying to find a clear channel, which just makes them crud them all up.

(I'd love to get some kind of super industrial strength wifi for my house and just crank it up to infinity and put it on every channel so that nobody for a mile around gets any wifi)

I've long had our WiFi on channel 8 because it looked like the most clear spot to be. Well, it turns out that was a classic newb mistake. Apparently it's worse to be slightly offset from a busy channel than it is to be right on it. When you're offset, you get signal leakage from the other channel that just looks like noise; being on the channel you're fighting with other people, but at least you are seing their data as real data that you can ignore. Anyway, switching our network to channel 11 fixed it.

It looks like in practice channel 6 and 11 are the only usable ones in noisy environments (eg. everywhere).

The new 802.11ac on 5 GHz should be a nice clean way to go for a few years until it too gets crudded up.

02-18-14 | ans_fast implementation notes

Some notes about the ans_fast I posted earlier .

ans_fast contains a tANS (table-ANS) implementation and a rANS (range-ANS) implementation.

First, the benchmarking. You can compare to the more naive implementations I posted earlier . However, do not compare this tANS impl to Huffman or arithmetic and conclude "ANS is faster" because the tANS impl here has had rather more loving than those. Most of the tricks used on "ans_fast" can be equally used for other algorithms (though not all).

Here L=4096 to match the 12-bits used in the previous test. This is x64 on my lappy (1.7 Ghz Core i7 with turbo disabled). Compressed sizes do not include sending the counts. Time "withtable" includes all table construction but not histogramming or count normalization (that affects encoder only). ("fse" and "huf" on the previous page included table transmission and histogramming time)


tANS 768771 -> 435252.75

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

rANS 768771 -> 435980 bytes (v2)

ticks to encode: 6.97 decode: 5.06
mbps encode: 248.02 decode: 341.63

withtable ticks to encode: 6.97 decode: 5.07
withtable mbps encode: 247.92 decode: 341.27


tANS 513216 -> 78856.88

ticks to encode: 4.53 decode: 3.47
mbps encode: 382.02 decode: 497.75

withtable ticks to encode: 4.62 decode: 3.56
withtable mbps encode: 374.45 decode: 485.40

rANS 513216 -> 79480 bytes (v2)

ticks to encode: 5.62 decode: 3.53
mbps encode: 307.78 decode: 490.32

withtable ticks to encode: 5.63 decode: 3.54
withtable mbps encode: 307.26 decode: 488.88

First a note on file sizes : rANS file sizes are a few bytes larger than the "rans 12" posted last time. That's because that was a 32-bit impl. The rANS here is 64-bit and dual-state so I have to flush 16 bytes instead of 4. There are ways to recover some of those bytes.

The tANS file sizes here are smaller than comparable coders. The win comes from the improvements to normalizing counts and making the sort order. In fact, the +1 bias heuristic lets me beat "arith 12" and "rans 12" from the last post, which were coding nearly perfectly to the expected codelen of the normalized counts.

If you run "ans_learning" you will often see that the written bits are less than the predicted codelen :

H = 1.210176
CL = 1.238785
wrote 1.229845 bpb
this is because the +1 bias heuristic lets the codelens match the data better than the normalized counts do.

Okay, so on to the speed.

The biggest thing is that the above reported speeds are for 2x interleaved coders. That is, two independent states encoding the single buffer to a single compressed stream. I believe ryg will talk about this more soon. You can read his paper on arxiv now. Note that this is not just unrolling. Because the states are independent they allow independent execution chains to be in flight at the same time.

The speedup from interleaving is pretty huge (around 1.4X) :


rANS non-interleaved (v1)

ticks to encode: 26.84 decode: 7.33
mbps encode: 64.41 decode: 235.97

withtable ticks to encode: 26.85 decode: 7.38
withtable mbps encode: 64.41 decode: 234.19

rANS 2x interleaved (v1)

ticks to encode: 17.15 decode: 5.16
mbps encode: 100.84 decode: 334.95

withtable ticks to encode: 17.15 decode: 5.22
withtable mbps encode: 100.83 decode: 331.31

tANS non-interleaved

ticks to encode: 6.43 decode: 4.68
mbps encode: 269.10 decode: 369.44

withtable ticks to encode: 6.48 decode: 4.73
withtable mbps encode: 266.86 decode: 365.39

tANS 2x interleaved

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

But even non-interleaved it's fast. (note that interleaved tANS is using only a single shared bit buffer). The rest of the implementation discussion will use the non-interleaved versions for simplicity.

The tANS implementation is pretty straightforward.

Decoding one symbol is :

    struct decode_entry { uint16 next_state; uint8 num_bits; uint8 sym; };

    decode_entry * detable = table - L;

    #define DECODE_ONE() do { \
        de = detable + state; \
        nb = de->num_bits; \
        state = de->next_state; \
        BITIN_OR(bitin_bits,bitin_numbits,nb,state); \
        *toptr++ = (uint8) de->sym; \
    } while(0)

where BITIN_OR reads "nb" bits and ors them onto state.

With a 64-bit bit buffer, I can ensure >= 56 bits are in the buffer. That means with L up to 14 bits, I can do four decodes before checking for more bits needed. So the primary decode loop is :

        // I know >= 56 bits are available  
        // each decode consumes <= 14 bits

        // now >= 56 bits again

The fastest way I could find to do the bit IO was "big endian style". That's the next bits at the top of the word. Bits in the word are in order of bits in the file. This lets you unconditionally grab the next 8 bytes to refill, but requires a bswap (on little endian platforms). eg :

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += bytesToGet; \
        bitin_bits |= (next8 >> 1) >> (bitin_numbits-1); \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } while(0)

The other nice thing about the bits-at-top style is that the encoder can put bits in the word without any masking. The encoder is :

    #define ENCODE_ONE() do { \
        sym = *bufptr--; ee = eetable+sym; \
        msnb = ee->max_state_numbits; \
        msnb += ( state >= ee->max_state_thresh ); \
        BITOUT_PUT(bout_bits,bout_numbits, state,msnb); \
        state = ee->packed_table_ptr[ state>>msnb ]; \
        } while(0)

    #define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        bout_bits >>= nb; \
        bout_bits |= ((uint64)val) << (64 - nb); \
        bout_numbits += nb; \
    } while(0)

the key interesting part being that the encoder just does BITOUT_PUT with "state", and by shifting it up to the top of the word for the bitio, it gets automatically masked. (and rotate-right is a way to make that even faster).

Similarly to the decoder, the encoder can write 4 symbols before it has to check if the bit buffer needs any output.

The other crucial thing for fast tANS is the sort order construction. I do a real sort, using a radix sort. I do the first step of radix sorting (generating a histogram), and then I directly build the tables from that, reading out of the radix histogram. There's no need to explicitly generate the sorted symbol list as an intermediate step. I use only an 8-bit radix here (256 entries) but it's not significantly different (in speed or compression) than using a larger radix table.

The rANS implementation is pretty straightforward and I didn't spend much time on it, so it could probably be faster (particularly encoding which I didn't spend any time on (ADDENDUM : v2 rANS now sped up and encoder uses fast reciprocals)). I use a 64-bit state with 32-bit renormalization. The basic decode operation is :

        uint64 xm = x & mask;   
        const rans_decode_table::entry & e = table[xm];
        x = e.freq * (x >> cumprobtot_bits) + e.xm_minus_low;
        buffer[i] = (uint8) e.sym;
        if ( x < min_x )
            x <<= 32;
            x |= *((uint32 *)comp_ptr);
            comp_ptr += 4;

One thing I should note is that my rANS decode table is 2x bigger than the tANS decode table. I found it was fastest to use an 8-byte decode entry for rANS :

    // 8-byte decode entry
    struct entry { uint16 freq; uint16 xm_minus_low; uint8 sym; uint16 pad; };
obviously you can pack that a lot smaller (32 bits from 12+12+8) but it hurts speed.

For both tANS and rANS I make the encoder write backwards and the decoder read forwards to bias in favor of decoder speed. I make "L" a variable, not a constant, which hurts speed a little.

02-18-14 | Understanding ANS - Conclusion

I think we can finally say that we understand ANS pretty well, so this series will end. I may cover some more ANS topics but they won't be "Understanding ANS".

Here is the index of all posts on this topic :

cbloom rants 1-30-14 - Understanding ANS - 1
cbloom rants 1-31-14 - Understanding ANS - 2
cbloom rants 02-01-14 - Understanding ANS - 3
cbloom rants 02-02-14 - Understanding ANS - 4
cbloom rants 02-03-14 - Understanding ANS - 5
cbloom rants 02-04-14 - Understanding ANS - 6
cbloom rants 02-05-14 - Understanding ANS - 7
cbloom rants 02-06-14 - Understanding ANS - 8
cbloom rants 02-10-14 - Understanding ANS - 9
cbloom rants 02-11-14 - Understanding ANS - 10
cbloom rants 02-14-14 - Understanding ANS - 11
cbloom rants 02-18-14 - Understanding ANS - 12

And here is some source code for my ANS implementation : (v2 02/21/2014)

ans.zip - contains ans_fast and ans_learning
cblib.zip is required to build my code

My home code is MSVC 2005/2008. Port if you like. Email me if you need help.

NOTE : this release is not a library you just download and use. It is intended as documentation of research. If you want some ANS code you can just use off the shelf, go get FiniteStateEntropy . You may also want ryg_rans .

I think I'll do a followup post with the performance of ans_fast and some optimization notes so it doesn't crud up this index post. Please put implementation speed discussion in that followup post .

02-18-14 | Understanding ANS - 12

A little note about sorts and tables.


What's wrong with that sort?

(That's the naive rANS sort order; it's just a "cum2sym" table. It's each symbol Fs times in consecutive blocks. It has M=32 entries. M = sum{Fs} , L = coding precision)

(here I'm talking about a tANS implementation with L=M ; the larger (L/M) is, the more you preserve the information in the state x)

Think about what the state variable "x" does as you are coding. In the renormalization range it's in [32,63]. Its position in that range is a slider for the number of fraction bits it contains. At the bottom of the range, log2(x) is 5, at the top log2(x) is 6.

Any time you want to encode a "D" you must go back to a singleton precursor state, Is = [1]. That means you have to output all the bits in x, so all fractional bits are thrown away. All information about where you were in that I range is gone. Then from that singleton Is range you jump to the end of the I range.

(if Fs=2 , then you quantize the fractional bits up to 0.5 ; is Fs=3, you quantize to 1/3 of a bit, etc.)

Obviously the actual codelen for a "D" is longer than that for an "A". But so is the codelen for a "C", and the codelen for "A" is too short. Another way to think of it is that you're taking an initial state x that spans the whole interval [32,63] and thus has variable fractional bits, and you're mapping it into only a portion of the interval.

In order to preserve the fractional bit state size, you want to map from the whole interval back to the whole interval. In the most extreme case, something like :


(M=16) , when you encode an A you go from [16,31] to [8,15] and then back the A's in that string. The net result is that state just lost its bottom bit. That is, x &= ~1. You still have the full range of possible fractional bits from [0,1] , you just lost the bottom bit of precision.

I was thinking about this because I was making some weird alternative tANS tables. In fact I suppose not actually ANS tables, but more general coding tables.

For background, you can make one of the heuristic tANS tables thusly :

shuffle(s) = some permutation function
shuffle is one-to-one over the range [0,L-1]
such as Yann's stepping prime-mod-L
or bit reverse

    int next_state[256];    
    uint8 permutation[MAX_L];
    // make permutation :
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        next_state[sym] = count;
        for LOOP(c,(int)count)
            uint32 index = shuffle(cumulative_count);
            permutation[index] = (uint8)sym;
    ASSERT( cumulative_count == (uint32)L );

    // permutation is now our "output string"   

    for LOOP(i,L) // iterating over destination state
        int sym = permutation[i];
        // step through states for this symbol
        int from_state = next_state[sym];
        next_state[sym] ++;
        int to_state = L + i;
        encode_packed_table_ptr[sym][from_state] = to_state;

which is all well and good. But I started thinking - can I eliminate the intermediate permutation[] table entirely? Well, yes. There are a few ways.

If you have a "cum2sym" table already handy, then you can just use shuffle() to look up directly into cum2sym[], and that is identical to the above. But you probably don't have cum2sym.

Well what if we just use shuffle() to make the destination state? Note that this is calling it in the opposite direction (from cum2sym index to to_state , rather than from to_state to cum2sym). If your shuffle is self-inverting like bit reversal is, then it's the same.

It gives you a very simple table construction :

    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        for LOOP(c,(int)count)
            uint32 index = shuffle(cumulative_count);

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
    ASSERT( cumulative_count == (uint32)L );

make_tans_shuffle_direct walks the Fs in a kind of cum2sym order and then scatters those symbols out to semi-random target locations using the shuffle() function.

It doesn't work. Or rather, it works, it encodes & decodes data correctly, but the total coded size is worse.

The problem is that the encode table is no longer monotonic. That is, as "from_state" increases, "to_state" does not necessarily increase. The Fs encode table entries for each symbol are not numerically in order.

In the images we've been picturing from earlier in the post we can see the problem. Some initial state x is renormalized down to the Is coding range. We then follow the state transition back to the I range - but we go somewhere random. We don't go to the same neighborhood where we started, so we randomly get more or less fractional bits.

You can fix it thusly :

    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        for LOOP(c,(int)count)
            uint32 index = shuffle(cumulative_count);

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;

        // fix - to_states not monotonic
        // sort the destination states for this symbol :
        std::sort( encode_packed_table_ptr[sym]+count, encode_packed_table_ptr[sym]+2*count );
    ASSERT( cumulative_count == (uint32)L );

and then it is identical to "make_tans_shuffle" (identical if shuffle is self-inverting, and if not then it's different but equal, since shuffle is really just a random number generator so running it backwards doesn't hurt compression).

For the record the compression penalty for getting the state transition order wrong is 1-2% :

CCC total bytes out :

correct sort : 1788631
shuffle fixed: 1789655
shuffle bad  : 1813450

02-14-14 | Understanding ANS - 11

I want to do some hand waving about the different ways you can conceptually look at ANS.

Perhaps the best way to understand ANS mathematically is via the analogy with arithmetic coding . While ANS is not actually building up an arithmetic coder interval for the file, each step is very much like a LIFO arithmetic coder, and the (x/P) scaling is what makes x grow the right amount for each symbol. This is the most useful way to think about rANS or uANS, I find.

But there are other ways to think about it.

One is Duda's "asymmetric numeral system", which is how he starts the explanation in the paper, and really confused me to begin with. Now that we've come at ANS from the back way we can understand what he was on about.

The fundamental op in ANS is :

integer x contains some previous value

make x' = x scaled up in some way + new value 

with a normal "symmetric numeral system" , you would just do base-b math :

x' = x * b + v

which gives you an x' where the old value (x) is distributed evenly, and the v's just cycle :

b = 3 for example

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  0  1  1  1  2  2  2
v :  0  1  2  0  1  2  0  1  2

x' is a way of packing the old value x and the new value v together. This symmetric packing corresponds to the output string "012" in the parlance of this series. The growth factor (x'/x) determines the number of bits required to send our value, and it's uniform.

But it doesn't need to be uniform.

0102 :

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  1  0  2  1  3  1  4 
v :  0  1  0  2  0  1  0  2  0

Intuitively, the more often a symbol occurs in the output string, the more slots there are for the previous value (x) to get placed; that is, more bits of x can be sent in lower values of x' when the symbol occurs in many slots. Hence x' grows less. If you're thinking in terms of normalized x's, growing less means you have to output fewer bits to stay in the renormalization range.

You can draw these asymmetric numeral lines in different ways, which Duda does in the paper. For example :

input x as the axis line,
output x' in the table :

  0 1 2 3 4 5 6  x
A 0 2 4          x'
B 1 3 5


  0 1 2 3 4 5 6  x
A 0 1 3 4 6 7 9  x'
B 2 5 8 11

output x' as the axis line
input x in the table :

  0 1 2 3 4 5 6  x'
A 0   1   2   3  x
B   0   1   2

  0 1 2 3 4 5 6  x'
A 0 1   2 3   4  x
B     0     1

output x' line implicit
show x and output symbol :

0 0 1 1 2 2 3

0 1 0 2 3 1 4

That is, it's a funny way of just doing base-b math; we're shifting up the place value and adding our value in, but we're in an "asymmetric numeral system", so the base is nonuniform. I find this mental image not very useful when thinking about how the coding actually works.

There's another way to think about tANS in particular (tANS = table-based ANS), which is what Yann is talking about .

To get there mentally, we actually need to optimize our tANS code.

When I covered tANS encoding before , I described it something like this :

x is the current state
x starts in the range I = [L,2L-1]

to encode the next symbol s
we need to reach the precursor range Is = [Fs,2Fs-1]

to do that, output bits from x
b = x&1; x >>= 1;
until x is lowered to reach Is

then take the state transition C()
this takes x back into I

this should be familiar and straightforward.

To optimize, we know that x always starts in a single power of 2 interval [L,2L-1] , and it always lands in a power of 2 interval [Fs,2Fs-1]. That means the minimum number of bits we ever output is from L to 2Fs-1 , and the maximum number of bits is only 1 more than that. So the renormalization can be written as :

precompute :

max_state = 2Fs - 1;
min_num_bits = floor(log2(L/Fs));

to renormalize :

x in [L,2L-1]
output min_num_bits from x
x >>= min_num_bits

now ( x >= Fs && x < 2*max_state );

if ( x > max_state ) output 1 more bit; x>>= 1;

now x in [Fs,2Fs-1]

But you can move the check for the extra output bit earlier, before shifting x down :

precompute :

min_num_bits = log2(L) - log2ceil(Fs);  // if L is power of 2
threshold = (2*Fs)<<num_bits;

to renormalize :

x in [L,2L-1]
num_bits = min_num_bits;
if ( x >= threshold ) num_bits ++;
output num_bits from x
x >>= num_bits

x in [Fs,2Fs-1]

and then use C(x) since x is now in Is.

It's just straightforward optimization, but it actually allows us to picture the whole process in a different way. Let's write the same encoder, but just in terms of a table index :

let t = x - L
t in [0,L-1]

t is a table index.

to encode s :

num_bits = min_num_bits[s] + ( t >= threshold[s] );
bitoutput( t, num_bits );
t = encode_state_table[s][ (t+L)>>num_bits ];

That is, we're going from a table index to another table index. We're no longer thinking about going back to the [Fs,2Fs-1] precursor range at all.

Before we got our desired code len by the scaling of the intervals [L,2L)/[Fs,2Fs) , now the code len is the stored number of bits. We can see that we get fractional bits because sometimes we output one more.

Let's revisit an example that we went through previously , but with this new image.

L = 8
Fs = {3,3,2}
output string = ABCABCAB

We can see right away that our table index t is 3 bits. To encode a 'C' there will be only two slots on our numeral line that correspond to a lower digit of C, so we must output 2 bits and keep 1 bit of t. To encode an 'A' we can keep 3 values, so we can output 1 bit for t in [0,3] and 2 bits for t in [4,7] ; that will give us 2 retained values in the first region and 1 retained value in the second.

Explicitly :

t in [0,7]
I want to encode an A
so I want to reach {AxxAxxAx}

t in [0,3]
  output t&1
  index = (t+L)>>1 = 4 or 5
  take the last two A's {xxxAxxAx}
  so state -> 3 or 6

t in [4,7]
  output t&3
  index = (t+L)>>2 = 3
  take the first A {Axxxxxxx}
  state -> 0

Note that the way we're doing it, high states transition to low states, and vice versa. These comes up because of the +L sentry bit method to separate the subranges produced by the shift.

The tANS construction creates this encode table :

A : b=1+(t>=4) : {0,3,6}
B : b=1+(t>=4) : {1,4,7}
C : b=2+(t>=8) : {2,5}

It should be obvious that we can now drop all our mental ideas about "ANS" and just make these coding tables directly. All you need is an output string, and you think about doing these kinds of mapping :

t in [0,7]

I want to encode a B

[xxxxxxxx] -> [xBxxBxxB]

output bits to reduce the 3 values
and transition to one of the slots with a B

The decode table is trivial to make from the inverse :

 0: A -> 4 + getbits(2)
 1: B -> 4 + getbits(2)
 2: C -> 0 + getbits(2)
 3: A -> 0 + getbits(1)
 4: B -> 0 + getbits(1)
 5: C -> 4 + getbits(2)
 6: A -> 2 + getbits(1)
 7: B -> 2 + getbits(1)

Note that each symbol's decode covers the entire origin state range :

 0: A -> 4 + getbits(2)  from [4,7]
 3: A -> 0 + getbits(1)  from [0,1]
 6: A -> 2 + getbits(1)  from [2,3]

 1: B -> 4 + getbits(2)  from [4,7]
 4: B -> 0 + getbits(1)  from [0,1]
 7: B -> 2 + getbits(1)  from [2,3]

 2: C -> 0 + getbits(2)  from [0,3]
 5: C -> 4 + getbits(2)  from [4,7]

During decode we can think about our table index 't' as containing two pieces of information : one is the current symbol to output, but there's also some information about the range where t will be on the next step. That is, the current t contains some bits of the next t. The number of bits depends on where we are in the table. eg. in the example above; when t=4 we specify a B, but we also specify 2 bits worth of the next t.

Doing another example from that earlier post :



A : b=1+(t>=12) : {0,3,5,7,10,12,15}
B : b=1+(t>=8) : {1,4,6,9,11,14}
C : b=2+(t>=8) : {2,8,13}

 0: A -> 12 + getbits(2)
 1: B -> 8 + getbits(2)
 2: C -> 8 + getbits(3)
 3: A -> 0 + getbits(1)
 4: B -> 12 + getbits(2)
 5: A -> 2 + getbits(1)
 6: B -> 0 + getbits(1)
 7: A -> 4 + getbits(1)
 8: C -> 0 + getbits(2)
 9: B -> 2 + getbits(1)
10: A -> 6 + getbits(1)
11: B -> 4 + getbits(1)
12: A -> 8 + getbits(1)
13: C -> 4 + getbits(2)
14: B -> 6 + getbits(1)
15: A -> 10 + getbits(1)

and this concludes our conception of tANS in terms of just an [0,t-1] table.

I'm gonna be super redundant and repeat myself some more. I think it's intriguing that we went through all this ANS entropy coder idea, scaling values by (x/P) and so on, and from that we constructed tANS code. But you can get to the exact same tANS directly from the idea of the output string!

Let's invent tANS our new way, starting from scratch.

I'm given normalized frequencies {Fs}. Sum{Fs} = L. I want a state machine with L entries. Take each symbol and scatter it into our output string in some way.

To encode each symbol, I need to map the state machine index t in [0,L-1] to one of its occurances in the output string.

There are Fs occurances in the output string

I need to map an [0,L-1] value to an [0,Fs-1] value
by outputting either b or b+1 bits

now clearly if (L/Fs) is a power of 2, then the log2 of that is just b and we always output that many bits. (eg L=16, Fs=4, we just output 2 bits). In general if (L/Fs) is not a power of 2, then

b = floor(log2(L/Fs))
b+1 = ceil(log2(L/Fs))

so we just need two sub-ranges of L such that the total adds up to Fs :

threshold T
values < T output b bits
values >= T output b+1 bits

total of both ranges after output should equal Fs :

(T>>b) + (L-T)>>(b+1) = Fs

(2T + L-T)>>(b+1) = Fs

L+T = Fs<<(b+1)

T = (Fs<<(b+1)) - L

and that's it! We've just made a tANS encoder without talking about anything related to the ANS ideas at all.

The funny thing to me is that we got the exact same condition before from "b-uniqueness". That is, in order to be able to encode symbol s from any initial state, we worked out that the only valid precursor range was Is = [Fs,2*Fs-1] . That leads us to the renormalization loop :

while x > (2*Fs-1)
  output a bit from x; x>>= 1;

And from that we computed a minimum number of output bits, and a threshold state for one more. That threshold we computed was

(max_state + 1)<<min_num_bits

= (2*Fs-1 + 1)<<b
= Fs<<(b+1)

which is the same.

02-11-14 | Understanding ANS - 10

Not really an ANS topic, but a piece you need for ANS so I've had a look at it.

For ANS and many other statistical coders (eg. arithmetic coding) you need to create scaled frequencies (the Fs in ANS terminology) from the true counts.

But how do you do that? I've seen many heuristics over the years that are more or less good, but I've never actually seen the right answer. How do you scale to minimize total code len? Well let's do it.

Let's state the problem :

You are given some true counts Cs

Sum{Cs} = T  the total of true counts

the true probabilities then are

Ps = Cs/T

and the ideal code lens are log2(1/Ps)

You need to create scaled frequencies Fs
such that

Sum{Fs} = M

for some given M.

and our goal is to minimize the total code len under the counts Fs.

The ideal entropy of the given counts is :

H = Sum{ Ps * log2(1/Ps) }

The code len under the counts Fs is :

L = Sum{ Ps * log2(M/Fs) }

The code len is strictly worse than the entropy

L >= H

We must also meet the constraint

if ( Cs != 0 ) then Fs > 0

That is, all symbols that exist in the set must be codeable. (note that this is not actually optimal; it's usually better to replace all rare symbols with a single escape symbol, but we won't do that here).

The naive solution is :

Fs = round( M * Ps )

if ( Cs > 0 ) Fs = MAX(Fs,1);

which is just scaling up the Ps by M. This has two problems - one is that Sum{Fs} is not actually M. The other is that just rounding the float does not actually distribute the integer counts to minimize codelen.

The usual heuristic is to do something like the above, and then apply some fix to make the sum right.

So first let's address how to fix the sum. We will always have issues with the sum being off M because of integer rounding.

What you will have is some correction :

correction = M - Sum{Fs}

that can be positive or negative. This is a count that needs to be added onto some symbols. We want to add it to the symbols that give us the most benefit to L, the total code len. Well that's simple, we just measure the affect of changing each Fs :

correction_sign = correction > 0 ? 1 : -1;

Ls_before = Ps * log2(M/Fs)
Ls_after = Ps * log2(M/(Fs + correction_sign))

Ls_delta = Ls_after - Ls_before
Ls_delta = Ps * ( log2(M/(Fs + correction_sign)) - log2(M/Fs) )
Ls_delta = Ps * log2(Fs/(Fs + correction_sign))

so we need to just find the symbol that gives us the lowest Ls_delta. This is either an improvement to total L, or the least increase in L.

We need to apply multiple corrections. We don't want a solution thats O(alphabet*correction) , since that can be 256*256 in bad cases. (correction is <= alphabet and typically in the 1-50 range for a typical 256-symbol file). The obvious solution is a heap. In pseudocode :

For all s
    push_heap( Ls_delta , s )

For correction
    s = pop_heap
    adjust Fs
    compute new Ls_delta for s
    push_heap( Ls_delta , s )

note that after we adjust the count we need to recompute Ls_delta and repush that symbol, because we might want to choose the same symbol again later.

In STL+cblib this is :

to[] = Fs
from[] = original counts

struct sort_sym
    int sym;
    float rank;
    sort_sym() { }
    sort_sym( int s, float r ) : sym(s) , rank(r) { }
    bool operator < (const sort_sym & rhs) const { return rank < rhs.rank; }


    if ( correction != 0 )
        int32 correction_sign = (correction > 0) ? 1 : -1;

        vector<sort_sym> heap;

        for LOOP(i,alphabet)
            if ( from[i] == 0 ) continue;
            ASSERT( to[i] != 0 );
            if ( to[i] > 1 || correction_sign == 1 )
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
                heap.push_back( sort_sym(i,change) );
        while( correction != 0 )
            ASSERT_RELEASE( ! heap.empty() );
            sort_sym ss = heap.back();
            int i = ss.sym;
            ASSERT( from[i] != 0 );
            to[i] += correction_sign;
            correction -= correction_sign;
            ASSERT( to[i] != 0 );
            if ( to[i] > 1 || correction_sign == 1 )
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
                heap.push_back( sort_sym(i,change) );
        ASSERT( cb::sum(to,to+alphabet) == (uint32)to_sum_desired );

You may have noted that the above code is using natural log instead of log2. The difference is only a constant scaling factor, so it doesn't affect the heap order; you may use whatever log base is fastest.

Errkay. So our first attempt is to just use the naive scaling Fs = round( M * Ps ) and then fix the sum using the heap correction algorithm above.

Doing round+correct gets you 99% of the way there. I measured the difference between the total code len made that way and the optimal, and they are less than 0.001 bpb different on every file I tried. But it's still not quite right, so what is the right way?

To guide my search I had a look at the cases where round+correct was not optimal. When it's not optimal it means there is some symbol a and some symbol b such that { Fa+1 , Fb-1 } gives a better total code len than {Fa,Fb}. An example of that is :

count to inc : (1/1024) was (1866/1286152 = 0.0015)
count to dec : (380/1024) was (482110/1286152 = 0.3748)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43
to dec; cl before : 1.43 cl after : 1.43 , true cl : 1.42

The key point is on the 1 count :

count to inc : (1/1024) was (1866/1286152 = 0.0015)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43

1024*1866/1286152 = 1.485660
round(1.485660) = 1

so Fs = 1 , which is a codelen of 10

but Fs = 2 gives a codelen (9) closer to the true codelen (9.43)

And this provided the key observation : rather than rounding the scaled count, what we should be doing is either floor or ceil of the fraction, whichever gives a codelen closer to the true codelen.

BTW before you go off hacking a special case just for Fs==1, it also happens with higher counts :

count to inc : (2/1024) was (439/180084) scaled = 2.4963
to inc; cl before : 9.00 cl after : 8.42 , true cl : 8.68

count to inc : (4/1024) was (644/146557) scaled = 4.4997
to inc; cl before : 8.00 cl after : 7.68 , true cl : 7.83

though obviously the higher Fs, the less likely it is because the rounding gets closer to being perfect.

So it's easy enough just to solve exactly, simply pick the floor or ceil of the ratio depending on which makes the closer codelen :

Ps = Cs/T from the true counts

down = floor( M * Ps )
down = MAX( down,1)

Fs = either down or (down+1)

true_codelen = -log2( Ps )
down_codelen = -log2( down/M )
  up_codelen = -log2( (down+1)/M )

if ( |down_codelen - true_codelen| < |up_codelen - true_codelen| )
  Fs = down
  Fs = down+1

And since all we care about is the inequality, we can do some maths and simplify the expressions. I won't write out all the algebra to do the simplification because it's straightforward, but there are a few key steps :

| log(x) | = log( MAX(x,1/x) )

log(x) >= log(y)  is the same as x >= y

down <= M*Ps
down+1 >= M*Ps

the result of the simplification in code is :

from[] = original counts (Cs) , sum to T
to[] = normalized counts (Fs) , will sum to M

    double from_scaled = from[i] * M/T;

    uint32 down = (uint32)( from_scaled );
    to[i] = ( from_scaled*from_scaled <= down*(down+1) ) ? down : down+1;

Note that there's no special casing needed to ensure that (from_scaled < 1) gives you to[i] = 1 , we get that for free with this expression.

I was delighted when I got to this extremely simple final form.

And that is the conclusion. Use that to find the initial scaled counts. There will still be some correction that needs to be applied to reach the target sum exactly, so use the heap correction algorithm above.

As a final note, if we look at the final expression :

to[i] = ( from_scaled*from_scaled < down*(down+1) ) ? down : down+1;

to[i] = ( test < 0 ) ? down : down+1;

test = from_scaled*from_scaled - down*(down+1); 

from_scaled = down + frac

test = (down + frac)^2 - down*(down+1);

solve for frac where test = 0

frac = sqrt( down^2 + down ) - down

That gives you the fractional part of the scaled count where you should round up or down. It varies with floor(from_scaled). The actual values are :

1 : 0.414214
2 : 0.449490
3 : 0.464102
4 : 0.472136
5 : 0.477226
6 : 0.480741
7 : 0.483315
8 : 0.485281
9 : 0.486833
10 : 0.488088
11 : 0.489125
12 : 0.489996
13 : 0.490738
14 : 0.491377
15 : 0.491933
16 : 0.492423
17 : 0.492856
18 : 0.493242
19 : 0.493589

You can see as Fs gets larger, it goes to 0.5 , so just using rounding is close to correct. It's really in the very low values where it's quite far from 0.5 that errors are most likely to occur.

02-10-14 | Understanding ANS - 9

If you just want to understand the basics of how ANS works, you may skip this post. I'm going to explore some unsolved issues about the sort order.

Some issues about constructing the ANS sort order are still mysterious to me. I'm going to try to attack a few points.

One thing I said wrote last time needs some clarification - "Every slot has an equal probability of 1/M."

What is true is that every character of the output string is equiprobable (assuming again that the Fs are the true probabilities). That is, if you have the string S[] with L symbols, each symbol s occurs Fs times, then you can generate symbols with the correct probability by just drawing S[i] with random i.

The output string S[] also corresponds to the destination state of the encoder in the renormalization range I = [L,2L-1]. What is not true is that all states in I are equally probable.

To explore this I did 10,000 random runs of encoding 10,000 symbols each time. I used L=1024 each time, and gathered stats from all the runs.

This is the actual frequency of the state x having each value in [1024,2047] (scaled so that the average is 1000) :

The lowest most probable states (x=1024) have roughly 2X the frequency of the high least probable states (x=2047).

Note : this data was generated using Duda's "precise initialization" (my "sort by sorting" with 0.5 bias). Different table constructions will create different utilization graphs. In particular the various heuristics will have some weird bumps. And we'll see what different bias does later on.

This is the same data with 1/X through it :

This probability distribution (1/X) can be reproduced just from doing this :

            x = x*b + irandmod(b); // for any base b
            while( x >= 2*K ) x >>= 1;
            stats_count[x-K] ++;            

though I'd still love to see an analytic proof and understand that better.

So, the first thing I should correct is : final states (the x' in I) are not equally likely.

How that should be considered in sort construction, I do not know.

The other thing I've been thinking about was why did I find that the + 1.0 bias is better in practice than the + 0.5 bias that Duda suggests ("precise initialization") ?

What the +1 bias does is push low probability symbols further towards the end of the sort order. I've been contemplating why that might be good. The answer is not that the end of the sort order makes longer codelens, because that kind of issue has already been accounted for.

My suspicion was that the +1 bias was beating the +0.5 bias because of the difference between normalized counts and unnormalized original counts.

Recall that to construct the table we had to make normalized frequences Fs that sum to L. These, however, are not the true symbol frequencies (except in synthetic tests). The true symbol frequencies had to be scaled to sum to L to make the Fs.

The largest coding error from frequency scaling is on the least probable symbols. In fact the very worst case is symbols that occur only once in a very large file. eg. in a 1 MB file a symbol occurs once; its true probability is 2^-20 and it should be coded in 20 bits. But we scale the frequencies to sum to 1024 (for example), it still must get a count of 1, so it's coded in 10 bits.

What the +1 bias does is take the least probable symbols and push them to the end of the table, which maximizes the number of bits they take to code. If the {Fs} were the true frequencies, this would be bad, and the + 0.5 bias would be better. But the {Fs} are not the true frequencies.

This raises the question - could we make the sort order from the true frequencies instead of the scaled ones? Yes, but you would then have to either transmit the true frequencies to the decoder, or transmit the sort order. Either way takes many more bits than transmitting the scaled frequencies. (in fact in the real world you may wish to transmit even approximations of the scaled frequencies). You must ensure the encoder and decoder use the same frequencies so they build the same sort order.

Anyway, I tested this hypothesis by making buffers synthetically by drawing symbols from the {Fs} random distribution. I took my large testset, for each file I counted the real histogram, made the scaled frequencies {Fs}, then regenerated the buffer from the frequencies {Fs} so that the statistics match the data exactly. I then ran tANS on the synthetic buffers and on the original file data :

synthetic data :

total bytes out : 146068969.00  bias=0.5
total bytes out : 146117818.63  bias=1.0

real data :

total bytes out : 144672103.38  bias=0.5
total bytes out : 144524757.63  bias=1.0

On the synthetic data, bias=0.5 is in fact slightly better. On the real data, bias=1.0 is slightly better. This confirms that the difference between the normalized counts & unnormalized counts is in fact the origin of 1.0's win in my previous tests, but doesn't necessarily confirm my guess for why.

An idea for an alternative to the bias=1 heuristic is you could use bias=0.5 , but instead of using the Fs for the sort order, use the estimated original count before normalization. That is, for each Fs you can have a probability model of what the original count was, and select the maximum-likelihood count from that. This is exactly analoguous to restoring to expectation rather than restoring to middle in a quantizer.

Using bias=1.0 and measuring state occurance counts, we get this :

Which mostly has the same 1/x curve, but with a funny tail at the end. Note that these graphs are generated on synthetic data.

I'm now convinced that the 0.5 bias is "right". It minimizes measured output len on synthetic data where the Fs are the true frequencies. It centers each symbol's occurances in the output string. It reproduces the 1/x distribution of state frequencies. However there is still the missing piece of how to derive it from first principles.


While I was at it, I gathered the average number of bits output when coding from each state. If you're following along with Yann's blog he's been explaining FSE in terms of this. tANS outputs bits to get the state x down into the coding range Is for the next symbol. The Is are always lower than I (L), so you have to output some bits to scale down x to reach the Is. x starts in [L,2L) and we have to output bits to reach [Fs,2Fs) ; the average number of bits required is like log2(L/Fs) which is log2(1/P) which is the code length we want. Because our range is [L,2L) we know the average output bit count from each state must differ by 1 from the top of the range to the bottom. In fact it looks like this :

Another way to think about it is that at state=L , the state is empty. As state increases, it is holding some fractional bits of information in the state variable. That number of fraction bits goes from 0 at L up to 1 at 2L.

Ryg just pointed me at a proof of the 1/x distribution in Moffat's "Arithmetic Coding Revisited" (DCC98).

The "x" in ANS has the same properties as the range ("R") in an arithmetic coder.

The bits of information in x is I ~= log( x )

I is in [0,1] and is a uniform random value, Pr(I) ~= 1

if log(x) has Pr ~= 1 , then Pr(x) must be ~= 1/x

The fact that I is uniform is maybe not entirely obvious; Moffat just hand-waves about it. Basically you're accumulating a random variable into I ( -log2(P_sym) ) and then dropping the integer part; the result is a fractional part that's random and uniform.

02-06-14 | Understanding ANS - 8

Time to address an issue that we've skirted for some time - how do you make the output string sort order?

Recall : The output string contains Fs occurances of each symbol. For naive rANS the output string is just in alphabetical order (eg. "AAABBBCCC"). With tANS we can use any permutation of that string.

So what permutation should we use? Well, the output string determines the C() and D() encode and decode tables. It is in fact the only degree of freedom in table construction (assuming the same constraints as last time, b=2 and L=M). So we should choose the output string to minimize code length.

The guiding principle will be (x/P). That is, we achieve minimum total length when we make each code length as close to log2(1/P) as possible. We do that by making the input state to output state ratio (x'/x) as close to (1/P) as possible.

(note for the record : if you try to really solve to minimize the error, it should not just be a distance between (x'/x) and (1/P) , it needs to be log-scaled to make it a "minimum rate" solution). (open question : is there an exact solution for table building that finds the minimum rate table that isn't NP (eg. not just trying all permutations)).

Now we know that the source state always come from the precursor ranges Is, and we know that

destination range :
I = [ M , 2*M - 1]

source range :
Is = [ Fs, 2*Fs - 1 ] for each symbol s

and Ps = Fs/M

so the ideal target for the symbols in each source range is :

target in I = (1/Ps) * (Is) = (M/Fs) * [ Fs, 2*Fs - 1 ] = 

and taking off the +M bias to make it a string index in the range [0,M-1] :

Ts = target in string = target in I - M

Ts = { 0 , M * 1/Fs , M * 2/Fs) , ... }

Essentially, we just need to take each symbol and spread its Fs occurances evenly over the output string.

Now there's a step that I don't know how to justify without waving my hands a bit. It works slightly better if we imagine that the source x was not just an integer, but rather a bucket that covers the unit range of that integer. That is, rather that starting exactly at the value "x = Fs" you start in the range [Fs,Fs+1]. So instead of just mapping up that integer by 1/P we map up the range, and we can assign a target anywhere in that range. In the paper Duda uses a bias of 0.5 for "precise initialization" , which corresponds to assuming the x's start in the middle of their integer buckets. That is :

Ts = { M * (b/Fs), M* (1+b)/Fs, M * (2+b)/Fs , ... }

with b = 0.5 for Duda's "precise initialization". Obviously b = 0.5 makes T centered on the range [0,M] , but I see no reason why that should be preferred.

Now assuming we have these known target locations, you can't just put all the symbols into the target slots that they want, because lots of symbols want the same spot.

For example :


T_A = { 8 * 0.5/3 , 8 * 1.5 / 3 , 8 * 2.5 / 3 } = { 1 1/3 , 4 , 6 2/3 }
T_B = T_A
T_C = { 8 * 0.5/2 , 8 * 1.5/2 } = { 2 , 6 }

One way to solve this problem is to start assigning slots, and when you see that one is full you just look in the neighbor slot, etc. So you might do something like :

initial string is empty :

string = "        "

put A's in 1,4,6

string = " A  A A "

put B's in 1,4,6 ; oops they're taken, shift up one to find empty slots :

string = " AB ABAB"

put C's in 2,6 ; oops they're taken, hunt around to find empty slots :

string = "CABCABAB"

now obviously you could try to improve this kind of algorithm, but there's no point. It's greedy so it makes mistakes in the overall optimization problem (it's highly dependant on order). It can also be slow because it spends a lot of time hunting for empty slots; you'd have to write a fast slot allocator to avoid degenerate bad cases. There are other ways.

Another thing I should note is that when doing these target slot assignments, there's no reason to prefer the most probable symbol first, or the least probable or whatever. The reason is every symbol occurance is equally probable. That is, symbol s has frequency Fs, but there are Fs slots for symbol s, so each slot has a frequency of 1. Every slot has an equal probability of 1/M.

An alternative algorithm that I have found to work well is to sort the targets. That is :

make a sorting_array of size M

add { Ts, s } to sorting_array for each symbol  (that's Fs items added)

sort sorting_array by target location

the symbols in sorting_array are in output string order

I believe that this is identical to Duda's "precise initialization" which he describes using push/pop operations on a heap; the result is the same - assigning slots in the order of desired target location.

Using the sort like this is a little weird. We are no longer explicitly trying to put the symbols in their target slots. But the targets (Ts) span the range [0, M] and the sort is an array of size M, so they wind up distributed over that range. In practice it works well, and it's fast because sorting is fast.

A few small notes :

You want to use a "stable" sort, or bias the target by some small number based on the symbol. The reason is you will have lots of ties, and you want the ties broken consistently. eg. for "AABBCC" you want "ABCABC" or "CBACBA" but not "ABCCAB". One way to get a stable sort is to make the sorting_array work on U32's, and pack the sort rank into the top 24 bits and the symbol id into the bottom 8 bits.

The bias = 0.5 that Duda uses is not strongly justified, so I tried some other numbers. bias = 0 is much worse. It turns out that bias = 1.0 is better. I tried a bunch of values on a large test set and found that bias = 1 is consistently good.

One very simple way to get a decent sort is to bit-reverse the rANS indexes. That is, start from a rANS/alphabetical order string ("AABB..") and take the index of each element, bit-reverse that index (so 0001 -> 1000) , and put the symbol in the bit reversed slot. While this is not competitive with the proper sort, it is simple and one pass.

Another possible heuristic is to just scatter the symbols by doing steps that are prime with M. This is what Yann does in fse.c

All the files in Calgary Corpus :
(compression per file; sum of output sizes)

M = 1024

rANS/alpahabetical : 1824053.75

bit reverse : 1805230.75

greedy search for empty slots : 1801351

Yann's heuristic in fse.c : 1805503.13

sort , bias = 0.0 : 1817269.88

sort , bias = 0.5 : 1803676.38  (Duda "precise")

sort , bias = 1.0 : 1798930.75

Before anyone throws a fit - yes, I tested on my very large test set, not just calgary. The results were consistent on all the test sets I tried. I also tested with larger M (4096) and the results were again the same, though the differences are smaller the larger you make M.

For completeness, here is what the sorts actually do :


bit reverse :   ABABABACABACABBC

greedy search : CABABACABABACABB

greedy search, LPS first :  ABCABAACBABACBAB

Yann fse :          AAABBCAABBCAABBC

sort , bias = 0.0 : ABCABABCABABCABA

sort , bias = 0.5 : ABCABABACBABACBA

sort , bias = 1.0 : ABABCABABCABAABC

but I caution against judging sorts by whether they "look good" since that criteria does not seem to match coding performance.

Finally for clarity, here's the code for the simpler sorts :

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );
    const int fse_step = (sort_count>>1) + (sort_count>>3) + 1;
    int fse_pos = 0;
    int s = 0;
    for LOOP(a,alphabet)
        int count = normalized_counts[a];

        for LOOP(c,count)
            // choose one :

            // rANS :
            sorted_syms[s] = a;

            // fse :
            sorted_syms[fse_pos] = a;
            fse_pos = (fse_pos + step) % sort_count;

            // bitreverse :
            sorted_syms[ bitreverse(s, numbits(sort_count)) ] = a;


and the code for the actual sorting sort (recommended) :

struct sort_sym
    int sym;
    float rank;
    bool operator < (const sort_sym & rhs) const
        return rank < rhs.rank;

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    vector<sort_sym> sort_syms;

    int s = 0;

    for LOOP(sym,alphabet)
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        float invp = 1.f / count;
        float base =  1.f * invp; // 0.5f for Duda precise

        for LOOP(c,(int)count)
            sort_syms[s].sym = sym;
            sort_syms[s].rank = base + c * invp;
    ASSERT_RELEASE( s == sort_count );
    for LOOP(s,sort_count)
        sorted_syms[s] = sort_syms[s].sym;

and for the greedy search :

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    // make all slots empty :
    for LOOP(s,sort_count)
        sorted_syms[s] = -1;
    for LOOP(a,alphabet)
        uint32 count = normalized_counts[a];
        if ( count == 0 ) continue;
        uint32 step = (sort_count + (count/2) ) / count;
        uint32 first = step/2;
        for LOOP(c,(int)count)
            uint32 slot = first + step * c;
            // find an empty slot :
                if ( sorted_syms[slot] == -1 )
                    sorted_syms[slot] = a;
                slot = (slot + 1)%sort_count;

small note : the reported results use a greedy search that searches away from slot using +1,-1,+2,-2 , instead of the simpler +1,+2 in this code snippet. This simpler version is very slightly worse.

02-05-14 | Understanding ANS - 7

And we're ready to cover table-based ANS (or "tANS") now.

I'm going to be quite concrete and consider a specific choice of implementation, rather than leaving everything variable. But extrapolation to the general solution is straightforward.

You have integer symbol frequences Fs. They sum to M. The cumulative frequencies are Bs.

I will stream the state x in bits. I will use the smallest possible renormalization range for this example , I = [ M , 2*M - 1]. You can always use any integer multiple of M that you want (k*M, any k), which will give you more coding resolution (closer to entropy). This is equivalent to scaling up all the F's by a constant factor, so it doesn't change the construction here.

Okay. We will encode/decode symbols using this procedure :

ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

We need tables for C() and D(). The constraints are :

D(x') = { x , s }  outputs a state and a symbol

D(x) must be given for x in I = [ M , 2*M - 1 ]

D(x) in I must output each symbol s Fs times

that is, D(x in I) must be an output string made from a permutation of "AA..BB.." , each symbol Fs times

D( C( s, x ) ) = { x , s }  decode must invert coding

C(s,x) = x'  outputs the following state

C(s,x) must be given for x' in I
 that's x in Is

The precursor ranges Is = { x : C(s,x) is in I }
must exist and be of the form Is = [ k , 2k-1 ] for some k

Now, if we combine the precursor range requirement and the invertability we can see :

D(x in I) outputs each s Fs times

C(s,x) with x' in I must input each s Fs times

the size of Is must be Fs

the precursor ranges must be Is = [ Fs, 2*Fs - 1 ]

C(s,x) must given in M slots

And I believe that's it; those are the necessary and sufficient conditions to make a valid tANS system. I'll go over some more points and fill in some details.

Here's an example of the constraint for an alphabet of "ABC" and M = 8 :

Now, what do you put in the shaded area? You just fill in the output states from 8-15. The order you fill them in corresponds to the output string. In this case the output string must be some permutation of "AAABBBCC".

Here's one way : (and in true Duda style I have confusingly used different notation in the image, since I drew this a long time ago before I started this blog series. yay!)

In the image above I have also given the corresponding output string and the decode table. If you're following along in Duda's paper arxiv 1311.2540v2 this is figure 9 on page 18. What you see in figure 9 is a decode table. The "step" part of figure 9 is showing one method of making the sort string. The shaded bars on the right are showing various permutations of an output string, with a shading for each symbol.

Before I understood ANS I was trying tables like this :

Fs = {7,6,3}

 S |  0|  1|  2
  1|  2|  3|  4
  2|  5|  6| 10
  3|  7|  8| 15
  4|  9| 11| 20
  5| 12| 13| 26
  6| 14| 16| 31
  7| 17| 19|   
  8| 18| 22|   
  9| 21| 24|   
 10| 23| 27|   
 11| 25| 29|   
 12| 28|   |   
 13| 30|   |   

This table does not work. If you're in state x = 7 and you want to encode symbol 2, you need to stream out bits to get into the precursor range I2. So you stream out from x=7 and get to x=3. Now you look in the table and you are going to state 15 - that's not in the range I=[16,31]. No good!

A correct table for those frequencies is :

 S |  0|  1|  2
  3|   |   | 18
  4|   |   | 24
  5|   |   | 29
  6|   | 17|   
  7| 16| 20|   
  8| 19| 22|   
  9| 21| 25|   
 10| 23| 27|   
 11| 26| 31|   
 12| 28|   |   
 13| 30|   |   

Building the decode table from the encode table is trivial.

Note that the decode table D(x) only has to be defined for x in I - that's M entries.

C(x,s) also only has M entries. If you made it naively as a 2d array, it would be |alphabet|*M . eg. something like (256*4096) slots, but most of it would be empty. Of course you don't want to do that.

The key observation is that C(x,s) is only defined over consecutive ranges of x for each s. In fact it's defined over [Fs, 2*Fs-1]. So, we can just pack these ranges together. The starting point in the packed array is just Bs - the cumulative frequency of each symbol. That is :

PC = packed coding table
PC has M entries

C(x,s) = PC[ Bs + (x - Fs) ]

eg. for the {3,3,2} table shown in the image above :

PC = { 8,11,14, 9,12,15, 10,13 }

this allows you to store the coding table also in an array of size M.

There are a few topics on tANS left to cover but I'll leave them for the next post.

02-04-14 | Understanding ANS - 6

Okay, let's get to streaming.

For illustration let's go back to the simple example of packing arbitrary base numbers into an integer :

// encode : put val into state
void encode(int & state, int val, int mod)
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;

// decode : remove a value from state and return it
int decode(int & state, int mod )
    int val = state % mod;
    state /= mod;
    return val;

as you encode, state grows, and eventually gets too big to fit in an integer. So we need to flush out some bits (or bytes).

But we can't just stream out bits. The problem is that the decoder does a modulo to get the next value. If we stream in and out high bits, that's equivalent to doing something like +65536 on the value. When you do a mod-3 (or whatever) on that, you have changed what you decode.

If you only ever did mod-pow2's, you could stream bits out of the top at any time, because the decoding of the low bits is not affected by the high bits. This is how the Huffman special case of ANS works. With Huffman coding you can stream in and out any bits that are above the current symbol, because they don't affect the mask at the bottom.

In general we want to stream bits (base 2) or bytes (base 256). To do ANS in general we need to mod and multiply by arbitrary values that are not factors of 2 or 256.

To ensure that we get decodability, we have to stream such that the decoder sees the exact value that the encoder made. That is :

ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

The key thing is that the value of x' that C(s,x) makes is exactly the same that goes into D(x).

This is different from Huffman, as noted above. It's also different than arithmetic coding, which can have an encoder and decoder that are out of sync. An arithmetic decoder only uses the top bits, so you can have more or less of the rest of the stream in the low bits. While the basic ANS step (x/P + C) is a kind of arithmetic coding step, the funny trick we did to take some bits of x and mod it back down to the low bits (see earlier posts) means that ANS is *not* making a continuous arithmetic stream for the whole message that you can jump into anywhere.

Now it's possible there are multiple streaming methods that work. For example with M = a power of 2 in rANS you might be able to stream high bytes. I'm not sure, and I'm not going to talk about that in general. I'm just going to talk about one method of streaming that does work, which Duda describes.

To ensure that our encode & decode streaming produce the same value of x', we need a range to keep it in. If you're streaming in base b, this range is of the form [L, b*L-1] . So, I'll use Duda's terminology and call "I" the range we want x' to be in for decoding, that is

I = [L, b*L-1]

Decoder streams into x :

x <- x*b + get_base_b();

until x is in I

but the encoder must do something a bit funny :

stream out from x

x' = C(s,x)  , coding function

x' now in I

that is, the stream out must be done before the coding function, and you must wind up in the streaming range after the coding function. x' in the range I ensures that the encoder and decoder see exactly the same value (because any more streaming ops would take it out of I).

To do this, we must know the "precursor" ranges for C(). That is :

Is = { x : C(s,x) is in I }

that is, the values of x such that after coding with x' = C(s,x), x' is in I

these precursor ranges depend on s. So the encoder streaming is :

I'm about to encode symbol s

stream out bits from x :

put_base_b( x % b )
x <- x/b

until x is in Is

so we get into the precursor range, and then after the coding step we are in I.

Now this is actually a constraint on the coding function C (because it determines what the Is are). You must be able to encode any symbol from any state. That means you must be able to reach the Is precursor range for any symbol from any x in the output range I. For that to be true, the Is must span a power of b, just like "I" does. That is,

all Is must be of the form

Is = [ K, b*K - 1 ]

for some K

eg. to be concrete, if b = 2, we're streaming out bits, then Is = { 3,4,5 } is okay, you will be able to get there from any larger x by streaming out bits, but Is = {4,5,6} is not okay.

I = [8, 15]

Is = {4,5,6}

x = 14

x is out of Is, so stream out a bit ; 14 -> 7

x is out of Is, so stream out a bit ; 7 -> 3

x is below Is!  crap!

this constraint will be our primary guide in building the table-based version of ANS.

02-03-14 | Understanding ANS - 5

First in case you aren't following them already, you should follow along with ryg and Yann as we all go through this :
RealTime Data Compression A comparison of Arithmetic Encoding with FSE
rANS notes The ryg blog

Getting back to my slow exposition track.

We talked before about how strings like "ABC" specify an ANS state machine. The string is the symbols that should be output by the decoder in each state, and there's an implicit cyclic repeat, so "ABC" means "ABCABC..". The cyclic repeat corresponds to only using some modulo of state in the decoder output.

Simple enumerations of the alphabet (like "ABC") are just flat codes. We saw before that power of two binary-distributed strings like "ABAC" are Huffman codes.

What about something like "AAB" ? State 0 and 1 both output an A. State 2 outputs a B. That means A should have twice the probability of B.

How do we encode a state like that? Putting in a B is obvious, we need to make the bottom of state be a 2 (mod 3) :

x' = x*3 + 2

but to encode an A, if we just did the naive op :

x' = x*3 + 0
x' = x*3 + 1

we're wasting a value. Either a 0 or 1 at the bottom would produce an A, so we have a free bit there. We need to make use of that bit or we are wasting code space. So we need to find a random bit to transmit to make use of that freedom. Fortunately, we have a value sitting around that needs to be transmitted that we can pack into that bit - x!

take a bit off x :
b = x&1
x >>= 1

x' = x*3 + b

or :

x' = (x/2)*3 + (x%2)

more generally if the output string is of length M and symbol s occurs Fs times, you would do :

x' = (x/Fs)*M + (x%Fs) + Bs

which is the formula for rANS.

Now, note that rANS always makes output strings where the symbols are not interleaved. That is, it can make "AAB" but it can't make "ABA". The states that output the same symbol are in consecutive blocks of length Fs.

This is actually not what we want, it's an approximation in rANS.

For example, consider a 3-letter alphabet and M=6. rANS corresponds to an output string of "AABBCC". We'd prefer "ABCABC". To see why, recall the arithmetic coding formula that we wanted to use :

x' = (x/P) + C

the important part being the (x/P). We want x to grow by that amount, because that's what gives us compression to the entropy. If x grows by too much or too little, we aren't coding with codelens that match the probabilities, so there will be some loss.

P = F/M , and we will assume for now that the probabilities are such that the rational expression F/M is the true probability. What we want is to do :

x' = (x*M)/F + C

to get a more accurate scaling of x. But we can't do that because in general that division will cause x' to not fall in the bucket [Cs,Cs+1) , which would make decoding impossible.

So instead, in rANS we did :

x' = floor(x/F)*M + C + (x%F)

the key part here being that we had to do floor(x/F) instead of (x/P), which means the bottom bits of x are not contributing to the 1/P scaling the way we want them to.


x = 7
F = 2
M = 6
P = 1/3

should scale like

x -> x/P = 21

instead scales like

x -> (x/F)*M + (x%F) = (7/2)*6 + (7%2) = 3*6 + 1 = 19

too low
because we lost the bottom bit of 7 when we did (7/2)

In practice this does in fact make a difference when the state value (x) is small. When x is generally large (vs. M), then (x/F) is a good approximation of the correct scaling. The closer x is to M, the worse the approximation.

In practice with rANS, you should use something like x in 32-bits and M < 16-bits, so you have decent resolution. For tANS we will be using much lower state values, so getting this right is important.

As a concrete example :

alphabet = 3
1000 random symbols coded
H = 1.585
K = 6
6 <= x < 12

output string "ABCABC"
wrote 1608 bits

output string "AABBCC"
wrote 1690 bits

And a drawing of what's happening :

I like the way Jarek Duda called rANS the "range coder" variant of ANS. While the analogy is not perfect, there is a similarity in the way it approximates the ideal scaling and gains speed.

The crucial difference between a "range coder" and prior (more accurate) arithmetic coders is that the range coder does a floor divide :

range coder :

symhigh * (range / symtot)

CACM (and such) :

(symhigh * range) / symtot

this approximation is good as long as range is large and symtot is small, just like with rANS.

02-02-14 | Understanding ANS - 4

Another detour from the slow exposition to mention something that's on my mind.

Let's talk about arithmetic coding.

Everyone is familiar with the standard simplified idea of arithmetic coding. Each symbol has a probability P(s). The sum of all preceding probability is the cumulative probability, C(s).

You start with an interval in [0,1]. Each symbol is specified by a range equal to P(s) located at C(s). You reduce your interval to the range of the current symbol, then put the next symbol within that range, etc. Like this :

As you go, you are making a large number < 1, with more and more bits of precision being added at the bottom. In the end you have to send enough bits so that the stream you wanted is specified. You get compression because more likely streams have larger intervals, and thus can be specified with fewer bits.

In the end we just made a single number that we had to send :

x = C0 + P0 * ( C1 + P1 * (C2 ...

in order to make that value in a FIFO stream, we would have to use the normal arithmetic coding style of tracking a current low and range :

currenty at [low,range]
add in Cn,Pn

low <- low + range * Cn
range <- range * Pn

and of course for the moment we're assuming we get to use infinite precision numbers.

But you can make the same final value x another way. Start at the end of the stream and work backwards, LIFO :


x contains all following symbols already encoded
x in [0,1]

x' <- Cn + Pn * x

there's no need to track two variables [low,range], you work from back to front and then send x to specify the whole stream. (This in fact is an ancient arithmetic coder. I think it was originally described by Elias even before the Pasco work. I mention this to emphasize that single variable LIFO coding is nothing new, though the details of ANS are in fact quite new. Like "range coding" vs prior arithmetic coders , it can be the tiny details that make all the difference.) (umm, holy crap, I just noticed the ps. at the bottom of that post ("ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand")).

You can decode an individual step thusly :

x in [0,1]
find s such that C(s) <= x < C(s+1)

x' <- (x - Cs)/Ps

Now let's start thinking about doing this in finite precision, or at least fixed point.

If we think of our original arithmetic coding image, growing "x" up from 0, instead of keeping x in [0,1] the whole time, let's keep the active interval in [0,1]. That is, as we go we rescale so that the bottom range is [0,1] :

That is, instead of keeping the decimal at the far left and making a fraction, we keep the decimal at the far right of x and we grow the number upwards. Each coding step is :

x' <- x/Ps + Cs

in the end we get a large number that we have to send using log2(x) bits. We get compression because highly probable symbols will make x grow less than improbable symbols, so more compressable streams will make smaller values of x.

(x = the value before the current step, x' = the value after the current step)

We can decode each step simply if the (x/Ps) are integers, then the Cs is the fractional part, so we just do :

f = frac(x)
find s such that C(s) <= f < C(s+1)

x' <- (x - Cs)*Ps
that is, we think of our big number x as having a decimal point with the fractional part Cs on the right, and the rest of the stream is in a big integer on the left. That is :

[ (x/Ps) ] . [ Cs ]

Of course (x/Ps) is not necessarily an integer, and we don't get to do infinite precision math, so let's fix that.

Let :

Ps = Fs / M

P is in [0,1] , a symbol probability
F is an integer frequency
M is the frequency denominator

Sum{F} = M

Cs = Bs / M

B is the cumulative frequency

now we're going to keep x an integer, and our "decimal" that separates the current symbol from the rest of the stream is a fixed point in the integer. (eg. if M = 2^12 then we have a 12 bit fixed point x).

Our coding step is now :

x' = x*M/Fs + Bs

and we can imagine this as a fixed point :

[ (x*M/Fs) ] . [ Bs ]

in particular the bottom M-ary fraction specifies the current symbol :

( x' mod M ) = Bs

the crucial thing for decoding is that the first part, the (x/P) part which is now (x*M/F) shouldn't mess up the bottom M-ary fraction.

But now that we have it written like this, it should be obvious how to do that, if we just write :

x*M/F -> floor(x/F)*M

then the (mod M) operator on that gives you 0, because it has an explicit *M

so we've made the right (x/P) scaling, and made something that doesn't mess up our bottom mod M for decodability.

But we've lost some crucial bits from x, which contains the rest of the stream. When we did floor(x/F) we threw out some bottom bits of x that we can't get rid of. So we need that (x mod F) back.

Fortunately we have the perfect place to put it. We can specify the current symbol not just with Bs, but with anything in the interval [Bs , Bs + Fs) ! So we can do :

x' = M*floor(x/Fs) + Bs + (x mod Fs)

which is :

x' = [ floor(x/Fs) ] . [ Bs + (x mod Fs) ]

with the integer part growing on the left and the base-M fractional part on the right specifying the current symbol s

and this is exactly the rANS encoding step !

As we encode x grows by (1/P) with each step. We wind up sending x with log2(x) bits, which means the code length of the stream is log2(1/P0*P1...) which is what we want.

For completeness, decoding is straightforwardly undoing the encode step :

f = M-ary fraction of x  (x mod M)
find s such that Bs <= f < Bs+1

x' = Fs * (x/M) + (x mod M) - Bs


x' = Fs * intM(x) + fracM(x) - Bs

and we know

(fracM(x) - Bs) is in [0, Fs)

which is the same as the old arithmetic decode step : x' = (x - Cs)/Ps

Of course we still have to deal with the issue of keeping x in fixed width integers and streaming, which we'll come back to.

02-01-14 | Understanding ANS - 3

I'm gonna take an aside from the slow exposition and jump way ahead to some results. Skip to the bottom for summary.

There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.

In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).

The "tANS" (table ANS, aka FSE) decode is :

  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = decodeTable[state].nbBits;
  state = decodeTable[state].newState + getbits(nBits);

while a standard table-based Huffman decode is :

  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = codelen[sym];
  state = ((state<<nBits)&STATE_MASK) + getbits(nBits);  

where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder state machine that has memory; this allows it to carry fractional bits forward.

If you so chose, you could of course put the Huffman codelen and next state into decodeTable[] just like for ANS and they would be identical.

So, let's see some concrete results comparing some decent real world implementations.

I'm going to compare four compressors :

huf = order-0 static Huffman

fse = Yann's implementation of tANS

rans = ryg's implementation of rANS

arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length

Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)

Some notes on the four implementations will follow. First the raw results :

inName : book1
H = 4.527

arith 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
arith encode     : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode     : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

"rans 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
rans encode      : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode      : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

fse :   768,771 ->   435,981 =  4.537 bpb =  1.763 to 1 
fse encode       : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode       : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

huf :   768,771 ->   438,437 =  4.562 bpb =  1.753 to 1 
huf encode       : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode       : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode       : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)

inName : pic
H = 1.210

arith 12:   513,216 ->    79,473 =  1.239 bpb =  6.458 to 1 
arith encode     : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode     : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

rans 12:   513,216 ->    79,474 =  1.239 bpb =  6.458 to 1 
rans encode      : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode      : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

fse :   513,216 ->    80,112 =  1.249 bpb =  6.406 to 1 
fse encode       : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode       : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

huf :   513,216 ->   106,691 =  1.663 bpb =  4.810 to 1 
huf encode       : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode       : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s

And some conclusions :

1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.

2. Huffman encode is still significantly faster than tANS encode.

3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.

4. "tANS" is about twice as fast as "rANS". (at this point)

And some implementation notes for the record :

"fse" and "rans" encode the array by walking backwards.  The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.

"huf" and "fse" are transmitting their code tables.  "arith" and "rans" are not.  
They should add about 256 bytes of header to be fair.

"arith" is a standard Schindler range coder, with byte-at-a-time renormalization

"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.

All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.

"huf" here is not length-limited.  A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)

"huf" here is was implemented with PowerPC and SPU in mind.  A more x86/x64 oriented version should be
a little faster.  ("fse" is pretty x86/x64 oriented).

and todo : compare binary rANS with a comparable binary arithmetic coder.

1-31-14 | Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -

if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
        state = state*2;
        state = state*4 + (val-1)*2 + 1;

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :

void encode(int & state, int val)
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :

normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every symbol. But this same kind of system can be used with more general code lens, as we'll see later.

1-30-14 | Understanding ANS - 1

I'm trying to really understand Jarek Duda's ANS (Asymmetric Numeral System). I'm going to write a bit as I figure things out. I'll probably make some mistakes.

For reference, my links :

RealTime Data Compression Finite State Entropy - A new breed of entropy coder
Asymmetric Numeral System - Polar
arxiv [1311.2540] Asymmetric numeral systems entropy coding combining speed of Huffman coding with compression rate of arithmetic
encode.ru - Asymetric Numeral System
encode.ru - FSE
Large text benchmark - fpaqa ans
New entropy coding faster than Huffman, compression rate like arithmetic - Google Groups

I actually found Polar's page & code the easiest to follow, but it's also the least precise and the least optimized. Yann Collet's fse.c is very good but contains various optimizations that make it hard to jump into and understand exactly where those things came from. Yann's blog has some good exposition as well.

So let's back way up.

ANS adds a sequence of values into a single integer "state".

The most similar thing that we're surely all familiar with is the way that we pack integers together for IO or network transmission. eg. when you have a value that can be in [0,2) and one in [0,6) and one in [0,11) you have a range of 3*7*12 = 252 so you can fit those all in one byte, and you use packing like :

// encode : put val into state
void encode(int & state, int val, int mod)
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;

// decode : remove a value from state and return it
int decode(int & state, int mod )
    int val = state % mod;
    state /= mod;
    return val;

Obviously at this point we're just packing integers, there's no entropy coding, we can't do unequal probabilities. The key thing that we will keep using in ANS is in the decode - the current "state" has a whole sequence of values in it, but we can extract our current value by doing a mod at the bottom.

That is, say "mod" = 3, then this decode function can be written as a transition table :

state   next_state  val
0       0           0
1       0           1
2       0           2
3       1           0
4       1           1
5       1           2
6       2           0

In the terminology of ANS we can describe this as "0120120120..." or just "012" and the repeating is implied. That is, the bottom bits of "state" tell us the current symbol by looking up in that string, and then those bottom bits are removed and we can decode more symbols.

Note that encode/decode is LIFO. The integer "state" is a stack - we're pushing values into the bottom and popping them from the bottom.

This simple encode/decode is also not streaming. That is, to put an unbounded number of values into state we would need an infinite length integer. We'll get back to this some day.

12-31-13 | Statically Linked DLL

Oodle for Windows is shipped as a DLL.

I have to do this because the multiplicity of incompatible CRT libs on Windows has made shipping libs for Windows an impossible disaster.

(seriously, jesus christ, stop adding features to your products and make it so that we can share C libs. Computers are becoming an increasingly broken disaster of band-aided together non-functioning bits.)

The problem is that clients (reasonably so) hate DLLs. Because DLLs are also an annoying disaster on Windows (having to distribute multiple files, accidentally loading from an unexpected place, and what if you have multiple products that rely on different versions of the same DLL, etc.).

Anyway, it seems to me that the best solution is actually a "statically linked DLL".

The DLL is the only way on Windows to combine code packages without mashing their CRT together, and being able to have some functions publicly linked and others resolved privately. So you want that. But you don't want an extra file dangling around that causes problems, you just want it linked into your EXE.

You can build your DLL as DelayLoad, and do the LoadLibrary for it yourself, so the client still sees it like a normal import lib, but you actually get the DLL from inside the EXE. The goal is to act like a static lib, but avoid all the link conflict problems.

The most straightforward way to do it would be to link the DLL in to your EXE as bytes, and at startup write it out to a temp dir, then LoadLibrary that file.

The better way is to write your own "LoadLibraryFromMem". A quick search finds some leads on that :

Loading Win3264 DLLs manually without LoadLibrary() - CodeProject
Loading a DLL from memory ╗ ~magogpublic
LoadLibrary replacement - Source Codes - rohitab.com - Forums

Crazy or wise?

12-12-13 | Call for Game Textures

... since I've had SOooo much luck with these calls for data in the past. Anyway, optimism...

I need game textures! They must be from a recent game (eg. modern resolutions), and I need the original RGB (not the shipped DXTC or whatever).

If you can provide some, please contact me.

11-25-13 | Oodle and the real problems in games

When I started work on Oodle, I specifically didn't want to do a compression library. For one thing, I had done a lot of compression and don't like to repeat the same work, I need new topics and things to learning. For another, I doubted that we could sell a compression library; selling compression is notoriously hard, because there's so much good free stuff out there, and even if you do something amazing it will only be 10% better than the free stuff due to the diminishing returns asymptote; customers also don't understand things like space-speed tradeoffs. But most of all I just didn't think that a compression library solved important problems in games. Any time I do work I don't like to go off into esoteric perfectionism that isn't really helping much, I like to actually attack the important problems.

That's why Oodle was originally a paging / packaging / streaming / data management product. I thought that we had some good work on that at Oddworld and it seemed natural to take those concepts and clean them up and sell that to the masses. It also attacks what I consider to be very important problems in games.

Unfortunately it became pretty clear that nobody would buy a paging product. Game companies are convinced that they "already have" that, or that they can roll it themselves easily (in fact they don't have that, and it's not easy). On the other hand we increasingly saw interest in a compression library, so that's the direction we went.

(I don't mean to imply that the clients are entirely at fault for wanting the wrong thing; it's sort of just inherently problematic to sell a paging library, because it's too fundamental to the game engine. It's something you want to write yourself and have full control over. Really the conception of Oodle was problematic from the beginning, because the ideal RAD product is a very narrow API that can be added at the last second and does something that is not too tied to the fundamental operation of the game, and also that game coders don't want to deal with writing themselves)

The two big problems that I wanted to address with Original Oodle was -

1. Ridiculous game level load times.

2. Ridiculous artist process ; long bake times ; no real previews, etc.

These are very different problems - one is the game runtime, one is in the dev build and tools, but they can actually be solved at the same time by the same system.

Oodle was designed to offer per-resource paging; async IO and loading, background decompression. Resources could be grouped into bundles; the same resource might go into several bundles to minimize loads and seeks. Resources could be stored in various levels of "optimization" and the system would try to load the most-optimized. Oodle would store hashes and mod times so that old/incompatible data wouldn't be loaded. By checking times and hashes you can do a minimal incremental rebuild of only the bundles that need to be changed.

The same paging system can be used for hot-loading, you just page out the old version and page in the new one - boom, hot loaded content. The same system can provide fast in-game previews. You just load an existing compiled/optimized level, and then page in a replacement of the individual resource that the artist is working on.

The standard way to use such a system is that you still have a nightly content build that makes the super-optimized bundles of the latest content, but then throughout the day you can make instant changes to any of the content, and the newer versions are automatically loaded instead of the nightly version. It means that you're still loading optimized bakes for 90% of the content (thus load times and such aren't badly affected) but you get the latest all day long. And if the nightly bake ever fails, you don't stop the studio, people just keep working and still see all the latest, it just isn't all fully baked.

These are important problems, and I still get passionate about them (aka enraged) when I see how awful the resource pipeline is at most game companies.

(I kept trying to add features to the paging product to make it something that people would buy; I would talk to devs and say "what does it need to do for you to license it", and everybody would say something different (and even if it had that feature they wouldn't actually license it). That was a bad road to go down; it would have led to huge feature bloat, been impossible to maintain, and made a product that wasn't as lean and focused as it should be; customers don't know what they want, don't listen to them!)

Unfortunately, while compression is very interesting theoretically, make a compressor that's 5% better than an alternative is just not that compelling in terms of the end result that it has.

11-14-13 | Oodle Packet Compression for UDP

Oodle now has compressors for UDP (unordered / stateless) packets. Some previous posts on this topic :

cbloom rants 05-20-13 - Thoughts on Data Compression for MMOs
cbloom rants 08-08-13 - Oodle Static LZP for MMO network compression
cbloom rants 08-19-13 - Sketch of multi-Huffman Encoder

What I'm doing for UDP packet is static model compression. That is, you pre-train some model based on a capture of typical network data. That model is then const and can be just written out to a file for use in your game. At runtime, you read the model from disk, then it is const and shared by all network channels. This is particularly desirable for large scale servers because there is no per-channel overhead, either in channel startup time or memory use.

(ASIDE : Note that there is an alternative for UDP, which is to build up a consistent history between the encoder and decoder by having the decoder send back "acks", and then making sure the encoder uses only ack'ed packets as history, etc. etc. An alternative is to have the encoder mark packets with a description of the history used to encode them, and then when the decoder gets them if it doesn't have the necessary history it drops the packet and requests it be resent or something. I consider these a very bad idea and Oodle won't do them; I'm only looking at UDP compression that uses no transmission history.)

Call for test data! I currently only have a large network capture from one game, which obviously skews my results. If you make a networked game and can provide real-world sample data, please contact me.

Now for a mess of numbers comparing the options.

UDP compression of packets (packet_test.bin)

order-0 static huffman :  371.1 -> 234.5 average
(model takes 4k bytes)

order-0 static multi-huffman (32 huffs) : 371.1 -> 209.1 average
(model takes 128k bytes)

order-2 static arithmetic model : 371.0 -> 171.1 average
(model takes 549444 bytes)

OodleStaticLZP for UDP : 371.0 -> 93.4 average
(model takes 13068456 bytes)

In all cases there is no per-channel memory use. OodleStaticLZP is the recommended solution.

For comparison, the TCP compressors get :

LZB16 models take : 131072 bytes per channel
LZB16 [sw16|ht14] : 371.0 -> 122.6 average

LZNib models take : 1572864 bytes per channel
LZnib [sw19|ht18] : 371.0 -> 90.8 average

LZP models take : 104584 bytes per channel, 12582944 bytes shared
LZP [8|19] : 371.0 -> 76.7 average

zlib uses around 400k per channel
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

For MMO type scenarios (large number of connections, bandwidth is important), LZP is a huge win. It gets great compression with low per-channel memory use. The other compelling use case is LZNib when you are sending large packets (so per-byte speed is important) and have few connections (so per-channel memory use is not important); the advantage of LZNib is that it's quite fast to encode (faster than zlib-3 for example) and gets pretty good compression.

To wrap up, logging the variation of compression under some options.

LZPUDP can use whatever size of static dictionary you want. More dictionary = more compression.

LZPUDP [dic mb | hashtable log2]

LZPUDP [4|18] : 595654217 -> 165589750 = 3.597:1
1605378 packets; 371.0 -> 103.1 average
LZPUDP [8|19] : 595654217 -> 154353229 = 3.859:1
1605378 packets; 371.0 -> 96.1 average
LZPUDP [16|20] : 595654217 -> 139562083 = 4.268:1
1605378 packets; 371.0 -> 86.9 average
LZPUDP [32|21] : 595654217 -> 113670899 = 5.240:1
1605378 packets; 371.0 -> 70.8 average

And MultiHuffman can of course use any number of huffmans.

MultiHuffman [number of huffs | number of random trials]

MultiHuffman [1|8] : 66187074 -> 41830922 = 1.582:1
178376 packets; 371.1 -> 234.5 average, H = 5.056
MultiHuffman [2|8] : 66187074 -> 39869575 = 1.660:1
178376 packets; 371.1 -> 223.5 average, H = 4.819
MultiHuffman [4|8] : 66187074 -> 38570016 = 1.716:1
178376 packets; 371.1 -> 216.2 average, H = 4.662
MultiHuffman [8|8] : 66187074 -> 38190760 = 1.733:1
178376 packets; 371.1 -> 214.1 average, H = 4.616
MultiHuffman [16|8] : 66187074 -> 37617159 = 1.759:1
178376 packets; 371.1 -> 210.9 average, H = 4.547
MultiHuffman [32|8] : 66187074 -> 37293713 = 1.775:1
178376 packets; 371.1 -> 209.1 average, H = 4.508

On the test data that I have, the packets are pretty homogenous, so more huffmans is not a huge win. If you had something like N very different types of packets, you would expect to see big wins as you go up to N and then pretty flat after that.

Public note to self : it would amuse me to try ACB for UDP compression. ACB with dynamic dictionaries is not Pareto because it's just too slow to update that data structure. But with a static precomputed suffix sort, and optionally dynamic per-channel coding state, it might be good. It would be slower & higher memory use than LZP, but more compression.

10-14-13 | Oodle's Fast LZ4

Oodle now has a compressor called "LZB" (LZ-Bytewise) which is basically a variant of Yann's LZ4 . A few customers were testing LZNib against LZ4 and snappy and were bothered that it was slower to decode than those (though it offers much more compression and I believe it is usually a better choice of tradeoffs). In any case, OodleLZB is now there for people who care more about speed than ratio.

Oodle LZB has some implementation tricks that could also be applied to LZ4. I thought I'd go through them as an illustration of how you make compressors fast, and to give back to the community since LZ4 is nicely open source.

OodleLZB is 10-20% faster to decode than LZ4. (1900 mb/s vs. 1700 mb/s on lzt99, and 1550 mb/s vs. 1290 mb/s on all_dds). The base LZ4 implementation is not bad, it's in fact very fast and has the obvious optimizations like 8-byte word copies and so on. I'm not gonna talk about fiddly junk like do you do ptr++ or ++ptr , though that stuff can make a difference on some platforms. I want to talk about how you structure a decode loop.

The LZ4 decoder is like this :

U8 * ip; // = compressed input
U8 * op; // = decompressed output

    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

    ip += lrl;
    op += lrl;

    if ( EOF ) break;

    // get offset
    int offset = *((U16 *)ip);

    int ml = ml_control + 4;
    if ( ml_control == 0xF ) AddExcess(ml,ip);

    // copy match :
    if ( overlap )
        copy_match_overlap(op, op - offset, ml );
        copy_match_nooverlap(op, op - offset, ml );

    op += ml;

    if ( EOF ) break;

and AddExcess is :

#define AddExcess(val,cp)   do { int b = *cp++; val += b; if ( b != 0xFF ) break; } while(1)

So, how do we speed this up?

The main thing we want to focus on is branches. We want to reduce the number of branches on the most common paths, and we want to make maximum use of the branches that we can't avoid.

There are four branches we want to pay attention to :

1. the checks for control nibbles = 0xF
2. the check for match overlap
3. the check of LRL and match len inside the match copiers
4. the EOF checks

So last one first; the EOF check is the easiest to eliminate and also the least important. On modern chips with good branch prediction, the highly predictable branches like that don't cost much. If you know that your input stream is not corrupted (because you've already checked a CRC of some kind), then you can put the EOF check under one of the rare code cases, like perhaps LRL control = 0xF. Just make the encoder emit that rare code when it hits EOF. On Intel chips that makes almost no speed difference. (if you need to handle corrupted data without crashing, don't do that).

On to the substantial ones. Note that branch #3 is hidden inside "copy_literals" and "copy_match". copy_literals is something like :

for(int i=0;i<lrl;i+=8)
    *((U64 *)(dest+i)) = *((U64 *)(src+i));

(the exact way to do copy_literals quickly depends on your platform; in particular are offset-addresses free? and are unaligned loads fast? Depending on those, you would write the loop in different ways.)

We should notice a couple of things about this. One is that the first branch on lrl is the most important. Later branches are rare, and we're also covering a lot of bytes per branch in that case. When lrl is low, we're getting a high number of branches per byte. Another issue about that is that the probability of taking the branch is very different the first time, so you can help the branch-prediction in the chip by doing some explicit branching, like :

if ( lrl > 0 )
    *((U64 *)(dest)) = *((U64 *)(src));
    if ( lrl > 8 )
        *((U64 *)(dest+8)) = *((U64 *)(src+8));
        if ( lrl > 16 )
            // .. okay now do a loop here ...

We'll also see later that the branch on ( lrl > 0 ) is optional, and in fact it's best to not do it - just always unconditionally/speculatively copy the first 8 bytes.

The next issue we should see is that the branch for lrl > 16 there for the copier is redundant with the check for the control value (lrl == 0xF). So we should merge them :

change this :

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

to :

    // get excess lrl
    if ( lrl == 0xF )

        // lrl >= 15
        copy_literals_long(op, ip, lrl);
        // lrl < 15
        copy_literals_short(op, ip, lrl);

this is the principle that we can't avoid the branch on the LRL escape code, so we should make maximum use of it. That is, it caries extra information - the branch tells us something about the value of LRL, and any time we take a branch we should try to make use of all the information it gives us.

But we can do more here. If we gather some stats about LRL we see something like :

% of LRL >= 0xF : 8%
% of LRL > 8 : 15%
% of LRL <= 8 : 85%
the vast majority of LRL's are short. So we should first detect and handle that case :

    if ( lrl <= 8 )
        if ( lrl > 0 )
            *((U64 *)(op)) = *((U64 *)(ip));
        // lrl > 8
        if ( lrl == 0xF )

            // lrl >= 15
            copy_literals_long(op, ip, lrl);
            // lrl > 8 && < 15
            *((U64 *)(op)) = *((U64 *)(ip));
            *((U64 *)(op+8)) = *((U64 *)(ip+8));

which we should then rearrange a bit :

    // always immediately speculatively copy 8 :
    *((U64 *)(op)) = *((U64 *)(ip));

    if_unlikely( lrl > 8 )
        // lrl > 8
        // speculatively copy next 8 :
        *((U64 *)(op+8)) = *((U64 *)(ip+8));

        if_unlikely ( lrl == 0xF )

            // lrl >= 15
            // can copy first 16 without checking lrl
            copy_literals_long(op, ip, lrl);
    op += lrl;
    ip += lrl;

and we have a faster literal-copy path.

Astute readers may notice that we can now be stomping past the end of the buffer, because we always do the 8 byte copy regardless of LRL. There are various ways to prevent this. One is to make your EOF check test for (end - 8), and when you break out of that loop, then you have a slower/safer loop to finish up the end. (ADD : not exactly right, see notes in comments)

Obviously we should do the exact same procedure with the ml_control. Again the check for spilling the 0xF nibble tells us something about the match length, and we should use that information to our advantage. And again the short matches are by far the most common, so we should detect that case and make it as quick as possible.

The next branch we'll look at is the overlap check. Again some statistics should be our guide : less than 1% of all matches are overlap matches. Overlap matches are important (well, offset=1 overlaps are important) because they are occasionally very long, but they are rare. One option is just to forbid overlap matches entirely, and really that doesn't hurt compression much. We won't do that. Instead we want to hide the overlap case in another rare case : the excess ML 0xF nibble case.

The way to make compressors fast is to look at the code you have to write, and then change the format to make that code fast. That is, write the code the way you want it and the format follows - don't think of a format and then write code to handle it.

We want our match decoder to be like this :

if ( ml_control < 0xF )
    // 4-18 byte short match
    // no overlap allowed
    // ...
    // long match OR overlap match
    if ( offset < 8 )
        ml = 4; // back up to MML

        // do offset < 8 possible overlap match
        // ...

        // do offset >= 8 long match
        // ...

so we change our encoder to make data like that.

Astute readers may note that overlap matches now take an extra byte to encode, which does cost a little compression ratio. If we like we can fix that by reorganizing the code stream a little (one option is to send one ml excess byte before sending offset and put a flag value in there; since this is all in the very rare pathway it can be more complex in its encoding), or we can just ignore it, it's around a 1% hit.

That's it.

One final note, if you want to skip all that, there is a short cut to get much of the win.

The simplest case is the most important - no excess lrl, no excess ml - and it occurs around 40% of all control bytes. So we can just detect it and fastpath that case :

U8 * ip; // = compressed input
U8 * op; // = decompressed output

    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    if ( (control & 0x88) == 0 )
        // lrl < 8 and ml_control < 8

        // copy literals :
        *((U64 *)(op)) = *((U64 *)(ip));
        ip += lrl;
        op += lrl;

        // get offset
        int offset = *((U16 *)ip);
        int ml = ml_control + 4;    

        // copy match; 4-11 bytes :
        U8 * from = op - offset;
        *((U64 *)(op)) = *((U64 *)(from));
        *((U32 *)(op+8)) = *((U32 *)(from+8));

        op += ml;


    // ... normal decoder here ...

This fastpath doesn't help much with all the other improvements we've done here, but you can just drop it on the original decoder and get much of the win. Of course beware the EOF handling. Also this fastpath assumes that you have forbidden overlaps from the normal codes and are sending the escape match code (0xF) for overlap matches.

ADDENDUM : A few notes :

In case it's not clear, one of the keys to this type of fast decoder is that there's a "hot region" in front of the output pointer that contains undefined data, which we keep stomping on over and over. eg. when we do the 8-byte literal blasts regardless of lrl. This has a few consequences you must be aware of.

One is if you're trying to decode in a minimum size circular window (eg. 64k in this case). Offsets to matches that are near window size (like 65534) are actually wrapping around to be just *ahead* of the current output pointer. You cannot allow those matches in the hot region because those bytes are undefined. There are a few fixes - don't use this decoder for 64k circular windows, or don't allow your encoder to output offsets that cause that problem.

A similar problem arises with parallelizing the decode. If you're decoding chunks of the stream in parallel, you cannot allow the hot region of one thread to be stomping past the end of its chunk, which another thread might be reading from. To handle this Oodle defines "seek points" which are places that you are allowed to parallelize the decode, and the coders ensure that the hot regions do not cross seek points. That is, within a chunk up to the seek point, the decoder is allowed to do these wild blast-aheads, but as it gets close to a seek point it must break out and fall into a safer mode (this can be done with a modified end condition check).

10-09-13 | Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

10-04-13 | The Reality of Coding

How you actually spend time.

4 hours - think about a problem, come up with a new algorithm or variation of algorithm, or read papers to find an algorithm that will solve your problem. This is SO FUCKING FUN and what keeps pulling me back in.

8 hours - do initial test implementation to prove concept. It works! Yay!

And now the fun part is over.

50 hours - do more careful implementation that handles all the annoying corner cases; integrate with the rest of your library; handle failures and so on. Provide lots of options for slightly different use cases that massively increase the complexity.

50 hours - adding features and fixing rare bugs; spread out over the next year

20 hours - have to install new SDKs to test it; inevitably they've broken a bunch of APIs and changed how you package builds so waste a bunch of time on that

10 hours - some stupid problem with Win8 loading the wrong drivers; or the linux dir my test is writing to is no longer writeble by my user account; whatever stupid computer problem; chase that around for a while

10 hours - the p4 server is down / vpn is down / MSVC has an internal compiler error / my laptop is overheating / my hard disk is full, whatever stupid problem always holds you up.

10 hours - someone checks in a breakage to the shared library; it would take a minute just to fix it, but you can't do that because it would break them so you have to do meetings to agree on how it should be fixed

10 hours - some OS API you were using doesn't actually behave the way you expected, or has a bug; some weird corner case or undocumented interaction in the OS API that you have to chase down

40 hours - writing docs and marketing materials, teaching other members of your team, internal or external evangelizing

30 hours - some customer sees a bug on some specific OS or SDK version that I no longer have installed; try to remote debug it, that doesn't work, try to find a repro, that doesn't work, give up and install their case; in the end it turns out they had bad RAM or something silly.

The reality is that as a working coder, the amount of time you actually get to spend working on the good stuff (new algorithms, hacking, clever implementations) is vanishingly small.

10-03-13 | SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :

ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

    normal async IO return value

    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :

SetLastError(77); // something bogus

if ( ! ReadFile(....) )
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.

if ( ! SomeWin32API(...) )
    DWORD hey_I_know_this_error_is_legit = GetLastError();


That is, Win32 APIs returning failure does *not* guarantee that they set LastError.

ADD : while I'm at it :

in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86

09-27-13 | Playlist for Rainy Seattle

Who the fuck turned the lights out in the world? Hello? I'm still here, can you turn the lights back on please?

Playlist for the gray :

(Actually now that I think about it, "playlist for the gray" is really just the kind of music I listened to when I was young (and, apparently, depressed). It reminds me of sitting in the passenger seat of a car, silent, looking out the window, it's raining, the world passing.)

Music request :

Something I've been seeking for a while and am having trouble finding : really long repetitive tracks. Preferably analog, not techno or dance music. Like some guitar strumming and such. Not pure drone, not just like one phrase repeated over and over, but a song, a proper song, just a really long version of that song. And not some awful Phish crap either; I don't mean "jazz" or anything with long improvs that get away from the basic song structure. I don't want big Mogwai walls of sound or screeching anything either; nothing experimental; I hate songs that build to a big noise crescendo, no no, just keep the steady groove going. Just regular songs, But that go on and on.

Any pointers appreciated. Particularly playlists or "best of" lists on music sites. There must be some DJ kid doing super-long remixes of classic rock songs, right? Where is he?

Some sort of in the right vein examples :

Neu! - Hallogallo (10 min)
Traffic - Mr Fantasy Live (11 min) (maybe not a great example, too solo-y)
YLT - Night Falls on Hoboken (17 min)

09-20-13 | House Design

Blah blah blah cuz it's on my mind.

First a general rant. There's nothing worse than "designers". The self-important sophomoric egotism of them is just mind boggling. Here's this product (or in this case, house plan) that has been refined over 1000's of years by people actually using it. Oh no, I know better. I'm so fucking great that I can throw all that out and come up with something off the top of my head and it will be an improvement. I don't have to bother with prototyping or getting evaluations from the people who actual use this stuff every day because my half-baked ideas are so damn good. I don't need to bother learning about how this product or layout has been fiddled with and refined in the past because my idea *must* be brand new, no one could have possibly done the exact same thing before and proved that it was a bad idea.

And onto the random points -

X. Of course the big fallacy is that a house is going to make your life better or fix anything. One of the most hillarious variants of this is people who put in a specific room for a gym or a bar or a disco, because of course in their wonderful new house they'll be working out and having parties and lots of friends. Not sitting in front of the TV browsing donkey porn like they do in their old house.

X. Never use new technology. It won't work long term, or it will be obsolete. You don't want a house that's like a computer and needs to be replaced in a few years. Use old technology that works. That goes for both the construction itself and for any integrated gadgets. Like if you get some computer-controlled power and lighting system; okay, good luck with that, I hope it's not too hard to rip out of the walls in five years when it breaks and the replacement has gone out of production. Haven't you people ever used electronics in your life? How do you not know this?

X. Living roof? WTF are you thinking? What a nightmare of maintenance. And it's just a huge ceiling leak inevitably. Yeah, I'm sure that rubber membrane that had several leaks during construction is gonna be totally fine for the next 50 years.

X. Assume that all caulks, all rubbers, all glues, all plastics will fail at some point. Make them easy to service and don't rely on them for crucial waterproofing.

X. Waterproofing should be primarily done with the "shingle principle". That is, mechanical overlaps - not glues, caulks, gaskets, coatings. Water should have to run upwards in order to get somewhere you don't want it.

X. Lots of storage. People these days want to eliminate closets to make rooms bigger. WTF do you need those giant rooms for? Small rooms are cosy. Storage everywhere makes it easy to keep the rooms uncluttered. So much nicer to have neat clean small rooms. The storage needs to be in every single room, not centralized, so you aren't walking all over the house every time you want something.

X. Rooms are good. Small rooms. I feel like there are two extreme trends going on these days that are both way off the ideal - you have the McMansion types that are making 5000 sqft houses, and then the "minimalist" types trying to live in 500 sqft to prove some stupid point. Both are retarded. I think the ideal amount of space for two people is around 1200-1500 sqft. For a family of three it's 1500-2000 or so.

X. Doors are good. Lofts are fucking retarded. Giant open single spaces are awful. Yeah it's okay if you're single, but if there are two people in the house you might just want to do different things and not hear each other. Haven't you ever lived in a place like that? It's terrible. Doors and separated spaces are wonderful things. (I like traditional Japanese interiors with the sliding screens so that you can rearrange spaces; maybe an open living-dining room, but with a sliding door through the middle to make it into two rooms when you want that? Not sure.)

X. Shadow gaps, weird kitchen islands, architectural staircases, sunken areas, multiple levels. Bleh. You've got to think about the cleaning. These things might look okay when it's first built, but they're a nightmare for maintenace.

X. Use trim. The popular thing these days seems to be trim-less walls. (part of the sterile awful "I live in a museum" look). In classic construction trim is partly used to hide the boundary between two surfaces that might not have a perfect joint, or that are different materials and thus might move against each other over time. With fancy modern construction the idea is that you don't have a bad joint that you have to hide, so you can do away with the baseboards and crown molding for a cleaner look. Blah, no, wrong. Baseboards are not just for hiding the joint, they serve a purpose. You can clean them, you can kick them, and they protect the bottom of your walls. They also provide a visual break from the floor to the wall, though that's a matter of taste.

X. I don't see anybody do the things that are so fucking obviously necessary these days. One is that all your wiring should be accessible for redoing, because we're going to continue to get new internet and cabling needs. Really you should run PVC tubes through your walls with fish-wires in them so that you can just easily pull new wiring through your house. (or of course if you make one of those god-awful modern places you should just do exposed pipes and wires; it's one of the few advantages of modern/industrial style, wtf. don't take the style and then reject the advantage of it). You should have charging stations that are just cupboards with outlets inside the cupboard so that you can hide all your charger cords and devices. There should be tons of outlets and perhaps they should be hidden; you could make them styled architectural features in some cases, or hide them in some wood trim or something. Another option would be to have retractable power outlets that coil up inside the wall and you can pull out as needed. Another idea is your baseboards could have a hollow space behind them, so you could snap them off and run cords around the room hidden behind the baseboards.

It would have been fun to be an architect. I have a lot of ideas about design, and I appreciate being in physical spaces that do something special to your experience. I love making physical objects that you can see and touch and show other people; it's so frustrating working in code and never making anything real.

But I'm sure I would have failed. For one things being an architect requires a lot of salesmanship and bullshitting. Particularly at the top levels, it's more about being a celebrity persona than about your work (just like art, cooking, etc.). To make any money or get the good commisions as an architect you have to have a bit of renown; you need to get paid extra because you're a name that's getting magazine attention.

But it's really more about following trends than about doing what's right. I suppose that's just like cooking too. Magazines (and blogs now) have a preconceived idea of what is "current" or "cutting edge" (which is not actually cutting edge at all, because it's a widespread cliche by that time), and they look for people that fit their preconceived idea. If you're a cook that does fucking boring ass generic "farm to table" and "sustainably raised" shit, then you're current and people will feature you in the news. If you just stick to what you know is delicious and ignore the stupid fads, then you're ignored.

09-18-13 | Per-Thread Global State Overrides

I wrote about this before ( cbloom rants 11-23-12 - Global State Considered Harmful ) but I'm doing it again because I think almost nobody does it right, so I'm gonna be really pedantic.

For concreteness, let's talk about a Log system that is controlled by bit-flags. So you have a "state" variable that is an or of bit flags. The flags are things like where do you output to (LOG_TO_FILE, LOG_TO_OUTPUTDEBUGSTRING, etc.) and maybe things like subsection enablements (LOG_SYSTEM_IO, LOG_SYSTEM_RENDERER, ...) or verbosity (LOG_V0, LOG_V1, ...). Maybe some bits of the state are an indent level. etc.

So clearly you have a global state where the user/programmer have set the options they want for the log.

But you also need a TLS state. You want to be able to do things like disable the log in scopes :


U32 oldState = Log_SetState(0);




(and in practice it's nice to use a scoper-class to do that for you). If you do that on the global variable, your thread is fucking up the state of other threads, so clearly it needs to be per-thread, eg. in the TLS. (similarly, you might want to inc the indent level for a scope, or change the verbosity level, etc.).

(note of course this is the "system has a stack of states which is implemented in the program stack").

So clearly, those need to be Log_SetLocalState. Then the functions that are used to set the overall options should be something like Log_SetGlobalState.

Now some notes on how the implementation works.

The global state has to be thread safe. It should just be an atomic var :

static U32 s_log_global_state;

U32 Log_SetGlobalState( U32 state )
    // set the new state and return the old; this must be an exchange

    U32 ret = Atomic_Exchange(&s_log_global_state, state , mo_acq_rel);

    return ret;

U32 Log_GetGlobalState( )
    // probably could be relaxed but WTF let's just acquire

    U32 ret = Atomic_Load(&s_log_global_state, mo_acquire);

    return ret;

(note that I sort of implicitly assume that there's only one thread (a "main" thread) that is setting the global state; generally it's set by command line or .ini options, and maybe from user keys in a HUD; the global state is not being fiddled by lots of threads at program time, because that creates races. eg. if you wanted to do something like turn on the LOG_TO_FILE bit, it should be done with a CAS loop or an Atomic OR, not by doing a _Get and then _Set).

Now the Local functions need to set the state in the TLS and *also* which bits are set in the local state. So the actual function is like :

per_thread U32_pair tls_log_local_state;

U32_pair Log_SetLocalState( U32 state , U32 state_set_mask )
    // read TLS :

    U32_pair ret = tls_log_local_state;

    // write TLS :

    tls_log_local_state = U32_pair( state, state_set_mask );

    return ret;

U32_pair Log_GetLocalState( )
    // read TLS :

    U32_pair ret = tls_log_local_state;

    return ret;

Note obviously no atomics or mutexes are need in per-thread functions.

So now we can get the effective combined state :

U32 Log_GetState( )
    U32_pair local = Log_GetLocalState();
    U32 global = Log_GetGlobalState();

    // take local state bits where they are set, else global state bits :

    U32 state = (local.first & local.second) | (global & (~local.second) );

    return state;

So internally to the log's operation you start every function with something like :

static bool NoState( U32 state )
    // if all outputs or all systems are turned off, no output is possible
    return ((state & LOG_TO_MASK) == 0) ||
        ((state & LOG_SYSTEM_MASK) == 0);

void Log_Printf( const char * fmt, ... )
    U32 state = Log_GetState();

    if ( NoState(state) )

    ... more here ...


So note that up to the "... more here ..." we have not taken any mutexes or in any way synchronized the threads against each other. So when the log is disabled we just exit there before doing anything painful.

Now the point of this post is not about a log system. It's that you have to do this any time you have global state that can be changed by code (and you want that change to only affect the current thread).

In the more general case you don't just have bit flags, you have arbitrary variables that you want to be per-thread and global. Here's a helper struct to do a global atomic with thread-overridable value :

struct tls_intptr_t
    int m_index;
        m_index = TlsAlloc();
        ASSERT( get() == 0 );
    intptr_t get() const { return (intptr_t) TlsGetValue(m_index); }

    void set(intptr_t v) { TlsSetValue(m_index,(LPVOID)v); }

struct intptr_t_and_set
    intptr_t val;
    intptr_t set; // bool ; is "val" set
    intptr_t_and_set(intptr_t v,intptr_t s) : val(v), set(s) { }
struct overridable_intptr_t
    atomic<intptr_t>    m_global;
    tls_intptr_t    m_local;    
    tls_intptr_t    m_localset;
    overridable_intptr_t(intptr_t val = 0) : m_global(val)
        ASSERT( m_localset.get() == 0 );
    intptr_t set_global(intptr_t val)
        return m_global.exchange(val,mo_acq_rel);
    intptr_t get_global() const
        return m_global.load(mo_acquire);
    intptr_t_and_set get_local() const
        return intptr_t_and_set( m_local.get(), m_localset.get() );
    intptr_t_and_set set_local(intptr_t val, intptr_t set = 1)
        intptr_t_and_set old = get_local();
        if ( set )
        return old;
    intptr_t_and_set set_local(intptr_t_and_set val_and_set)
        intptr_t_and_set old = get_local();
        if ( val_and_set.set )
        return old;
    intptr_t_and_set clear_local()
        intptr_t_and_set old = get_local();
        return old;
    intptr_t get_combined() const
        intptr_t_and_set local = get_local();
        if ( local.set )
            return local.val;
            return get_global();


// test code :  

static overridable_intptr_t s_thingy;

int main(int argc,char * argv[])
    argc; argv;
    ASSERT( s_thingy.get_combined() == 1 );
    intptr_t_and_set prev = s_thingy.set_local(3,1);
    ASSERT( s_thingy.get_combined() == 3 );

    ASSERT( s_thingy.get_combined() == 3 );
    ASSERT( s_thingy.get_combined() == 2 );
    return 0;

Or something.

Of course this whole post is implicitly assuming that you are using the "several threads that stay alive for the length of the app" model. An alternative is to use micro-threads that you spin up and down, and rather than inheriting from a global state, you would want them to inherit from the spawning thread's current combined state.

09-18-13 | Fast TLS on Windows

For the record; don't use this blah blah unsafe unnecessary blah blah.

extern "C" DWORD __cdecl FastTlsGetValue_x86(int index)
    mov     eax,dword ptr fs:[00000018h]
    mov     ecx,index

    cmp     ecx,40h // 40h = 64
    jae     over64  // Jump if above or equal 

    // return Teb->TlsSlots[ dwTlsIndex ]
    // +0xe10 TlsSlots         : [64] Ptr32 Void
    mov     eax,dword ptr [eax+ecx*4+0E10h]
    jmp     done

    mov     eax,dword ptr [eax+0F94h]
    mov     eax,dword ptr [eax+ecx*4-100h]


DWORD64 FastTlsGetValue_x64(int index)
    if ( index < 64 )
        return __readgsqword( 0x1480 + index*8 );
        DWORD64 * table = (DWORD64 *)  __readgsqword( 0x1780 );
        return table[ index - 64 ];

the ASM one is from nynaeve originally. ( 1 2 ). I'd rather rewrite it in C using __readfsdword but haven't bothered.

Note that these may cause a bogus failure in MS App Verifier.

Also, as noted many times in the past, you should just use the compiler __declspec thread under Windows when that's possible for you. (eg. you're not in a DLL pre-Vista).

09-17-13 | Travel with Baby

We did our first flight with baby over the weekend.

It went totally fine. All the things that people worry about (getting through security, baby ears popping, whatever) are no problem. Sure she cried on the plane some, but mostly she was really good, and anybody on the plane who was bothered can go to hell.

(hey dumbass who sat next to us - when you see a couple with a baby and they say "would you like to move to another seat" you should fucking move. And if you don't move, then you need to be cool about it. Jesus christ you all suck so bad, I'm fucking holding your hand helping you be a decent person, you don't even need to take any initiative, I know you all suck too bad to speak up and take the lead at being decent, but even when I open the door for you, you still can't manage to take the easy step. Amazing.)

Despite it being totally fine, it made me feel like I don't really need to do that again.

You just wind up spending the whole trip staring at baby anyway. You wind up spending a lot of time stuck in the hotel room, because she needs to nap, or you have to go back to feed her and get more toys, or get her out of the sun, or whatever. Hotel rooms are god awful. There's this weird romanticism about hotels, but the reality is they're almost uniformly dreary, in the standard shoebox design with light at only one end. My house is so much fucking better.

Like I'm in another part of the world, but I'm still just doing "goo-goos" and shaking the rattle, why do I need to bother with the travel if this is all I'm doing? And of course it's much harder because I don't have all my handy baby accessories and her comfy bed and all that. It made me think of those cheesy photo series with the baby always in the foreground of the photo and all kinds of different world locations in the background.

09-17-13 | A Day in the Life

Wake up at 730 with the baby.

Show her some toys, shake them around, let her chew on them. Practice rolling over, she gets frustrated, help her out. She wants to walk around a bit, so pick her up and hold her while she walks. Ugh this is hard on my back. She swats at things, I move away the dangerous knives and bring close some jars for her to play with. She's getting frustrated. Show her the mirror baby, she flirts for a minute. She gets fussy; check diaper, yep it's dirty; go change it. Lay her down and make faces and do motorboats and stuff. Try to read her a book; she just wants to eat the pages and gets frustrated. Walk her around outside and show her some plants, let her grab some leaves. She wants to walk, help her walk in the grass. Ugh this is hard on my back. Getting tired now. Take her in and put her in the walker; put some toys on the walker, she swats them off, pick them up and put them back. She's getting bored of that. Show her some running water, let her suck my finger. She's getting fussy again.

God I'm tired. Check the clock.

It's 800.

Twelve more hours until she sleeps. ZOMG.

09-12-13 | Health Insurance

We've got a bunch of health insurance bills from baby and related care, and several of them have fraudulent up-coding or double-billing. But it's sort of a damned-if-you-whatever situation. Either you :

1. Fight it, spend lots of time on hold on the phone, filling out forms, talking to assholes who won't help you. Probably get nowhere and be stressed and angry the whole time.

2. Just pay it and try to "let it go". The money's not that huge and peace of mind is worth it. But in fact feel defeated and like a pussy for letting them get away with robbing you. Become less of a man day by day as you are crushed by the indefeatable awfulness of the world.

Though I suppose those are generally your only two options when fighting beaurocracy. It's just that health care is more important to our lives, our wallets, and generally the health of the entire economy as it becomes an increasing tax on all of us.

We walked through some local street fair a few weeks ago, and saw one of the doctors who's fraudulently billed us; he was being all smiley, oo look at me I'm part of the community, I'm your nice neighborhood doctor man. I wanted to just punch him in the face.

Also : Republicans are retarded pieces of shit. How could you possibly be seriously opposed to tearing apart our entire health care sector and rebuilding from scratch with cost controls and a public option? They're either morons (unaware of their evil), or assholes (aware and intentionally evil). Oh yes, it's wonderful that we have choice and competition in this robust and functioning health care economy. Oh no, wait, actually we don't have that at all. We have a corrupt government-private conspiracy, which you have intentionally created, which is screwing over everyone in America except for the heads of the health care industry (and the politicians that take their money). Sigh. Time to go back to pretending that nothing exists outside my home, because it's all just too depressing.

09-10-13 | Grand Designs

I've been watching a lot of "Grand Designs" lately; it's my new go-to soothing background zone-out TV. It's mostly inoffensive (*) and slightly interesting.

(* = the incredibly predictably saccharine sum-up by Kevin at the end is pretty offensive; all throughout he's raising very legitimate concerns, and then at the end every time he just loves it. There are a few episodes where you can see the builder/client is just crushed and miserable at the end, but they never get into anything remotely honest like that, it's all very superficial, ooh isn't excess consumerism wonderful!)

I'm not even going to talk about the house designs really. I think most of them are completely retarded and abysmal generic pseudo-modern crap. (IMO in general modernism just doesn't work for homes. Moderism is beautiful when it's pure, unforgiving, really strictly adhered to. But nobody can live in a home like that. So they water it down and add some natural materials and normal cosy furniture and storage and so on, and that completely ruins it and turns it into "condo pseudo-modernism" which is just the tackiest of all types of housing. Modernism should be reserved for places where it's realistic to keep the severe empty minimalism that makes it beautiful, like museums).

Thoughts in sections :

What makes a house.

One of the most amusing things is noticing what people actually say at the sum-up at the end. Every time Kevin sits down and talks with the couple when the house is done and asks what they really love about it, the comments are things like :

"We're just glad it's done / we're just glad to have a roof over our head".
"It's so nice to just be together as a family"
"The views are amazing."
"The location is so special."

etc. it always strikes me that none of the pros has anything to do with the expensive house they just made. They never say anything like : "the architecture is incredibly beautiful and it's a joy to just appreciate the light and the spaces". Or "we're really glad we made a ridiculous 4000 sq ft living room because we often have 100 friends over for huge orgies". Because they don't actually care about the house at all (and it sucks).

Quite often I see the little cramped bungalow that people start out with and think "that looks just charming, why are you building?". It's all cosy and full of nick-nacks. It's got colorful paint schemes and is appropriately small and cheap and homey. Then they build some awful huge cavernous cold white box.

The children in particular always suffer. Often they're in a room together before the build, and the family is building to get more space so the kids can all have their own room. But the kids in the shared room are all laughing and wrestling, piled up on each other and happy. Kids are meant to be with other kids. In fact, humans are meant to be with other humans. We spend all this money to get big empty lonely spaces, and are worse off for it. Don't listen to what people say they want, it's not good for them.

In quite a few of the episodes, the couple at the beginning is full of energy, really loving to each other, bright eyed. At the end of the episode, they look crushed, exhausted, stressed out. Their lips are thin and they're all tense and dead inside.

Even in the disasters they're still saying how wonderful it is and how they'd do it all over (because people are so awfully boringly cowardlyly dishonest and never admit regret about major life decisions until after they've unwound them (like every marriage is "wonderful" right up until the day of the divorce)), but you can see it in their eyes. Actually the final interviews of Grand Designs are an interesting study in non-verbal communication, because the shlock nonsense that they say with their mouths has absolutely zero information content (100% predictable = zero information), so you have to get all your information from what they're not saying.

It's so weird the way some people get some ridiculous idea in their head and stick to it no matter how inconvenient and costly and more and more obviously foolish it is. Like I absolutely have to build on this sloping lot that has only mud for soil, or I have to build in this old water tower that's totally impractical. They incur huge costs, for what? You could have just bought a normal practical site, and you would have been just as happy, probably much happier.

In my old age I am appreciating that all opinions are coincidences and taste is an illusion. Why in the world would you butt your head against some obviously difficult and costly and impractical problem. You didn't actually want that thing anyway. You just thought you wanted it because you are brainwashed by the media and your upbringing. Just get something else that's easier. All things are equal.

Eco Houses.

The "eco houses" are some of the most offensive episodes to me. The purported primary focus of these "eco houses" is reducing their long-term carbon footprint, with the primary focus being on insulation to reduce heating costs. That's all well and good, but it's a big lie built on fallacies.

They completely ignore the initial eco cost of the build. Often they tear down some perfectly good house to start the build, which is absurd. Then they build some giant over-sized monstrosity out of concrete. They ignore all that massive energy waste and pollution because "long term the carbon savings will make up for it". Maybe, maybe not. Correctly doing long term costs is very hard.

Obviously anyone serious about an eco house should build it as small as reasonable. Not only does a small house use less material to make and use less energy to heat and light, there's less maintenance over its whole life, you fill it with less furniture, it's smaller in the landfill when inevitably torn down, etc.

Building a ridiculously huge concrete "eco house" is just absurd on the face of it; it's so hypocritical that it kind of blows your sensor out and you can't even measure it any more. It's sort of like making a high performance electric sports car and pretending that's "eco", it's just an absurd transparently illogical combination; oh wait...

One of the big fallacies of the eco house is the "long term payoff". There might be new building technology in 5 years that makes your house totally obsolete. Over-engineering with anything technical is almost always a mistake, because the cost (and environment cost in this case) is going down so fast.

Your house might go out of style. Houses now are not permanent, venerated monuments. They are fashion accessories. You can see by the way the eco people so happily tear down the houses from the 50's and 70's as if they were garbage. In 20 years if your house isn't fashionable, someone will buy it and tear it down. You're using lots of experimental new technology which greatly reduces the chance of your house actually lasting for the long term. Things like burying a house in the ground with a neoprene waterproofing layer makes the probability of the house actually lasting for the long term very small.

Perhaps the biggest issue is that they assume that the carbon cost of energy is constant. In fact it's going down rapidly. The whole idea of the carbon savings (for things like using massive amounts of concrete) is that the alternative (a timber house with normal insulation and more energy use) is polluting a lot through its energy use. But if its heat energy comes from solar or other clean sources, then your passive house is not a win. The technology of energy production will improve massively in the next 20-50 years, so saying your house is a win over the long (50-100 years) term is insane.

As usual in these phony arguments, they use a straw man alternative. They compare building a new $500k eco house vs. just leaving the old house alone. That's absurd. Of course if you want to say that the tear-down-and-build is more "eco" you should compare your new house vs. spending $500k on the old house or other eco purposes. What if you just left the old house and spent $100k to add insulation and solar panels and better windows? Then you could spend $400k on preserving forest land or something. A fair comparison has to be along an iso-line of constant cost, and doing the best you can per dollar in each case.

I'm sure the reality in most cases is just that people *want* a new house and are rationalizing and making up excuses why it's okay to do it. I'd like it so much better if they just said "yeah, we fucking want a new house that uses tons of concrete and we don't give a shit about the eco, but we're going to make it passive so that we can feel smug and show off to our friends".

European building vs. American.

Holy crap European building quality is ridiculously good.

In one of the episodes somebody puts softwood cladding on the house and the host is like "but that will rot in 10 years!" and the builder feels all guilty about it. (it's almost vanishingly rare to have anything but softwood cladding in America (*), and yes in fact it does rot almost immediately). (* = you might get plastic or aluminum, or these days we have various fake wood cement fiber-board options, but you would never ever use hardwood, yegads).

Granted the houses on the show are on the high end; I'm sure low end European houses are shittier. Still.

Almost every house is post-and-beam, either timber or steel frames. The timer frames are fucking oak which just blew my mind the first time I saw it. Actual fucking carpenters cutting joints. And real fucking wood. We have nothing like that. "skilled tradesman" isn't even an occupation in America anymore. All we have is "day laborer who is using a nail gun for the first time ever".

An American-style asphalt shingle roof is looked down upon as ridiculously crappy. Everything is slate or tile or metal. Their rooves last longer than our entire houses.

One funny thing I noticed is that the cost of stonemasons and proper carpenters seems to be incredibly low in the UK. There are some houses with amazing hand-done stonework and carpentry, and they cost about the same as the fucking awful modern boxes that are all prefab glass and drywall. Why in the world would you get a horrible cold cheapo-condo-looking modern box when you could have something made of natural materials cut by hand? The stone work in particular shocked me how cheap it was.

Another odd one is the widespread use of "blockwork" (concrete block walls). That's something we almost never do for homes in America, and I'm not sure why not. It's very quick and cheap, and makes a very solid wall. We associate it with prisons and prison-like schools and such, but if you put plaster over it, it's a perfectly nice wall and feels just like a stone house. I guess even blockwork is expensive compared to the ridiculously cheap American stick-framing method.

Another difference that shocked me is the "fixed price contract". Apparently in the UK you can get a builder to bid a price, and then if there are any overruns *they* have to cover it. OMG how fucking awesome is that, I would totally consider building a house if you could do that.

Oh yeah, and of course the planning regulations are insane. Necessary evil I suppose. It's why Europe is beautiful (despite heavy human modification absolutely everything) and America looks like a fucking pile of vomit anywhere it's been touched by the hand of man. (though a lot of the stuff that gets allowed on the show in protected areas is pretty awful modern crap that doesn't fit in or hide well at all, so I'm not sure the planners are really doing a great job. It seems like if you spend enough time and money they will eventually let you build an eyesore).

It's interesting to watch how people (the clients) handle the building process.

A few people get completely run over by their builder or architect, pushed into building something they don't want, and forced to eat delays and overruns and shitty quality, and that's sad to see. But it's also unforgivably pathetic of them to let it happen. YOU HAVE A FUCKING CH4 CAMERA CREW! It's the easiest time ever to make your builder be honest and hard-working. Just go confront them when the cameras are there and make them explain themselves on camera. WTF how can you be such a pussy that you don't stand up for yourself even when you have this amazing backup. But no, they'll say "oh, I don't know, what can I do about it?". You can bloody well do more than you are doing.

A few people are asshole micro-managers totally hovering over the crew all the time. The crew hate them and complain about them. But they also do seem to work harder. In this sad awful life of ours, being an annoying nag really does work great, because most people just don't want to deal with it and so will do what they have to in order to not get nagged.

Building a house is one of those situations where you can really see the difference between people who just suck it up and go with the flow "oh, I guess that's just what it costs", vs. people who are always scrapping and fighting and getting more for themselves. You can see some rich old pussy fart who doesn't fight might spend $1M on a build, and some other poor immigrant guy who knows how to deal and cajole and hustle might spend $200k on the exact same build. You can be bigger than your money or your intellect if you just fight for it.

The ones that are most impressive to me are the self-builds. It just astounds me how hard they work. And how wonderful to put 2 years or so of your life into just building one thing, that afterward you can go "I made this". Amazing, I'd love to do that. It's also the only time that I really see the people enjoying the process, and being happy afterward. (I particularly like the couple in scotland that does the gut-rehab of an old stone house all by themselves with no experience).

There are a few episodes with the classic manipulative architect. The architect-client relationship is usually semi-adversarial. Architects don't just want to make you the nice house you want, that suits you and is cheap and easy to build. They want to build something that will get them featured in a magazine; they want to build something that is cutting edge, or they have some bee in their bonnet that they want to try out. They want to use expensive and experimental methods and make you take all the risk for it. In order to get you to do that, they will lie to you about how risky and expensive it really it is. I don't necessarily begrudge the architects for that, it's what they have to do in order to get something interesting built. But it's amazing how naive and trusting some of the clients are. And it's a sort of inherently shady situation. Any time one person gets the upside (eg. the architect benefits if it goes well) and someone else gets the downside (the client has to eat the cost overrun and delays and live in the shitty house if it doesn't go well), that's a big problem for morality. You're relying solely on their ethics to treat you well, and that is an iffy thing to rely on.

09-02-13 | DEET

About to go camping for a few days. Discovered that the DEET has eaten its way through the original tube it came in, through a few layers of ziplocs, and out into my tub of camping stuff where it gladly ate a hole in my thermarest. Fucking DEET !

I guess after every trip I need to take the deet out and put it in a glass jar by itself. Or in a toxic waste containment facility or some shit. It's nasty stuff. Still better than mosquitos.

08-28-13 | Crying Baby Algorithm

Diagnosis and solution of crying baby problem.

1. Check diaper. This is always the first check because it's so easy to do and is such a conclusive result if positive. Diaper is changed, proceed to next step.

2. Check for boredom. Change what you're doing, take baby outside and play a new game. If this works, hunger or fatigue are still likely, you're just temporarily averting them.

3. Check hunger. My wife disagrees about this being so early in the order of operations, but testing it early seems like a clear win to me. After diaper this is the test with the most reliable feedback - if baby eats, that's a positive result and you're done. You can do an initial test for hungry-sucking with a finger in the mouth (hungry-sucking is much harder and more aggressive than comfort-sucking); hungry-sucking does not 100% of the time indicate actual hunger, nor does lack of hungry-sucking mean no hunger, but there is strong correlation. Doing a test feed is much easier with breastfed babies than with bottles (in which case you have to warm them up and find a clean nipple and so on). If baby is extremely upset, then failure to respond to a test-feed does not mean it is not hungry, you may have to repeat with stronger measures (see below).

4. Check for obvious stomach pain. Crying from stomach pain (gas, burps, acid, whatever) can be hard to diagnose once it becomes constant or severe (we'll get to that case below). But in the early/mild stages it's pretty easy to detect by testing body positions. If baby cries like crazy on its back but calms down upright or on its stomach (in a football hold), stomach pain is likely. If baby quiets down with abdomen pressure and back-patting, stomach pain is likely.

5. Look for obvious fatigue. The signs of this are sometimes clear - droopy eyes, yawning, etc. Proceed to trying to make baby sleep.

This is the end of the easy quick checks. If none of these work you're getting into the problem zone, where there may be multiple confounding factors, or the factors may have gone on so long that baby no longer responds well to them (eg. hungry but won't eat, gassy and patting doesn't stop the crying), so you will have to do harder checks.

Before proceeding, go back to #1 and do the easy checks again. Often it's the case that baby was just hungry, but after you did #1 it pooped and so wouldn't eat. It also helps to have a different person do the re-check; baby may not have responded to the first person but will respond to the next person doing exactly the same check the second time.

This is worth repeating beacuse it's something that continues to be a stumbling block for me to this day - just becaues you checked something earlier in the crying process doesn't mean it's not the issue. You tested hunger, she wouldn't eat. That doesn't mean you never test hunger again! You have to keep working on issues and keep re-checking. You have not ruled out anything! (except dirty diaper, stick your finger in there again to make sure, nope that's not it).

If that still doesn't work, proceed to the longer checks :

6. Check for overstimulation. This is pretty straightforward but can take a while. Take baby to a dark room, pat and rock, make shushes or sing to baby, perhaps swaddle. Just general calming. Crying may continue for quite a while even if this is the right solution so it takes some patience. (and then do 1-5 again)

7. Check for fatigue but won't go to sleep. The usual technique here is a stroller walk. (and then do 1-5 again)

8. Check for hungry but won't eat. You will have to calm baby and make it feel okay in order to eat. Proceed as per #6 (overstimulation) but stick baby on a nipple, in its most secure eating position. For us it helps to do this in the bedroom where baby sleeps and eats at night, that seems to be the most comforting place. Swaddling also helps here. Baby may even turn away and reject eating several times before it works. (and then do 1-5 again)

9. Assume unsignalled stomach pain. Failing all else, assume the problem is gastrointestinal, despite the lack of a clear GI relief test result (#4). So just walk baby around and pat its back, there's nothing else to do. (and keep repeating 1-5 again)

10. If you reach this point, your baby might just be a piece of shit. Throw it out and get a better one.

08-28-13 | How to Crunch

Baby is like the worst crunch ever. Anyway it's got me thinking about things I've learned about how to cope with crunch.

1. There is no end date. Never push yourself at an unsustainable level, assuming it's going to be over soon. Oh, the milestone is in two weeks, I'll just go really hard and then recover after. No no no, the end date is always moving, there's always another crunch looming, never rely on that. The proper way to crunch is to find a way to lift your output to the maximum level that you can hold for an indeterminate amount of time. Never listen to anyone telling you "it will be over on day X, let's just go all-out for that", just smile and nod and quietly know that you will have the energy to keep going if necessary.

2. Don't stop taking care of yourself. Whatever you need to do to feel okay, you need to keep doing. Don't cut it because of crunch. It really doesn't take that much time, you do have 1-2 hours to spare. I think a lot of people impose a kind of martyrdom on themselves as part of the crunch. It's not just "let's work a lot" it's "let's feel really bad". If you need to go to the gym, have a swim, have sex, do yoga, whatever it is, keep doing it. Your producers and coworkers who are all fucking stupid assholes will give you shit about it with passive aggressive digs; "ooh I'm glad our crunch hasn't cut into your workout time, none of the rest of us are doing that". Fuck you you retarded inefficient less-productive martyr pieces of crap. Don't let them peer pressure you into being stupid.

3. Resist the peer pressure. Just decided this is worth it's own point. There's a lot of fucking retarded peer pressure in crunches. Because others are suffering, you have to also. Because others are putting in stupid long hours at very low productivity, you have to also. A classic stupid one is the next point -

4. Go home. One of the stupidest ideas that teams get in crunches is "if someone on the team is working, we should all stay for moral support". Don't be an idiot. You're going to burn out your whole team because one person was browsing the internet a month ago when they should have been working and is therefore way behind schedule? No. Everyone else GO HOME. If you aren't on the critical path, go sleep, you might be critical tomorrow. Yes the moral support is nice, and in rare cases I do advocate it (perhaps for the final push of the game if the people on the critical path are really hitting the wall), but almost never. Unfortunately as a lead you do often need to stick around if anyone on your team is there, that's the curse of the lead.

5. Sleep. As crunch goes on, lack of sleep will become a critical issue. You've got to anticipate this and start actively working on it from the beginning. That doesn't just mean making the time to lie in bed, it means preparing and thinking about how you're going to ensure you are able to get the sleep you need. Make rules for yourself and then be really diligent about it. For me a major issue is always that the stress of crunch leads to insomnia and the inability to fall asleep. For me the important rules are things like - always stop working by 10 PM in order to sleep by 12 (that means no computer at all, no emails, no nothing), no coffee after 4 PM, get some exercise in the afternoon, take a hot shower or bath at night, no watching TV in bed, etc. Really be strict about it; your sleep rules are part of your todo list, they are tasks that have to be done every day and are not something to be cut. I have occasionally fallen into the habit of using alcohol to help me fall asleep in these insomnia periods; that's a very bad idea, don't do that.

6. Be smart about what you cut out of your life. In order to make time for the work crunch you will have to sacrifice other things you do with your life. But it's easy to cut the wrong things. I already noted don't cut self care. (also don't cut showering and teeth brushing, for the love of god, you still have time for those). Do cut non-productive computer and other electronics time. Do cut anything that's similar to work but not work, anything where you are sitting inside, thinking hard, on a computer, not exercising. Do cut "todos" that are not work or urgent; stuff like house maintenace or balancing your checkbook, all that kind of stuff you just keep piling up until crunch is over. Do cut ways that you waste time that aren't really rewarding in any way (TV, shopping, whatever). Try not to cut really rewarding pleasure time, like hanging out with friends or lovers, you need to keep doing that a little bit (for me that is almost impossible in practice because I get so stressed I can't stop thinking about working for a minute, but in theory it sounds like a good idea).

7. Be healthy. A lot of people in crunch fall into the trap of loading up on sugar and caffeine, stopping exercising, generally eating badly. This might work for a few days or even weeks, but as we noted before crunch is always indeterminate, and this will fuck you badly long term. In fact crunch is the most critical time to be careful with your body. You need it to be healthy so you can push hard, in fact you should be *more* careful about healthy living than you were before crunch. It's a great time to cut out all sugary snacks, fast food, and alcohol.

08-27-13 | Email Woes

I'm having some problem with emails. Somewhere between gmail and my verio POP host, some emails are just not getting through.

If you send me an important email and I don't reply, you should assume I didn't get it. Maybe send it again with return-receipt to make sure. (if you send me a "funny" link and I don't reply, GTFO)

Fucking hell, I swear everyone is fired. How is this so hard. Sometimes I think that the project I should really be working on (something that would actually be important and make a difference in the world to a lot of people) is making a new internet from scratch. Fully encrypted and peer-to-peer so governments can never monitor it, and no corporate overlord controls the backbone, and text-only so fucking flash or whatever can never ruin it. Maybe no advertising either, like the good old days. But even if I built it, no one would come because it wouldn't have porn or lolcats or videos of nut-shots.

08-26-13 | OMG E46 M3's have the best ads

Everyone knows that E46 M3's attract an amazing demographic . I'm still occasionally keeping my eye out for them (I can't face the minivan!) , and you come across the most incredible ads for these things.

Check this one out :

2004 E46 Bmw Laguna Blue M3

OMG it starts so good and then just gets better and better. Like you can't believe that he can top his last sentence, and then he just blows it away. Amazing. (if it's fake, it's fucking brilliant)

In white for when that expires (fucking useless broken ass piece of shit internet can't even hold some text permanently, what a fakakta load of crap everything in this life is, oy) :

Hello i got my baby blue BMW m3. Finally decided to sell. You will never find a cleaner BMW i promise. Best deal for best car. -black on black -6 speed manual -66,435 miles (all freeway) (half city)(1/2country) never racetracked. -Rebuilt title( -fixed by my brother Yuri with quality shop, he did good job). -19 inch wheels perfect for race, burnout, drift. I never do.. I promise. -AC blows hard like Alaska. -333hp but i put intake and its now 360hp at least, trust me. Powerfull 3,2 engine almost as fast as Ferarri 360. You dont believe? Look it up, youtube i have video! I can show you in the test drive.. You will feel like king on the roads... Trust me. This car puts you in extasy feeling. The pistons push hard. sometimes i say, "Big mistake- big payment, sucks job". But for you this the best because i believe ur serious. I keep it very clean, oil change every 20,000 miles. Ladies like it. My wife is mad sell, because she drives to work everyday, in seattle, and likes the power. I say we buy toyota and go vacation to Hawai. CASH ONLY( no credit, debit , fake checks, paypal, ebay, trade only for mercedes AMG, i like power. No lowball, serious buyers only, im tired of low balls offers, so please serious. I have pictures here. Thank you. I love the car so i could keep it if you dont want it. And go to casino to make payment. Im good with blackjack.

You dont believe? Look it up, youtube i have video!

08-22-13 | Sketch of Suffix Trie for Last Occurance

I don't usually like to write about algorithms that I haven't actually implemented yet, but it seems in my old age that I will not actually get around to doing lots of things that I think about, so here goes one.

I use a Suffix Trie for string matching for my LZ compressors when optimal parsing.

reminder: Suffix Tries are really the super-awesome solution, but only for the case that you are optimal parsing not greedy parsing, so you are visiting every byte, and for large windows (sliding window Suffix Tries are not awesome). see : LZ String Matcher Decision Tree (w/ links to Suffix Trie posts)

Something has always bothered me about it. Almost the entire algorithm is this sweet gem of computer science perfection with no hackiness and a perfect O(N) running time. But there's one problem.

A Suffix Trie really just gives you the longest matching substring in the window. It's not really about the *location* of that substring. In particular, the standard construction using pointers to the string that was inserted will give you the *first* occurance of each substring. For LZ compression what you want is the *last* occurance of each substring.

(I'm assuming throughout that you use path compression and your nodes have pointers into the original window. This means that each step along the original window adds one node, and that node has the pointer to the insertion location.)

In order to get the right answer, whenever you do a suffix query and find the deepest node that you match, you should then visit all children and see if any of them have a more recent pointer. Say you're at depth D, all children at depth > D are also substring matches of the same first D bytes, so those pointers are equally valid string matches, and for LZ you want the latest one.

An equivalent alternative is instead of searching all children on query, you update all parents on insertion. Any time you insert a new node, go back to all your parents and change their pointers to your current pointer, because your pointer must match them all up to their depth, and it's a larger value.

Of course this ruins the speed of the suffix trie so you can't do that.

In Oodle I use limitted parent updates to address this issue. Every time I do a query/insert (they always go together in an optimal parse, and the insert is always directly under the deepest match found), I take the current pointer and update N steps up the parent links. I tested various values of N against doing full updates and found that N=32 gave me indistinguishable compression ratios and very little speed hit.

(any fixed value of N preserves the O(N) of the suffix trie, it's just a constant multiplier). (you need to walk up to parents anyway if you want to find shorter matches at lower offsets; the normal suffix lookup just gives you the single longest match).

So anyway, that heuristic seems to work okay, but it just bothers me because everything else about the Suffix Trie is so pure with no tweak constants in it, and then there's this one hack. So, can we solve this problem exactly?

I believe so, but I don't quite see the details yet. The idea goes like this :

I want to use the "push pointer up to parents method". But I don't actually want to update all parents for each insertion. The key to being fast is that many of the nodes of the suffix trie will never be touched again, so we want to kind of virtually mark those nodes as dirty, and they can update themselves if they are ever visited, but we don't do any work if they aren't visited. (BTW by "fast" here I mean the entire parse should still be O(N) or O(NlogN) but not fall to O(N^2) which is what you get if you do full updates).

In particular in the degenerate match cases, you spend all your time way out at the leaves of the suffix trie chasing the "follows" pointer, you never go back to the root, and many of the updates overwrite each other in a trivial way. That is, you might do substring "xxaaaa" at "ptr", and then "xxaaaaa" at "ptr+1" ; the update of "ptr" back up the tree will be entirely overwrittten by the update from "ptr+1" (since it also serves as an "xxaa" match and is later), so if you just delay the update it doesn't need to be done at all.

(in the end this whole problem boils down to a very simple tree question : how do you mark a walk from a leaf back to the root with some value, such that any query along that walk will get the value, but without actually doing O(depth) work if those nodes are not touched? Though it's not really that question in general, because in order to be fast you need to use the special properties of the Suffix Trie traversal.)

My idea is to use "sentries". (this is a bit like skip-lists or something). In addition to the "parent" pointer, each node has a pointer to the preceding "sentry". Sentry steps take you >= 1 step toward root, and the step distance increases. So stepping up the sentry links might take you 1,1,2,4,.. steps towards root. eg. you reach root in log(depth) steps.

When you insert a new node, instead of walking all parents and changing them to your pointer, you walk all sentries and store your pointer as a pending update.

When you query a node, you walk to all sentries and see if any of them has a lower pointer. This effectively finds if any of your children did an update that you need to know about.

The pointer that you place in the sentry is really a "pending update" marker. It means that update needs to be applied from that node up the tree to the next sentry (ADD: I think you also need to store the range that it applies to, since a large-step range can get broken down to smaller ranges by updates). You know what branch of the tree it applies to because the pointer is the string and the string tells you what branch of the tree to follow.

The tricky bit happens when you set the pointer in the sentry node, there may be another pointer there from a previous insertion that is still pending update. You need to apply the previous pending update before you store your new pointer in the pending update slot.

Say a node contains a pending update with the pointer "a", and you come in and want to mark it with "b". You need to push the "a" update into the range that it applies to, so that you can set that node to be pending with a "b".

The key to speed is that you only need to push the "a" update where it diverges from "b". For example if the substring of "a" and "b" is the same up to a deeper sentry that contains "b" then you can just throw away the "a" pending update, the "b" update completely replaces it for that range.

Saying it all again :

You have one pointer update "a" that goes down a branch of the tree. You don't want to actually touch all those nodes, so you store it as applying to the whole range. You do a later pointer update "b" that goes down a branch that partially overlaps with the "a" branch. The part that is "a" only you want to leave as a whole range marking, and you do a range-marking for "b". You have to find the intersection of the two branches, and then the area where they overlap is again range-marked with "b" because it's newer and replaces "a". The key to speed is that you're marking big ranges of nodes, not individual nodes. My proposal for marking the ranges quickly is to use power-of-2 sentries, to mark a range of length 21 you would mark spans of length 16+4+1 kind of a thing.

Maybe some drawings are clearer. Here we insert pointer "a", and then later do a query with pointer "b" that shares some prefix with "a", and then insert "b".

The "b" update to the first sentry has to push the "a" update that was there up until the substrings diverge. The update back to the root sees that "a" and "b" are the same substring for that entire span and so simply replaces the pending update of "a" with a pending update of "b".

Let's see, finishing up.

One thing that is maybe not clear is that within the larger sentry steps the smaller steps are also there. That is, if you're at a deep leaf you walk back to the root with steps that go 1,1,2,4,8,16,32. But in that last big 32 step, that doesn't mean that's one region of 32 nodes with no other sentries. Within there are still 1,2,4 type steps. If you have to disambiguate an update within that range, it doesn't mean you have to push up all 32 nodes one by one. You look and see hey I have a divergence in this 32-long gap, so can I just step up 16 with "a" and "b" being the same? etc.

I have no idea what the actual O() of this scheme is. It feels like O(NlogN) but I certainly don't claim that it is without doing the analysis.

I haven't actually implemented this so there may be some major error in it, or it might be no speed win at all vs. always doing full updates.

Maybe there's a better way to mark tree branches lazily? Some kind of hashing of the node id? Probabilistic methods?

08-19-13 | Sketch of multi-Huffman Encoder

Simple way to do small-packet network packet compression.

Train N different huffman code sets. Encoder and decoder must have a copy of the static N code sets.

For each packet, take a histogram. Use the histogram to measure the transmitted length with each code set, and choose the smallest. Transmit the selection and then the encoding under that selection.

All the effort is in the offline training. Sketch of training :

Given a large number of training packets, do this :

for - many trials - 

select 1000 or so seed packets at random
(you want a number 4X N or so)

each of the seeds is a current hypothesis of a huffman code set
each seed has a current total histogram and codelens

for each packet in the training set -
add it to one of the seeds
the one which has the most similar histogram
one way to measure that is by counting the huffman code length

after all packets have been accumulated onto all the seeds,
start merging seeds

each seed has a cost to transmit; the size of the huffman tree
plus the data in the seed, huffman coded

merge seeds with the lowest cost to merge
(it can be negative when merging makes the total cost go down)

keep merging the best pairs until you are down to N seeds

once you have the N seeds, reclassify all the packets by lowest-cost-to-code
and rebuild the histograms for the N trees using only the new classification

those are your N huffman trees

measure the score of this trial by encoding some other training data with those N trees.

It's just k-means with random seeds and bottom-up cluster merging. Very heuristic and non-optimal but provides a starting point anyway.

The compression ratio will not be great on most data. The advantage of this scheme is that there is zero memory use per channel. The huffman trees are const and shared by all channels. For N reasonable (4-16 would be my suggestion) the total shared memory use is quite small as well (less than 64k or so).

Obviously there are many possible ways to get more compresion at the cost of more complexity and more memory use. For packets that have dword-aligned data, you might do a huffman per mod-4 byte position. For text-like stationary sources you can do order-1 huffman (that is, 256 static huffman trees, select by the previous byte), but this takes rather more const shared memory. Of course you can do multi-symbol huffman, and there are lots of ways to do that. If your data tends to be runny, an RLE transform would help. etc. I don't think any of those are worth pursuing in general, if you want more compression then just use a totally different scheme.

Oh yeah this also reminds me of something -

Any static huffman encoder in the vernacular style (eg. does periodic retransmission of the table, more or less optimized in the encoder (like Zip)) can be improved by keeping the last N huffman tables. That is, rather than just throwing away the history when you send a new one, keep them. Then when you do retransmission of a new table, you can just send "select table 3" or "send new table as delta from table 5".

This lets you use locally specialized tables far more often, because the cost to send a table is drastically reduced. That is, in the standard vernacular style it costs something like 150 bytes to send the new table. That means you can only get a win from sending new tables every 4k or 16k or whatever, not too often because there's big overhead. But there might be little chunks of very different data within those ranges.

For example you might have one Huffman table that only has {00,FF,7F,80} as literals (or whatever, specific to your data). Any time you encounter a run where those are the only characters, you send a "select table X" for that range, then when that range is over you go back to using the previous table for the wider range.

08-13-13 | Dental Biofeedback

Got my teeth cleaned the other morning, and as usual the dental hygienist cut the hell out of my gums with her pointy stick.

It occured to me that the real problem is that she can't feel what I feel; when the tip of the tool starts to go into my gum, she doesn't get the pain message. It's incredibly hard to do sensitive work on other people because you don't have their self-sensory information.

But there's no reason for it to be that way with modern technology. You should be able to put a couple of electrodes on the patient to pick up the pain signal (or perhaps easier is to detect the muscles of the face clenching) hooked up to like a vibrating pad in the hygienist's chair. Then the hygienist can poke away, and get haptic feedback to guide their pointy stick.

I should make a medical device so I can get in on the gold-rush which is the health care raping of America. Yeeeaaaah!

This reminds me how incredibly shitty all the physical therapists that I saw for my shoulder were.

One of my really tough lingering problems is that after my injury, my serratus anterior basically stopped firing, which has given me a winged scapula. I went and got nerve conduction testing, as long thoracic nerve dysfunction is a common cause of this problem, but it seems the nerve is fine, it's just that my brain is no longer telling that muscle to fire. When I do various movements that should normally be recruiting the SA, instead my brain tells various other muscles to fire and I do the movement in a weird way.

Anyway I did lots of different types of PT with various types of quacks who never really listened to me properly or examined the problem properly, they just started doing their standard routine that didn't quite apply to my problem. (by far the worst was the head of Olympic PT, who started me on his pelvic floor program; umm, WTF; I guess the guy is just a molester who likes to mess around with pelvic floors, I could see half of the PT office doing pelvic floor work). When someone has this problem of firing the wrong muscles to do movements, you can't just tell them to go do 100 thera-band external rotations, because they will wind up losing focus and doing them with the wrong muscles. Just having the patient do exercises is entirely the wrong prescription.

Not one of them did the correct therapy for this problem, which is biofeedback. The patient should be connected to electrodes that detect the firing of the non-functioning muscle. They should then be told to do a series of movements that in a normal human would involve the firing of that muscle. The patient should be shown a monitor or given a buzzer that lets them know if they are firing it correctly. The direct sensory feedback is the best way to retrain the muscle control part of the brain.

A very simple but effective way to do this is for the therapist to put a finger on the muscle that should be firing. By just applying slight pressure with the finger it makes the patient aware of whether that muscle is contracting or not and can guide them to do the movement with the correct muscles. (it's a shame that our society is so against touching, because it's such an amazing aid to have your personal trainer or your yoga teacher or whatever put their hands on the part of the body they are trying to get you to concentrate on, but nobody ever does that in stupid awful don't-molest-me America).

Everyone is fired.

A related note : I've tried various posture-correction braces over the years and I believe they all suck or are even harmful. In order to actually pull your shoulders back against your body weight, you have to strap yourself into a very tight device, and none of them really work. And even if they do work, having a device support your posture is contrary to building muscle and muscle-memory to train the patient to do it themselves. I always thought that the right thing was some kind of feedback buzzer system. Give the patient some kind of compression shirt with wiring on the front and back that can detect the length of the front of the shirt and the length of the back. Have them establish a baseline of correct posture. The shirt then has a little painful electric zapper in it, so if they are curving forward, which stretches the back of the shirt and compresses the front, they get a zap. The main problem with people with posture problems is just that they forget about it and fall into bad old habits, so you need to give them this constant monitoring and immediate feedback to fix it.

08-13-13 | Subaru WRX Mini Review

We got a WRX Wagon a while ago to be our primary family car. It's basically superb, I absolutely love it.

Above all else I love that it is simple and functional. It feels like the best car that Subaru could have possibly built for the money. It has no extra frills. It's not trying to be "luxury" in any way, and I fucking love that. But where it counts, the engine, the drivetrain, the reliability, it is totally quality, superb quality, much better than the so-called German masters of engineering. And actually I like the feel of the interior plastics (for the most part); they feel functional and solid and unpretentious. Way better than the gross laminated wood in modern BMWs and the blingy rhinestone-covered interiors of modern Audis. So-called "nice" cars have all become like tacky McMansions catering to the Chelsea tastes, all faux-marble and oversized and so on. Trying to appear better than they actually are, whereas the Subaru is better quality than it looks.

It feels so crazy fast on the street. To be so sporty and also be a good family wagon is just amazing. Sometimes I wish I'd spent the extra money for the STI (and then spent some more to de-stupidify the ridiculously hard and low suspension of the STI) just for the diffs. The base WRX (non STI) has a totally non-functional viscous 4wd system that ruins the ability to throttle steer in a really predictable fun way. But realistically the number of times that I'm pushing the car that hard is vanishing and it would need lots of other mods to be a Ken Block trick machine.

Given that I basically love it and highly recommend it (*), here are a few things that I don't like about it :

(* = I believe that everyone should buy a WRX. If you want a car for X purpose, no matter what X is, the answer is WRX. The only reasons I see to buy anything but a WRX are 1. the gas mileage is not great, and 2. if you really need more space, like you if you have 8 children and 4 great danes or something ridiculous like that).

Sport Seats. God sports seats are so fucking retarded and are ruining cars. They are entirely for looks, put in by the marketting department to attract dumb teenage boys (the same dumb boys that like low profile tires and jutting hips). You can make a non-sports seat that holds you in place perfectly fine, and if you are actually doing performance driving you just need a harness (or a lap-belt lock). The specific problem in the WRX is the big solid head rests that are part of the seat - they block visibility really badly for backing up, and they also get in the way of the folding rear seats, so to fold the back seats up and down you have to go fold the front seats up first. It's lame.

Hill-start Assist. I hate hill-start assist. It would be okay if it was optional, actually I would like it if it was optional, on a button I could turn on and off so I would only get it when I actually wanted it, but no it's on all the time. I can start a car on a hill just fine by myself, thank you (where were you 15 years ago when I couldn't ?). The main result of hill-start assist is that I press the gas to get going and the car doesn't move and it feels like the brake is holding the car and I'm like "wtf is going on, the car is broken god dammit" and then it finally lets go and I realize it's the dumb HSA. Sucky. Stop trying to help me drive.

Variable Power Steering. The PS is variably boosted and *way* too variable. It's so boosted at low speed that the wheel feels all swimmy with no response. Variable PS in general hurts your ability to develop muscle-memory for turning. If you are going to do variable PS, it should be a very minor difference, not like a completely different car the way it is in the WRX; basically an ideal variable PS system should be so minor that the driver doesn't consciously know it's there at all, it should just feel right. At least it's still hydraulic, not that god-awful electric crap.

Hatch blockage. One of the big problems with the car is that the wagon back is not as useful as it should be. It's a decent amount of space, but there's a lot of weird bumps and lips that make it hard to get large things inside. The hatch opening in particular is a lot smaller that it should be, and it shouldn't be so sloped in the rear; if the roof was longer and the rear glass more vertical, you could actually get a small couch in the back (fast-backs in general are retarded; they are drawn by car designers because they *look* aerodynamic, but a hard edge is actually more aero and provides way more interior space). As is, it's unfortunately terrible for cargo carrying and just not a very good small station wagon.

Summer Tires. God damn summer tires are such a stupid scam. They're annoyingly low profile too. It just sucks that almost any car you buy these days you have to immediately buy new wheels and tires to undo the stupidity, which is again marketing department tinkering. Cars should be sold on all-season tires, and we need to stop this low-profile retardation. Rubber is amazing wonderful stuff, it's what makes wheels work well. (part of the reason cars are all sold on summer tires now is to make the magazine tests "on stock tires" look better. Those "on stock tires" tests have always been retarded, because there are a few sports cars sold on autocross tires (street legal track tires) which is totally unfair, and a few actually good-to-their-customers car companies sell their cars on all-seasons, so the magazines' claim that by testing all cars on stock tires is a fair way to compare is just bananas.)

Jack plates & hitch. The jack points on these cars absolutely suck (pinch welds with the actually reinforced part inboad of the pinch weld). What would it cost the factory to put a decent jack plate there, maybe $10? Come on. A factory hitch option would have been nice too.

Plastic bumpers. The body skirts are super low (too low, they don't clear a parking spot curb thingy) and what's worse is they're made from crunchy hard plastic that cracks or pops out of the weak plastic press-fittings rather than bends. I wish bumpers and skirts were still steel with rubber, or even soft polyurethane plastic that bends on small impacts would be fine. As usual with metric over-training, the fact that the official crash tests don't rate the damage to the car means that modern cars are very good at protecting the occupants, but self-destruct on even the slightest impact. These really aren't even "bumpers", the actual impact absorbing part is all hidden underneath; these are plastic skins over the bumpers, and they really should be designed so that you can have a 1 mph impact without cracking them or popping the fittings out.

But really the only serious complaint is the stupid shape that ruins its cargo capacity :

I just hate it when marketing people ruin the functionality of things for no good reason. First you make it as functional as possible. Then the marketing people can play with the look or whatever but *only* in ways that don't change the functionality. Almost all modern cars are really "haunchy" these days, with wheels jutting out from the body, and bodies that are cut in at the top and back to be "sleek". No it is *not* for handling or aerodynamics. It's entirely an aesthetic thing. It's making me long for something like the Honda Element that at least has not compromised function for stupid vanity.

08-13-13 | Kinesis Freestyle 2 Mini Review

Aaron got a KF2 so I tried it out. Very quick review :

(on the right side I've failed to circle the Shift and Control keys which are perhaps the worst part)

The key action is not awesome but is totally acceptable. It would not stop me from using it.

Even in brief testing I could feel that having my hands further apart was pretty cool. Getting your elbows back tucked into your ribs with forearms externally rotated helps reduce the way that normal computer use posture has the weight of the arms collapsing the chest, causing kyphosis.

The fact that it sits flat and is totally old-skool keyboard style is pretty retarded. Who is buying an ergonomic split keyboard that wants a keyboard that could have been made in the 80's ? You need to start from a shape like the MS Natural and *then* cut it in half and split it. You don't start with a keyboard that came on your Packard Bell BBS-hosting 8086 machine. In other words, each half should have a curve and should be lifted and tilted. (the thumbs should be much higher than the pinkies).

(ideally each half should sit on something like a really nice high-friction camera tripod mount, so that you can adjust the tilt to any angle you want; the perfectly-flat or perflect-vertical is no good).

The real absolute no-go killer is the fucking retarded arrow key layout. It's not a fucking laptop (and even on a laptop there's no excuse for that). What are you thinking? There is *one* way to do arrow keys and the pgup/home set, you do not change that. Also cramming the arrows in there makes the shift key too small and puts the right-control in the wrong place. Totally unnaceptable, why are they trying so hard to save on plastic? There should be a normal proper offset arrow key set over on the right side.

And get rid of all those custom-function buttons on the left side. They serve no purpose, and negatively move my left-hand-mouse out further left than it needs to be.

Why is everyone but me so retarded? Why am I not hired to design every product in the world? Clearly no one else is capable. Sheesh.

08-12-13 | Cuckoo Cache Tables - Failure Report

This is a report on a dead end, which I wish people would do more often.

Ever since I read about Cuckoo Hashing I thought, hmm even if it's not the win for hash tables, maybe it's a win for "cache tables" ?

(A cache table is like a hash table, but it never changes size, and inserts might overwrite previous entries (or not insert the new entry, though that's unusual). There may be only a single probe or multiple).

Let me introduce it as a progression :

1. Single hash cache table with no hash check :

This is the simplest. You hash a key and just look it up in a table to get the data. There is no check to ensure that you get the right data for your key - if you have collisions you may just get the wrong data back from lookup, and you will just stomp other people's data when you write.

Data table[HASH_SIZE];

lookup :

hash = hash_func(key);
Data & found = table[hash];

insert :

table[hash] = data;

This variant was used in LZP1 ; it's a good choice in very memory-limited situations where collisions are either unlikely or not that big a deal (eg. in data compression, a collision just means you code from the wrong statistics, it doesn't break your algorithm).

2. Single hash cache table with check :

We add some kind of hash-check value to our hash table to try to ensure that the entry we get actually was from our key :

Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash = hash_func(key);
check = alt_hash_func(key);
if ( table_check[hash] == check )
  Data & found = table[hash];

insert :

table_check[hash] = check;
table[hash] = data;

In practice, hash_func and alt_hash_func are usually actually the same hash function, and you just grab different bit ranges from it. eg. you might do a 64-bit hash func and grab the top and bottom 32 bits.

In data compression, the check hash value can be quite small (8 bits is common), because as noted above collisions are not catastrophic, so just reducing the probability of an undetected collision to 1/256 is good enough.

3. Double hash cache table with check :

Of course since you are now making two hashes, you could look up two spots in your table. We're basically running the primary hash and alt_hash above, but instead of unconditionally using only one of them as the lookup hash and one as the check, we can use either one.

Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash1 = hash_func(key);
hash2 = alt_hash_func(key);
if ( table_check[hash1] == hash2 )
  Data & found = table[hash1];
else if ( table_check[hash2] == hash1 )
  Data & found = table[hash2];

insert :

if ( quality(table[hash1]) <= quality(table[hash2]) )
    table_check[hash1] = hash2;
    table[hash1] = data;
    table_check[hash2] = hash1;
    table[hash2] = data;

Where we now need some kind of quality function to decide which of our two possible insertion locations to use. The simplest form of "quality" just checks if one of the slots is unused. More complex would be some kind of recency measure, or whatever is appropriate for your data. Without any quality rating you could still just use a random bool there or a round-robin, and you essentially have a hash with two ways, but where the ways are overlapping in a single table.

Note that here I'm showing the check as using the same number of bits as the primary hash, but it's not required for this type of usage, it could be fewer bits.

(also note that it's probably better just to use hash1 and hash1+1 as your two hash check locations, since it's so much better for speed, but we'll use hash1 and hash2 here as it leads more directly to the next -)

4. Double hash cache table with Cuckoo :

Once you get to #3 the possibility of running a Cuckoo is obvious.

That is, every entry has two possible hash table indices. You can move an entry to its alternate index and it will still be found. So when you go to insert a new entry, instead of overwriting, you can push what's already there to its alternate location. Lookup is as above, but insert is something like :

insert :

table_check[hash1] = hash2;
table[hash1] = data;

// I want to write at hash1; kick out whatever is there

if ( table[hash1] is empty ) return;

// move my entry from hash1 to hash2
hash2 = table_check[hash1];

table[hash2] = table[hash1];
table_check[hash2] = hash1;


Now of course that's not quite right because this is a cache table, not a hash table. As written above you have a gauranteed infinite loop because cache tables are usually run with more unique insertions than slots, so PushCuckoo will keep trying to push things and never find an empty slot.

For cache tables you just want to do a small limited number of pushes (maybe 4?). Hopefully you find an empty slot to in that search, and if not you lose the entry that had the lowest "quality" in the sequence of steps you did. That is, remember the slot with lowest quality, and do all the cuckoo-pushes that precede that entry in the walk.

For example, if you have a sequence like :

I want to fill index A

hash2[A] = B
hash2[B] = C
hash2[C] = D
hash2[D] = E

none are empty

entry C has the lowest quality of A-E

Then push :

B -> C
A -> B
insert at A

That is,

table[C] = table[B]
hash2[C] = B
table[B] = table[A]
hash2[B] = A
table[A],hash2[A] = new entry

The idea is that if you have some very "high quality" entries in your cache table, they won't be destroyed by bad luck (some unimportant event which happens to have the same hash value and thus overwrites your high quality entry).

So, I have tried this and in my experiments it's not a win.

To test it I wrote a simple symbol-rank compressor. My SR is order-5-4-3-2-1 with only 4 symbols ranked in each context. (I chose an SR just because I've been working on SR for RAD recently; otherwise there's not much reason to pursue SR, it's generally not Pareto). Contexts are hashed and looked up in a cache table. I compressed enwik8. I tweaked the compressor just enough so that it's vaguely competitive with state of the art (for example, I use a very simple secondary statistics table for coding the SR rank), because otherwise it's not a legitimate test.

For Cuckoo Caching, the hash check value must be the same size as the hash table index, so that's what I've done for most fo the testing. (in all the other variants you are allowed to set the size of the check value freely). I also tested 8-bit check value for the single lookup case.

I'm interested in low memory use and really stressing the cache table, so most of the testing was at 18-bits of hash table index. Even at 20 bits the difference between Cuckoo and no-Cuckoo disappears.

The results :

18 bit hash :

Single hash ; no confirmation :
Compress : 100000000 -> 29409370 : 2.353

Single hash ; 8 bit confirmation :
Compress : 100000000 -> 25169283 : 2.014

Single hash ; hash_bits size confirmation :
Compress : 100000000 -> 25146207 : 2.012

Dual Hash ; hash_bits size confirmation :
Compress : 100000000 -> 24933453 : 1.995

Cuckoo : (max of 10 pushes)
Compress : 100000000 -> 24881931 : 1.991

Conclusion : Cuckoo Caching is not compelling for data compression. Having some confirmation hash is critical, but even 8 bits is plenty. Dual hashing is a good win over single hashing (and surprisingly there's very little speed penalty (with small cache table tables, anyway, where you are less likely to pay bad cache miss penalties)).

For the record :

variation of compression with hash table size :

two hashes, no cuckoo :

24 bit o5 hash : (24,22,20,16,8)
Compress : 100000000 -> 24532038 : 1.963
20 bit o5 hash : (20,19,18,16,8)
Compress : 100000000 -> 24622742 : 1.970
18 bit o5 hash : (18,17,17,16,8)
Compress : 100000000 -> 24933453 : 1.995

Also, unpublished result : noncuckoo-dual-hashing is almost as good with the 2nd hash kept within cache page range of the 1st hash. That is, the good thing to do is lookup at [hash1] and [hash1 + 1 + (hash2&0xF)] , or some other form of semi-random nearby probe (as opposed to doing [hash1] and [hash2] which can be quite far apart). Just doing [hash1] and [hash1+1] is not as good.

08-08-13 | Oodle Static LZP for MMO network compression

Followup to my post 05-20-13 - Thoughts on Data Compression for MMOs :

So I've tried a few things, and Oodle is now shipping with a static dictionary LZP compressor.

OodleStaticLZP uses a static dictionary and hash table which is const and shared by all network channels. The size is set by the user. There is an adaptive per-channel arithmetic coder so that the match length and literal statistics can adapt to the channel a bit (this was a big win vs. using any kind of static models).

What I found from working with MMO developers is that per-channel memory use is one of the most important issues. They want to run lots of connections on the same server, which means the limit for per-channel memory use is something like 512k. Even a zlib encoder at 400k is considered rather large. OodleStaticLZP has 182k of per-channel state.

On the server, a large static dictionary is no problem. They're running 16GB servers with 10,000 connections, they really don't care if the static dictionary is 64MB. However, that same static dictionary also has to be on the client, so the limit on how big a static dictionary you can use really comes from the client side. I suspect that something in the 8MB - 16MB range is reasonable. (and of course you can compress the static dictionary; it's only something like 2-4 MB that you have to distribute and load).

(BTW you don't necessarily need an adaptive compression state for every open channel. If some channels tend to go idle, you could drop their state. When the channel starts up again, grab a fresh state (and send a reset message to the client so it wipes its adaptive state). You could do something like have a few thousand compression states which you cycle in an LRU for an unbounded number of open channels. Of course the problem with that is if you actually get a higher number of simultaneous active connections you would be recycling states all the time, which is just the standard cache over-commit problem that causes nasty thrashing, so YMMV etc.)

This is all only for downstream traffic (server->client). The amount of upstream traffic is much less, and the packets are tiny, so it's not worth the memory cost of keeping any memory state per channel for the upstream traffic. For upstream traffic, I suggest using a static huffman encoder with a few different static huffman models; first send a byte selecting the huffman table (or uncompressed) and then the packet huffman coded.

I also tried a static dictionary / adaptive statistics LZA (LZA = LZ77+arith) (and a few other options, like a static O3 context coder and some static fixed-length string matchers, and static longer-word huffman coders, but all those were much worse than static LZA or LZP). The static dictionary LZA was much worse than the LZP.

I could conjecture that the LZP does better on static dictionaries than LZA because LZP works better when the dictionary mismatches the data. The reason being that LZP doesn't even try to code a match unless it finds a context, so it's not wasting code space for matches when they aren't useful. LZ77 is always trying to code matches, and will often find 3-byte matches just by coincidence, but the offsets will be large so they're barely a win vs literals.

But I don't think that's it. I believe the problem with static LZA is simply for an offset-coded LZ (as I was using), it's crucial to put the most useful data at low offset. That requires a very cleverly made static dictionary. You can't just put the most common short substrings at the end - you have to also be smart about how those substrings run together to make the concatenation of them also useful. That would be very interesting hard algorithm to work on, but without that work I find that static LZA is just not very good.

There are obvious alternatives to optimizing the LZA dictionary; for example you could take the static dictionary and build a suffix trie. Then instead of sending offsets into the window, forget about the original linear window and just send substring references in the suffix trie directly, ala the ancient Fiala & Green. This removes the ugly need to optimize the ordering of the linear window. But that's a big complex bowl of worms that I don't really want to go into.

Some results on some real packet data from a game developer :

downstream packets only
1605378 packets taking 595654217 bytes total
371.0 bytes per packet average

O0 static huff : 371.0 -> 233.5 average

zlib with Z_SYNC_FLUSH per packet (32k window)
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

OodleLZH has a 128k window
OodleLZH Fast :
371.0 -> 91.2 average

OodleLZNib Fast lznib_sw_bits=19 , lznib_ht_bits=19 : (= 512k window)
371.0 -> 90.6 average

OodleStaticLZP [mb of static dic|bits of hash]

LZP [ 4|18] : 371.0 -> 82.8 average
LZP [ 8|19] : 371.0 -> 77.6 average
LZP [16|20] : 371.0 -> 69.8 average
LZP [32|21] : 371.0 -> 59.6 average

Note of course that LZP would also benefit from dictionary optimization. Later occurances of a context replace earlier ones, so more useful strings should be later in the window. Also just getting the most useful data into the window will help compression. These results are without much effort to optimize the LZP dictionary. Clients can of course use domain-specific knowledge to help make a good dictionary.

TODOS : 1. optimization of LZP static dictionary selection. 2. mixed static-dynamic LZP with a small (32k?) per-channel sliding window.

08-08-13 | Selling my Porsche 911 C4S

Well, it's time to sell my car (a 2006 911 C4S). Unfortunately I missed the best selling season (early summer) because of baby's arrival, so it may take a while. It's so horrible having to work with these web sites, they're so janky and broken, jesus christ the web is such total balls, you misclick the back button and lose all your work or hit enter and some required field wasn't entered so they wipe out all your entries. (Whenever I remember to, it's best practice to always do your writing into a text editor and then copy-paste it to the web site, don't edit directly in the web site). So far I'm just getting lots of emails from spammers and dealers and other lowballers, so it's a great waste of my time. (and a lot of the lowballers (guys will offer things like $30k for the car I'm asking $48k for) are delightful human beings who insult me when I politely tell them they're out of their mind). Anyhoo...

Why am I selling it? Primarily I just don't drive it anymore. Between baby and everything else I have no time for it. Also Seattle is just a horrible place for driving. With my back problems I just really can't sit in a sports car for long periods of time (even though the 911 is by far the best of any of the cars I tried, it's super roomy inside which is really awesome), I'm gonna have to get an SUV (*shudder*) or something more upright, I no longer bend at the waist. I don't have the cash to sit on a depreciating car that I'm not driving. I also have concluded that I should not own a car that I'm worried about scratching, it means I can't really enjoy it and I'm kind of tense about it all the time. I need a piece of junk that I can crash into things and not care. (I feel the same way about nice clothes; if it's espensive enough that I am worried about spilling my food on it, then I don't want to wear it).

I'm gonna take a huge value loss on the car even if I get a good price, because it's in great shape mechanically, but has a few cosmetic flaws. Porsche buyers would much rather have something that's spotless but a disaster in the engine bay. That is, there's a lot of ownership value in the car which just evaporates for me when I put it on sale. Oh well.

Autotrader ad here with pictures.

Here's the ad text properly formatted :

07-25-13 | Baby Baby

Some rambling.

We've been going through a pretty hard time. Baby has had colic for the past few weeks (colic = mysterious crying, probably due to GI pain). It's pretty reliable in the evening, usually about 2 hours, but on really bad days it can be 5 hours. It's brutal and exhausting. The only thing that soothes the crying is constant walking around, patting, cooing, various distraction techniques. It just takes massive amounts of energy. Wifey and I both get super exhausted doing it and then we start snapping at each other.

Some of the silly books say "start to establish a going-to-bed ritual". Okay, got it. Cry inconsolably. Parents start thinking "fucking baby I'm going to kill you". After 2-3 hours of crying, go to bed. Yay, we've got a ritual! On the plus side, after the big cry session, she is exhausted and sleeps pretty solidly through the night.

But it's getting better I think. She's recently learned how to get burps out mostly on her own and without a huge fuss about it, and I think that's relieving a lot of it. We're also getting better at learning the soothing techniques. We've had a few days with hardly any of the crazy colic crying at all, and those are like "wow, this isn't that bad". (addendum : in the weeks since I wrote this and haven't posted it, it's continued to get better and she seems to be down to basically just fussy baby crying, not so much the evil colic stuff).

Just in the past few days she's started reaching for things and grasping. She still has zero coordination, so the reaching is like random flailing of tentacles, and if one of them happens to hit the target then it locks on. She does pretty amazing smiling laughing now, which is adorable (the cutests ones are when she gets so happy that she just can't handle it and explodes with this flailing of arms and has to turn her head away like a shy Japanese schoolgirl). We have nice little "talking" sessions of err-ga's and such. I rhetorically wonder if she has any concept that the sounds I make at her (or the sounds she makes back) have any meaning (or any possibility of meaning) at all, or are they just random sounds?

My mom visited and helped out for a while, which was amazing. It was so nice just to be able to eat dinner together (wifey and I), or work in the garden, or just generally have some time when Wifey and I could both be baby free. To all of you baby owners who had parents nearby to help out with yours : suck it. (or, you know, thank them).

I'm so looking forward to when my kid(s) are like 5-10 years old and can actually do stuff with me. I took baby on a walk by the lake the other day, and saw families with their little kids swimming in the lake, and damn that is the part that I want. It's what I've always wanted; I want to swim and ride bikes and play in the park. I've been doing it alone as a creepy old single man in the park ("hi, cute kid you've got there") for the past 10 years and I'm ready to do it with my own kids (not the creepy part, just the playing in the park part).

I think the 5-10 years old phase might be the most unconditionally happy time in life. I mean yeah things are better in many ways when you're college age and finally free and having sex and all that, but in those early years (assuming you have a good family) you're just totally free of worry, life is so care free, it's before the horrible teenage years where you start feeling the societal pressures and stresses of having to do well and so on. You can just live in the moment; like hey I'm at the park, I'm gonna run around the field and there are zero thoughts of what I have to accomplish, the fucking health insurance up-coding I have to fight, my excessive job todo list, etc.

Hmm, Charles's life map :

baby : who fucking cares you can't remember it
adolescent : playing, making bows and arrows, sweet parents, yay fun times!
teen : oo I'm so mopey, life sucks, listen to emo music and pout
college : I'm free! do drugs, have sex, go clubbing, party time oh yeah!
post-college : ugh wtf do I do with my life, I'm so stressed out, work sucks, new baby sucks
child is adolescent : playing, making bows and arrows, sweet parents, yay fun times!
child is teen : fucking ungrateful annoying mopey bastard kid, I'm so stressed out, work sucks
child goes away to college : we're free! do prescription drugs, drink wine all day, oh yeah!

or something.

We've now tried almost every single baby carrier that's made, and none of them are working for us. Thankfully we inherited most of them. We have a Peanut Shell, a Moby wrap, a Babyhawk Mei Tai, a traditional Mei Tai, and an Ergo. Baby pretty much hates all of them. It is possible to get her in the Moby in positions she likes, but it takes about five hours so by the time you get it all tied up she has a dirty diaper and wants to eat again so you have to take her out. We've got a few difficulties with them. One is that she hates not being able to see the world, and the vast majority of them stick her face directly in your chest which she won't tolerate. The other is that none of them work very well with small baby's feet/legs. She's too young to really wrap her legs around our torsos, but the alternative is to just fold up her legs inside and they get all crushed (I suspect all the carriers will work better once she gets to the wrap-legs-around stage). It's been a frustrating ordeal, trying over and over to get her to tolerate a carrier only to have to bail out after five minutes because she starts screaming.

(addendum : we finally caved and got a Baby Bjorn to use front-facing, even though you're "not supposed to" because it's perhaps bad for her hips (*). Yep, it works. It's the only carrier she'll tolerate for more than a few minutes when she's awake).

(* = the "not supposed to" being the standard line from the internet clique of chattering moms who spew a lot of nonsense that's not based on any facts and then repeat it over and over as if it came from some solid source. The anti-rational-thought-basis of the modern internet mob is pretty sickening to me.)

One of the things that's been really difficult for me personally is adjusting my night schedule. I used to use dinner as my transition point from hard-working-mode to relaxed-day-is-done-mode. I loved to have a very slow dinner, lying down and nibbling for a while like a Roman, having a glass of wine, taking some deep breaths. It's great for the digestion to just really relax and eat slowly. Then there would be no more work for the rest of the day. No longer can I do that; dinner is a frantic rush to cram some food in my mouth while we take turns holding baby, and the work day is not done until we get her down to sleep at night, so I can't really start relaxing until then. It all means that my "on" time is so much longer, it takes more endurance to stick it out, and then after we get her down I have stay up for a few more hours to get that relaxation to get to sleep. Oh well; life changes.

Just recently I started working near full time again, and it's such a change. For one thing it's actually easier. Work is a relief compared to baby care. (though the hardest thing of all is trying to squeeze in bits of work and still do masses of baby care and try to keep a good attitude about it). On the negative side, I could feel my closeness with baby slip away almost immediately.

In the first few weeks when I was baby-caring full time, I felt almost as close to baby as mom was. Obviously she had the breastfeeding advantage, but I could soothe baby just as well, and interpret her signals, and I didn't feel like there was any huge gap. As soon as I started working full time and I was away from baby for long stretches I started seeing a difference. Suddenly baby wanted mom more than me; mom could read signals I couldn't read; baby would sometimes get fussy and I couldn't console her, but a hand-off to mom instantly calmed her down.

07-16-13 | Vaccines

Every parent these days has to face the question of vaccines, and whether they will blindly follow the standard CDC schedule or choose any modifications.

Obviously most of the anti-vaccine movement is total nutjobs, based on no science. They're an odd mix of the crazy christians and the crazy liberal/organic types who are part of a general insane granolay anti-science movement. The scaremongering has become so widespread, that even the sane parent has to ask themselves "is there anything to this?".

Unfortunately, the pro-vaccine side is not really without faults. They also are dishonest and misrepresent the facts, and make lots of arguments that aren't helping their side.

A few links on the pro-vaccine (anti-Dr. Bob) side :

The Problem With Dr Bob's Alternative Vaccine Schedule
CDC - Vaccines - Immunization Schedules for Children in Easy-to-read Formats
Cashing In On Fear The Danger of Dr. Sears ź Science-Based Medicine

Let me make a few points :

1. Some vaccines may be in the best interest of society, but not of the child. A parent making a purely selfish decision for the best interest of their child would logically choose not to get that vaccine.

The pro-vaccine people really don't want you to know this, so they lie about it or try to hide it.

Let's consider the MMR shot. The pro-vaccine people will say something like : the rate of severe complications (mainly encephalopathy and other high-fever side effects) from the MMR vaccine is something like 1-3 per million (*). However if you actually do get measles the rate of severe side effects (eg. death) is around 1-3 per thousand.

(* = there is some debate about whether the rate is actually higher than placebo, but let's ignore that question for the purpose of this point)

That's a slam dunk for vaccines, right? One per thousand vs one per million! So they claim, but no it is not. The problem is that you are getting a vaccine 100% of the time, but assuming that everyone else continues to get the MMR shot so that the diseases remain vanishingly rare (**), the chance of you getting measles is only something like 1 in a million. So in fact the chance that you will die from measles in your lifetime is only 1 in a billion, much lower than the complication rate of the vaccine. (of course the complication is worse and you have to weigh that somehow)

(** = I know perfectly well there have been recent outbreaks due to the non-vaccinating nutters, but it's still vanishingly rare at this point)

Now, obviously if you choose to fuck over society for such a marginal +EV for your child, you're an asshole. But that is the American Way. It could practically be on our national seal - "Take what you want and fuck the greater good".

I think this is the most interesting point of the whole post; theoretically let us imagine there is a vaccine that is actually on average harmful for each individual, but as a whole provides a massive benefit to society. Should you get it? Would people actually get it? I believe that the demonstrated behavior of the modern vaccine-abstainer movement is that no, people would in fact not get it. So then, should it be required by law?

BTW I suspect that there probably is not a currently existing vaccine (of a major disease; I'm not talking about chickenpox, hpv, or flu, which are in a separate category) where it is in fact +EV for the child to abstain, but I think it's close in a few cases and it's an interesting theoretical question. (the main question for it being "close" is that I suspect the supposed MMR side effects, and the settlements under the table injury law, are mostly not actually MMR side effects (*!*)).

(*!* = The primary question is about the cases of encephalopathy that have been compensated under the NCVIA based solely on them occuring shortly after the injection. In these cases, the government pays a settlement automatically without admitting fault or requiring any proof that the encephalopathy was caused by the vaccine. Some on the anti-vaccine side incorrectly interpret these settlements as courts finding that the vaccines did harm, but that is not the case. It's unclear whether the rate of encephalopathy following MMR is in fact statistically significant compared to the rate following a placebo shot; there're lots of paper on this in case you want to waste a day).

2. The pro-vaccine people claim that combo shots are perfectly safe and there's no reason to separate them. However, we know that quite a few of the combo shots that have been sold by the major pharmaceutical companies in the last 40 years have in fact been *not* safe. And during that time they were saying "oh yeah sure ProQuad (or whatever) is totally safe, trust us". So we're supposed to believe that despite the safety net failing repeatedly in the past, it's worked now and the current set of shots is fine.

My opinion on medication in general is to not trust anything that's new. If it's new, not only has it not been tested much in the field (and for many reasons you should not trust the manufacturer's own tests), but more than that there's a profit motive. The generic combinations that have been in use for a long time are no longer cash cows, so they try to bring out some new combination that puts even more together, and when there is a profit opportunity there is often pharmaceutical companies making people sick. I'd much rather get a 20 year old drug than a new one that's supposed to be better (though it's hard to find doctors that go along with this).

And the whole pro-combo-shot argument seems illogical to me on the face of it. What they generally argue is that the number of antigens in the vaccines is low, in fact very low compared to the number of antigens that babes get environmentally all the time (***). They also contend that baby's immune systems are perfectly capable of handling the load. Sure sure. But in fact vaccines do sometimes trigger high level immune responses, a very small fraction of the time. Each separate type of antigen is capable of doing that, and if you get unlucky and the body responds badly to several at the same time, that's more likely to be a higher and longer lasting fever.

(*** = and of course that's a specious argument; the daily environmental antigen exposure is obviously not the same thing as an injection of very specific antigens related to a major disease. They're not the same kind of antigens, your body doesn't have the same reaction, they aren't introduced the same way, it's just a retarded argument. The way the pro-vaccine group constantly tried to make its argument "stronger" by adding more points that aren't quite right doesn't help).

They put up this absurd straw man argument, claiming the objection to combos is that the infant immune system will be "overwhelmed", and in fact it will not be, QED combos are fine. Umm, what?

3. The "Aluminum is safe" arguments are weak. There is no good data on whether the Aluminum in the new vaccines is safe or not long term. It simply hasn't been used long enough to know if there is a low level long term toxicity.

In order to contend that it is safe, they compare it to the ingestable aluminum recommended limits and slow exposure limits and things like that which are not the same thing. Of course there's never any long term testing of any new medicine, so any new drug you take could have bad long term effects. And even for drugs that are on the market a long time, it can be very hard to attribute long term complications to them (eg. the mercury carriers that were used before aluminum probably were in fact perfectly safe, but it's hard to tell for sure).

It sort of reminds me of when dental fillings all switched from mercury amalgam over to epoxy resins. There was zero evidence that the mercury fillings were actually harmful, but because it has the scary word "mercury" in the name people thought they must be toxic. So instead we all of a sudden get some new chemical reaction happening inside our mouths, that due to being new of course had no long term health results. So essentially you're putting your whole population to a new risk for no good reason (eg. dental resins release various VOCs during initial curing, which may or may not have health consequences).

4. The CDC schedule is in fact not made with the health of your infant in mind.

This is one of the big dishonesties in the pro-vaccine camp that bothers me. Lots of the pro-vaccine people say "follow the CDC, they're the experts, the schedule is made in the best way". Well, yes, sort of, but they made the schedule with several different factors in mind. One is the best interest of your child. One is herd immunity. Another one is protecting babies from parents that are dishonest or cheaters; eg. things like the early Hep B vaccine is important even for parents who claim they are clean because many parents are either secret drug users or having secret extramarital sex. If you know you are not doing those things, it's probably fine to skip it.

Another major factor in the CDC schedule is compliance. They want you to get all the vaccines early and in few appointments, because they know that is the only chance to get most people to see their doctors. If it took lots of appointments, and continued into later childhood, there would be huge compliance failures. A large part of the CDC schedule is behavioral engineering. In fact the best schedule for your child's health is probably slower and more sparse than the CDC schedule (assuming you would actually stick to the slower schedule).

Shots like Polio are given early not because the infant needs them at that age, but just to make sure that person gets them *ever*, because the early shots are the least likely to be missed. It's probably better to get those shots later (though only microscopically since the harm of getting them early is minimal (in fact so minimal that I suspect the extra doctors visits of a spaced out schedule like Dr. Bob's are probably more harmful than just going ahead and getting all the shots early even though you don't need them early)).

Unfortunately the pro-vaccine people don't want to admit this fact and provide a well researched science based slower schedule that is designed with the best interest of the child in mind, so parents are left with things like the non-scientific ad hoc Dr. Bob schedule.

5. Lots of pro-vaccine people in the end resort to "they're the experts, they know more than you, trust them". LOL yeah right, because pharmaceutical companies have never tried to sell us poison, and our government has always given us good science-based health policies.

If you are in fact right, you should be able to argue the facts without resorting to "because we know best". Also the argument that "it's the only schedule that's been tested" is a cheap way out, since you don't allow any other schedule to be tested.

BTW in case it's not clear, my personal belief is that of course you should vaccinate (don't be ridiculous; vaccines are one of the very best things that medical science has ever done (in close competition with antibiotics and antiseptics)), and lacking any better information you should probably just use the CDC schedule. Any -EV in it for your child is very very small, and the risk of trying to make up your own schedule that would be better for your child is greater than any potential benefit.

Of course good decision making also considers the meta-decision of "should I spend my time thinking about this" and I believe in this case the answer is "no".

07-02-13 | Baby Baby Baby

Bleck it's so hard to get any work done. I've been going into the office a bit recently to try to get some concentration, but it's not helping a lot. Part of the problem is I'm not used to the office so it feels weird and uncomfortable being here. A lot of the problem is I just hate commuting so very much; by the time I get here I'm enraged and exhausted and need a lot of time to calm down.

(one thing that I've finally realized recently is that if you spend a lot of energy trying to do every little trivial thing in your life well (like driving, or loading the dishwasher, or enqueueing your laundry), it takes away energy that you could spend on something more important, like your social life or your work. I can kind of see the merit of being a total non-mindful fuckup when you're doing the trivial stuff of daily life, like the way people will walk straight into me when walking down the sidewalk, or let their leashed dog go on the opposite side of me, or leave their grocery cart right in the middle of the aisle; I always thought "jesus christ what a fucking asshole shit-for-brains", but maybe they are just saving their mental energy for more important things. More realistically I now see that the average McDonalds employee who is obviously not putting any effort into doing their job well is actually doing the right thing; why should they? of course they should just be as lazy as possible and save their energy for the fun glue-huffing party that night).

Baby has started to get a little easier. She's sleeping a bit better, and the severe colic we were occasionally getting is perhaps decreasing. I now suspect that the worst colic was due to foremilk/hindmilk imbalance, so we're trying some steps to address that and perhaps it's working. It's pretty hard and indirect to diagnose these kind of issues, so we just sort of stab in the dark and see if things get better (and of course when things do get better that is no proof that you were right (classic "entrepreneur's fallacy" (*))). We're also just learning how to deal with it better; when she gets into the once a day fussy time, we just have to walk her back and forth for hours, keep patting her back or bouncing her, show her things to keep her amused, and just wait it out until she settles down again.

(* "entrepreneur's fallacy" is my own coinage (I dunno if there's a more standard name for this). Basically it's the belief that because you were successful, everything you did was right and part of that success. It is the unfounded cause-effect association of every decision you made with the observed result of your success.) (I suppose this is just the "post hoc" fallacy, but I enjoy the pejorative implied in my nomenclature)

It's still exhausting and we're running on very little sleep. I'm a little bothered by the frequent advice we get to "ask for help". Ask who exactly? And WTF are they going to do? Are they going to sleep with baby and feed her so wifey can get a decent night's sleep for once? Are they going to clean the house? Of course not. It's like the advisers think the problem is that we're foolish martyrs who are choosing to take on more than we should. Uh no, we'd love to have help. There is no fucking help. Not for anything in this life. I've had the same advice in different arenas - at work, in social life - "make sure you ask for help if you need it". Bullshit. In my experience when you go to your boss or producer in a job and say "I really have too much work, you need to offload me or get another person on this" what you get is not help, but rather a condescending lesson on time management or priotizing, like "well let's look through your task list and see what time estimates you've got and perhaps we can reduce some of those". (but of course in the work place it is very important to ask for help even though you won't get it, and in fact very important to make sure it's in writing (really every communication with your boss/producer needs to be in writing so you have a record), so that when you start missing tasks they can't say "you should've asked for help"). I've often asked friends for help with work or life issues or whatever (partly just because I think it's nice in this world to get and give help, and I like to have that relationship with people when I can), and the majority of the time if it is at all inconvenient or just not totally trivial for them to help, or not in their own personal best interest, the result is no help (with exceptions that I am grateful for).

She's now doing some simple two syllable "uh-goos" and "err-kch". One of my favorite times with her is right after she eats (so she's in a really good mood and relaxed, not eager to leave), I'll sit with her in my lap and we'll stare into each other's eyes and talk to each other. I make sounds and she smiles at me and occasionally makes them back. She loves textiles; anything with a complex pattern she'll stare at for minutes totally enthralled.

I'm dreading the upcoming 4th of July. There are already random fireworks being set off in our neighborhood and baby hates them. If she's sleeping, they wake her right up. The night of the 4th is going to suck bad. It's been a major heat wave recently so keeping all the windows closed is not really an option.

Last weekend when we had this big hot spike we went down to the lake for some relief. We sat on a blanket with baby and it was pretty sweet. It was interesting to me to think about how it would have been different doing the same thing before baby, since hanging out by the lake in the heat was one of my favorite activities. With baby we were basically walking her back and forth the whole time to keep her content, occasionally feeding her. Without baby I would have been sunning, swimming, reading, perhaps boozing. It was okay, I didn't miss it much (part of the "everything sucks anyway so it's not too bad to lose it" principle). Single times in the sun at the lake were a joy, but also sort of unsatisfying, haunted by that feeling of "is this all there is?" or "shouldn't I be living it up more somehow?". The thing that I really missed during the hot spike was being out at night. One of my favorite things from the old days was on those heat wave nights, getting out on the bike and feeling the night air, or going to an outdoor cafe to feel the buzzing urban heat wave night energy. Actually some of my first dates with N were heat wave nights, and they were magical. Oh well, sayonara.

(oh and tangential by-the-lake rant : fucking boats and motorcycles with excessively loud motors make the lake a fucking din of growling engines all weekend. Some asshole owners do that intentionally, but the real problem is the law; we need noise limits on boat motors. WTF. They should have to be even quieter than cars, because the sound travels so clearly over the lake, and the fucking lake should be a beautiful peaceful place. If you want a fucking motorboat speedway, go to Chelan or a similar lake that's more rednecky. I'm okay with there being a handful of lakes where the drunks can run each other over, but the majority should be free of that awfulness).

We're trying to hire a "mother's helper" or housekeeper who also helps with baby a bit. They're so fucking cheap, why the hell not? If we can get some relief (and I can get some more work done) for $150/week of course it's worth it. Well, it's not so easy. So far the hiring process has been a frustrating waste of time. Kids are always complaining about how there're no jobs these days; well let me tell you why you can't get a job, it's because you're either a total irresponsible fuck-up or a spazzy freak show. You only need to have just the most minimal level of basic professionalism, like if we set an appointment time for a phone interview, fucking answer your phone when I call. If you come to our house, be on time, clean, and considerate. If you send me an email application, check your email often so that you can follow up on my reply within 24 hours. COME ON!

(in general everyone these days seems to be such a moron that it's a bit risky to let anyone in your house; they do things like flush paper towels down the toilet, put onion peels down the garbage disposal, etc. you've got to keep an eye on them constantly or they're going to do annoying or expensive damage to your house).

We should all know by now that humanity is just fucking vile and horrible and dumb and selfish and mean. But we've started taking baby on walks recently and my opinion of the human race has gone down another notch. People are fucking asshole retards behind the wheel of a car, that are generally irresponsible, inconsiderate, dicky, selfish, dangerous, and just generally stupid, but I always thought that most of those people would still be reasonable around babies. Nope. Just about every time we take her on a walk in the stroller, some asshole tries to speed by us as close as possible. There are several intersections near our house where the cars basically don't stop at the stop signs (perhaps they slow down and roll through if they're one of the more considerate ones). I assumed that hey if I'm crossing at the stop sign and I'm halfway through the intersection with a stroller, they're going to actually stop now, right? right? Nope. If I've got my car parked with the door open and I'm taking baby from stroller to car seat, people passing are going to slow down a bit or pull out a bit wider, right? Nope, full speed right past. Unbelievable, so depressing. It's so deeply sad to me every time I leave my home and see how awful the world is, it makes me never want to leave home.

06-27-13 | Some Media Reviews

Light Asylum - umm, yeah, amazing. Light Asylum is a modern band that recreates the 80's goth industrial sound. It's perfect, tinny synths, that bad operatic singing, it's exactly like what the kid who painted his finger nails black listened to. So, long story: every few months I just go see what the hip kid websites are recommending and download all their favorite stuff, then I gradually get around to listening to it. About a year or so I got Light Asylum and put it in my playlist. The first time it played I was like "WTF this is awful" and skipped it. But I was lazy and left it in my playlist. Then over the next few months, once in a while I would listen to something else (mainly "Hooray for Earth" and Grizzly Bear's Shields) and Light Asylum was after those in the PL order so it would drift into there. I would be in the other room and not skip it immediately and I started thinking "this is hillariously bad, but kind of addictive". Flash forward and now I can't get enough of it, I'm listening to it over and over. In every objective way it's just awful; the beats are repetitive, the songs are very basic and don't go anywhere, the singing is terrible, but somehow that's all just right. (I think "IP" is the best example; it's so boiled down and repetitive, and the "25 to liiiiife" is just awful, but wow it works). Amazing.

Top of the Lake - great. My first impression was "bleh it's just The Killing in New Zealand, not this shit again". But it's much better than that. It's intense, the character development is superb, it's hard to watch, you really hate some people and are afraid of what might happen, which means they're doing something right. It is also a bit disappointing in the end; there are some weird Lynchian tones hinted at early on, making me think it might drift into a semi-Twin-Peaks territory and it never develops that thread. And the last episode really sucks; all the episodes before the last are slow and develop things gradually and beautifully, and the last one just wraps everything up neatly in a rushed way. Totally worth watching. It did all feel a bit disjointed, as most modern TV shows do, like they were writing it as they went along without a great overall plan, and it had a lot more promise than it delivered, but still just way more real and powerful than almost anything else on TV.

The Fall - meh, good. Totally straightforward BBC-style crime serial. Not really anything interesting about it, hey there's some crime and some detectives, with no particular twist or local character, but it's very well made, the acting is good, it looks beautiful. Watchable.

Nobody Walks - underrated; simple little obvious movie, but nicely done; it flows well, some good little moments of human interaction. It's right in the early-Lena-Dunham wheelhouse of disfunctional upper class intellectual families.

BBC Storyville - "The Road" - really well done slice of life doc. I love this kind of thing; reminds me of "The Tower". Sad and beautiful, this world.

Endeavour - old-school BBC style detective show, in the sense that it's sort of charming but the "mysteries" are totally retarded, the local characters are shallow stereotypes, and it sets you up from the beginning to let you know what you're going to get and then gives it to you. It's like a warm bowl of milk and a cardigan, very comforting. I think it's great, carried by the delightful Shaun Evans and Roger Allam.

Out of the Wild - trash reality show survival thing, but I'm a sucker for this and found it addicting. Better than Survivorman or Man vs Wild for my taste. The group dynamic is pretty interesting and much nicer than the typical vote-them-off reality format.

Crap : Orphan Black, The White Queen, Silver Linings Playbook, Rectify, lots of other crap that doesn't bear mentioning.

Lots of good food docs on Netflix right now. "Three Stars" is the best, really superb, but I also enjoyed "Step Up to the Plate", and "A Matter of Taste: Serving Up Paul Liebrandt", both interesting.

06-19-13 | Baby Blues

I had my first real day of "baby blues" a few days ago.

Baby gets these bouts of colic that I believe are mostly from gas. When she has it, she can't stand to be horizontal, and really just wants to be held over the shoulder and patted. That's okay for a while, but sometimes it goes for an hour, which is exhausting. Most days it only happens once, but rarely it occurs over and over throughout the day.

We had a pretty bad day, and I found myself just losing my mind. You get so hungry and tired, but you can't take a break, and you start thinking "shut the fuck up! WTF do you want god dammit". In a real bad moment I started getting these weird impulses "like maybe if I throw the baby on the floor it will shut up" or "maybe shaking a baby isn't that bad". And then you just have to go whoah, keep it together, calm down.

It made me realize that if I was somewhat more irresponsible, more prone to rage, or less in control of myself, I easily could shake a baby, or one of those awful other things that people do (just lock it in a room to cry itself out, or give it booze, or whatever).

In fact it made me sort of understand those moms that kill their children, or the dads that go out for cigarettes and never come back. There's this feeling that these fucking kids are ruining your life and you can't do anything about it and you're going to be stuck with them for the next 18 years and there's this sudden feeling of helpless desperation. I can sympathize with the impulse, but of course that's where being an adult with some self control comes in.

I'm in awe of the women who had to take care of their kids all alone, with no help from their selfish misogynist husband that wouldn't touch a diaper or cook for the family, and appalled at those husbands.

Part of my thinking in having a baby was that I understood perfectly well that I would lose going out to eat, and travel, and pretty much every activity out of the house. And I'm okay with that, because those things fucking suck anyway. I don't understand what old single people do. I feel like I've done pretty much everything (*) there is to do in this life, and I don't need to keep doing the same shit over and over. I'm fine with losing all of that. But I do miss the ability to just have a quiet moment for myself, especially at the end of the day when I'm exhausted and frazzled.

(* = obviously not actually everything, nor everything that I want to do; what I have done is everything that's *easy* to do, which is all that normal people ever do. Things like going to restuarants and driving cars on race tracks and skydiving and travel and watching movies - how fucking boring and pathetic is all of that, don't you have any creativity? That's just the easy default consumerist way to waste your time - pay some money and get some superficial kicks. The actual good activities are things like : make your own internet comedy, assassinate an evil politician, find sponsors and be a motorcycle racer, go horseback camping on your own across the Russian steppe. But everyone's just too lazy and boring to ever do anything good with their life, and so am I. So just have a kid.)

One thing I've realized is that a good parent is never annoyed; a good parent never says "not now". You need to be always able to drop what you're doing, or get up off the couch and help your kid. I've always been the kind of person who has moments where I'll socialize and other times when I really want to be left alone, and if someone tries to talk to me during the left alone time I'm really pissy at that them. That's not okay for a parent, you can't be pissy at your kid because they talked to you when you're tired or hungry or trying to work.

(I suppose this is related to a realization I had some time ago, which is if you believe you are a nice person except when you are tired or hungry or cranky or busy or having a bad day - you're not a nice person. Your character is how you behave in the difficult times.)

A good parent doesn't only love their children when they're behaving well. If you only like them when they're being quiet and happy, you're an asshole. A good dad has unconditional love that is not taken away when they misbehave; if anything you need to have more kindness in your heart when they're bad, because that's when they need it from you the most. Don't be a pissy little selfish whiney baby who's like "whaah the kids are being jerks to me so I'm justified in yelling at them or just running away to my office". You're the adult, you're supposed to be the bigger person than them. (in fact a good adult treats other adults the same way)

I really miss sleeping. I'm getting way more sleep than my wife, but it's still just not enough; I suppose the draining hard work of caring for baby is part of the problem, I really need even more sleep than usual and instead I'm getting less. I can feel my brain is not working the way it used to, and that feels horrible.

Maybe most of all I miss being able to have a moment where I know I don't have to do anything. So I can finally stop working, so I can let down my guard and just relax and know I'm not going to get have to start up the diesel engine again. I may never have that again in my life, because kid issues can occur at any time; you can't knock yourself out on heroin anymore; you never get that time when you know you can just shut off your brain. Parents have to be always-on, and that's just exhausting.

I guess one of the problems for me is that I give my all to work; I don't stop working for the day until I feel completely drained, like I have nothing left, and then I just can't deal with any more tasks, I need to crash, be left alone. That's been a shitty thing for me to do my whole life; it's been shitty to my girlfriends and wife that I get home from work and just have no energy for them, but perhaps even more so it's been shitty to myself. Even morons (smarter than me) who work retail or whatever know that you shouldn't give all your energy to the stupid job; of course you should be texting your friends while you work, planning your night's activities, and when you get off work you shouldn't be drained, you should have energy to hang out, be nice to people, have a life outside of work. Anyway, when there's a whole mess of baby work to be done when you get off work; you really can't afford to work 'til you drop.

In other shitty-but-true news : if I was hiring programmers, and I had the choice between a dad and a single guy, I would not hire the dad. They would have to be massively better to compensate for the drain of children. Only young single guys will stupidly throw away their lives on work the way a manager really wants.

06-18-13 | How to Work

Reminder to myself, because I've gotten out of the habit. In the months before baby I had developed a pretty good work pattern and I want it back.

There is only full-on hard work, and away-from-computer restorative time. Nothing in between.

1. When working, disable internet. No browsing around. If you have a long test run or something, usually it's not actually blocking and you can work on something else while it goes, but if it is blocking then just walk away from the computer, get your hands off the machine, do some stretching.

2. No "easing into it". This is a self-indulgence that I can easily fall into, letting myself start slowly in the morning, and before I know it it's close to noon. When you start you just fucking start.

3. If you're tired and can't keep good posture when working, stop working. Go sleep. Work again when you're fresh.

4. Whenever you aren't working, don't do anything that's similar to work. No computer time. Fuck computers, there's nothing good to see on there anyway. Just walk away. Avoid any activity that has your hands in front of your body. Try to spend time with your arms overhead and/or your hands behind your back.

5. When you feel like you need to work but can't really focus, don't substitute shitty work like paying bills or online shopping or fiddling around cleaning up code pointlessly, or whatever that makes you feel like you're doing something productive. You're not. Either fucking get over it and work anyway, or if you really can't then walk away.

6. Be defensive of your good productive time. For me it's first thing in the morning. Lots of forces will try to take this away from you; you need to hold baby, you need to commute to go to the office. No no no, this is the good work time, go work.

7. Never ever work at a laptop. Go to your workstation. If you feel your ergonomics are bad, do not spend one second working in the bad position. Fix it and then continue.

8. Set goals for the day; like "I'm going to get X done" not just "I'm going to work on X for a while" which can easily laze into just poking at X without making much progress.

9. When you decide to stop working for the day, be *done*. No more touching the computer. Don't extend your work hours into your evening with pointless trickles of super-low-productivity work. This is easier if you don't use any portable computing device, so just step away from the computer and that's it.

10. Avoid emotional disturbances. Something like checking email in the morning should be benign, but if there's a 10% chance of it makes you pissed off, that's a big negative because it lingers as a distraction for a while. I've basically stopped reading any news, and I think it's a big life +EV and certainly productivity +EV.

06-06-13 | Baby Misc

You're old when you it takes you a while to remember how old you are.
You're older when you have to do the math from your birth year to figure it out.
You're really old when you have to do the math and get it wrong.
You're really really old when you do the math, get it wrong, and insist you're right despite everyone else in the room telling you a different number.

Pooping baby looks like an alternating sequence of Angry Andy Richter :

and O-face Gollum :

(baby made me realize that Andy Richter looks just like a giant baby)

On TV you always hear people gushing about that "great baby smell" , like mmm let me stick my nose in and smell that baby. WTF there's no great baby smell. I suppose what those people like is the smell of Johnson&Johnson shampoo and baby powder (both of which are rather out of fashion now). In our house we always try to avoid scented products (the better to smell you by), so our baby gets none of those. The real natural baby smell is mainly sour milk. Milk vomit, milk spitup, milk poos, spilled milk. Yum. It's mixed in with a faint whiff of that really nasty toe-jam funk, because babies have all these awkward fat folds and no matter how thoroughly we bathe her, we seem to miss some fold or other that accumulates crud.

One of the baby diaper-changing suggestions is "make sure to wipe front to back so that you don't spread poo towards the genitals". Umm, have you ever seen a baby diaper? It's like someone set off a poo grenade in there; there's poo everywhere; it's leaking out the top of the diaper, it's way deep inside the vagina. Oh, let me carefully wipe all that poo from front to back, ok, that'll make it all fine.

Baby is finally starting to spend some time awake that's not just eating or burping. That's kind of nice, she is starting to make some more expressive "eh"'s ; the other day she head-tracked her mom across the room when she was hungry for the first time.

Before baby was born IC made this note to me that "babies are boring", that I thought was charmingly honest. Yep, it's true, babies are boring as fuck. Sure they're cute and all, but there's just endless hours of feeding, burping, rocking; yeah, yeah, baby I've seen your cute arm waving before, just go to sleep already so I can do something else. At first I was kind of trying not to watch TV while holding baby or just put her in a mechanical rocker, I thought it was better to engage and talk to her and play with her and such, but now I say fuck it, there's just too much time to kill.

There's a sort of Stockholm Syndrome with babies. They're so hard in the first few weeks, constantly demanding attention, that it makes you grateful when they do the most basic things. Like, oh great baby overlord, thank you so much for sleeping for 3 hours in the evening, we are so grateful for your kindness. It's like the classic dick boss/dad trick of being really shitty to people so that if you are just halfway decent they love you for it.

06-04-13 | Reader Replacement

Can anyone suggest a good Reader replacement? (WTF Google, seriously).

I tried a few of the Win32 RSS Readers and absolutely hated them; they all tried to be too fancy and out-feature each other. I don't want anything that has its own built-in html renderer. I certainly don't want you to recommend related feeds to me or anything like that. I just want a list of my RSS subscriptions and show me the ones with unread updates, then I'll click it and you can open the original page in Firefox. (even Google Reader is too fancy for my taste; I don't like the in-Reader view of the feed, which often renders wrong; just open the source page in a new tab).

(actually I suppose I don't really care for RSS at all; don't send me the text of an update, all I want is a date stamp of last update so I can see when it changes).

Anyway, suggestions please.

Also if someone knows a webmail + spam filter that can integrate with a POP3 reader, I would drop gmail too, and be happy about my pointless solitary boycott of Google.

06-02-13 | Sport

I've been watching F1 for the past few years. There are some good things about; for me it satisfies a certain niche of mild entertainment that I call "drone". It's a kind of meditation, when you are too tired to work or exercise, but you don't really want to watch anything either, you just put it on in the background and it sort of lets you zone out, your mind goes into the buzz of the motors and the repetitive monotony of the cars going around over and over in the same order. I particularly like the onboards, you can get your head into imagining that it's you behind the wheel and then sort of go into a trance watching the left-right-left-right sweeping turns.

One thing that clicked for me watching F1 is just how active it is in the cockpit. When we drive on the street we're mostly just sitting there doing nothing, then there's a turn, you are active for a second or two, then it's back to doing nothing. Even with my limited experience on track, I'm so far below the capability of my car that I'm still getting breaks between the corners. A proper race driver lengthens every corner - that's what the "racing line" is all about - you use the whole track to make the corners less sharp, and you extend them so that one runs in the next; the result is that except on the rare long straight, you are actively working the car every second. Also, F1 cars are actually slipping around all the time; you don't really notice it from the outside because the drivers are so good; from the outside the cars seem to just drive around the corner, but they are actually constantly catching slides. The faster you drive a car, the less grip it has; you keep going faster and faster until the lack of grip limits you; every car driven on the absolute limit is slippy (and thus fun). I've been annoyed by how damn grippy my car is, but that's just because I'm not driving it fast enough.

F1 has been super broken for many years now. I suppose the fundamental thing that ruined it was aerodynamics. Aero is great for speed, but horrible for racing. In a very simplified summary, the primary problem is that aero makes it a huge advantage to be the front-runner, and it makes it a huge disadvantage to be behind another car, which makes it almost impossible to make "natural" passes. (more on natural passing shortly). 10 years or so ago before KERS and DRS and such, F1 was completely unwatchable; a car would qualify on pole and then would easily lead the whole race. Any time a faster car behind got up behind a car it wanted to pass, aero would make it slower and unable to make the pass. It was fucked. So they added KERS and DRS, which sort of work to allowing passing despite fundamentally broken racing, and that's how it's been the last few years, but it's weird and unnatural and not actually fun to watch a DRS pass, there's no cleverness or bravery or beauty about it. The horribly designed new tracks have not helped either (bizarre how one firm can become the leading track designer in the world and do almost all the new tracks, and yet be so totally incompetent about how to make a track that promotes natural passing; it's a great example of the way that the quality of your work is almost irrelevant to whether you'll get hired again or not).

(the thing that's saved F1 for a while is the combined brilliance of Raikkonen, Alonso, Vettel, and Hamilton. They continue to find surprising places to pass; long high speed passes in sweeping corners where passing shouldn't be possible, or diving through tiny holes. It's a shame they don't have a better series to race in, those guys are amazing. Vettel is often foolishly critized as only being able to lead, but actually some of the best races have been when he gets a penalty or a mechanical fault so that he has to start way back in the pack, he charges up more ferociously than anyone else, just man-handling his way up the order despite the RB not being a great car for racing from behind)

Anyway this year I just can't watch any more. The new tires are just so fucked, it takes a series that already felt weird and artificial and just made it even more so. The whole series is a squabble about regulations and politicial wrangling about the tires and blah blah I'm done with it.

Searching for something else, I stumbled on MotoGP. I'd seen the Mark Neale documentaries a few years ago ("Faster" etc) and thought they were pretty great, but never actually watched a season. Holy crap MotoGP has been amazing this year. There are three guys who all have legitimate shots at the title (Marquez, Pedrosa, Lorenzo). Rossi is always exciting to watch. Marquez is an incredible new star; I can't help thinking it will end badly for him, he seems too fearless, but until then he is a threat to win any race.

The best thing about MotoGP is there's no aero. No fucking stupid aero. So of course you don't need artificial crap like DRS. The result is that passing is entirely "natural", and it is a beautiful thing to watch; it's a sort of dance, it's smooth and swooping. The bikes are just motors and tires and drivers, the way racing should be. (actually without aero, it's a slight advantage to be behind because you get slipstream; giving an advantage to the follower is good, that's how you would design it as a videogame if real world physics were not a constraint; giving an advantage to the driver in front is totally retarded game design).

Natural passing is almost always done by braking late and taking an inside line to a corner. The inside line lets you reach the apex sooner, so you are ahead of the person you want to pass, but you are then badly set up for the corner exit, so that the person you passed will often have a chance to get you back on the exit; you therefore have to take a blocking line through corner exit while you are at a speed disadvantage due to your inside line. It's how racing passing should be; it's an absolutely beautiful thing to behold; it requires courage to brake late and skill to take the right line on exit and intelligence to set it up well.

Watching the guys ride around on the MotoGP bikes, I wish I could have that feeling. Puttering around on a cruiser bike (sitting upright, in traffic, ugh) never really grabbed my fancy, but to take this beast of a bike and have to grab it and manhandle it and pull it over to the side to get it to turn, it's like riding a bull, it really is like being a jockey (you're never sitting on your butt, you're posted up on your legs and balancing and adjusting your weight all the time), it's a physical athletic act, and yeah fuck yeah I'd like to do that.

I believe the correct way to fix F1 is to go back to the 70's or 80's. Ban aero. Make a standard body shell; let the teams do the engine, suspension, chassis, whatever internals, but then they have to put on a standard-shaped exterior skin (which should also be some material other than carbon so that it can take a tiny bit of contact without shattering). Design the shape of the standard skin such that behing behind another car is an advantage (due to slipstream) not a disadvantage. Then no more DRS, maybe leave KERS. Get rid of all the stupid intentionally bad tires and just let the tire maker provide the best tires they can make. Of course none of that will happen.

I've also been watching a bit of Super Rugby. You have to be selective about what teams you watch, but if you are then the matches can be superb, as good or better than internationals. There have been a couple of great experimental rule changes for Super Rugby this year and I hope they get more widely adopted.

1. Time limit on balls not in hand (mainly at the back of a ruck). The ref will call "use it" and then you have 5 seconds to get the ball in play. No more scrumhalves standing around at the ruck doing nothing with the ball at your feet.

2. Limitting scrum resets, and just generally trying to limit time spent on scrums. The refs are instructed not to endlessly reset bad scrums; either let them collapse and play on if the ball is coming out, or call a penalty on the side that's losing ground.

The result is the play is much more ball-in-hand running, which is the good stuff kids go for.

If you want to watch a game, these are the teams I recommend watching, because they are skilled and also favor an attacking ball-in-hand style : Chiefs, Waratahs, Rebels, Blues (Rene Ranger is amazing), Brumbies (too much strategic field position play, but very skilled), Cheetahs, Crusaders. The Reds and Hurricanes are good for occasional flashes of brilliance. The Bulls are an incredibly skilled forward-based team, but not huge fun to watch unless they're playing against one of the above.

The Chiefs play an incredible team attack that I've never seen the like of in international rugby. The thing about the international teams is that while they are the cream of the talent, they don't practice together very much, so they are often not very coordinated. (international matches also tend to be conservative and defensive field-position battles, yawn). The Chiefs crash into every breakdown and recycle ball so efficiently, with everybody working and knowing their part; they go through the phases really fast and are always running vertical, it's fantastic.

Quade Cooper is actually amazing on the Reds. I'd only seen him before in some Wallaby matches where he single-handedly threw away the game for his side, so I always thought of him as a huge talent that unfortunately thought his talent was even bigger than it really is. He plays too sloppy, makes too many mistakes, tries to force moves that aren't there. But on the Reds, it occasionally all works; perhaps because he has more practice time with the teammates so they know where to be when he makes a wild pass out of contact. I saw a couple of quick-hands knock passes by him that just blew me away.

I'm continually amazed by how great rugby refereeing is. It occurred to me that the fundamental difference is that rugby is played with the intention of not having any penalties. That is, both the players and the refs are generally trying to play without penalties. That is completely unlike any other sport. Basketball is perhaps the worst, in that penalties are essentially just part of the play, they are one of the moves in your arsenal and you use them quite intentionally. Football is similar in that you are semi-intentionally doing illegal stuff all the time (holding, pass interference, etc.) and the cost of the foul is often less than the cost of not doing it, like if a receiver gets away and would score, of course you just grab him and pull him down. That doesn't happen in rugby - if it did they would award a penalty try and you would get a card. If someone is committing the same foul again, particularly one that is impeding the opponent from scoring, the ref will take them aside and say "if you do that again it's a card". It's a totally different attitude to illegal play. In most sports, it's up to the player to make a decision about whether the illegal play is beneficial to them or not. I think it reflects the general American attitude to rules and capitalism - there's no "I play fair because that's right" , it's "I'll cheat if I won't get caught" or "I'll weigh the pros and cons and decide based on the impact on me".

05-30-13 | Well Crap

The predictable happened - baby threw out my back. Ever since we had it I kept thinking "oh shit, this position is really not good for me, this is not going to end well", and then yesterday blammo excruciating pain, can't stand up, etc. It's an episode like I haven't had in several years. I've been trying to forestall it, trying to keep up my stretching regimin and so on, but combined with the fatigue of not sleeping it was inevitable. Fatigue is the real enemy of my spine, because it leads to lazy positions, slouching, resting on my bones and so on.

I've been really happy in the last 6 months or so because I've finally been able to work out again after something like 5 years of not being able to properly. Ever since SF I've been doing nothing but light weight therapy work; I kept trying to slowly ramp back into real workouts and would immediately aggravate an injury, have to bail out and start over again with only light PT. I felt like I finally turned the corner, it's the first time I've been able to do basic exercises like squats and deadlifts without immediately re-injuring myself, and it's been wonderful. I still have to be super careful; I only do pulls, never pushes, I don't do any jerky violent moves, I keep the weight well below my max so that I can be really careful with form; perhaps most importantly I spend ages warming up before lifting, which is so damn boring but necessary. And now I'm fucked again.

I used to have all these ambitions (discover a scientific principle and have a theorem named after me, etc. naive child!) but now my one great dream is just to be able to do some basic human activities like lie in a bed or throw a ball without being in great pain.

Sometimes I wonder how much of my sourpuss personality is just because I'm in pain all the time. Like those kids who struggle in school and it turns out they're just slightly deaf or blind or whatever. Often things that you think are fundamental about yourself are actually just because of some surface issue that you should just fix. (and of course my physical problems are totally trivial compared to what lots of people go through)

(like for example I know that part of my problem with socializing is I have some kind of crowd deafness issue; if I'm in a conversation with more than one person, I find it hard to follow the thread, and if more than one person at a time is speaking the words kind of get all jumbled up; sometimes it's so frustrating that I just give up on trying to listen in groups and just check out and zone out. I also avoid a lot of social situations like dinners and movies because I know they'll mean I'm stuck sitting for a long time, which is inevitable severe back pain (and dinners are often intestinal pain); I think a lot of those people who just seem so happy and well-adjusted are that way not because of any mental difference but because they lucked out and don't have any major physical problems)

I think if I had a pool things would be much better. Swimming is amazing for the body. I have a dream; it's to live somewhere sunny with my own pool. I'll lie in the sunshine and swim, and lie in the sunshine and swim. I'm really looking forwarding to be an excessively tan old man shamelessly swimming (and just walking around) in a tiny speedo. Like this :

I often see those guys on the beach in Hawaii or where-ever, shamelessly strutting about with their leathery sunburns and tiny speedos under droopy bellies, and think "I hope that's me in 20 years, and I can't wait!".

05-27-13 | Marsupial Sapiens

I'm convinced that the human being is actually a marsupial that just hasn't developed a pouch yet.

The human baby is the least developed of any mammal. There are various theories why the human baby is born at such an early stage of development (all humans, like marsupials, are essentially born 3 months before they're ready); the naive guess is because later birth would not fit in the mother, but modern theories are different (metabolic or developmental).

A baby is pretty crap as a human, but it's pretty good as an infant marsupial. It wants to just lie on the mother's chest and sleep and eat. Once you think of it as a marsupial lots of other things are just obvious, like it needs low light and not very much stimulation. If you try to make it do anything other than what a marsupial wants (like sleep without skin contact) it gets upset, of course. It really struck me when watching our baby do a sort of proto-crawl (really just a random moving of the limbs) and wiggling around trying to get from the chest to the nipple; that evolved proto-crawl is useless to get along the land to the mother, the only thing it can do is move the baby around inside the marsupial sack.

The Karp method is at its core one sentence of information - "babies are soothed by recreating a womb-like environment". But it's even more accurate to say that what you want to do is create a marsupial-pouch-like environment (eg. you aren't putting the baby in total darkness and immersing it in fluid, nor are you feeding it through a tube).

As is often the case with childrearing, ancient man does a better job than modern man. I think the ideal way to deal with a newborn is just to tuck it in mama's shirt and leave it there. It sleeps on the chest, suckles whenever it wants, and bonds to mama in a calm, sheltered, warm environment. Being tool-using animals, we make our missing marsupial pouch from some cloth. The modern baby carrier things are okay (especially the ones that have the baby facing you on the chest), but they're wrong in a major aspect - they're worn outside the clothing, the baby should really be right on your skin.

It's really incredible how badly we fucked up childbirth and rearing in the western world in the last 100-200 years. Granted it was for good reason - babies and moms used to die all the time. I don't romanticize the old days of home birth with all its lack of sanitation and high mortality rates, the way some of the more foolish hippy homebirth types do. Birth was a major medical problem that needed to be fixed, it's just that we fixed it in massively wrong ways.

I'm trying as much as possible to not read any baby books or crap on the internet. I don't want to be influenced by all the trends and crap advice out there. I want to just observe my baby and do what seems obviously right to me. So for example, I was aware of "attachment parenting" or "aware parenting" or whatever the fuck the cult is called now, but I hadn't made up my mind that we would try that. But once we had the baby and I just looked at it, it was totally obvious that that's what you should do.

If you just pay attention to your baby, it's so clearly trying to tell you "I'm hungry" or "I'm sleepy" or "the light is too bright" or whatever. The only time that a healthy baby cries is when you have ignored its communication for quite a long time (5-10 minutes) and it's gotten frustrated and fed up; a baby crying is it saying "you are neglecting me, please please help me I can't help myself, you fucker, I already asked you nicely". (of course it's not the same in unhealthy babies; and we have some issues like acid reflux and/or gas that lead to some amount of crying inevitably; it's not that all crying is necessarily bad parenting, but there are enough cases of crying that are a result of not listening to the earlier gentle communication that it just seems obvious that you should take care of the baby's needs before it gets to the point of crying, if possible). You've created this helpless little creature, and it gets hungry or uncomfortable and is begging for your help, and to ignore it for so long that it has to yell at you is pretty monstrous.

It's crazy to me that people for so many years thought it was right to just let a baby cry, that it was even good for it to work its lungs or develop independence or not get spoiled. Society gets so stuck in these group-thinks; of course we're all just in a new one now, and it's impossible for me to ever know if I am actually thinking clearly based on what I see, or if I have been subconsciously brainwashed by my awareness of the societal group-think.

(Society considered harmful) (only fools think that they could ever have an independent idea, or any taste or opinion that is actually their own).

05-24-13 | Hunger Cues

I'm so exhausted that I've been eating from fatigue, which is terrible, so a note for myself. I very much believe in listening to your body, but you have to know what to listen to, and hunger cues can be confusing.

1. Tiredness is not a hunger cue. Yes, sure popping some sugar will give you a boost, but that is not the right solution. When you are tired you need sleep, not food. This is always a huge problem for me in work crunches, I start jamming candy down my throat to try to keep my energy up, and I'm tempted to do it now for baby, so hey self - no, sleepiness is not solved by eating. Go sleep.

2. Your belly is actually not a great cue. Sure extreme belly ache and rumbling means you need to eat, but a slight hollow empty feeling, which most people take as a "I must eat now" is not actually a good hunger cue. Humans are not meant to feel full in the belly all the time, but in our culture we get used to that feeling and so it feels strange when it's gone and you think something is wrong that you have to fix by putting more food in. It's really not; you need to try to detach the mental association of "belly empty" with "eat now".

3. The actual correct hunger cue is light headedness, dizzyness, or weakness. That means you really do need to eat something now, but perhaps not a lot. (getting quantities right for yourself takes some experimentation over time to figure out)

I believe that the primary goal of food consumption portioning and scheduling should be to eat as little as possible, without ever going into that red zone of critically low blood sugar. (of course I'm assuming that you want to be near your "ideal" body weight, with "ideal" being the modern standard of trim; if your ideal is to be as large as possible then you would do something different). Note that belly feelings show up nowhere in the "primary goal", you just ignore them. Perhaps even learn to enjoy that slight hollow feeling in your gut, which gives you a bit of a hungry wolf feeling, it's sort of energizing. (if I'm slightly hungry, slightly horny, and slightly angry, good god, get out of the way!)

I'm convinced that the right way to eat is something like 5 small meals throughout the day. Long ago when I was single and quite fit, I ate like that and was quite successful at meeting the "primary goal of food consumption portioning and scheduling" (minimal eating without going into the red zone). It's very hard to keep that up in a relationship, because eating a big meal is such a key part of our social conventions. When I was single I would very rarely eat a proper dinner; I would eat a sandwich at 4 PM or something so that I wouldn't really be too hungry at 8 when Fifth Meal came around, so I might just eat a salad and some canned tuna. It is possible to do in a relationship, and I'm sort of trying it now. You have to just make sure that you eat less at the standard meal times, or eat more low-cal food like cooked veg and salads, and then go ahead and eat the intermediate meals yourself. (it's particularly hard when someone else cooks and you feel compelled to eat a large amount to show that you like it; it's also hard at restaurants where the portions are always huge and you feel like you have to eat it to "get your money's worth"; eating around other men is also a problem, because of the stupid macho "let's stuff our faces" competition)

05-22-13 | Baby Work

Jesus christ it's a lot of work. I was hoping to get back to doing a little bit of RAD work by Monday (two days ago), but it's just not possible yet. I'm doing all the work for Tasha and baby and it's completely swamping me. I get little moments of free time (that I oddly use to write things like this), but never a solid enough block to actually do programming work, and you can never predict when that solid block of time will be, which is so hard for programming. A few times I tried to get going on some actual coding, and then baby wakes up and I have to stop and run around changing diapers and getting mom snacks. I give up.

(even when I do get a bit of solid time, I'm just too tired to be productive; I stare blankly at the code. Actually I've had a few funny "tired dad" moments already. I went to the grocery store and was shopping along, and all of a sudden I noticed my cart was full of stuff that wasn't mine. Oh crap I took someone else's cart at some point because I was so asleep on my feet. I remember all through my childhood my mom would make shopping mistakes that I found so infuriating (Mom, I said corn chex and you got wheat chex, omg mom how could you be so daft!) and now I finally have some sympathy; you just get so exhausted that you can't even perform the most rudimentary of tasks without spacing out and making mistakes).

If you haven't had your baby yet, get some help, don't try to do it alone. (our relief should arrive tomorrow, and it's getting easier each day anyway; the hardest days were the beginning when we were still exhausted from labor and cleaning up after the home birth, and baby hadn't figured out nursing very well yet, that was some serious crunch time). We thought it would be sweet to have some time for just the three of us, and it has been, actually it's been really nice just being alone together, but it's too much work, I don't recommend it.

(we've been incredibly fortunate so far that our baby is super easy, really well behaved, a good sleeper and nurser; we just have none of the problems that you hear about (*). Of course that's not entirely luck (though perhaps mostly luck), we're both super healthy people and we've done what we believe is right to make a happy newborn (singing to the womb, immediate mommy skin contact after birth, never separating baby from parents, cosleeping, breatfeeding, etc.). But it's been so hard even with a well-behaved baby I now have new respect for the parents that go through a baby with colic or feeding difficulties or one that doesn't sleep, you guys are heros).

(* = of course we're struggling with some acid reflux problems (what used to be called "colic" back when parents were awful and thought babies just cried because they were a nuisance, rather than because they were in serious pain that should be adressed) and forceful letdown and latching difficulties and etc etc, some other typical minor baby fussiness struggles, but that's all just normal baby stuff that I can't complain about, not a serious hardship like a baby with an actual health problem or disability)

House work is so annoying in the way that you can't just get it all done at once. Even during this time when the house work is so much harder than usual, it's still only something like 4-6 hours total of work, but it's all spread out over the day; you work for a while, then you have a 15 minute break (waiting for laundry to finish or the baby to poop again), then you do more work, then another tiny break, etc. I don't do well with work like that; I'm a work sprinter (actually I'm an everything sprinter), I want to get all my tasks on a list and then I'm just gonna strap on the gusto and knock it out as fast as possible so that I can be done and have a deep rest.

I'm sad that I'm so busy running around doing chores that I can't just lie in bed with mom and baby very much. I've always used doing work for people as a way of showing love, and it's fucking retarded. It's not what they really want, and it's not perceived as love. I'm sure that if I had someone else do all the work and I just spent more time cuddling, I would be a better dad.

At family gatherings, there are always those people who disappear from the group to help out in the kitchen; they might help cook, then help set up, help clean up; they do a lot of work and show their love for the family that way. There are other people who just hang out with the group the whole time and chat and smile and are more overtly interested in being there. Of course the hard worker is subconsciously perceived negatively, as standoffish, while the smiler is a "good family man" that everyone enjoys. In my youth I would rage about the unfairness of it all. I'm past that and can see things more clearly :

1. If you have a choice, then of course it's better to be the smiler who does no work and everyone loves. There is no reward in the real world for being a hard worker; not in social situations nor in the workplace. It's much better to be perceived as nice than to actually do deeply giving nice things.

2. Being a friendly, funny social creature is of course a type of "work" that's contributing value to the social situation. Think of them as an MC if you like; they're providing a service to the group, telling stories or laughing or whatever. That's valuable work as well.

3. The people who are really stealing from the group are the ones that don't do the work, and also don't provide laughs and good vibes. They are energy vampires and you should minimize your contact with them.

4. If you're a "smiler", the hard-working types will give you dirty looks or even drop passive aggressive shitty remarks about how "some people don't contribute" or whatever. Fuck them. They're just morons who have chosen a bad path for themselves and are trying to bring you down. Just laugh at them inside your head. Foolish hard-workers, your judgement has no sway over me, I don't need your approval. If someone else wants to do all the hard work for you, and then make themselves feel all sour about the fact that you didn't do the work, fantastic.

5. If you're a hard-worker, don't despair about the unfair world. You have found your lot in life. Maybe you are simply incapable of being a smiler. That's too bad for you, but we all have our place, and it's better to accept it and be content than to rage about what cannot be yours.

In my youth I used to struggle with trying it both ways. One of the nice things about age is you figure out your lot in life and just accept it; some years ago I concluded that I would never really contribute great social energy to groups, so I should just be a dish-doer in order to avoid being an energy vampire.


I'm a bit worried that I will never be able to really concentrate in my home office the way I have in the past. It's been a wonderful sanctuary of peace and alone time for me, where I can dive into my work and there's nobody making noise or peeking in at me (the way people do in offices). But now even just knowing that my baby is in the house, my mind is partly on the baby; is she okay, should I go check on her now? I'm sure that will decrease over time, but never go away. And of course once the child is running around making noise that will be a new distraction. Oh well, I guess we'll see how it goes.

Programming is such hard work to mix with anything else because you really need that solid block of uninterrupted time. You can't just pause and resume; or I can't anyway, I need to get into this sort of trance, which takes a while to acheive, and is quite draining. I feel a bit like a wizard in a fantasy novel; I can cast this amazing spell ("write code"), but to do so drains a bit of my life force, and if I do it more than once per day I bring myself close to death; if you're interrupted in coding, the spell is cancelled. I can write code without the spell, but then progress is slow and difficult, just like a normal human trying to write code, it's so frustrating for me to write code without the spell that I don't like to do it at all.


The actual point of this post is that I feel this need to get back to doing RAD work right away, and it's making me angry. Why do I have to feel that way? Fuck RAD work, I need to be with family. But my god damn hard-working WASPy martyr upbringing makes me feel like I can't ever take anything for myself, that I need to go and sacrifice and work for other people.

The whole time Tasha was pregnant I was crunching like crazy trying to finish Oodle 1.1 and get the real public release done, and to just get as much work done then as I could so that I would feel better about taking time off after the birth. I neglected Tasha when she needed me and she was really upset at me for it. I wanted to get ahead on my schedule, and I did, but now that I'm here my brain won't let me have that and wants me to go back to work.

One of the things I've really struggled with at RAD is the lack of structure and the self-scheduling. There's never a point where I can get ahead of an official schedule; I can't hit all my milestones for the month and then feel okay about taking it easy for a while. Any time I do take it easy because I just need a break, I feel bad about it. In general, my stupid brain makes me productive as a programmer, but also miserable as a programmer.

People who have a job where they just have a list of things to do and they can actually do them all and then go "I'm done, I'm going home" are very fortunate. In programming, the todo list is always effectively infinite (it's finite, but always longer than what you can ever accomplish). You might make a schedule and set a target set of tasks for a given month, but if you get them done sooner you don't go "great, I'm done for the month, time for a few days off", you go "oh, I went faster than expected, I guess I'll adjust the schedule and start on next month's tasks".

Even in a structured programming work environment, if you do your tasks faster than scheduled, you don't get sent home - you get given more tasks. In the traditional producer/team work system, your reward for being the fastest on the team is not more free time or even more pay, it's more work. Yay. Cynical "realist" programmers learn this at some point and many of them start to sandbag. They might finish their tasks quickly, but don't report it to production until their previously alotted time. Or they will intentionally take "slow work" breaks, like browing the web or watching videos while working.

I used to use my speed as a way to work on features I wasn't supposed to; mainly in the pre-Oddworld days, I would sprint and do my assigned tasks, and then not tell anyone I had finished much faster than scheduled, so that I could work on VIPM or secretly rewriting the DX render layer or some other task that had been decided by production was "low priority". Oo, what a rebel I was, secretly giving my employer masses of value for free, great way to use your youth cbloom.

Anyway. I guess this post is all just my way of trying to convince myself that it's okay for me to take a few more days off.

05-20-13 | Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

I believe if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go (rather than trying to track the different received and transmitted states to use for a dynamic model). Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history, so static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use a totally different scheme for the large transfers.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.

05-17-13 | Cloth Diapering

Oh yes indeed, you are in for a spate of baby-related blogging.

I'm pretty sure clother diapers are bullshit. I'm about to cancel my diaper service. In this first week I've been using a semi-alternating mix of cloth and disposable. I assumed that I would start out with disposables just for ease in the first few days and then switch to cloth because it's "better", but I don't think I will.

(I make all my decisions now based only on 1. personal observations and 2. serious scientific studies where I can read the original papers. I try to avoid and discount 3. journalism 4. hearsay 5. the internet 6. mass-market nonfiction. I think they are garbage and mental poison.)

What I'm seeing is :

Disposable diapers actually work the way they claim to. The seal around the borders is good. The entire diaper itself has a nice low profile so is not too bulky or uncomfortable. But most importantly, they actually do trap and absorb moisture. When baby has a heavy pee in a disposable diaper, the moisture stays right in one little spot and doesn't spread all over. When I remove the diaper I can feel her skin all over the nether regions is pretty dry.

Cloth diapers don't. The worst aspect is that when baby has a heavy pee, the cloth soaks it up, and because it's cloth and wicks moisture, the pee is spread all over her entire lower parts. When I get the diaper off, she's soaking wet all over. (and yes of course I'm changing her almost instantly after peeing because at this point we're watching her constantly). That alone is enough to turn me off cloth diapers, but there's lots more that sucks about them. It's really hard to get the diaper cover on such that it actually makes a water-tight seal, so leakage is much more likely (and if you do try to make it water tight, it's easy to make it too tight and cut off circulation, which I accidentally did once). The cloth diaper alone looks pretty comfortable on her, but the diaper cover is much rougher and more bulky than a dispoable; the result is that she has this huge awkward thing on.

When you add the inconvenience of cloth diapers (longer changing times, having to store poop in your house, taking the pail in and out for pickup), it just seems like a massive lose.

The only possible argument pro-cloth that makes sense to me is the reduction of the landfill load. Now, environmental arguments are always complicated; there are arguments for the other side based on the environmental cost of washing (though I think they're bogus). But even assuming that the environmental case is clear, being a hypocritical liberal I wouldn't actually inconvenience myself and discomfort my baby for the benefit of the landfill.

Eh, actually I take back that false self-accusation. That's a retarded Fox News style "gotcha" that's based on misrepresentation and not understanding. I've never advocated the standard liberal martyrdom (and if I once did, I certainly don't now). I don't believe in choosing to undermine yourself because you believe the world would be better if everyone did it. I believe in changing the laws such that they encourage you to make the choice that is better for the world. eg. people who don't drive because they believe it's evil, even if it would be much to their benefit, are just being dumb martyrs. The US government massively subsidizes driving, so if you don't take advantage of that you are essentially paying for other people to drive. I would love it if the government would subsidize *not driving* rather than the other way around, but until they do I'm driving up a storm. (tangent : the massive subsidies for Teslas is a great example of the way that Dems and Reps are in fact both really working for the same cause : creating loop holes and kick backs so that they can give money to rich people).

I'm a big tangent wanderer. My political philosophy in a nutshell :

Government's role is to create a market structure (through laws, regulation, the Fed, direct market action, etc) such that when each actor maximimizes their own personal utility, the net result is as good for the entire world (nation) as possible.

(if you're out of high school (or the 18th century) you should know that a free market does not do that on its own)

(And crucially, "good for" must be defined on something like a sum-of-logs scale, or perhaps just maximize the median, or minimize the number in poverty; if you maximize the sum (basically GDP) then giving huge profits to Larry Ellison and fucking everyone else looks like it's "good for the world")

And, uh, oh yeah, cloth diapers suck.

05-15-13 | Baby

I suppose this is the easiest way to announce to various friends and semi-friends rather than trying to mass-email. I have a new baby, yay! No pictures, you paparazzos. She's adorable and healthy. I love how simple and direct her communication is. Suckling lips = needs to nurse. Squirming = needs a diaper change. Fussing = cold or gassy. Everything else = needs to sleep. I wish all humans communicated so clearly.

I want to write about the wonderful experience of having a home birth (see *2), but don't want to intrude on Tasha's privacy. Suffice it to say it was really good, so good to be home and have everything at hand to make Tasha comfortable, and then be able to take baby in our arms and settle into bed right away. We spent the first 36 hours after birth all in bed together and I think that time was really important.

I've always wanted to have kids, but I'm (mostly) glad that I waited this long. For one thing Tasha is a wonderful mom and I'm glad I found her. But also, I realize now that I wasn't ready in my twenties. I've changed a lot in the last five years and I'm a much better person now. I've learned important lessons that are helping me a lot in this challenging time, like to do hard work correctly you have to not only complete the task but also keep a good attitude and be nice to the people around you while you do it. And that when you are tired and hungry is when you can really show your character; anyone can have a good attitude when they're fresh, but if you get nasty when the going gets tough then you are nasty. etc. standard cbloom slowly figures out things that most people learned in their teens.

Now for some old-style ranting.

1. "We had a baby". No you fucking did not. Your wife had a baby. If you were a really good husband, you held her hand and got her snacks. She squeezed a watermelon out of her vagina. You do not get to take any credit for that act, it was all her. It's a bit like Steve Jobs saying "we invented" anything; no you did not you fucking credit-stealing douchebag, your company didn't even invent it, much less you.

(tangent : I can't stand the political correctness in sport post-game interviews these days; they're all so sanitized and formulaic. They must go to interview coaching classes or something because everyone says exactly the same things. Of course it's not the athlete's fault, they would love to have emotional honest outbursts, it's the god damn stupid public who throw a coniption if anybody says anything remotely true. In particular this post reminds me of how athletes always immediately go "it wasn't just me, it was the team"; no it was not, Kobe, you just had an 80 point game, it was all fucking you, don't give me this bullshit credit to the team stuff. Be a man and say "*I* won this game".)

2. People are busy-body dicks. When we would tell acquaintances about our plans to have a home birth, a good 25% would feel like they had to tell us what a bad idea that was and nag us about the dangers of childbirth. Shut the fuck up you stupid asshole. First of all, don't you think that maybe we've researched that more than you before making our decision, so you don't know WTF you're talking about? Second of all, we're not going to change our mind because of your nagging, so all you're doing is being nasty about something you're not going to change. We didn't ask for your opinion, you can just stay the hell out of it. (The doctors that we would occasionally see for tests were often negative and naggy as well, which only made us more confident in our choice).

It's a bit like if a friend tells you they're marrying someone and you go "her?". Even if the marriage is a questionable choice, they're not going to stop it due to your misgivings, so all you're doing is adding some unpleasantness to their experience.

You always run into these idiots when you do software reviews or brainstorming sessions. You'll call a meeting to discuss revisions to the boss fight sequence, and some asshole will always chime in with "I really think the whole idea of boss fights sucks and we should start over". Umm, great, thanks, very helpful. We're not going to tear up the whole design of the game a few months from shipping, so maybe you could stick to the topic at hand and get some kind of clue about what things are reasonable to change and which need to be taken as a given and worked within as constraints.

Like when I'd ask for reviews of Oodle, a few of the respondents would give me something awesomely unhelpful like "I don't like the entire style of the API, and I'd throw it out and do a new one" , or "actually I think a paging + data compression library is a bad idea and I'd just start over on something else". Great, thanks; I might agree with you but obviously you must know that that is not going to happen and it's not what I was asking for, so if you don't want to say anything helpful then just say "no".

ADDENDUM : a few notes on home birth and midwives.

Even if you are planning to do home birth (without a doctor present), you should get an OB and do a prenatal visit with them to "establish care". That way you are officially their patient, even if you don't see them again. In the US health care system, if you do wind up having a problem, or even just a question, and you have not "established care" with a certain practice, you are just fucked. You would wind up at the ER and that's never good.

While the midwives seemed reasonably competent at popping out a healthy baby from a healthy mother with no complications, I certainly would not do it if there were any major risk factors. They also are less than thorough at the prenatal and postnatal exams, so it's probably worth seeing a regular doc for those at least once (probably only once).

05-08-13 | A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :

Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :

    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
                // did the ref inc
                return s->m_ptr;
            // state was reloaded, loop

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );

    // DecRef
    void DecRef( handle_type h )
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
                // I released the slot
                // cmpx provides the acquire barrier for the free :
            // somebody else mucked with me

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :

void bad_function( OwningRefT<Thingy> sptr )

void Stuff::bad_caller()
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )

void Stuff::bad_caller()
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )

If you like formal rules, they're something like this :

1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.

    // DecRef
    void DecRef( handle_type h )
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        handle_type old_guid  = handle_get_guid(h);

            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                    // I dec'ed a ref

05-02-13 | Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :

// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t

        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;

        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;

    void push(const t_data & data)
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;


    struct node
        atomic<node *>      next;
        nonatomic<t_data>   data;
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;

Download : cbliblf.zip

04-30-13 | Packing Values in Bits : Flat Codes

One of the very simplest forms of packing values in bits is simply to store a value with non-power-of-2 range and all values of equal probability.

You have a value that's in [0,N). Ideally all code lengths would be the same ( log2(N) ) which is fractional for N not a power of 2. With just bit output, we can't write fractional bits, so we will lose some efficiency. But how much exactly?

You can of course trivially write a symbol in [0,N) by using log2ceil(N) bits. That's just going up to the next integer bit count. But you're wasting values in there, so you can take each wasted value and use it to reduce the length of a code that you need. eg. for N = 5 , start with log2ceil(N) bits :

0 : 000
1 : 001
2 : 010
3 : 011
4 : 100
x : 101
x : 110
x : 111
The first five codes are used for our values, and the last three are wasted. Rearrange to interleave the wasted codewords :
0 : 000
x : 001
1 : 010
x : 011
2 : 100
x : 101
3 : 110
4 : 111
now since we have adjacent codes where one is used and one is not used, we can reduce the length of those codes and still have a prefix code. That is, if we see the two bits "00" we know that it must always be a value of 0, because "001" is wasted. So simply don't send the third bit in that case :
0 : 00
1 : 01
2 : 10
3 : 110
4 : 111

(this is a general way of constructing shorter prefix codes when you have wasted values). You can see that the number of wasted values we had at the top is the number of codes that can be shortened by one bit.

A flat code is written thusly :

void OutputFlat(int sym, int N)
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;
    // T is the number of "wasted values"
    if ( sym < T )
        // write in B-1 bits
        PutBits(sym, B-1);
        // write in B bits
        // push value up by T
        PutBits(sym+T, B);

int InputFlat(int sym,int N)
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;

    int sym = GetBits(B-1);
    if ( sym < T )
        return sym;
        // need one more bit :
        int ret = (sym<<1) - T + GetBits(1);        
        return ret;

That is, we write (T) values in (B-1) bits, and (N-T) in (B) bits. The intlog2ceil can be slow, so in practice you would want to precompute that or pass it in as a parameter.

So, what is the loss vs. ideal, and where does it occur? Let's work it out :

H = log2(N)  is the ideal (fractional) entropy

N is in (2^(B-1),2^B]
so H is in (B-1,B]

The number of bits written by the flat code is :

L = ( T * (B-1) + (N-T) * B ) / N

with T = 2^B - N

Let's set

N = f * 2^B

with f in (0.5,1] our fractional position in the range.

so T = 2^B * (1 - f)

At f = 0.5 and 1.0 there's no loss, so there must be a maximum in that interval.

Doing some simplifying :

L = (T * (B-1) + (N-T) * B)/N
L = (T * B - T + N*B - T * B)/N
L = ( N*B - T)/N = B - T/N
T/N = (1-f)/f = (1/f) - 1
L = B - (1/f) + 1

The excess bits is :

E = L - H

H = log2(N) = log2( f * 2^B ) = B + log2(f)

E = (B - (1/f) + 1) - (B + log2(f))
E = 1 - (1/f) - log2(f)

so find the maximum of E by taking a derivative :

d/df(E) = 0
d/df(E) = 1/f^2 - (1/f)/ln2
1/f^2 = (1/f)/ln2
1/f = 1/ln(2)
f = ln(2)
f = 0.6931472...

and at that spot the excess is :

E = 1 - (1/ln2) - ln(ln2)/ln2
E = 0.08607133...

The worst case is 8.6% of a bit per symbol excess. The worst case appears periodically, once for each power of two.

The actual excess bits output for some low N's :

The worst case actually occurs as N->large, because at higher N you can get f closer to that worst case fraction (ln(2)). At lower N, the integer steps mean you miss the worst case and so waste less. This is perhaps a bit surprising, you might think that the worst case would be at something like N = 3.

In fact for N = 3 :

H = l2(3) = 1.584962 ...

L = average length written by OutputFlat

L = (1+2+2)/3 = 1.66666...

E = L - H = 0.08170421...

(obviously if you measure the loss as a percentage of the output length, the worst case is at N=3, and there it's 5.155% of the entropy).

04-21-13 | How to grow Linux disk under VMWare

There's a lot of these guides around the net, but I found them all a bit confusing to follow, so my experience : It seems to me it's a good idea to leave it this way - BIOS is set to boot first from CD, but the VM is set with no CD hardware enabled. This makes it easy to change the ISO and just check that box any time you want to boot from an ISO, rather than having to go into that BIOS nightmare again.

More generally, what have I learned about multi-platform development from working at RAD ?

That it's horrible, really horrible, and I pray that I never have to do it again in my life. Ugh.

Just writing cross-platform code is not the issue (though that's horrible enough, solely due to stupid non-fundamental issues like the fact that struct packing isn't standardized, adding signed ints isn't standardized, restrict/noalias isn't standardized, inline linkage varies greatly, etc. urg wtf etc etc). If you're just releasing some code on the net and offering it for many platforms (leaving it up to the downloaders to actually build it and test it), your life is easy. The horrible part is if you actually have to maintain machines and build systems for all those platforms, test them, be able to debug on them, keep all the sdk's up to date, etc. etc.

(in general coding is easy when you don't actually test your code and make sure it works well, which a surprising number of people think is "done"; hey it compiles, I'm done! umm, no...)

(I guess that's a more general life thing; I observe a lot of people who just do things and don't actually measure whether the "doing" was successful or done well, but they just move on and are generally happy. People who stress over whether what they're doing is actually a good job or not are massively less happy but also actually do good work.)

I feel like I spend 90% of my time on stupid fucking non-algorithmic issues like this Linux partition resizing shit (probably more like 20%, but that's still frustratingly high). The regression tests are failing on Linux, okay have to figure out why, oh it's because the VM disk is too small, okay how do I fix that; or the PS4 compiler has a bug I have to work around, or the system software on this system has a bug, or the Mac clang wants to spew pointless warnings about anonymous namespaces, or my tests aren't working on Xenon .. spend some time digging .. oh the console is just turned off, or the IP changed or it got reflashed and my SDK doesn't work anymore, and blah blah fucking blah. God dammit I just want to be able to write algorithms. I miss coding, I miss thinking about hard problems. Le sigh.

I've written before about how in my imagination I could hire some kid for $50k to do all this shit work for me and it would be a huge win for me overall. But I'm afraid it's not that easy in reality.

What really should exist is a "coder cloud" service. There should be a bunch of VMs of different OS'es with various compilers and SDKs installed, so I can just say "build my shit for X with Y". Of course you need to be able to run tests on that system as well, and if something goes wrong you need remote desktop for interactive debugging. It's got to have every platform, including things like game consoles where you need license agreements, which is probably a no-go in reality because corporations are jerks. There's got to be superb customer service, because if I can't rely on it for builds at every moment of every day then it's a no-go. Unfortunately, programmers are almost uniformly moronic about this kind of thing (in that they massively overestimate their own ability to manage these things quickly) so wouldn't want to pay what it costs to run that service.

04-10-13 | Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :

waiter = waitset.prepare_wait(condition);

if ( double check )
    // possibly loop and re-check condition


then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :

T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :

waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
    waitset : send message { cancel , &w };



waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :

Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.

Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :

U32 thinlock;

Lock :
    // first check :
    while( Exchange(&thinlock,1) != 0 )
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
            // got it


Unlock :
    if ( Exchange(&thinlock,0) == 2 )
        waitset.NotifyAll( &thinlock );
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :

U32 thinlock;

Lock :
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
            // got it
            w.Cancel(waitset_resignal_no); // (*2)


Unlock :
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
        waitset.NotifyOne( &thinlock );
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :

U32 thinlock; // = 0 initializes thinlock

Lock :
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
        // got the lock (no need to resignal)

    // woke up - I have the lock !

Unlock :
    if ( Fetch_Add(&thinlock,-1) > 1 )
        // there were waiters
        waitset.NotifyOne( &thinlock );
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :

U32 thinsem; // = 0 initializes thinsem

Wait :
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)

    // woke up - I got the sem !

Post :
    if ( Fetch_add(&thinsem,1) < 0 )
        waitset.NotifyOne( &thinsem );

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).

Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :

    change shared variable
    if ( any waiters )

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3

04-04-13 | Tabdir

I made a fresh build of "tabdir" my old ( old ) utility that does a recursive dirlisting in tabbed-text format for "tabview".

Download : tabdir 320k zip

tabdir -?
usage : tabdir [opts] [dir]
 -v  : view output after writing
 -p  : show progress of dir enumeration (with interactive keys)
 -w# : set # of worker threads
 -oN : output to name N [r:\tabdir.tab]

This new tabdir is built on Oodle so it has a multi-threaded dir lister for much greater speed. (*)

Also note to self : I fixed tabview so it works as a shell file association. I hit this all the time and always forget it : if something works on the command line but not as a shell association, it's probably because the shell passes you quotes around file names, so you need a little code to strip quotes from args.

Someday I'd like to write an even faster tabdir that reads the NTFS volume directory information directly, but chances are that will never happen.

One odd thing I've spotted with this tabdir is that the Windows SxS Assembly dirs take a ton of time to enumerate on my machine. I dunno if they're compressed or WTF the deal is with them (I pushed it on the todo stack to investigate), but they're like 10X slower than any other dir. (could just be the larger number of files in there; but I mean it's slow *per file*)

I never did this before because I didn't expect multi-threaded dir enumeration to be a big win; I thought it would just cause seek thrashing, and if you're IO bound anyway then multi-threading can't help, can it? Well, it turns out the Win32 dir enum functions have quite a lot of CPU overhead, so multi-threading does in fact help a bit :

nworkers| elapsed time
1       | 12.327
2       | 10.450
3       | 9.710
4       | 9.130

(* = actually the big speed win was not multi-threading, it's that the old tabdir did something rather dumb in the file enum. It would enum all files, and then do GetInfo() on each one to get the file sizes. The new one just uses the file infos that are returned as part of the Win32 enumeration, which is massively faster).

04-04-13 | Worker Thread Forward Permit Delay-Kicks

I've had this small annoying issue for a while, and finally thought of a pretty simple solution.

You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.

I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.

To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.

Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.

In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)

LZ Read and Decompress without seek-resets, IO bound :

You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.

You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)

Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.

But something bad happened due to the forward permit system. The control flow was something like this :

On worker thread 0

wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
  -> fire off that handle
   handle is pushed to worker thread system
   no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
   there is none because it was already handed to a new worker

And it looked like this :

LZ Read and Decompress without seek-resets, CPU bound, naive permits :

You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.

So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).

The scheme is something like this :

On each worker thread :

Try to pop a work item from the "delay kick queue"
  if there is more than one item in the DKQ,
    take one for myself and "kick" the remainder
    (kick means wake worker threads to do the jobs)

If nothing on DKQ, pop from the main queue
  if nothing on main queue, wait on work semaphore

Do your job

Set "delay kick" = true
  ("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run 
  if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false


In brief : work that is made runnable by the completion of work is not fired until the worker thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion phase, it's in the pop phase.

And the result is :

LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :

Molto superiore.

These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.

Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .

04-04-13 | Oodle Compression on BC1 and WAV

I put some stupid filters in, so here are some results for the record and my reference.

BC1 (DXT1/S3TC) DDS textures :

All compressors run in max-compress mode. Note that it's not entirely fair because Oodle has the BC1 swizzle and the others don't.

Some day I'd like to do a BC1-specific encoder. Various ideas and possibilities there. Also RD-DXTC.

I also did a WAV filter. This one is particularly ridiculous because nobody uses WAV, and if you want to compress audio you should use a domain-specific compressor, not just OodleLZ with a simple delta filter. I did it because I was annoyed that RAR beat me on WAVs (due to its having a multimedia filter), and RAR should never beat me.

WAV compression :

See also : same chart with 7z (not really fair cuz 7z doesn't have a WAV filter)

Happy to see that Oodle-filt handily beats RAR-filt. I'm using just a trivial linear gradient predictor :

out[i] = in[i] - 2*in[i-1] + in[i-2]

this could surely be better, but whatever, WAV filtering is not important.

I also did a simple BMP delta filter and EXE (BCJ/relative-call) transform. I don't really want to get into the business of offering all kinds of special case filters the way some of the more insane modern archivers do (like undoing ZLIB compression so you can recompress it, or WRT), but anyhoo there's a few.

ADDED : I will say something perhaps useful about the WAV filter.

There's a bit of a funny issue because the WAV data is 16 bit (or 24 or 32), and the back-end entropy coder in a simple LZ is 8 bit.

If you just take a 16-bit delta and put it into bytes, then most of your values will be around zero, and you'll make a stream like :

[00 00] [00 01] [FF FF] [FF F8] [00 04] ...
The bad thing you should notice here are the high bytes are switching between 00 and FF even though the values have quite a small range. (Note that the common thing of centering the values with +32768 doesn't change this at all).

You can make this much better just by doing a bias of +128. That makes it so the most important range of values (around zero (specifically [-128,127])) all have the same top byte.

I think it might be even slightly better to do a "folded" signed->unsigned map, like

{ 0,-1,1,-2,2,-3,...,32767,-32768 }
The main difference being that values like -129 and +128 get the same high byte in this mapping, rather than two different high bytes in the simple +128 bias scheme.

Of course you really want a separate 8-bit huffman for alternating pairs of bytes. One way to get that is to use a few bottom bits of position as part of the literal context. Also, the high byte should really be used as context for the low byte. But both of those are beyond the capabilities of my simple LZ-huffs so I just deinterleave the high and low bytes to two streams.

04-02-13 | The Method of Random Holdouts

Quick and dirty game-programming style hacky way to make fitting model parameters somewhat better.

You have some testset {T} of many items, and you wish to fit some heuristic model M over T which has some parameters. There may be multiple forms of the model and you aren't sure which is best, so you wish to compare models against each other.

For concreteness, you might imagine that T is a bunch of images, and you are trying to make a perceptual DXTC coder; you measure block error in the encoder as something like (SSD + a * SATD ^ b + c * SSIM_8x8 ) , and the goal is to minimize the total image error in the outer loop, measured using something complex like IW-MS-SSIM or "MyDCTDelta" or whatever. So you are trying to fit the parameters {a,b,c} to minimize an error.

For reference, the naive training method is : run the model on all data in {T}, optimize parameters to minimize error over {T}.

The method of random holdouts goes like this :

Run many trials

On each trial, take the testset T and randomly separate it into a training set and a verification set.
Typically training set is something like 75% of the data and verification is 25%.

Optimize the model parameters on the {training set} to minimize the error measure over {training set}.

Now run the optimized model on the {verification set} and measure the error there.
This is the error that will be used to rate the model.

When you make the average error, compensate for the size of the model thusly :
average_error = sum_error / ( [num in {verification set}] - [dof in model] )

Record the optimal parameters and the error for that trial

Now you have optimized parameters for each trial, and an error for each trial. You can take the average over all trials, but you can also take the sdev. The sdev tells you how well your model is really working - if it's not close to zero then you are missing something important in your model. A term with a large sdev might just be a random factor that's not useful in the model, and you should try again without it.

The method of random holdouts reduces over-training risk, because in each run you are measuring error only on data samples that were not used in training.

The method of random holdouts gives you a decent way to compare models which may have different numbers of DOF. If you just use the naive method of training, then models with more DOF will always appear better, because they are just fitting your data set.

That is, in our example say that (SSD + a * SATD ^ b) is actually the ideal model and the extra term ( + c * SSIM_8x8 ) is not useful. As long as it's not just a linear combo of other terms, then naive training will find a "c" such that that term is used to compensate for variations in your particular testset. And in fact that incorrect "c" can be quite a large value (along with a negative "a").

This kind of method can also be used for fancier stuff like building complex models from ensembles of simple models, "boosting" models, etc. But it's useful even in this case where we wind up just using a simple linear model, because you can see how it varies over the random holdouts.

03-31-13 | Market Equilibrium

I'm sure there's some standard economic theory of all this but hey let's reinvent the wheel without any background.

There's a fundamental principal of any healthy (*) market that the reward for some labor is equal across all fields - proportional only to standard factors like the risk factor, the scarcity of labor, the capital required for entry, etc. (* = more on "healthy" later). The point is that those factors have *nothing* to do with the details of the field.

The basic factor at play is that if some field changes and suddenly becomes much more profitable, then people will flood into that field, and the risk-equal-capital-return will keep going down until it becomes equal to other fields. Water flows downhill, you know.

When people like Alan Greenspan try to tell you that oh this new field is completely unlike anything we've seen in the past because of blah blah - it doesn't matter, they may have lots of great points that seem reasonable in isolation, but the equilibrium still applies. The pay of a computer programmer is set by the pay of a farmer, because if the difference were out of whack, the farmer would quit farming and start programming; the pay of programmers will go down and the wages of farmers will go up, then the price of lettuce will go up, and in the end a programmer won't be able to buy any more lettuce than anyone else in a similar job. ("similar" only in terms of risk, ease of entry, rarity of talent, etc.)

We went through a drive-through car wash yesterday and Tasha idly wondered how much the car wash operator makes from an operation like that. Well, I bet it's about the same as a quick-lube place makes, and that's about the same as a dry cleaner, and it's about the same as a pizza place (which has less capital outlay but more risk), because if one of them was much more profitable, there would be more competition until equilibrium was reached.

Specifically I've been thinking about this because of the current indie game boom on the PC, which seems to be a bit of a magic gold rush at the moment. That almost inevitably has to die out, it's just a question of when. (so hurry up and get your game out before it does!).

But of course that leads us into the issue of broken markets, since all current game sales avenues are deeply broken markets.

Equilibrium (like most naive economic theory) only applies to markets where there's fluidity, robust competition, no monopolistic control, free information, etc. And of course those don't happen in the real world.

Whenever a market is not healthy, it provides an opportunity for unbalanced reward, well out of equilibrium.

Lack of information can be particularly be a factor in small niches. There can be a company that does something random like make height-adjustable massage tables. If they're a private operation and nobody really pays attention to them, they can have super high profit levels for something that's not particularly difficult - way out of equilibrium. If other people knew how easy that business was, lots of others would enter, but due to lack of information they don't.

Patents and other such mechanisms that create legally enforced distortions of the market. Of course things like the cable and utility systems are even worse.

On a large scale, government distortion means that huge fields like health care, finance, insurance, oil, farming, etc. are all more profitable than they should be.

Perhaps the biggest issue in downloadable games is the oligopoly of App Store and Steam. This creates an unhealthy market distortion and it's hard to say exactly what the long term affect of that will be. (of course you don't see it as "unhealthy" if you are the one benefiting from the favor of the great powers; it's unhealthy in a market fluidity and fair competition sense, and may slow or prevent equilibrium)

Of course new fields are not yet in equilibrium, and one of the best ways to "get rich quick" is to chase new fields. Software has been out of equilibrium for the past 50 years, and is only recently settling down. Eventually software will be a very poorly paid field, because it requires very little capital to become a programmer, it's low risk, and there are lots of people who can do it.

Note that in *every* field the best will always rise to the top and be paid accordingly.

Games used to be a great field to work in because it was a new field. New fields are exciting, they offer great opportunities for innovation, and they attract the best people. Mature industries are well into equilibrium and the only chances for great success are through either big risk, big capital investment, or crookedness.

03-31-13 | Some GDC Observations

From my very limitted view of GDC standing at the RAD booth.

1. Programming is dead. There were basically zero programming talks at GDC this year. That's sad, but also perfectly reasonable since programming is not the problem any more (*). (* = assuming that you just want to make the same old shit with different graphics)

2. Piece of shit mobile games that people have thrown together in a month look better than AAA games 10 years ago. It's not just that GPU's are so much better, but the free engines are really amazing these days, and the content pipes are so much better, and there are so many more decent 3d artists that can just make tons of content.

3. Game developers look like human beings now. If you looked at a GDC when I first started going, we were all classic troglodyte nerds; unwashed sweatshirts and open backpacks with slide-rules falling out. We were all vampirically pale from being locked in a dark box surrounded by our giant CRTs. (more generally I'm noticing that the average fitness level (on the west coast anyway) is way up in the past 5 years or so).

4. Mobile is dead, downloadable is king. I do an unscientific random sampling every year just by asking the people who stop by the RAD booth what they're working on. For the past few years it has been mobile mobile "we're making a game for ios and android", tons of kids and startups and indies trying to get into mobile. That seems to be gone, and the new gold rush is "downloadable" (PC, XBLA, etc).

5. Games are tacky and tasteless. One of the worst things for me standing at the booth is just hearing and seeing games all day. I don't play games much, I never watch TV with commercials, and I never watch things like cable news with all the excessive HUD and overstimulation, I find all that stuff abusive of my senses. Games are stuck in this awful "bling bling whoosh blammo" flashing and fast-cuts and just really tacky aesthetic. It's just like TV ads, or a bit like standing in the slot machine section of a casino (which is surely some level of hell).

6. I saw one really amazing game at GDC that stood out from the rest. It had all the players instantly smiling and laughing. It was fun for kids and adults. It created a feeling of group affinity. Everyone around wanted to join in. It was even beneficial to the body. It was an inflatable ball. Personally I had the "holy shit what we make is total crap" (actually worse than crap, because it's actively harmful to the body and mind) epiphany some 10+ years ago, but it just struck me so hard standing there with all these shit games around and people having so much more fun in the most basic game in the non-electronic world.

03-31-13 | Index - Game Threading Architecture

Gathering the series for an index post :

cbloom rants 08-01-11 - A game threading model
cbloom rants 12-03-11 - Worker Thread system with reverse dependencies
cbloom rants 03-05-12 - Oodle Handle Table
cbloom rants 03-08-12 - Oodle Coroutines
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts
cbloom rants 12-18-12 - Async-Await ; Microsoft's Coroutines
cbloom rants 12-21-12 - Coroutine-centric Architecture
cbloom rants 12-21-12 - Coroutines From Lambdas
cbloom rants 12-06-12 - Theoretical Oodle Rewrite Continued
cbloom rants 02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

I believe this is a good architecture, using the techniques that we currently have available, without doing anything that I consider bananas like writing your own programming language (*). Of course if you are platform-specific or know you can use C++11 there are small ways to make things more convenient, but the fundamental architecture would be about the same (and assuming that you will never need to port to a broken platform is a mistake I know well).

(* = a lot of people that I consider usually smart seem to think that writing a custom language is a great solution for lots of problems. Whenever we're talking about "oh reflection in C is a mess" or "dependency analysis should be automatic", they'll throw out "well if you had the time you would just write a custom language that does all this better". Would you? I certainly wouldn't. I like using tools that actually work, that new hires are familiar with, etc. etc. I don't have to list the pros of sticking with standard languages. In my experience every clever custom language for games is a huge fucking disaster and I would never advocate that as a good solution for any problem. It's not a question of limitted dev times and budgets.)

03-31-13 | Endian-Independent Structs

I dunno, maybe this is common practice, but I've never seen it before.

The easy way to load many file formats (I'll use a BMP here to be concrete) is just to point a struct at it :

    U16 bfType; 
    U32 bfSize; 
    U16 bfReserved1; 
    U16 bfReserved2; 
    U32 bfOffBits; 
} __attribute__ ((__packed__));


if ( bmfh->bfType != 0x4D42 )
    ERROR_RETURN("not a BM",0);


but of course this doesn't work cross platform.

So people do all kinds of convoluted things (which I have usually done), like changing to a method like :

U16 bfType = Get16LE(&ptr);
U32 bfSize = Get32LE(&ptr);

or they'll do some crazy struct-parse fixup thing which I've always found to be bananas.

But there's a super trivial and convenient solution :

    U16LE bfType; 
    U32LE bfSize; 
    U16LE bfReserved1; 
    U16LE bfReserved2; 
    U32LE bfOffBits; 
} __attribute__ ((__packed__));

where U16LE is just U16 on little-endian platforms and is a class that does bswap on itself on big-endian platforms.

Then you can still just use the old struct-pointing method and everything just works. Duh, I can't believe I didn't think of this earlier.

Similarly, here's a WAV header :

struct WAV_header_LE
    U32LE FOURCC_RIFF; // RIFF Header 
    U32LE riffChunkSize; // RIFF Chunk Size 
    U32LE FOURCC_WAVE; // WAVE Header 
    U32LE FOURCC_FMT; // FMT header 
    U32LE fmtChunkSize; // Size of the fmt chunk 
    U16LE audioFormat; // Audio format 1=PCM,6=mulaw,7=alaw, 257=IBM Mu-Law, 258=IBM A-Law, 259=ADPCM 
    U16LE numChan; // Number of channels 1=Mono 2=Sterio 
    U32LE samplesPerSec; // Sampling Frequency in Hz 
    U32LE bytesPerSec; // bytes per second 
    U16LE blockAlign; // normall NumChan* bytes per sample
    U16LE bitsPerSample; // Number of bits per sample 
}  __attribute__ ((__packed__));;


For file-input type structs, you just do this and there's no penalty. For structs you keep in memory you wouldn't want to eat the bswap all the time, but even in that case this provides a simple way to get the swizzle into native structs by just copying all the members over.

Of course if you have the Reflection-Visitor system that I'm fond of, that's also a good way to go. (cursed C, give me a "do this macro on all members").

03-30-13 | Error codes

Some random rambling on the topic of returning error codes.

Recently I've been fixing up a bunch of code that does things like

void  MutexLock( Mutex * m )
    if ( ! m ) return;

yikes. Invalid argument and you just silently do nothing. No thank you.

We should all know that silently nopping in failure cases is pretty horrible. But I'm also dealing with a lot of error code returns, and it occurs to me that returning an error code in that situation is not much better.

Personally I want unexpected or unhandleable errors to just blow up my app. In my own code I would just assert; unfortunately that's not viable in OS code or perhaps even in a library.

The classic example is malloc. I hate mallocs that return null. If I run out of memory, there's no way I'm handling it cleanly and reducing my footprint and carrying on. Just blow up my app. Personally whenever I implement an allocator if it can't get memory from the OS it just prints a message and exits (*).

(* = aside : even better is "functions that don't fail" which I might write more about later; basically the idea is the function tries to handle the failure case itself and never returns it out to the larger app. So in the case of malloc it might print a message like "tried to alloc N bytes; (a)bort/(r)etry/return (n)ull?". Another common case is when you try to open a file for write and it fails for whatever reason, it should just handle that at the low level and say "couldn't open X for write; (a)bort/(r)etry/change (n)ame?" )

I think error code returns are okay for *expected* and *recoverable* errors.

On functions that you realistically expect to always succeed and will not check error codes for, they shouldn't return error codes at all. I wrote recently about wrapping system APIs for portable code ; an example of the style of level 2 wrapping that I like is to "fix" the error returns.

(obviously this is not something the OS should do, they just have to return every error; it requires app-specific knowledge about what kind of errors your app can encounter and successfully recover from and continue, vs. ones that just mean you have a catastrophic unexpected bug)

For example, functions like lock & unlock a mutex shouldn't fail (in my code). 99% of the user code in the world that locks and unlocks mutexes doesn't check the return value, they just call lock and then proceed assuming the lock succeeded - so don't return it :

void mypthread_mutex_lock(mypthread_mutex_t *mutex)
    int ret = pthread_mutex_lock(mutex);
    if ( ret != 0 )

When you get a crazy unexpected error like that, the app should just blow up right at the call site (rather than silently failing and then blowing up somewhere weird later on because the mutex wasn't actually locked).

In other cases there are a mix of expected failures and unexpected ones, and the level-2 wrapper should differentiate between them :

bool mysem_trywait(mysem * sem)
        int res = sem_trywait( sem );
        if ( res == 0 ) return true; // got it

        int err = errno;
        if ( err == EINTR )
            // UNIX is such balls
        else if ( err == EAGAIN )
            // expected failure, no count in sem to dec :
            return false;
            // crazy failure; blow up :

(BTW best practice these days is always to copy "errno" out to an int, because errno may actually be #defined to a function call in the multithreaded world)

And since I just stumbled into it by accident, I may as well talk about EINTR. Now I understand that there may be legitimate reasons why you *want* an OS API that's interrupted by signals - we're going to ignore that, because that's not what the EINTR debate is about. So for purposes of discussion pretend that you never have a use case where you want EINTR and it's just a question of whether the API should put that trouble on the user or not.

I ranted about EINTR at RAD a while ago and was informed (reminded) this was an ancient argument that I was on the wrong side of.

Mmm. One thing certainly is true : if you want to write an operating system (or any piece of software) such that it is easy to port to lots of platforms and maintain for a long time, then it should be absolutely as simple as possible (meaning simple to implement, not simple in the API or simple to use), even at the cost of "rightness" and pain to the user. That I certainly agree with; UNIX has succeeded at being easy to port (and also succeeded at being a pain to the user).

But most people who argue on the pro-EINTR side of the argument are just wrong; they are confused about what the advantage of the pro-EINTR argument is (for example Jeff Atwood takes off on a general rant against complexity ; I think we all should know by now that huge complex APIs are bad; that's not interesting, and that's not what "Worse is Better" is about; or Jeff's example of INI files vs the registry - INI files are just massively better in every way, it's not related at all, there's no pro-con there).

(to be clear and simple : the pro-EINTR argument is entirely about simplicity of implementation and porting of the API; it's about requiring the minimum from the system)

The EINTR-returning API is not simpler (than one that doesn't force you to loop). Consider an API like this :

U64 system( U64 code );

doc :

if the top 32 bits of code are 77 this is a file open and the bottom 32 bits specify a device; the
return values then are 0 = call the same function again with the first 8 chars of the file name ...
if it returns 7 then you must sleep at least 1 milli and then call again with code = 44 ...
etc.. docs for 100 pages ...

what you should now realize is that *the docs are part of the API*. (that is not a "simple" API)

An API that requires you to carefully read about the weird special cases and understand what is going on inside the system is NOT a simple API. It might look simple, but it's in disguise. A simple API does what you expect it to. You should be able to just look at the function signature and guess what it does and be right 99% of the time.

Aside from the issue of simplicity, any API that requires you to write the exact same boiler-plate every time you use it is just a broken fucking API.

Also, I strongly believe that any API which returns error codes should be usable if you don't check the error code at all. Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do :

int fd = open("blah");



and that should work okay in my hack test app. Nope, not in UNIX it doesn't. Thanks to its wonderful "simplicity" you have to call "read" in a loop because it might decide to return before the whole read is done.

Another example that occurs to me is the reuse of keywords and syntax in C. Things like making "static" mean something completely different depending on how you use it makes the number of special keywords smaller. But I believe it actually makes the "API" of the language much *more* complex. Instead of having intuitive and obvious separate clear keywords for each meaning that you could perhaps figure out just by looking at them, you instead have to read a bunch of docs and have very technical knowledge of the internals of what the keywords mean in each usage. (there are legitimate advantages to minimizing the number of keywords, of course, like leaving as many names available to users as possible). Knowledge required to use an API is part of the API. Simplicity is determined by the amount of knowledge required to do things correctly.

03-26-13 | Oodle 1.1 and GDC

Hey it's GDC time again, so if you're here come on by the RAD booth and say "hi" (or "fuck you", or whatever).

The Oodle web site just went live a few days ago.

Sometimes I feel embarassed (ashamed? humiliated?) that it's taken me five years to write a file IO and data compression library. Other times I think I've basically written an entire OS by myself (and all the docs, and marketing materials, and a video compressor, and aborted paging engine, and a bunch of other crap) and that doesn't sound so bad. I suppose the truth is somewhere in the middle. (perhaps with Oodle finally being officially released and selling, I might write a little post-mortem about how it's gone, try to honestly look back at it a bit. (because lord knows what I need is more introspection in my life)).

Oodle 1.1 will be out any day now. Main new features :

Lots more platforms.  Almost everything except mobile platforms now.

LZNIB!  I think LZNIB is pretty great.  8X faster to decode than ZLIB and usually
makes smaller files.

Other junk :
All the compressors can run parallel encode & decode now.
Long-range-matcher for LZ matching on huge files (still only in-memory though).
Incremental compressors for online transmission, and faster resets.

Personally I'm excited the core architecture is finally settling down, and we have a more focused direction to go forward, which is mainly the compressors. I hope to be able to work on some new compressors for 1.2 (like a very-high-compression option, which I currently don't have), and then eventually move on to some image compression stuff.

03-26-13 | Simulating Deep Yield with a Wait

I'm becoming increasingly annoyed at my lack of "deep yield" for coroutines.

Any time you are in a work item, if you decide that you can get some more parallelism by doing a branch-merge inside that item, you need deep yield.

Remember you should never ever do an OS wait on a coroutine thread (with normal threads anyway; on a WinRT threadpool thread you can). The reason is the OS wait disables that worker thread, so you have one less. In the worst case, it leads to deadlock, because all your worker threads can be asleep waiting on work items, and with no worker threads they will never get done.

Anyway, I've cooked up a temporary work-around, it looks like this :

I'm in some function and I want to branch-merge

If I'm not on on a worker thread
  -> just do a normal branch-merge, send the work off and use a Wait for completion

If I am on a worker thread :

inc target worker thread count
if # currently live worker threads is < target count
  start a new worker thread (either create or wake from pool)

now do the branch-merge and use OS Wait
dec the target worker thread count

on each worker thread, after completing a work item and before popping more work :
if target worker thread count < currently live count
  stop self (go back into a sleeping state in the pool)

this is basically using OS threads to implement stack-saving deep yield. It's not awesome, but it is okay if deep yield is rare.

03-23-13 | AI Rambles

With GDC coming up I've been thinking generally about the state of game technology in general. First a bit of a rant on that.

I am so fucking bored of graphics. Graphics are not the damn problem. I'm completely appalled by the derivative repetitive boring games you all keep making. I don't want to play "Shoot People in the Face 227" or "Space Marines 154" or "Slide Blocks to Make them Go Bling N" or "Cute Creatures Jump Around on Blocks N". Barf, boring. And making them all shiny with new graphics is just gilding the turd. Stop working on graphics.

Games have huge tech problems that nobody seems to want to work on. One that I have wanted to work on for a long time is animation. And by "animation" I don't really mean playing back clips, which fundamentally looks like garbage, but making characters move naturally, able to transition movements the way their body should, respond to surface variations and so on. Game animation just looks so awful, and it's becoming more uncanny as the graphics get better.

(in fact if we were smart we would have done it the other way around. Every cartoonist for a hundred years has known that it's actually ok for the visuals to look unrealistic if the animation and sound are really good. Human perception cares more about motion than the static appearance of things.)

Anyhoo, the other big one is AI. And by "AI" I don't mean playing scripts, or moving to designer-placed cover spots. Even some of the more sophisticated game AI systems are really just fancy whack-a-mole. You can see the AI's run to one spot, do a pre-programmed routine, run to another spot, pop out of cover so the player can shoot me, pop back in cover. Now, certainly there are merits to whack-a-mole AI. If you're making a platformer you don't want the enemy to do surprising things, you just want them to walk back and forth on a set pattern that the player can pick up easily. They're not really AI at all, they're rigid bodies with an animal painted on them.

These AI's never surprise you, they never make you laugh, they never make you want to play again because they might do something new. They feed off your energy and don't give anything back, like a bad conversation partner.

So it made me realize that game AIs are actually more interesting when the game is very simple. It might naively seem like a big complex sandbox 3d world has got a more complex AI, but really that complex world means that the AI no longer understands what it's doing. Your only hope is to give it simple rules to follow about what it can do in that world.

In contrast, AI for simple game systems (chess, checkers, backgammon, poker) can do amazing things that the human programmer never anticipated. There's a funny thing that happens with computer algorithms where a cold rational scientific brute-force search of a mathematical problem space actually leads to behavior that's more human than the shitty heuristic decision-tree type programming that's explicitly trying to simulate human behavior.

For example, when I was writing poker AI, I was really amazed at the "creative" plays that a simulation-based bot makes. (for review : a standard UAlberta-style poker bot works by building a model of the opponent based on observation of previous action; it then simulates all the possibilities for future cards and imagines what the opponent will do in each situation; it sums the EV over all paths for each of its own actions, and chooses the action that maximizes EV).

At the simplest level, it figures out things like check-raising when you tend to bet checked flops too much. But it did even weirder things. For example the bot would very quickly become hyper-aggressive against an opponent that folds even slightly too much; it adjusted faster and way more severely than any human. I would play against it sometimes with our cards face up so that I could make sure it was doing sane things, and I would see it make a huge check-raise bluff on the river with junk. My first thought is "I have a bug" and I'd go looking into the stats of the model, and found that there was no bug, it's just that the AI had learned that I thought a big river raise meant strength, so I was folding to them a lot, and therefore the simulation will jam almost every hand.

This type of poker AI is not the game theoretic equilibrium solution. It's assuming that the opponent plays by some scheme which may not be optimal, and that its own strategy is not face up. That can lead it to make mistakes. One I've long been aware of is that it doesn't hedge correctly. Normal humans hedge all the time in their poker play, perhaps too much; you will often suspect that someone is bluffing a huge percent of the time, but you aren't sure. A non-hedging AI would immediately start making very light call-downs, but a cautious human will weight in some factor for the model being wrong and play with a blended strategy that's not disastrous if the model is wrong (like only doing the light call-down in small pots, or waiting for a call-down with a hand that has some chance of being best even if the model is wrong).

Continuing the random rambling train of thought, I just realized (re-realized?) that one of the flaws with this style of poker AI is that it doesn't anticipate the reaction to its moves. Of course it does anticipate the reaction just in terms of "if I bet, what hands will he call with or raise with", but it is evaluating based on the *past* model of the opponent. After you make your bet, the opponent sees it and adjusts their view of you, so you need to be anticipating how their play style changes. For example in the case I mentioned above - when someone is playing pretty weak/tight the bot rapidly becomes hyper-aggressive, which is mostly good, but the bot never gets the idea that "hey he can see I'm raising every single street of every hand, he's going to adjust and call me down more".

Anyway, bringing it back to games, it occurred to me that it would be interesting to try some really simple 2d games, and give them a mathematical solving AI, instead of the usual heuristic crap we do. Like, let's face facts - we can't actually make games in these big free form 3d worlds, it's too complex. Our ability to do the graphics has gotten way beyond every other aspect. We need to back up and go to like Ultima-style 2d tile-based games. Now you have a space where the AI can just explore future actions, and things like advancing on the player by moving from cover to cover just pops out of the behavior automatically because it maximizes EV, not because it was explicitly coded.

(I'm not contending that this is the "right way" to make games or that it will necessarily make good games, I just thought it was interesting)

03-19-13 | Windows Sleep Variation

Hmm, I've discovered that Sleep(n) behaves very differently on my three Windows boxes.

(Also remember there are a lot of other issues with Sleep(n) ; the times are only reliable here because this is in a no-op test app)

This actually started because I was looking into Linux thread sleep timing, so I wrote a little test to just Sleep(n) a bunch of times and measure the observed duration of the sleep.

(Of course on Windows I do timeBeginPeriod(1) and bump my thread to very high priority (and timeGetDevCaps says the minp is 1)).

Anyway, what I'm seeing is this :

Win7 :
sleep(1) : average = 0.999 , sdev = 0.035 ,min = 0.175 , max = 1.568
sleep(2) : average = 2.000 , sdev = 0.041 ,min = 1.344 , max = 2.660
sleep(3) : average = 3.000 , sdev = 0.040 ,min = 2.200 , max = 3.774

Sleep(n) averages n
duration in [n-1,n+1]

WinXP :
sleep(1) : average = 1.952 , sdev = 0.001 ,min = 1.902 , max = 1.966
sleep(2) : average = 2.929 , sdev = 0.004 ,min = 2.665 , max = 2.961
sleep(3) : average = 3.905 , sdev = 0.004 ,min = 3.640 , max = 3.927

Sleep(n) averages (n+1)
duration very close to (n+1) every time (tiny sdev)

Win8 :
sleep(1) : average = 2.002 , sdev = 0.111 ,min = 1.015 , max = 2.101
sleep(2) : average = 2.703 , sdev = 0.439 ,min = 2.017 , max = 3.085
sleep(3) : average = 3.630 , sdev = 0.452 ,min = 3.003 , max = 4.130

average no good
Sleep(n) minimum very precisely n
duration in [n,n+1] (+ a little error)
rather larger sdev

it's like completely different logic on each of my 3 machines. XP is the most precise, but it's sleeping for (n+1) millis instead of (n) ! Win8 has a very precise min of n, but the average and max is quite sloppy (sdev of almost half a milli, very high variation even with nothing happening on the system). Win7 hits the average really nicely but has a large range, and is the only one that will go well below the requested duration.

As noted before, I had a look at this because I'm running Linux in a VM and seeing very poor performance from my threading code under Linux-VM. So I ran this experiment :

Sleep(1) on Linux :

native : average = 1.094 , sdev = 0.015 , min = 1.054 , max = 1.224
in VM  : average = 3.270 , sdev =14.748 , min = 1.058 , max = 656.297

in VM2 : average = 1.308 , sdev = 2.757 , min = 1.052 , max = 154.025

obviously being inside a VM on Windows is not being very kind to Linux's threading system. On the native box, Linux's sleep time is way more reliable than Windows (small min-max range) (and this is just with default priority threads and SCHED_OTHER, not even using a high priority trick like with the Windows tests above).

added "in VM2". So the VM threading seems to be much better if you let it see many fewer cores than you have. I'm running on a 4 core (8 hypercore) machine; the base "in VM" numbers are with the VM set to see 4 cores. "in VM2" is with the VM set to 2 cores. Still a really bad max in there, but much better overall.

03-16-13 | Writing Portable Code Rambles

Some thoughts after spending some time on this (still a newbie). How I would do it differently if I started from scratch.

1. Obviously you all know the best practice of using your own data types (S32 or whatever) and making macros for any kind of common operation that the standards don't handle well (like use a SEL macro instead of ?: , make a macro for ROT, etc). Never use bit-fields, make your own macros for manipulating bits within words. You also have to make your own whole macro meta-language for things not quite in the language, like data alignment, restrict/alias, etc. etc. (god damn C standard people, spend some time on the actual problems that real coders face every day. Thanks mkay). That's background and it's the way to go.

Make your own defines for SIZEOF_POINTER since stupid C doesn't give you any way to check sizeof() in a macro. You probably also want SIZEOF_REGISTER. You need your own equivalent of ptrdiff_t and intptr_t. Best practice is to use pointer-sized ints for all indexing of arrays and buffer sizes.

(one annoying complication is that there are platforms with 64 bit pointers on which 64-bit int math is very slow; for example they might not have a 64-bit multiply at all and have to emulate it. In that case you will want to use 32-bit ints for array access when possible; bleh)

Avoid using "wchar_t" because it is not always the same size. Try to explicitly use UTF16 or UTF32 in your code. You could make your own SIZEOF_WCHAR and select one or the other on the appropriate platform. (really try to avoid using wchar at all; just use U16 or U32 and do your own UTF encoding).

One thing I would add to the macro meta-language next time is to wrap every single function (and class) in my code. That is, instead of :

int myfunc( int args );


FUNC1 int FUNC2 myfunc(int args );

or even better :

FUNC( int , myfunc , (int args) );

this gives you lots of power to add attributes and other munging as may be needed later on some platforms. If I was doing this again I would use the last style, and I would have two of them, a FUNC_PUBLIC and FUNC_PRIVATE to control linkage. Probably should have separate wrapper macros for the proto and the body.

While you're at it you may as well have a preamble in every func too :

FUNC_PUBLIC_BODY( int , myfunc , (int args) )


which lets you add automatic func tracing, profiling, logging, and so on.

I wish I had made several different layers of platform Id #defines. The first one you want is the lowest level, which explicitly Id's the current platform. These should be exclusive (no overlaps), something like OODLE_PLATFORM_X86X64_WIN32 or OODLE_PLATFORM_PS3_PPU.

Then I'd like another layer that's platform *groups*. For me the groups would probably be OODLE_PLATFORM_GROUP_PC , GROUP_CONSOLE, and GROUP_EMBEDDED. Those let you make gross characterizations like on "GROUP_PC" you use more memory and have more debug systems and such. With these mutually exclusive platform checks, you should never use an #else. That is, don't do :

.. some code ..
.. fallback ..
it's much better to explicitly enumerate which platforms you want to go to which code block, and then have an
#error new platform
at the end of every check. That way when you try building on new platforms that you haven't thought carefully about yet, you get nice compiler notification about all the places where you need to think "should it use this code path or should I write a new one". Fallbacks are evil! I hate fallbacks, give me errors.

Aside from the explicit platforms and groups I would have platform flags or caps which are non-mutually exclusive. Things like PLATFORM_FLAG_STDIN_CONSOLE.

While you want the raw platform checks, in end code I wish I had avoided using them explicitly, and instead converted them into logical queries about the platform. What I mean is, when you just have an "#if some platform" in the code, it doesn't make it clear why you care that's the platform, and it doesn't make it reusable. For example I have things like :

// .. do string matching by U64 and xor/cntlz
// unaligned U64 read may be slow
// do string match byte by byte
what I should have done is to introduce an abstraction layer in the #if that makes it clear what I am checking for, like :

#error classify me

// .. do string matching by U64 and xor/cntlz
// unaligned U64 read may be slow
// do string match byte by byte

then it's really clear what you want to know and how to classify new platforms. It also lets you reuse that toggle in lots of places without code duping the fiddly bit, which is the platform classification.

Note that when doing this, it's best to make high level usage-specific switches. You might be tempted to try to use platform attributes there. Like instead of "PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS" you might want to use "PLATFORM_SWITCH_UNALIGNED_READ_PENALTY" . But that's not actually what you want to know, you want to know if on my particular application (LZ string match) it's better to use big words or not, and that might not match the low level attribute of the CPU.

It's really tempting to skip all this and abuse the switches you can see (lord knows I do it); I see (and write) lots of code that does evil things like using "#ifdef _MSC_VER" to mean something totally different like "is this x86 or x64" ? Of course that screws you when you move to another x86 platform and you aren't detecting it correctly (or when you use MSVC to make PPC or ARM compiles).

Okay, that's all pretty standard, now for the new bit :

2. I would opaque out the system APIs in two levels. I haven't actually ever done this, so grains of salt, but I'm pretty convinced it's the right way to go after working with a more standard system.

(for the record : the standard way is to make a set of wrappers that tries to behave the same on all systems, eg. that tries to hide what system you are on as much as possible. Then if you need to do platform-specific stuff you would just include the platform system headers and talk to them directly. That's what I'm saying is not good.)

In the proposed alternative, the first level would just be a wrapper on the system APIs with minimal or no behavior change. That is, it's just passing them through and standardizing naming and behavior.

At this level you are doing a few things :

2.A. Hiding the system includes from the rest of your app. System includes are often in different places, and often turn on compiler flags in nasty ways. You want to remove that variation from the rest of your code so that your main codebase only sees your own wrapper header.

2.B. Standardizing naming. For example the MSVC POSIX funcs are all named wrong; at this level you can patch that all up.

2.C. Fixing things that are slightly different or don't work on various platforms where they really should be the same. For example things like pthreads are not actually all the same on all the pthreads platforms, and that can catch you out in nasty ways. (eg. things like sem_init always failing on Mac).

Note this is *not* trying to make non-POSIX platforms look like POSIX. It's not hiding the system you're on, just wrapping it in a standard way.

2.D. I would also go ahead and add my own asserts for args and returns in this layer, because I hate functions that just return error codes when there's a catastrophic failure like a null arg or an EHEAPCORRUPT or whatever.

So once you have this wrapper you no longer call any system funcs directly from your main codebase, but you still would be doing things like :


    HANDLE h = platform_CreateFile( ... )


    int fd = platform_open( name , flags )

    #error unknown platform

that is, you're not hiding what platform you're on, you're still letting the larger codebase get to the low level calls, it's just the mess of how fucked they are that's hidden a bit.

3. You then have a second level of wrapping which tries to make same-action interfaces that dont require you to know what platform you're on. Second level is written on the first level.

The second level wrappers should be as high level as necessary to opaque out the operation. For example rather than having "make temp file name" and "open file" you might have "open file with temp name", because on some platforms that can be more efficient when you know it is a high-level combined op. You don't just have "GetTime" you have "GetTimeMonotonic" , because on some platforms they have an efficient monotonic clock for you, and on other platforms/hardwares you may have to do a lot of complicated work to ensure a reliable clock (that you don't want to do in the low level timer).

When a platform can't provide a high-level function efficiently, rather than emulate it in a complex way I'd rather just not have it - not a stub that fails, but no definition at all. That way I get a compile error and in those spots I can do something different, using the level 1 APIs.

The first level wrappers are very independent of the large code base's usage, but the second level wrappers are very much specifically designed for their usage.

To be clear about the problem of making platform-hiding second layer wrappers, consider something like OpenFile(). What are the args to that? What can it do? It's hopeless to make something that works on all platforms without greatly reducing the capabilities of some platforms. And the meaning of various options (like async, temporary, buffered, etc.) all changes with platform.

If you wanted to really make a general purpose multi-platform OpenFile you would have to use some kind of "caps" query system, where you first do something like OpenFile_QueryCaps( OF_DOES_UNBUFFERED_MEAN_ALIGNMENT_IS_REQUIRED ) and it would be an ugly disaster. (and it's retarded on the face of it, because really what you're doing there is saying "is this win32" ?). The alternative to the crazy caps system is to just make the high level wrappers very limited and specific to your usage. So you could make a platform-agnostic wrapper like OpenFile_ForReadShared_StandardFlagsAndPermissions(). Then the platforms can all do slightly different things and satisfy the high level goal of the imperative in the best way for that platform.

A good second level has as few functions as possible, and they are as high level as possible. Making them very high level allows you to do different compound ops on the platform in a way that's hidden from the larger codebase.

03-10-13 | Two LZ Notes

Note 1 : on rep matches.

"Rep matches" are a little weird. They help a lot, but the reason why they help depends on the file you are compressing. (rep match = repeat match, gap match, aka "last offset")

On text files, they work as interrupted matches, or "gap matches". They let you generate something like :

stand on the floor
stand in the door

stand in the door
[stand ][i][n the ][d][oor]

[off 19, len 6][1 lit][rep len 6][1 lit][off 18, len 3]

that is, you have a long match of [stand on the ] but with a gap at the 'o'.

Now, something I observed was that more than one last offset continues to help. On text the main benefit from having two last offsets is that it lets you use a match for the gap. When the gap is not just one character but a word, you might want to use a match to put that word in, in which case the continuation after the gap is no longer the first last-offset, it's the second one. eg.

how to work with animals
how to cope with animals

[how to ][cope][ with animals]
[off 25 ][off 32][off 25 (rep2)]

You could imagine alternative coding structures that don't require keeping some number of "last offsets". (oddly, the last offset maintenance can be a large part of decode time, because maintaining an MTF list is something that CPUs do incredibly poorly). For example you could code with a scheme where you just send the entire long match, and then any time you send a long match you send a flag for "are there any gaps", and if so you then code some gaps inside the match.

The funny thing is, on binary files "last offsets" do something else which can be more important. They become the most common offsets. In particular, on highly structured binary data, they will generally be some factor of the structure size. eg. on a file that has a struct size of 36, and that struct has dwords and such in it, the last offsets will generally be things like 4,8,16,36, or 72. They provide a sort of dictionary of the most common offsets so that you can code those smaller. You are still getting the gap-match effect, but the common-offset benefit is much bigger on these files.

(aside : word-replacing transform on text really helps LZ (and everything) by removing the length variance of tokens. In particular for LZ77, word length variance breaks rep matches. There are lots of common occurances of a single replaced word in a phrase, like : "I want some stuff" -> "I want the stuff". You can't get a rep match here of [ stuff] because the offset changed because the substituted word was different length. If you do WRT first, then gap matches get these.)

Note 2 : on offset structure.

I've had it in the back of my head for quite some time now to do an LZ compressor specifically designed for structured data.

One idea I had was to use "2d" match offsets. That is, send a {dx,dy} where dx is within the record and dy is different records. Like imagine the data is in a table, dy is going back rows, dx is an offset on the row. You probably want to mod dx around the row so its range is always the same, and special case dy=0 (matches within your own record).

It occurred to me that the standard way of sending LZ offsets these days actually already does this. The normal way that good LZ's send offsets these days is to break it into low and high parts :

low = offset & 7F;
high = offset >> 7;
or similar, then you send "high" using some kind of "NoSB" scheme (Number of Significant Bits is entropy coded, and the bits themselves are sent raw), and you send "low" with an order-0 entropy coder.

But this is just a 2d structured record offset for a particular power-of-2 record size. It's why when I've experimented with 2d offsets I haven't seen huge wins - because I'm already doing it.

There is some win to be had from custom 2d-offsets (vs. the standard low/high bits scheme) when the record size is not a power of two.

03-06-13 | Sympathy for the Library Writer

Over the years of being a coder who was a library-consumer and not a library-writer, I've done my share of griping about annoying API's or what I saw as pointless complication or ineffiency. Man, I've been humbled by my own experience trying to write a public library. It is *hard*.

The big problem with libraries is that you don't control how they're used. This is in contrast to game engines. Game engines are not libraries. I've worked on many game engines over the years, including ones that went out to large free user bases (Genesis 3d and Wild Tangent), and they are much much easier than libraries.

The difference is that game engines generally impose an architecture on the user. They force you to use it in a certain way. (this is of course why more advanced developers despise them so much; it sucks to have some 3rd party telling you your code architecture). It's totally acceptable if a game engine only works well when you use it in the approved way, and is really slow if you abuse it, or it could even crash if you use it oddly.

A library has to be flexible about how it's used; it can't impose a system on the user, like a certain threading model, or a certain memory management model, or even an error-handling style.

Personally when I do IO for games, I make a "tool path" that just uses stdio and is very simple and flexible, does streaming IO and text parsing and so on, but isn't shipped with the game, and I make a "game path" that only does large-block async IO that's pre-baked so you can just point at it. I find that system is powerful enough for my use, it's easy to write and use. It means that the "tool path" doesn't have to be particularly fast, and the fast game path doesn't need to support buffered character IO or anything other than big block reads.

But I can't force that model on clients, so I have to support all the permutations and I have to make them all decently fast.

A lot of times in the past I've complained about over-complicated APIs that have tons of crazy options that nobody ever needs (look at the IJG jpeg code for example). Well, now I see that often those complicated APIs were made because somebody (probably somebody important) needed those options. Of course as the library provider you can offer the complex interface and also simpler alternatives, but that has its own pitfalls of making the API bigger and more redundant (like if you offer OpenFileSimple and OpenFileComplex); in some ways it's better to only offer the complex API and make the user wrap it and reduce the parameter set to what they actually use.

There's also a sort of "liability" issue in libraries. Not exactly legal liability, but program bad behavior liability. Lots of things that would make the library easier to use and faster are naughty to do automatically. For example Oodle under Vista+ can run faster with elevated priviledge, to get access to some of the unsecure file APIs (like extending without zeroing), but it would be naughty for me to do that automatically, so instead I have to add an extra step to make the client specifically ask for that.

Optimization for me has really become a nightmare. At first I was trying to make every function fast, but it's impossible, there are just too many entry points and too many usage patterns. Now my philosophy is to make certain core functions fast, and then address problems in the bigger high level API as customers see issues. I remember as a game developer always being so pissed that all the GL drivers were specially optimized for Id. I would want to use the API in a slightly different style, and my way would be super slow, not for any good reason but just because it hadn't gotten the optimization loving of the important customer's use case.

I used to also rail about the "unnecessary" argument checking that all the 3d APIs do. It massively slows them down, and I would complain that I had ensured the arguments were good so just fucking pass them through, stop slowing me down with all your validation! But now I see that if you really do that, you will just constantly be crashing people as they pass in broken args. In fact arg validation is often the way that people figure out the API, either because they don't read the docs or because the docs are no good.

(this is not even getting into the issue of API design which is another area where I have been suitably humbled)

ADDENDUM : I guess I should mention the really obvious points that I didn't make.

1. One of the things that makes a public library so hard after release is that you can't refactor. The normal way I make APIs for myself (and for internal teams) is to sort of make an effort at a good API the first time, but it usually sucks, and you rip it out and go through big scourges of find-rep. That only works when you control all the code, the library and the consumer. It's only after several iterations that the API becomes really nice (and even then it's only nice for that specific use case, it might still suck in the wild).

2. APIs without users almost always suck. When someone goes away in a cave and works on a big new fancy library and then shows it to the world, it's probably terrible. This a problem that I think everyone at RAD faces. The code of mine that I really like is stuff that I use over and over, so that I see the flaws and when I want it to be easier to use I just go fix it.

3. There are two separate issues about what makes an API "good". One is "is it good for the user?" and one is "is it good for the library maintainer?". Often they are the same but not always.

Anyway, the main point of this post is supposed to be : the next time you complain about a bad library design, there may well be valid reasons why it is the way it is; they have to balance a lot of competing goals. And even if they got it wrong, hey it's hard.

03-05-13 | Obama

Sometimes when I see Obama making a speech (eg. recently on the sequester, and before on gun control, and health care), it strikes me that he's addressing the country and the opposing legislature as if he can convince them through logic and reasoning and good discussion of the issues.

I think maybe Obama just doesn't understand politics. Perhaps because of his youth and lack of experience in serious elected office, he seems to think he can just make a good speech to the public and the legislature will somehow see the light and bow to his finely reasoned and rationally based argument. LOL, silly Obama. The only way to actually accomplish anything progressive is through strong-arm backroom deals and dirty tricks (see eg. LBJ and FDR). You can't just beat crooks like the NRA and AMA by being *right* ; the moral high ground or rational correctness never got anyone anywhere.

Either that or he's super clever and knows that none of his stuff will ever pass, and he's just trying to make a show to look a bit progressive while intentionally only succeeding on the very pro-big-business measures.

Sometimes when I see a bit of Fox News or some Tea Party demonstration or whatever, I imagine Mr. Burns is standing just off screen whispering "and what about the taxes" or "it's big government that's doing this to you!". Can't you see that these talking points have been written by think tanks and your angry mob is just a puppet in their game?

03-05-13 | Immigration Reform

This is something that everyone with a clue has been saying forever, and I'm quite sure it's not going to happen, and it's really too late to take advantage of our huge lead, but anyway :

1. Open up immigration for anyone with a PhD , MD , etc.. Not just visas - give them citizenships. You want them to stay and make their business here.

(not really on topic, but if the AMA wasn't such a bunch of fuckers we would have a super-fast-track for MD's in other countries to get a US MD)

2. Instant citizenship for any immigrant who goes to a US PhD program and graduates.

(* obviously some difficulty here because colleges would pop up just to take money and make citizens, so there has to be some control on this)

Anybody who's gone to an American science PhD program knows just how completely insane our policy is and how much amazing human talent we are letting slip away. It's fucking retarded that so many Indian and Chinese and Russian (and others) scientists come to America, get PhDs, and leave because they can't stay. Now, as I said it's already too late, and our small-mindedness and intransigence has already fucked us, because they now have decent tech economies to go home to. If we'd done this 10 years ago we could have been the tech leader of the world for a long time.

3. No limit on H1B visas. Fast track to citizenship for H1B workers. Certainly anyone who works in software knows how stupid this is. Don't let American tech companies hire the best workers in the world, and then even when we do get to hire them, don't let them stay and become assimilated US citizens. Good system guys.

03-01-13 | Finance and Realism

interfluidity is pretty great. It's like if I wrote about economics and actually knew what I was talking about.

Now some ranting.

The financial system of the world is in a very sick state. Since the Great Recession, things have actually gotten worse. Zero reforms have been passed to prevent instability and counter-humanist actions by the big banks. What's worse is that there has been even *more* consolidation, so the reins of power are now in the hands of the very few, and government is almost helpless against them. Furthermore, Europe's troubles have provided a great opportunity for for the big banks to redesign the finances of many european countries in their favor.

There *will* be another major collapse similar to the housing crash. I have no idea how long it will take or how it will happen exactly, but with the current economic climate it's inevitable. When government wants nothing but growth and has no stomach for regulation, the inevitable result is bubble and crash.

When there are crises, there are great opportunities for change. How have they been used?

LTCM -> should have been a wake up call about the danger of huge leverage and computer trading.  Nothing done.

Dot com bubble -> chance to break investment banks from financial advisors, make televised stock pumping illegal, etc.  Nothing done.

9/11 -> used to strip Americans of civil liberties and provide justificiation for the Cheney/Rumsfeld project in Iraq.

Obama election -> great opportunity to restore some transparency, rule of law, and civil rights.  Opportunity not taken.

Great Recession -> chance to restore bank separation and generally improve regulation of the financial system.  Opportunity not taken.

Collapse of Iceland, Ireland, Italy, Spain, etc. -> Goldman is there literally writing the conditions of the bailouts.

Reduced revenue for all levels of US government -> small government schemers use it to tighten the one way ratchet.

the forces of evil are far more clever about using crises. Of course they are, they have huge advantages when it comes to acting decisively in a moment of crisis (lack of morals, political power, unified organization, etc.).

The entire productive world economy is now functioning to subsidize the financial sector (and brand owners & patent holders). We let this happen party because we are all fools, and partly because they control the government, so the system is designed to make it that way.

It's absurd to think of capitalism as any kind of fair rational playing field; if capitalism was left on its own unfettered, it would very quickly become a world oligopoly, in which a few players controlled everything. (in a game design sense, capitalism is a game which is badly afflicted by "runaways" ; once one player starts winning, they become even more powerful, and soon the other players have no hope other than a massive blunder). The only way to make capitalism work decently is with a robust regulatory structure which crafts the system so that it is okay for workers and consumers. A capitalism economy is a lot like a game system, the way it plays is a direct result of the rules. We are sculptors of our own capitalism environment; it is a human political creation.

Anyway, ranting about it is pointless of course and I gave up on all that long ago, because it won't get better. Money wins. A rational realist only has one option : if the system is going to stay this way you have to work in finance (or try to get some bullshit patent so you can sell your tech company). If you don't choose to work in finance, you are choosing to subsidize people who work in finance, which is a silly thing to do.

Long ago I decided it was dumb to be angry about the way the world works, and it was pointless to try to change it. So when you identify something, the only rational thing to do is to use it for your benefit. But I can't bring myself to actually do that for some reason.

Sometimes I wonder if the whole idea of being "moral" is a trick that was used to brainwash us into being good obedient and controllable pawns. It's largely the churches that created this idea of it being so great to live a quiet life of moral goodness, when at the same time the churches themselves were in immoral ruthless power grabs. It's like one of the monkeys in the tribe convinces everyone that you shouldn't hit other monkeys when they steal your food, and then he proceeds to steal all your food. Very clever, monkey, very clever.

(Obviously those in power use the "moral" argument in a transparent and disgusting way to supress opposition; like claiming that government workers who speak out about the evils of government are "traitors" or that questioning the merits of going to war is cowardly or unpatriotic, or that regulating the financial sector in a recession is "irresponsible". Of course the way kids are taught to "respect authority" and such is just to keep them in line. Of course the entire public school system is a way of converting individuals into mindless worker drones. But those specific things are really what I'm talking about in the above paragaph).

03-01-13 | Zopfli

zopfli seems to make small zips. There's no description of the algorithm so I can't comment on it. But hey, if you want small zips it seems to be the current leader.

(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)

I've written these things before, but I will summarize here how to make small zips :

1. Use an exact (windowed) string matcher.

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse

3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).

cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling

4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.

cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies

5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.

One cool thing to have would be a cheap way to compute incremental huffman cost.

That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :

fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.

You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?

*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].

But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.

I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.

Now some ranting.

How sad is it that we're still using zip?

I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.

It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.

Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)

02-23-13 | Threading APIs that would be ideal

If you were writing an OS from scratch right now, what low level threading primitives should you provide? I contend they are rather different than the norm.

1. A low-level keyed event with double-checked wait.

Futex and NT's keyed event are both pretty great, but the ideal low level wait should be double-checked. I believe it should be something like :

HANDLE Waitset;

Waitset CreateWaitset();
DestroyWaitset(Waitset ws);

HANDLE wait_handle = Waitset_PrepareWait( Waitset ws , U64 key );

Waitset_CancelWait( Waitset ws , wait_handle h );
Waitset_Wait( Waitset ws , wait_handle h );

Waitset_Signal( Waitset ws, U64 key );

**Now, key of course could be a pointer, but there's no reason for it to be particularly. This is easily a superset of futex; if you want you could just have one global Waitset object, and key could be an int pointer, and you could check *ptr in between PrepareWait and Wait, that would give you futex. But you can do much more with this.

I prefer having a "waitset" object to put the waits on (like KeyedEvent), not just making it global/static (like futex). The advantage is 1. efficiency and 2. multiple meanings for a single "key". It's more efficient because you can have different waitsets for different uses, which makes each one cover fewer waits, which makes all the lookups faster. (that is, rather than 100 global waits pending, maybe you have 10 on 10 different waitsets). The other advantage is that you can reuse the same value for key without it confusing the system. You could have one Waitset where key is a pointer, and another where key is an internal handle number, etc.

2. A proper cond_var with waker-side condition checking.

First of all, a decent cond_var API combines a lot of the disjoint junk in the posix API. It should include the mutex, because that allows for vastly more efficient implementation :

    class condition_var
        void lock();
        void unlock();
        // the below are always called with lock held :

        void unlock_wait_lock();
        void signal_unlock();
        void broadcast_unlock();


The basic usage of this cv is like :


    while( ! condition )

    .. do stuff with condition true ..


A good implementation should do the compound ops (signal_unlock, etc) atomically. But I wouldn't require that because it's too hard.

But that's just background. What you really want is to put the condition check in the API. It should be :

        void wait_lock( [] { wake condition } );

The spec of the API is that "wake condition" is some code that will be run with the mutex locked, and when the function exits you will own the mutex and the condition is true. Then client usage is like :

    cv.wait_lock( condition );

    .. do stuff with condition true ..


which allows for much more efficient implementation. The wake condition of the waiter list can be evaluated easily inside signal_unlock(), because that's always called with the mutex held.

02-23-13 | Threading Patterns - Wake Polling

Something I've written about a lot but never given a solid name to.

When a thread is waiting on some condition, your goal should be to only wake it up if that condition is actually true - that is, the thread really gets to run. In reverse order of badness :

1. Wakeup condition polling. This is the worst and is very common. You're essentially just using the thread wakeup to say "hey your condition *might* be set, check it yourself". The suspect code looks something like :

while( ! condition )

these threads can waste a ton of cycles just waking up, checking their condition, then going back to sleep.

One of the common ways to get nasty wake-polling is when you are trying to just wake one thread, but you have to do a broadcast due to the possibility of a missed wakeup (as in the naive semaphore from waitset ).

Of course any usage of cond_var is a wake-poll loop. I really don't like cond_var as an API or a programming pattern. It encourages you to write wakee-side condition checks. Whenever possible, waker-side condition checks are better. (See previous notes on cond vars such as : In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition").

(ADDENDUM : in fact I dislike cond_var so much I wrote a proposal on an alternative cond_var api ).

Now it's worth breaking this into two sub-categories :

1.A. Wake-polling when it is extremely likely that you get to run immediately.

This is super standard and is not that bad. At root, what's happening here is that under normal conditions, the wakeup means the condition is true and you get to run. The loop is only needed to catch the race where someone stole your wakeup.

For example, the way Linux implements semaphore on futex is a classic wake-poll. The core loop is :

        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0

If there's no contention, you wake from the wait and get to try_wait (dec the count) and proceed. The only time you have to loop is if someone else raced in and dec'ed the count before you. (see also in that same post a discussion of why you actually *want* that race to happen, for performance reasons).

The reason this is okay is because the futex semaphore only has to do a wake 1 when it signals. If it had to do a broadcast, this would be a bad loop. (and note that the reason it can do a broadcast is due to the special nature of the futex wait, which ensures that the single thread signal actually goes to someone who needs it!) (see : cbloom rants 08-01-11 - Double checked wait ).

1.B. Wake-polling when it is unlikely that you get to run.

This is the really bad one.

As I've noted previously ( cbloom rants 07-26-11 - Implementing Event WFMO ) this is a common way for people to implement WFMO. The crap implementation basically looks like this :

while ( any events in array[] not set )
    wait on an unset event in array[]

What this does is any time one of the events in the set triggers, it wakes up all the waiters that are waiting on it in an array, checks the array, and they go back to sleep.

Obviously this is terrible, it causes bad "thread thrashing" - tons of wakeups and immediate sleeps just to get one thread to eventually run.

2. "Direct Handoff" - minimal wakes. This is the ideal; you only wake a thread when you absolutely know it gets to run.

When only a single thread is waiting on the condition, this is pretty easy, because there's no issue of "stolen wakeups". With multiple threads waiting, this can be tricky.

The only way to really robustly ensure that you have direct handoff is by making the wakeup ensure the condition.

At the low level, you want threading primitives that don't give you unnecessary wakeups. eg. we don't like the pthreads cond_var that has you call :

as two separate calls, which means you can wake from the condvar and immediately fail to get the mutex and go back to sleep. Prefer a single call :
which only wakes you when you get a cv signal *and* can acquire the mutex.

At the high level, the main thing you should be doing is *waker* side checks.

eg. to do a good WFMO you should be checking for all-events-set on the *waker* side. To do this you must create a proxy event for the set when you enter the wait, register all the events on the proxy, and then you only signal the proxy when they are all set. When one of them is set, it does the checking. That is, the checking is moved to the signaller. The advantage is that the signalling thread is already running.

02-23-13 | Threading - Reasoning Behind Coroutine Centric Design

cbloom rants 12-21-12 - Coroutine-centric Architecture is a proposed architecture.

Why do I think it should be that way? Let's revisit some points.

1. Main thread should be a worker and all workers should be symmetric. That is, there's only one type of thread - worker threads, and all functions are work items. There are no special-purpose threads.

The purpose of this is to minimize thread switches, and to make waits be immediate runs whenever possible.

Consider the alternative. Say you have a classic "main" thread and a worker thread. Your main thread is running along and then decides it has to Wait() on a work item. It has to sleep the thread pending a notification from the worker thread. The OS has to switch to the worker, run the job, notify, then switch you back.

With fully symmetric threads, there is no actual thread wait there. If the work item is not started, or is in a yield point of a coroutine, you simply pop it and run it immediately. (of course your main() also has to be a coroutine, so that it can be yielded out at that spot to run the work item). Symmetric threads = less thread switching.

There are other advantages. One is that you're less affected by the OS starving one of your threads. When your threads are not symmetric, if one is starved (and is the bottleneck) it can ruin your throughput; one crucial job or IO can't run and then all the other threads back up. With symmetric threads, someone else grabs that job and off you go.

Symmetric threads are self-balancing. Any time you decide "we have 2 threads for graphics and 1 for compute" you are assuming you know your load exactly, and you can only be wrong. Symmetric threads maximize the utilization of the cpu. (Note that for cache coherence you might want to have a system that *prefers* to keep the same time of job on the same thread, but that's only a soft preference and it will run other jobs if there are none of the right type).

Symmetric threads scale cleanly down to 1. This is a big one that I think is important. Even just for debugging purposes, you want to be able to run your system non-threaded. With asymmetric threads you have to have a custom "non-threaded" pathway, which leads to bugs and means you aren't testing the same threaded pathway. The symmetric thread system scales down to 1 thread using the same code as always - when you wait on a job, if it hasn't been started it's just run immediately.

It's also much easier to have deadlocks in asymmetric thread systems. If an IO job waits on a graphics job, and a graphics job waits on an IO job, you're in a tricky situation; of course you shouldn't deadlock as long as there are no circular dependencies, but if one of those threads is processing in FIFO order you can get in trouble. It's just better to have a system where that issue doesn't even arise.

2. Deep yield.

Obviously if you want to write real software, you can't be returning out to the root level of the coroutine every time you want to yield.

In the full coroutine-centric architecture, all the OS waits (mutex locks, etc) should be coroutine yields. The only way to do that is if they can call yield() internally and it's a full stack-staving deep yield.

Of course you should be able to spawn more coroutines from inside your coroutine, and wait on them (with that wait being a yield). That is, aside from the outer branch-merge, each internal operation should be able to do its own branch-merge, and yield its thread to its sub-items.

3. Everything GC.

This is just the only reasonable way to write code in this system. It gives you a race-free way to ensure that object lifetimes exceed their usage.

The last post I did about the simple string crash is just so easy to do. The problem is that without GC you inevitably try to be "clever" and "efficient" (really "dumb" and "pointless") about your lifetime management. That is, you'll write things like :

void func1()
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );



which is okay, because it waits on the async op inside the lifetime of "name". But of course a week later you change this function to :

Handle func1()
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );


return h;

with the wait done externally, and now it's a crash. Manual lifetime management in heavily-threaded code is just not reasonable.

The other compelling reason is that you want to be able to have "dangling" coroutines, that is you don't want to have to wait on them and clean them up on the outside, just fire them off and the clean themselves when they finish. This requires some kind of ref-counted or GC'ed ownership of all objects.

4. A thread per core.

With all your "threading" as coroutines and all your waits as "yields", you no longer need threads to share the cpu time, so you just make one thread per core and leave it there.

I wanted to note an exception to this - OS signals that cannot be converted to yields, such as IO. In this case you still need to do a true OS Wait that would block a thread. This would stop your entire worker from running, so that's not nice.

The solution is to have a separate pool of threads for running the small set of OS functions that do internal thread waits. That is, you convert :

ReadFile( args )


yield RunOnThreadPool( ReadFile, args );

this separate pool of threads is in addition to the one per core (or it could just be all one pool, and you make new threads as needed to ensure that #cores of them are running).

02-18-13 | Don't write spaghetti

It never works out. I usually even warn myself about it (by writing comments to myself), but it still catches me out. I also usually give myself a todo item like "hmm this smells funny revisit it", but of course the todos just pile up in a never-ending heap, and little old ones like that get burried.

void DoLZDecompress(const char *filename,...)
    struct CommandInfo i;
    i.data = (void *)filename;
    // warning : passing string pointer (not copying) to another thread, make sure it's const / sticks around!
    StartJob( &i );

Yup, that's a crash.

void OodleIOQ_SetKickImmediate( bool kick );
/* kick state is global ; hmm should really be per-thread ; makes it a race

Yup, that's a problem, which leads to the later deadlock :

void Oodle_Wait( Handle h )
    // @@ ? can this handle depend on un-kicked items, and hence never complete ?
    //  I used to check for that with normal deps but it's hard now with the "permits"

Coding crime doesn't pay. Spaghetti always gets you in the end, with its buggy staining sauce.

Whenever I have one of those "hmm this smells funny, I'm worried about the robustness of this" , yep, it's a problem.

One of my mortal enemies are the "don't worry about it, it'll be fine" people. No it will fucking not be fine. You know what will happen? It'll be a nasty god damn race bug, which I will wind up fixing while the "don't worry about it, it'll be fine" guy is watching lolcatz or browsing facebook.

02-18-13 | The Myth of Future Value

I believe that most people (aka me) grossly overvalue future rewards when weighing the merits of various options.

I've been thinking about this a lot over the last fews days, and have come to it simultaneously from several different angles.

For the past month or so I've been going over my finances, reviewing my spending, because I'm not happy with the amount I'm saving, and I'm trying to figure out where the money is leaking. Obviously there are big expenses like cars and vacations, but those I've budgeted for, they're not the leak (*) (**), but there's still a large general money leakage that I want to track down. It turns out a lot of it is buying stuff for the house or for productivity, stuff that on its own I can justify, but overall adds up to a big waste.

A lot of that waste are things that I tell myself will "pay off someday". Like I need some rope for around the house; hey look it's a much better deal if I buy it in a 500 foot spool. I'll use it eventually so that's the better buy. Or, I need to set a bolt in concrete; sure a hammer drill is expensive, but I'll use it the rest of my life, so it will be a good value some day (better than renting one for this one job). etc. Lots of stuff where the idea is that in the long term it will be a good value.

Now I certainly haven't hit the "long term" yet, but I can already see the flaw in that logic. There are lots of reasons why that imagined long term value might never come. I might never wind up using the stuff. It might get damaged over time from sitting, or flood or who knows what. I also essentially pay a tax to store it, having stuff is not free. I pay a tax on it any time I move. Maybe I won't want to do DIY in the future and will just hire out the jobs and so won't use it. There are a lot of costs and uncertainty about this future value which make it much less valuable than it naively appears.

Perhaps computer stuff is an even easier example; like I sort of need a USB hub; I could live without it and just unplug stuff to make room depending on what device I want to use at the moment. You could easily convince yourself that the value is high because "even if I don't really need it now I'll use it someday". But of course there's any number of reasons why you might not use it some day.

(* = aside : expensive cars actually aren't that expensive; if you're careful about how you buy and sell, and look for cars that are on a pretty flat part of the depreciation curve, you can get a "$100k car" that actually only costs $5k a year. That's not really a big expense in the scheme of things. However that also doesn't mean it's free; the big cost is the time spent buying and selling; if you actually want it to be low cost you have to spend a lot of time on the transaction to get good value, and for people like me that's excruciatingly painful; for people who like wheeling-and-dealing, they can do pretty well, getting almost free stuff that they are just holding temporarily between sales)

(** = more aside, and actually there is a spending leak that I have that's associated to cars and vacations; I, like most, and perhaps less than average, fall prey to the sunk cost fallacy. The sunk cost fallacy is the idea that you've spent a bunch on something, you should stick with it and spend some more. Like I've spent this much to go on vacation, I shouldn't cheap out on the dining or whatever. Or I've got an expensive car, I should buy the expensive tires. But that of course is not true. Each decision should be evaluated independently for its value; the fact that you have a large sunk cost only matters in that it changes your current situation. You don't keep chasing your flush just because you're already called some big bets (though obviously your past calls do affect the pot size which affects your current decision)).

Of course home improvements are a classic of false future value. I'm not foolish enough to think I'll get any resale value benefit, but I do fall prey to thinking "I'll enjoy this for many years" when in fact I might not.

I was thinking about buying a really good mattress that's supposed to last 30 years vs one that will only last 5. In theory the long-life one is a much better deal, but there are any number of reasons why that might not be the case. It might not last like it's supposed to; you might pee and poo on it; you might want a different size mattress. By making an "investment" what you've done is commit yourself to something, you've removed flexibility, which is a cost.

Of course if you ever decide you want to travel the world and live in apartments again, all the buying of stuff is a big liability.

Getting away from just "accumulating stuff" now :

I've been thinking lately about my career arc. All through my young-adulthood I was carefully building up my value as a software development employee. I was improving my skills, improving my profile, networking, all that stuff, going up the career ladder. During that time I was not getting paid particularly well. I took jobs based on them being good opportunities for my larger career, not for their immediate financial reward.

The problem is that the big payoff never came. When Oddworld went under I was at the point where I could have moved on to CTO-level jobs at major studios, but I decided I didn't want to do that. The stress was ruining my life (and various other things that I've blogged about back then). The point is that this "future value" I had been building suddenly became zero. If you actually want to redeem that future value, you are locking yourself in to a path, which is a major cost you are paying, giving up flexibility in your life. And in careers there are so many factors outside your control; perhaps your specialty will become less prominent in a few years. Lots of people have done things like getting a JD only to find the law job market has dried up by the time they graduate.

Saving money in general is questionable now. The governments of the world have demonstrated that they don't care about the integrity of the world financial systems, so socking money away for the future has immense risk associated with it. (I don't put much credence in the complete currency collapse alarmists, but I do believe that a long period of negative inflation-adjusted returns is very likely). In the old days we glorified the good salaryman, who worked hard and saved some money, putting the joy of today aside to build a life for themselves and their family tomorrow.

Of course we can relate this all to poker, in old skool cbloom-rants style.

One of the first big realizations I had on my own as I was getting better and moving beyond book TAG play is that implied odds are massively overvalued by most players. "implied odds" is the term used for the imagined future value that you will get if you hit a big hand. Like if I call with a flush draw, it might be a bad value based on the immediate odds, but if I hit I'll make some more money, which makes it worth it the call. The problem is that there are a wide variety of reasons why you might not get paid off even if your flush comes (scare cards, or your opponent never had a strong calling hand to begin with). Or your flush might come and he might have a better flush (negative implied odds). If you realistically weight all those undesirable outcomes, the result is that the true effect of implied odds is very small. eg. on the turn you have a 16% chance to improve, you can call a bet for zero EV if the bet is about 20% of pot size. The action of implied odds is very small; you can only call a bet that's maybe 25% of pot size; really not much more. Certainly not the 30-35% that people talk themselves into believing is correct. (and of course in no limit holdem you have to adjust for position; out of position you should consider your implied odds to be zero, chasing a draw out of position is so very bad). What I'm getting at is the imagined future value of your current investment is far lower than you imagine.

(sort of not related but "implied odds" is also a good example of the "rationalization trap". Whenever a complicated logical exercise justifies behavior that your naughty irresponsible side secretly wants to do (like chasing draws), you should be very skeptical of that logic. Whenever you read that "a little red wine is healthy" you should be very skeptical. Whenever the result of a "study" is exactly what people want to hear, beware).

This isn't really related to the "future value" mistake, but I've been mulling another spending fallacy, which is the "value of an hour" fallacy. Sometimes I'll do something like buy a tool or hire a helper because it only costs $50 and saves me an hour of work; my hour is worth more than $50, so that's a good deal, right? I'm not so sure. I feel like that line of reasoning is just a way of rationalizing more spending, but I haven't quite found the flaw in it yet.

02-17-13 | Rambles on Mattresses and Retail

I'm finally getting around to trying to buy a new mattress, after the last new mattress that I bought turned out to be a dud (don't buy an S brand).

One of the better stores around here is "Bedrooms and More", which sounds just like a national chain sleaze-o-rama mattress trash peddler, but is actually not. The owner does some funny rants online and he suggests that the real shittening of the S-brands is due to private equity. Interesting idea; certainly there's no doubt that the S-brands have gone to total shit.

Of course we should be mad at the corporate overlords for sending product quality to shit, and generally using dishonest schemes to maximize short term profit. But I'm also angry at consumers for letting it happen. The only way to direct good behavior is to punish people who behave badly. And that just doesn't happen, neither in social life, nor capitalist life. People are amazingly (foolishly?) forgiving. Your only weapon as a consumer is your money.

Hanging out on Porsche forums a few years ago (zomg what was I thinking), I kept having my mind blown by how short-memoried everyone was. Even people who were pretty realistic about what fuckers Porsche had been in the past were all ready to buy the new model. (background : back in the early 90's, Porsche almost went bankrupt; they were completely restructured to focus more on marketing and profit, and less on quality. They intentionally drove the quality of their products down to the absolute minimum (actually, below minimum). This was the era of the Boxster and then the 996, and the early cars that were made were complete junk, some of the worst made cars for any money (worse than a Tata or god knows, it's hard to even think of an example of a horribly made car any more), the engine blocks were porous, the cylinders were out of round, there was cheap plastic inside the engine, and of course terrible cheap plastic everywhere in the interior, it was just a total clusterfuck). The rational consumer response should have been : whoah you guys are lying fuckers, we are never going to buy anything from you again. Instead most of the people were just incredibly forgiving and short-memoried, like yeah that was bad, but they'll fix it in the next model!

Wouldn't it be nice if products that cost more were actually better? Then you could just look at the range of products, pick your price-quality tradeoff point, and buy one. It would still be a tough decision, you'd have to weight how much you want to spend on this thing, but you would at least know that spending more got you something better. In the real world, that's not remotely the case. It's so nice when you go shopping in a video game RPG and you can just buy the more expensive sword and know it's better (and it's so fucking retarded when video games designers throw in wild-cards of expensive items that suck or really cheap items that are great, you dumb assholes, you don't get it, the game world should not make me do all the stressful shit I have to do in the real world).

I've always wanted a grocery store that actually selected its products for good cost/quality tradeof. That is, a good store should only sell things on the Pareto Frontier. Why the fuck do you have 50 different olive oils? I have no fucking clue what all these olive oils are, don't offer them to me. You (the retailer) should be an expert in this product (and also act as an agglomerator of customer feedback). There should only be like 4 olive oils to choose from, at various cost/quality tradeoffs (and also some for finishing vs. cooking oils, but let's pretend right now that there's only one axis of "quality" for olive oil), so I can just choose how much I want to spend and get the best oil at that price.

I had a funny self-realization moment at Soaring Heart when the salesman was saying how everything was made locally and they pay health care and benefits for their workers, I instinctively/subconscious thought "yeuck, that means bad value". Apparently my subconscious wants to buy products made in sweatshops. More generally I've got a major bias against ever giving money to someone who seems to be living well. When I see a realtor in a fancy car or a contractor who gets a swim and massage daily, fuck you I'm not giving you money, I want my pay to you to be barely enough to support human life, you should be in miserable subsistence conditions, not living it up! I guess I'm also biased against anything made in America; my mental image of Seattle mattress builders is not great skilled workers (like New Yankee Workshop), but something like failed philosophy PhDs who smoke weed while they work and don't know WTF they're doing (like Workaholics).

02-16-13 | The Reflection Visitor Pattern

Recent blog post by Maciej ( Practical C++ RTTI for games ) set me to thinking about the old "Reflect" visitor pattern.

"Reflect" is in my opinion clearly the best way to do member-enumeration in C++. And yet almost nobody uses it. A quick reminder : the reflection visitor pattern is that every class provides a member function named Reflect which takes a templated functor visitor and applies that visitor to all its members; something like :

class Thingy
type1 m_x;
type2 m_y;

template <typename functor>
void Reflect( functor visit )
    // (for all members)


with Reflect you can efficiently generate text IO, binary IO, tweak variable GUIs, etc.

(actually instead of directly calling "visit" you probably want to use a macro like #define VISIT(x) visit(x,#x))

A typical visitor is something like a "ReadFromText" functor. You specialize ReadFromText for the basic types (int, float), and for any type that doesn't have a specialization, you assume it's a class and call Reflect on it. That is, the fallback specialization for every visitor should be :

struct ReadFromText
    template <typename visiting>
    void operator () ( visiting & v )
        v.Reflect( *this );

The standard alternative is to use some macros to mark up your variables and create a walkable set of extra data on the side. That is much worse in many ways, I contend. You have to maintain a whole type ID system, you have to have virtuals for each type of class IO (note that the Reflect pattern uses no virtuals). The Reflect method lets you use the compiler to create specializations, and get decent error messages when you try to use new visitors or new types that don't have the correct handlers.

Perhaps the best thing about the Reflect system is that it's code, not data. That means you can add arbitrary special case code directly where it's needed, rather than trying to make the macro-cvar system handle everything.

Of course you can go farther and auto-generate your Reflect function, but in my experience manual maintenance is really not a bad problem. See previous notes :

cbloom rants 04-11-07 - 1 - Reflection
cbloom rants 03-13-09 - Automatic Prefs
cbloom rants 05-05-09 - AutoReflect

Now, despite being pro-Reflect I thought I would look at some of the drawbacks.

1. Everything in headers. This is the standard C++ problem. If you truly want to be able to Reflect any class with any visitor, everything has to be in headers. That's annoying enough that in practice in a large code base you probably want to restrict to just a few types of visitor (perhaps just BinIO,TextIO), and provide non-template accessors for those.

This is a transformation that the compiler could do for you if C++ was actually well designed and friendly to programmers (grumble grumble). That is, we have something like

template <typename functor>
void Reflect( functor visit );
but we don't want to eat all that pain, so we tell the compiler which types can actually ever visit us :
void Reflect( TextIO & visit );
void Reflect( BinIO & visit );
and then you can put all the details in the body. Since C++ won't do it for you, you have to do this by hand, and it's annoying boiler-plate, but could be made easier with macros or autogen.

2. No virtual templates in C++. To call the derived-class implementation of Reflect you need to get down there in some ugly way. If you are specializing to just a few possible visitors (as above), then you can just make those virtual functions and it's no problem. Otherwise you need a derived-class dispatcher (see cblib and previous discussions).

3. Versioning. First of all, versioning in this system is not really any better or worse than versioning in any other system. I've always found automatic versioning systems to be more trouble than they're worth. The fundamental issue is that you want to be able to incrementally do the version transition (you should still be able to load old versions during development), so the objects need to know how to load old versions and convert them to new versions. So you wind up having to write custom code to adapt the old variables to the new, stuff like :

if ( version == 1 )
    // used to have member m_angle
    double m_angle;
    m_angleCos = cos(m_angle);

now, you can of course do this without explicit version numbers, which is my preference for simple changes. eg. when I have some text prefs and decide I want to remove some values and add new ones, you can just leave code in to handle both ways for a while :


#ifndef FINAL
if ( visitor.IsRead() )
    double m_angle = 0;
    m_angleCos = cos(m_angle);



where I'm using the assumption that my IO visitor is a NOP on variables that aren't in the stream. (eg. when loading an old stream, m_angleCos won't be found and so the value from m_angle will be loaded, and when loading a new stream the initial filling from m_angle will be replaced by the later load from m_angleCos).

Anyway, the need for conversions like this has always put me off automatic versioning. But that also means that you can't use the auto-gen'ed reflection. I suspect that in large real-world code, you would wind up doing lots of little special case hacks which would prevent use of the simple auto-gen'ed reflection.

4. Note that macro-markup and Reflect() both could provide extra data, such as min & max value ranges, version numbers, etc. So that's not a reason to prefer one or the other.

5. Reflect() can be abused to call the visitor on values that are on the stack or otherwise not actually data members. Mostly that's a big advantage, it lets you do converions, and also serialize in a more human-friendly format (for text or tweakers) (eg. you might store a quaternion, but expose it to tweak/text prefs as euler angles) (*).

But, in theory with a macro-markup cvar method, you could use that for layout info of your objects, which would allow you to do more efficient binary IO (eg. by identifying big blocks of data that can be read in binary without any conversions).

(* = whenever you expose a converted version, you should also store the original form in binary so that write-then-read is a gauranteed nop ; this is of course true even for just floating point numbers that aren't printed to all their places, which is something I've talked about before).

I think this potential advantage of the cvar method is not a real advantage. Doing super-efficient binary IO should be as close to this :

void * data = Load( one file );
GameWorld * world = (GameWorld *) data;

as possible. That's going to require a whole different pathway for IO that's separate from the cvar/reflect pathway, so there's no need to consider that as part of the pro/con.

6. The End. I've never used the Reflect() pattern in the real world of a large production codebase, so I don't know how it would really fare. I'd like to try it.

02-05-13 | Some Media Reviews

"The Rings of Saturn" (WG Sebald) is the most incredible book I've read in a long time. It's like one big rambling stream of consciousness aside; there's no story, it's not really about anything in particular, there's no paragaphs - all things that I normally hate - and yet it's totally compelling. It's a real "page turner", an easy read, I love his writing and I just wanted to consume more and more of it. Touched me deeply, amazing book.

(followed up by reading "The Emigrants" which is good, but much more of a normal book, it's terrestrial, not so oddly magical and other-worldly as "The Rings of Saturn").

I've just discovered "The Sky at Night" (ex of Sir Patrick Moore) on BBC. What a marvelous show. I don't really even care much about astronomy, and yet here is a show with real scientists, talking to each other about things they actually understand, and talking at a very high level and not really dumbing it down much for the audience. I've never seen anything like it on television before, actual intelligent people talking to each other, it's amazing. I love Patrick's interviewing style, the way he just blurts things out; he reminds me so much of the real scientists I've known in my life who are super direct and straight to the facts without dancing around the point. It's best to watch early episodes before he got too old/ill.

They did a demo of the Higgs spontaneous symmetry breaking on The Sky at Night which is the best I've seen. Take a wine bottle (with a good hump at the bottom; those familiar will recognize that the hump is the key to symmetry breaking) and put a piece of cork or a ball or something inside. Now shake the bottle vigorously. At high energy like that (bottle shaking), the location of the cork is random, so the whole assemblage still has rotational symmetry. Now stop shaking (low energy) and the cork will settle somewhere - not in the middle of the wine bottle where the hump is, but off to one side in the trough. Symmetry broken.

"The Loneliest Planet" is a really terrible movie and I don't recommend it (jesus christ the scenes where they sit around the camp fire and say the same words over and over are excruciating torture), but it has these few scenes that are some of the most beautiful I've ever seen in a movie - the scenes with the wide static shots that the characters slowly walk across, they're staggering, breath-taking.

"Hello Lonesome" was good. The director obviously knows loneliness; it reminded me a lot of various times in my life; the weirdness of being alone for a long time, the sad joy - you do whatever you want, but it all kind of sucks. The long idle times, so much free time, rambling around your property, hitting fruit with a baseball bat (me, not the movie).

also watchable : Summer Hours (quiet, nothing ever happens, and yet very adult interactions, somehow compelling), Bernie (irresistible, charming), Bonsai (simple little movie that reminds me of life at that age; tasteful), Youth in Revolt (much better teeny rom-com than the more well known teeny rom-coms), Breaking and Entering (some great characters; in the end it's a movie about sad and lonely people)

GAYLE is crazy funny. Weird as hell, wtf is going on, and yet it's the most biting mockery of normal suburban life.

"Utopia" is pretty retarded; the plot is standard unrealistic conspiracy crap, straight out of that awful graphic-novel type of writing; there's no part of it that's clever or insightful or well thought out (so far), and the characters are pretty awful to boot. I don't really care for the torture-porn stuff either (just skip it). All that said, the look of it is just super beautiful, amazing art direction, subtle and realistic but strikingly odd; every shot has these dramatic geometric forms and colors in it. And there's an eerie stillness to it, lots of long holds. It's so good to look at, and good sounds too.

(a lot of recent British stuff is just staggeringly good looking. See also "The Tower", "Red Riding", "Wallander".)

The new season of Top Gear is finally here, and good god is it painfully bad. I guess I should face the fact that it's been awful for many years now, but I was clinging on, hoping it would perk back up (as it occasional has done, eg. the Bolivia Special). You develop this almost pavlovian response to things that have given you pleasure in the past; the sound of a beer bottle top popping off, the smell of coffee, that Top Gear opening theme song, it starts the pleasure molecules bursting in my brain, even if I don't actually want a beer, and I know that Top Gear is going to suck, there's still this vestigial fondness I have for it. The best part of the series so far has been the meta-comedy moment of James May falling asleep on the show because he too was so bored of it.

ADDENDUM : "Beasts of the Southern Wild" is amazing! joyous, sad, hard to watch, thrilling, it's a rich emotional feast, but it's also an incredible work of art. There's obvious allegory, but it's characters aren't unrealistic victims without faults. More than any thing I think it's a modern fairy tale (of the old style); not "fairy tale" like in the shit Disney sense meaning "princess, happy ending, dreams come true" but in the original sense, like Grimm's and all the old stories that were terrifying ; they were fantasies, but grounded in real world horrors, and often were obvious warning messages. Real fairy tales are magical and beautiful but also scary and sad.

01-31-13 | Ugh ugh I hate the web

So Blogger randomly changed a bunch of shit a while ago.

One of the consequences of this is that the layout of "cbloom rants" can no longer be achieved or maintained with the new blogger layout, which means I can't edit it without losing it completely. (the existing layout does seem to keep working as long as I don't touch it, because they keep the raw HTML of the layout).

Another nasty one I just discovered is that a key setting that I rely upon is no longer there. Under "Settings->Formatting" there used to be a setting for "Convert Line Breaks" which defaults to Yes and causes any LF to be turned into an HTML BR code. I set that to "No" for cbloom rants so that it doesn't crud up my html when I send it over the Blogger API. (god dammit just let me put up HTML and stop fucking with it).

The odd thing is that the "No" setting (of "Convert Line Breaks") for cbloom rants appears to have stuck even though that setting has disappeared. That's fine with me I guess, though I wouldn't be surprised if it just stops working at some point when they revise the service again. The problem is I'm trying to set up a new blog and I can't get that setting any more.

(I of course have a workaround, which is removing LF's before I upload posts. The workaround sucks a little because I like to be able to download my posts back down and have them match the way I wrote them, which of course was with LF's in it for my readability during composition. The point is not the specific issue, it's god damn it don't push updates on me ever never ever unless I ask for them.)

Software updates are incredibly harmful. The benefit from changing *anything* has to be really massive for it to be a win. I'm so sick of getting new versions of crap pushed on me. At least with non-web software you can try to hold onto old versions as long as possible so that you can keep your valuable knowledge and its connections to your automation suite.

01-28-13 | Importing Eudora MBX's to Gmail

I'd like to import all my old Eudora mail to gmail, to get it all together in one place, and for searchability.

(my current mail solution is to use Eudora POP on my local machine, but forward all my mail through gmail for spam filtering and archiving and searchability; it's working pretty well finally).

Gmail does not offer any "import from local disk" options. Sigh. There appear to be a few ways to do this :

1. Change my gmail temporarily to IMAP. Get all my Eudora MBX's into an IMAP client (something like Outlook; perhaps requiring an MBX to PST conversion step or something). Open the IMAP client and connect to gmail; drag the mail from the Eudora boxes to the gmail boxes.

Should work in theory, but a bit scary, and extremely slow (moving mail on IMAP is ungodly slow).

(Also, when I switch back to POP, is it going to redeliver me all that mail that I just uploaded? That would double-suck.)

2. Make a POP server somewhere. Convert the mbx's to mbox's to maildirs and dump them on the POP server for it to serve up. Tell gmail to grab mail from that POP server.

One issue is where I could get a POP server with a public IP and admin access. The other is that any time I try to do networking stuff it's a massive fail of mysterious problems and no error messages.

3. Get a Google Apps gmail account (different from regular gmail account for unknowable reasons). Import MBX's to Outlook. Use "Google Apps Migration for Microsoft Outlook" to import mail to Apps mail account. Use gmail fetcher to bring mail from apps-gmail to my normal gmail.

(similar alternative : get apps gmail. Convert mbx to mbox. Find a Mac. Use "Google Email Uploader for Mac" to upload the mbox. Transfer mail from apps-mail to normal mail).

(I could also use gmail API to write my own importer, but that also requires an Apps gmail, so may as well just use the existing importers in method 3)

It's all such a hassle that I'm once again tempted to just write my own damn email client. Sigh I wish I'd done that long ago, but it's always the local optimization to not do it. I'm so fucking sick of getting penis emails. Hello spam filterers, *penis* -> spam. You're welcome. And of course if I used my own email client, my private property (words) wouldn't be data-mined to serve me ads (you bastards).

(oddly gmail does remarkably well at spam detection on the cases that would be hard for me to do with simple filters; things like bank phishing mails that are designed to look exactly like legitimate mails from my bank; I don't think I could give that up, so I'd still be stuck with routing mail through gmail even if I had my own client).

01-27-13 | Kauai

We took a vacation from the Big Island vacation for a few days to go to Kauai. I'd never been to another Hawaiian island so it was interesting to compare. It also gave us a chance to stay right on the beach, which we're not doing this trip, which was nice for a little while, getting that salt air in the bedroom and sitting on the water late at night. Anyhoo, thoughts on Kauai :

1. Yes it's beautiful. It looks more like Vietnam or Thailand and their limestone karst stuff, all old and weathered and crumbly with these random protrusions and such. (it's cool how you can travel the Hawaiian islands from south to north and visually see geographic time passing at a rate of 100,000 years per island hop). It actually wasn't as lush as I expected given all the hype about how wet it was and the incredible lushness. It's no more jungley beautiful than the Hamakua Coast near Hilo is really. My favorite parts of Kauai were the northern coast, and also just south of Lihue around the Hulemalu Road area (which would be a pretty sweet bike ride; good pavement, no cars).

2. There're sweet beaches all over the island. Like you almost don't have to seek them out there's another one around every corner, and most with very few people on them. None of them looked really perfect the way Mauna Kea and Hapuna are just ridiculously perfect in every way (clear water, no rocks, bottom drops neither too fast nor too slow, no rips, etc), but they were uncrowded and more sort of charming in a rustic way and often have cool surrounding cliffs and pretty settings.

3. The traffic sucks. The island is small, which is cool for a vacation (actually I love staying on tiny islands, like Ko Hai, Caye Caulker, or the Isla Mujeres of my childhood; islands where you can walk from one side to the other in half an hour or so). However, despite the smallness it takes forever to get anywhere because it's constant gridlock. Sitting in traffic fucking blows, and this alone is almost enough to put me off Kauai.

4. The human development on Kauai is repellent. The cities are all really ugly (though that seems to be standard all over Hawaii); most of the island is strip malls and run down shopping centers and fast food and such. Then the alternative are these fancy manicured suburban/golf developments like Poipu and Princeville which are disgusting in a different way. Between the two, the human hand on Kauai has scarred it with an ugliness that is quite tragic.

5. It's extremely tourist-oriented. Every restaurant is for tourists (which means rotten food and weird phoney-nice service), the place is covered in tourist crap shops (t-shirts, mac nuts, koa, etc). It has no feeling of being it's own place independent from the tourists. It also has a big port where cruise ships drop out hordes. Part of the problem with that is that Kauai is so small it can't really handle the appearance of 5000 people in one day.

6. The Na Pali Coast trail (Kalalau) is pretty cool. We made it 6 miles in before turning around (just into Hanakoa Valley, which was the best part of the trail that we saw) (pretty impressive for a pregnant lady). It's definitely not the most beautiful hike ever (as some say); there are lots of hikes in WA that are better scenery and not so jam packed with ding-dongs. It is sweet to be able to take a dip in the rivers along the way and swim at the beach afterward. Much like the Big Island, there's too much private property and not very much development of good trails, so you see all this beautiful stuff around but you can't really get to it (unless you want to tresspass and bushwhack, which you certainly can do).

7. I think it would be a pretty great place for a surf vacation. One of the good things about it from that standpoint is there are decent beaches facing every cardinal direction, so you can pick your spot to match the swell, and because it's small it doesn't take forever to get there. I could see maybe going back to Kauai some day for an intensive "finally learn to really surf" vacation, based around Hanalei or something (and never leaving that area).

8. For anyone considering going to Kauai - don't go in winter. We got super lucky with no big storms during our short trip, but generally Kauai is pounded in winter with big waves and lots of rain. You can always go hide in the dry south, but since the north is the best part of the island it's just better to go when it's not storm season.

Overall it made me miss our Big Island home, and I'm happy to be back.

I guess I'm a little negative about Kauai because I was super tired the whole time from not sleeping well. I also realized that I kind of hate vacation these days. I like workcation where I rent a house for a while and settle in and can cook my own food and bring my bike and get to do what I like (bike, swim, work). I don't really like sight seeing, just going from place to place and going yep I saw that; it feels so pointless, and it's kind of all the same experience no matter what sight it is you're seeing. I hate hotels, the invariably awful beds and pillows, the ice makers and elevators and other guests, the nasty decor and bad air, the attendants angling for tips. I hate restaurants, I'm so sick of restaurants. I wish I could just buy some proper ingredients that are actually fresh and okay quality, and have them cooked simply at the time that I order. Instead you get frozen super-low-grade Sysco garbage that's been pre-cooked and then warmed to order and covered in some nasty "sauce", it's just revolting the filth that they pass off as food all over America. (and the fancy expensive restaurants are not much better). And you have to sit around forever while the waiter does god knows what and try to act nice and make the most of it while poisonous filth is flopped down in front of you.

I like the idea of vacations that are for a certain activity that you like. Not going to see sights or relax, but to go hiking in some place that's really great for hiking, or to go biking, or surfing, or whatever you like. I sort of did this with the CA work/bike-cation, and it was rocking good. I'd like to do it more, but it's hard to find good information. A lot of the "epic hikes" or "great bike rides" are actually total shit; the rating is done by people who don't know WTF they're talking about. (same is true of "great beaches", which are often total crap beaches except for their white sand or something stupid like that). For example, I know that Hwy 1 in CA is on many a list of epic rides, and having lived there for a long time I know that's totally retarded; not only is it not epic, it's barely even tolerable, like I would never ride it by choice (I only rode it when necessary to connect a loop between other roads), and in the same area where they recommend Hwy 1 there are probably 30 rides that are much better. So anyway, actually finding solid information on places that are good "destination biking" is very difficult.

I'm also getting more sensitive about travelling places where the tourism is sort of a form of exploitation. In Hawaii the bad vibes are mild, but they're definitely there. We stole these islands from the Hawaiians, and now they are mostly pretty poor and get to watch rich tourists come in and buy up their best land and crowd up their favorite local spots. But despite that Hawaii is immensely better than other beach destinations I've gone to. In Mexico & Central America you get to see the abject poverty of people whose lives have been destroyed by government corruption and "free trade" (which is a transparent absurdity when we own all the patents and subsidize our exports and fuel costs); most of the beach developments were the result of the government evicting the people who rightfully lived there with minimal compensation; you used to be able to get away from the Zona Hoteleria areas and find sweet little towns that were still pretty untouched, but that's increasingly hard. In Thailand you're surrounded by the sex tourists and the cheap-booze backpacking set, who generally sleaze the place up (but it's better when you get away from the tourist-heavy areas).

Anyhoo, some photos from Kauai :

(including "tree canopy" and "how to look at tree canopy")

01-17-13 | What Happened to Tech Blogs?

I feel like the internet is dying. There's less and less legitimate content, and more and more fluff and self-promotional ignorant useless crap. It's becoming harder and harder to find solid information that's written by people who actually know anything about what they're posting about.

The information on the internet is now almost entirely one of :

1. Advertising. Sometimes even subtly hidden advertising (there are now tons of "blogs" that are actually advertisements, and a lot of the posters on web forums are actually advertisers who are more or less clever about it).

2. Ignorant. Stuff like answers.yahoo and eHow and Yelp and so on are once in a while written by someone who knows their topic, but usually not. Reading these sites is often more harmful than helpful.

Oddly, the vast majority of blogs about things like cooking, cars, home improvement, or any DIY hobbies are not written by people who actually do those things and know anything about them. They're usually written by housewives or techie nerds who just want some attention or love blogging or god knows why they do it. It should sort of be harmless for ignorant people to write about their adventures building a shack for the first time (lord knows I do it), but it's actually not harmless. For one thing, they tend to become popular and so become the leading search results, ahead of much better information which is drier or not so cutesy. For another, the writers often present themselves as more well informed than they actually are, and they often misrepresent the success of their endeavour.

3. Self-vertising. Even some of the better blogs are just ways to self-promote or otherwise make money. This can be okay and there can still be good information from the self-vertisers, but they also do a lot of padding, a lot of repetition, and heavily distort the truth to make themselves seem more important. The tech self-vertisers tend to be annoyingly pedantic and act like experts when they are not. They almost never do the helpful thing and link to their (better) original sources. They often use the same style as pundits or paid "experts" in that they present their solution as The One True Way to give it extra legitimacy, when in fact the truth is more nuanced (maybe there are disadvantages that they don't talk about, or equally valid other solutions that already exist, or uncertainties in the parameters). Part of the problem with the self-vertisers is that they all mutually promote and are very active about SEO, so they become the primary visible voices. Also to pad their posting they tend to grab "facts" from other sources and repeat them, which creates a bad false sense of confidence in those nonsensities because they are being repeated all over.

Somewhat related to this are the lunatics with some kind of agenda. They aren't exactly advertising, but they are rabid about some point and so spam the web with their "facts" which are just creations designed to prove their point. It makes it almost impossible to find information about controversial topics, because these people are so active that they dominate search results.

4. Communities. I used to get some of my best information from web communities/forums. The great thing about them is that you can find these individual posters that hang out on them who are actually true experts in the field; like if you're searching for home improvement stuff you can find guys in web communities who are actual long term builders and provide solid facts; or for car info you can find people who actually build or race cars and know WTF they're talking about. However, it takes a lot of work to find those guys; they generally are not the most frequent posters, they tend to pop in and snipe some amazing wisdom once in a while and then disappear. You have to do a lot of scrounging around, and read multiple posts from each poster to try to assess the credibility of the individual user.

But I've been noticing something really nasty about web communities recently. They tend to get into this kind of rigid group-think which can lead them to constantly repeat certain "facts" despite there being no substance to them. What happens is some strong personality on the forum promotes some fact and everyone gives it a "thumbs up" , they start repeating it everytime someone asks that question, and it winds up in the FAQ. Posters on web communities are highly motivated by the approval of their peers; they act like a pack of high schoolers who are constantly looking around to make sure everyone else thinks they're cool. There's very little independent thinking and willingness to challenge the group-think. There's lots of high-fiving.

The truly wise tend to be humble and a bit soft-spoken. That's all well and good, but in the juvenile shouting match which is the modern internet, it's the people who are unashamed to loudly pontificate and bully about things they know not much of who are heard.

Try searching for something like "Calphalon" or "Big Island Waterfall" and see how many results you can find that aren't one of those 4 groups. Sure there's still signal out there but it's getting drowned in the noise.

Anyhoo. One of the symptoms of the dying internet that I've noticed is that there are basically no tech blogs for me to read any more. Maybe I'm just out of the loop? Are you all blogging on facebook now, or some other closed system that I refuse to join?

A few years ago, I felt like I was getting really superb quality tech blogs in my RSS on an almost daily basis, and now that has slowed to a trickle of maybe one a week or one a month. The vast majority of people that I liked and followed are not posting any more. What gives?

I understand that a lot of people who blog do it for a while, but lose steam and their blog goes silent. But there should be new people picking up the mantle; maybe I just haven't been active enough about figuring out who the good new bloggers are.

For reference, my tech blog subscriptions :

<opml version="1.0">
        <title>cbloom subscriptions in Google Reader</title>
        <outline text=".mischief.mayhem.soap."
            title=".mischief.mayhem.soap." type="rss"
            xmlUrl="http://msinilo.pl/blog/?feed=rss2" htmlUrl="http://msinilo.pl/blog"/>
        <outline text="1024cores" title="1024cores" type="rss"
            xmlUrl="http://feeds.feedburner.com/1024cores" htmlUrl="http://blog.1024cores.net/"/>
        <outline text="A random walk through geek-space"
            title="A random walk through geek-space" type="rss"
            xmlUrl="http://api.live.net/Users(4929737823860505484)/Main?$format=rss20" htmlUrl="http://sebastiansylvan.wordpress.com"/>
         <outline text="Amit's Thoughts" title="Amit's Thoughts"
            xmlUrl="http://amitp.blogspot.com/feeds/posts/default" htmlUrl="http://amitp.blogspot.com/"/>
        <outline text="Aras' website" title="Aras' website" type="rss"
            xmlUrl="http://aras-p.info/atom.xml" htmlUrl="http://aras-p.info/"/>
        <outline text="Atom" title="Atom" type="rss"
            xmlUrl="http://farrarfocus.blogspot.com/feeds/posts/default" htmlUrl="http://farrarfocus.blogspot.com/"/>
        <outline text="Attractive Chaos" title="Attractive Chaos"
            xmlUrl="http://attractivechaos.wordpress.com/feed/" htmlUrl="http://attractivechaos.wordpress.com"/>
         <outline text="Beautiful Pixels" title="Beautiful Pixels"
            xmlUrl="http://feeds.feedburner.com/BeautifulPixels" htmlUrl="http://beautifulpixels.blogspot.com/"/>
        <outline text="Birth of a Game" title="Birth of a Game"
            xmlUrl="http://uber.typepad.com/birthofagame/atom.xml" htmlUrl="http://uber.typepad.com/birthofagame/"/>
        <outline text="bitsquid: development blog"
            title="bitsquid: development blog" type="rss"
            xmlUrl="http://bitsquid.blogspot.com/feeds/posts/default" htmlUrl="http://bitsquid.blogspot.com/"/>
        <outline text="bouliiii's blog" title="bouliiii's blog"
            xmlUrl="http://bouliiii.blogspot.com/feeds/posts/default" htmlUrl="http://bouliiii.blogspot.com/"/>
        <outline text="Braid" title="Braid" type="rss"
            xmlUrl="http://braid-game.com/news/?feed=rss2" htmlUrl="http://braid-game.com/news"/>
        <outline text="Breaking Eggs And Making Omelettes"
            title="Breaking Eggs And Making Omelettes" type="rss"
            xmlUrl="http://multimedia.cx/eggs/feed/" htmlUrl="http://multimedia.cx/eggs"/>
        <outline text="C++Next" title="C++Next" type="rss"
            xmlUrl="http://cpp-next.com/feed/" htmlUrl="http://cpp-next.com"/>
        <outline text="c0de517e" title="c0de517e" type="rss"
            xmlUrl="http://c0de517e.blogspot.com/feeds/posts/default" htmlUrl="http://c0de517e.blogspot.com/"/>
        <outline text="Canned Platypus" title="Canned Platypus"
            type="rss" xmlUrl="http://pl.atyp.us/wordpress/?feed=rss2" htmlUrl="http://pl.atyp.us/wordpress"/>
        <outline text="cbloom rants" title="cbloom rants" type="rss"
            xmlUrl="http://feeds.feedburner.com/CbloomRants" htmlUrl="http://cbloomrants.blogspot.com/"/>
        <outline text="Cessu's blog" title="Cessu's blog" type="rss"
            xmlUrl="http://cessu.blogspot.com/feeds/posts/default" htmlUrl="http://cessu.blogspot.com/"/>
         <outline text="CodeItNow" title="CodeItNow" type="rss"
            xmlUrl="http://www.rorydriscoll.com/feed/" htmlUrl="http://www.rorydriscoll.com"/>
        <outline text="Coder Corner" title="Coder Corner" type="rss"
            xmlUrl="http://www.codercorner.com/blog/?feed=rss2" htmlUrl="http://www.codercorner.com/blog"/>
        <outline text="copypastepixel" title="copypastepixel" type="rss"
            xmlUrl="http://copypastepixel.blogspot.com/feeds/posts/default" htmlUrl="http://copypastepixel.blogspot.com/"/>
        <outline text="Corensic" title="Corensic" type="rss"
            xmlUrl="http://corensic.wordpress.com/feed/" htmlUrl="http://corensic.wordpress.com"/>
        <outline text="Diary of a Graphics Programmer"
            title="Diary of a Graphics Programmer" type="rss"
            xmlUrl="http://diaryofagraphicsprogrammer.blogspot.com/feeds/posts/default" htmlUrl="http://diaryofagraphicsprogrammer.blogspot.com/"/>
        <outline text="Diary Of An x264 Developer"
            title="Diary Of An x264 Developer" type="rss"
            xmlUrl="http://x264dev.multimedia.cx/?feed=atom" htmlUrl="http://x264dev.multimedia.cx/"/>
        <outline text="direct to video" title="direct to video"
            type="rss" xmlUrl="http://directtovideo.wordpress.com/feed/" htmlUrl="http://directtovideo.wordpress.com"/>
        <outline text="el trastero" title="el trastero" type="rss"
            xmlUrl="http://www.iquilezles.org/blog/?feed=rss2" htmlUrl="http://www.iquilezles.org/blog"/>
        <outline text="EntBlog" title="EntBlog" type="rss"
            xmlUrl="http://feeds2.feedburner.com/EntBlog" htmlUrl="http://entland.homelinux.com/blog"/>
        <outline text="EnterTheSingularity" title="EnterTheSingularity"
            xmlUrl="http://enterthesingularity.blogspot.com/feeds/posts/default?alt=rss" htmlUrl="http://enterthesingularity.blogspot.com/"/>
         <outline text="Fast Data Compression"
            title="Fast Data Compression" type="rss"
            xmlUrl="http://fastcompression.blogspot.com/feeds/posts/default" htmlUrl="http://fastcompression.blogspot.com/"/>
        <outline text="fixored?" title="fixored?" type="rss"
            xmlUrl="http://www.sjbrown.co.uk/feed/" htmlUrl="http://www.sjbrown.co.uk"/>
        <outline text="Game Angst" title="Game Angst" type="rss"
            xmlUrl="http://gameangst.com/?feed=rss2" htmlUrl="http://gameangst.com"/>
        <outline text="Game Rendering" title="Game Rendering" type="rss"
            xmlUrl="http://www.gamerendering.com/feed/atom/" htmlUrl="http://www.gamerendering.com/"/>
        <outline text="Game Rendering" title="Game Rendering" type="rss"
            xmlUrl="http://www.gamerendering.com/feed/" htmlUrl="http://www.gamerendering.com"/>
        <outline text="GameArchitect" title="GameArchitect" type="rss"
            xmlUrl="http://gamearchitect.net/feed/" htmlUrl="http://gamearchitect.net"/>
        <outline text="Gamedev Coder Diary" title="Gamedev Coder Diary"
            type="rss" xmlUrl="http://gamedevcoder.wordpress.com/feed/" htmlUrl="http://gamedevcoder.wordpress.com"/>
        <outline text="Graphic Rants" title="Graphic Rants" type="rss"
            xmlUrl="http://graphicrants.blogspot.com/feeds/posts/default" htmlUrl="http://graphicrants.blogspot.com/"/>
        <outline text="Graphics Runner" title="Graphics Runner"
            xmlUrl="http://graphicsrunner.blogspot.com/feeds/posts/default" htmlUrl="http://graphicsrunner.blogspot.com/"/>
        <outline text="Graphics Size Coding"
            title="Graphics Size Coding" type="rss"
            xmlUrl="http://sizecoding.blogspot.com/feeds/posts/default" htmlUrl="http://sizecoding.blogspot.com/"/>
        <outline text="Gustavo Duarte" title="Gustavo Duarte" type="rss"
            xmlUrl="http://feeds2.feedburner.com/GustavoDuarte" htmlUrl="http://duartes.org/gustavo/blog"/>
        <outline text="Hardwarebug" title="Hardwarebug" type="rss"
            xmlUrl="http://hardwarebug.org/feed/" htmlUrl="http://hardwarebug.org"/>
        <outline text="hbr" title="hbr" type="rss"
            xmlUrl="http://brnz.org/hbr/?feed=rss2" htmlUrl="http://brnz.org/hbr"/>
        <outline text="Humus" title="Humus" type="rss"
            xmlUrl="http://www.humus.name/rss.xml" htmlUrl="http://www.humus.name"/>
        <outline text="I am an extreme moderate"
            title="I am an extreme moderate" type="rss"
            xmlUrl="https://extrememoderate.wordpress.com/feed/" htmlUrl="https://extrememoderate.wordpress.com"/>
        <outline text="I Get Your Fail" title="I Get Your Fail"
            type="rss" xmlUrl="http://feeds.feedburner.com/IGetYourFail" htmlUrl="http://igetyourfail.blogspot.com/"/>
        <outline text="Ignacio Casta├▒o" title="Ignacio Casta├▒o"
            type="rss" xmlUrl="http://castano.ludicon.com/blog/feed/" htmlUrl="http://www.ludicon.com/castano/blog"/>
        <outline text="Industrial Arithmetic"
            title="Industrial Arithmetic" type="rss"
            xmlUrl="http://industrialarithmetic.blogspot.com/feeds/posts/default" htmlUrl="http://industrialarithmetic.blogspot.com/"/>
          <outline text="John Ratcliff's Code Suppository"
            title="John Ratcliff's Code Suppository" type="rss"
            xmlUrl="http://codesuppository.blogspot.com/feeds/posts/default" htmlUrl="http://codesuppository.blogspot.com/"/>
        <outline text="Just Software Solutions Blog"
            title="Just Software Solutions Blog" type="rss"
            xmlUrl="http://www.justsoftwaresolutions.co.uk/index.rss" htmlUrl="http://www.justsoftwaresolutions.co.uk/blog/"/>
        <outline text="Lair Of The Multimedia Guru"
            title="Lair Of The Multimedia Guru" type="rss"
            xmlUrl="http://guru.multimedia.cx/feed/" htmlUrl="http://guru.multimedia.cx"/>
        <outline text="Larry Osterman's WebLog"
            title="Larry Osterman's WebLog" type="rss"
            xmlUrl="http://blogs.msdn.com/larryosterman/rss.xml" htmlUrl="http://blogs.msdn.com/b/larryosterman/"/>
        <outline text="Level of Detail" title="Level of Detail"
            type="rss" xmlUrl="http://levelofdetail.wordpress.com/feed/" htmlUrl="http://levelofdetail.wordpress.com"/>
        <outline text="level of detail" title="level of detail"
            type="rss" xmlUrl="http://www.jshopf.com/blog/?feed=rss2" htmlUrl="http://jshopf.com/blog"/>
        <outline text="Light is beautiful" title="Light is beautiful"
            xmlUrl="http://feeds.feedburner.com/LightIsBeautiful?format=xml" htmlUrl="http://lousodrome.net/blog/light"/>
        <outline text="Lightning Engine" title="Lightning Engine"
            xmlUrl="http://feeds2.feedburner.com/LightningEngine" htmlUrl="http://blog.makingartstudios.com"/>
        <outline text="Lost in the Triangles"
            title="Lost in the Triangles" type="rss"
            xmlUrl="http://feeds.feedburner.com/LostInTheTriangles" htmlUrl="http://aras-p.info/"/>
        <outline text="Mark's Blog" title="Mark's Blog" type="rss"
            xmlUrl="http://blogs.technet.com/markrussinovich/rss.xml" htmlUrl="http://blogs.technet.com/b/markrussinovich/"/>
        <outline text="meshula.net" title="meshula.net" type="rss"
            xmlUrl="http://meshula.net/wordpress/?feed=rss2" htmlUrl="http://meshula.net/wordpress"/>
        <outline text="Miles Macklin's blog"
            title="Miles Macklin's blog" type="rss"
            xmlUrl="http://blog.mmacklin.com/feed/" htmlUrl="http://blog.mmacklin.com"/>
        <outline text="Mod Blog" title="Mod Blog" type="rss"
            xmlUrl="http://www.modularpeople.com/blog/?feed=rss2" htmlUrl="http://www.modularpeople.com/blog"/>
        <outline text="Molecular Musings" title="Molecular Musings"
            xmlUrl="http://molecularmusings.wordpress.com/feed/" htmlUrl="http://molecularmusings.wordpress.com"/>
        <outline text="Monty" title="Monty" type="rss"
            xmlUrl="http://xiphmont.livejournal.com/data/rss" htmlUrl="http://xiphmont.livejournal.com/"/>
        <outline text="My Green Paste, Inc."
            title="My Green Paste, Inc." type="rss"
            xmlUrl="http://mygreenpaste.blogspot.com/feeds/posts/default" htmlUrl="http://mygreenpaste.blogspot.com/"/>
        <outline text="Nerdblog.com" title="Nerdblog.com" type="rss"
            xmlUrl="http://www.nerdblog.com/feeds/posts/default" htmlUrl="http://www.nerdblog.com/"/>
         <outline text="nothings' projects" title="nothings' projects"
            type="rss" xmlUrl="http://nothings.org/projects/?feed=rss2" htmlUrl="http://nothings.org/projects"/>
        <outline text="Nynaeve" title="Nynaeve" type="rss"
            xmlUrl="http://www.nynaeve.net/?feed=rss2" htmlUrl="http://www.nynaeve.net"/>
        <outline text="onepartcode.com" title="onepartcode.com"
            type="rss" xmlUrl="http://onepartcode.com/main/index.rss" htmlUrl="http://onepartcode.com/main"/>
        <outline text="Online Game Techniques"
            title="Online Game Techniques" type="rss"
            xmlUrl="http://onlinegametechniques.blogspot.com/feeds/posts/default" htmlUrl="http://onlinegametechniques.blogspot.com/"/>
        <outline text="Pete Shirley's Graphics Blog"
            title="Pete Shirley's Graphics Blog" type="rss"
            xmlUrl="http://psgraphics.blogspot.com/feeds/posts/default" htmlUrl="http://psgraphics.blogspot.com/"/>
        <outline text="Pixels, Too Many.." title="Pixels, Too Many.."
            type="rss" xmlUrl="http://pixelstoomany.wordpress.com/feed/" htmlUrl="http://pixelstoomany.wordpress.com"/>
        <outline text="Preshing on Programming"
            title="Preshing on Programming" type="rss"
            xmlUrl="http://preshing.com/feed" htmlUrl="http://preshing.com"/>
        <outline text="Ray Tracey's blog" title="Ray Tracey's blog"
            xmlUrl="http://raytracey.blogspot.com/feeds/posts/default" htmlUrl="http://raytracey.blogspot.com/"/>
        <outline text="Real-Time Rendering" title="Real-Time Rendering"
            xmlUrl="http://www.realtimerendering.com/blog/feed/" htmlUrl="http://www.realtimerendering.com/blog"/>
        <outline text="realtimecollisiondetection.net - the blog"
            title="realtimecollisiondetection.net - the blog" type="rss"
            xmlUrl="http://realtimecollisiondetection.net/blog/?feed=atom" htmlUrl="http://realtimecollisiondetection.net/blog"/>
        <outline text="Reenigne blog" title="Reenigne blog" type="rss"
            xmlUrl="http://www.reenigne.org/blog/feed/" htmlUrl="http://www.reenigne.org/blog"/>
        <outline text="RenderWonk" title="RenderWonk" type="rss"
            xmlUrl="http://renderwonk.com/blog/index.php/feed/" htmlUrl="http://renderwonk.com/blog"/>
        <outline text="ridiculous_fish" title="ridiculous_fish"
            type="rss" xmlUrl="http://ridiculousfish.com/blog/feed/" htmlUrl="http://ridiculousfish.com/blog/"/>
        <outline text="Sanders' blog" title="Sanders' blog" type="rss"
            xmlUrl="http://sandervanrossen.blogspot.com/feeds/posts/default?alt=rss" htmlUrl="http://sandervanrossen.blogspot.com/"/>
         <outline text="Self Shadow" title="Self Shadow" type="rss"
            xmlUrl="http://blog.selfshadow.com/feed/" htmlUrl="http://blog.selfshadow.com/"/>
        <outline text="Some Assembly Required"
            title="Some Assembly Required" type="rss"
            xmlUrl="http://assemblyrequired.crashworks.org/feed/" htmlUrl="http://assemblyrequired.crashworks.org"/>
        <outline text="stinkin' thinkin'" title="stinkin' thinkin'"
            xmlUrl="http://stinkygoat.livejournal.com/data/rss" htmlUrl="http://stinkygoat.livejournal.com/"/>
        <outline text="Stuart Denman" title="Stuart Denman" type="rss"
            xmlUrl="http://www.stuartdenman.com/feed/" htmlUrl="http://www.stuartdenman.com"/>
        <outline text="Stumbling Toward 'Awesomeness'"
            title="Stumbling Toward 'Awesomeness'" type="rss"
            xmlUrl="http://www.chrisevans3d.com/pub_blog/?feed=atom" htmlUrl="http://www.chrisevans3d.com/pub_blog"/>
         <outline text="Syntopia" title="Syntopia" type="rss"
            xmlUrl="http://blog.hvidtfeldts.net/index.php/feed/" htmlUrl="http://blog.hvidtfeldts.net"/>
        <outline text="S├ębastien Lagarde" title="S├ębastien Lagarde"
            type="rss" xmlUrl="https://seblagarde.wordpress.com/feed/" htmlUrl="https://seblagarde.wordpress.com"/>
        <outline text="The Atom Project" title="The Atom Project"
            xmlUrl="http://www.farrarfocus.com/atom/index.atom" htmlUrl="http://www.farrarfocus.com/atom/"/>
        <outline text="The Danger Zone" title="The Danger Zone"
            type="rss" xmlUrl="http://mynameismjp.wordpress.com/feed/" htmlUrl="http://mynameismjp.wordpress.com"/>
        <outline text="The Data Compression News Blog"
            title="The Data Compression News Blog" type="rss"
            xmlUrl="http://www.c10n.info/feed" htmlUrl="http://www.c10n.info"/>
        <outline text="The Fifth Column" title="The Fifth Column"
            xmlUrl="http://thefifthcolumn.com/blog/?feed=rss2" htmlUrl="http://thefifthcolumn.com/blog"/>
        <outline text="The Ladybug Letter" title="The Ladybug Letter"
            type="rss" xmlUrl="http://www.ladybugletter.com/?feed=atom" htmlUrl="http://www.ladybugletter.com/"/>
        <outline text="The ryg blog" title="The ryg blog" type="rss"
            xmlUrl="http://fgiesen.wordpress.com/feed/" htmlUrl="http://fgiesen.wordpress.com"/>
        <outline text="The software rendering world"
            title="The software rendering world" type="rss"
            xmlUrl="http://winden.wordpress.com/feed/" htmlUrl="http://winden.wordpress.com"/>
        <outline text="The Witness" title="The Witness" type="rss"
            xmlUrl="http://the-witness.net/news/feed/" htmlUrl="http://the-witness.net/news"/>
        <outline text="Transcendental Technical Travails"
            title="Transcendental Technical Travails" type="rss"
            xmlUrl="http://t-t-travails.blogspot.com/feeds/posts/default" htmlUrl="http://t-t-travails.blogspot.com/"/>
        <outline text="Treatise on Graphics Programming"
            title="Treatise on Graphics Programming" type="rss"
            xmlUrl="http://www.wolfgang-engel.info/blogs/?feed=rss2" htmlUrl="http://www.wolfgang-engel.info/blogs"/>
        <outline text="UMBC Games, Animation and Interactive Media"
            title="UMBC Games, Animation and Interactive Media"
            type="rss" xmlUrl="http://gaim.umbc.edu/feed/" htmlUrl="http://gaim.umbc.edu"/>
        <outline text="View" title="View" type="rss"
            xmlUrl="http://view.eecs.berkeley.edu/blog/rss.php?ver=2" htmlUrl="http://view.eecs.berkeley.edu/blog/"/>
         <outline text="VirtualBlog" title="VirtualBlog" type="rss"
            xmlUrl="http://www.virtualdub.org/blog/rss.xml" htmlUrl="http://virtualdub.org/blog/index.php"/>
         <outline text="Voxelium" title="Voxelium" type="rss"
            xmlUrl="http://voxelium.wordpress.com/feed/" htmlUrl="http://voxelium.wordpress.com"/>
            text="What your mother never told you about graphics development"
            title="What your mother never told you about graphics development"
            xmlUrl="http://zeuxcg.blogspot.com/feeds/posts/default" htmlUrl="http://zeuxcg.blogspot.com/"/>
            text="What your mother never told you about graphics development"
            title="What your mother never told you about graphics development"
            type="rss" xmlUrl="http://zeuxcg.org/feed/" htmlUrl="http://zeuxcg.org"/>
        <outline text="Work Without Dread" title="Work Without Dread"
            xmlUrl="http://workwithoutdread.blogspot.com/feeds/posts/default" htmlUrl="http://workwithoutdread.blogspot.com/"/>
        <outline text="Zack Rusin" title="Zack Rusin" type="rss"
            xmlUrl="http://zrusin.blogspot.com/feeds/posts/default" htmlUrl="http://zrusin.blogspot.com/"/>
        <outline text="ZigguratVertigo's Hideout"
            title="ZigguratVertigo's Hideout" type="rss"
            xmlUrl="http://zigguratvertigo.com/feed/" htmlUrl="http://colinbarrebrisebois.com"/>
        <outline text="  Bartosz Milewski's Programming Cafe"
            title="  Bartosz Milewski's Programming Cafe" type="rss"
            xmlUrl="http://bartoszmilewski.wordpress.com/feed/" htmlUrl="http://bartoszmilewski.com"/>

01-15-13 | Kids

Some random thoughts on my impending kid-having-ness :

(note for dumb people : we're not here to talk about boring obvious shit like "kids make you sleep less" or "many parents live out their frustrated life goals through their kids". That's an obvious given as a baseline that should not need to be said; on this blog we try to talk about the things that are past the baseline, though many readers seem to not get that and want to chime in with the material that was a prerequisite for this course; get out of here and go back to reading "Excessive DOF Photos of Crappy Food" or "The New Old Coding Bore" or "Precious Twee Artisinal All-Organic Parenting" or whatever banal blog you usually read)

1. Kids automatically make you cooler. They're like a +1 modifier on anything you do. Like if you're just some single guy and you're in good shape and do triathlons or whatever, who cares, you're kind of an obsessive dweeb. But if you're a good family-man dad and you do the same, then you're amazing cool fit dad. (of course there's valid reason for this +1, because it's so much harder to do anything once you have kids, they're such a huge energy-suck)

(I've long been aware that I have some sort of bad jealousy tick where I really hate awesome dads; whenever I meet a dad who's super-fit and has great kids and also has a great job and builds robots or writes books on the side, I'm just filled with loathing; I'm not entirely sure but I assume that instant gut loathing comes from jealousy; I also think those guys are liars/phonies. Like, I think they must actually be terrible dads, it's just not possible to do all those things and spend enough time with your kids; why aren't you exhausted and frazzled? perhaps they have very self-sacrificing wives who are actually doing all the work at home, and/or they aren't actually putting in the work at their job; something is amiss, my spidey senses tingle)

2. Kids let you do things you suck at without feeling awkward. Say you suck at skiing; if you just go as a single man and take beginners lessons and have to ski the tow-rope bunny slope, you feel embarassed and most people can't get over it (of course if you do it anyway, you actually are super cool, and it's the people who look down on you that are fucking retard losers, but I digress). With kids you can go and ski the bunny slope with them and nobody looks at you funny. If you go ice skate for the first time as a single adult and are falling all over and wobbly you're a weirdo, but if you do it with your kids, you're a cool dad.

(one of the great tragedies of life is that people stop doing new things around 20 because they don't want to look like a beginner; they also lose all humility and never want to admit that they are a beginner at something. It's super dumb and I've been trying to get past it for the last 20 years or so. It's so funny seeing men at track days or at home improvement stores; they obviously don't know a thing about cars or construction (like I don't), but they can't just admit it and go "yeah I'm a newbie, can you help me?" they have to act all macho-man and pretend to be in their element like "I need a ball-peen wrench to adjust the timing on my carburetor." Um, let's back up and try again.)

3. Kids let you do things that are dweeby to do as single people, like go to the zoo or ride in a carriage. Part of the issue is that those activities are just not quite interesting enough on their own, but when you have the +1 enjoyment modifier of seeing it through your kids' eyes that pushes them over the threshold of worth doingness. I've always loved factory tours and those living-history museums where you can see how stained glass is made or whatever, also science museums (particularly interactive ones), but they just aren't quite worth doing as an adult. Kids remove the difficult embarassingness of everyone around you thinking "why are you here? it's only really old people and families, childless adults are not allowed".

4. Kids give you an excuse to be a selfish inconsiderate asshole. This is not a good thing and lots of parents over-do it. (it starts with pregnant moms who use the pregnancy as an excuse to be selfish bitches way beyond what's necessary or appropriate). Things like we can be loud at the symphony because we have kids, or we can cut in line because we're pregnant, or lets take the best seats and spread out all over, or lets take all the chairs at the hotel pool and then leave a giant mess behind us, etc. People know that kids make it much harder for others to go "hey fucker, you're out of line" and they abuse that advantage.

5. Kids let you play. I'm super excited about this. For a long time I've known that what I really need in my life is *play* , not sports, not games, but just joyous pointless movement. Adults are so fucking uptight and trying to act cool and impress each other all the time that they can never just play (actually I had a pretty sweet thing going for a while with Ryan where we could play a bit, but that was rare). Of course there's a whole industry of "ecstatic dance" and shit like that which is basically adults paying someone to let them play, which is so sad and bizarre; you have these uptight type-A business assholes who are total fuckers to everyone all day long, and then they go in a room and listen to a teacher tell them to run around in circles and stick their tongue out; super bizarre disconnect there. Anyhoo, kids let you go to the park and run around and roll in the grass and jump on things and nobody thinks you're a weirdo. (alternatively : move to San Francisco; fucking wonderful place SF, but all the gentry and computerists are ruining it)

(I guess those funny-dress-up runs are also societal concoctions to let adults play; but they ruin it by being a regimented precisely specified play; you're still just trying to fit in and do what you're supposed to. Oh crap, I wore a tutu and everyone else is wearing a cape! And it's still competitive and judgemental - ooh look, that guy is really relaxing well. Adulthood is so bizarre.)

6. Kids let you not have friends. They let you turn inward and just hang out with family. And of course you get some socializing through yours kids doing things and hanging out with other parents. You don't have to make any effort to make adult relationships work, which is a pain in the ass. Kids let you just stay home with your family without being a weird lonely hermit. Of course this is also a danger if you take it too far; you see these families that are so drawn in and almost afraid of other adults that when they're out in public they hardly even look up at the world around them.

7. Kids let you feel okay about sucking. If you're not really doing anything with your life and you're just kind of a rotten human being, but you have kids - then you can think "I devote my life to my kids, they are my pride and joy, at least I've made them, they are my life's work". They provide a +1 smugness bonus.

8. Kids give you a new thing to be ridiculously analytical and obsessive and introspective about. Most type-A nerds have kind of gotten bored of thinking about life by the time they hit 25 or so. We've already thought and over-thought everything that we do in life ("what is actually going on in the little social interaction with the grocery store checkout person? should I make minimal polite smalltalk, or should I try to say something unexpected to cheer them up? do I feel bad for this person whose life has obviously taken such a wrong turn? am I trying to make them feel like my equal and not my servant?" etc). We've made charts and graphs about how various influences in our life affect our productivity, and it's just all old hat. ("should I turn the other cheek, or should I get aggressive back at this asshole? Turning the other cheek is a local optimization of my own happiness, but that does not create a social game-theory structure which directs overall behavior in a good way. Oh wait I've had this same thought like 100 times before"). Kids give you a whole new set of things to be anal and nerdy about, read books and think about cause-effect and blah blah blah. Of course this blog post is a sign that I've already begun.

01-14-13 | Hawaii Workcation 2013

Photos from the first few days here. Tasha's bro visited so we did a bit of travelling around sight seeing. Starting with the rental house, my office, and then some excursions :

Man it feels great to be here. The house is incredible, just as we hoped, tons of windows and a big view of Mauna Kea with not a single neighbor around. I feel alive, young, virile, lithe. I love the sun and the sweat. I love the trees and the good vibes.

I packed my bike this year (mild hassle (and the damn TSA opened my box and disturbed my careful packing)), looking forward to getting some good rides.

BTW you may notice that the correct ergonomic position for a "laptop" is about three feet above the human lap.

I can't wait for Tasha to pop the kids out so we can travel with them and play on the beach and run around in the trees.

01-01-13 | Chicken Coop Learnings

Some hindsight and lessons learned after living a while with my first coop, some mistakes made and things I'll do differently the next time. In all cases I'm assuming a backyard-size flock, 10 birds or less. Obviously different considerations apply to large-scale coops. Also I'm assuming that you live somewhere relatively warm (winters above 20 degrees); in the super-cold different considerations apply.

1. Chickens don't need a big coop. They don't like to be inside, they like to be outside (as noted above, I'm assuming a decently warm climate). The coop is just for sleeping and laying. Almost all the coop designs you'll see on the internet, and all the fancy ones you can buy, are much too big. Not only is it a waste of time and materials to build a big coop, it's a huge disadvantage because it takes up more space and is more work to clean and is harder to move.

2. Don't build a coop you can walk inside. As per #1, the coop should be small, and it should be high (chickens like to be up high to sleep). All you need is a small raised box. You do not need a door for humans or a floor at human height. Do, however, put an entire wall or roof on hinges so that you can open up the whole thing and easily reach every corner.

3. Don't over-engineer. Because the coop only needs to bear chicken-weight not human-weight, there's no need to use 2x4's or half inch plywood, you can use much lighter and smaller construction materials. Again most of the internet designs and coops you can buy are just way off here, way over-engineered. (it does need to be strong enough to be wind-proof and dog-proof; dogs are by far the biggest hazard to urban chickens).

Even if you want a movable coop, you don't really need wheels if you use suitably light building materials and are moderately athletic. It's very easy to just pick up a small coop and move it around the yard as needed.

4. Paint. I painted the inside of the coop, and some sites & people consider this silly and froo-froo, but I think it was a good call and would do it again. A thick coat of high-gloss provides great water proofing and provides a smooth surface, which makes for much easier cleanup and longer life.

5. Rain/Snow. In contrast to #3, you should *not* cut corners in following good practices for weather-proofing. In particular, don't leave exposed edges of plywood or sheathing (they delaminate very easily), do use good shingle-principle for roofing (overlap and cover holes), use a proper drip-edge to prevent water wrapping around, etc.

6. Doors. I put a bunch of doors in the coop and one thing I didn't really consider was that all the poop and shavings and such will constantly be getting in the door jamb, which will prevent closure if it's a tight fit. One option is just to intentionally make a sloppy door that's a loose fit; another is to put some kind of trough near the door so that closing it pushes out the crud into the gap. Many designs, including mine, feature a door hinged at the bottom, so that when it opens it becomes a ramp. This seems clever but is not a very functional door because of the poop-in-the-hinges problem, it just becomes a static ramp. Probably the best type of door is top-hinged, with a raised bottom sill to prevent crud building up there. There's just not a lot of need for doors though; if you make the whole coop open for cleanability (such as via a hinged or removable roof), you can just use that to get the eggs as well; there's no need for the cute little nesting boxes with individual doors that people do.

7. The roost is the backbone of the coop. The chickens will spend 90% of their indoor time on the roosts, so locating the roost is the most important aspect of the design. The coop is really just the roost and the nesting boxes, the chickens want to spend their time outside in the run or free ranging, not on the floor of the coop.

8. The Poop Trough. Because of #7, I've found that almost all the chicken poop that's inside the coop is in a perfect straight line under the roost. I think you could take advantage of that and put an angled trough under the roost so that the poop was super easy to clean out. Another option would be a line of wire mesh instead of solid floor under the roost, perhaps with a removable trough under the wire mesh.

9. Rats. You have to decide from the beginning if you want to try to make a rat-proof coop. Doing so is a major undertaking and requires careful design. For example, chicken wire is not rat-proof. To make a rat-proof coop, first you need a solid stone foundation (for a small coop the easiest way to rat-proof the floor is just to cover the whole floor with pavers or bricks; for a larger coop you wouldn't want to do that, so you have to dig down at least 1 foot underground and surround the perimeter with rat-proof wire mesh or concrete blocks; rats are excellent diggers). Then the entire coop must be surrounded with hardware cloth (wire mesh) or similar. Rats are also superb climbers and jumpers, so vertical barriers will not stop them (you need a closed roof).

Some people try to rat-proof by putting wire on the floor (rather than a solid paver floor or burying a barrier around the perimeter). This is not a great idea. What will happen is the rats will still dig under the coop and create a network of tunnels under the wire floor. The chickens knock their feed all around, so lots (most) of it will fall through the wire mesh into the gap below it, and the rats will have a party living in the dirt under the wire floor. This might be okay with you (at least the rats are not actually in the chicken's space) but I think that overall wire on the floor is actually worse than nothing.

10. Feeders. Lots of people advocate these big automatic hanging feeders that you can fill with feed and it will drop down to let out more. Unless you have made a seriously rat-proof coop, these things are a terrible idea. Rats with an unlimited supply of food like that will multiply incredibly rapidly. You're going to want to visit the chickens every day anyway, so I see no advantage to these gravity feeders, just give them their ration each day so that there aren't a lot of left-overs for the vermin.

11. The Run. You have to decide up front whether you are going to free-range the chickens or not. If you are going to free-range them, then you don't need any run at all, just let them out in the yard. If not, then you need a big run. A tiny run (like under the popular commercial A-frame "chicken tractor") is pointless and cruel. If I had a decent amount of land I would build a simple run by just putting in some posts and wrapping it in chicken wire. (obviously this run is not rat proof). There's no need to cover the top of a large run (assuming as above you do not use a big feeder which would attract other birds).

12. Free ranging in your yard kind of sucks. Chickens love to dig in soft soil, so will go after your new plantings and vegetable beds and dig up your seedlings. They like to sit on railings and handles and poop. You will have poop all over everything. It's not awesome. On the other hand, it is very easy. They will eat a better diet without you having to carefully manage the supplements in their feed. They also naturally return to their coop at night so you don't really have to do any work to get them in and out, they do it themselves.

13. The poop pile. If you are going to try to reuse the poop and shavings you get when you clean out the coop as manure, you need to locate a spot for the poop to rest. You will get a *lot* of waste out of the coop, so you need a big spot, and you need at least two piles so you can cycle the new into the old (like compost; poop needs 2-3 months rest before use). The poop pile should not be near the coop (or run) and should also not be near your planting beds to avoid pest and pathogen transfer. It can be hard to find a good location for the poop pile in an urban yard, so you may want to abandon this idea and just throw out the poop. The poop pile will also attract rats and flies (but of course so will composting); it may also attract justifiably irate neighbors.

12-22-12 | Data Considered Harmful

I believe that the modern trend of doing some very superficial data analysis to prove a point, or support your argument is extremely harmful. It leads to a false impression of a scientific basis to arguments that is in fact spurious.

I've been thinking about this for a while, but this washingtonpost blog about the correlation of video games and gun violence recently popped into my blog feed, so I'll use it as an example.

The Washington Post blog leads you to believe that the data shows an unequivocal lack of correlation between videogames and gun violence. That's retarded. It only takes one glance at the chart to see that the data is completely dominated by other factors, like probably most strongly the gun ownership rate. You can't possibly try to find the effect of a minor contributing factor without normalizing for other factors, which most of these "analyses" fail to do, which makes them totally bogus. Furthermore, as usual, you would need a much larger sample size to have any confidence in the data, and you'd have to question the selection of data that was done. Also the entire thing being charted is wrong; it shouldn't be video game spending per capita, it should be video games played per capita (especially with China on there), and it shouldn't be gun-related murders, it should be all murders (because the fraction of murders that is gun related varies strongly by gun control laws, while the all murders rate varies more directly with the level of economic and social development in a country).

(Using data and charts and graphs has been a very popular way to respond to the recent shootings. Every single one that I've seen is complete nonsense. People just want to make a point that they've previously decided, so they trot out some data to "prove it" or make it "non-partisan" as if their bogus charts somehow make it "factual". It's pathetic. Here's a good example of using tons of data to show absolutely nothing . If you want to make an editorial point, just write your opinion, don't trot out bogus charts to "back it up". )

It's extremely popular these days to "prove" that some intuition is wrong by finding some data that shows a reverse correlation. (blame Freakonomics, among other things). You get lots of this in the smarmy TED talks - "you may expect that stabbing yourself in the eye with a pencil is harmful, but in fact these studies show that stabbing yourself in the eye is correlated to longer life expectancy!" (and then everyone claps). The problem with all this cute semi-intellectualism is that it's very often just wrong.

Aside from just poor data analysis, one of the major flaws with this kind of reasoning is the assumption that you are measuring all the inputs and all the outputs.

An obvious case is education, where you get all kinds of bogus studies that show such-and-such program "improves learning". Well, how did you actually measure learning? Obviously something like cutting music programs out of schools "improves learning" if you measure "learning" in a myopic way that doesn't include the benefits of music. And of course you must also ask what else was changed between the measured kids and the control (selection bias, novelty effect, etc; essentially all the studies on charter schools are total nonsense since any selection of students and new environment will produce a short term improvement).

I believe that choosing the wrong inputs and outputs is even worse than the poor data analysis, because it can be so hidden. Quite often there are some huge (bogus) logical leaps where the article will measure some narrow output and then proceed to talk about it as if it was just "better". Even when your data analysis was correct, you did not show it was better, you showed that one specific narrow output that you chose to measure improved, and you have to be very careful to not start using more general words.

(one of the great classic "wrong output" mistakes is measuring GDP to decide if a government financial policy was successful; this is one of those cases where economists have in fact done very sophisticated data analysis, but with a misleadingly narrow output)

Being repetitive : it's okay if you are actually very specific and careful not to extrapolate. eg. if you say "lowering interest rates increased GDP" and you are careful not to ever imply that "increased GDP" necessarily means "was good for the economy" (or that "was good for the economy" meant "was good for the population"); the problem is that people are sloppy, in their analysis and their implications and their reading, so it becomes "lowering interest rates improved the well-being of the population" and that becomes accepted wisdom.

Of course you can transparently see the vapidity of most of these analyses because they don't propagate error bars. If they actually took the errors of the measurement, corrected for the error of the sample size, propagated it through the correlation calculation and gave a confidence at the end, you would see things like "we measured a 5% improvement (+- 50%)" , which is no data at all.

I saw Bryan Cox on QI recently, and there was some point about the US government testing whether heavy doses of LSD helped schizophrenics or not. Everyone was aghast but Bryan popped up with "actually I support data-based medicine; if it had been shown to help then I would be for that therapy". Now obviously this was a jokey context so I'll cut Cox some slack, but it does in fact reflect a very commonly held belief these days (that we should trust the data more than our common sense that it's a terrible idea). And it's just obviously retarded on the face of it. If the study had shown it to help, then obviously something was wrong with the study. Medical studies are almost always so flawed that it's hard to believe any of them. Selection bias is huge, novelty and placebo effect are huge; but even if you really have controlled for all that, the other big failure is that they are too short term, and the "output" is much too narrow. You may have improved the thing you were measuring for, but done lots of other harm that you didn't see. Perhaps they did measure a decrease in certain schizophrenia symptoms (but psychotic lapses and suicides were way up; oops that wasn't part of the output we measured).

Exercise/dieting and child-rearing are two major topics where you are just bombarded with nonsense pseudo-science "correlations" all the time.

Of course political/economic charts are useless and misleading. A classic falsehood that gets trotted out regularly is the charts showing "the economy does better under democrats" ; for one thing the sample size is just so small that it could be totally random ; for another the economy is more effected by the previous president than the current ; and in almost every case huge external factors are massively more important (what's the Fed rate, did Al Gore recently invent the internet, are we in a war or an oil crisis, etc.). People love to show that chart but it is *pure garbage* , it contains zero information. Similarly the charts about how the economy does right after a tax raise or decrease; again there are so many confounding factors and the sample size is so tiny, but more importantly tax raises tend to happen when government receipts are low (eg. economic growth is already slowing), while tax cuts tend to happen in flush times, so saying "tax cuts lead to growth" is really saying "growth leads to growth".

What I'm trying to get at in this post is not the ridiculous lack of science in all these studies and "facts", but the way that the popular press (and the semi-intellectual world of blogs and talks and magazines) use charts and graphs to present "data" to legitimize the bogus point.

I believe that any time you see a chart or graph in the popular press you should look away.

I know they are seductive and fun, and they give you a vapid conversation piece ("did you know that christmas lights are correlated with impotence?") but they in fact poison the brain with falsehoods.

Finally, back to the issue of video games and violence. I believe it is obvious on the face of it that video games contribute to violence. Of course they do. Especially at a young age, if a kid grows up shooting virtual men in the face it has to have some effect (especially on people who are already mentally unstable). Is it a big factor? Probably not; by far the biggest factor in violence is poverty, then government instability and human rights, then the gun ownership rate, the ease of gun purchasing, etc. I suspect that the general gun glorification in America is a much bigger effect, as is growing up in a home where your parents had guns, going to the shooting range as a child, rappers glorifying violence, movies and TV. Somewhere after all that, I'm sure video games contribute. The only thing we can actually say scientifically is that the effect is very small and almost impossible to measure due to the presence of much larger and highly uncertain factors.

(of course we should also recognize that these kind of crazy school shooting events are completely different than ordinary violence, and statistically are a drop in the bucket. I suspect the rare mass-murder psycho killer things are more related to a country's mental health system than anything else. Pulling out the total murder numbers as a response to these rare psychotic events is another example of using the wrong data and then glossing over the illogical jump.)

I think in almost all cases if you don't play pretend with data and just go and sit quietly and think about the problem and tap into your own brain, you will come to better conclusions.

12-21-12 | File Handle to File Name on Windows

There are a lot of posts about this floating around, most not quite right. Trying to sort it out once and for all. Note that in all cases I want to resolve back to a "final" name (that is, remove symlinks, substs, net uses, etc.) I do not believe that the methods I present here guarantee a "canonical" name, eg. a name that's always the same if it refers to the same file - that would be a nice extra step to have.

This post will be code-heavy and the code will be ugly. This code is all sloppy about buffer sizes and string over-runs and such, so DO NOT copy-paste it into production unless you want security holes. (a particular nasty point to be wary of is that many of the APIs differ in whether they take a buffer size in bytes or chars, which with unicode is different)

We're gonna use these helpers to call into windows dlls :

template <typename t_func_type>
t_func_type GetWindowsImport( t_func_type * pFunc , const char * funcName, const char * libName , bool dothrow)
    if ( *pFunc == 0 )
        HMODULE m = GetModuleHandle(libName);
        if ( m == 0 ) m = LoadLibrary(libName); // adds extension for you
        ASSERT_RELEASE( m != 0 );
        t_func_type f = (t_func_type) GetProcAddress( m, funcName );
        if ( f == 0 && dothrow )
            throw funcName;
        *pFunc = f;
    return (*pFunc); 

// GET_IMPORT can return NULL
#define GET_IMPORT(lib,name) (GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,false))

// CALL_IMPORT throws if not found
#define CALL_IMPORT(lib,name) (*GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,true))
#define CALL_KERNEL32(name) CALL_IMPORT("kernel32",name)
#define CALL_NT(name) CALL_IMPORT("ntdll",name)

I also make use of the cblib strlen, strcpy, etc. on wchars. Their implementation is obvious.

Also, for reference, to open a file handle just to read its attributes (to map its name) you use :

    HANDLE f = CreateFile(from,

(also works on directories).

Okay now : How to get a final path name from a file handle :

1. On Vista+ , just use GetFinalPathNameByHandle.

GetFinalPathNameByHandle gives you back a "\\?\" prefixed path, or "\\?\UNC\" for network shares.

2. Pre-Vista, lots of people recommend mem-mapping the file and then using GetMappedFileName.

This is a bad suggestion. It doesn't work on directories. It requires that you actually have the file open for read, which is of course impossible in some scenarios. It's just generally a non-robust way to get a file name from a handle.

For the record, here is the code from MSDN to get a file name from handle using GetMappedFileName. Note that GetMappedFileName gives you back an NT-namespace name, and I have factored out the bit to convert that to Win32 into MapNtDriveName, which we'll come back to later.

BOOL GetFileNameFromHandleW_Map(HANDLE hFile,wchar_t * pszFilename,int pszFilenameSize)
    BOOL bSuccess = FALSE;
    HANDLE hFileMap;

    pszFilename[0] = 0;

    // Get the file size.
    DWORD dwFileSizeHi = 0;
    DWORD dwFileSizeLo = GetFileSize(hFile, &dwFileSizeHi); 

    if( dwFileSizeLo == 0 && dwFileSizeHi == 0 )
        lprintf(("Cannot map a file with a length of zero.\n"));
        return FALSE;

    // Create a file mapping object.
    hFileMap = CreateFileMapping(hFile, 

    if (hFileMap) 
        // Create a file mapping to get the file name.
        void* pMem = MapViewOfFile(hFileMap, FILE_MAP_READ, 0, 0, 1);

        if (pMem) 
            if (GetMappedFileNameW(GetCurrentProcess(), 
                //pszFilename is an NT-space name :
                //pszFilename = "\Device\HarddiskVolume4\devel\projects\oodle\z.bat"

                wchar_t temp[2048];

            bSuccess = TRUE;

        return FALSE;


3. There's a more direct way to get the name from file handle : NtQueryObject.

NtQueryObject gives you the name of any handle. If it's a file handle, you get the file name. This name is an NT namespace name, so you have to map it down of course.

The core code is :


ObjectBasicInformation, ObjectNameInformation, ObjectTypeInformation, ObjectAllInformation, ObjectDataInformation


typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;


    WCHAR NameBuffer[1];


IN HANDLE ObjectHandle, IN OBJECT_INFORMATION_CLASS ObjectInformationClass, OUT PVOID ObjectInformation, IN ULONG Length, OUT PULONG ResultLength )
= 0;

    char infobuf[4096];
    ULONG ResultLength = 0;



    wchar_t * ps = pinfo->NameBuffer;
    // info->Name.Length is in BYTES , not wchars
    ps[ pinfo->Name.Length / 2 ] = 0;

    lprintf("OBJECT_NAME_INFORMATION: (%S)\n",ps);

which will give you a name like :

    OBJECT_NAME_INFORMATION: (\Device\HarddiskVolume1\devel\projects\oodle\examples\oodle_future.h)

and then you just have to pull off the drive part and call MapNtDriveName (mentioned previously but not yet detailed).

Note that there's another call that looks appealing :


but NtQueryInformationFile seems to always give you just the file name without the drive. In fact it seems possible to use NtQueryInformationFile and NtQueryObject to separate the drive part and path part.

That is, you get something like :

t: is substed to c:\trans

LogDosDrives prints :

T: : \??\C:\trans

we ask about :

fmName : t:\prefs.js

we get :

NtQueryInformationFile: "\trans\prefs.js"
NtQueryObject: "\Device\HarddiskVolume4\trans\prefs.js"

If there was a way to get the drive letter, then you could just use NtQueryInformationFile , but so far as I know there is no simple way, so we have to go through all this mess.

On network shares, it's similar but a little different :

y: is net used to \\charlesbpc\C$

LogDosDrives prints :

Y: : \Device\LanmanRedirector\;Y:0000000000034569\charlesbpc\C$

we ask about :

fmName : y:\xfer\path.txt

we get :

NtQueryInformationFile: "\charlesbpc\C$\xfer\path.txt"
NtQueryObject: "\Device\Mup\charlesbpc\C$\xfer\path.txt"

so in that case you could just prepend a "\" to NtQueryInformationFile , but again I'm not sure how to know that what you got was a network share and not just a directory, so we'll go through all the mess here to figure it out.

4. MapNtDriveName is needed to map an NT-namespace drive name to a Win32/DOS-namespace name.

I've found two different ways of doing this, and they seem to produce the same results in all the tests I've run, so it's unclear if one is better than the other.

4.A. MapNtDriveName by QueryDosDevice

QueryDosDevice gives you the NT name of a dos drive. This is the opposite of what we want, so we have to reverse the mapping. The way is to use GetLogicalDriveStrings which gives you all the dos drive letters, then you can look them up to get all the NT names, and thus create the reverse mapping.

Here's LogDosDrives :

void LogDosDrives()
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t szTemp[BUFSIZE];
    szTemp[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, szTemp)) 
      wchar_t szName[MAX_PATH];
      wchar_t szDrive[3] = (L" :");

      wchar_t * p = szTemp;

        // Copy the drive letter to the template string
        *szDrive = *p;

        // Look up each device name
        if (QueryDosDeviceW(szDrive, szName, MAX_PATH))
            lprintf("%S : %S\n",szDrive,szName);

        // Go to the next NULL character.
        while (*p++);
      } while ( *p); // double-null is end of drives list



LogDosDrives prints stuff like :

A: : \Device\Floppy0
C: : \Device\HarddiskVolume1
D: : \Device\HarddiskVolume2
E: : \Device\CdRom0
H: : \Device\CdRom1
I: : \Device\CdRom2
M: : \??\D:\misc
R: : \??\D:\ramdisk
S: : \??\D:\ramdisk
T: : \??\D:\trans
V: : \??\C:
W: : \Device\LanmanRedirector\;W:0000000000024326\radnet\raddevel
Y: : \Device\LanmanRedirector\;Y:0000000000024326\radnet\radmedia
Z: : \Device\LanmanRedirector\;Z:0000000000024326\charlesb-pc\c


Recall from the last post that "\??\" is the NT-namespace way of mapping back to the win32 namespace. Those are substed drives. The "net use" drives get the "Lanman" prefix.

MapNtDriveName using QueryDosDevice is :

bool MapNtDriveName_QueryDosDevice(const wchar_t * from,wchar_t * to)
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t allDosDrives[BUFSIZE];
    allDosDrives[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, allDosDrives)) 
        wchar_t * pDosDrives = allDosDrives;

            // Copy the drive letter to the template string
            wchar_t dosDrive[3] = (L" :");
            *dosDrive = *pDosDrives;

            // Look up each device name
            wchar_t ntDriveName[BUFSIZE];
            if ( QueryDosDeviceW(dosDrive, ntDriveName, ARRAY_SIZE(ntDriveName)) )
                size_t ntDriveNameLen = strlen(ntDriveName);

                if ( strnicmp(from, ntDriveName, ntDriveNameLen) == 0
                         && ( from[ntDriveNameLen] == '\\' || from[ntDriveNameLen] == 0 ) )
                    return true;

            // Go to the next NULL character.
            while (*pDosDrives++);

        } while ( *pDosDrives); // double-null is end of drives list

    return false;

4.B. MapNtDriveName by IOControl :

There's a more direct way using DeviceIoControl. You just send a message to the "MountPointManager" which is the guy who controls these mappings. (this is from "Mehrdad" on Stackoverflow) :

struct MOUNTMGR_TARGET_NAME { USHORT DeviceNameLength; WCHAR DeviceName[1]; };
struct MOUNTMGR_VOLUME_PATHS { ULONG MultiSzLength; WCHAR MultiSz[1]; };


union ANY_BUFFER {
    char Buffer[4096];

bool MapNtDriveName_IoControl(const wchar_t * from,wchar_t * to)
    ANY_BUFFER nameMnt;
    int fromLen = strlen(from);
    // DeviceNameLength is in *bytes*
    nameMnt.TargetName.DeviceNameLength = (USHORT) ( 2 * fromLen );
    strcpy(nameMnt.TargetName.DeviceName, from );
    HANDLE hMountPointMgr = CreateFile( ("\\\\.\\MountPointManager"),
    ASSERT_RELEASE( hMountPointMgr != 0 );
    DWORD bytesReturned;
    BOOL success = DeviceIoControl(hMountPointMgr,
        sizeof(nameMnt), &nameMnt, sizeof(nameMnt),
        &bytesReturned, NULL);

    if ( success && nameMnt.TargetPaths.MultiSzLength > 0 )

        return true;    
        return false;

5. Fix MapNtDriveName for network names.

I said that MapNtDriveName_IoControl and MapNtDriveName_QueryDosDevice produced the same results and both worked. Well, that's only true for local drives. For network drives they both fail, but in different ways. MapNtDriveName_QueryDosDevice just won't find network drives, while MapNtDriveName_IoControl will hang for a long time and eventually time out with a failure.

We can fix it easily though because the NT path for a network share contains the valid win32 path as a suffix, so all we have to do is grab that suffix.

bool MapNtDriveName(const wchar_t * from,wchar_t * to)
    // hard-code network drives :
    if ( strisame(from,L"\\Device\\Mup") || strisame(from,L"\\Device\\LanmanRedirector") )
        return true;

    // either one :
    //return MapNtDriveName_IOControl(from,to);
    return MapNtDriveName_QueryDosDevice(from,to);

This just takes the NT-namespace network paths, like :




And we're done.

12-21-12 | File Name Namespaces on Windows

A little bit fast and loose but trying to summarize some insanity from a practical point of view.

Windows has various "namespaces" or classes of file names :

1. DOS Names :

"c:\blah" and such.

Max path of 260 including drive and trailing null. Different cases refer to the same file, *however* different unicode encodings of the same character do *NOT* refer to the same file (eg. things like "accented e" and "e + accent previous char" are different files). See previous posts about code pages and general unicode disaster on Windows.

I'm going to ignore the 8.3 legacy junk, though it still has some funny lingering effects on even "long" DOS names. (for example, the longest path name length allowed is 244 characters, because they require room for an 8.3 name after the longest path).

2. Win32 Names :

This includes all DOS names plus all network paths like "\\server\blah".

The Win32 APIs can also take the "\\?\" names, which are sort of a way of peeking into the lower-level NT names.

Many people incorrectly think the big difference with the "\\?\" names is that the length can be much longer (32768 instead of 260), but IMO the bigger difference is that the name that follows is treated as raw characters. That is, you can have "/" or "." or ".." or whatever in the name - they do not get any processing. Very scary. I've seen lots of code that blindly assumes it can add or remove "\\?\" with impunity - that is not true!

"\\?\c:\" is a local path

"\\?\UNC\server\blah" is a network name like "\\server\blah"

Assuming you have your drives shared, you can get to yourself as "\\localhost\c$\"

I think the "\\?\" namespace is totally insane and using it is a Very Bad Idea. The vast majority of apps will do the wrong thing when given it, and many will crash.

3. NT names :

Win32 is built on "ntdll" which internally uses another style of name. They start with "\" and then refer to the drivers used to access them, like :


In the NT namespace network shares are named :

Pre-Vista :

\Device\LanmanRedirector\<some per-user stuff>\server\share

Vista+ : Lanman way and also :


And the NT namespace has a symbolic link to the entire Win32 namespace under "\Global??\" , so


is also a valid NT name, (and "\??\" is sometimes valid as a short version of "\Global??\").

What fun.

12-21-12 | Coroutine-centric Architecture

I've been talking about this for a while but maybe haven't written it all clearly in one place. So here goes. My proposal for a coroutine-centric architecture (for games).

1. Run one thread locked to each core.

(NOTE : this is only appropriate on something like a game console where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be locking to cores, and you have the thread affinity scheduler problems, and so on).

The one-thread-per-core set of threads is your thread pool. All code runs as "tasks" (or jobs or whatever) on the thread pool.

The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and there are probably some OS threads that are running which will grab some of your time, and you want that; but you are not using the OS threading in your code).

2. All functions are coroutines. A function with no yields in it is just a very simple coroutine. There's no special syntax to be a coroutine or call a coroutine.

All functions can take futures or return futures. (a future is just a value that's not yet ready). Whether you want this to be totally implicit or not is up to your taste about how much of the operations behind the scenes are visible in the code.

For example if you have a function like :

int func(int x);

and you call it with a future<int> :

future<int> y;

it is promoted automatically to :

future<int> func( future<int> x )
    yield x;
    return func( x.value );

When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.

To branch I would use a new keyword, like "start" :

future<int> some_async_func(int x);

int current_func(int y)

    // execution will step directly into this function;
    // when it yields, current_func will yield

    future<int> f1 = some_async_func(y);

    // with "start" a new coroutine is made and enqueued to the thread pool
    // my coroutine immediately continues to the f1.wait
    future<int> f2 = start some_async_func(y);
    return f1.wait();

"start" should really be an abbreviation for a two-phase launch, which allows a lot more flexibility. That is, "start" should be a shorthand for something like :

start some_async_func(y);


coro * c = new coro( some_async_func(y); );

because that allows batch-starting, and things like setting dependencies after creating the coro, which I have found to be very useful in practice. eg :

coro * c[32];

for(i in 32)
    c[i] = new coro( );
    if ( i > 0 )
        c[i-1]->depends( c[i] );

start_all( c, 32 );

Batch starting is one of those things that people often leave out. Starting tasks one by one is just like waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and going back to sleep over and over, or switching back and forth).

3. Full stack-saving is crucial.

For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.

You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.

Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.

4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.

Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.

Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the return from an async procedural call. So you can do things like :

future<int> y = some_async_func();

yield( y , my_mutex.when_lock() );

which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".

Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).

You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them instead of doing thread-waits.

5. You must have some kind of GC.

Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is the only reasonable way to do this.

I would also go ahead and put an RW-lock in every object as well since that will be necessary.

6. Dependencies and side effects should be expressed through args and return values.

You really need to get away from funcs like

void DoSomeStuff(void);

that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they can be used to create dependency chains.

When that's not directly possible, you must use a convention to express it. eg. for file manipulation I recommend using a string containing the file name to express the side effects that go through the file system (eg. for Rename, Delete, Copy, etc.).

7. Note that coroutines do not fundamentally alter the difficulties of threading.

You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but they are no panacea and do not try to be anything other than a nicer way of writing threading. (eg. they are not transactional memory or any other auto-magic).

to be continued (perhaps) ....

Add 3/15/13 : 8. No static size anything. No resources you can run out of. This is another "best practice" that goes with modern thread design that I forgot to list.

Don't use fixed-size queues for thread communication; they seem like an optimization or simplification at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a big problem.

The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program. It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during the graph evaluation, it's a big mess figuring out how to deal with that.

Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal with.

12-21-12 | Coroutines From Lambdas

Being pedantic while I'm on the topic. We've covered this before.

Any language with lambdas (that can be fired when an async completes) can simulate coroutines.

Assume we have some async function call :

future<int> AsyncFunc( int x );

which send the integer off over the net (or whatever) and eventually gets a result back. Assume that future<> has a "AndThen" which schedules a function to run when it's done.

Then you can write a sequence of operations like :

future<int> MySequenceOfOps( int x1 )

    future<int> f1 = AsyncFunc(x1);

    return f1.AndThen( [](int x2){

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    return f2.AndThen( [](int x3){

    x3 --;

    return x3;

    } );
    } );


with a little munging we can make it look more like a standard coroutine :

#define YIELD(future,args)  return future.AndThen( [](args){

future<int> MySequenceOfOps( int x1 )

    future<int> f1 = AsyncFunc(x1);

    YIELD(f1,int x2)

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    YIELD(f2,int x3)

    x3 --;

    return x3;

    } );
    } );


the only really ugly bit is that you have to put a bunch of scope-closers at the end to match the number of yields.

This is really what any coroutine is doing under the hood. When you hit a "yield", what it does is take the remainder of the function and package that up as a functor to get called after the async op that you're yielding on is done.

Coroutines from lambdas have a few disadvantages, aside from the scope-closers annoyance. It's ugly to do anything but simple linear control flow. The above example is the very simple case of "imperative, yield, imperative, yield" , but in real code you want to have things like :

if ( bool )


while ( some condition )


which while probably possible with lambda-coroutines, gets ugly.

An advantage of lambda-coroutines is if you're in a language where you have lambdas with variable-capture, then you get that in your coroutines.

12-18-12 | Async/Await ; Microsoft's Coroutines

As usual I'm way behind in knowing what's going on in the world. Lo and behold, MS have done a coroutine system very similar to me, which they are now pushing as a fundamental technology of WinRT. Dear lord, help us all. (I guess this stuff has been in .NET since 2008 or so, but with WinRT it's now being pushed on C++/CX as well)

I'm just catching up on this, so I'm going to make some notes about things that took a minute to figure out. Correct me where I'm wrong.

For the most part I'll be talking in C# lingo, because this stuff comes from C# and is much more natural there. There are C++/CX versions of all this, but they're rather more ugly. Occasionally I'll dip into what it looks like in CX, which is where we start :

1. "hat" (eg. String^)

Hat is a pointer to a ref-counted object. The ^ means inc and dec ref in scope. In cbloom code String^ is StringPtr.

The main note : "hat" is a thread-safe ref count, *however* it implies no other thread safety. That is, the ref-counting and object destruction is thread safe / atomic , but derefs are not :

Thingy^ t = Get(); // thread safe ref increment here
t->var1 = t->var2; // non-thread safe var accesses!

There is no built-in mutex or anything like that for hat-objects.

2. "async" func keyword

Async is a new keyword that indicates a function might be a coroutine. It does not make the function into an asynchronous call. What it really is is a "structify" or "functor" keyword (plus a "switch") . Like a C++ lambda, the main thing the language does for you is package up all the local variables and function arguments and put them all in a struct. That is (playingly rather loosely with the translation for brevity) :

async void MyFunc( int x )
    string y;


[ is transformed to : ]

struct MyFunc_functor
    int x;
    string y;

    void Do() { stuff(); }

void MyFunc( int x )
    // allocator functor object :
    MyFunc_functor * f = new MyFunc_functor();
    // copy in args :
    f->x = x;
    // run it :

So obviously this functor that captures the function's state is the key to making this into an async coroutine.

It is *not* stack saving. However for simple usages it is the same. Obviously crucial to this is using a language like C# which has GC so all the references can be traced, and everything is on the heap (perhaps lazily). That is, in C++ you could have pointers and references that refer to things on the stack, so just packaging up the args like this doesn't work.

Note that in the above you didn't see any task creation or asynchronous func launching, because it's not. The "async" keyword does not make a function async, all it does is "functorify" it so that it *could* become async. (this is in contrast to C++11 where "async" is an imperative to "run this asynchronously").

3. No more threads.

WinRT is pushing very hard to remove manual control of threads from the developer. Instead you have an OS thread pool that can run your tasks.

Now, I actually am a fan of this model in a limitted way. It's the model I've been advocating for games for a while. To be clear, what I think is good for games is : run 1 thread per core. All game code consists of tasks for the thread pool. There are no special purpose threads, any thread can run any type of task. All the threads are equal priority (there's only 1 per core so this is irrelevant as long as you don't add extra threads).

So, when a coroutine becomes async, it just enqueues to a thread pool.

There is this funny stuff about execution "context", because they couldn't make it actually clean (so that any task can run any thread in the pool); a "context" is a set of one or more threads with certain properties; the main one is the special UI context, which only gets one thread, which therefore can deadlock. This looks like a big mess to me, but as long as you aren't actually doing C# UI stuff you can ignore it.

See ConfigureAwait etc. There seems to be lots of control you might want that's intentionally missing. Things like how many real threads are in your thread pool; also things like "run this task on this particular thread" is forbidden (or even just "stay on the same thread"; you can only stay on the same context, which may be several threads).

4. "await" is a coroutine yield.

You can only use "await" inside an "async" func because it relies on the structification.

It's very much like the old C-coroutines using switch trick. await is given an Awaitable (an interface to an async op). At that point your struct is enqueued on the thread pool to run again when the Awaitable is ready.

"await" is a yield, so you may return to your caller immediately at the point that you await.

Note that because of this, "async/await" functions cannot have return values (* except for Task which we'll see next).

Note that "await" is the point at which an "async" function actually becomes async. That is, when you call an async function, it is *not* initially launched to the thread pool, instead it initially runs synchronously on the calling thread. (this is part of a general effort in the WinRT design to make the async functions not actually async whenever possible, minimizing thread switches and heap allocations). It only actually becomes an APC when you await something.

(aside : there is a hacky "await Task.Yield()" mechanism which kicks off your synchronous invocation of a coroutine to the thread pool without anything explicit to await)

I really don't like the name "await" because it's not a "wait" , it's a "yield". The current thread does not stop running, but the current function might be enqueued to continue later. If it is enqueued, then the current thread returns out of the function and continues in the calling context.

One major flaw I see is that you can only await one async; there's no yield_all or yield_any. Because of this you see people writing atrocious code like :

await x;
await y;
await z;
Now they do provide a Task.WhenAll and Task.WhenAny , which create proxy tasks that complete when the desired condition is met, so it is possible to do it right (but much easier not to).

Of course "await" might not actually yield the coroutine; if the thing you are awaiting is already done, your coroutine may continue immediately. If you await a task that's not done (and also not already running), it might be run immediately on your thread. They intentionally don't want you to rely on any certain flow control, they leave it up to the "scheduler".

5. "Task" is a future.

The Task< > template is a future (or "promise" if you like) that provides a handle to get the result of a coroutine when it eventually completes. Because of the previously noted problem that "await" returns to the caller immediately, before your final return, you need a way to give the caller a handle to that result.

IAsyncOperation< > is the lower level C++/COM version of Task< > ; it's the same thing without the helper methods of Task.

IAsyncOperation.Status can be polled for completion. IAsyncOperation.GetResults can only be called after completed. IAsyncOperation.Completed is a callback function you can set to be run on completion. (*)

So far as I can tell there is no simple way to just Wait on an IAsyncOperation. (you can "await"). Obviously they are trying hard to prevent you from blocking threads in the pool. The method I've seen is to wrap it in a Task and then use Task.Wait()

(* = the .Completed member is a good example of a big annoyance : they play very fast-and-loose with documenting the thread safety semantics of the whole API. Now, I presume that for .Completed to make any sense it must be a thread-safe accessor, and it must be atomic with Status. Otherwise there would be a race where my completion handler would not get called. Presumably your completion handler is called once and only once. None of this is documented, and the same goes across the whole API. They just expect it all to magically work without you knowing how or why.)

(it seems that .NET used to have a Future< > as well, but that's gone since Task< > is just a future and having both is pointless (?))

So, in general if I read it as :

"async" = "coroutine"  (hacky C switch + functor encapsulation)

"await" = yield

"Task" = future

then it's pretty intuitive.

What's missing?

Well there are some places that are syntactically very ugly, but possible. (eg. working with IAsyncOperation/IAsyncInfo in general is super ugly; also the lack of simple "await x,y,z" is a mistake IMO).

There seems to be no way to easily automatically promote a synchronous function to async. That is, if you have something like :

int func1(int x) { return x+1; }

and you want to run it on a future of an int (Task< int >) , what you really want is just a simple syntax like :

future<int> x = some async func that returns an int

future<int> y = start func1( x );

which makes a coroutine that waits for its args to be ready and then runs the synchronous function. (maybe it's possible to write a helper template that does this?)

Now it's tempting to do something like :

future<int> x = some async func that returns an int

int y = func1( await x );

and you see that all the time in example code, but of course that is not the same thing at all and has many drawbacks (it waits immediately even though "y" might not be needed for a while, it doesn't allow you to create async dependency chains, it requires you are already running as a coroutine, etc).

The bigger issue is that it's not a real stackful coroutine system, which means it's not "composable", something I've written about before :
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

Specifically, a coroutine cannot call another function that does the await. This makes sense if you think of the "await" as being the hacky C-switch-#define thing, not a real language construct. The "async" on the func is the "switch {" and the "await" is a "case ". You cannot write utility functions that are usable in coroutines and may await.

To call functions that might await, they must be run as their own separate coroutine. When they await, they block their own coroutine, not your calling function. That is :

int helper( bool b , AsyncStream s )
    if ( b )
        return 0;
        int x = await s.Get<int>();
        return x + 10;

async Task<int> myfunc1()
    AsyncStream s = open it;
    int x = helper( true, s );
    return x;

The idea here is that "myfunc1" is a coroutine, it calls a function ("helper") which does a yield; that yields out of the parent coroutine (myfunc1). That does not work and is not allowed. It is what I would like to see in a good coroutine-centric language. Instead you have to do something like :

async Task<int> helper( bool b , AsyncStream s )
    if ( b )
        return 0;
        int x = await s.Get<int>();
        return x + 10;

async Task<int> myfunc1()
    AsyncStream s = open it;
    int x = await helper( true, s );
    return x;

Here "helper" is its own coroutine, and we have to block