cbloom.com

Go to the new cbloom rants @ blogspot


04-05-12 | DXT is not enough - Part 2

As promised last time , a bit of rambling on the future.

1. R-D optimized DXTC. Sticking with DXT encoding, this is certainly the right way to make DXTC smaller. I've been dancing around this idea for a while, but it wasn't until CRUNCH came out that it really clicked.

Imagine you're doing something like DXT1 + LZ. The DXT1 creates a 4 bpp (bits per pixel) output, and the LZ makes it smaller, maybe to 2-3 bpp. But, depending on what you do in your DXT1, you get different output sizes. For example, obviously, if you make a solid color block that has all indices of 0, then that will be smaller after LZ than a more complex block.

That is, we think of DXT1 as being a fixed size encoding, so the optimizers I wrote for it a while ago were just about optimizing quality. But with a back end, it's no longer a fixed size encoding - some choices are smaller than others.

So the first thing you can do is just to consider size (R) as well as quality (D) when making a choice about how to encode a block for DXTC. Often there are many ways of encoding the same data with only very tiny differences in quality, but they may have very different rates.

One obvious case is when a block only has one or two colors in it, the smallest encoding would be to just send those colors as the end points, then your indices are only 0 or 1 (selecting the ends). Often a better quality encoding can be found by sending the end point colors outside the range of the block, and using indices 2 and 3 to select the interpolated 1/3 and 2/3 points.

Even beyond that you might want to try encodings of a block that are definitely "bad" in terms of quality, eg. sending a solid color block when the original data was not solid color. This is intentionally introducing loss to get a lower bit rate.

The correct way to do this is with an R-D optimized encoder. The simplest way to do that is using lagrange multipliers and optimizing the cost J = R + lambda * D.

There are various difficulties with this in practice; for one thing exposing lambda is unintuitive to clients. Another is that (good) DXTC encoding is already quite slow, so making the optimization metric be J instead of D makes it even slower. Many simple back-end coders (like LZ) are hard to measure R for a single block for. And adaptive back-ends make parallel DXTC solvers difficult.

2. More generally we should ask why are we stuck with trying to optimize DXTC? I believe the answer is the preferred way that DXTC is treated by current hardware. How could we get away from that?

I believe you could solve it by making the texture fetch more programmable. Currently texture fetch (and decode) is one of the few bits of GPU's that still totally fixed function. DXTC encoded blocks are fetched and decoded into a special cache on the texture unit. This means that DXTC compressed textures can be directly rendered from, and also that rendering with DXTC compressed textures is actually faster than rendering from RGB textures due to the decreased memory bandwidth needs.

What we want is future hardware to make this part of the pipeline programmable. One possibility is like this : Give the texture unit its own little cache of RGB 4x4 blocks that it can fetch from. When you try to read a bit of texture that's not in the cache, it runs a "texture fetch shader" similar to a pixel shader or whatever, which outputs a 4x4 RGB block. So for example a texture fetch shader could decode DXTC. But it could also decode JPEG, or whatever.


03-29-12 | Computer Notes to Self

1. If your TEMP env var is set to anything other than "C:\Users\you\AppData\Local\Temp" , some stupid apps (eg. windows Installer) may fail. This failure can show up in some weird ways such as "access denied" type errors.

2. Some dumb apps can fail when run on a subst'ed drive (such as Installer).

3. Windows crash dumps don't work unless you have enough virtual memory. They claim 16M is enough.

4. Once in a while I run Procmon and filter only for writes to see if there is any fucking rogue service that's thrashing my disk (such as Indexing or Superfetch or any of that bloody rot). This time I found that IpHlpSvc was logging tons of shite. You can disable it thusly :

regedit
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Tracing\IpHlpSvc\Enable-FileTracing
value 0

5. The basic process for examining a crash dump is this :


Run WinDbg

Set symbol search path to :

"SRV*c:\windows\symbols*http://msdl.microsoft.com/download/symbols"

(if you do it after loading the .dmp, then use the command ".reload" )

Load the .dmp, probably from "c:\windows\minidump"

command :

!analyze -v

command "lmv" will list drivers with info

6. Windows comes with a "driver verifier" (verifier.exe). It's pretty cool. If you enable all the checks on all your drivers, it will make your computer too slow to be usable. What I do is enable it for all the non-Microsoft drivers, and that seems to be fast enough to stand. What it does is sort of stress the drivers so that when one of them does something bad, you get a blue screen and crash dump rather than just a hard freeze with no ability to debug. It also enables lots of memory corruption and overrun checks on the drivers (it seems to force a debug allocator on them which puts gaurd pages around allocs, you may wind up with BSODs due to memory trashes even on a system that is apparently stable).

7. I wanted to reduce the number of drivers I had to examine to just the ones I actually use, and was somewhat surprised to find almost a hundred drivers installed on my machine but disabled. The biggest culprit is USB; every time you plug something in, it installs a custom driver and then you get it forever. You can get rid of them thusly :

SET DEVMGR_SHOW_NONPRESENT_DEVICES=1

open Device Manager
Menu -> View -> Show hidden devices

now you should see lots of crud ghosted out.


03-28-12 | The Internet is Broken Part 714

I've given up on Google; the search seems to get worse every day. It's so annoying the way it "fixes" what I'm searching for. (eg. I was doing some searches about "flashing" (the metal strips used to keep water out) and it wants to give me results about fucking Flash).

So I'm trying a full switch to DuckDuckGo. A while ago the horribleness of Google as a home page motivated me to make my own home page. Now the search box in my custom home page is :


<form method="get" action="https://duckduckgo.com/" name="z2">

    <input type="hidden" name="kb" value="n"/>
    <input type="hidden" name="ki" value="-1"/>
    <input type="hidden" name="kp" value="-1"/>
    <input type="hidden" name="kz" value="-1"/>

<input type="text" name="q" size="70"
 maxlength="255" value="" />

        <select name="sites">
          <option value="" SELECTED>Duck</option>
          <option value="cbloomrants.blogspot.com">cbloomrants</option>
          <option value="thepiratebay.se">piratebay</option>
          <option value="www.youtube.com">youtube</option>
          <option value="en.wikipedia.org">wikipedia</option>
          <option value="www.amazon.com">amazon</option>
          <option value="citeseerx.ist.psu.edu">citeseerx</option>
        </select>
<! input type="submit" value="site" />

</form>

So far the experience is kind of meh. On the plus side, Duck doesn't "fix" your search, and it doesn't give your phone number to marketers every time you use the internet. On the minus side, it's slower and the quality of search results is slightly lower. (though, while it has slightly less good results, it also doesn't have so many fake advertising/spam/stub results as Google, which is very prone to search- jackers).

For my reference, the way you set your Firefox URL line to be an "I Feel Ducky" search is thusly :


about:config
keyword.url
"https://duckduckgo.com/?q=! "
      (note the space at the end)

In other "the internet is fucked" news, OMPF went away a while ago and took lots of good information with it. Sadly I can't get archive.org to find it's data. (and of course OMPF.org is domain-parked with some fucking spam site, and archive.org has made a very poor choice of defaulting to giving you the current site when they don't have it in the archive; so if you go trying various URL's to try to find the goods, you get handed to the spammers; great).

In particular, at the moment, I want these :


http://ompf.org/forum/viewtopic.php?f=11&t=836
http://ompf.org/forum/viewtopic.php?p=5916&sid=286260f501881f2429c87f3f31f639d7#p5916

but more generally I would like a full download of ompf.org at its latest good time.

And the more general point about internet fuckage (which I have made before) is - What fucking good is a URL ? The content keeps changing or even disappears. It's like giving directions to a sand painting in the Sahara. Oh thanks, by the time I actually get around to wanting to read it, it will be gone.

Is there no fucking way to download web pages? Am I the only person in the world who thinks the internet was better when it was plain text? (this is without even getting into the horror that is Flash or even HTML5 or web pages with mouse-over menus; anybody who puts mouse-over menu drop-downs in a web page should get punched in the cock)

And finally in this edition of "god I hate the internet and yet I cannot stop suckling on its sweet sweet teat" is the change to piratebay. Piratebay now only provides magnet links in some misguided attempt to delay prosecution. (it is inevitable that they will be shut down by the forces of American business, acting through their proxy the American government). This is personally very annoying for me, because I download torrent files from various machines but then only actually run bittorrent on my server; with magnet links you don't get the torrent files to pass around (unless you actually run a torrent client and manually extract the torrent file). Bleh.

(BTW / semi-unrelated rant / the entire way that computer security is designed is so fucking retarded for the way 99.99% of the world actually uses computers. Having separate user accounts is useless and creates all kinds of annoyance and problems. But if you just had a "sandboxed web browser" and a non-sandboxed web browser, that would eliminate almost every security breach that real people actually face. How about just a "hmm, this exe I downloaded might be iffy; run it in supervised mode where anything it tries to do to my registry or settings or whatever I get to manually okay". WTF, no of course you don't get the simple obvious solutions like that)

(and why in fuck do we still let web browsers make popups, redirect, or resize windows? How can you possibly think that's a good idea? It's like having a feature in an iPod to let the music electrocute you. Some jackass is of course going to use that feature in a bad way, so just fucking take it out.)


03-27-12 | DXT is not enough

I've been sent this GDC talk a few times now, so I will write some notes about it. (BTW thanks to all who sent it; I don't really follow game dev news much, so I rely on you all to tell me when there's something interesting I should know about).

There's nothing wrong with the general subject matter of the talk, and the points are along the right track in a vague sort of way, but there's just absolutely no effort put into it. I put more effort into the average blog post. If you aren't going to actually do the testing against other methods and measurement on a real (public) data set and see if your ideas are actually good, then just don't do a talk.

Anyhoo, a quick summary of the talk in plain text :

JPEG and then realtime DXTC is kind of bonkers (I agree). To make DXTC smaller, he applies Zip, then pre-huffman, pre-delta of colors, rearranging the colors & indices to be in two separate blocks, and then "codebooks", and finally 8x8 and even 16x16 blocks.

There are a lot of problems with this talk. The first is the assumption that you are using Zip on the back end. (BTW Zip is not a type of LZW at all). Zip is ancient and has a tiny window, there's no reason to use zip. If you just use a better back end, most of what he does next is irrelevant. Essentially a lot of what he does (such as the codebooks and the pre-huffman) are just ways of extending Zip, effectively making the sliding window larger.

Second, whenever you are doing these things you need to consider the memory use and processor use tradeoffs.

For example, reorganizing the DXTC data to separate the colors and the indeces does in fact help. (I do it in Oodle, optionally). But that doesn't make it a clear win. It actually takes a huge amount of CPU. Just swizzling memory around like that can be slower than a very advanced LZ decompressor. (unless you are lucky enough to be on a PC which has an amazing big cache, amazing fast memory, and an amazing out of order processor that can hide cache misses). So you have to consider what is the speed cost of doing that reorg vs. other ways you could use the CPU time to improve compression (eg. running LZMA or LZX or whatever instead of Zip). Even on a PC, the reorg will ruin large block write-combining. For me, reorg took me from 500 MB/s to 300 MB/s or so, and the gain is only a few percent, not obviously worth it. (my back end is much better than Zip so the gains are much smaller, or not there at all).

The only real idea in the talk is going to 8x8 blocks. That is in fact a valid thing to do, but here the rigor of the talk completely falls apart, and instead we get "look at this blurry slide in a terrible lighting environment, can you see the loss?". Errmmm, okay. To be fair it's no worse than the typical GDC graphics talk where you get "look at this picture of my global illumination technique, can you see any artifacts?" ; ermm, well you have chosen the scene to show me, and I'm a hundred feet away looking at a slide, and I can't zoom in or examine the areas I think look funny, so of course it should look good, but in fact, yes I do see lighting artifacts!

Any time you start introducing loss you have to ask : how does this loss I'm injecting compare to other ways I could reduce bitrate and increase loss? An easy one to check is just to halve the resolution of your image (in both dimensions). That's a 4:1 compression, and quite often looks just fine visually (eg. on smooth data it is one of the best possible ways to create loss). And of course since we're in this domain you need to compare against JPEG-DXTC.

CRUNCH addresses this subject much better, even doing some actual tests, and it has some much more interesting ideas.

See my previous writings on DXTC in general.

Now some actual rigor :

DXT1 is a 4 bpp (bits per pixel) format. Additional lossless compression can get you below 4 bpp, but getting to 1 bpp is unrealistic. Here I will show the results for compressors of increasing compression : zip, rrlzhlw, and lzma. The "reorg" here is just separating colors and indices; other reorgs do not help if the back end compressor is rrlzhlw or better.

zip rrlzhlw lzma reorg zip reorg rrlzhlw reorg lzma
kodim01.bmp 3.187 2.962 2.786 2.98 2.794 2.683
kodim02.bmp 2.984 2.738 2.574 2.703 2.575 2.484
kodim03.bmp 2.768 2.534 2.373 2.494 2.344 2.254
kodim04.bmp 3.167 2.931 2.751 2.913 2.727 2.625
kodim05.bmp 3.463 3.272 3.155 3.238 3.108 2.999
kodim06.bmp 3.039 2.827 2.626 2.755 2.635 2.514
kodim07.bmp 2.862 2.622 2.489 2.634 2.469 2.366
kodim08.bmp 3.416 3.197 3.073 3.211 3.041 2.936
kodim09.bmp 2.919 2.701 2.497 2.658 2.525 2.4
kodim10.bmp 3.074 2.838 2.644 2.803 2.638 2.525
kodim11.bmp 3.001 2.827 2.655 2.791 2.668 2.542
kodim12.bmp 2.86 2.645 2.446 2.583 2.451 2.343
kodim13.bmp 3.517 3.331 3.182 3.299 3.159 3.042
kodim14.bmp 3.296 3.104 2.94 3.078 2.922 2.803
kodim15.bmp 3.067 2.835 2.675 2.798 2.632 2.547
kodim16.bmp 2.779 2.565 2.362 2.543 2.401 2.276
kodim17.bmp 3.077 2.849 2.659 2.788 2.653 2.544
kodim18.bmp 3.495 3.315 3.181 3.255 3.106 3.025
kodim19.bmp 3.09 2.878 2.685 2.827 2.698 2.571
kodim20.bmp 2.667 2.486 2.302 2.406 2.308 2.22
kodim21.bmp 3.087 2.893 2.7 2.804 2.712 2.582
kodim22.bmp 3.39 3.213 3.046 3.168 3.005 2.901
kodim23.bmp 3.221 2.985 2.826 2.949 2.758 2.646
kodim24.bmp 3.212 2.986 2.86 3.009 2.826 2.724
clegg.bmp 2.987 2.75 2.598 2.712 2.576 2.459
FRYMIRE.bmp 1.502 1.318 1.224 1.417 1.3 1.209
LENA.bmp 3.524 3.332 3.209 3.304 3.136 3.062
MONARCH.bmp 3.28 3.055 2.916 2.999 2.835 2.741
PEPPERS.bmp 3.381 3.2 3.073 3.131 2.962 2.881
SAIL.bmp 3.425 3.234 3.123 3.197 3.047 2.967
SERRANO.bmp 1.601 1.39 1.289 1.484 1.352 1.26
TULIPS.bmp 3.511 3.27 3.164 3.227 3.061 2.974
total 97.849 91.083 86.083 90.158 85.424 82.105
gain 7.691 5.659 3.978

What you should be able to see : reorg zip is roughly the same as rrlzhlw (without reorg), and reorg rrlzhlw is about the same as lzma (without reorg). Note that reorg is *slow* ; rrlzhlw without reorg decodes quite a bit faster than zip with reorg, so speed is not a good reason to prefer that. (I suppose simplicity of coding is one advantage it has). The gain from reorging decreases as you go to better back-ends.

I should also point out that doing something like reorg lzma is kind of silly. If you really want the maximum compression of DXTC textures, then surely a domain-specific context-based arithmetic coder will do better, and be faster too. (see for example "Lossless Compression of Already Compressed Textures" , Strom and Wennersten ; not a great paper, just the very obvious application of normal compression techniques to ETC (similar to DXTC) texture data).

In the next post I'll ramble a bit about future possibilities.


03-12-12 | Comparing Compressors

It's always hard to compare compressors fairly in a way that's easily understood by the layman. I think the basic LZH compressors in Oodle are very good, but do they compress as much as LZMA ? No. Are they as fast as LZO? No. So if I really make a fair comparison chart that lists lots of other compressors, I will be neither the fastest nor the highest ratio.

(The only truly fair way to test, as always, is to test in the actual target application, with the actual target data. Other than that, it's easiest to compare "trumps", eg. if compressor A has the same speed as B, but more compaction on all files, then it is just 100% better and we can remove B from consideration)

I wrote before about the total time method of comparing compressors. Basically you assume the disk has some given speed D. Then you see what is the total time to load the compressed file (eg. compressed size/D) and the time to do the decompression.

"Total time" is not really the right metric for various reasons; it assumes that one CPU is fully available to compression and not needed for anything else. It assumes single threaded operation only. But the nice thing about it is you can vary D and see how the best compressor changes with D.

In particular, for two compressors you can solve for the disk speed at which their total time is equal :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

disk speed where two compressors are equal :

D = C1 * C2 * (R1 - R2) / (C1 - C2)

at lower disk speeds, the one with more compression is preferred, and at higher disk speeds the faster one with lower compression is preferred.

The other thing you can do is show "effective speed" instead of total time. If you imagine the client just gets back the raw file at the end, they don't know if you just loaded the raw file or if you loaded the compressed file and then decompressed it. Your effective speed is :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

S = 1 / ( R/D + 1/C )

So for example, if your compressor is "none", then R = 1.0 and C = infinity, so S = D - your speed is the disk speed.

If we have two compressors that have a different ratio/speed tradeoff, we can compare them in this way. I was going to compare my stuff to Zip/Zlib, but I can't. On the PC I'm both faster than Zip and get more compression, so there is no tradeoff. (*1) (*2)

(*1 = this is not anything big to brag about, Zip is ancient and any good modern compressor should be able to beat it on both speed and ratio. Also Zlib is not very well optimized; my code is also not particularly optimized for the PC, I optimize for the consoles because they are so much slower. It's kind of ironic that some of the most pervasive code in the world is not particularly great).

(*2 = actually there are more dimensions to the "Pareto space"; we usually show it as a curve in 2d, but there's also memory use, and Zip is quite low in memory use (which is why it's so easy to beat - all you have to do is increase the window size and you gain both ratio and speed (you gain speed because you get more long matches)); a full tradeoff analysis would be a manifold in 3d with axes of ratio,speed, and size)

Anyhoo, on my x64 laptop running single threaded and using the timing technique here I get :


zlib9 : 24,700,820 ->13,115,250 =  1.883 to 1, rate= 231.44 M/s

lzhlw : 24,700,820 ->10,171,779 =  2.428 to 1, rate= 256.23 M/s

rrlzh : 24,700,820 ->11,648,928 =  2.120 to 1, rate =273.00 M/s

so we can at least compare rrlzh (the faster/simpler of my LZH's) with lzhlw (my LZH with Large Window).

The nice thing to do is to compute the effective speed S for various possible disk speeds D, and make a chart :

On the left is effective speed vs. disk speed, on the right is a log-log plot of the same thing. The blue 45 degree line is the "none" compressor, eg. just read the uncompressed file at disk speed. The axis is MB/sec, and here (as is most common for me) I use M = millions, not megas (1024*1024) (but the numbers I was showing at GDC were megas, which makes everything seem a little slower).

We see that on the PC, lzhlw is the better choice at any reasonable disk speed. They are equal somewhere around D = 280 MB/sec, but it's kind of irrelevant because at that point they are worse than just loading uncompressed.

The gap between lines in a log-log plot is the *ratio* of the original numbers; eg. the speedup multipler for LZH over RAW is maximum at the lowest speeds (1 MB/sec, = 0 on the log-log chart) and decreases as the disk gets faster.


03-10-12 | GDC Stereotypes

Thanks to everyone I know who's stopped by, you've been a nice break from the usual cast of characters I get :

1. The student. Walks around with a glazed look on their eyes. Okay, students are young and I don't expect them to be experts, but my god they just seem so dumb. I thought kids were supposed to be really precocious about computers these days? What exactly are they teaching them in these programs? They also seem so incredibly young and childish; are these kids not college graduates?

2. The card-taker. These people may annoy me the most. They come up and rifle through the business cards. Some just take one of each, others dig through looking for something specific. If you ask them what they want they say "nothing" or "just taking some cards". Why?!? Arg put my card down you spammer. Even more boggling are the guys who actually take the information handouts about the products, but won't talk to you. Do you have any questions? Are you interested in the products? No? Then why the fuck are you taking the info sheets !? It's like they have this idea that it's a "successful GDC" if they come home with a big stack of papers and cards.

3. The competitor. We quite frequently get competitors who come over and pretend to be customers. They sort of act like a normal interested customer at first, just asking generic questions, but then the questions start getting more and more detailed and they're obviously trying to dig into implementation details, and we're like hmmm wait a minute "who do you work for?" oh, shit. I actually don't mind talking to competitors, but FUCK YOU just tell me the truth up front and I will be happy to talk to you, don't pretend to be a customer, and especially don't play dumb and make me explain a bunch of basics that you already know.

4. The expert. Every so often you get an "expert" who wants to educate you about your product. Oh really we should be using slerp for our animations because that's the "correct" way to interpolate splines? Oh really I should be using JPEG2000 for my texture compression? Well, thanks for the tip cock-hole, now go away and get some fucking humility you retard. I told some guy that we're better than Zip in most ways and a better balance for games and he was like "I dunno, Zip is pretty much the sweet spot"; oh, okay, I haven't like run that analysis and carefully studied it a hundred times or anything, thanks for telling me.

5. The curious. "Oh, I'm actually a musician, I was just reading your product info because I'm curious". Okey doke, fine, but REALLY ? This is the interesting thing to you? Our janky ass boring demos? There are fucking games and crazy graphics demos all over the place, and you choose this to stop and look at for five minutes? Why !? A lot of people stop and look for a second and move on, okay, that's fine, that's a normal rational response. But these people I just want to ask Why (actually I did ask a few why, and it's very disappointing they just hem and haw and shuffle away).

6. The humble. "The humble" is like when the CTO of EA walks up to the booth and just acts like a normal goofy question asker. They don't really give off great vibes of competence or importance, but god dammit, you are fucking important, let me know who you are! You can just say "hi, I'm important" , or like wear a special hat or something. Humility is terrible. The same thing happens with programmers who actually know their shit, you just say "hey I'm a real programmer, I know my shit" and then I will talk to you at a higher level and we'll both be happier.


After the show was over I went to the crazy-overpriced Italian deli near Moscone (which is not bad BTW, maybe the best quick food option near there), and the obvious was rubbed in my face. The weirdos who come to the booth are just the same weirdos that go anywhere.

I've never had a service job, waiting tables or whatever, my first job ever was software. When people come up to the booth and I step up and say "do you have any questions about RAD products?" it's almost exactly the same as if I was saying "welcome to Bennigan's, can I get you some drinks?". You have no idea if the people are cool or what, but most of the time they are fucking weirdos.

At the deli while I was sitting eating my sandwich, I kept seeing people who would walk in and look intently at the deli counter, and the nice guy behind the counter would be like "can I get you anything?" and the person would just be totally silent, then walk out weirdly without saying anything.


03-09-12 | Good Things about GDC

from the perspective of a booth monkey.

1. It's good in life to occasionally go through a very unpleasant ordeal. Kind of like how native Americans do the Sundance or whatever (good god I am a huge douche for making that analogy), it cleanses the soul and makes you stronger and also makes you appreciate your normal life when you see how bad it could be.

2. Visiting old friends and having to answer "so, what have you been up to?" or "what's new?". It's rare to be faced with those questions and they are always a moment of crisis and great depression for me. I think "hmm, what have I been up to?" and realize the answer is not fucking much. Have I been living in India? no. Have I been building cars and racing them? no. Have I gotten married or kids? no. Have I been dancing or biking or anything that I love? no. Have I started my own company? or written some great freeware? or made an autonomous walking robot? no. What the fuck have I been up to? It makes you realize you are not doing much with your life and might shake you into taking some action.

(the worst time for this question for me was during my unemployment period between Oddworld and RAD; people would ask if I'd been travelling the world, or having lots of sex, or whatever they think a rich, great-looking single playboy is supposed to be doing when they have no job; I would always have to answer, nope, not doing any of that)

3. It is sort of useful to get a barometer of who the game industry is these days. Just from the people who stop by to talk to us, and who we see in general, you can kind of get a random sampling of the industry. The big thing this year seems to be a lot more actual mobile developers, and a lot more semi-corporate "indies" (basically just meaning small teams, small budgets; the term "indie" in game dev is losing meaning, it's becoming like "indie" in music or movies, just meaning not the big-budget highly produced mainstream stuff).

4. Meeting people who read my blog has been nice; I've been surprised at how many came by, especially people from the olden days when I was writing .txt files about 3d pipelines and vipm and such. It's hard to tell how many people actually read this stuff, and some times I start thinking "what's the point" (actually the main point is that it clarifies my thoughts and makes me write it down for myself) so anyhoo, it's cool.

5. San Francisco is so fucking great, I miss it terribly. I'm sure a lot of it is just that it's sunny here in SF now and it's still bleak in Seattle, so any place sunny looks good to me. And I'm sure some of it is that I miss the carefree unemployed fun in my life back when I lived in SF. But most of it is just that I really love this city. I love the people who live here, the gays, the hipsters, the crack-heads in the loin, the Lebanese grocers and the old hippies-cum-yuppies, the latinos in the mission, it's great, it's kind of amazing that it has resisted gentrification so well. I love how walkable and bikable it is, I love the row houses and the hills, the views, the clubs, the ethnic dives and the temples of haute cuisine. It's really such a wonderful mix. I think most GDC attendants don't really get out of the Sodo/Union Square area, which is a shame because that part of the city is the absolute worst part of the city *by far*. When I lived in SF I never went to that neighborhood, it's full of douchey bankers and computer programmers and lots of foreign tourists shopping at horrible mall chain stores.

6. Quiet time. Oddly, standing there on the show floor in the middle of the chaos of crowds and blasting music has been a rare peaceful time for me recently. I've been consuming too much media recently, either watching movies or reading magazines or browsing the internet or whatever, any time that I'm not working or doing home improvement, I'm consuming media. At almost no moment in my day am I just sitting with my own thoughts and nothing to do. On the show floor my mind would wander and I realized that I haven't had that time, and that sitting around philosophizing is one of the defining things about me (and what led to my rants originally - an excess of shit swirling around in my head that had to be let out somewhere, like a puss-filled cyst that I popped on the tongue of the internet (like a teenage cum-overloaded cock that I shot on the face of the internet (like an american tourist's butt in India that I explosive-diarrhead all over the squat toilet that is the internet))).


03-08-12 | Oodle Coroutines

I wrote a year ago ( 03-11-11 | Worklets , IO , and Coroutines ) about coroutines for doing tasks with external delays. (a year ago! wtf)

This is a small followup to say that yes, in fact coroutines are awesome for this. I never bothered to try to preserve local variables, it's not worth it. You can just store data that you want to save across a "yield" into a struct. In Oodle whenever you are doing a threaded task, you always have a Work context, so it's easy to just stuff your data in there.

I've always been a big fan of coroutines for game scripting languages. You can do things like :


Walk to TV
Turn on TV
if exists Couch
  Walk to Couch

etc. You just write it like linear imperative code, but in fact some of those things take time and yield out of the coroutine.

So obviously the same thing is great for IO or SPU tasks or whatever. You can write :


Vec3 * array = malloc(...);

io_read( array , ... );  //! yields

Mat3 m = camera.view * ...;

spu_transform(array, m); //! yields

object->m_array = array;

and it just looks like nice linear code, but actually you lose execution there at the ! marks and you will only proceed after that operation finishes.

To actually implement the coroutines I have to use macros, which is a bit ugly, but not intolerable. I use the C switch method as previously described; normally I auto-generate the labels for the switch with __COUNTER__ so you can just write :


 .. code ..

YIELD

 .. code ..

and the YIELD macro does something like :

  N = __COUNTER__;
  work->next = N;
  return eCoroutine_Refire;
  }
case N:
  {

(note the braces which mean that variables in one yield chunk are not visible to the next; this means that the failure of the coroutine to maintain local variables is a compile error and thus not surprising).

The exception is if you want to jump around to different blocks of the coroutine, then you need to manually specify a label and you can jump to that label.

Note that yielding without a dependancy is kind of pointless; these coroutines are not yielding to share the CPU, they are yielding because they need to wait on some async handle to finish. So generally when you yield it's because you have some handle (or handles) to async tasks.

The way a yielding call like "spu_transform(array, m);" in the previous example has to be implemented is by starting an async spu task, and then setting the handle as a dependency. It would be something like :


#define spu_transform(args) \
  Handle h = start_spu_transform(args); \
  Work_SetDeps(work,h); \
  YIELD

The coroutine yield will then stop running the work, and the work now has a dependency, so it won't resume until the dependency is done. eg. it waits for the spu task to complete.

I use coroutines basically every time I have to do some processing on a file. For one thing, to minimize memory use I need to stream the file through a double-buffer. For another, you often need to open the file before you start processing, and that needs to be part of the async operation chain as well. So a typical processing coroutine looks something like :


int WorkFunc( Work * work )
{
COROUTINE_START

  Handle h = ioq_open_file( work->filename );

COROUTINE_YIELD_TO(h)

  if open failed -> abort

  get file size

  h = start read chunk 0
  chunkI = 1; // the next chunk is 1

YIELD(h)

  // read of chunkI^1 just finished

  // start the next read :  
  h = ioq_read( chunk[chunkI] );
  chunkI ^= 1;

  // process the chunk we just received :
  process( chunk[chunkI] );

  if done
  {
    ioq_close_file();
    return
  }

YIELD_REPEAT
}

where "YIELD_REPEAT" means resume at the same label so you repeat the current block.

The last block of the coroutine runs over and over, ping-ponging on the double buffer, and yields if the next IO is not done yet when the processing of each block is done.


03-06-12 | The Worker Wake and Semaphore Delay Issue

Let's say you have a pool of worker threads, and some of them may be asleep. How should you wake them up?

The straightforward way is to use a semaphore which counts the work items, and wait the worker threads on the semaphore. Workers will go to sleep when there is no work for them, and wake up when new work is pushed.

But this is rather too aggressive about waking workers. If you push N items (N less than the number of worker threads) it will wake N workers. But by the time some of those workers wake there may be nothing for them to do.

Let's look at a few specific issues.

First of all, when you're making a bunch of work items, you might want to delay inc'ing the semaphore until you have made all the items, rather than inc'ing it each time. eg. instead of :


1 : incremental push :

push item A
inc sem
push item B 
inc sem
...

instead do

2 : batch push :

push item A
push item B
inc sem twice

There are a few differences. The only disadvantage of batch pushing is that the work doesn't start getting done until all the pushes are done. If you're creating a lot of jobs and there's a decent amount of processing to get them started, this adds latency.

But what actually happens with incremental push? One possibility is like this :


bad incremental push :

push item A
inc sem

sem inc causes work thread to wake up
pusher thread loses execution

worker pops item A
worker does item A
worker sees empty queue and goes to sleep

pusher thread wakes up

push item B 
inc sem
...

That's very bad. The possible slight loss of efficiency from batch push is worth it to avoid this kind of bad execution flow.

There's a related issue when you are creating work items from a worker thread itself. Say a work item does some computation and also creates another work item :


Worker pops item A
does some stuff
push new work item B
inc sem
do some other stuff
item A done

Is this okay? Typically, no.

The problem is if the other worker threads are asleep, that inc sem might wake one up. Then the original worker finishes item A and sees nothing else to do and goes to sleep. It would have been much better if the worker just stayed awake and did item B itself.

We can fix this pretty easily. For work items pushed on worker threads, I typically use "batch push" (that is, delayed semaphore increment) with an extra wrinkle - I delay it up until my own thread tries to do a semaphore decrement.

That is, the way a worker thread runs is something like :


decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

decrement semaphore (wait if count <= 0)
pop item
do work item ...

instead we do :

decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

 push new work items but don't post the semaphore
 instead set N = number of incs to sem that are delayed

decrement semaphore AND add N (*)
pop item
do work item ...

The key operation is at (*) , where I post the sem for any work items I made, and also try to dec one.

The gain can be seen from a special case - if I made one work item, then the operation at (*) is a nop - I had one increment to the sem delayed, and I want to take one for myself, so I just take the one I had delayed. (if I made two, I post one and take one for myself). etc.

There is one extra little optimization you can do for the edge case - if you have some threads that are creating work items and some threads doing them, there is a sort of "performance race" between them. You really want them to be running along side with the creator feeding the poppers, neither running too fast. If the poppers are running slightly faster than the creators, you can fall off a huge performance cliff when the poppers see no work available and go into an OS sleep. Now, obviously you use a spin in your semaphore, but an enhanced version is something like this :


delayed/batched work creation :

push various items
...
inc sem N times


work popper :

spin { try pop queue }
try dec sem
if didn't get pop , dec sem (may wait)

In words, the work popper can "shortcut" the delayed sem inc. That is, the pusher has created a delay between the queue push and the sem inc, but the delay only applies to thread wakeups!! (this is the important point). The delay does not apply to the work being available to already running worker threads.

That is, if the pusher is using delay sem incs, and the popper is using sem shortcuts - then an active pusher makes work available to active workers as soon as possible. The thing that is delayed is only thread wakeup, so that we can avoid waking threads that don't need to wake up, or threads that will steal the execution context from the pusher, etc.

Here's an example of how things can go wrong if you aren't careful about these things :

Each work item is creating a followup work item, but that wakes up the other worker thread, who quickly grabs the item, and I go back to sleep.

(you might ask why the work item is creating a followup instead of just doing more work; it's because the followup is dependent on an IO; in this particular case the IO is running faster than the computation, so the IO dependency for each item is already done, and they can be run immediately)

With delayed push & early pop it's all cleaned up :


03-06-12 | Oodle Handle Table : WFMO

The other big advantage of a unified handle system is that you can do things like a true WFMO (wait for multiple objects).

Often you have the case that you have launched several asynchronous operations (let's call them A,B, and C), and you wish to do something more, but only after all three are done.

You can always do this manually by just waiting on all three :


Wait(A);
Wait(B);
Wait(C);
*- now all three are done

(note : the invariant at * is only true if the state transition from "not done" to "done" is one way; this is an example of what I mentioned last time that reasoning and coding about these state transitions is much easier if it's one way; if it was not then you would have absolutely no idea what the state of the objects was at *).

Obviously the disadvantage of this is that your thread may have to wake up and go to sleep several times before it reaches *. You can minimize this by waiting first on the item you expect to finish last, but that only works in some cases.

The overhead of these extra thread sleeps/wakes is enormous; the cost of a thread transition is on the order of 5000-50,000 clocks, whereas the cost of an OS threading call like signalling an event or locking a mutex is on the order of 50-500 clocks. So it's worth really working on these.

So to fix it we use a true WFMO call like WFMO(A,B,C).

What WFMO does is esentially just use the forward permits system that the unified handle table uses for all dependencies. It creates a pseudo-handle H which does nothing and has no data :


WFMO(A,B,C) :

create handle H
set H depends on {A,B,C}
  sets A to permit H , etc.
  sets H needs 3 permits to run
Wait(H);
free handle H

The result is just a single thread sleep, and then when your thread wakes you know all ABC are done and there is no need to poll their status and possibly sleep again.


03-05-12 | Oodle Handle Table

A few months ago I created a new core structure for Oodle which has worked out really well and tied the whole concept together. I'm going to make a few notes about it here.

The foundation is a lock-free weak reference handle table. One of the goals of Oodle library design was that I didn't want to force a coding structure on the client. eg. if it was me I would use a PagingResource base class with virtual OnLoad functions or whatever, and probably reference counting. But I understand many people don't want to do that, and one of the nice thing about RAD products is that they fit into *your* engine structure, they don't force an engine structure on you. So initially I was trying to write the Oodle systems so that each system is very independent and don't force any overall concept on you. But that was a bit of a mess, and bug prone; for example object lifetime management was left up to the client, eg. if you fired an async task off, you could delete it before it was done, which could be a crash if the SPU was still using that memory.

The weak reference handle table fixes those bugs in a relatively low overhead way (and I mean "low overhead" both in terms of CPU time, but more importantly in terms of coding structure or intellectual burden).

Handles can be locked (with a mutex per handle (*)) which gives you access to any payload associated with them. It also prevents deletion. Simple status checks, however, don't require taking the lock. So other people can monitor your handle while you work on it without blocking.

(* = not an OS mutex of course, but my own; it's one of the mutexes described in the previous serious. The mutex in Oodle uses only a few bits per handle, plus more data per thread which gives you the actual OS waits; game work loads typically involve something like 1000's of objects but only 10's of threads, so it's much better to use a per-thread "waitset" type of model)

One of the crucial things has been that the state of handles can basically only make one state transition, from "not done" to "done". Once they are done, they can never go back to not done; if you decide you have more followup work you create a new handle. This is as opposed to a normal Event or something which can flip states over and over. The nice thing about the single state transition is it makes waiting on that event much simpler and much easier to do race-free. There's no issue like "eventcount" where you have to do a prepare_wait / commit_wait.

The state of the handle is in the same word as the GUID which tells you if the handle is alive. So you only have to do one atomic op and it tells you if the weak reference is alive, and if so what the status is.

The other big thing is that "foward dependencies" (that I call "permits") are built in to the handle table, so all the systems automatically get depenedencies and can wait on each other. (as described previously here : Worker Thread system with reverse dependencies )

You can make any kind of handle, eg. some async SPU job, and then mark it as depending on an IO, so the SPU job will get kicked when the IO is done.

A little example image (click for high res) :

The row of pink is the IO thread doing a bunch of reads. There are pre-made SPU jobs to decompress each chunk, and you can see they fire as each read completes. Then another thread (the brown below the 4 SPUs) waits on each SPU handle and processes completion for them.

This all comes from


(for each chunk:)
OodleAsyncHandle h1 = fire read
OodleAsyncHandle h2 = fire SPU decompress , depends on h1

add h2 to list
wait for all handles in list


Aside :

There's only one disadvantage I haven't figured out yet. If something depends on an SPU job, then when the SPU work item completes it needs to check if it "permits" anything else, and if so mark them as ready to run.

The problem is that it can't do that work from the SPU. (actually it probably could do some of it from the SPU, by chasing pointers and doing DMA's for each chunk, and using the SPU atomic cache line stuff; but in the end it will run into a problem that it needs to call some system call on the PPU).

So when an SPU job completes I need to send back to the PPU a message that "hey this handle is done see if it permits anything and do the processing for that". The problem is there seems to be no good mechanism for that on the PS3, which kind of blows my mind. I mean, I can just tack my messages onto a lock-free queue from the SPU, that part is fine, but then I need a way to signal the PPU (or run an interrupt on the PPU) to get the PPU to process that queue (and the threads that need the processing might be asleep and need to get woken), and that is where it's not great.


03-04-12 | GDC

Just a note to say I'll be at GDC this year, hawking Oodle at the RAD booth.

I'm sure I will be tired and bored, so stop by and say hi.

BTW a plea to speakers : if you are doing a GDC talk, if you actually have some useful information to share, write it down! The written word on the internet can reach many more people than speech at a conference, and it lasts forever and provides a reference for game developers in the future.

And no, putting up your slides doesn't count. Slides are crap, throw them away. Don't publish your slides. I would be very happy if I never saw a single PPT file on the internet ever again. You used the slides in your talk, that's fine, that's what they are for - your talk. They are NOT a readable useful form of the information that is worth publishing. Write some real text in paragraphs and publish that. And no, video or audio of the talk don't count either. I know you're busy, but you really don't have time to write a few paragraphs? If your research is substantial, then you spent months and months on it, and now you can't put in a few more hours to write it down properly? Come on, do it.


03-03-12 | Stranger's Wrath HD

I just noticed that Stranger's Wrath HD for PS3 (on PSN) finally came out a few months ago. So I bought it and started playing through.

It seems fine from what I've seen so far. No major bugs or framerate problems or anything like that. It supposedly has some "HD" up-resing, but I can't see any, it looks just like the XBox, and that's not entirely bad. (actually the only difference I notice is that the menus seem to have been redone and they use a horrible janky-looking font; I don't think that we had such terrible looking text on the XBox). (maybe the content is up-res'ed and I just can't tell because the screen resolution is also higher?)

Anyhoo, I'm pleased to say it holds up decently still. Some of the realtime art is fantastic, for such an old game it still looks really good, mainly due to all the use of "decorators". The first 30 minutes of play are pretty terrible, but it gets better.

Every time Stranger jumps and does the Al Pacino hoo-ah I bust out laughing. WTF were we thinking? And lots of his monologues and his vocal pacing is just super bizarre. That's a good thing, bizarre is good.

In unrelated news, I tried Rage on PS3 ; holy texture popping batman! I do like the art style; the lighting is like super blown out, really flat looking; it looks cool, it reminds me of Anime. Actually the best thing about Rage is the menus; the menus are superb, really fast, responsive, circular, simple, no loading or unnecessary animation or whatever; every game developer should copy the Rage menus.


02-28-12 | WA Sales Tax Deductions

In WA because we have no state income tax, you will deduct sales tax paid from your federal income taxes instead.

Since I'm not completely fucking insane, I don't keep every single receipt for every purchase all year. Fortunately you can just use the IRS formula and take a computed sales tax deduction. On top of the sales tax formula, you can also deduct the actual sales tax on certain additional items.

Cars is a big one.

Major home improvements is a gray area that you could probably stretch if you wanted to. You can deduct sales tax for materials (not labor) (including if a sub-contractor bought the materials). But it's supposed to only be for "substantial additions" not repairs.

Another little one I noticed is that the RTA portion of vehicle registration in WA is based on a percentage of the car's value (the rest of the registration fee is not; it's size/weight based). Any fee that's a percent of value counts as a sales tax, so this is deductable. (it's about $200 for me).

BTW I continue to be a big fan of "TaxAct" for foolish people like me still doing their own taxes. It's non-web-based, and let's you just enter directly in the forms instead of going through some nonsense Q&A. However, it does seem unable to import data directly from Vanguard which is a bit annoying. That would be less annoying if you could copy-paste a 1099 form and just change a few spots, but it seems you can't do that either. So lots of typing (or manual ctrl-c-ctrl-v'ing) is required.

BTW2 you might be tempted by the real estate excise tax (1.78% in Seattle). My research says that this is technically charged to the seller, and furthermore even if you are the seller apparently it is an "excise tax" not a "sales tax", and therefore not deductible. (it can be subtracted from the "basis" but that is rarely relevant due to the large real estate capital gains dead zone).


02-03-12 | Some Music

that I have enjoyed recently.

For some reason my auto-href generator does reverse alphabetical order :

Zoot Woman - It's Automatic - YouTube
We Are Trees - Sunrise Sunset - YouTube
Vondelpark - Hippodrome - YouTube
Thundercat The Golden Age of Apocalypse - For Love I Come - YouTube
The Maccabees - Pelican on Vimeo
The Answer - Why You Smile ( 60's Garage - Psych ballad ) - YouTube
Terry Richardson (Baron von Luxxury)
Submotion Orchestra- Finest Hour - YouTube
Siriusmo - Femuscle - YouTube
Possum Dixon - Watch The Girl Destroy Me - YouTube
Porcelain Raft - Unless You Speak From Your Heart (Official Video) - YouTube
Phantogram - You Are The Ocean - YouTube
Neon Indian - Polish Girl - YouTube
Memoryhouse - To The Lighthouse (Millionyoung Remix) - YouTube
Marble Sounds - Good Occasions - YouTube
Let's Buy Happiness - Fast Fast - YouTube
Gardens & VIlla - Black Hills (Official Video) - YouTube
Electric Ocean People - North - YouTube
Cotton Jones - Blood Red Sentimental Blues - YouTube
Bruno Pronsato - Feel Right - YouTube
Breakbot - make you mine - YouTube
Blouse Into black - YouTube
Blood Orange - Bad Girls - YouTube

Mostly listening to mixes from AOR Disco . Not all good, but some really great, like Psychemagik - Valley of Paradise


01-27-12 | Some Game Reviews

Mass Effect 2 - it's pretty good looking (mmm space camel toe), but this is not a game; it's like semi-interactive fiction or something. I just feel like nothing that I do (in the game, though also in life) actually matters; all the "challenges" are absurdly easy, and your hand is held. Then it goes into a long cinematic talky blah blah. This type of game is after my era and I really don't understand it. It's not good gameplay, and it's also just not a good movie; it's sort of like a bad game combined with a bad movie, and people seem to just love that. I'd rather just watch a good movie where I don't have to hold a controller in my hand and press A once in a rare while.

Skyrim - my god, I am surprised that this game is so widely praised. I get that there's definitely a niche market that wants to collect thirty snails so you can make dye so you can tie-dye shirts to sell in the town fair. But I'm shocked that that market is so large. Also, the general game mechanics are so fucking janky, it feels like Everquest or something built by amateurs on a free 3d engine. There's weird pointy collision geometry that you get stuck on, the graphics are full of artifacts, wtf.

Gran Turismo 5 - whoah this game is super disappointing. First of all the graphics are surprisingly terrible. How hard is it to render a race track? The basic stuff, the pavement, trees, etc. just don't look good and you wind up seeing big zoomed-in pixels all the time. The menu flow of the game is atrocious, and the load times are totally unforgivable. Typical game session goes like this - click though some damn movies, click through several fucking popup boxes about how I need to set the date & time and connect to the network, okay, finally the main menu (which takes a while to load); click into a sub-menu (loading...), again and again, like 5 or more menus before I get to race (each menu taking quite some time to load), and then *loading* ... it loads each fucking opponent car one by one and they take several seconds each, WTF. Every time you go into a menu it loads, and then just to back out to the previous menu it has to reload it! WTF WTF. But by far the most disappointing thing is the physics; I thought that with a wheel it would actually behave like a real car, and it just doesn't. Which brings me to -

Logitech Driving Force GT Wheel - GT5 was sucking so bad I thought I would try it with a wheel, so I bought this piece of crap. The force feedback is completely terrible; it's a nice idea, it should give you some of the feedback you get from a real steering wheel, which should give you information about if your tires are sliding or in the air or whatever. Nope, none of that. Instead you just get the motor seemingly randomly turning on and moving around the wheel really hard, like you have to fight it with fierce muscle flexing. It's garbage. So I found this trick that you can unplug the power from it (after it does its self-calibrating thing) and that disables the horrible motor jamming you up, and then it's sort of okay, obviously no feel or response in the wheel, but that's better than horribly wrong random force feedback.

Of course all that shittiness pales in comparison to the basic shittiness of the entire modern console gaming experience. I only play games like once a year, so when I do decide "hey let's try this game out" the experience is - "sorry, you need to update your system software" ; urgh , wtf, okay ... then I do it and start the game - "this game needs to download an update" ; urgh, wtf, okay ... then "this game must be installed to the harddrive to run; it needs 18.4 GB" ; urgh wtf, okay... finally it's installed and starts up - "this game requires a newer version of the system software"... WTF !? I already did that and you didn't give me the latest version? (drop out to the main menu), yep, there's a newer version, get that. By this point I fucking hate the game and have no interest in playing it.


01-24-12 | Protectionism Part 3

One of the stranger things that's happened in the past 20 years or so is that the religion of free-marketing has become widely accepted dogma. I say "the religion" because these people believe in it beyond rationality. For examply the widely known ways in which the mathematical abstraction of ideal free markets does not apply to the real world (such as rational actors, perfect information, shared goods, etc.) are waved off as insignificant.

Part of this weird mass delusion is the belief that "protectionism failed" or "the welfare state failed" ; you will frequently see these assertions in supposedly factual news articles. Papers like the NYT which will parrot ridiculous government press releases without question will also say things "free markets have clearly proven themselves" or "nobody wants a return to a welfare state, that model has clearly failed".

Huh, really? I find that interpretation highly dubious; certainly not something that can be dropped as an aside as if it's incontrovertible. During the golden age of America (roughly 1950-70) we were highly protectionistic and we had our most generous welfare state. So, where was the failure? I'm sure the popular memory is somewhat fixated on the bad times of the 70's-80's that were blamed on corrupt unions and inefficient protected businesses and supposedly saved by Reagan/Thatcher free marketism. But there are numerous complicating factors that make that interpretation questionable at best.

The only indisputable success of the free market era was the dot com boom, but that would probably have happened anyway in a more controlled market system (and we may have avoided some major crashes due to excessive speculation and capital liquidity, such as the southeast asian crash and the south american crash).

In the mean time, highly protectionist / socialist states have been doing very well (eg. most of the scandanavian and german speaking states). I don't mean to imply that they are a good model for us to copy, but if you're just looking at world evidence for certain economic/political schemes, it seems there's more pro-socialist examples than there are good "free market" examples.


I believe that part of the problem is the mathematical appeal of free markets in theory. You can set up these toy problems with idealized conditions, and talk about what creates the "global optimum" for all.

There's nothing wrong with playing with toy models, except that people then think that it applies to the real world. A serious economist would be able to list the various ways that the idealized models don't match reality, but then you start drawing yield curves and run simulations and cook up formulas and it all looks so nice that by the time you get to the end you forget that it doesn't actually connect to reality.


01-18-12 | Stop SOPA but don't stop there

I'm impressed with the breadth of support for this cause, and it is a good one. But at the same time I find it a bit disturbing that people can get mobilized for this, but can't get mobilized for things that are actually much worse, like NDAA, the Patriot Act, campaign finance reform, lobbyist reform, the crushing of consumer protection or glass-steagall or media ownership rules, etc. etc.

In the end, the reason that SOPA exists is because corporate interest groups are literally writing our laws. If you stop SOPA it's like cutting off the top of a dandelion - as long as the roots are in the ground it will just come back.

Lobbyists and persistent and clever, they will slip the laws they want in as a rider on a budget bill, or they'll just keep trying until the opposition movement peters out.

We have to attack the cause.

A very similarly evil act, the Research Works Act (RWA) is moving forward. In brief, the RWA makes it illegal for the government to require that researchers that receive government funding make their results available for free (eg. on the internet). The RWA was written by the AAP (American Association of Publishers), which is made up of the IEEE and ACM and all the usual suspects, who want to have exclusive rights to research paper copyrights.

It should be noted that even before the RWA has passed, we currently have pathetically weak open access to research requirements. Only research that receives direct grant funding from something like the NIH is required to supply open access. But almost all research is actually government funded, because the researchers are paid by colleges, and the colleges get government funding (and of course all research is based on past research, which is a public good, etc). But this type of funding does not require the documents to be open access. So, first of all :

Colleges : require all of your professors' publications to be open, NOW!

Apparently Harvard has sort of done this but not many others have followed suit.

(colleges also have no right to own patents or create public/private partnerships for business development of professors' work. That research is owned by all of us. But that's another topic...)

The IEEE and ACM's standard publication terms gives them exclusive ownership of the copyright, which really they have no business in getting, since their contribution to the work is near zero (eg. peer review is done by unpaid volunteers, they provide almost zero paid editing, etc.; they aren't like a book publisher that actually does something).

Authors : stop publishing your papers in works that take your copyright! Make all your works "open access" only!

Pick an Open Access Journal instead.

Some of the blog posts I've read about the RWA have been very naive in my opinion ; they seem to think that the scholarly organizations like the ACM or IEEE are better than the RWA and aren't specifically the people behind it. I think it should be absolutely clear by now that the RWA is exactly in line with the sleazy character of these usurious organizations.

Scholars : quit the ACM and IEEE right now.

It's very simple, these publishers are bad for science, and they serve no purpose at all in a world where the internet is the most important form of publication.

But more generally it's just another case of corporations writing the laws and pushing an agenda that's bad for America. We may stop SOPA and RWA, but it will just happen over and over again until we fix the problem. Stop corporate personhood. Make lobbying illegal. Make it illegal for corporations to write laws.


01-16-12 | Some TV Reviews

Recommended :

"The Hour" - period TV is getting a bit old already; this one is sort of lumpy, it lurches around between almost being really good and getting itself stuck in plot deadends and thin characters. It also sort of builds up a lot of promise that doesn't really pay off, but overall I enjoyed it.

"State of Play" - this is old, but I just finally saw it. Yeah it's good, worth watching. I like Bill Nighy a lot. You can tell that the writer Paul Abbott uses that horrible lazy style of writing where he doesn't actually have a plan, he just sort of keeps stringing together one scene at a time to keep it exciting; the writing is not so much logical as it is crafted to create a pacing, to create a cliff-hanger, etc. It can be very effective (just ask JJ Abrams), but in the end if you stop and look back you see that the whole thing has no substance, and the work has no long term impact.

(aside / general rant : I feel like most TV writers these days have got the formula down pretty well for building up a complex story, generating tension, stroking the right emotional peaks at the right time; they keep adding little plot twists, get you more and more excited, and then it all just sort of peters out because they never really had a plan for where to go with it all, and the ending feels forced and rushed; it's kind of like going on a date and having really exciting flirting and foreplay, and then the sex is over in a few seconds and you're like WTF, maybe you could like actually plan out the overall arc a bit more)

"Boss" - pretty great; rich atmosphere, lots of scheming; the long term cross-episode plot is a bit weak, but Kelsey Grammer is fantastic.

"Workaholics" - something is a little off about this show (like none of the stars are actually very good comic actors (the way eg. Charlie Kelly is) and I sort of get the impression that they are like the douchebags that they're making fun of, which creates a weird vibe); nevertheless it's just about the only actually funny show on all of TV at the moment (I hate 30 Rock and Parks & Rec). Every once in a while it really gets it right and is absolutely hillarious.

"Human Planet" - like Planet Earth but with people. Obvious must see.

Adam Curtis Documentaries : google for "adam curtis collection" and get the torrent. These are some of the most beautiful films I've ever seen. Incredible archival footage, nice music, good taste and something exciting, something inspiring. These make me feel like there are other human beings out there who are going about their lives with some intelligence and some creativity, and that gives me great solace. The actual substance of his narration is a bit questionable at times, but I forgive that, and in any case it will make you think about the world. Probably start with "Machines of Loving Grace".

Meh-commended :

"The Take" - well made (all BBC stuff now seems to be incredibly well made; as opposed to 10 years ago when it was way behind American TV; it used to all look like it was shot on handicam, now it all looks as good as the best HBO stuff) ... but crap. The story just has no point to it other than "criminals are bad". The guy who plays Freddie is quite good and carries it for a while, but even his act gets repetitive and I just wound up bored by the end. "Whitechapel" is roughly the same, very generic crime junk, nothing exciting or new or fun about it ("Collision" is even worse; there are an awful lot of BBC police procedurals and most of them are nothing special). Watch "The Shadow Line" and "Red Riding" again instead.

"Downton Abbey" - mmm I dunno; it's beautiful and all, but pretty vapid. Most of the drama is created by the fact that the despicable characters are kept around when they should just be told to bloody well fuck off. It's a pretty common TV method, and I just never enjoy it. It's also written very much like a serial soap opera; the evil characters are almost Susan Luchis of absurd scheming, and you have your naifs and your star crossed lovers. And the overall message is very scummy conservative nonsense.

"Billy Connolly - Journey To The Edge Of The World" : one of the better travel shows I've seen in a while (Riku and Tunna, come back!). His good spirit is infectious, though it does get tiresome; you have to watch in small doses. Didn't like his Route 66 very much though.

Crap :

"Doctor Who" (the new series) - I was pretty sure this would be terrible so put off watching it for a long time; yep, it's terrible. So cheesy, and the pseudo-science is painful. I guess it appeals to the Babylon 5 / Buffy crowd, which is a surprisingly large segment of the scifi viewers (personally I've never seen the appeal of the campy/cheesy in scifi). I watched S01E01 and hated it, so I tried S02E01 to see if it got better in the next year; nope, if any thing Tennant is worse than Ecclestone, he makes it even cheesier; IMO Ecclestone had the right attributes for The Doctor, which is a bit befuddled but also a bit menacing, whereas Tennant has no creepiness.

"Southland" - junk, generic cop crap, all the characters are unpleasant, the stories are super generic, I just don't see anything appealing in this show and am baffled at its high ratings.

"Stephen Fry in America" - meh, he doesn't actually do much driving and seeing America; instead it's these formulaic set up events where he talks to someone who is doing something stereotypically American and thus "learns something" (you never learn anything when you have constructed a pantomime that is just supposed to portray the stereotype you already had (just ask GWB's CIA)). I quite like Stephen Fry but this was not very entertaining.

Michael Wood Documentaries : terrible. Too bad cuz there are lots of them. Similar thing for Brian Cox science shows. I was shocked to read that Cox is actually a professor; from watching his shows I did not get the impression that he understands physics at all.

"Sons of Anarchy" : sort of cartoony silly unpleasant people doing unpleasant things. I have no idea why this show is rated so highly by critics. I dunno, I guess a lot of people like reality TV, which is basically dumb unpleasant people being dumb and unpleasant with no point to it.

"Community" : the protagonist is so profoundly unlikeable that I found nothing funny in this. John Oliver is the best part, and that's not a good thing (because John Oliver is not very funny). Also highly rated and not sure why.

"Homeland" : extremely unpleasant to watch, and also mentally dangerous right wing propaganda (and yes I know the "twists" and no that doesn't change the overall message). If you enjoy this or "24" there is something seriously wrong with you.

New shows that look promising :

"Luck" , "On Freddie Roach" , "Key & Peele"

I've also tried a few shows like "Cougar Town" and "Happy Endings" because they get very high metacritic or IMDB ratings, even though my spidey senses knew they were going to be awful. Wow, it kind of blows my mind that people are still making this super-sitcomy crap. Like, the characters all stand side-by-side last-supper style while they talk to each other so you can get everyone in frame. One of them is always the "wacky" one who starts doing the robot dance randomly and everybody just thinks it's funny, not incredibly strange. And there's some big drama and confusion and everyone cries then makes a big gesture like professing their love over the speakers at Karoake night and then Urkel stumbles in and the gang all says "that's Urkel!". WTF TV, WTF.


01-09-12 | Protectionism Part 2

Nobody likes the idea of protectionism, because it conjures idea of jingoism, as well as corrupt inefficient business locking in their market through political deals. But in fact basically every country in the world still engages in heavy protectionism, in the form of subsidizing local business in one way or another.

Subsidies can take many forms. Of course there are direct subsidies (Airbus, Boeing, etc.) and these are generally disliked. There are tax break incentives, which almost every big business in America gets. Obviously there are tons of small business incentives and employment incentives and so on.

But there are also more subtle and indirect form of subsidies.

China (and other asian countries) are big fans of the "business development zone" ; these are areas they construct for certain industries, provide tax breaks, and build up infrastructure for power, transport, etc.

Government paid health care is perhaps the biggest subsidy any country offers. It's a general subsidy for employment (notably, not for business). In the American model, employers pay for health care - only for employed people. With government paid health care, the employers are still paying for health care (through taxes) - but they are paying whether they employ the people or not.

To make that more clear - imagine a system where everyone was paid $100k by the government, and then corporations had to pay taxes to cover that - whether you were an employee or not. Then there's no such thing as "saving money by laying people off" ; the cost per employee is the same whether you hire them or not, so you may as well hire them. Obviously it's too much of a market distortion for the government to just pay all of everyone's expenses whether they are employed or not, but paying some amount regardless of employment amounts to a subsidy for employment. The more basic social welfare of a person is paid by the government from the general tax fund, the more incentive there is to hire people.

So social welfare (health care being the biggest one) is actually a subsidy for local employment. There are other things that reduce the cost of an employee, such as government child care, good public transit (allows you to pay less because employees don't need cars),

Another big one is education. But of course subsidized education is a form of development of a local business resource (people). Some countries have well developed industry-eduction partnerships to provide students with the skills needed.

etc.

I believe that one of the problems in America is that we are in fact engaging in heavy protectionism to this day, but we are not doing it in a very smart way. Defense contractors probably get the biggest subsidy (actually I take that back, finance gets the biggest subsidy by a huge huge margin; finance has been getting around a trillion dollars in direct subsidy while defense only gets a few hundred billion), but so does aviation, mineral/mining, lumber, agriculture. They are generally in the form of direct subsidies (eg. have some free money), which is a very bad way to do it. Direct subsidies tend to go straight into rich people's pockets, they don't promote employment. Subsidies can be crafted in more clever ways, and the best way is not to favor one particular industry over another, but rather to favor employment over outsourcing and let the market decide what type of employment is best.

Of course the US government does heavily subsidize certain behaviors, so the idea of subsidy to affect the market is not at all exotic. The subsidies for real estate investment are massive; even ignoring TARP and the FM's and such, the mortgage interest deduction and capital gains exclusion are huge behavior modifiers for very questionable benefit. The lower capital gains tax (vs. income tax) and dividend tax rate are huge subsidies for investors. And of course the tax code in general is a huge subsidy for corporations (vs. individuals) and particularly for multi-national corporations. Why are we massively subsidizing all those things, and not employment and local business?


01-09-12 | LZ Optimal Parse with A Star Part 5

Wrapping up the series with lots of numbers.

Previous parts :

cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1

I'm delirious with fever right now so I might write something inane, but I'm so bored of lying in bed so I'm trying to wrap this up. Anyhoo..

So first of all we have to talk a bit about what we're comparing the A Star parse to.

"Normal" is a complex forward lazy parse using heuristics to guide parsing, as described in Part 1. "Fast" is like Normal but uses simpler heuristics and simpler match finder.

"Chain" is more interesting. Chain is a complex "lazy"-type parser which considers N decisions ahead (eg. Chain 4 considers 4 decisions ahead). It works thusly :

Chain Parse : first do a full parse of the file using some other parser; this provides with a baseline cost to end from each point. Now do a forward parse. At each position, consider all match and literal options. For each option, step ahead by that option and consider all the options at the next position. Add up the cost of each coding step. After N steps (for chain N) add on the cost to end from the first baseline parse. Go back to the original position and finalize the choice with the lowest cost. Basically it's a full graph walk for N steps, then use an estimate of the cost to the end from the final nodes of that sub-graph.

To make Chain parsing viable you have to reduce the number of match options to a maximum of 8 or so. Still Chain N has a complexity of 8^N , so it becomes slow very quickly as N grows.

Chain forward parse is significantly better than LZSS style backwards optimal parse for these LZ coders that have important adaptive state. The baseline parse I use for Chain actually is a backwards LZSS optimal parse, so you can see how it does by looking at the "Chain 0" results.


First overall results. Chain 6 is the most amount of steps I can run in reasonable time, and AStar 2048 means the quantum length for dividing up the file for AStar was 2048.

raw Fast Normal Chain 6 AStar 2048
lzt00 16914 5179 5016 4923 4920
lzt01 200000 198313 198321 198312 198312
lzt02 755121 181109 177792 173220 173315
lzt03 3471552 1746443 1713023 1698949 1690655
lzt04 48649 13088 12412 10407 10249
lzt05 927796 368346 367598 355804 354230
lzt06 563160 352827 351051 344721 343173
lzt07 500000 226533 215996 209133 208566
lzt08 355400 250503 249987 230541 230220
lzt09 786488 302927 287479 268544 265525
lzt10 154624 11508 10958 10307 10291
lzt11 58524 20553 19628 19139 19087
lzt12 164423 29001 26488 23966 23622
lzt13 1041576 935484 931415 924510 922745
lzt14 102400 47690 47298 46417 46350
lzt15 34664 10832 10688 10269 10260
lzt16 21504 10110 10055 9952 9927
lzt17 53161 19526 18514 17971 17970
lzt18 102400 64280 63251 59772 59635
lzt19 768771 322951 288872 269132 269162
lzt20 1179702 888881 872315 856369 855588
lzt21 679936 91677 88011 83529 83184
lzt22 400000 287715 284378 279674 279459
lzt23 1048576 807253 804048 798369 798334
lzt24 3471552 1418076 1411387 1399197 1388105
lzt25 1029744 113085 107882 97320 100175
lzt26 262144 212445 210836 207701 207552
lzt27 857241 237253 235137 222023 220837
lzt28 1591760 332660 308940 260547 252808
lzt29 3953035 1193914 1180823 1147160 1135603
lzt30 100000 100001 100001 100001 100001
10800163 10609600 10337879 10289860


Now number of Chain steps for the chain parser : (that's O0 - O6)

U N O0 O1 O2 O3 O4 O5 O6
lzt00 16914 5016 5024 4922 4922 4922 4922 4923 4923
lzt01 200000 198321 198321 198312 198312 198312 198312 198312 198312
lzt02 755121 177792 177877 175905 174835 174073 173759 173509 173220
lzt03 3471552 1713023 1712337 1704417 1703873 1702651 1701635 1700282 1698949
lzt04 48649 12412 11315 10516 10481 10457 10427 10416 10407
lzt05 927796 367598 368729 365743 364332 360630 356403 355968 355804
lzt06 563160 351051 350995 346856 345500 344778 344739 344702 344721
lzt07 500000 215996 215644 211336 209481 209259 209244 209138 209133
lzt08 355400 249987 249372 239375 237320 231554 231435 233324 230541
lzt09 786488 287479 284875 280683 275679 270721 269754 269107 268544
lzt10 154624 10958 10792 10367 10335 10330 10311 10301 10307
lzt11 58524 19628 19604 19247 19175 19225 19162 19159 19139
lzt12 164423 26488 25644 24217 24177 24094 24108 24011 23966
lzt13 1041576 931415 931415 929713 927841 926162 924515 924513 924510
lzt14 102400 47298 47300 46518 46483 46461 46437 46429 46417
lzt15 34664 10688 10656 10317 10301 10275 10278 10267 10269
lzt16 21504 10055 10053 9960 9966 9959 9952 9948 9952
lzt17 53161 18514 18549 17971 17970 17974 17971 17973 17971
lzt18 102400 63251 63248 59863 59850 59799 59790 59764 59772
lzt19 768771 288872 281959 277661 273316 269157 269141 269133 269132
lzt20 1179702 872315 872022 868088 865376 863236 859727 856408 856369
lzt21 679936 88011 88068 84848 83851 83733 83674 83599 83529
lzt22 400000 284378 284297 281902 279711 279685 279689 279696 279674
lzt23 1048576 804048 804064 802742 801324 799891 798367 798368 798369
lzt24 3471552 1411387 1410226 1404736 1403314 1402345 1401064 1400193 1399197
lzt25 1029744 107882 107414 99839 100154 99710 98552 98132 97320
lzt26 262144 210836 210855 207775 207763 207738 207725 207706 207701
lzt27 857241 235137 236568 233524 228073 223123 222884 222540 222023
lzt28 1591760 308940 295072 286018 276905 273520 269611 264726 260547
lzt29 3953035 1180823 1183407 1180733 1177854 1170944 1162310 1152482 1147160
lzt30 100000 100001 100001 100001 100001 100001 100001 100001 100001
10609600 10585703 10494105 10448475 10404719 10375899 10355030 10337879

Some notes : up to 6 (the most I can run) more chain steps is better - for the sum, but not for all files. In some cases, more steps is worse, which should never really happen, but it's an issue of approximate optimal parsers I'll discuss later. (*)

On most files, going past 4 chain steps helps very little, but on some files it seems to monotonically keep improving. For example lzt29 stands out. Those files are ones that get helped the most by AStar.


Now the effect on quantum size on AStar. In all cases I only output codes from the first 3/4 of each quantum.

raw 256 512 1024 2048 4096 8192 16384
lzt00 16914 4923 4923 4920 4920 4920 4921 4921
lzt01 200000 198312 198312 198312 198312 198312 198314 198314
lzt02 755121 175242 173355 173368 173315 173331 173454 173479
lzt03 3471552 1699795 1691530 1690878 1690655 1690594 1690603 1690617
lzt04 48649 10243 10245 10234 10249 10248 10241 10241
lzt05 927796 357166 354629 354235 354230 354233 354242 354257
lzt06 563160 346663 343202 343139 343173 343194 343263 343238
lzt07 500000 209934 208669 208584 208566 208556 208553 208562
lzt08 355400 228389 229447 229975 230220 230300 230374 230408
lzt09 786488 266571 265564 265487 265525 265559 265542 265527
lzt10 154624 10701 10468 10330 10291 10273 10273 10272
lzt11 58524 19139 19123 19096 19087 19085 19084 19084
lzt12 164423 23712 23654 23616 23622 23628 23630 23627
lzt13 1041576 923258 922853 922747 922745 922753 922751 922753
lzt14 102400 46397 46364 46351 46350 46350 46348 46350
lzt15 34664 10376 10272 10260 10260 10251 10258 10254
lzt16 21504 9944 9931 9926 9927 9927 9927 9927
lzt17 53161 17937 17970 17968 17970 17969 17969 17969
lzt18 102400 59703 59613 59632 59635 59637 59640 59640
lzt19 768771 269213 269151 269128 269162 269193 269218 269229
lzt20 1179702 855992 855580 855478 855588 855671 855685 855707
lzt21 679936 83882 83291 83215 83184 83172 83171 83169
lzt22 400000 279803 279368 279414 279459 279605 279630 279647
lzt23 1048576 798325 798319 798321 798334 798354 798357 798358
lzt24 3471552 1393742 1388636 1388031 1388105 1388317 1388628 1388671
lzt25 1029744 97910 101246 101302 100175 100484 100272 100149
lzt26 262144 207779 207563 207541 207552 207559 207577 207576
lzt27 857241 222229 220832 220770 220837 220773 220756 220757
lzt28 1591760 256404 253257 252933 252808 252737 252735 252699
lzt29 3953035 1136193 1135442 1135543 1135603 1135710 1135689 1135713
lzt30 100000 100001 100001 100001 100001 100001 100001 100001
10319878 10292810 10290735 10289860 10290696 10291106 10291116

The best sum is at 2048, but 1024 is a lot faster and almost the same.

Again, as the previous note at (*), we should really see just improvement with larger quantum sizes, but past 2048 we start seeing it go backwards in some cases.


Lastly a look at where the AStar parse is spending its time. This is for a 1024 quantum.

The x axis here is the log2 of the number of nodes visited to parse a quantum. So, log2=20 means a million nodes were needed to parse that quantum. So for speed purposes a cell one to the right is twice as bad. The values in the cells are the percentage of quanta in the file that needed that number of nodes.

(note : log2=20 means one million nodes were visited to output 768 bytes worth of codes, so it's quite a lot)

log2 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
lzt00 0 0 0 18.18 59.09 18.18 4.55
lzt01 3.75 0.75 41.2 34.08 13.86 5.62 0.75
lzt02 1.81 1.36 25.37 34.09 13.59 13.02 8.15 1.93 0.23 0.23
lzt03 1.46 1.18 17.51 18.46 14.16 13.17 6.95 4.81 3.66 4.81 9.54 2.79 0.96 0.11 0.03
lzt04 1.67 0 0 1.67 0 21.67 5 18.33 3.33 10 16.67 16.67 5
lzt05 0.59 0.25 4.41 10.77 9.92 18.32 13.23 10.09 9.67 6.02 12.47 3.22 0.51 0.08 0.08
lzt06 0.8 0.93 6.81 23.77 14.69 16.96 21.09 11.48 2.67 0.8
lzt07 0.46 0.46 8.66 7.88 6.8 15.3 17 14.53 5.56 9.58 8.19 4.79 0.31 0.31
lzt08 0 0 0 0 0 0 1.68 1.68 1.47 27.67 53.88 11.95 1.68
lzt09 0.29 0.48 0.76 0.86 0.95 3.9 28.07 47.76 16.18 0.38
lzt10 0 0.56 10.17 12.99 9.04 9.04 10.17 41.24 4.52 0.56 1.13
lzt11 0 0 7.89 10.53 14.47 17.11 6.58 9.21 21.05 10.53 2.63
lzt12 0 0 0 0 0 4.27 28.91 59.24 7.58
lzt13 0 0 0.07 0.14 0.57 1.72 3.36 5.72 39.24 42.03 7.08 0.07
lzt14 0 0.83 0 2.5 8.33 34.17 42.5 5 2.5 1.67 0.83 0 0.83
lzt15 0 2.27 4.55 15.91 13.64 15.91 13.64 6.82 11.36 11.36 4.55
lzt16 0 0 3.57 0 14.29 42.86 32.14 3.57
lzt17 1.39 1.39 2.78 1.39 4.17 75 13.89
lzt18 0 0 0 0 0 0.72 0 2.17 2.9 11.59 56.52 23.19 2.9
lzt19 0 0 1.26 2.81 0.39 7.56 87.11 0.87
lzt20 0 0.13 2.08 2.02 4.29 67.07 24.29 0.06
lzt21 0.2 0.78 6.07 6.07 5.28 19.77 35.62 22.9 1.96 0.2 0.2
lzt22 0 0.56 2.98 5.59 26.82 62.94 1.12
lzt23 0 0 0 0 0 0.07 1.35 2.63 0.92 70.88 23.15 0.14 0.36 0.5
lzt24 0.44 0.61 4.14 37.41 7.62 12.68 12.72 8.52 6.11 5.19 3.11 0.94 0.31 0.04
lzt25 0.22 0.43 1.52 1.74 2.68 6.44 15.69 27.19 30.22 13.09 0.72
lzt26 0 0 0 1.15 3.15 2.58 77.65 14.61 0.57
lzt27 0.61 0.1 7.55 6.53 1.22 4.39 5 4.08 7.76 44.8 16.43 1.43
lzt28 0.25 0.1 3.71 0.94 0.74 6.77 15.56 10.08 10.97 14.82 18.68 11.41 4.05 1.24 0.1
lzt29 0.3 0.73 1.61 22.37 5.28 6.16 26.34 2.97 0.48 0.85 19.63 12.47 0.73
lzt30 3.7 0.74 47.41 34.07 12.59 0.74

Well there's no easy answer, the character of the files are all very different.

In many cases the A Star parse is reasonably fast (comparable to Chain 3 or something). But in some cases it's quite slow, eg. lzt04, lzt08, lzt28.


Okay, I think that's all the data. We have one point to discuss :

(*) = in all these type of endeavors, we see these anomolies where as we give the optimizer more space to make decisions, it gets better for a while, then starts getting worse. I saw the same thing, but more extreme, with video coding.

Basically what causes this is that you aren't optimizing for your real final goal. If you were optimizing for the total output size, then giving it more freedom should never hurt. But you aren't. With Chain N or with A Star in both cases you are optimizing just some local portion, and it turns out that if you let it make really aggressive decisions trying to optimize the local bit, that can hurt overall.

A similar issue happens with an Huffman optimal parse, becuase you are using the huffman code lengths from the previous parse to do the current parse. That's fine as long as your parse is reasonably similar, but if you let the optimal parser really go nuts, it can start to get pretty far off those statistics, which makes it wrong, so that more optimizing actually gives worse results.

With video coding the main issue I had was that the optimization was generally local (eg. just on one macro block at a time or some such), but it of course affects the future as a source for motion compensation (and in other ways), and it turns out if you do really aggressive optimization on the local decisions, that can wind up hurting overall.

A similar thing can happen in image and video coding if you let optimization proceed very aggressively, because you have to use some simple analytic criterion (such as RMSE - though even if you use a fancier metric the same problems arise). The issue is that the coder can wind up finding strange states that are a good trade-off for RMSE, but wind up looking just horrible visually.

Obviously the correct solution is to optimize with the true final goal in mind. But that's not always possible, either computationally, or because the final goal is subjective.

Generally the solution is to moderate the optimization in some way. You have some heuristic idea of what kind of solutions will provide good globally optimal solutions. (for example, in image/video coding, you might require that the bit rate allocation not create too big of a difference between adjacent blocks). So you sort of want to guide your optimization to start around where you suspect the answer to be, and then you tune it so that you don't allow it to be too aggressive in making whatever decision it thinks is locally optimal.


01-07-12 | Protectionism

There are some basic economics that I just don't understand, and a lot of the times the accepted "right answer" conflicts with common sense.

One example is it seems to me that buying something locally made is better for the local area than buying something made far away. (for "locality" you may substitute "state" or "nation" or whatever region you want to divide things by).

For me this conjures bad memories of the anti-"Jap" "USA USA" crowd of the '80's that had "buy American" bumper stickers and such ; something in my moral fibers says you should buy the best quality cheapest product. But I don't think that's true.

Consider for the moment the case that there are two products, identical in price and quality. One is locally made, one is foreign made. I contend it is better for the local area (and usually better for you personally) to buy the locally made product. When you do that, the money goes to someone who lives nearby, who spends that money again, and that person spends it, again, etc. This makes the local area prosperous.

(in a purely selfish sense, whether or not making the local area prosperous is good for you or not depends on the details of your situation; if you are a merchant or an altruist, it is good for you; but if your business is international and you would prefer local property values to be low, it might be bad for you; we will assume for the moment that you want the local area to benefit).

So there is some value to buying local and keeping money and industry circulating locally.

So, even if the local product is somewhat more expensive, it still might be better overall if you bought that instead of the foreign product. You have to weigh the benefit of both; the region gains some utility from access to cheaper foreign products, but that is traded off against not circulating that money around the local economy. eg. there's some break even point (in terms of overall utility) ; maybe if the local product costs 20% more, that's the actual break even point.

Of course consumers should not have to make that decision themselves, they should just be able to buy the cheapest product. The correct way to fix that is with government - one of the valuable things that government can do is to make apparent price equal to actual price (eg. to make price proportional to utility, or to move long term costs forward, etc.), or to use laws to bias pricing so that logical purchasing decisions lead to the greatest overall utility (eg. putting penalties on products that help you but hurt others).

The obvious way to make the prices match utility here is either with tarrifs on imports, or subsidies for local production. This is called "protectionism" to attack it, but it seems to me it's just a way of getting the benefit of circulating those dollars locally.

I'm a little disturbed by my conclusion because it's awfully close to the anti-globalization crackpots who claim that modern government financial policy benefits "wall street not main street" (and other slogans).


Granted, in reality, it's too late to go back to pre-1990's protectionism. The cat is out of the bag. And of course in reality protectionism degrades into political gifts for corrupt corporations. But we can ignore those issues for the theoretical discussion.

Also, if you are an extremely altruistic chap you might question the whole goal of maximizing the benefit to your locality (nation/state/fiefdom/whatever). You might say the goal of policies should be to maximize the good for the world. But for the moment let's ignore that and assume that the government of a nation should act to maximize benefit for that nation.


01-06-12 | Surveying is a Powder Keg

So I got my property surveyed a while ago, because there were some boundaries I wasn't totally sure about, and wanted to see how much space I had for fences, etc.

I should have realized this but did not - surveying is a powder keg. When the surveyor comes out and puts out his stakes and flags, it's like a siren call for neighborhood crazies to come around and dispute the line.

Basically people are retarded and unreasonable, and before just calmly talking to you, they assume you will ask them to take down their fence which is across the line (or whatever). Depending on where you live, the exact force of crazy takes different forms; out in the country you might get shot at; over in the suburbs you might get served with a notice of adverse posession. All just because you hired a surveyor, before you even consider doing anything about it. The crazies seem to see the surveyor's flag as a declaration of war.

The mistake I made was I got the survey and then I wanted some time to think about how I was going to fence things, I didn't just jam up the fences right away. That gave the crazies time, and this is what they did :

(pink flag stake is official from the surveyor of course, and crazy neighbor has put up his own line two feet into my property from the official mark)

Well done, crazy neighbor.

A bit of forehead vein popping and yelling at them got them to remove the post, but I expect more complications of this issue before it is done.

(BTW yelling at people is never satisfying in real life the way it is in fiction. In fiction there's this myth that people are actually good and reasonable, and were just doing something bad for a moment, and when you yell at them they realize their mistake and reform; like if Gorden Ramsay walked in at the right moment and yelled at Hitler he would be like "oh gosh, you're right, I'm so ashamed, I'll try to do better". Furthermore in fiction, they respond to the yelling either by yelling back, giving a satisfying argument, or by accepting the scolding and apologizing. In real life that never happens, what really happens is they try to change the subject, or turn it around and somehow blame you, or make excuses, or bring up random other points that don't matter to the issue (*), and it just leaves you feeling derailed).

(* = this is maybe the most common and most effective response - the completely random diversion into some other story that just leaves me going "WTF?" and totally takes the wind out of my sails. The other super effective tactic that I've noticed from like car salesman and contractors and such is to just completely stone wall you, like they tack on some $500 extra fee that they specifically agreed they wouldn't do, and you're like "hey WTF is this fee that you said wouldn't be there" and they just act like they did nothing wrong and of course you're going to go along with them, like "yep that's the $500 extra rapeage surcharge, so you can pay now by giving me your bank account and social security number..." , wait, what? I'm in the middle of yelling at you, you can't just act like your way is the only way).


01-06-12 | Nice Wiring Bub

I took off a horrible track light (blyeck, so tacky, track lights and can lights are the worst), and underneath I found this gem :

at first I thought it was just a big wad of masking tape on the end of the wire (bad enough, not an actual insulator, and a fire hazard), but upon peeling the masking tape I found this :

Oh, of course. You spliced on an extra 1 inch of wire, no wire nut, wrapped in masking tape - and the thing that most boggles my mind is that the wire ends are not even twisted in any kind of sane way, they are just randomly balled up around each other. Not to mention the original wires are plenty long without the splice.

Pretty impressive piece of fail.

BTW one of the hazards of old knob & tube wires is that the insulation is only rated to 60 C, but newer light fixtures are allowed to heat up to 90 C (which new Romex can handle). So you need to be careful when installing new light fixtures, and at the very least don't over-bulb (*). One way to solve this (without a ton of rewiring) is to back up the knob and tube a few feet away, put a junction box there, then run new Romex for the last few feet.

(* = I just love to over-bulb ; I can never get enough light; I used to put 100 W's in everything whenever I moved into an apartment. Here I might be a bit more careful about that, because they do generate a lot more heat (BTW I despise the gross inhumane light of CFL's, but one advantage of them with old wiring is they draw much less power (which keeps the wires cool) and they themselves are cool (which doesn't heat up the light fixture and box)). I'm not a huge fan of dimmers (fucking 75 W is already dark enough, I don't need any less than that), but if I could install an *amplifier* that let me over-drive the bulbs that would be sweet (but not in my house, which is apparently wired by paper clips and masking tape)).

It's kind of scary what kinds of disasters can be hiding inside your walls that you don't know about upon purchase (or maybe ever, until they cause water damage or a fire or whatever). I really like doing home improvement work in the garage and the basement, because the walls are unfinished so I can see where the studs are, which is so handy, I can see all the wire runs and junction boxes. It's totally superior.

Covering up your walls is super over-rated. I think if I was designing a modern house it would be all Pompidou Center style with color-coded pipes running around where I could directly access the electricity, water, etc.

If you want a more old fashioned home look, you could still do all your major wire runs around the ceiling and then cover them with a removeable wood crown molding piece. That way if you want to get into the wires, you just pop off the crown molding and you have a wooden box for access.


01-04-12 | Two laws you should hate

NDAA : makes legal the GWB/Obama policy of indefinite detainment (outside of war zone, Geneva convention, or any legal jurisdiction), and unilateral assasination orders -

» Obama’s Signing Statement on NDAA I have the power to detain Americans… but I won’t Alex Jones' Infowars There's a war on
The NDAA's historic assault on American liberty Jonathan Turley Comment is free guardian.co.uk
The Hit List The Public Applauds As President Obama Kills Two Citizens As A Presidential Prerogative « JONATHAN TURLEY
Senate Votes Overwhelmingly To Allow Indefinite Detention of Citizens « JONATHAN TURLEY

SOPA : basically makes free speech on the internet impossible, by making site hosts legally liable for any content posted on them. Basically allows private companies to censor the internet at will.

Stop Online Piracy Act - Wikipedia, the free encyclopedia
SOPA for Dummies - Google Docs
House takes Senate's bad Internet censorship bill, tries making it worse

SOPA might not pass in it's worst form, but some lobbyists are pushing very hard for something like this, so the internet is going to get censored unless we fight it very hard.


01-04-12 | Double Pane Glass is a scam

"Replacement Windows" are shit sold by the window industry to sucker homeowners. The tiny gap (typically less than 1 inch) in a standard double pane IGU (integrate glass unit) is no better (and sometimes worse) than a traditional window + storm.

Throwing out perfectly good lovely old windows for "environmental" reasons is of course retarded; if you want more air proofing and don't already have storm windows, just get some good storms and you're done.

Replacement windows almost always uglier than good old wood windows, which architecturally fit the house and have nice wavy old glass.

Furthermore, they cannot be maintained and repaired in the same way as an old window + storm. When an IGU fails (which they do in 10-20 years typically) it cannot be repaired, it has to be replaced. Old wood windows can be easily taken a part, cleaned, resealed, and can last 100 years. (Vinyl windows and caulk and foam seal strips and so on are similarly problematic - they seem great at first, but they all decay badly in sun and weather, so have to be replaced regularly and can't really be maintained).

Replacement windows are usually shoved inside the existing window framing, and add their own thin frame which makes the opening smaller and adds an extra ugly architectural detail.

It's a standard "sustainable" bullshit corporate ripoff. To sell you some new crap and get you to throw away your perfectly good old windows. (it's like the wonderful irony of "sustainable christmas trees").

This guy addresses the issue in much more detail.


01-04-12 | Police Brutality

Much has been made of the outrageous treatment of Occupy protestors by police. But I believe it's been small potatos compared to the rampant, systemic brutality which pervades our nation's police departments. It's fallen out of the news because we're bored of it (we've seen black guys getting beaten by cops a million times) and because it doesn't affect the wealthy, but it has really not gotten better (or not enough, anyway).

Police in American are de-facto above the law. They violate human rights at will, with rarely a punishment greater than suspension or transfer.

Here in Seattle things have gotten shockingly bad, so bad in fact that even the DOJ has made an official report about how bad our police department is. (original here) . What is Seattle's official response? Not to do anything about it, it's to question the methodology of the report. Shameful.

To Protect & Skull-Fuck - Page 1 - News - Seattle - Seattle Weekly
SPDecay - Page 1 - News - Seattle - Seattle Weekly
Seattle sues attorney over public records request Local & Regional Seattle News, Weather, Sports, Breaking News KOMO News
Seattle Police Department Sued by KOMO News for Not Releasing Dash-Cam Videos - Seattle News - The Daily Weekly
Seattle Police A Department in Denial - Page 1 - News - Seattle - Seattle Weekly

One of the few ways that people are getting any justice these days is by getting the dash cam footage to prove that the cops' lies are in fact lies. (see for example Ian Birk lying about John T. Williams "lunging" at him (eerily similar to the very tragic case of Otto Zehm in Spokane in which officers also lied and claimed he "lunged" (with a soda pop bottle, which led them to kill him))). The result of course is that SPD is doing what it can to stop the dash cam system. They are now suing to stop releases of footage under public records disclosures, and have "accidentally" deleted many thousands of hours of footage.

Police Chief Diaz needs to be fired.

But in the larger picture, the big problem is the stone wall and loyalty attitude of police departments; that when there is a case where a police office may have killed a civilian without cause, their attitude is not to investigate and apologize, it's to cover it, draw ranks, support the officer, etc. This attitude makes not just the few bad cops responsible, but every cop who treats his compatriots as beyond reproach or above the law. Loyalty to evil is not admirable. (ask Joe Paterno).

Following a rash of unjustified killings in the 80's, many laws were passed that make it somewhat more difficult for police officers to use their guns. But the gap has been filled by stompings, clubbings, and taserings.

spokane police abuses summary

previous post at cbloomrants

Of course part of the problem is that there is a decent portion of the population that thinks "tough policing" is a good thing.


12-20-11 | Grocery Store Lines

Grocery store lines are a microcosm for how fucking shitty almost every human is, in almost every way.

First of all, you have the fact that nobody shows any basic human courtesy to each other. I almost never see someone with a ton of items let ahead the person with one item. But of course when I let the person with one item go ahead of me, they invariably do something fucked up like ask for a pack of cigarettes (which always takes forever in US grocery stores) or pay with food stamps or some shit. (aside : why is it always such a clusterfuck to pay with food stamps in some groceries? they must have done it a million times, but the checker always acts like someone handed them monopoly money, and the manager has to get called, and forms are filled out, wtf). Of course the people who are paying with coupons and civil war scrip never warn the person lining up behind them that maybe they should pick a different line.

But when the lines get long you really start to see people's souls.

There's the people who stand around and chat right in the middle of the lines. I watch people over and over asking "are you in line? oh, no? okay". Hmm, maybe you should get the fuck out of the line area to have your chit chat!

Then there's the people who can't seem to run a line in a reasonable direction and wind up blocking all the aisles or running it into another line. Invariably it takes a manager to come over and tell people to "please line up over here" since god knows they aren't going to sort themselves out.

Then you get the people who start stamping around and huffing and quickly looking from one side to another like this is the greatest injustice since slavery. You can just see wheels spinning in their heads about how "ridiculous this is" and so on.

There are the people who think that being really pushy in the back of the line is going to speed things up. We're eight people away from the register and they keep jamming their cart into my feet because the person three ahead of us moved. I get out of line to grab a magazine (leaving my cart) and they push into the gap where my body was. Whoah, slow down dick-face, we're still twenty feet from the register, you can chill a little now.

On the flip side then is the people who are absurdly slow about getting their checkout done with. (and of course to double-down on dickishness, it's often the same people who were impatient and pushy when they were way back in line (*)).

There are two general classes of people who fail to check out quickly :

1. The epically incompetent. These people pay with a check, either because they are ancient geezers (excusable) or because they think cards are somehow inferior or checks are more convenient (inexecusable). They might go digging around in their purse for half an hour trying to find exact change, or somehow still don't know they can scan their card before the checker is done.

2. The intentionally slow. These people think everyone needs to chill and slow down; what's the rush? They might chat with the checker a bit. They think everyone else is rotten for being in such a hurry. OMG you epic douchebag; it's fine if you want to live a slow, relaxed, life, but it's not okay to impose that on everyone behind you in line. You probably drive slowly too and think that everyone behind you is in the wrong for wanting to go faster. You probably have a "keep your laws off my body" bumper sticker, and fail to see that your own behavior is the same kind of selfish forcing of your values on others.

(* = the double-dick seems to be the norm for airplane passengers; who are reliably annoying and pushy and do a lot of huffing when they are way back in line, but when they actually get up to the TSA guy they still have their shoes on, don't have their ID in hand, are drinking a gallon of liquid, and act like it's some big surprise. Same thing with the overhead bin stowage of course).


12-19-11 | SRAM

SRAM is rolling out a big promotional campaign this year, trying to convince people that their components are actually superior.

They've signed up lots of the pro teams. Just as with car racing, do not be mislead by what the pros use. I see lots of morons on forums saying "well the race team uses this, it must be great". No, the reason the race team uses it is because they are paid to use it.

SRAM double-tap shifters are fucking *awful*. Absolutely retarded. Imagine your right mouse button was taken away and instead you had to double-tap the left to accomplish that function. Yep, it's horrible.

Double-press GUI is always horrible and always should only be used as a last resort. We use it sometimes in games because there just aren't enough buttons on console controllers, but a smart game designer knows that only the secondary functions should go on double-tap buttons (the same goes for "hold" buttons) and the twitch functions should be on their own dedicated button.

Actually it's even worse than that. They don't do the right thing when you're at the edge of the gear shift limits. So like if you are at the low end, you can't go any lower (which you would accomplish by double-tapping), it will still let you single tap (to up-shift). So you're riding up a steep hill and you want a lower gear, you go to double-tap, and oh fuck half way through it the lever won't let you do the double-tap, but you've already single-tapped. There's no way to back out of it, when you let go it will up-shift you and you'll be fucked.

The STI system is just one million billion times better. But it's patented. That's why all these shift levers are so dang expensive, because they're patented.

It's also why there has to be a new lever system every year, a new size of bottom bracket, a new headset system - it's so that the manufacturer can patent it and/or make an exclusive line so that they can rip you off. The old system was perfectly fine functionally, the problem with it was that generic brands were starting to come out with cheap decent components for that system. We can't have that.

It's just like medicine of course, though with medicine it's much more diabolical.

Certainly with medicine it's obvious that there should be laws that prevent the pointless pushing of the new expensive product when it's not actually any better than cheap old solutions.

But I think it would actually be in the world's best interest to have a similar law for everything. It would be hard to phrase and hard to enforce, but the idea is something like - you must make components that are compatible with others on the market unless the incompatibility is for a necessary functional reason. It's actually much better for the free market and competition of products can plug and play and the consumer can choose based on pice and functionality, not compatibilty with some bullshit proprietary interface.

One that annoys me is car parts; most of the car parts for a Porsche or BMW or whatever are actually identical to the ones for a cheaper car (like a VW for example) - but they intentionally make the interface ever so slightly different so that you can't just go buy the cheaper part. The parts are all made by Bosch or whoever major part supplier anyway, it's not like you get a better brand of part for the money. The interesting thing to me is that the car maker doesn't really benefit from this, it's the part maker who does, so there must be some kind of collusion where the car maker gets a kickback in exchange for using the proprietary part.

Maybe the most obvious example is car wheels. Wheels are wheels, there's no need for them to be car specific, but the auto manufacturers intentionally use different bolt spacings (5x130, 4x110, etc) so that you can't go buy cheap mass market wheels for your fancy car. You can cross-shop the exact same wheel with different bolt spacings, and the price difference can be 2X or more.


12-17-11 | LZ Optimal Parse with A Star Part 4

Continuing ...
Part 1
Part 2
Part 3

So we have our A star parse from last time.

First of all, when we "early out" we still actually fill out that hash_node. That is, you pop a certain "arrival", then you evaluate the early out conditions and decide this arrival is not worth pursuing. You need to make a hash_node and mark it as a dead end, so that when you pop earlier arrivals that see this node, they won't try to visit it again.

One option would be to use a separate hash of just bools that mark dead ends. This could be a super-efficient smaller hash table of bit flags or bloom filters or something, which would save memory and perhaps speed.

I didn't do this because you can get some win from considering parses that have been "early outed". What you do is when you decide to "early out" an arrival, you will not walk to any future nodes that are not yet done, but you *will* consider paths that go to nodes that were already there. In pseudo-code :


pop an arrival

check arrival early outs and just set a flag

for all coding choices at current pos
{
  find next_node
  if next_node exists
    compute cost to end
  else
    if ! early out flag
       push next_node on arrivals stack
}

So the early out stops you from creating any new nodes in the graph walk that you wouldn't have visited anyway, but you can still find new connections through that graph. What this lets you do in practice is drive the early out thresholds tighter.

The other subtlety is that it helps a lot to actually have two (or more) stages of early out. Rather than just stop consider all exit coding choices once you don't like your arrival, you have a couple of levels. If your arrival looks sort of bad but not terrible, then you still consider some of the coding choices. Instead of considering 8 or 16 coding choices, you reduce it to 2 or 4 which you believe are likely advantageous.

The exact details depend on the structure of your back end coder, but some examples of "likely advantangeous" coding choices that you would consider in the intermediate early out case : if you have a "repeat recent offset" structure like LZX/LZMA, then those are obvious things to include in the "likely advantageous". Another one might be RLE or continue-previous type of match codes. Another would be if the literal codes below a certain number of bits with the current statistics. Also the longest match if it's longer than a certain amount.

Okay, so our A star is working now, but we have a problem. We're still just not getting enough early outs, and if you ran this on a big file it will take forever (sometimes).

The solution is to use another aspect we expect from our LZ back end, which is "semi-locality". Locality means that a decision we make now will not have a huge effect way in the future. Yes, it has some effect, because it may change the state and that affects the future, but over time the state changes so many times and adapts to future coding that the decision 4000 bytes ago doesn't matter all that much.

Another key point is that the bad (slow) case occurs when there are lots of parses that cost approximately the same. Because of our early out structure, if there is a really good cheap parse we will generally converge towards it, and then the other choices will be more expensive and they will early out and we won't consider too many paths. We only get into bad degeneracy if there are lots of parses with similar cost. And the thing is, in that case we really don't care which one we pick. So when we find an area of the file that has a huge branching factor that's hard to make a decision about, we are imperfect but it doesn't cost us much overall.

The result is that we can cut up the file to make the parse space tractable. What I do is work in "quanta". You take the current chunk of the file as your quantum and parse it as if it was its own little file. The parse at the beginning of the quantum will be mostly unaffected by the quantum cut, but the parse at the end will be highly affected by the false EOF, so you just throw it out. That is, advance through the first 50% or 75% of the parse, and then start the next quantum there.

There is one special case for the quantum cutting which is long matches that extend past the end of the quantum. What you would see is when outputting the first 50% of the parse, the last code will be a match that goes to the end of the quantum. Instead I just output the full length of the match. This is not ideal but the loss is negligible.

For speed you can go even further and use adaptive quantum lens. On highly degenerate parts of the file, there may be a huge node space to parse that doesn't get early-out'ed. When you detect one of these, you can just reduce the quantum len for that part of the file. eg. you start with a quantum length of 4096 ; if as you are parsing that quantum you find that the hash table occupancy is beyond some threshold (like 1 million nodes for example), you decide the branching factor is too great and reduce the quantum length to 2048 and resume parsing on just the beginning of that chunk. You might hit 1 million nodes again, then you reduce to 1024, etc.

That's it! Probably a followup post with some results numbers and maybe some more notes about subtle issues. I could also do several long posts about ideas I tried that didn't work which I think are sort of interesting.


12-17-11 | LZ Optimal Parse with A Star Part 3

Continuing ...
Part 1
Part 2

At the end of Part 2 we looked at how to do a forward LZSS optimal parse. Now we're going to add adaptive "state" to the mix.

Each node in the walk of parses represents a certain {Pos,State} pair. There are now too many possible nodes to store them all, so we can't just use an array to store all {Pos,State} nodes we have visited. So hopefully we will not visit them all, so we will store them in a hash table.

We are parsing forward, so for any node we visit (a {Pos,State} will be called a "node") we know how we got there. There can be many ways of reaching the same node, but we only care about the cheapest one. So we only need to store one entering link into each node, and the total cost from the beginning of the path to get to that node.

If you think about the flow of how the forward LZSS parse completes, it's sort of like an ice tendril reaching out which then suddenly crystalizes. You start at the beginning and you are always pushing the longest length choice first - that is, you are taking big steps into the parse towards the end without filling in all the gaps. Once you get to the end with that first long path (which is actually the greedy parse - the parse made by taking the longest match available at each step), then it starts popping backwards and filling in all the gaps. It then does all the dense work, filling backwards towards the beginning.

So it's like the parse goes in two directions - reaching from the beginning to get to the end (with node that don't have enough information), and then densely bubbling back from the end (and making final decisions). (if I was less lazy I would make a video of this).

Anyhoo, we'll make that structure more explicit. The hash table, for each node, stores the cost to get to the end from that node, and the coding choice that gives that cost.

The forward parse uses entry links, which I will henceforth call "arrivals". This is a destination node (a {pos,state}), and the cost from the beginning. (you don't need to store how you got here from the beginning since that can be reproduced at the end by rewalking from the beginning).


Full cost of parse through this node =

arrival.cost_from_head + hash_node.cost_to_tail

Once a node has a cost in the hash table, it is done, because it had all the information it needed at that node. But more arrivals can come in later as we fill in the gaps, so the full cost from the beginning of the parse to the end of the parse is not known.

Okay, so let's start looking at the parse, based on our simple LZSS pseudo-code from last time :


hash table of node-to-end costs starts empty
stack of arrivals from head starts empty

Push {Pos 1,state initial} on stack of arrivals

While stack is not empty :

pop stack; gives you an arrival to node {P,state}

see if node {P,state} is already in the hash
if so
{
  total cost is arrival.cost_from_head + hash_node.cost_to_tail
  done with this arrival
  continue (back to stack popping);
}

For each coding choice {C} at the current pos
{
  find next_state = state transition from cur state after coding choice C
  next_pos = P + C.len
  next_node = {next_pos,next_state]

  if next_node is in the hash table :
  {
    compute cost to end from code cost of {C} plus next_node.cost_to_tail
  }
  else
  {
    push next_node to the arrivals stack (*1)
  }
}

if no pushes were done
{
  then processing of current node is done
  choose the best cost to end from the choices above
  create a node {P,state} in the hash with that cost
}

(*1 = if any pushes are done, then the current node is also repushed first (before other pushes). The pushes should be done in order from lowest pos to highest pos, just as with LZSS, so that the deep walk is done first).

So, we have a parse, but it's walking every node, which is way too many. Currently this is a full graph walk. What we need are some early outs to avoid walking the whole thing.

The key is to use our intuition about LZ parsing a bit. Because we step deep first, we quickly get one parse for the whole segment (the greedy parse). Then we start stepping back and considering variations on that parse.

The parse doesn't collapse the way it did with LZSS because of the presence of state. That is, say I parsed to the end and now I'm bubbling back and I get back to some pos P. I already walked the long length, so I'm going to consider a shorter one. When I walk to the shorter one with LZSS, then states I need would already be done. But now, the nodes aren't done, but importantly the positions have been visited. That is -


At pos P, state S
many future node positions are already done
 (I already walked the longest match length forward)

eg. maybe {P+3, S1} and {P+5, S2} and {P+7, S3} have been done

I a shorter length now; eg. to {P+2,S4}

from there I consider {P+5, S5}

the node is not done, but a different state at P+5 was done.

If the state didn't matter, we would be able to reuse that node and collapse back to O(N) like LZSS.

Now of course state does matter, but crucially it doesn't matter *that much*. In particular, there is sort of a limit on how much it can help.

Consider for example if "state" is some semi-adaptive statistics. Those statistics are adaptive, so if you go far enough into the future, the state will adapt to the coding parse, and the initial state won't have helped that much. So maybe the initial state helps a lot for the next 8 coding steps. And maybe it helps at most 4 bits each time. Then having a better initial state can help at most 32 bits.

When you see that some other parse has been through this same position P, albeit with different state at this position, if that parse has completed and has a total cost, then we know it is the optimal cost through that node, not just the greedy parse or whatever. That is, whenever a hash node has a cost_to_tail, it is the optimal parse cost to tail. If there is a good parse later on in the file, the optimal parse is going to find that parse, even if it starts from a non-ideal state.

This is the form of our early outs :


When you pop an arrival to node {P,S} , look at the best cost to arrive to pos P for any state, 

if arrival.cost_from_head - best_cost_from_head[P] > threshold
  -> early out

if arrival.cost_from_head + best_cost_to_tail[P] > best_cost_total + threshold
  -> early out

where we've introduced two arrays that track the best seen cost to head & tail at each pos, regardless of state. We also keep a best total cost, which is initially set to infinity until we get through a total parse, and then is updated any time we see a new whole-walk cost.

This is just A star. From each node we are trying to find a lower bound for the cost to get to the end. What we use is previous encodings from that position to the end, and we assume that starting from a different state can't help more than some amount.

Next time, some subtleties.


12-17-11 | LZ Optimal Parse with A Star Part 2

Okay, optimal parsing with A star. (BTW "optimal" parsing here is really a misnomer that goes back to the LZSS backwards parse where it really was optimal; with a non-trivial coder you can't really do an optimal parse, we really mean "more optimal" (than greedy/lazy type parses)).

Part 1 was just a warmup, but may get you in the mood.

The reason for using A Star is to handle LZ parsing when you have adaptive state. The state changes as you step through the parse forward, so it's hard to deal with this in an LZSS style backwards parse. See some previous notes on backwards parsing and LZ here : 1 , 2 , 3

So, the "state" of the coder is something like maybe an adaptive statistical mode, maybe the LZMA "markov chain" state machine variable, maybe an LZX style recent offset cache (also used in LZMA). I will assume that the state can be packed into a not too huge size, maybe 32 bytes or so, but that the count of states is too large to just try them all (eg. more than 256 states). (*1)

(*1 - in the case that you can collapse the entire state of the coder into a reasonably small number of states (256 or so) then different approaches can be used; perhaps more on this some day; but basically any adaptive statistical state or recent offset makes the state space too large for this).

Trying all parses is impossible even for the tiniest of files. At each position you have something like 1-16 options. (actually sometimes more than 16, but you can limit the choices without much penalty (*2)). You always have the choice of a literal, when you have a match there are typically several offsets, and several lengths per offset to consider. If the state of the coder is changed by the parse choice, then you have to consider different offsets even if they code to the same number of bits in the current decision, because they affect the state in the future.

(*2 - the details of this depend on the back end of coder; for example if your offset coder is very simple, something like just Golomb type (NOSB) coding, then you know that only the shortest offset for a given length needs to be considered, another simplification used in LZMA, only the longest length for a given offset is considered; in some coders it helps to consider shorter length choices as well; in general for a match of Length L you need to consider all lengths in [2,L] but in practice you can reduce that large set by picking a few "inflection points" (perhaps more on this some day)).

Okay, a few more generalities. Let's revisit the LZSS backwards optimal parser. It came from a forward style parser, which we can implement with "dynamic programming" ; like this :


At pos P , consider the set of possible coding choices {C}

For each choice (ci), find the cost of the choice, plus the cost after that choice :
{

  Cost to end [ci] = Current cost of choice C [ci] + Best cost to end [ P + C[ci].len ]

}

choose ci as best Cost to end
Best code to end[ P ] = Cost to end [ best ci ]

You may note that if you do this walking forward, then the "Best cost to end" at the next position may not be computed yet. If so, then you suspend the current computation and step ahead to do that, then eventually come back and finish the current decision.

Of course with LZSS the simpler way to do it is just to parse backwards from the end, because that ensures the future costs are already done when you need them. But let's stick with the forward parse because we need to introduce adaptive state.

The forward parse LZSS (with no state) is still O(N) just like the backward parse (this time cost assumes the string matching is free or previously done, and that you consider a fixed number of match choices, not proportional to the number of matches or length of matches, which would ruin the O(N) property) - it just requires more book keeping.

In full detail a forward LZSS looks like this :


Set "best cost to end" for all positions to "uncomputed"

Push Pos 1 on stack of needed positions.

While stack is not empty :

pop stack; gives you a pos P

If any of the positions that I need ( P + C.len ) are not done :
{
  push self (P) back on stack
  push all positions ( P + C.len ) on stack
    in order from lowest to highest pos
}
else
{
  make a choice as above and fill "best cost to end" at pos P
}
If you could not make a choice the first time you visit pos P, then because of the order that we push things on the stack, when you come back and pop P the second time it's gauranteed that everything needed is done. Therefore each position is visited at most twice. Therefore it's still O(N).

We push from lowest to highest len, so that the pops are highest pos first. This makes us do later positions first; that way earlier positions are more likely to have everything they need already done.

Of course with LZSS this is silly, you should just go backwards, but we'll use it to inspire the next step.

To be continued...


12-12-11 | Things I want and cannot find

1. A true "sauna" near Seattle. A hot room right next to a lake, so I can steam it up and then go jump in the cold lake. We're in the perfect place for it, we have lots of nice cold swimmable lakes, and there are tons of Swedes around here, and yet I can't find one.

There are plenty of "saunas" at spas and health clubs, but without the lake swim it's fucking bullshit. I imagine some rich guy has one at his lakefront home, but I can't get the hookup.

2. A secluded cabin rental. I like the idea of going out in the woods and writing code alone. I can wear some flannel and chop wood for the fire like a real Northwesterner. But all the cabin rentals I can find are in sort of "cabin communities" or near a highway, or some shit. I want a place where you can look out of the big picture window and just see scenery for miles.

3. A good river swim. I found a bunch in CA but I can't find any up here. An ideal river swim has a nice deep "hole" due to rocks or waterfall or something. It should be a 4-5 mile hike from the closest parking to cut down on traffic (or be up a rough road or something, not just right off a highway). Ideally it should not be straight out of snow melt so that it's not ball-breaking freezing cold even in the middle of summer.

4. A nice place to ride. A country lane, no cars, good pavement. God damn I miss this.


12-12-11 | Sense

One of the most important skills in an employee is the sense to know when to ask for help and when not to. To know when they should just make a decision on their own vs. ask to make sure their choice is okay. To know when they need to call a meeting about something vs. when not to disturb others about it.

It's incredibly rare actually to find someone who has the sense to get this just right. I think it's very undervalued.

When you're a manager, the most awesome thing you can have is an employee you can trust. That means no unpleasant surprises. If you give them a task, it will be done on time, or you will be notified with enough notice to take action. You won't find out that they're slipping when it's too late to do anything about it. You won't have them claim to be done and then upon inspection find out that they've done it all wrong. You can just assign the task off and then you don't have to worry about it any more. You don't have to follow up and keep pinging them for status updates.

Someone with a great deal of sense will just know to give you status updates at the appropriate intervals. Not too often that they waste your time, but not too infrequently - they should always come in just before you start wondering "WTF happened to this task?".

One of the most crucial things is knowing what decisions they need to get approval for. It sucks to have an employee who asks about every little thing. "should I put this button here or here? should I make another file for this code or put it in this file?" Just make a fucking decision yourself, I don't care! But it also sucks to have someone go off and do all kinds of crazy shit without asking, like "oh yeah I ripped out the old animation system and am doing a new one" ; uh, you did what? and you didn't ask me first? WTF. Both are very common.

Of course the definition of the "right amount of approval" depends on the manager, and a key part of having good "sense" is actually social adaptation - it's about adapting to your situation and learning what is wanted of you. Many of the type-A left-brain coders never get this; part of your job as an employee is always interacting with other human beings, even if it's only with your boss, and there is no rational absolute answer about the right way to communicate, you have to feel it out and adapt.

Of course part of the role of a good manager is to teach these things, and to help people who may have good skills but not much "sense".

It's actually more annoying in personal life than in business life. For example you're having a dinner party and somebody volunteers to bring the wine, and then they show up with none, or they show up with a box of ripple. WTF dude, I could have just gotten it myself, if you're going to drop the ball, you need to notify someone with sufficient warning.

The annoying thing about the non-business world is you can't check up on them; like "hey can you give me a status update on that wine purchasing?" because you would be considered a huge dick.


A lot of this goes along with what I call "basic professionalism". Like if I assign you a crucial task that I need done today, don't go home without checking in with me and telling me it's done or not. If you think I assigned you too much and you can't get it done in time, don't go pout, come and tell me about it.

Another aspect of "basic professionalism" is knowing when to shut up. Like if you think the company is going in the wrong direction - raise the issue to your managers, that's good, if you have a good boss they want that feedback. But after they call a meeting and everyone disagrees with you and the decision is made to go on the path you don't like - it's time to shut up about it. We don't want to hear complaints every day.

A related aspect is knowing who it's appropriate to say things to. When we have someone from the publisher touring the studio, that is not the time to point out that you don't like the design of the lead character.

"Basic professionalism" is sort of a level below having good "sense" but it's also actually surprisingly hard to find.


One of the worst situations is to have someone who is not great about "sense" or "basic professionalism" but is touchy about it. Most people are not perfect on these points, and that's okay, but if you're not then you need a certain amount of supervision. That's just the way work gets done, but some people act like it's a personal affront to be monitored.

Like they occasionally drop the ball on tasks, you decide, okay I just have to ask for daily status reports. Then they get all pissy about it, "don't you trust me" or it's "too much beaurocracy" blah blah.

Or if they don't come to you and ask questions at the appropriate time, then you have to pre-screen all their approaches. Like sometimes you assign them a task and they'll just go off and start doing it wrong without saying anything. Now what you have to do is when you assign a task you have to say "can you tell me how you're going to approach this?" to make sure they don't say something nutso.


12-09-11 | Kittens

We want to get a kitten (since we have a stable house now), and I would like to just get a kitten from a home but WTF they don't exist any more.

When I was a kid, every couple of weeks some family in the neighborhood would have kittens and put out a sign. You could go to their house and see the kittens play and pick one. You could see if they were coming from a good home where they got socialized. You could see how old they were to know they weren't separated from their mom at too young an age.

There just isn't any of this anymore. It seems to have all been corporatized into kitten adoption centers.

Yeah yeah yeah you should adopt an a deformed adult cat that drools and has mange. No thanks. Part of the reason why I can't just find a normal home to adopt from is all the pet-adoption-nazis make it so that you can't use craigslist (or whatever) to find pets.

Also all the adoption agencies have strict spay/neuter at time of adoption rules. When I was a kid when we would get a cat sometimes we would not spay right away so that we could get a batch of kittens and keep one and give the result away. It was delightful to have a line of generations and a bunch of kittens to play with. That tradition seems to be all gone. The result is that the babies only come from strays or weirdos outside the system. It's sort of like if all law abiding citizens were castrated, then the only children in the world would be from criminals.

Boo.


Also, in other cat news, it turns out our professional cat sitter grossly overfed our cat while we were away in Hawaii. This despite verbal and written instructions on the correct amount to feed her. So she's sickly obese on our return.

How fucking hard is it to follow basic instructions? Jesus christ, I'm trying not to be a rich old crank, but I can't help thinking things like "it's so hard to find good help" and that the poor are poor because they're fucking retarded. (*). Half a cup means fucking half a cup, not "oh, I'll feed her until she stops eating like she's starving". You are the fucking help, you don't get to make your own decisions when I gave you specific orders.

What makes it even worse is that she (the cat sitter) gave us the usual condescending "I know so much about cats" bullshit when we interviewed her. Hey lady, I've watched an episode of the Dog Whisperer too, I'm not impressed by your amateur pet psychiatry.

* = you would think that it should be easy to find someone who could like get your groceries for you, or build you a fence, or pay your bills, or whatever, but it's actually really hard. It's amazing how badly the average person will fuck up the most basic assignments. To get someone that is smart enough that you can trust them to do those things, you have to hire someone in the top 1% of intellects, someone who could make $100k a year. It's actually sort of easy to hire someone really smart who costs a lot of money, and it's easy to hire someone who is just manual labor that you have to constantly supervise, but to hire someone in between that you can trust enough not to supervise but doesn't cost a fortune is hard because people are epic fuck ups.


12-08-11 | Some Semaphores

In case you don't agree with Boost that Semaphore is too "error prone" , or if you don't agree with C++0x that semaphore is unnecessary because it can be implemented from condition_var (do I need to point out why that is ridiculous reasoning for a library writer?) - here are some semaphores for you.

I've posted a fastsemaphore before, but here's a more complete version that can wrap a base semaphore.


template< typename t_base_sem >
class fastsemaphore_t
{
private:
    t_base_sem m_base_sem;
    atomic<int> m_count;

public:
    fastsemaphore_t(int count = 0)
    :   m_count(count)
    {
        RL_ASSERT(count > -1);
    }

    ~fastsemaphore_t()
    {
    }

    void post()
    {
        if (m_count($).fetch_add(1,mo_acq_rel) < 0)
        {
            m_base_sem.post();
        }
    }

    void post(int count)
    {
        int prev = m_count($).fetch_add(count,mo_acq_rel);
        if ( prev < 0)
        {
            int num_waiters = -prev;
            int num_to_wake = MIN(num_waiters,count);
            // use N-wake if available in base sem :
            // m_base_sem.post(num_to_wake);
            for(int i=0;i<num_to_wake;i++)
            {
                m_base_sem.post();
            }
        }
    }
    
    bool try_wait()
    {
        // see if we can dec count before preparing the wait
        int c = m_count($).load(mo_acquire);
        while ( c > 0 )
        {
            if ( m_count($).compare_exchange_weak(c,c-1,mo_acq_rel) )
                return true;
            // c was reloaded
            // backoff here optional
        }
        return false;
    }
        
    void wait_no_spin()
    {
        if (m_count($).fetch_add(-1,mo_acq_rel) < 1)
        {
            m_base_sem.wait();
        }
    }
    
    void wait()
    {
        int spin_count = 1; // ! set this for your system
        while(spin_count--)
        {
            if ( try_wait() ) 
                return;
        }
        
        wait_no_spin();
    }
    
    
    int debug_get_count() { return m_count($).load(); }
};

when m_count is negative it's the number of waiters (plus or minus people who are about to wait, or about to be woken).

Personally I think the base semaphore that fastsem wraps should just be your OS semaphore and don't worry about it. It only gets invoked for thread wake/sleep so who cares.

But you can easily make Semaphore from CondVar and then put fastsemaphore on top of that. (note the semaphore from condvar wake N is not awesome because CV typically doesn't provide wake N, only wake 1 or wake all).

Wrapping fastsem around NT's Keyed Events is particularly trivial because of the semantics of the Keyed Event Release. NtReleaseKeyedEvent waits for someone to wake if there is noone. I've noted in the past that Win32 event is a lot like a semaphore with a max count of 1 ; a problem with building a Semaphore from normal Event would be that you Set it when it's already Set, you effectively run into the max count and lose your Set, but this is impossible with KeyedEvent. With KeyedEvent you get exactly one wake from Wait for each Release.

So, if we wrap up keyed_event for convenience :


struct keyed_event
{
    HANDLE  m_keyedEvent;

    enum { WAITKEY_SHIFT = 1 };

    keyed_event()
    {
        NtCreateKeyedEvent(&m_keyedEvent,EVENT_ALL_ACCESS,NULL,0);
    }
    ~keyed_event()
    {
        CloseHandle(m_keyedEvent);
    }

    void wait(intptr_t key)
    {
        RL_ASSERT( (key&1) == 0 );
        NtWaitForKeyedEvent(m_keyedEvent,(PVOID)(key),FALSE,NULL);
    }

    void post(intptr_t key)
    {
        RL_ASSERT( (key&1) == 0 );
        NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(key),FALSE,NULL);
    }
};

Then the base sem from KE is trivial :


struct base_semaphore_from_keyed_event
{
    keyed_event ke;

    base_semaphore_from_keyed_event() { }
    ~base_semaphore_from_keyed_event() { }
    
    void post() { ke.release(this); }   
    void wait() { ke.wait(this); }
};

(note this is a silly way to use KE just for testing purposes; in practice it would be shared, not one per sem - that's sort of the whole point of KE).

(note that you don't ever use this base_sem directly, you use it with a fastsemaphore wrapper).

I also revisited the semaphore_from_waitset that I talked about a few posts ago. The best I can come up with is something like this :


class semaphore_from_waitset
{
    waitset_simple m_waitset;
    std::atomic<int> m_count;

public:
    semaphore_from_waitset(int count = 0)
    :   m_count(count), m_waitset()
    {
        RL_ASSERT(count >= 0);
    }

    ~semaphore_from_waitset()
    {
    }

public:
    void post()
    {
        m_count($).fetch_add(1,mo_acq_rel);
        m_waitset.notify_one();
    }

    bool try_wait()
    {
        // see if we can dec count before preparing the wait
        int c = m_count($).load(mo_acquire);
        while ( c > 0 )
        {
            if ( m_count($).compare_exchange_weak(c,c-1,mo_acq_rel) )
                return true;
            // c was reloaded
        }
        return false;
    }

    void wait(wait_thread_context * cntx)
    {
        for(;;)
        {
            // could spin a few times on this :
            if ( try_wait() )
                return;
    
            // no count available, get ready to wait
            waiter w(cntx);
            m_waitset.prepare_wait(&w);
            
            // double check :
            if ( try_wait() )
            {
                // (*1)
                m_waitset.retire_wait(&w);
                // pass on the notify :
                int signalled = w.flag($).load(mo_acquire);
                if ( signalled )
                    m_waitset.notify_one();
                return;
            }
            
            w.wait();
            m_waitset.retire_wait(&w);
            // loop and try again
        }
    }
    
    void wait()
    {
        wait_thread_context cntx;
        wait(&cntx);
    }
};

The funny bit is at (*1). Recall before we talked about a race that can happen if two threads post and two other threads pop. If one of the poppers gets through to *1 , it dec'ed the sem but is still in the waitset, one pusher might then signal this thread, which is a wasted signal, and the other waiter will not get a signal, and you have a "deadlock" (not a true deadlock, but an unexpected permanent sleep, which I will henceforth call a deadlock).

You can fix that by detecting if you recieved a signal while you were in the waitset. That's what's done here now. While it is not completely ideal from a performance perspective, it's a rare race case, and even when it happens the penalty is small. I still don't recommend using semaphore_from_waitset unless you have a comprehensive waitset-based system.

(note that in practice you would never make a wait_thread_context on the stack as in the example code ; if you have a waitset-based system it would be in the TLS)

Another note :

I have mentioned before the idea of "direct handoff" semaphores. That is, making it such that thread wakeup implies you get to dec count. For example "base_semaphore_from_keyed_event" above is a direct-handoff semaphore. This is as opposed to "optimistic" semaphores, in which the wakeup just means "you *might* get to dec count" and then you have to try_wait again when you wake up.

Direct handoff is neat because it gaurantees a minimum number of thread wakeups - you never wake up a thread which then fails to dec count. But they are in fact not awesome. The problem is that you essentially have some of your semaphore count tied up in limbo while the thread wakeup is happening (which is not a trivial amount of time).

The scenario is like this :


1. thread 1 does a sem.wait

2. thread 2 does a sem.post 
  the sem is "direct handoff" the count is given to thread 1
  thread 1 starts to wake up

3. thread 3 (or thread 2) now decides it can do some consuming
  and tries a sem.wait
  there is no sem count so it goes to sleep

4. thread 1 wakes up and processes its received count

You have actually increased latency to process the message posted by the sem, by the amount of time between steps 3 and 4.

Basically by not pre-deciding who will get the sem count, you leave the opportunity for someone else to get it sooner, and sooner is better.

Finally let's have a gander at the Linux sem : sem_post and sem_wait

If we strip away some of the gunk, it's just :


sem_post()
{

    atomic_add( & sem->value , 1);

    atomic_full_barrier (); // (*1)

    int w = sem->nwaiters; // (*2)

    if ( w > 0 )
    {
        futex_wake( & sem->value, 1 );  // wake 1
    }

}

sem_wait()
{
    if ( try_wait() ) return;

    atomic_add( & sem->waiters , 1);

    for(;;)
    {
        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

    atomic_add( & sem->waiters , -1);
}

Some quick notes : I believe the barrier at (*1) is unnecessary ; they should be doing an acq_rel inc on sem->value instead. However, as noted in the previous post about "producer-consumer" failures, if your producer is not strongly synchronized it's possible that this barrier helps hide/prevent bugs. Also at (*2) in the code they load nwaiters with plain C which is very sloppy; you should always load lock-free shared variables with an explicit load() call that specifies memory ordering. I believe the ordering constraint there is the load of nwaiters needs to stay after the store to value; the easiest way is to make the inc on value be an RMW acq_rel.

The similarity with waitset should be obvious, but I'll make it super-clear :


sem_post()
{

    atomic_add( & sem->value , 1);
    atomic_full_barrier ();

    // waitset.notify_one :
    {
        int w = sem->nwaiters;
        if ( w > 0 )
        {
            futex_wake( & sem->value, 1 );  // wake 1
        }
    }
}

sem_wait()
{
    if ( try_wait() ) return;

    // waitset.prepare_wait :
    atomic_add( & sem->waiters , 1);

    for(;;)
    {
        // standard double-check :
        if ( try_wait() ) break;

        // waitset.wait()
        // (*3)
        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

    // waitset.retire_wait :
    atomic_add( & sem->waiters , -1);
}

It's exactly the same, but with one key difference at *3 - the wait does not happen if count is not zero, which means we can not receive the wait wakeup from futex_wake if we don't need it. This removes the need for the re-pass that we had in the waitset semaphore.

This futex semaphore is fine, but you could reduce the number of atomic ops by storing count & waiters in one word.


12-05-11 | Surprising Producer-Consumer Failures

I run into these a lot, so let's have a quick glance at why they happen.

You're trying to do something like :


Thread1 :

Produce 1
sem.post

Thread2 :

Produce 2
sem.post

Thread 3 :

sem.wait
Consume 1

Thread 4 :

sem.wait
Consume 2

and we assert that the Consume succeeds in both cases. Produce/Consume use a queue or some other kind of lock-free communication structure.

Why can this fail ?

1. A too-weak semaphore . Assuming out Produce and Consume are lock-free and not necessarily synchronized on a single variable with something strong like an acq_rel RMW op, we are relying on the semaphore to synchronize publication.

That is, in this model we assume that the semaphore has something like an "m_count" internal variable, and that both post and wait do an acq_rel RMW on that single variable. You could certainly make a correct counting semaphore which does not have this behavior - it would be correct in the sense of controlling thread flow, but it would not provide the additional behavior of providing a memory ordering sync point.

You usually have something like :


Produce :
store X = A
sem.post // sync point B

Consume:
sem.wait // sync point B
load X  // <- expect to see A

you expect the consume to get what was made in the produce, but that is only gauranteed if the sem post/wait acts as a memory sync point.

There are two reasons I say sem should act like it has an internal "m_count" which is acq_rel , not just release at post and acquire at wait as you might think. One is you want sem.wait to act like a #StoreLoad, so that the loads which occur after it in the Consume will see preceding stores in the Produce. An RMW acq_rel is one way to get a #StoreLoad. The other is that by using an RMW acq_rel on a single variable (or behaving as if you do), it creates a total order on modifications to that variable. For example if T3 seems T1.post and T2.post and then does its T3.wait , T4 cannot see T1.post T3.wait T4.wait or any funny other order.

Obviously if you're using an OS semaphore you aren't worrying about this, but there are lots of cases where you use this pattern with something "semaphore-like" , such as maybe "eventcount".

2. You're on POSIX and forget that sem.wait has spurious wakeups on POSIX. Oops.

3. Your queue can temporarily appear smaller than it really is.

Say, as a toy example, adding a node is done something like this :


new_node->next = NULL;

old_head = queue->head($).exchange( new_node );
// (*)
new_node->next = old_head;

There is a moment at (*) where you have truncated the queue down to 1 element. Until you fix the next pointer, the queue has been made to appear smaller than it should be. So pop might not get the items it expects to get.

This looks like a bad way to do a queue, but actually lots of lock free queues have this property in more or less obvious ways. Either the Push or the Pop can temporarily make the queue appear to be smaller than it really is. (for example a common pattern is to have a dummy node, and if Pop takes off the dummy node, it pushes it back on and tries again, but this causes the queue to appear one item smaller than it really is for a while).

If you loop, you should find the item that you expected in the queue. However, this is a nasty form of looping because it's not just due to contention on a variable; if in the example above the thread is swapped out while it sits at point (*), then nobody can make progress on this queue until that thread gets time.

The result I find is that ensuring that waking from sem.wait always implies there is an item ready to pop is not worth the trouble. You can do it in isolated cases but you have to be very careful. A much easier solution is to loop on the pop.


12-03-11 | RAD - Hawaii Branch

It's a pretty nice place to work. The ergonomics of the picnic table are not half bad actually. Very glad I brought my keyboard; wish the laptop screen was bigger.


12-03-11 | Worker Thread system with reverse dependencies

In the previous episode we looked at a system for doing work with dependencies.

That system is okay; I believe it works, but it has two disadvantages : 1. It requires some non-standard synchronization primitives such as OR waits, and 2. There is a way that it can fail to do work as soon as possible; that is, there is the possibility for moments when work could be done but the worker that could do it is asleep. It's one of our design goals to not let that happen so let's see why it happens :

The problem basically is the NR (not ready) queue. When we have no RTR (ready to run) work, we popped one item from the NR queue and waited on its dependencies. But there could be other items later in the NR queue which become ready sooner. If the items in the NR queue become ready to run in order, this doesn't occur, but if they can become ready in different orders, we could miss out on chances to do work.

Anyhoo, both of these problems go away and everything becomes much simpler if we reformulate our system in terms of "forward dependencies" instead of "backward dependencies".

Normal "dependencies" are backwards; that is, A depends on B and C, which were created earlier in time. The opposite direction link I will call "permits" (is there a standard term for this?). That is, B and C permit A. A needs 2 permissions before it can run.

I propose that it is conceptually easier to set up work in terms of "dependencies", so the client still formulates work items with dependencies, but when they are submitted to be run, they are converted into "permissions". That is, A --> {B,C} is changed into B --> {A} and C --> {A}.

The main difference is that there is no longer any "not ready" queue at all. NR items are not held in any global list, they are only pointed to by their dependencies. Some dependency back in the tree should be ready to run, and it will then be the root that points through various NR items via permission links.

With no further ado, let's look at the implementation.

The worker thread becomes much simpler :


worker :

wait( RTR_sem );

pop RTR_queue and do work

that's it! Massively simpler. All the work is now in the permissions maintenance, so let's look at that :

How do we maintain permissions? Each item which is NR (not ready) has a (negative) count of the # of permissions needed before it can run. Whenever an item finishes, it walks its permission list and incs the permit count on the target item. When the count reaches zero, all permissions are done and the item can now run.

A work item now has to have a list of permissions. In my old system I had just a fixed size array for dependencies; I found that [3] was always enough; it's simply the nature of work that you rarely need lots of dependencies (and in the very rare cases that you do need more than 3, you can create a dummy item which only marks itself complete when many others are done). But this is not true for permissions, there can be many on one item.

For example, a common case is you do a big IO, and then spawn lots of work on that buffer. You might have 32 work items which depend on the IO. This only needs [1] when expressed as dependencies, but [32] when expressed as permissions. So a fixed size array is out and we will use a linked list.

The maintenance looks like this :


submit item for work :

void submit( work_item * wi , work_item * deps[] , int num_deps )
{

    wi->permits = - num_deps;

    if ( num_deps == 0 )
    {
        RTR_queue.push( p );
        RTR_sem.post();
        return;
    }

    for(int i=0;i<num_deps;i++)
    {
        deps[i]->lock();

        if ( ! deps[i]->is_done )
        {
            deps[i]->permits_list.push( wi );
        }
        else
        {
            int prev = wi->permits.fetch_add(1); // needs to be atomic
            if ( prev == -1 ) // permitted (do this also if num_deps == 0)
            {
                RTR_queue.push( p );
                RTR_sem.post();
            }
        }

        deps[i]->unlock();
    }

}


when an item is completed :

void complete( work_item * wi )
{
    wi->lock();

    set wi->is_done

    swap wi->permits_list to local permits_list

    wi->unlock();

    for each p in permits_list
    {
        int prev = p->permits.fetch_add(1);

        if ( prev == -1 )
        {
            // p is now permitted

            RTR_queue.push( p );
            RTR_sem.post();
        }
    }
}

the result is that when you submit not-ready items, they go into the permits list somewhere, then as their dependencies get done their permits count inc up towards zero, when it hits zero they go into the RTR queue and get picked up by a worker.

The behavior is entirely the same as the previous system except that workers who are asleep because they have no RTR work can wake up when any NR item becomes RTR, not just when the single one they popped becomes RTR.

One annoyance with this scheme is you need to lock the item to maintain the permits_list ; that's not really a big deal (I use an indexed lock system similar to Nt Keyed Events, I don't actually put a lock object on each item), but I think it's possible to maintain that list correctly and simply lock free, so maybe we'll revisit that.

ADDENDUM : hmm , not easy to do lock free. Actually maintaining the list is not hard, and even doing it and avoiding races against the permitted count is not hard, the problem is that the list is in the work item and items can be deleted at any time, so you either need to hold a lock on the item to prevent deletion, or you need something like RCU or SMR.


12-02-11 | Natural Expression

It's so nice when you find the "natural" way to express a coding problem. All of a sudden everything because so much simpler and the answers just start popping out at you. Like oh, and I can do this here, and this automatically happens just the way I wanted. Tons of old code just disappears that was trying to solve the problem in the "un-natural" way.

It doesn't change the code; in the end it all becomes assembly language and it can do the same thing, but changing the way you write it can change the way you think about it. Also when you find an simple elegant way to express things, it sort of makes it feel "right", whereas if you are getting the same thing done through a series of kludges and mess, it feels horrible, even though they are accomplishing the same thing.

It reminds me of physics. I think some of the greatest discoveries the past century in physics were not actually discoveries of any phenomenom, but just ways to write the physics down. In particular I cite Dirac's Bra-Ket notation and Feynman's path integrals.

Neither one added any new physics. If you look at it in a "positivist" view point, they did nothing - the actual observable predictions were the same. The physics all existed in the equations which were already known. But they opened up a new understanding, and just made it so much more natural and easier to work with the equations, and that can actually have huge consequences.

Dirac's bra ket for example made it clear that quantum mechanics was about Hilbert spaces and Operators. Transformation between different basis spaces became a powerful tool, and very useful and elegant things like raising and lowering operators popped out. Quantum mechanics at the time was sort of controversial (morons like Einstein were still questioning it), and finding a clear elegant solid way to write it down made it seem more reasonable. (physicists have a semi-irrational distrust of any physical laws that are very complicated or vague or difficult to compute with; they also have a superstition that if a physical law can be written in a very concise way, it must be true; eg. when you write Maxwell's equations as d*F = J).

Feynman's path integrals came along just at a time when Quantum Field Theory was in crisis; there were all these infinities which make the theory impossible to calculate with. There were some successful computations, and it just seemed like the right way to extend QM to fields, so people were forging ahead, but these infinities made it an incomplete (and possibly wrong) theory. The path integral didn't solve this, but it made it much easier to see what was actually being computed in the QFT equations - rather than just a big opaque integral that becomes infinity and you don't know why, the path integral lets you separate out the terms and to pretend that they correspond to physical particles flying around in many different ways. It made it more obvious that QFT was correct, and what renormalization was doing, and the fact that renormalization was a physically okay way to fix the infinities.

(while I say this is an irrational superstition, it has been the fact that the laws of physics which are true wind up being expressable in a concise, elegant way (though that way is sometimes not found for a long time after the law's discovery); most programmers have the same supertition, when we see very complex solutions to problems we tend to turn up our noses with distate; we imagine that if we just found the right way to think about the problem, a simple solution would be clear)

(I know this history is somewhat revisionist, but a good story is more important than accuracy, in all things but science)

Anyhoo, it's nice when you get it.


11-30-11 | Some more Waitset notes

The classic waitset pattern :

check condition

waiter w;
waitset.prepare_wait(&w);

double check condition

w.wait();

waitset.retire_wait(&w);

lends itself very easily to setting a waiter flag. All you do is change the double check into a CAS that sets that flag. For example say your condition is count > 0 , you do :

if ( (count&0x7FFFFFFF) == 0 )
{
    waiter w;
    waitset.prepare_wait(&w);

    // double check condition :
    int c = count.fetch_or( 0x80000000 ); // set waiter flag and double check
    if ( (c&0x7FFFFFFF) == 0 )
        w.wait();

    waitset.retire_wait(&w);
}

then in notify, you can avoid signalling when the waiter flag is not set :

// publish :
int c = count.atomic_inc_and_mask(1,0x7FFFFFFF);
// notify about my publication if there were waiters :
if ( c & 0x80000000 )
  waitset.notify();

(note : don't be misled by using count here; this is still not a good way to build a semaphore; I'm just using an int count as a simple way of modeling a publish/consume.


I was being obtuse before when I wrote about the problems with waitset OR. It is important to be aware of those issues when working with waitsets, because they are inherent to how waitsets work and you will encounter them in some form or other, but of course you can do an OR if you extend the basic waitset a little.

What you do is give waiter an atomic bool to know if it's been signalled, something like :


struct waiter
{
  atomic<bool> signalled;
  os_handle  waitable_handle;
}

(a "waiter" is a helper which is how you add your "self" to the waitset; depending on the waitset implementation, waitable_handle might be your thread ID for example).

Then in the waitset notify you just do :


if ( w->signalled.exchange(true) == false )
{
   Signal( w->waitable_handle );
}
else
    step to next waiter in waitset and try him again.

That is, you try to only send the signal to handles that need it.

If we use this in the simple OR example from a few days ago, then both waiting threads will wake up - two notify_ones will wake two waiters.

While you're at it, your waiter struct may as well also contain the origin of the signal, like :


if ( w->signalled.exchange(true) == false )
{
    // non-atomic assignment :
    w->signal_origin = this; // this is a waitset
    Signal( w->waitable_handle );
}

That way when you wake from an OR wait you know why.

(note that I'm assuming your os_handle only ever does one state transition - it goes from unsignalled to signalled. This is the correct way to use waitset; each waiter() gets a new waitable handle for its lifetime, and it only lives for the length of one wait. In practice you actually recycle the waiters to avoid creating new ones all the time, but you recycle them safely in a way that you know they cannot be still in use by any thread (alternatively you could just have a waiter per thread in its TLS and reset them between uses))

(BTW of course you don't actually use atomic bool in real code because bool is too badly defined)


11-30-11 | Basic sketch of Worker Thread system with dependencies

You have a bunch of worker threads and work items. Work items can be dependent, on other work items, or on external timed events (such as IO).

I've had some trouble with this for a while; I think I finally have a scheme that really works.

There are two queues :


RTR = ready to run : no dependencies, or dependencies are done

NR = not ready ; dependencies still pending

Each queue has an associated semaphore to count the number of items in it.

The basic work popping that each worker does is something like :


// get all available work without considering sleeping -
while( try_wait( RTR_sem ) )
{
    pop RTR_queue and do work
}

// (optionally spin a few times here and check RTR_sem)

// I may have to sleep -

wait( RTR_sem OR NR_sem ); // (*1)

if ( wakeup was from RTR_sem )
{
    pop RTR_queue and do work
}
else
{
    NRI (not ready item) = pop NR_queue
    deps = get dependencies that NRI needs to wait on

    wait( deps OR RTR_sem ); // (*3)

    if ( wakeup was from RTR_sem )
    {
        push NRI back on NR_queue and post NR_sem  // (*4)
        pop RTR_queue and do work
    }
    else
    {
        wakeup was because deps are now done
        NRI should be able to run now, so do it
        (*2)
    }  
}

*1 : the key primitive here is the ability to do a WFMO OR wait, and to know which one of the items signalled you. On Windows this is very easy, it's just WaitForMultipleObjects, which returns the guy who woke you. On other platforms it's trickier and probably involves rolling some of your own mechanisms.

Note that I'm assuming the semaphore Wait() will dec the semaphore at the time you get to run, and the OR wait on multiple semaphores will only dec one of them.

*2 : in practice you may get spurious wakeups or it may be hard to wait on all the dependencies, so you would loop and recheck the deps and possibly wait on them again.

How this differs from my previous system :

My previous system was more of a traditional "work stealing" scheme where each worker had its own queue and would try to just push & pop works from its own queue. This was lower overhead in the fast path (it avoids having a single shared semaphore that they have to contend on, for example), but it had a lot of problems.

Getting workers to go to sleep & wake up correctly in a work stealing scheme is a real mess. It's very hard to tell when you have no work to do, or when you have enough work that you need to wake a new worker, because you don't atomically maintain a work count (eg. a semaphore). You could fix this by making an atomic pair { work items, workers awake } and CAS that pair to maintain it, but that's just a messy way of being a semaphore.

The other problem was what happens when you have dependent work. You want a worker to go to sleep on the dependency, so that it yeilds CPU time, but wakes up when it can run. I had that, but then you have the problem that if somebody else pushes work that can immediately run, you want to interrupt that wait on the dependency and let the worker do the ready work. The semaphore OR wait fixes this nicely.

If you're writing a fractal renderer or some such nonsense then maybe you want to make lots of little work items and have minimal overhead. But that's a very special purpose rare case. Most of the time it's much more important that you do the right work when possible. My guiding principles are :

If there is no work that can be done now, workers should go to sleep (yield CPU)
If there is work that can be done now, workers should wake up
You should not wake up a worker and have it go back to sleep immediately
You should not have work available to do but the workers sleeping

Even in the "fractal renderer" sort of case, where you have tons of non-dependent work items, the only penalty here is one extra semaphore dec per item, and that's just a CAS (or a fetch_add) assuming you use something like "fastsemaphore" to fast-path the case of being not near zero count.

There is one remaining issue, which is when there is no ready-to-run work, and the workers are asleep on the first semaphore (they have no work items). Then you push a work item with dependencies. What will happen in the code sketch above is that the worker will wake up, pop the not ready item, then go back to sleep on the dependency. This violates article 3 of the resolution ("You should not wake up a worker and have it go back to sleep immediately").

Basically from *1 to *3 in the code is a very short path that wakes from one wait and goes into another wait; that's always bad.

But this can be fixed. What you need is "wait morphing". When you push a not-ready work item and you go into the semaphore code that is incrementing the NR_sem , and you see that you will be waking a thread - before you wake it up, you take it out of the NR_sem wait list, and put it into the NRI's dependency wait list. (you leave it waiting on RTR_sem).

That is, you just leave the thread asleep, you don't signal it to wake it up, it stays waiting on the same handle, but you move the handle from NR_sem to the dependency. You can implement this a few ways. I believe it could be done with Linux'es newer versions of futex which provide wait morphing. You would have to build your semaphore and your dependency waiting on futex, which is easy to do, then wait morph to transfer the wait. Alternatively if you build them on "waitset" you simply need to move an item from one waitset to the other. This can be done easily if your waitset uses a mutex to protect its internals, you simply lock both mutexes and move the waitable handle with both held.

The net result with wait morphing is very nice. Say for example are you workers are asleep. You create a work item that is dependent on an IO and push it. None of the workers get woken up, but one of them is changed from waiting on work available to waiting on the dependency. When the IO completes it wakes that worker and he runs. If somebody pushed a ton of work in the mean time, all the workers would be woken and they would do that work, and the dependent work would be pushed back on the NR queue and set aside while they did RTR work.

ADDENDUM : at the spot marked (*4) :


push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work

In real code you need do something a bit more complex here. What you do is something like :

if ( NRI is ready ) // double check
{
  RTR_sem.post() // we woke from RTR_sem , put it back
  do NRI work
}
else
{
  push NRI onto NR_lifo and post NR_sem
  pop RTR_queue and do work
}

we've introduced a new queue , the NR_lifo which is a LIFO (eg. stack). Now whenever you get an NR_sem post, you do :

// NR_sem just returned from wait so I know an NR item is available :

NRI = NR_lifo.pop()
if ( NRI == NULL )
  NRI = NR_queue.pop()

the item must be in one or the other and we prefer to take from the LIFO first. Basically the LIFO is a holding area for items that were popped off the FIFO and were not yet ready, so we want to keep trying to run those before we go back to the FIFO. You can use a single semaphore to indicate that there is an item in either queue.


11-28-11 | Some lock-free rambling

It helps me a lot to write this stuff down, so here we go.

I continually find that #StoreLoad scenarios are confusing and catch me out. Acquire (#LoadLoad) and Release (#StoreStore) are very intuitive, but #StoreLoad is not. I think I've covered almost this exact situation again, but this stuff is difficult so it's worth revisiting many times. (I find low level threading to be cognitively a lot like quantum mechanics, in that if you do it a lot you become totally comfortable with it, but if you stop doing it even for a month it is super confusing and bizarre when you come back to it, and you have to re-work through all the basics to convince yourself they are true).

(Aside : fucking Google verbatim won't even search for "#StoreLoad" right. Anybody know a web search that is actually verbatim? A whole-word-only option would be nice too, and also a match case option. You know, like basic text search options from like 1970 or so).

The classic case for needing #StoreLoad is WFMO. The very simple scenario goes like this :


bool done1 = false;
bool done2 = false;

// I want to do X() when done1 & done2 are both set.

Thread1:

done1 = true;
if ( done1 && done2 )
    X();

Thread2:

done2 = true;
if ( done1 && done2 )
    X();

This doesn't work.

Obviously Thread1 and Thread2 can run in different orders so done1 and done2 become set in random order. But one thread or the other should see them both set. But they don't; the reason is that the memory visibility can be reordered. This is a pretty clear illustration of the thing that trips up many people - threads can interleave both in execution order and in memory visibility order.

In particular the bad execution case goes like this :


done1 = false, done2 = false

T1 sets done1 = true
  T1 sees done1 = true (of course)
  T2 still sees done1 = false (store is not yet visible to him)

T2 sets done2 = true
  T2 sees done2 = true
  T1 still sees done2 = false

T1 checks done2 for (done1 && done2)
  still sees done2 = false
  doesn't call X()

T2 checks done1
  still sees done1 = false
  doesn't call X()

later
T1 sees done2=true
T2 sees done1=true

when you write it out it's obvious that the issue is the store visibility is not forced to occur before the load. So you can fix it with :

Thread1:

done1 = true;
#StoreLoad
if ( done1 && done2 )
    X();

As noted previously there is no nice way to make a StoreLoad barrier in C++0x. The best method I've found is to make the loads into fetch_add(0,acq_rel) ; that works by making the loads also be stores and using a #StoreStore barrier to get store ordering. (UPDATE : using atomic_thread_fence(seq_cst) also works).


The classic simple waitset that we have discussed previously is a bit difficult to use in more complex ways.

Refresher : A waitset works with a double-check pattern, like :


signalling thread :

set condition
waitset.notify();

waiting thread :

if ( ! condition )
{
    waitset.prepare_wait()

    // double check :
    if ( condition )
    {
        waitset.cancel_wait();
    }
    else
    {
        waitset.wait();
    }
}

we've seen in the past how you can easily build a condition var or an eventcount from waitset. In some sense waitset is a very low level primitive and handy for building higher level primitives from. Now on to new material.

You can easily use waitset to perform an "OR" WFMO. You simply add yourself to multiple waitsets. (you need a certain type of waitset for this which lets you pass in the primitive that you want to use for waiting). To do this we slightly extend the waitset API. The cleanest way is something like this :


instead of prepare_wait :

waiter create_waiter();
void add_waiter( waiter & w );

instead of wait/cancel_wait :

~waiter() does cancel/retire wait 
waiter.wait() does wait :

Then an OR wait is something like this :

signal thread 1 :

set condition1
waitset1.notify();

signal thread 2 :

set condition2
wiatset2.notify();


waiting thread :

if ( condition1 ) // don't wait

waiter w = waitset1.create_waiter();

// double check condition1 and first check condition2 :

if ( condition1 || condition2 ) // don't wait
  // ~w will take you out of waitset1

waitset2.add_waiter(w);

// double check :

if ( condition2 ) // don't wait

// I'm now in both waitset1 and waitset2
w.wait();

Okay. This works fine. But there is a limitation which might not be entirely obvious.

I have intentionally not made it clear if the notify() in the signalling threads is a notify_one (signal) or notify_all (broadcast). Say you want it to be just notify_one , because you don't want to wake more threads than you need to. Say you have this scenario :


X = false;
Y = false;

Thread1:
X = true;
waitsetX.notify_one();

Thread2:
Y = true;
waitsetY.notify_one();

Thread3:
wait for X || Y

Thread4:
wait for X || Y

this is a deadlock. The problem is that both of the waiter threads can go to sleep, but the two notifies might both go to the same thread.

This is a general difficult problem with waitset and is why you generally have to use broadcast (for example eventcount is built on waitset broadcasting).

You may think this is an anomaly of trying to abuse waitset to do an OR, but it's quite common. For example you might try to do something seemingly simple like build semaphore from waitset.


class semaphore_from_waitset
{
    waitset_simple m_waitset;
    std::atomic<int> m_count;

public:
    semaphore_from_waitset(int count = 0)
    :   m_count(count), m_waitset()
    {
        RL_ASSERT(count >= 0);
    }

    ~semaphore_from_waitset()
    {
    }

public:
    void post()
    {
        m_count($).fetch_add(1,mo_acq_rel);
        // broadcast or signal :
        // (*1)
        //m_waitset.notify_all();
        m_waitset.notify_one();
    }

    bool try_wait()
    {
        // see if we can dec count before preparing the wait
        int c = m_count($).load(mo_acquire);
        while ( c > 0 )
        {
            if ( m_count($).compare_exchange_weak(c,c-1,mo_acq_rel) )
                return true;
            // c was reloaded
        }
        return false;
    }

    void wait(HANDLE h)
    {
        for(;;)
        {
            if ( try_wait() )
                return;
    
            // no count available, get ready to wait
            ResetEvent(h);
            m_waitset.prepare_wait(h);
            
            // double check :
            if ( try_wait() )
            {
                m_waitset.retire_wait(h);
                // (*2)
                // pass on the notify :
                m_waitset.notify_one();
                return;
            }
            
            m_waitset.wait(h);
            m_waitset.retire_wait(h);
            // loop and try again
        }
    }
};

it's totally straightforward in the waitset pattern, except for the broadcast issue. If *1 is just a notify_one, then at *2 you must pass on the notify. Alternatively if you don't have the re-signal at *2 then the notify at *1 must be a broadcast (notify_all).

Now obviously if you have 10 threads waiting on a semaphore and you inc the count by 1, you don't want all 10 threads to wake up so that just 1 of them can dec the count and get to execute. The re-signal method will wake 2 threads, so it's better than broadcast, but still not awesome.

(note that this is easy to fix if you just put a mutex around the whole thing; or you can implement semaphore without waitset; the point is not to reimplement semaphore in a bone-headed way, the point is just that even very simple uses of waitset can break if you use notify_one instead of notify_all).

BTW the failure case for semaphore_from_waitset with only a notify_one and no resignal (eg. if you get the (*1) and (*2) points wrong) goes like this :


the problem case goes like this :

    T1 : sem.post , sem.post
    T2&T3 : sem.wait

    execution like this :

    T2&3 both check count and see zereo
    T1 now does one inc and notify, noone to notify yet
    T2&3 do prepare_wait
    T2 does its double-check, sees a count and takes it (does not retire yet)
    T3 does its double-check, sees zero, and goes to sleep
    T1 now does the next inc and notify
    -> this is the key problem
    T2 can get the notify because it is still in the waiter list
        (not retired yet)
    but T3 needs the notify

The key point is this spot :

            // double check :
            if ( try_wait() )
            {
                // !! key !!
                m_waitset.retire_wait(h);

you have passed the double check and are not going to wait, but you are still in the waiter list. This means you can be the one thread chosen to receive the signal, but you don't need it. This is why resignal works.


11-25-11 | Sustainability

I've always had a sour feeling about the "sustainability" movement, but I haven't been quite sure why exactly. As a knee-jerk reaction, I feel uneasy about any of the cultish movements where people get overly devoted to a narrow worldview, and tend to get into a dogma where adherence to the movement is more important than logically pursuing the original goals. So for example there are lots of current movements which I basically agree with, like "nose to tail" and "locavorism" and "minimalism" and so on, I think the basic ideas are great, but the movements themselves tend to be weird and actually kind of ruin the idea that I like so much by making it dogmatic.

(eg. if you eat pig's ears because you like pig's ears, that's cool. If you eat pig's ears because you got the whole pig and don't want to throw parts away, that's cool. If you eat pig's ears because you are trying to be a good "nose to tail"'er , that's fucking stupid.)

Anyhoo, I had a few realizations about what is that bothers me so much about "sustainability". First the obvious ones that I've known for a while :

"sustainability" is so expensive that it's only accessible to 10% of the population. When the vast majority of the population can't afford those products, they are inherently unsustainable, as in they do not support human life and they do not signficantly reduce the amount of factory farming , clear cutting , etc. A lifestyle which is only accessible to the rich cannot transform the earth.

The majority of "sustainable" products are unproven and may in fact not be sustainable, it's just a marketing word that doesn't correspond to any fact of actual low long-term impact on the earth. The fact is that the central valley of california and the fields of iowa have sustained "unsustainable" factory farming for the past 100 years or so, and despite predictions of imminent collapse, they are still feeding hundreds of millions of people for very low prices. On the other hand we get new coconut charcoal and bamboo or hemp or whatever which we don't really know how it will affect the earth in mass production on the long term.

Buying a bunch of new products because they are "sustainable" is of course highly ironic. The most destructive thing that modern society does is buy new junk every time there's a new trend, and this appears to be just another new trend, people throw out their unfashionable "unsustainable" stuff to buy new approved stuff, and will throw that out for the next trend. (also ironically, "minimalism" generally tends to involve buying new stuff).

High-paid low-yield gentleman farmers are inherently unsustainable. You cannot support a 7+ billion person planet with a good quality of life if the cost of a piece of lumber or a piece of fruit is so high and takes so much labor. Our quality of life (per capita) is entirely based on the fact that those things are so cheap and easy, so that we can spend more time producing TV shows and iPods. Now the more extreme hippie-ish end of the sustainability movement might espouse a true back-to-the-land lifestyle change where in fact people do spend more time laboring and don't get TV shows and iPods, but that is a small fringe, the main thrust wants the spoils of civilization.

Now a little equivocation. Buying "sustainable" junk is obviously a form of charity. When you spend much more on a "sustainable" version of a product, you are essentially donating the difference. Where does that donation go? Some (I suspect small) portion of it actually goes to benefiting the earth. Most of the rest goes to profit of the product maker. On that level, it is a very bad form of charity; your charity dollars would have a much greater direct benefit on the earth if you just bought normal products and donated the difference to direct action.

But it is a bit more complicated than that. For the most part I'm not delighted by exercising political expression through purchasing (it's far too easy to manipulate and take advantage of, and in the end the only thing that They care about is that you keep buying things like a good little consumer, so you really aren't winning) - however I can't deny that it does sometimes work. When industry sees that lots of consumers are willing to waste their dollars on "green" products, they do sometimes change their practices for the better, and the net result can be a greater impact than the amount of charity dollars suggest. That is, there is a sort of leverage if businesses think that the "political buyers" will continue spending lots of money far into the future, the businesses will make a change based on lots of *future* dollars. Thus something like only a few tens of millions of dollars in charity spending can actually create a hundred-million dollar product line transformation.

As an aside, I should note that there are lots of small scale "sustainable" endeavors that are basically irrelevant because they are inherently small scale and cannot ever have a significant effect on the planet. For example the reclaimed lumber movement, it's okay, I have no major objection to it (though it's not entirely clear that it's the best value for your charity lumber eco dollars), but it's just irrelevant to any large scale analysis because it can't significantly reduce commercial lumber use. The only sustainable businesses that matter are the ones that have the possibility to go large scale.

Anyhoo, the thing that occurred to me last night was that the large scale sustainable industry is basically built on the back of unsustainable industry.

What I mean is, the large-scale mass produced "sustainable" industry (eg. bamboo flooring, "sustainable" chocolate, etc) is largely about making products in the 3rd world and exporting them to the 1st world. First of all this is sort of inherently unsustainable and hypocritical because it relies on a massive income gap for affordability, essentially you have to have people in subsistence living conditions to subsidize this product, and a good liberal who is spending their charity dollars to direct the world towards a better future should not include that in their better future. But more directly, the workers in those sustainable factories could not live a decent life on their low wages without unsustainable industry. The only way they can be paid so low is because they can get cheap factory farm corn to eat, and cheap sneakers and clothes and everything they need to live. If they had to buy the expensive sustainable junk, they would have to have huge wages, which would make the product even more expensive, which would make it impossible.


11-23-11 | This is not okay

Fuck this shit. I'm going to Hawaii.


11-22-11 | The Mature Programmer

1. The Mature Programmer

The mature programmer manages their own time and productivity well. The MP knows that maintenance is as much work as the initial writing and code always takes longer than you think. The MP knows that any changes to code can introduce bugs, no matter how seemingly trivial. The MP knows that premature optimization is foolish and dangerous. The MP knows that sexy coding like writing big complex systems from scratch is rarely the best way to go. The MP does not get into ego competitions about who has the prettiest code. The MP acheives the best final result in the minimum amount of time.

When I started at Oddworld, I was watching lots of game companies get into dick-waving contests about who had the greatest home-rolled graphics engine, and I was also watching lots of indie developers spend massive amounts of time on their "one true" product and never actually ship it. I resolved that we would not fall into those traps - we would be humble and not reach too far, we would not let our egos stop us from licensing code or using old fashioned solutions to problems, we would stay focused on the end product - any sexy code that didn't produce a visible benefit in the actual shipping game was nixed. For the most part I think we succeeded in that (there were a few digressions that were mainly due to me).

But the way of the Mature Programmer can be a trap which comes back to bite you.

The problem is that writing code in this way is not very much fun. Sure there's the fun of making the product - and if you're working on a game and believe in the game and the team, then just seeing the good product come out can give you motivation. But if you don't have that, it can be a real slog.

Most of us got into programming not for the end products that we create, but because the programming itself is a joy. Code can be beautiful. Code can be a clever, artistic, exciting creation, like a good mathematical proof. The Mature Programmer would say that "clever code is almost always dangerous code". But fuck him. The problem is that when you get carried away with being "mature" you suck the joy right out coding.

You need to allow yourself a certain amount of indescretions to keep yourself happy with your code. Sure those templates might not actually be a good idea, but you enjoy writing the code that way - fine, do it. Yes, you are optimizing early and it just makes the code harder to maintain and harder to read and more buggy - but you love to do that, fine, do it.

Obviously you can't go overboard with this, but I think that I (and many others) have gone overboard with being mature. Basically in the last ten years of my evolution as a coder I have become less of a wild card "hot shot" and more of a productivity manager, an efficient task analyzer and proactive coordinater of code-actualizing solutions. It's like a management beaurocracy of one inside my head. It's horrible.

I think there are two factors to consider : first is that being "mature" and productive can cause burnout which winds up hurting your productivity, or it can just make coding unpleasant so you spend fewer hours at it. Most "mature" coders brag about the fact that they can get as much done in 6 hours as they used to do in 14. But those 14 hours were FUN, you coded that long because you loved it, you couldn't get to sleep at night because you wanted to code more; now the 6 hours is all sort of unpleasant because instead of rolling your own solution you're just tying together some java and perl packages. Second is that being productive is not the only goal. We are coding to get some task done and to make money, but we're also coding because we enjoy it, and actually being less productive but enjoying your coding more may be a net +EV.

2. The healthy opposition of a producer

Many programmers in normal coding jobs hate having the interference of a producer (or corporate management, or the publisher, or whatever). This person enforces stupid schedules and won't let us do the features we want, and urrgh we hate them! These coders long to be able to make their own schedules and choose their own tasks and be free.

It's actually a very healthy and much more relaxing in many ways to have that opposition. When you have to schedule yourself or make your own decisions about tasks, you become responsible for both the creative "reach for the sky" side and the responsible "stay in budget" side. It's almost impossible to do a good job of both sides. This can happen if you are an indie or even if you are a powerful lead with a weak producer.

Most creative industries know that there is a healthy opposition in having the unconscrained creative dreamer vs. the budget-enforcing producer. You don't want the dreamer to be too worried about thinking about schedules or what's possible - you just want them to make ideas and push hard to get more.

When you have to cut features or crunch or whatever, it's nice to have that come from outside - some other force makes you do it and you can hate them and get on with it. It's nice to have that external force to blame that's not on your team; it gives you a target of your frustration, helps bonding, and also gives you an excuse to get the job done (because they told you to).

When you have to balance dreams vs schedules on your own, it adds an intellectual burden to every task - as you do each task you have to consider "is this worth the time? is this the right task to do now? should I do a simpler version of this?" which greatly reduces your ability to focus just on the task itself.

3. Coding standards

It's kind of amazing to me how many experienced programmers still just don't understand programming. The big difficulty in programming is that the space of the ways to write something are too large. We can get lost in that space.

One of the problems is simply the intellectual overload. Smart coders can mistakenly believe that they can handle it, but it is a burden on everyone. Every time you write a line of code, if you have to think "should I use lower case or mixed caps?" , or "should I make this a function or just write it in line?" , your brain is spending masses of energy on syntactic decisions and doesn't have its full power for the functionality. Strict coding standards are actually an intellectual relief because they remove all those decisions and give you a specific way to do the syntax. (The same of course goes for reading other people's code - your eyes can immediately start looking at the functionality, not try to figure out the current syntax)

The other big benefit of coding standards is creating a "meta language" which is smaller than the parent language and enforces certain invariants. By doing that you again reduce the space that the brain has to consider. For example you might require that all C macros behave like functions (eg. don't eat scopes and don't declare variables). Now when I see one I know I don't have to worry about those things. Or you might require that globals are never externed and only get accessed through functions called "GetGlobal_blah". It doesn't really matter what they are as long as they are simple, clear, uniform, and strictly enforced, because only if they are totally reliable can you stop thinking about them.

4. The trap of "post-Mature Programmer" ism.

Many great coders of my generation have gone through the strict clean rules-following coder phase and have moved onto the "post" phase. The "post-mature programmer" knows the importance of following strict coding style rules or not indulging themselves too much, but also sees the benefit of bending those rules and believes that they can be a bit more free about deciding on what to do for each situation.

I believe that they/we mostly get this wrong.

The best analogy I can think of is poker. Most successful poker players go through several phases. First you think you're ever so clever and you can bluff and trap people and play all sorts of weird lines. Once you move up levels and start playing serious poker this delusion is quickly wiped out and you realize you need to go back to fundemantals. So then most people will go through the TAG "standard line" phase where they learn the right thing to do in each situation and the standard way to analyze hands, and they will be quite successful with this. (note that "standard line" doesn't mean nitty, it involves things like squeeze plays and even check-shove river bluffs, but it's based on playing a balanced range and studying EV). But then they are so successful with their solid play that they start to think they can get away with "mixing it up", playing hands that are almost certainly not profitable because they think they are good enough post-flop to make up for it (eg. Durrr style), or imagining that by playing some minus EV hands it helps their image and pays off later.

This is almost always wrong. Limping AA is almost always wrong, opening 72o UTG is almost always wrong - maybe you've done some analysis and you've decided it's the right thing at this table at this moment (for example limping AA because the people behind you attack limpers way too much and they think you would never limp AA so they will get stuck easily). It's wrong.

(telling yourself that your current bad play is made up for with later "image value" is one of the great rationalizations that poker players use an excuse to justify their bad play. programmers due to same with a set of excuses like "performance" that are really just rationalizing justifications for their bad practices; with poker, EV in the hand is worth more than EV in the bush; that is, the later image value you might win is so small and dubious and depends on various things working out just right that it's almost never correct to give up known current value for possible future value. (a particularly simple case of this is "implied odds" which bad players use an excuse to chase hands they shouldn't))

The problem is that when you open yourself up to making any possible move at any moment, there is simply too much to consider. You can't possibly go through all those decisions from first principles and make the right choice. Even if you could, there's no way you can sustain it for thousands of hands. You're going to make mistakes.

The same is true in coding; the post-MP knows the value of encapsulating a bit of functionality into a struct + helpers (or a class), but they think I'm smart enough I can decide not to do that in this particular case. No! You are wrong. I mean, maybe you are in fact right in this particular case, but it's not a good use of your brain energy to make that decision, and you will make it wrong some times.

There is a great value in having simple rules. Like "any time I enter a pot preflop, I come in for a raise". It may not always be the best thing to do, but it's not bad, and it saves you from making possibly big mistakes, and most importantly it frees up your brain for other things.

The same thing happens with life decision making. There's a standard set of cliches :

Don't marry the first person you sleep with
Don't get in a serious relationship off a rebound
Don't buy anything if the salesman is pushing it really hard
Take a day to sleep on any big decision
Don't lend money to poor friends
etc.
you may think "I'm smart, I'm mature, I don't need these rules, I can make my own decision correctly based on the specifics of the current situation". But you are wrong. Sure, following the rules you might miss out on the truly optimum decision once in a while. But it's foolish arrogance to think that your mind is so strong that you don't need the protection and simplicity that the rules provide.

In poker the correct post-solid-player adjustment is very very small. You don't go off making wild plays all the time, that's over-confidence in your abilities and just "spew". A correctly evolved player basically sticks to the solid line and the standard way of evaluating, but knows how to indentify situations where a very small correction is correct. Maybe the table is playing too tight preflop, so in the hijack position you start opening the top 35% of hands instead of the top 25% of hands. You don't just start opening every hand. You stay within the scope of the good play that you understand and can do without rethinking your whole approach.

The same is true in programming I believe; the correct adjustment for post-mature coding is very small; you don't have to be totally dogmatic about making every member variable private, but you also don't just stop encapsulating classes at all.


11-09-11 | Weird shite about Exceptions in Windows

What happens when an exception is thrown in Windows ? (please fill in any gaps, I haven't researched this in great detail).

1. The VectoredExceptionHandlers are called. One of these you may not be aware of is the "first chance" exception handler that the MSVC debugger installs. If you have the flags set in a certain way, this will cause you to breakpoint at the spot of the throw without passing the exception on to the SEH chain.

2. The list of __except() handlers is walked and those filters are invoked; if the filter takes the exception then they handle it.

* of note here is the change from x86 to x64. Under x86 SEH handlers were made on the stack and then tacked onto the list as you descended (basically the __try corresponds to tacking on the handler); under x64 that is all removed and the SEH filter walk relies on being able to trace back up the function call stack. Normally there's no difference, however under x64 if your function call stack can't be walked for some reason, then your SEH filters won't get called! This can happen for a few reasons; it can happen due to the 32-64 thunk layer, it can happen if you manually create some ASM or "naked" functions and don't maintain the stack trace info correctly, and it can happen of course if you stomp the return addresses in the stack. See for example : The case of the disappearing OnLoad exception – user-mode callback exceptions in x64 at Thursday Night . (stomping the stack can of course ruin SEH on x86 as well since the exception registration structures are on the stack).

More info on the x64 SEH trace : at osronline and at nynaeve .

3. If no filter wanted the exception, it goes up to the UnhandledExceptionFilter. In MSVC's CRT this is normally set to __CxxUnhandledExceptionFilter, that function itself will check if a debugger is present and do different things (eg. breakpoint).

4. If UnhandledExceptionFilter still didn't handle the exception and passes it on, the OS gets it and you get the application critical error popup box. Depending on your registry settings this may invoke automatic debugging. However as noted here : SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog there is a funny bypass where the OS will pass the exception directly to the OS handler and not call your filter.

Automatic debugging is controlled by

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug].  
when it was first introduced it defaulted to Dr Watson. At some point (Vista?) that was changed to point it to the Windows Troubleshooting engine instead. I believe that when you install DevStudio this registry key is changed to point to vsjitdebugger. The key "Auto" is set to 0 by default which means ask before popping into the debugger.

To clarify a bit about what happens with unhandled exceptions : your unhandled exception callback is not called first, and is not necessarily called at all. After all the SEH filters get their chance, the OS calls its own internal "UnhandledExceptionFilter" - not your callback. This OS function checks if you are in a debugger and might just pass off the exception to the debugger (this is *not* the "first chance" check which is done based on your MSVC check boxes). This function also might just decide that the exception is a security risk and pass it straight to the AeDebug. If none of those things happen, then your filter may get called. (this is where the CRT CxxUnhandledExceptionFilter would get called if you didn't install anything).

Another note : the standard application error popup box just comes from UnhandledExceptionFilter. One of the ways you can get a silent application exit with no popup is if the OS detects that your SEH chain is corrupted, it will just TerminateProcess on your ass and drop you out. Similarly if you do something bad from inside one of your exception handlers. (another way you can get a silent TerminateProcess is if you touch things during thread or process destruction; eg. from a DLL_THREAD_DETACH or something like that, if you try to enter crit secs that are being destroyed you can get a sudden silent process exit).


Some links :

DebugInfo.com - Unexpected user breakpoint in NTDLL.DLL
Under the Hood New Vectored Exception Handling in Windows XP
SetUnhandledExceptionFilter Anti Debug Trick « Evilcodecave’s Weblog
SetUnhandledExceptionFilter and VC8 - Jochen Kalmbach's WebLog
SetErrorMode function
C++ tips AddVectoredExceptionHandler, AddVectoredContinueHandler and SetUnhandledExceptionFilter - Zhanli's tech notes - Sit
A Crash Course on theDepths of Win32 Structured Exception Handling, MSJ January 1997

A bit of interesting stuff about how the /RTC run time checks are implemented :

Visual C++ Debug Builds–”Fast Checks” Cause 5x Slowdowns Random ASCII

A bit about stack sizes on windows, in particular there are *two* thread stack sizes (the reserved and initial commit) and people don't usually think about that carefully when they pass a StackSize to CreateThread :

Thread Stack Size

Not directly related but interesting :

Pushing the Limits of Windows Processes and Threads - Mark's Blog - Site Home - TechNet Blogs
Postmortem Debugging Dr Dobb's
John Robbins' Blog How to Capture a Minidump Let Me Count the Ways
Collecting User-Mode Dumps
Automatically Capturing a Dump When a Process Crashes - .NET Blog - Site Home - MSDN Blogs


11-08-11 | Differences Running in Debugger

Bugs that won't repro under debugging are the worst. I'm not talking about "debug" vs "release" builds; I mean the exact same exe, run in the debugger vs. not in the debugger.

What I'd like is to assemble a list of the differences between running under the debugger and not under the debugger. I don't really know the answers to this so this is a community participation post. eg. you fill in the blanks.

Differences in running under the debugger :

1. Timing. A common problem now with heavily threaded apps are bugs due to timing variation. But where do the timing differences come from exactly?

1.a. OutputDebugString. Duh, affects timing massively. Similarly anything you do dependent on IsDebuggerPresent().

1.b. VC-generated messages about thread creation etc. These obviously affect timing. You can disable them being shown by right-clicking in the output window of the debugger, but the notification is still being sent so you can never completely eliminate the timing difference for creating/destroying threads. (and the debugger does a lot more work for thread accounting anyway, so create/destroy will always have significant timing variation).

2. Exceptions. (not C++ exceptions, which are handled pretty uniformly, but more the low level SEH exceptions like access violations and such). Obviously in the debugger you can toggle the handling of various exceptions and that can change behavior. One thing I'm not sure of is if there are any registry settings or other variables that control exception behavior in NON-debugged runs? (* more on this in another post)

3. Stack. Is there a difference here? Not that I know of.

4. Debug Heap. This is probably the biggest one. Processes run in the debugger on windows *always* get the debug heap, even if you didn't ask for it. You can turn this off by setting _NO_DEBUG_HEAP as an environment variable or starting MSVC with -hd. See Behavior of Spawned Processes .

Note that this isn't coming from MSVC, it's actually in ntdll. When you create your process heap, ntdll does a "QueryInformationProcess" and sees if it's being debugged, and if so it stuffs in the debug heap. The important thing is that this is at heap creation time, which leads to a solution.

5. Child Process issues. Because the debugged process is a child process of the debugger, it inherits its process properties. (the same issue can occur for running under "cmd" vs. spawning from explorer). Two specifics are "permissions" and environment variables. Another inherited value is the "ErrorMode" as in "GetErrorMode/SetErrorMode".

There's a solution to #4 and #5 which is this :

Start your app outside of the debugger. Make it do an int 3 so it pauses. Then attach the debugger. You can now debug bug you don't get some of the ran-from-debugger differences.

(note to self about attaching : for some reason the MSVC "attach to running process" seems to fail a lot; there are other ways to do it though, when you get an int 3 message box popup you can click "debug" there, or from task manager or procexp you can find the task and click "debug" there).


11-03-11 | BoolYouMustCheck

I've got a lot of functions that return error codes (rather than throw or something). The problem with that is that it's very easy to just not check the error code and then you have incorrect code that can possibly break in a nasty way if the error case is hit and not detected.

One way to test this is like this :


class BoolYouMustCheck
{
private:
    bool m_b;
    mutable bool m_checked;

public :

    //BoolYouMustCheck() : m_b(false), m_checked(false) { }
    BoolYouMustCheck(bool b) : m_b(b), m_checked(false) { }
    
    ~BoolYouMustCheck()
    {
        ASSERT( m_checked );
    }
    
    operator bool () const
    {
        m_checked = true;
        return m_b;
    }

};

it's just a proxy for bool which will assert if it is assigned and never read.

So now you can take a function that returns an error condition, for example :


bool func1(int x)
{
    return ( x > 7 );
}

normally you could easily just call func1() and not check the value. But you change it to :

BoolYouMustCheck func2(int x)
{
    return ( x > 7 );
}

(in practice you probably just want to do #define bool BoolYouMustCheck)

Now you get :


{

    int y = clock();

    // asserts:
    func1(y);

    // asserts :
    bool b1 = func1(y);
    
    // okay :
    bool b2 = func1(y);
    if ( b2 )
        y++;
    
    // okay :
    if ( func1(y) )
        y++;
        
    return y;
}

which is kind of nice.

The only ugly thing is that the assert can be rather far removed from the line of code that caused the problem. In the first case (just calling func1 and doing nothing with the return value), you get an assert right away, because the returned class is destructed right away. But in the second case where you assign to b1, you don't get the assert until the end of function scope. I guess you could fix that by taking a stack trace in the constructor.

(note : if you want to intentionally ignore the return value b1 you can just add a line like (int) b1; to surpress the assert.


11-03-11 | The difficulty of school reform

I'm so opposed to top-down metric based "reform" that I figured I should talk about what I think is a better alternative.

First of all there is no doubt that American public schools are sick. There are lots of good teachers and good classes, but also lots of bad. In my opinion, we don't need massive structural reform, we need a way to get rid of the bad teachers.

(almost always a cluster of bad teachers goes with a bad principal and often a bad superindendent too; they tend to be teachers with seniority who just don't care much anymore, and they all just want to maintain the status quo)

I do believe that charters are not the answer. There's nothing wrong with private schools, but they should be private. I don't believe that federal money should go to private institutions, almost ever, because it leads to corruption, and it also just sucks funding out of the public school system. The charters almost always wind up being a way to discriminate about entrants in some way (even just by desire to go to them), and are also often just a way to sneak around the teacher's union. Anyhoo.

I think the answer is motivating teachers and rewarding good teachers, and also being able to fire bad teachers. If teachers are motivated to succeed, and principals are motivated to hire good teachers and fire bad ones, you have a more free labor market and things will improve.

But how do you do that? This is where the trouble comes in.

I believe standardized test performance is a terrible way to do measure success. Most simple metrics like this would be similarly bad.

Judgement by a panel of peers doesn't work, because the teachers get into collusion and just say everyone is great. Perhaps this could be improved by making teachers grade each other, and forcing the grade to be on a curve so there are gauranteed to be winners and losers. But this would just degenerate into a game of "Survivor" where the old guard makes alliances to vote for each other and so on.

I believe the best answer is to let parents grade the teachers. Schools are one of the few areas where I think local government is actually better than top-down federal government, because it's one of the few areas where the local people actually pay attention to what's happening and get involved. (on the other hand, I think local school funding is probably unconstitutional and needs to be abolished; it creates great inequality to this day, despite many court rulings trying to redistribute funding (such as the "robin hood" ruling in Texas))

One idea is to let parents apply for what school they want their kids in and what specific teacher they want. Kids are then assigned by lottery, but you count the number of applications each teacher gets and that's their score. It's basically measuring demand as if teaching was a good. Teachers with high scores get raises and teachers with low scores get fired.

Now you obviously have to control for things like teachers just giving all A's, so people apply because it's the "easy" teacher. One solution might to force all classes to be graded on a bell. That would actually balance out the social stratification of classes because the grade-grubber kids might want to avoid the most prestigious classes (since they would be full of smart kids and very hard to do well on with a bell curve).

That's all sort of okay I think, but there's a big problem, which is that it biases strongly against areas where the parents don't give a shit. And those are the most problematic areas.


11-02-11 | StringMatchTest Release

Code for my string match testbed discussed previously. I'm not gonna do the work to turn this into a clean standalone, so it's a big mess and you can take what you like out of it.

stringmatchtest.zip (45k)

Note : the stringmatchtest.vcproj project refers to some files that are not included in this distribution. Just delete them from the project.

Requires cblib.zip (633k)

You may also need STLPort (I haven't tried building with the VC STL , I use STLPort 5.1.5 or 5.2.1). (BTW I had to modify the STLPort headers to make it build on VS 2008 ; the mods should be obvious).

Tested with VC 2005 and 2008. Does not build with VC 2010 currently.

The most interesting bit is probably in test_suffixarray, which implements the three suffix-array based string searchers previously described on this blog. See previous posts :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression

cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - String Match Results Part 6

StringMatchTest includes :


/*
 * divsufsort.c for libdivsufsort-lite
 * Copyright (c) 2003-2008 Yuta Mori All Rights Reserved.
 *

/* LzFind.c -- Match finder for LZ algorithms
2009-04-22 : Igor Pavlov : Public domain */

/*
    MMC (Morphing Match Chain)
    Match Finder
    Copyright (C) Yann Collet 2010-2011

StringMatchTest like all cbloom.com software is released under zlib license (basically free for all uses).


11-02-11 | I need

I need some light entertainment that won't actively damage my brain.

Ideally like a web feed or something so I get a few minutes of mild diversion every day.

Nothing serious or political or overly technical that will take real concentration or make me angry.

But also nothing that will insult my intelligence or subtly put filth into my brain.

As an example of bad ones : I rather enjoy architecture and design, but I find almost all the design blogs are way too consumerist, pushing the constant purchasing of new crap just because it's the new thing, and that makes me ill ; another example is my current addiction, which is car news sites, which subconsciously fills my brain with all kinds of horrible ideas, like drifting is cool, I should make my exhaust louder, and so on.


10-31-11 | Photos , Mostly Enchantments

Colchuk lake is a beautiful turqoise :

I'm in love with this rock face. It towers over Colchuk and feels like a real living being, it has such presence, and you're not sure if it's protective or menacing; The super-difficult barely-a-trail up to the enchantments is on the left :

Sun shining through larches :

The actual enchantments area is a weird top-of-the-world wasteland :

This is from a hike to Snow Lake earlier in the year; I noticed these vortices sheeting off a rock in the river; the river had perfect laminar flow and the rock edge disturbance was shedding this regular "street" of round vortices that then acted as lenses. The lenses were so perfect they were creating caustic rings of light focused and defocused on the river bed. You can see some of the lenses in the lower left of the photo and the light through them hits the creek bed near the top of the photo.

This guy visited our back yard a few days ago. Pearlescent feathers. No idea what he is, but he seemed real tame like he was probably a pet.

This is the view from my new home office (it's downtown Seattle in the distance). We're in a slight microclimate where we get fractionally less wetness (it's not like a real San Francisco microclimate that's dramatically different; I think we get 90-95% of the wetness); the storms hit us later and leave us sooner, so I get to watch them roll in to Seattle, and I get to watch them clear up. The result is a lot of rainbows. I took a photo of the first one I saw. (BTW whoever has that house with the red roof, I thank you, it really adds some spice to my view).


10-31-11 | Small poker note

This year's WSOP final table has perhaps the best tournament poker players ever at a WSOP final table. Maybe at any major live tournament (?). I don't really follow tournaments much, but there's only one fish at the table, and he's not even a huge fish, he's just a "solid" old player, and everyone else is an internet kid, which means that they actually know what things like "fold equity" and "tournament chip EV vs. real dollar EV" is.

Martin Staszko (40.1 million in chips)
Eoghan O'Dea (33.9 million)
Matt Giannetti (24.7 million)
Phil Collins (23.8 million)
Ben Lamb (20.8 million)
Badih Bounahra (19.7 million)
Pius Heinz (16.4 million)
Anton Makiievskyi (13.8 million)
Samuel Holden (12.3 million)

I'm sure it will be horrible TV ; for one thing, ESPN will just show the all-ins which is horrible boring poker broadcasting. But beyond that, the young internet players are just SOoooo boring to watch. God, cash in some of your poker winnings and buy a personality, please. You may as well point a camera at me while I'm coding.

It's too bad that Daniel Negreanu doesn't have enough humility to buy some lessons from JMan or someone good, because it's so much more entertaining to see someone who actually interacts at the table. And his live metagame skills are very good, so if he would just play better technically he could do well.


10-31-11 | What are these pipes ?

So there's this low area along the path by my house that gathers water. I went out to fix it and digging around I found this colony of pipes :

The two small ones are two inch diameter, the big one is four inch; they're black PVC with hammered in caps (that I can easily pull out by hand). I dug down another foot or so and from what I could see they just continue straight down.

Pretty much all my utilities flow past that spot so they could be related to almost anything. In particular the sewer goes past there so I think they might be an outside sewer access. Kind of wierd that's there three of them though instead of just one.

I don't think that we have any kind of french drain system, though it would make sense to have a low point there with a french drain drawing the water off.

Anyhoo, kind of curious what they are before I cover up that spot with a bunch of rock and dirt to raise it up. Photos backing up :


10-31-11 | I hate the web

Google search no longer supports +. The Google response does a great job of making it worse by responding in super-douche corporate bullshit speak, like "actually you didn't want that feature, we know better than you, it's better without it" and also the power-douche "we hear your concerns and are glad for your feedback but fuck you we're going to ignore you".

Google has done some similarly epically retarded shit with Google+ . Like, hey dumb asses, if businesses want to be on Google+ and you don't have the business accounts features done yet, why don't you just let them keep using the normal Google+ and migrate them over when you have the business features done? Oh no, let's kick everyone off and cause a big shit storm because we know the "right way" and you will thank us because it's "worth the wait". Epically retarded.

Anyway, Google Reader got a new look and it seems to be neither good nor bad, but it PISSES ME OFF.

I fucking hate it when shit that I use as a basic part of my day changes under me randomly. It's like if somebody semi-randomly periodically came into your house and swapped out your clothes or your appliances. Sometimes they fuck up some feature that you really liked. But most of all it's just distracting and annoying and ruins your familiarity with a tool.

Web software in general sucks because of this. Even ignoring all the bullshit about how slow it is, or the fact that you need a live connection, or the fact that you can't download web pages properly, etc. it sucks because people are constantly changing it out under your feet.


10-27-11 | Tiny LZ Decoder

I get 17 bytes for the core loop (not including putting the array pointers in registers because presumably they already are there if you care about size) .

My x86 is rusty but certainly the trick to being small is to use the ancient 1 byte instructions, which conveniently the string instructions are. For example you might be tempted to read out length & offset like this :


        mov cl,[esi]    // len
        mov al,[esi+1]  // offset
        add esi,2

but it's smaller to do

        lodsb  // len
        mov cl,al
        lodsb  // offset

because it keeps you in 1 byte instructions. (and of course any cleverness with lea is right out). (naively just using lodsw and then you have len and offset in al and ah is even better, but in practice I can't make that smaller)

Anyhoo, here it is. I'm sure someone cleverer with free time could do better.


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        movsb   // literal
        lodsb   // len
        mov cl,al
        lodsb   // offset
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // match
        pop esi
        cmp edi,edx
        jne more
    }

}

------------------------------------------------------

    more:
        movsb   // literal
00401012  movs        byte ptr es:[edi],byte ptr [esi] 
        lodsb   // len
00401013  lods        byte ptr [esi] 
        mov cl,al
00401014  mov         cl,al 
        lodsb   // offset
00401016  lods        byte ptr [esi] 
        push esi
00401017  push        esi  
        mov esi,edi
00401018  mov         esi,edi 
        sub esi,eax
0040101A  sub         esi,eax 
        rep movsb   // match
0040101C  rep movs    byte ptr es:[edi],byte ptr [esi] 
        pop esi
0040101E  pop         esi  
        cmp edi,edx
0040101F  cmp         edi,edx 
        jne more
00401021  jne         more (401012h) 

Also obviously you would get much better compression with a literal run length instead of a single literal every time, and it only costs a few more bytes of instructions. You would get even better compression if the run len could be either a match len or a literal run len and that's just another few bytes. (ADDENDUM : see end)


A small "Predictor/Finnish" is something like this :


    __asm
    {
ByteLoop:   lodsb   // al = *esi++ // control byte
            mov edx, 0x100
            mov dl, al
BitLoop:
            shr edx, 1
            jz  ByteLoop
            jnc zerobit
            lodsb
            mov [ebx], al
zerobit:    mov al, [ebx]
            mov bh, bl
            mov bl, al
            stosb  // *edi++ = al
            jmp BitLoop
    }

the fast version of Finnish of course copies the bit loop 8 times to avoid looping but you can't do that if you want to be small.

I'm quite sure this could be smaller using some clever {adc esi} or such. Also the sentry bit looping is taking a lot of instructions and so on.

Note the crucial trick of "Finnish" is that the hash table must be 64k in size and 64k aligned, so you can do the hash update and the table address computaton just by cycling the bottom two bytes of ebx. (I use the name "Predictor" for the generic idea of the single bit prediction/literal style compressor; "Finnish" is the specific variant of Predictor that uses this bx trick).

(note that this is not remotely the fast way to write these on modern CPU's)


ADDENDUM : Better (in the sense of more compression) LZ decoder in 22 bytes (core loop only) :


__declspec(noinline) void TinyLZDecoder(char * to,char * fm,char * to_end)
{
    __asm
    {
        mov esi,fm
        mov edi,to
        mov edx,to_end
        xor eax,eax
        xor ecx,ecx
    more:
        mov cl,[esi]    // len
        inc esi
        shr cl,1        // bottom bit is flag
        jc literals
        
    //match:
        lodsb   // offset -> al
        push esi
        mov esi,edi
        sub esi,eax
        rep movsb   // copy match
        pop esi
    
        // ecx is zero, just drop through
    literals:
        rep movsb  // copy literal run
    
        cmp edi,edx
        jne more
    }

}

Note that obviously if your data is bigger than 256 bytes you can use a two byte match offset by doing lodsw instead of lodsb.

x86 of course is a special case that's particularly amenable to small LZ decoders; it's not really fair, all other instruction sets will be much larger. (which is sort of ironic because RISC is supposed to make instruction encoding smaller) (addendum : not actually true, that last bit)


10-27-11 | Metrics

The best thing you can ever have in software development is a good metric that you are trying to optimize for. A repeatable test case that produces a score and all you have to do is maximize that score.

However, having a performance metric that doesn't exactly match what you want to optimize can be very harmful. You have to be very careful about how you set your metric and over-training for it.

If you make a speed test where you run the same bit of code over and over a thousand times, you wind up creating code that overlaps well with itself and runs fast when it's hot in cache, and maybe code that factors out to a precompute then repeat - not necessarily things that you wanted.

If you set a compression challenge based on the Calgary Corpus, you wanted to get just great compressors, but instead you get compressors specifically tuned for those files (mainly english text).

An example that has misled many people is automated financial trading software. It might seem that that is the ideal case - you get a score, how much money it makes - and you just optimize it to make more money and let it run. But that's not right, because there are other factors, such as risk. If you just train the software to maximize EV it can wind up learning to do very strange things (like massive leverage circular arbitrage trades that require huge leverage to squeeze tiny margins; this is part of what killed LTCM for example).

The only time you can really go nuts optimizing for a metric is when the metric is the real final target of your application. Otherwise, you have to be really careful and take the metric with a grain of salt; you optimize for the metric, but also keep an eye on your code and use your human judgement to decide if the changes your making are good general changes or are just over-specific to the metric.

Anyone in software should know this.

Which is what makes it particularly disturbing that the Gates Foundation supports moronic metric-based education.

When you set simple performance metrics for big bureaucracies, you don't make things better. You make the bureaucracies better at optimizing those metrics. And since they have limited resources and limited amounts of time and energy, that typically makes everything else worse.

Granted, Gates is not so moronic as to advocate "teaching to the test", but even a more complicated cocktail of metrics (which they have yet to define, instead pouring money into metrics research) will not be any different. If you're going to pay and hire and fire people based on metrics you create a horrible situation where any creative thought is punished.

(I think Gates' opposition to small class sizes reflects an over-attention to test results (which have been shown to not correlate strongly to class size) and a lack of common sense about what actually happens in a class room)

The irony is that it's just like the way that horrible teachers grade their students. It's like those essay questions on AP exams where they don't actually read your essay and appreciate what you're saying at all, the grader just looks for the key words that you're supposed to have used if your answer is correct, so you could actually write something that doesn't make sense at all, as long as it has the correct "thesis/evidence/summary" structure you get full points.

It gets me personally hot and bothered. My experience of American public schools was that they were generally absurdly soul-crushing in a bureaucratic Kafka-esque way; like you would be tested for your creativity and independent though process, and the way that was done was you had to recite back the specific problem solving steps that you had been told was the method. In that general depressing stupidity, I was lucky enough to have a few teachers that really touched me and helped me through life because they just engaged me as a human being and were flexible about how I was allowed to go about things. In terms of objective evaluations I'm sure many of those special teachers would have done very poorly. They often spent entire class sessions rambling on about their personal lives or about politics - and that was great! That was so much more valuable to a child than taking turns reading out of the textbook or following the official lesson plan.


10-26-11 | Some Things I Find Appalling

The US continues to be the largest provider of arms around the world, including to questionable third world countries and private militias.

The FBI continues to use entrapment techniques on suspected possible terrorists in which they provide a more radical undercover agent who provides the arms and encouragement. (just like the 60's, man)

The FBI continues to spy on non-criminals inside the US.

The US government creates semi-hidden propaganda to sell its policies to US citizens.

US journalists are not allowed to cover our own wars any more. Don't be misled by "embeds" or other government-provided "news" footage.

The executive continues to hide its actions under the cloak of "privilege" or "national security" , way beyond what is remotely reasonable.

Private companies are paid to imprison our citizens. Privitization of prisons is just insane, but of course it's only natural when you have private military forces, which are not only illegal but paid for by our own government. WTF.

We continue to use terrorism as a thin excuse for deporting or imprisoning ("detention") thousands of immigrants.

etc.


10-26-11 | The Eight Month Cruise

If you're buying a home in Seattle, you should always try to do so in early spring. This will give you a few months of wetness to see any problems, and then you'll have the whole summer to deal with them before the wet sets in again. (it also lets you see the homes during rains, which lets you look for water incursions while you shop). I tried to time it that way, but the home shopping took too long and I didn't wind up buying until late summer.

The problem with Seattle is that once it starts raining (around Oct 1 pretty reliably) it literally does not stop for the next 8 months. Sure maybe it stops for a day or two, but never long enough for the whole house to dry out and then give you a big chunk of dry days to do something like replace the roof or paint the exterior.

Fortunately I don't have any problem so large as that, but even for minor things it's damn annoying. For example some time around September I realized that I really need to get a coat of waterproofer on all my decking. Oh well, it's gonna have to wait 8 months. There's a couple of spots I need to touch up exterior paint, but you can't really do a good job of painting without a solid 5 days of dry and decent warmth.

It's almost like our houses are boats, and we go on an 8 month aquatic voyage every year. You really need to use those 4 months as a chance to get your boat up on dry dock, scrub off the moss, dig out the dry rot, apply epoxy, sand and paint, etc.


10-26-11 | Tons

In the US, a "ton" = 2000 pounds.

In the UK a "ton" is 2240 pounds (which comes from twenty "hundredweights" where a "hundredweight" is eight stone, and a stone is 14 pounds, WTF Britain).

A "metric ton" is obviously 1000 kg. In the UK this is officially called a "tonne" which you will see in technical documents, but I don't see that used much in casual writing, and it's certainly confusing when spoken since it sounds the same. (but a UK ton is very close to a metric ton (2204.6 pounds) so the mixup here surely happens all the time and is not a huge problem).

(when you hear someone in the UK phonetically say "ton" do they mean "tonne" or imperial ton?)

To differentiate the US ton vs UK ton they can be called "short ton" or "long ton".


On a related note, a pint is not a pound *anywhere* in the world.

In the UK, 1 oz by volume of water = 1 oz of weight. But a "pint" in the UK is 20 oz. So a pint is 1.25 pounds (a gallon is exactly 10 pounds)

In the US, 1 oz by volume of water = 1.041 oz of weight, so a pint = 1.041 pounds. (and a gallon = 8.33 pounds).

(neither liquid ounce is anything neat in terms of volume; the only nice whole number unit is the US gallon which is 231 cubic inches)

(the weight measures are the same in the US and UK, it's the US volume measure which went weird (1.041), and I believe it was done in order to make the gallon an integer number of cubic inches)

If you want to get technical, a (US) "pint's a pound" at some high temperature. (...some digging...) actually it's very close just before boiling. It looks like 98 C water is almost exactly a pound per pint (US).

Actually there is a sort of cute book-end of the ranges of water density there :

Very close to freezing (4 C) water is 1 g/ml , and very close to boiling (98 C) it's a pound per (US) pint. The difference is a factor of about 0.96.


10-24-11 | We Cut Costs ...

"We cut costs, and pass the savings on to .... us (!?) "

That's the motto of the modern era. Your Jeff Bezoses and Steve Jobs of the world are like a late night "mattress king" but with much less integrity, since the mattress king also slashes prices.

Music and books are digital. The producers save 90% of their costs. No manufaturing. No distribution. No retail space. Where are the savings? In their pocket.

Nobody has a physical location anymore where you can go and get help. Even the horrible call centers are often just automated, or moved to the web. Savings for consumers? Nope.

It struck me when I got an "Orca card" , the local transit pass. I have to load up an e-wallet so they get to hold onto a bunch of my money all the time. All my fares are prepaid and processed in one big transaction. They save massive amounts (vs. collecting physical money to pay for rides). Where are the savings? Not for me of course.

It's gotten so bad that when a new technology comes around and the savings are not passed on to the consumer we don't even blink. They never are.


10-24-11 | LZ Optimal Parse with A Star Part 1

First two notes that aren't about the A-Star parse :

1. All good LZ encoders these days that aren't optimal parsers use complicated heuristics to try to bias the parse towards a cheaper one. The LZ parse is massively redundant (many parses can produce the same original data) and the cost is not the same. But in the forward normal parse you can't say for sure which decision is cheapest, so they just make some guesses.

For example the two key crucial heuristics in LZMA are :


// LZMA lastOffset heuristic :
  if (repLen >= 2 && (
        (repLen + 1 >= mainLen) ||
        (repLen + 2 >= mainLen && mainDist >= (1 << 9)) ||
        (repLen + 3 >= mainLen && mainDist >= (1 << 15))))
  {
    // take repLen instead of mainLen
  }

// LZMA lazy heuristic :

        // Truncate current match if match at next position will be better (LZMA's algorithm)
        if (nextlen >= prevlen && nextdist < prevdist/4 ||
            nextlen == prevlen + 1 && !ChangePair(prevdist, nextdist) ||
            nextlen > prevlen + 1 ||
            nextlen + 1 >= prevlen && prevlen >= MINLEN && ChangePair(nextdist, prevdist))
        {
             return MINLEN-1;
        } else {
             return prevlen;
        }

One is choosing a "repeat match" over a normal match, and the next is choosing a lazy match (literal then match) over greedy (immediate match).

My non-optimal LZ parser has to make similar decisions; what I did was set up a matrix and train it. I made four categories for a match based on what the offset is : {repeat, low, medium, high } , so to decide between two matches I do :


  cat1 = category of match 1 offset
  cat2 = category of match 2 offset
  diff = len1 - len2

  return diff >= c_threshold[cat1][cat2]
  
so c_threshold is a 4x4 matrix of integers. The values of threshold are in the range [0,3] so it's not too many to just enumerate all possibilities of threshold and see what's best.

Anyway, the thing about these heuristics is that they bias the parse in a certain way. They assume a certain cost tradeoff between literals vs. repeat matches vs. normal matches, or whatever. When the heuristic doesn't match the file they do badly. Otherwise they do amazingly well. One solution might be having several heuristics trained on different files and choose the one that creates the smallest output.

Also I should note - it's not trivial to tell when you have the heuristic wrong for the file. The problem is that there's a feedback loop between the parse heuristic and the statistical coder. That causes you to get into a local minimum, and you might not see that there's a better global minimum which is only available if you make a big jump in parsing style.

Repeating that more explicitly : the statistical coder will adapt to your parse heuristic; say you have a heuristic that prefers low offsets (like the LZMA heuristic above), that will cause your parse to select more low offsets, that will cause your statistical backend to code low offsets smaller. That's good, that's what you want, that's the whole point of the heuristic, it skews the parse intentionally to get the statistics going in the right direction. The issue then is that if you try to evaluate an alternative parse using the statistics that you have, it will look bad, because your statistics are trained for the parse you have.

2. It's crazy that LZ compression can do so well with so much redundancy in the code stream.

Think about it this way. Enumerate all the compressed bit streams that are created by encoding all raw bit streams up to length N.

Start with the shortest compressed bit stream. See what raw bits that decodes to. Now there may be several more (longer) compressed bit streams that decode to that same raw bit stream. Mark those all as unused.

Move to the next shortest compressed bit stream. First if there is a shorter unused one, move it into that slot. Then mark all other output streams that make the same raw bits as unused, and repeat.

For example a file like "aaaaaaa" has one encoding that's {a, offset -1 length 6} ; but there are *tons* of other encodings, such as {a,a,[-1,5]} or {a,[-1,3],[-1,3]} or {a,[-1,2],[-2,4]} etc. etc.

All the encodings except the shortest are useless. We don't want them. But even the ones that are not the best are quite short - they are small encodings, and small encodings take up lots of code space (see Kraft Inequality for example - one bit shorter means you take up twice the code space). So these useless but short encodings are real waste. In particular, there are other raw strings that don't have such a short encoding that would like to have that output length.

Anyhoo, surprisingly (just like with video coding) it seems to be good to add even *more* code stream redundancy by using things like the "repeat matches". I don't think I've ever seen an analysis of just how much wasted code space there is in the LZ output, I'm curious how much there is.

Hmm we didn't actually get to the A Star. Next time.


10-18-11 | StringMatchTest : Hash 1b

For reference :

The good way to do the standard Zip-style Hash -> Linked List for LZ77 parsing.

There are two tables : the hash entry point table, which gives you the head of the linked list, and the link table, which is a circular buffer of ints which contain the next position where that hash occured.

That is :


  hashTable[ hash ]  contains the last (most recent preceding) position that hash occured
  chainTable[ pos & (window_size-1) ]  contains the last position of the hash at pos before pos

To walk the table you do :

  i = hashTable[ hash ];
  while ( i in window )
    i = chainTable[ i & (window_size-1) ]

To update the table you do :

  head = hashTable[ hash ];
  hashTable[hash] = pos;
  chainTable[ pos & (window_size-1) ] = head;

And now for some minor details that are a bit subtle. We're going to go through "Hash1" from StringMatchTest which I know I still haven't posted.

int64 Test_Hash1(MATCH_TEST_ARGS)
{
    uint8 * buf = (uint8 *)charbuf;
    const int window_mask = window_size-1;
        
    vector<int> chain; // circular buffer on window_size
    chain.resize(window_size);
    int * pchain = chain.data();
    
    const int hash_size = MIN(window_size,1<<20);
    const int hash_mask = hash_size-1;
    
for small files or small windows, you can only get good per-byte speed if you make hash size proportional with that MIN. (what you can't see is outside the Test_ I also made window_size be no bigger than the smallest power of 2 that encloses file size).


    vector<int> hashTable; // points to pos of most recent occurance of this hash
    hashTable.resize(hash_size);
    int * phashTable = hashTable.data();
    
    memset(phashTable,0,hash_size*sizeof(int));

As noted previously, for large hashes you can get a big win by using a malloc that gives you zeros. I don't do it here for fairness to the other tests. I do make sure that my initialization value is zero so you can switch to VirtualAlloc/calloc.

    int64 total_ml = 0;
    
    // fiddle the pointers so that position 0 counts as being out of the window
    int pos = window_size+1;
    buf -= pos;
    ASSERT( (char *)&buf[pos] == charbuf );
    size += pos;

I don't want to do two checks in the inner loop for whether a position is a null link vs. out of the window. So I make the "null" value of the linked list (zero) be out of the window.

    for(;pos<(size-TAIL_SPACE_NEEDED);)
    {
        MATCH_CHECK_TIME_LIMIT();

        // grab 4 bytes (@ could slide here)
        uint32 cur4 = *((uint32 *)&buf[pos]);
        //ASSERT( cur4 == *((uint32 *)&buf[pos]) );

On PC's it's fastest just to grab the unaligned dword. On PowerPC it's faster to slide bytes through the dword. Note that endian-ness changes the value, so you may need to tweak the hash function differently for the two endian cases.

        // hash them 
        uint32 hash = hashfour(cur4) & hash_mask;
        int hashHead =  phashTable[hash];
        int nextHashIndex = hashHead;
        int bestml = 0;
        int windowStart = pos-window_size;
        ASSERT( windowStart >= 0 );
        
        #ifdef MAX_STEPS
        int steps = 0;
        #endif

Start the walk. Not interesting.

        while( nextHashIndex >= windowStart )
        {
            uint32 vs4 = *((uint32 *)&buf[nextHashIndex]);
            int hi = nextHashIndex&window_mask;
            if ( vs4 == cur4 )
            {
                int ml = matchlenbetter(&buf[pos],&buf[nextHashIndex],bestml,&buf[size]);
                    
                if ( ml != 0 )
                {
                    ASSERT( ml > bestml );
                    bestml = ml;
                    
                    // meh bleh extra check cuz matchlenbetter can actually go past end
                    if ( pos+bestml >= size )
                    {
                        bestml = size - pos;
                        break;
                    }
                }
            }

            #ifdef MAX_STEPS
            if ( ++steps > MAX_STEPS )
                break;
            #endif
                                
            nextHashIndex = pchain[hi];
        }

This is the simple walk of the chain of links. Min match len is 4 here which is particularly fast, but other lens can be done similarly.

"MAX_STEPS" is the optimal "amortize" (walk limit) which hurts compression in an unbounded way but is necessary for speed.

"matchlenbetter" is a little trick ; it's best to check the character at "bestml" first because it is the most likely to differ. If that char doesn't match, we know we can't beat the previous match, so we can stop immediately. After that I check the chars in [4,bestml) to ensure that we really do match >= bestml (the first 4 are already checked) and lastly the characters after, to extend the match.

The remainder just updates the hash and is not interesting :


        ASSERT( bestml == 0 || bestml >= 4 );
        total_ml += bestml;
        
        // add self :
        //  (this also implicitly removes the node at the end of the sliding window)
        phashTable[hash] = pos;
        int ci = pos & window_mask;
        pchain[ci] = hashHead;
                
        if ( greedy && bestml > 0 )
        {
            int end = pos+bestml;
            pos++;
            ASSERT( end <= size );
            for(;pos<end;pos++)
            {               
                uint32 cur4 = *((uint32 *)&buf[pos]);
                
                // hash them 
                uint32 hash = hashfour(cur4) & hash_mask;
                int hashHead =  phashTable[hash];
                phashTable[hash] = pos;
                int ci = pos & window_mask;
                pchain[ci] = hashHead;      
            }
            pos = end;
        }
        else
        {
            pos++;
        }
    }
    
    return total_ml;
}

Note that for non-greedy parsing you can help the O(N^2) in some cases by setting bestml to lastml-1. This helps enormously in practice because of the heuristic of matchlenbetter but does not eliminate the O(N^2) in theory. (the reason it helps in practice is because typically when you find a very long match, then the next byte will not have any match longer than lastml-1).

(but hash1 is really just not the right choice for non-greedy parsing; SuffixArray or SuffixTrie are vastly superior)


10-18-11 | To wrap it up or move on

One of the things I've really struggled with in the last few years at RAD is the trade off between wasting time on a side shoot of the main code line vs. really wrapping something up while your focus is on it.

Generally after I've spent a week or two on a topic I start feeling like "okay that's enough time on this I need to move on" ; I guess that's my internal project manager watching my schedule. But on the other hand, there's such a huge advantage to staying on a topic while it fills your mind. You just lose so much momentum if you have to come back to it later.

For example I wish I had finished my JPEG decoder back when I was working on it, that was an important piece of software I believe, but I felt like I needed to move on to more practical tasks, and now it's all out of my head and would take me several weeks to figure out all the nuances again.

Currently I'm working on a new way to do LZ optimal parsing, and I feel like I need to move on because it's not that important to my product and I need to get onto more practical tasks, but at the same time I hate to leave it now because I feel close to a breakthrough and I know that if I stop now I may never come back to it, and if I do it will take forever to get back into the flow.


10-18-11 | Occupy Personal Responsibility

I am happy to see the "occupy" movements ; it's nice just to see people trying to do something about our fucked up politics.

Using "the top 1%" as your scapegoat is very clever, because it's such a narrowly defined group that it can actually get some majority support behind it. Past democratic/populist movements have tried to blame "the rich" or the top 10% , but that never works since 20-30% of people think they are in the top 10% (or will be soon), and then another 20% will oppose you just because you're a democrat, and another 20% will oppose any kind of redistribution, and the result is that you can't get a majority. The "top 1%" is a mysterious group that nobody personally knows, they live in crystal castles and somehow screw us all over.

But that's where I find the whole movement to be rather depressing. It's not actually a new movement towards more realistic, humble governance. It's yet another call for a free lunch.

The real problem with American politics goes back to the voters. Nobody is ever willing to see the big picture and do what's good for the country. Everyone wants their taxes lower. They want their services increased. It's somebody else who can get higher taxes and get a service cut.

And the whole Occupy/1%er movement is the same thing. It's not our fault that the country is so fucked and we don't have jobs. It couldn't be because we act like a herd of buffalo bidding up houses and jumping on stocks just before they tank. It couldn't be because we ran up massive personal debts to buy imported crap. It couldn't be because we chose to get useless educations. It couldn't be because we refuse to raise taxes. It couldn't be because we slash education and infrastructure spending that would help our country develop. No, it's those top 1% ! They're somehow manipulating markets and controlling government and screwing us all over!

(not related to my main point, but another funny bit is that the *actual* poor are completely missing from this movement; it's always the middle class or maybe lower-middle class who are in a bit of a hard spot; if you watch the news about foreclosures, it's always some white people in a suburban 3000 square foot tract home whining about their foreclosure. We have 15-20% of our population in serious fucking big time poverty (the official poverty line is crazy low; $10k a year for one person, $20k for a family of four, a more reasonable definition would make the number even higher). This is not a small group, but they are completely invisible from modern politics, the news, and all these populist movements. That's very intentional, I believe, because the democrats/populists know that the truly poor are politic poison. When you get a bunch of blacks, homeless, immigrants, etc. in your rallies, that's when the Republican opposition calcifies against you. Plus it's just not news; we know we have massive embarassing ghettos where people are barely scraping by, and we don't care and we don't want to hear about it).


10-17-11 | Crap Products

Am I the only person in the world that's just disgusted by how the moron marketting/design people are ruining all modern cars? This is not the correct amount of glass in a car :

Oh sure, I can sacrifice something minor like seeing *out* of the car for something more important like how "aggressive" it looks (that seems to be the key word for car marketting morons ons at the moment). I guess with high end cars like the LFA it actually is more important to the average buyer how it looks from the outside than how it is to drive, but sadly this stupidity has infected all levels of car design.


Modern bed designers either are intentionally out to bruise a lot of shins or they're just morons. I like to think it's the former. They love the fact that all the pretentious douches who just buy things because they're "stylish" without thinking about whether they're actually good are walking around with constantly bruised shins.

If you google "modern platform bed" almost every single one is a shin minefield.

I picked the particular one because it exhibits another piece of retarded design - the integrated nightstand. It seems sort of like a nice idea at first; some of them have little cubbies or nooks in the headboard where you can stash your book, a lamp, a glass of water. But then I started thinking...

Do you people never have sex? When you do, is it lying almost perfectly still with a sheet between you?

Also on the bed topic - "Eco Leather" ? Seriously? Hmm, let's see... 70% polyurethane. That's plastic my friend. I believe that's "pleather", or "leatherette". Fucking eco leather. May as well call it "iLeather".

(there's some inconsistency about what the term "eco leather" means : the smugstainability junta want "eco leather" to mean recycled leather or some such shit, the ridiculously-obsessive-mom junta want "eco leather" to mean leather processed without chemicals so it's safe for their ever-so-fragile babies, but in fact the furniture manufacturers of the world have just said "nope, fuck you, it means plastic").


10-17-11 | Sensor Dry

Do the laundry manufacturers of the world really believe that "slightly damp" is the correct amount of dryness?

So far as I can tell all the fancy "Energy Star" low-energy dryers just work by NOT FUCKING DRYING. Oh, big woop you saved energy because you just fucking didn't heat up the clothes. I can make a dryer from 1960 use zero watts if I'm allowed to not actually dry the clothes.

It's so disrespectful. It treats me, the user of the device, like a fucking piece of shit. I told you to do something, but who the fuck am I? Just some moron consumer. I'm sure the engineers know what I want better than I do, so they'll detect that the clothes are dry and stop the machine. NO! I didn't fucking tell you to stop it, so don't stop it.

My range hood has a similar thing; a "heat sensor" feature which automatically detects a certain temperature and kicks it on high. FUCK YOU. You will run when I tell you to run, you bitch. If I want to boil a pot of water and leave it on the stove and not run the hood then I fucking get to, it's my god damn house. What a fucking cock sucking feature. And of course none of these things can be turned off.

All modern cars are essentially the same way. You wanted to do an abrupt engine brake to get some off-throttle oversteer? Nope, sorry, that throttle release is low-passed. You want to abruptly jump on the throttle to speed away? Nope, sorry, the ingition timing was advanced to save fuel. You wanted to get the weight unbalanced side to side to initiate a flick? Nope, sorry, the magnetic suspension detected the sway and adjusted to stop it. We (the engineers, the ECU, the manufacturer) know better than you. You don't get the freedom to do what you want with your own tools.

Of course they're right most of the time. People are morons. But we should be allowed to be morons. I hate this shit.

It's an unfortunate truth that any time you get software involved, things become shit.

Part of the problem, I believe, is that software allows products to be designed by the marketers.

When products were designed by engineers and scientists, and had to be pressed out of metal, and some big custom machine had to be made to do the pressing, there was a long turn around time, and they just tried to design it to work as well as it could. Some marketer could come in and say "can we make it automatically turn off after 10 minutes?" and the engineer would say "well, that would require this extra part, and it would take 6 months to change the machines, ..." so it wouldn't get done. You couldn't risk chasing the latest trend because by the time your products got through the cycle the trend would be different.

Software is just too easy to change. And programmers are cheap and easy to replace, and have no personal ethics about the code they're writing anyway.

So some moron Producer/Marketer can call a meeting at the last minute and say "what if we add vending machines that sell Sobe drink powerups?" or "what if we sell hats?" , the engineer says "yeah, umm, I guess we could do that, I don't really think it's a good idea..." but it's not your job to have ideas, just go write the code.


10-12-11 | Post-backpacking Recalibration

1. 65 degree house is fucking blazing hot at night.

2. Of course I'll walk a mile in the rain to get lunch. No big deal.

3. Everything is fucking DELICIOUS! Jar of spaghetti sauce, om nom nom. Beef stew yum.

4. None of the usual hesitation in sitting on the semi-shit-covered office toilet after having had to sit on the very-shit-covered wilderness toilet.


10-12-11 | Some random politico-economics

I believe it's possible to slant recent events such that Greenspan is the central villain. He presided over both the .com bubble and the real estate bubble, fueling both with super low interest rates and his approval (eg. the "new economy" that can grow forever!). He presided over a Fed that did nothing to control the banks it was supposed to be monitoring. Perhaps most damning though is how he steered the Clinton presidency, which led to the creation of this entire modern financial era.

One version of the myth is that Clinton arrived at the white house all liberal and bright eyed, only to find it infested by a den of wolves who forced him into realpolitik compromises. One of those wolves was Greenspan, who met with Clinton and said something along the lines of "if you don't cut federal spending, I will raise interest rates". Which is basically a threat - if you don't do what I like (small government) then I will destroy your presidency by constricting the economy in an already recessed time. I believe the fact of this threat is public record though you may believe Greenspan's version that the reason behind it was "advice" that runaway government debt would force his hand to raise interest rates. In any case, what followed was deregulation, low interest rates, and the inevitable disaster.

Anyhoo, I don't really believe that reading of history, but it makes a good story.

One thing that strikes is all the articles these days that describe the "Great Recession" as a "unique time in American history" or "unprecedented income inequality". Uhh, hello? Did you go to high school? Maybe if you qualified that with the addendum "since WW2". Basically it was completely standard pre-1910 American economics, with the only exception being that our new modern government caught the crash and smoothed it into a recession rather than just letting the bubble pop. If they hadn't caught the crash it would have been a generic "Panic of XXXX" (1857, 1873, 1884, 1890, 1893, 1896, 1907, 1910, 1914, 1920, 1929) many of which were unsurprisingly similar to our current fuckup (massive speculation in a bubbled asset leading to a crash in the value of that asset which leads to a liquidity crisis and bank failures); we could have dropped our monocle in shock as the robber-baron speculators caused a run on the banks.

Ever since the 80's we've been talking about how we need to transform America into a new "information economy" or "service economy". With NAFTA, etc. whenever we lost jobs, the response would be we need new types of jobs, new education etc. I don't recall much effort by anyone to stop and ask - do we want to live in a service economy? I mean, at the moment we are even failing to have a healthy service economy, in the sense that there's lots of unemployment, so all the cries are just to *fix* the service economy. But really, even if it was fixed, if there were call center jobs and retail jobs for everyone, that's a pretty fucking bad world to live in. An "information economy" is inherently bad for almost everyone in it, because "intellectual property" is owned by the very few and is where all the money is.

I believe that raising capital gains taxes to match income taxes would help. Not only is it just inherently fair to tax all income the same way and would eliminate the tax rate dip of the super rich vs the merely rich (a low capital gains tax is a regressive benefit to the super rich, since the poor don't make any income from capital gains), it would greatly reduce capital movement.

I also think that "retirement accounts" and all that shit should go away. It's an unnecessary complexity. Instead just make the capital gains tax go down based on the length held. Maybe reduce by 1% per year held, so after 30 years the tax goes to zero. This further encourages "buy and hold" type investing.

I'm more and more convinced that massive rapid capital movement isn't actually good for anyone but the very top echelon of financiers. It creates panics and crashes in small emerging markets. It creates bubbles, and it sucks profit out of markets. Obviously capital markets are beneficial to get money to companies that need it. All structures that are permitted in society need to justify themselves as being beneficial to the greater good.

I believe that allowing bankruptcy to wipe out student loans would provide a valuable incentive to colleges to keep tuition low, and to keep education useful. Right now it's far too easy to go and get a $200k education in the cultural impact of deconstruction in gay/lesbian underwater basket weaving, and wind up in a position with no job and no hope to ever pay it back. Obviously it's the fault of the student for doing that, but they're children, they shouldn't be expected to make such a huge financial life decision in the summer after high school when someone is handing them a huge blank check. If the colleges had to underwrite the student loans themselves, they would have an inherent interest to provide good useful educations, and to simply refuse to admit students that shouldn't be going to college.

I believe the most basic fundamental thing that we should do to fix the US economy (aside from the obvious things like eliminating off-book derivatives and restoring bank/investment separation and so on) is to make it cheaper to employ Americans.

It's simply too expensive to employ Americans. You can try to prop up our manufacturing sector in various ways, but you have to get to the root cause. It's insane how expensive labor is here, even only semi-skilled labor, vs. buying manufactured products.

I believe the best way to reduce the cost of employees is to get rid of all the additional costs to the employer. Health care, workers' comp, pensions, 401k's, all of this nonsense that employers in America have to administer and pay for. If these were provided by the government, the total cost for them would be the same, but it would be paid by corporate profits whether you hired those workers or not, so you may as well hire them here rather than outsourcing. You would greatly reduce the incremental cost of adding a new employee, which would encourage business to hire. Furthermore, without having to worry about so much hiring and firing paperwork, businesses would be more willing to hire in times of uncertain growth.

Ideally you would get to a scenario of no hiring/firing paperwork at all. This would make it far easier for businesses to find the best employees, and easier for employees to move on to better jobs without fear of being without health care for a while or whatever.

You might note that the heavy social welfare countries don't exactly have lively agile businesses. I believe that is not the fault of the government providing social care, but rather because those countries tend to also have very heavy regulatory structures that make doing business difficult and expensive. My proposal is to make small businesses much much easier to start and run, and much cheaper.

There's this whole modern movement to "save money" (for the government) by pushing the maintenance of social welfare programs onto the businesses (health care, retirement, etc.) But that doesn't actually save money for society, it just makes someone else pay it. The proponents will say they "reduced taxes" but they also reduced your pay check because the business now has to cover those costs. And it's worse than that, because it increases the cost per employee to the business, it makes them prefer to hire fewer people, or outsource some work, or just buy pre-made pieces from overseas.


10-12-11 | Subaru VIP program

FYI, just found out about this. Buy a Subaru for 2% under invoice with no haggling. WRX for $24,300 for example. Easy to get "VIP" status by joining some charity or other but requires six months of lead time.

Info at legacygt.com and cars101.com


10-12-11 | Bad profiling

I've seen quite a few posts recently that purport to do some profiling and show you which option is faster. The worst is probably spin locks are faster than mutexes but this kind of cache line study is not awesome either.

(Bouliii's test is not actually a demonstration of cache line sharing at all; it's a demonstration that threading speedup depends on data independence; to demonstrate cache line effects you should have lots of threads atomically incrementing *different* variables that are within the same cache line, not *one* variable; if you did that you would see that cache line sharing causes almost as much slow down as variable sharing)

This kind of profiling is deeply flawed and I believe is in fact worse than nothing.

Timing threading primitives in isolation or on synthetic task sets is not useful.

Obviously if all you measure is the number of clocks to take a lock, then the simplest primitives (like spinlock) will appear the fastest. Many blogs in the last few years have posted ignorant things like "critical section takes 200 clocks and spin lock is only 50 clocks so spin lock is faster". Doof! (bang head). What you care about in real usage is things like fairness, starvation, whether the threads that have CPU time are the ones that are able to get real work done, etc.

So say you're not completely naive and instead you actually cook up some synthetic work for some worker threads to do and you test your primitives that way. Well, that's better. It does tell you what the best primitive is for *that particular work load*. But that might not reflect real work at all. In particular homogenous vs. heterogenous threads (worker threads that are interchangeable and can do any work vs. threads that are dedicated to different things) will behave very differently. What other thread loads are on the system? Are you measuring how well your system releases CPU time to other threads when yours can't do any more? (a fair threading benchmark should always have a low priority thread that's counting the number of excess cycles that are released). (spinning will always seem faster than giving up execution if you are ignoring the ability of the CPU to do other work)

Furthermore the exact sharing structure and how the various workers share cache lines is massively important to the results.

In general, it is very easy for bad threading primitives to look very good in synthetic benchmarks, but to be disastrous in the real world, because they suffer from things like thundering herd or priority inversion or what I call "thread thrashing" (wasted or unnecessary thread context switches).

You may have noticed that when I posted lots of threading primitives a month or two ago there was not one benchmark. But what I did talk about was - can this threading primitive spin the CPU for a long time waiting on a data item? (eg. for an entire time slice potentially) ; can this threading primitive suffer from priority inversion or even "priority deadlock" (when the process can stall until the OS level set manager saves you) ; how many context switches are needed to transfer control, is it just one or are there extra? etc. these are the questions you should ask.

This has been thoroughly discussed over the years w.r.s.t. memory allocators, so we should all know better by now.

Also, as has been discussed thoroughly wrst allocators, it is almost impossible to cook up a thorough benchmark which will tell you the one true answer. The correct way to decide these things is :

1. Understand the issues and what the trade-offs are.

2. Make sure you pick an implementation that does not have some kind of disastrous corner case. For example make sure you allocator has decent O() behavior with no blow-up scenario (or understand the blow-up scenario and know that it can't happen for your application). Particularly with threading I believe that for 99.99% of us (eg. everyone except LRB) it's more important to have primitives that work correctly than to save 10 clocks of micro efficiency.

3. Try to abstract them cleanly so that they are easily swappable and test them in your real final application with your real use case. The only legitimate test is to test in the final use case.


10-11-11 | Comcast

Comcast keeps calling me with some horrible auto-dialer trying to tell me something. Unfortunately., if I just ignore it or screen it, it leaves me an automated message on my voicemail.

I try to ignore it for a while, but constantly having voicemails to go through is damn annoying so I finally call them up.

Enter your damn phone number. Bleep bloop. Urg I hate you. Mash buttons through the menu. Urg.

Me : "Why the fuck are you calling me over and over?"

Comcast : "Sir that's because you have a late a balance due ... "

Me (interrupting) , "Umm, my account is on credit card pay, why the fuck do I have a late balance?"

Comcast : "Sir that's because credit card payments aren't processed for 2-3 months after setting up the account"

Me : "Umm, okay so can I just pay the balance now so they stop pestering me?"

Comcast : "No it will be charged automatically in the next billing cycle"

Me : "So can you stop calling me?"

Comcast : "Unfortunately I can't do that, but I can give you a number to call ..."

Me : (sighs and rolls eyes)

Comcast : "..it's 877-824-2288"

Me : "Umm.. that's the generic comcast customer support number"

Blah. Boring, I know. So frustrating. I have absolutely no recourse, I can't pick a different internet provider, and I can't fight with these people because they just send you around in circles and you never get to talk to anyone with power. You always get some damn call center person who says "I'm sorry I can't do that".


10-06-11 | Fiberglass

Don't put this shit in your house. It's toxic, it's poison, it's the modern day asbestos. There are plenty of alternatives.

Even in the attic or walls, sure it's sealed up most of the time, but any time you have to go in there to work you stir it up, then you get glass shards in the air which get in your eyes and lungs, so you have to wear safety suits and respirators and so on just to do basic maintenace work.

If anything ever goes wrong with it, it's a nightmare to dispose of.

Worst of all is using it to wrap ducts. The problem is that all ducts will leak eventually. Maybe not right away, but in 10 years cracks will form. Then the air return ducts will start sucking in at the seams, and eventually that will be sucking glass fibers in, then blowing them out all over the house.

Just say no to toxic shit in your home.


10-03-11 | Amortized Hashing

The "hash1" simple Zip-style matcher was very fast. So why don't we love it?

The problem is amortized hashing. "Amortized" = we just stop walking after A steps. This limits the worst case. Technically it makes hash1 O(N) (or O(A*N)).

Without it, hash1 has disastrous worst cases, which makes it just not viable. The problem is that the "amortize" can hurt compression quite a bit and unpredictably so.

One perhaps surprising thing is that the value of A (the walk limit) doesn't actually matter that much. It can be 32 or 256 and either way you will save your speed from the cliff. Surpisingly even an A of 1024 on a 128k window helps immensely.


not "amortized" :

 file,   encode,   decode, out size
lzt00,  225.556,   35.223,     4980
lzt01,  143.144,    1.925,   198288
lzt02,  202.040,    3.067,   227799
lzt03,  272.027,    2.164,  1764774
lzt04,  177.060,    5.454,    11218
lzt05,  756.387,    3.940,   383035
lzt06,  227.348,    6.078,   423591
lzt07,  624.699,    4.179,   219707
lzt08,  311.441,    7.329,   261242
lzt09,  302.741,    3.746,   299418
lzt10,  777.609,    1.647,    10981
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt13,  101.577,    5.644,   978493
lzt14,  278.619,    6.026,    47540
lzt15, 1379.194,    9.396,    10911
lzt16,  148.908,   12.558,    10053
lzt17,  135.530,    5.693,    18517
lzt18,  171.413,    6.003,    68028
lzt19,  540.656,    3.433,   300354
lzt20,  109.678,    5.488,   900001
lzt21,  155.648,    3.605,   147000
lzt22,  118.776,    6.671,   290907
lzt23,  103.056,    6.350,   822619
lzt24,  218.596,    4.439,  2110882
lzt25,  266.006,    2.498,   123068
lzt26,  118.093,    7.062,   209321
lzt27,  913.469,    4.340,   250911
lzt28,  627.070,    2.576,   322822
lzt29, 1237.463,    4.090,  1757619
lzt30,   75.217,    0.646,   100001

"amortized" to 128 steps :

 file,   encode,   decode, out size
lzt00,  216.417,   30.567,     4978
lzt01,   99.315,    1.620,   198288
lzt02,   85.209,    3.556,   227816
lzt03,   79.299,    2.189,  1767316
lzt04,   90.983,    7.073,    12071
lzt05,   86.225,    4.715,   382841
lzt06,   91.544,    6.930,   423629
lzt07,  127.232,    4.502,   220087
lzt08,  161.590,    7.725,   261366
lzt09,  119.749,    4.696,   301442
lzt10,   55.662,    1.980,    11165
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt13,  103.460,    6.215,   978493
lzt14,   87.520,    5.529,    47558
lzt15,   98.902,    7.568,    10934
lzt16,   90.138,   12.503,    10061
lzt17,  115.166,    6.016,    18550
lzt18,  176.121,    5.402,    68035
lzt19,  272.349,    3.310,   304212
lzt20,  107.739,    5.589,   900016
lzt21,   68.255,    3.568,   147058
lzt22,  108.045,    5.867,   290954
lzt23,  108.023,    6.701,   822619
lzt24,   78.380,    4.631,  2112700
lzt25,   93.013,    2.554,   123219
lzt26,  108.348,    6.143,   209321
lzt27,  103.226,    3.468,   249081
lzt28,  145.280,    2.658,   324569
lzt29,  199.174,    4.063,  1751916
lzt30,   75.093,    1.019,   100001

The times are in clocks per byte. In particular let's look at some files that are really slow without amortize :

no amortize :

 file,   encode,   decode, out size
lzt11, 1073.232,    4.999,    19962
lzt12, 3250.114,    3.134,    25997
lzt15, 1379.194,    9.396,    10911
lzt27,  913.469,    4.340,   250911
lzt29, 1237.463,    4.090,  1757619

amortize 128 :

 file,   encode,   decode, out size
lzt11,  108.619,    6.072,    19978
lzt12,  112.264,    3.119,    26977
lzt15,   98.902,    7.568,    10934
lzt27,  103.226,    3.468,   249081
lzt29,  199.174,    4.063,  1751916

Massive speed differences. The funny thing is the only file where the compression ratio is drastically changes is lzt12. It's changed by around 4%. (lzt29 is a bigger absolute difference but only 0.34%)

So amortized hashing saved us massive time, and only cost us 4% on one file in the test case. Let me summarize the cases. There are three main classes of file :

1. 80% : Not really affected at all by amortize or not. These files don't have lots of degeneracy so there aren't a ton of links in the same hash bucket.

2. 15% : Very slow without amortize, but compression not really affected. These files have some kind of degeneracy (such as long runs of one character) but the most recent occurances of that hash are the good ones, so the amortize doesn't hurt compression.

3. 5% : Has lots of hash collisions, and we do need the long offsets to make good matches. This case is rare.

So obviously it should occur to us that some kind of "introspective" algorithm could be used. Somehow monitor what's happening and adjust your amortize limit (it doesn't need to be a constant) or switch to another algorithm for part of the hash table.

The problem is that we can't tell between classes 2 and 3 without running the slow compressor. That is, it's easy to tell if you have a lot of long hash chains, but you can't tell if they are needed for good compression or not without running the slow mode of the compressor.


10-02-11 | How to walk binary interval tree

I noted previously that SA3 uses a binary interval tree. It's obvious how that works but it always takes me a second to figure it out so let's write it down.

This is going to be all very similar to previous notes on cumulative probability trees (and followup ).

Say you have an array of 256 entries. You build a min tree. Each entry in the tree stores the minimum value over an interval. The top entry covers the range [0,255] , then you have [0,127][128,255] , etc.

If you're at index i and you want to find the next entry which has a value lower than yours. That is,


find j such that A[j] < A[i] and j > i

you just have to find an interval in the min tree whose min is < A[i]. To make the walk fast you want to step by the largest intervals possible. (once you find the interval that must contain your value you do a normal binary search within that interval)

You can't use the interval that overlaps you, because you only want to look at j > i.

It's easy to do this walk elegantly using the fact that the binary representation of integers is a sort of binary interval tree.

Say we start at i = 37 (00100101). We need to walk from 37 to 256. Obviously we want to use the [128,256) range to make a big step. And the [64,128). We can't use the [32,64) because we're inside that range - this corresponds to the top on bit of 37. We can use [48,64) and [40,48) because those bits are off. We can't use [36,40) but we can use [38,40) (and the bottom on bit corresponds to [37,38) which we are in).

Doing it backwards, you start from whatever index (such as i=37). Find the lowest on bit. That is the interval you can step by (in this case 1). So step by 1 to i=38. Stepping by the lowest bit always acts to clear that bit and push the lowest on bit higher up (sometimes more than 1 level). Now find the next lowest on bit. 38 = 00100110 , so step by 2 to 40 = 00101000 , now step by 8 to 48 = 00110000 , now step by 16 to 64 = 01000000. etc.

In pseudo code :


Start at index i

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  int level = BitPos(step);
  check tree[level][i>>level];
  i += step;
}
 
Now this should all be pretty obvious, but here comes the juju.

I've written tree[][] as if it is layed out in the intuitive way, that is :


tree[0] has 256 entries for the 1-step ranges
tree[1] has 128 entries for 2-step ranges
...

The total is 512 entries which is O(N). But notice that tree[0] is only actually ever used for indexes that have the bottom bit on. So the half of them that have the bottom bit off are not needed. Then tree[1] is only used for entries that have the second bit on (but bottom bit off). So the tree[1] entries can actually slot right into the blanks of tree[0], and half the blanks are left over. And so on...

It's a Fenwick tree!

So our revised iteration is :


// searching :
Start at index i
(i starts at 1)

while ( i < end )
{
  int step = i & (-i); // isolate bottom bit
  // I'm looking at the range [i,i+step]
  check fenwick_tree[i];
  i += step;
}

// building :

for all i
{
  int step = i & (-i); // isolate bottom bit
  fenwick_tree[i] = min on range [i,i+step]
}

(in practice you need to build with a binary recursion; eg. level L is built from two entries of level L-1).

Note that to walk backwards you need the opposite entries. That is, at level 7 (steps of 128) you only need [128,256) to walk forward, never [0,128) because a value in that range can't take that step. To walk backwards, you only need [0,128) , never [128,256). So in fact to walk forward or backward you need all the entries. When we made the "Fenwick compaction" for the forward walk, we threw away half the values - those are exactly the values that need to be in the backward tree.

For the three bit case , range [0,8) , the trees are :


Forward :

+-------------------------------+
|              0-8              |
+-------------------------------+
|       ^       |      4-8      |
+---------------+---------------+
|   ^   |  2-4  |   ^   |  6-8  |
+---------------|---------------+
| ^ |1-2| ^ |3-4| ^ |5-6| ^ |7-8| 
+-------------------------------+

where ^ means go up a level to find your step
the bottom row is indexed [0,7] and 8 is off the end on the right
so eg if you start at 5 you do 5->6 then 6->8 and done

Backward :

+-------------------------------+
|              8-0              |
+-------------------------------+
|      4-0      |       ^       |
+---------------+---------------+
|  2-0  |   ^   |  6-4  |   ^   |
+---------------|---------------+
|1-0| ^ |3-2| ^ |5-4| ^ |7-6| ^ | 
+-------------------------------+

the bottom row is indexed [1,8] and 0 is off the end of the left
so eg if you start at 5 you go 5->4 then 4->0 and done

You collapse the indexing Fenwick style on both by pushing the values down the ^ arrows. It should be clear you can take the two tables and lay them on top of each other to get a full set.


10-01-11 | Used Prices

I really like the idea of buying used. Save some money, maybe prevent some goods from going to the land fill, fuck the retailer, etc. But used prices are just really fucking out of whack.

Used good need to have a *huge* discount. If you look at the utility, when I see something for sale on craigslist,

I can't just click "buy it" and it shows up at my door. So I need a big discount for that.

It's usually badly described and photographed, so I have to do a bunch of research. Discount.

I can't pay with CC at all so I have to go fetch some cash. Discount.

The seller has a 50% chance of being a flake in some way, like telling me its been sold after I show up for my appointment, or putting up the wrong item. Discount.

If it's broken or wrong or whatever, I can't return it. Big discount for this.

There's always an information gap problem when buying used - the seller knows more about the item than you. This fucks you with new items too, of course, but it's worth with used. eg. there's a chance that it's in worse condition than you know. Discount for this.

Even if some of the bad eventualities don't happen on a particular purchase, the price needs to be discounted by the probability that they happen times the cost to you when they do. eg. if your $100 purchase has a 10% chance of being fucked and worthless, you need a $10 discount on all purchases PLUS a discount for the trouble in that event, which is probably another $100 times 10%, so $20 total discount.

The result when you add up all the factors is that something that sells for $1000 new needs to be under $500 or so for it to be +EV to buy used. And it just isn't. In fact it's usually over $800. Used prices are uniformly too high.

Part of the problem is that the price of goods has gotten so out of whack with the price of labor.


10-01-11 | Seattle Stop Shitting on my Face

I've been thinking about upgrading my neoprene gear so that I can swim in the lake through the fall/spring, not just the summer.

I fucking hate swimming laps in a pool during official lap swim hours, with all your "rules" and your system keeping me down. And it just doesn't make sense to go into some nasty indoor crowded box when I'm literally surrounded by miles of beautiful open water.

But there's a bit problem with this idea. Seattle is shitting on my face.

The fall/spring, when a wet suit would help, is when it rains. When it rains, the sewers overflow and drain into the lake. Then you get itchy bumps and vomiting and so on.

Seattle is basically doing nothing about it. There are some little programs to do "rain gardens", but those are sort of like using a tampon to stop elephant piss. What we need is serious fucking civil engineering. ("rain gardens" are also better known as "mosquito breeders"; we Houstonians are always amazed and delighted about the lack of mosquitos here; it's because Seattle is hilly and surrounded by lakes so the water doesn't pool, but the city is doing their best to ruin that). (a much bigger impact than piddly residential rain gardens would be to outlaw concrete parking lots; grass/gravel/lattice parking lots work perfectly fine for holding cars).

Of course this is a problem that is occuring all over the US. I hear NY has a major sewer problem as well. The population and development of most US cities has outstripped their infrastructure, and in our shitty faux-libertarian plutocracy of course there's no money for basic civil engineering. Only the heavy hand of EPA orders is forcing these dumb ass local governments to do anything.

The real solution is something like :

1. Separate the sewage and rain runoff. Run sewage to treatment plants in a closed system so it can never get out. (probably not realistic; alternatively, add a new clean storm water only system)

2. Since you will allow non-sewer storm water to drain to the lake, make the pollutants that run off to the lakes illegal. Fertilizers, pesticides, etc. everything that's water soluble and washes into the lakes is illegal right fucking now.


10-01-11 | More Reliable Timing on Windows

When profiling little code chunks on Windows, one of the constant annoyances is the unreliability of times due to multithreading.

Historically the way you address this is run lots of trials (like 100) and take the MIN time of any trial.

(* important note : if you aren't trying to time "hot cache" performance, you need to wipe the cache between each run. I dunno if there's an easy instruction or system call that would invalidate all cache pages; what I usually do is have a routine that goes and munges over some big array).

It's a bit better these days because of many cores. Now you can quite often find a core which is unmolested by annoying services popping up and stealing CPU time and messing up your profile. But sometimes you get unlucky, and your process runs on an IdealProc that has some other shite.

So a simple loop helps :


template <typename t_func>
uint64 GetFuncTime( t_func * pfunc )
{
    HANDLE proc = GetCurrentProcess();
    HANDLE thread = GetCurrentThread();
    
    DWORD_PTR affProc,affSys;
    GetProcessAffinityMask(proc,&affProc,&affSys);
    
    uint64 tick_range = 1ULL << 62;
    
    for(int rep=0;rep<24;rep++)
    {
        DWORD mask = 1UL<<rep;
        if ( mask & affProc )
            SetThreadAffinityMask(thread,mask);
        else
            continue;   

        uint64 t1 = __rdtsc();
        (*pfunc)();
        uint64 t2 = __rdtsc();

        uint64 cur_tick_range = t2 - t1;
        tick_range = MIN(tick_range,cur_tick_range);

    }

    SetThreadAffinityMask(thread,0xFFFFFFFFUL);

    return tick_range;
}

which makes it reasonably probable that you get a clean run on some core. For published results you will still want to repeat the whole thing N times.


10-01-11 | String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022 

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071 

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073 

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :


First a reminder of how the normal suffix array searcher works :

We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors

The problem with windowed matching is our neighbors may be way out of the window

So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.

So really all we are doing is adding another search structure on top of the suffix array.  It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1] 

To do this I just build a tree.  It's indexed by sort position, and its content is file positions.

Level L of the tree contains nodes that cover an interval of size 1<<L 

That is, level 0 has intervals of size 1, that's just

Tree_0[S] = sortIndexInverse[S] 
  (eg. it's just the file position of that sort pos)
  (in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)

Tree_1[i] covers steps of size 2, that is :
  file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
  I store the min and max of that

Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]

Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.


09-30-11 | String Match Results Part 5 + Conclusion

Finally for completeness, some of the matchers from Tornado in FreeArc. These are basically all standard "cache table" style matchers, originally due to LZRW, made popular by LZP and LZO. The various Tornado settings select different amounts of hash rows and ways.

As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a constant number of hash probes per match attempt.


totals : match len : clocks
Test_MMC1 : 0.663028 , 231.254818 
Test_Hash1 : 0.662718 , 64.888003 
Test_Tornado_3 : 0.630377 , 19.658834 
Test_Tornado_4 : 0.593174 , 28.456055 
Test_Tornado_5 : 0.586540 , 40.546146 
Test_Tornado_6 : 0.580042 , 56.841156 
Test_Tornado_7 : 0.596584 , 141.432393 

There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno. The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand why it's happening).

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.

MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.

LzFind doesn't have anything going for it really.

For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.

For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.


09-30-11 | String Match Results Part 4

Okay, finally on to greedy parsing. Note with greedy parsing the average match length per byte is always <= 1.0 (it's actually the % of bytes matched in this case).

Two charts for each , the first is clocks per byte, the second is average match length. Note that Suffix5 is just for reference and is neither windowed nor greedy.

got arg : window_bits = 16

got arg : window_bits = 17

got arg : window_bits = 18

Commentary :

Okay, finally MMC beats Suffix Trie and LzFind, this is what it's good at. Both MMC and LzFind get slower as the window gets larger. Surprisingly, the good old Zip-style Hash1 is significantly faster and finds almost all the matches on these files. (note that LzFind1 and Hash1 both have search limits but MMC does not)


test set :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin


The same matchers ; greedy, 16 bit window, on the stress tests :

LzFind does not do well at all on the stress tests. (note that LzFind1 and MMC1 are length-limitted; LzFind1 and Hash1 are "amortized" (step limitted)).

0 = stress_all_as 
1 = stress_many_matches 
2 = stress_search_limit 
3 = stress_sliding_follow 
4 = stress_suffix_forward 
5 = twobooks


09-30-11 | String Match Results Part 3

Still doing "optimal" (non-greedy parsing) but now lets move on to windowed & non-exact matching.

Windowed, possibly approximate matching.

Note : I will include the Suffix matchers for reference, but they are not windowed.

16 bit window :

Clocks per byte :

Average Match len :

This is what LzFind is designed for and it's okay at it. It does crap out pretty badly on the rather degenerate "particles.max" file, and it also fails to find a lot of matches. (LZFind1 has a maximum match length of 256 and a maximum of 32 search steps, which are the defaults in the LZMA code; LzFind2 which we saw before has those limits removed (and would DNF on many of these files)).

lztest is :

0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp


09-30-11 | String Match Results Part 2b

Still on optimal parsing, exact matching, large window :

Chart of clocks per byte, on each file of a test set :

On my "lztest" data set :


0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp

"lztest" is not a stress test set, it's stuff I've gathered that I think is roughly reflective of what games actually compress. It's interesting that this data set causes lots of DNF's (did not finish) for MMC and LzFind.

Suffix5 (the real suffix trie) is generally slightly faster than the suffix array. It should be, of course, if I didn't do a bonehead trie implementation, since the suffix array method basically builds a trie in the sort, then reads it out to sorted indexes, and then I convert the sorted indexes back to match lengths.

Good old CCC (Calgary Compression Corpus) :


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PAPER3
10 = PAPER4
11 = PAPER5
12 = PAPER6
13 = PIC
14 = PROGC
15 = PROGL
16 = PROGP
17 = TRANS

I won't be showing results on CCC for the most part because it's not very reflective of real world modern data, but I wanted to run on a set where MMC and LzFind don't DNF too much to compare their speed when they do succeed. Suffix Trie is almost always very close to the fastest except on paper4 & paper5 which are very small files.


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PROGC
10 = PROGL
11 = PROGP
12 = TRANS

Two new tests in the mix.

Test_Hash1 : traditional "Zip" style fixed size hash -> linked list. In this run there's no chain limit so matching is exact.

Test_Hash3 : uses cblib::hash_table (a reprobing ("open addressing" or "closed hashing", I prefer reprobing)) to hash the first 4 bytes then a linked list. I was surprised to find that this is almost the same speed as Hash1 and sometimes faster, even though it's a totally generic template hash table (that is not particularly well suited to this usage).


09-30-11 | BMWs

I think the E46 M3 is almost the most perfect car ever made. Great engine, plenty of room, great handling (after you dial out the factory understeer); comfortable enough to drive anywhere, but tight enough to toss. I wish it wasn't quite so ugly, and I wish it weighed about 300 pounds less, and I wish it didn't have a leather interior which is invariably gross by now (seriously? can we fucking get technical fabric in cars already? it's breathable, near waterpoof, doesn't get hot or cold, doesn't get ruined by water or sun, it's been the obvious correct material for car interiors for like 10 years now, stop being such ass hats).

E46 M3 prices from cars.com :

Progression of M3 power-to-weights over the years : (with the 1M thrown in since it's the real small M sedan of the present)


E30 2865 / 250 (*) = 11.5 (pounds per hp)
E36 3220 / 316 = 10.2
E46 3415 / 338  3443? = 10.1
1M  3461 / 335  3329? 3296? =  = 10.1 (**)
E92 3704-3900 / 414 = 9.2

correct/comparable weights are hard to find. Both the manufacturer weights and magazine weights are not reliable. You need the regulatory DIN weight or something but I can't find a good source for that (Wikipedia is no good). Anyway, the details may be a bit off, but the power-to-weights of M cars have changed surpisingly little over the years. The powers have gone up, but so have the weights, only slightly slower.

(* = the E30 S14 engine made more like 200 hp stock, but can be reasonably easy brought up to "race spec" 250 hp using modern electronic fuel injection and a few other cheap mods; unlike the other engines, the S14 was actually raced and is reliable even at higher output).
(** = the 1M can easily make 380 hp or so with standard turbo mods).

Before about 2001 the US versions of most BMW's were crippled. Either worse versions of the engines or completely different engines.

The Z3 "M coupe" shooting-brake is supposedly one of the best handling cars ever made (up there with the NSX and I'm not sure what else). The good one is the late model 2001-2002 which got the E46 M3 engine. Unfortunately the public has figured this out and they're now a bit of a collectors item for enthusiasts; prices have stabilized in the $30k-40k range, around double prices of the earlier small engine Z3 M Coupe. I'm a big fan of how ugly it is, and I love that it has the practicality of a wagon, but I hear the cockpit is a bit small.

The later M coupe is extremely comparable to a Cayman :


E86 Z4 M Coupe (2006-2008) : 330 hp, 3200 pounds = 9.7 (pounds per hp)
997.2 Cayman   (2009-2011) : 320 hp, 3000 pounds = 9.4

The M Coupe makes better noises and the engine is easier to tune up, it's also more analog, more raw. The Cayman actually has a lot more cabin room and luggage room ; the M is rather uncomfortably cramped, and I didn't love the feel of sitting over the rear axle in a car with a huge hood. The M will certainly depreciate less, and is marginally less douchey.

There's a large spec E30 (E30 325i) amateur race class. It's a very cheap race class to get into with a very strict spec, it looks like a lot of fun. Maybe I'll do something like that in my retirement. Cars like that are called "momentum cars" by racers because they have very little acceleration; not much fun on the street, but they can still be great on a track because it takes a lot of skill to get the right line to keep speed through corners in traffic.


09-30-11 | Don't use memset to zero

A very common missed optimization is letting the OS zero large chunks of memory for you.

Everybody just writes code like this :


U32 * bigTable = malloc(20<<20);
memset(bigTable,0,20<<20);

but that's a huge waste. (eg. for large hash table on small files the memset can dominate your time).

Behind your back, the operating system is actually running a thread all the time as part of the System Idle Process which grabs free pages and writes them with zero bytes and puts them on the zero'ed page list.

When you call VirtualAlloc, it just grabs a page from the zeroed page list and hands it to you. (if there are none available it zeroes it immediately).

!!! Memory you get back from VirtualAlloc is always already zeroed ; you don't need to memset it !!!

The OS does this for security, so you can never see some other app's bytes, but you can also use it to get zero'ed tables quickly.

(I'm not sure if any stdlib has a fast path to this for "calloc" ; if so that might be a reason to prefer that to malloc/memset; in any case it's safer just to talk to the OS directly).

ADDENDUM : BTW to be fair none of my string matchers do this, because other people's don't and I don't want to win from cheap technicalities like that. But all string match hash tables should use this.


09-30-11 | String Match Results Part 2

The first and simplest set of results are the ones where non-O(N) algorithms make themselves known.

Optimal parsing, large window.

The candidates are :

SuffixArray1 : naive matcher built on divsufsort
SuffixArray2 : using the forward-offset elimination from this post
Suffix2 : Suffix Trie with follows and path compression but missing the bits in this post
Suffix3 : Suffix Trie without follow
Suffix5 : fully working Suffix Trie
MMC1 : MMC with max match length of 256
MMC2 : MMC with no limit
LzFind1 : LzFind (LZMA HC4 - binary tree) with max ML of 256 and step limit of 32
LzFind2 : LzFind with no max ML or step limit

Note : LzFind was modified from LZMA to not record all matches, just the longest, to make it more like the competitors. MMC was modified to make window size a variable.

In all cases I show :


Test_Matcher : average match length per byte , average clocks per byte

And with no further ado :

got arg : window_bits = 24
working on : m:\test_data\lz_stress_tests

loading file : m:\test_data\lz_stress_tests\stress_many_matches
Test_SuffixArray1 : 32.757760 , 164.087948 
Test_SuffixArray2 : 32.756953 , 199.878476 
Test_Suffix2 : 32.757760 , 115.846130 
Test_Suffix3 : 31.628279 , 499.722569 
Test_Suffix5 : 32.757760 , 184.172167 
Test_MMC2 : 32.757760 , 1507.818166 
Test_LzFind2 : 32.757760 , 576.154370 

loading file : m:\test_data\lz_stress_tests\stress_search_limit
Test_SuffixArray1 : 823.341331 , 182.789064 
Test_SuffixArray2 : 823.341331 , 243.492241 
Test_Suffix2 : 823.341331 , 393.930504 
Test_Suffix3 : 807.648294 , 2082.447274 
Test_Suffix5 : 823.341331 , 91.699276 
Test_MMC2 : 823.341331 , 6346.400206 
Test_LzFind2 : 823.341331 , 1807.516994 

loading file : m:\test_data\lz_stress_tests\stress_sliding_follow
Test_SuffixArray1 : 199.576550 , 189.029462 
Test_SuffixArray2 : 199.573198 , 220.316868 
Test_Suffix2 : 199.576550 , 95.225780 
Test_Suffix3 : 198.967622 , 2110.521111 
Test_Suffix5 : 199.576550 , 106.019526 
Test_MMC2 : 199.576550 , 36571.382020 
Test_LzFind2 : 199.576550 , 1249.184412 

loading file : m:\test_data\lz_stress_tests\stress_suffix_forward
Test_SuffixArray1 : 5199.164464 , 6138.802402 
Test_SuffixArray2 : 5199.164401 , 213.675569 
Test_Suffix2 : 5199.164464 , 12901.429712 
Test_Suffix3 : 5199.075953 , 32152.812339 
Test_Suffix5 : 5199.164464 , 145.684678 
Test_MMC2 : 5199.016562 , 6652.666440 
Test_LzFind2 : 5199.164464 , 11739.369336 

loading file : m:\test_data\lz_stress_tests\stress_all_as
Test_SuffixArray1 : 21119.499148 , 40938.612689 
Test_SuffixArray2 : 21119.499148 , 127.520147 
Test_Suffix2 : 21119.499148 , 88178.094886 
Test_Suffix3 : 21119.499148 , 104833.677202 
Test_Suffix5 : 21119.499148 , 119.676823 
Test_MMC2 : 21119.499148 , 25951.480871 
Test_LzFind2 : 21119.499148 , 38581.431558 

loading file : m:\test_data\lz_stress_tests\twobooks
Test_SuffixArray1 : 192196.571348 , 412.356092 
Test_SuffixArray2 : 192196.571348 , 420.437773 
Test_Suffix2 : 192196.571348 , 268.524287 
Test_Suffix3 : DNF
Test_Suffix5 : 192196.571348 , 292.777726 
Test_MMC2 : DNF 
Test_LzFind2 : DNF 

(DNF = Did Not Finish = over 100k clocks per byte).

Conclusion : SuffixArray2 and Suffix5 both actually work and are correct with no blowup cases.

SuffixArray1 looks good on the majority of files (and is slightly faster than SuffixArray2 on those files), but "stress_suffix_forward" clearly calls it out and shows the break down case.

Suffix2 almost works except on the degenerate tests due to failure to get some details of follows quite right ( see here ).

Suffix3 just shows that a Suffix Trie without follows is some foolishness.

We won't show SuffixArray1 or Suffix2 or Suffix3 again.

MMC2 and LZFind2 both have bad failure cases. Both are simply not usable if you want to find the longest match at every byte. We will revisit them later in other usages though and see that they are good for what they're designed for.

I've not included any of the hash chain type matchers in this test because they all obviously crap their pants in this scenario.


09-30-11 | String Match Results Part 1

I was hoping to make some charts and graphs, but it's just not that interesting. Anyhoo, let's get into it.

What am I testing? String matching for an LZ-type compressor. Matches must start before current pos but can run past current pos. I'm string matching only, not compressing. I'm counting the total time and total length of matches found.

I'm testing match length >= 4. Matches of length 2 & 3 can be found trivially by table lookup (though on small files this is not a good way to do it). Most of the matchers can handle arbitrary min lengths, but this is just easier/fairer for comparison.

I'm testing both "greedy" (when you find a match step ahead its length) and "optimal" (find matches at every position). Some matchers like the suffix tree ones don't really support greedy parsing, since they have to do all the work at every position even if you don't want the match there.

I'm testing windowed and non-windowed matchers.

I'm testing approximate and non-approximate (exact) matchers. Exact matchers find all matches possible, approximate matchers find some amount less. I'm not sure the best way to show the approximation vs. speed trade off. I guess you want a "pareto frontier" type of graph, but what should the axes be?

Also, while I'm at it, god damn it!

MAKE YOUR CODE FREE PEOPLE !!

(and GPL is not free). And some complicated personal license is a pain in the ass. I used to do this myself, I know it's tempting. Don't fucking do it. If you post code just make it 100% free for all uses. BSD license is an okay choice.

Matchers I'm having trouble with :


Tornado matchers from FreeArc - seem to be GPL (?)

MMC - GPL

LzFind from 7zip appears to be public domain. divsufsort is free. Larsson's slide is free.


09-29-11 | Suffix Tries 3 : On Follows with Path Compression

Some subtle things that it took me a few days to track down. Writing for my reference.

1. Follows updates should be a bit "lazy". With path compression you aren't making all the nodes on a suffix. So when you match at length 5, the follow at length 4 might not exist. (I made a small note on the consequences of this previously . Even if the correct follow node doesn't exist, you should still link in to the next longest follow node possible (eg. length 3 if a 4 doesn't exist). Later on the correct follow might get made, and then if possible you want to update it. So you should consider the follows links to be constantly under lazy update; just because a follow link exists it might not be the right one, so you may want to update it.

eg. say you match 4 bytse of suffix [abcd](ef..) at the current spot. You want to follow to [bcd] but there is no length 3 node of that suffix currently. Instead you follow to [bc] (the next best follow available) , one of whose children is [dxy], you now split the [dxy] to [d][xy] and add [ef] under [d]. You can then update the follow from the previous node ([abcd]) to point at the new [bc][d] node.

2. It appears that you only need to update one follow per byte to get O(N). I don't see that this is obvious from a theoretical standpoint, but all my tests pass. Say you trace down a long suffix. You may encounter several nodes that don't have fully up to date follow pointers. You do not have to track them all and update them all at the next byte. It seems you can just update the deepest one (not the deepest node, but the deepest node that needs an update). (*)

3. Even if your follow is not up to date, you can still use the gauranteed (lastml-1) match len to good advantage. This was a big one that I missed. Say you match 4096 bytes and you take the follow pointer, and it takes you to a node of depth 10. You've lost a lot of depth - you know you must match at least 4095 bytes and you only have 10 of them. But you still have an advantage. You can descend the tree and skip all string compares up to 4095 bytes. In particular, when you get to a leaf you can immediately jump to matching 4095 of the leaf pointer.

4. Handling of EOF in suffix algorithms is annoying; it needs to act like a value outside the [0,255] range. The most annoying case is when you have a degenerate suffix like aaaa...aaaEOF , because the "follow" for that suffix might be itself (eg. what follows aaa... is aa..) depending on how you handle EOF. This can only happen with the degenerate RLE case so just special casing the RLE-to-EOF case avoids some pain.

(* = #2 is the thing I have the least confidence in; I wonder if there could be a case where the single node update doesn't work, or if maybe you could get non-O(N) behavior unless you have a more clever/careful update node selection algorithm)


09-28-11 | String Matching with Suffix Arrays

A suffix sorter (such as the excellent divsufsort by Yuta Mori) provides a list of the suffix positions in an array in sorted order. Eg. sortedSuffixes[i] is the ith suffix in order.

You can easily invert this table to make sortLookup such that sortLookup[ sortedSuffix[i] ] == i . eg. sortLookup[i] is the sort order for position i.

Now at this point, for each suffix sort position i, you know that the longest match with another suffix is either at i-1 or i+1.

Next we need the neighboring pair match lengths for the suffix sort. This can be done in O(N) as previously described here . So we now have a sortSameLen[] array such that sortSameLen[i] tells you the match length between (sorted order) elements i and i+1.

Using just these you can find all the match lengths for any suffix in the array thusly :


For a suffix start at index pos
Find its sort order : sortIndex = sortLookup[pos]
In each direction (+1 and -1)
current_match_len = infinite
step to next sort index
current_match_len = MIN(current_match_len,sortSameLen[sort index])

Okay. This is all old news. But it has a problem that has been discussed previously .

When matching strings for LZ and such, we don't want the longest match in the array, we want the longest match that occurs earlier. Handled naively this ruins the great O() performance of suffix array string matching. But you can do better.

Run Algorithm Next Index with Lower Value on the sortedSuffix[] array. This provides an array nextSuffixPreceding[]. This is exactly what you need - it provides the next closest suffix with a preceding index.

Now instead of the longest match being at +1 and -1, the longest match is at nextSuffixPreceding[i] and priorSuffixPreceding[i].

There's one last problem - if my current suffix is at position pos, and I look up si = sortIndex[pos] and from that nextSuffixPreceding[si] - I need to walk up to that position one by one doing MIN() on the adjacent pair match lengths (sortSameLen). That ruins my O() win.

But there's a solution - simply build the match length as well when you run "next index with lower value". This can be done easily by tracking the match length back to the preceding "fence". This adds no complexity to the algorithm.

The total sequence of operations is :


sort suffixes : O(NlogN) or so

build sort lookup : O(N)

build sort pair same len : O(N)

build next/prior pos preceding with match lengths : O(N)

now to find a match length :

at position "pos"
si = sortLookup[pos]
for each direction (following and preceding)
  matchpos = nextSuffixPreceding[si]
  matchlen = nextSuffixPreceding_Len[si]

that is, the match length lookup is a very simple O(1) per position (or O(N) for all positions).

One minor annoyance remains, which is that the suffix array string searcher does not provide the lowest offset for a given length of match. It gives you the closest in suffix order, which is not what you want.


09-28-11 | Algorithm : Next Index with Lower Value

You are given an array of integers A[]

For each i, find the next entry j (j > i) such that the value is lower (A[j] < A[i]).

Fill out B[i] = j for all i.

For array size N this can be done in O(N).

Here's how :

I'll call this algorithm "stack of fences". Walk the array A[] from start to finish in one pass.

At i, if the next entry (A[i+1]) is lower than the current (A[i]) then you have the ordering you want immediately and you just assign B[i] = i+1.

If not, then you have a "fence", a value A[i] which is seeking a lower value. You don't go looking for it immediately, instead you just set the current fence_value to A[i] and move on via i++.

At each position you visit when you have a fence, you check if the current A[i] < fence_value ? If so, you set B[fence_pos] = i ; you have found the successor to that fence.

If you have a fence and find another value which needs to be a fence (because it's lower than its successor) you push the previous fence on a stack, and set the current one as the active fence. Then when you find a value that satisfies the new fence, you pop off the fence stack and also check that fence to see if it was satisfied as well. This stack can be stored in place in the B[] array, because the B[] is not yet filled out for positions that are fences.

The pseudocode is :


fence_val = fence_pos = none

for(int i=1;i<size;i++)
{
    int prev = A[i-1];
    int cur = A[i];

    if ( cur > prev )
    {
        // make new fence and push stack
        B[i_prev] = fence_pos;
        fence_pos = i_prev;
        fence_val = prev;
    }
    else
    {
        // descending, cur is good :
        B[i_prev] = i;

        while( cur < fence_val )
        {
            prev_fence = B[fence_pos];
            B[fence_pos] = i;
            fence_pos = prev_fence;
            if ( fence_pos == -1 )
            {
                fence_val = -1;
                break;
            }
            fence_val = A[fence_pos];
        }
    }
}

This is useful in string matching, as we will see forthwith.


09-28-11 | Rugby

God damn rugby is a fucking joyous sport to watch.

1. No commercials. I just can't watch sports with commercials any more. Right when you start to get into the action you have to see some shit about Geico or Aflac or some shit. OMG what could be more of a scam than supplemental insurance. Maybe I should insure my insurance rate. And insure against my insurance company going bankrupt. And some supplemental umbrella insurance. Now I'm upset, no more watching commercials.

2. No breaks between plays. No instant replay. No timeouts. Just action action all the time.

3. Advantage. I think I did a post about this long ago, but the "advantage" rule for penalties is just so much fucking win. It means that you don't have to stop play for every piddly penalty, you let play keep going as long as the penalizing team has not gained an advantage from the penalty (more precisely, play continues as long as the team that was infringed upon is at an advantage compared to the position they were in at the time of the penalty). This sounds complex but is not and is just 100% win.

4. The refs. The refs in rugby are just uniformly superb. I'm not quite sure why they're so much better than any other sport. They have more autonomy and more freedom to make judgement calls, and they seem to do so well. One aspect perhaps is that most rugby refs have played a bit at the professional level, which I think is rare in other sports.

5. The game is (usually) not decided by penalties. I just can't watch basketball or soccer because of this. Penalties should encourage players to stick to the spirit of the game, the game shouldn't become all about the points you can get off penalties. It ruins the game.

6. The players aren't divers or whiners (mostly). In other sports you see the players taking dives, trying to draw fouls, or going and begging to the ref after plays. WTF, do you have no self respect? You're a grown man and you're diving and whining? WTF. I wonder if they secretly practice flopping in basketball and soccer training camps, or is that something (like taking steroids) that you're supposed to figure out on your own in a kind of nudge-nudge-wink-wink way. Maybe a veteran player takes the rookie under his wings and does some supplement flop and beg practice. I don't know how you can be a fan of a player like Robert Horry or Zidane; oh yeah, I really admire the way they fake being fouled, it's so graceful.

Anyway, I feel like most ruggers just want to get back to play. They want to win the game by smashing through their opponents with ball in hand, not by begging to the ref. And that I respect.

(I must say, watching the recent NZ-France World Cup match I was absolutely disgusted by the sleazy play of the French. Several soccer-style dives trying to draw penalty, one attempt at drawing "obstruction", and the very sleazy try by kicking off while the ref was talking, just absolutely scummy soccer-style tactics, I hope they get their eyes poked in every ruck).

7. Toughness. It's great to watch some big men just brutalize each other. This used to be part of the appeal of American Football but they've all become such delicate flowers now ; oo I have a stubbed toe better get off the field. Back in the day you had guys like Ronnie Lott who gave up a chunk of their thumb to stay on the field. That sort of macho insanity still exists in rugby ; if you get a sprain or even a broken bone or something of course you fucking stay on the field, what are you a pussy? You get off the field when your team doesn't need you any more.

Most of the WC games I've seen so far have been very pretty, good games of rugby, with good discipline and a few flashes of beautiful ball movement and big runs. That's not always the case, though, it can degenerate into a very ugly game. With unskilled teams you get scrums that don't hold together and constantly have to restart, you get lots of bobbled and dropped balls, they can't put together phases, and those games are no fun to watch.


09-27-11 | S2000 Prices

No surprise - S2000's have held their value very very well. For one thing, they're pretty rare, for another, they're great, and most importantly, they're a Honda, from the tail end of the glory days, when Honda was just giving away massive amounts of value. Honda was making $50k cars and selling them for $25k.

(aside : there's some sort of vague sense in which I believe Honda from somethinge like the mid-80's to the mid 90's made the greatest cars of all time; obviously not in terms of actually comparing them to modern cars, so that forces me to qualify it as "compared to their era" and then you get some moron who says the dupendinger hummdorf from 1910 was a bigger improvement over its era, which, okay, I have no fucking idea if that's true or not, and I don't care (and I doubt it), so I hate all that "good for its time" kind of rating. But anyway, when you look at the line from the CRX, the NSX, all the Type-R cars, just staggeringly good, so well made, reliability, perfectly tuned to give the driver the feedback and control he needs, and way way underpriced, so much more car for the money than the competition; it's a damn shame that that time is passed)

On the other hand, RX8 prices are falling fast, and something like a 2007 RX8 for $15k is looking pretty attractive -

Sometimes I go off on these flights of fancy about getting a 240Z or an E30 M3 or some cool old car like that, for the wonderful analog-ness of it, the lack of driver aids, light weight, direct steering. But when you can get an RX8 for the same price, come on, it's just better, much better. And then you don't have to deal with it being in the shop all the time. I like cars and all, but my tolerance for dealing with mechanics is very close to zero.


09-27-11 | God Damn It

When you are drilling a hole in a wood cabinet, first of all, just stop, don't fucking do it. But if you insist on continuing, god damn it make it centered and level and straight and all that.

If you're nailing or screwing into the exterior wood of a house, again, just stop. Do you really need to put screws in to hang your fucking garden party lights? No, I don't think you do. But if you insist, fucking caulk them or something so you aren't just piercing the waterproof skin and providing no protection.

If you're cutting a hole all the way through a roof or a wall to put a vent, fucking stop for a second and make sure you're doing it in the right place, make it level and centered.

You can't fucking undo these things.


09-27-11 | String Match Stress Test

Take a decent size file like "book1" , do :

copy /b book1 + book1 twobooks

then test on "twobooks".

There are three general classes of how string matchers respond to a case like "twobooks" :

1. No problemo. Time per byte is roughly constant no matter what you throw at it (for both greedy and non-greedy parsing). This class is basically only made up of matchers that have a correct "follows" implementation.

2. Okay with greedy parsing. This class craps out in some kind of O(N^2) way if you ask them to match at every position, but if you let them do greedy matching they are okay. This class does not have a correct "follows" implementation, but does otherwise avoid O(N^2) behavior. For example MMC seems to fall into this class, as does a suffix tree without "follows".

Any matcher with a small constant number of maximum compares can fall into this performance class, but at the cost of an unknown amount of match quality.

3. Craps out even with greedy parsing. This class fails to avoid O(N^2) trap that happens when you have a long match and also many ways to make it. For example simple hash chains without an "amortize" limit fall in this class. (with non-greedy parsing they are O(N^3) on degenerate cases like a file that's all the same char).


Two other interesting stress tests I'm using are :

Inspired by ryg, "stress_suffix_forward" :

4k of aaaaa...
then paper1
then 64k of aaaa...
obviously when you first reach the second part of "aaaa..." you need to find the beginning of the file, but a naive suffix sort will have to look through 64k of following a's before it finds it.

Another useful one to check on the "amortize" behavior is "stress_search_limit" :

book1
then, 128 times :
  128 random bytes
  the first 128 bytes of book1
book1 again
obviously when you encounter all of book1 for the second time, you should match the head of the file, but matcher which use some kind of search limit will see the 128 byte matches first and may never get back to the really long one.


09-26-11 | Tiny Suffix Note

Obviously there are lots of analogies between suffix tries and suffix arrays.

This old note about suffix arrays which provides O(N) neighbor pair match lengths is exactly analogous to using "follow pointers" for O(N) string matching in suffix tries.

(their paper also contains a proof of O(N)'ness , though it is obvious if you think about it a bit; see comments on previous post about this).

Doing Judy-ish stuff for a suffix tree is exacly analogous to the "introspective" stuff that's done in good suffix array sorters like divsufsort.

By Judy-ish I mean using a variety of tree structures and selecting one for the local area based on its properties. (eg. nodes with > 100 children switch to just using a radix array of 256 direct links to kids).

Suffix tries are annoying because it's easy to slide the head (adding nodes) but hard to slide the tail (removing nodes). Suffix arrays are even worse in that they don't slide at all.

The normal way to adapt suffix arrays to LZ string matching is just to use chunks of arrays (possibly a power-of-2 cascade). There are two problems I haven't found a good solution to. One is how to look up a string in the chunk that it is not a member of (eg. a chunk that's behind you). The other is how to deal with offsets that are in front of you.

If you just put your whole file in one suffix array, I believe that is unboundedly bad. If you were allowed to match forwards, then finding the best match would be O(1) - you only have to look at the two slots before you and after you in the sort order. But since we can't match forward, you have to scan. The pseudocode is like this :


do both forward and backward :
start at the sort position of the string I want to match
walk to the next closest in sort order (this is an O(1) table lookup)
if it's a legal match (eg. behind me) - I'm done, it's the best
if not, keep walking

the problem is the walk is unbounded. When you are somewhere early in the array, there can be an arbitrary number (by which I mean O(N)) of invalid matches between you and your best match in the sort order.

Other than these difficulties, suffix arrays provide a much simpler way of getting the advantages of suffix tries.

Suffix arrays also have implementation advantages. Because you separate the suffix string work from the rest of your coder it makes it easier to optimize each one in isolation, you get better cache use and better register allocation. Also, the suffix array can use more memory during the sort, or use scratch space, while a trie has to hold its structure around all the time. For example some suffix sorts will do things like use a 2-byte radix in parts of the sort where that makes sense (and then they can get rid of it and use it on another part of the sort), and that's usually impossible for a tree that you're holding in memory as you scan.


09-25-11 | More on LZ String Matching

This might be a series until I get angry at myself and move on to more important todos.

Some notes :

1. All LZ string matchers have to deal with this annoying problem of small files vs. large ones (and small windows vs large windows). You really want very different solutions, or at least different tweaks. For example, the size of the accelerating hash needs to be tuned for the size of data or you can spend all your time initializing a 24 bit hash to find matches in 10 byte file.

2. A common trivial case degeneracy is runs of the same character. You can of course add special case handling of this to any string matcher. It does help a lot on benchmarks of course, because this case is common, but it doesn't help your worst case in theory because there are still bad degenerate cases. It's just very rare to have long degenerate matches that aren't simple runs.

One easy way to do this is to special case just matches that start with a degenerate char. Have a special index of [256] slots which correspond to starting with >= 4 of that char.

3. A general topic that I've never seen explored well is the idea of approximate string matching.

Almost every LZ string matcher is approximate, they consider less than the full set of matches. Long ago someone referred to this as "amortized hashing" , which refers to the specific implemntation of a hash chain (hash -> linked list) in which you simply stop searching after visiting some # of links. (amortize = minimize the damage from the worst case).

Another common form of approximate string searching is to use "cache tables" (that is, hash tables with overwrites). Many people use a cache tables with a few "ways".

The problem with both these approaches is that the penalty is *unbounded*. The approximate match can be arbitrarily worse than the best match. That sucks.

What would be ideal is some kind of tuneable and boundable approximate string match. You want to set some amount of loss you can tolerate, and get more speedup for more loss.

(there are such data structures for spatial search, for example; there are nice aproximate-nearest-neighbors and high-dimensional-kd-trees and things like that which let you set the amount of slop you tolerate, and you get more speedup for more slop. So far as I know there is nothing comparable for strings).

Anyhoo, the result is that algorithms with approximations can look very good in some tests, because they find 99% of the match length but do so much faster. But then on another test they suddenly fail to find even 50% of the match length.


09-24-11 | Suffix Tries 2

Say you have a suffix trie with path compression.

So, for example if you had "abxyz" , "abymn" and "abxyq" then you would have :


[ab]   (vertical link is a child)
|
[xy]-[ymn]  (horizontal link is a sibling)
|
z-q

only the first character is used for selecting between siblings, but then you may need to step multiple characters to get to the next branch point.

(BTW I just thought of an interesting alternative way to do suffix tries in a b-tree/judy kind of way. Make your node always have 256 slots. Instead of always matching the first character to find your child, match N. That way for sparse parts of the tree N will be large and you will have many levels of the tree in one 256-slot chunk. In dense parts of the tree N becomes small, down to 1, in which case you get a radix array). Anyhoo..

So there are substrings that don't correspond to any specific node. For example "abx" is between "ab" and "abxy" which have definite spots in the tree. If you want to add "abxr" you have to first break the "xy" and then add the new node.

Okay, this is all trivial and just tree management, but there's something interesting about it :

If you have a "follow" pointer and the length you want does not correspond to a specific node (ie it's one of those between lengths), then there can be no longer match possible.

So, you had a previous match of length "lastml". You step to the next position, you know the best match is at least >= lastml-1. You use a follow pointer to jump into the tree and find the node for the following suffix. You see that the node does not have length "lastml-1", but some other length. You are done! No more tree walking is needed, you know the best match length is simply lastml-1.

Why is this? Consider if there was a longer match possible. Let's say our string was "sabcdt..." at the last position we matched 5 ("sabcd"). So we now have "abcdt..." and know match is >= 4. We look up the follow node for "abcd" and find there is no length=4 node in the tree. That means that the only path in the tree had "dt" in it - there has been no character other than "t" after "d" or there would be a branching node there. But I know that I cannot match "t" because if I did then the previous match would have been longer. Therefore there is no longer match possible.

This turns out to be very common. I'm sure if I actually spent a month or so on suffix tries I would learn lots of useful properties (there are lots of papers on this topic).


09-24-11 | Suffix Tries 1

To make terminology clear I'm going to use "trie" to mean a tree in which as you descend the length of character match always gets longer, and "suffix trie" to indicate the special case where a trie is made from all suffixes *and* there are "follow" pointers (more on this later).

Just building a trie for LZ string searching is pretty easy. Using the linked-list method (which certainly has disadvantages), internal nodes only need a child & sibling pointer, and some bit of data. If you always descend one char at a time that data is just one char. If you want to do "path compression" (multi-char steps in a single link) you need some kind of pointer + length.

(it's actually much easier to write the code with path compression, since when you add a new string you only have to find the deepest match in the tree then add one node; with single char steps you may have to add many nodes).

So for a file of length N, internal nodes are something like 10 bytes, and you need at most N nodes. Leaves can be smaller or even implicit.

With just a normal trie, you have a nice advantage for optimal parsing, which is that when you find the longest match, you also automatically walk past all shorter matches. At each node you could store the most recent position that that substring was seen, so you can find the lowest offset for each length of match for free. (this requires more storage in the nodes plus a lot more memory writes, but I think those memory writes are basically free since they are to nodes in cache anyway).

The Find and Insert operations are nearly identical so they of course should be done together.

A trie could be given a "lazy update". What you do is on Insert you just tack the nodes on somewhere low down in the tree. Then on Find, when you encounter nodes that have not been fully inserted you pick them up an carry them with you as you descend. Whenever you take a path that your baggage can't take, you leave that baggage behind. This could have advantages under certain usage patterns, but I haven't actually tried it.

But it's only when you get the "follow" pointers that a suffix trie really makes a huge difference.

A follow pointer is a pointer in the tree from any node (substring) to the location in the tree of the substring without the first character. That is, if you are at "banana" in the tree, the follow pointer should point at the location of "anana" in the tree.

When you're doing LZ compression and you find a match at pos P of length L, you know that at pos P+1 there must be a match of at least length L-1 , simply by using the same offset and matching one less character. (there could be a longer match, though). So, if you know the suffix node that was used to find the match of length L at pos P, then you can jump in directly to match of length L-1 at the next position.

This is huge. Consider for example the fully degenerate case, a file of length N of all the same character. (yes obviously there are special case solutions to the fully degenerate case, but that doesn't fix the problem, it just makes it more complex to create the problem). A naive string matcher is actually O(N^3) !!

For each position in the file (*N)
Consider all potential matches (*N)
Compare all the characters in that potential match (*N)
A normal trie makes this O(N^2) , because the comparing characters in the string is combined with finding all potential matches, so the tree descent + string compares combined are just O(N).

But a true suffix trie with follow pointers is only O(N) for the whole parse. Somewhere early on would find a match of length O(N) and then each subsequent one just finds a match of L-1 in O(1) time using the follow pointer. (the O(N) whole parse only works if you are just finding the longest length at each position; if you are doing the optimal parse where you find the lowest offset for each length it's O(N^2))

Unfortunately, it seems that when you introduce the follow pointer this is when the code for the suffix trie gets rather tricky. It goes from 50 lines of code to 500 lines of code, and it's hard to do without introducing parent pointers and lots more tree maintenance. It also makes it way harder to do a sliding window.


09-23-11 | Morphing Matching Chain

"MMC" is a lazy-update suffix tree.

mmc - Morphing Match Chain - Google Project Hosting
Fast Data Compression MMC - Morphing Match Chain
Fast Data Compression BST Binary Search Tree
encode.ru : A new match searching structure
Ultra-fast LZ

(I'm playing a bit loose with the term "suffix tree" as most people do; in fact a suffix tree is a very special construction that uses the all-suffixes property and internal pointers to have O(N) construction time; really what I'm talking about is a radix string tree or patricia type tree). (also I guess these trees are tries)

Some background first. You want to match strings for LZ compression. Say you decide to use a suffix tree. At each level of the tree, you have already matched L characters of the search string; you just look up your next character and descend that part of the tree that has that character as a prefix. eg. to look up string str, if you've already decended to level L, you find the child for character str[L] (if it exists) and descend into that part of the tree. One way to implement this is to use a linked list for all the characters that have been seen at a given level (and thus point to children at level +1).

So your nodes have two links :


child = subtree that matches at least L+1 characters
sibling = more nodes at current level (match L characters)

the tree for "bar","band",bang" looks like :

b
|  (child links are vertical)
a
|
r-n  (sibling links are horizontal)
| |
* d-g
  | |
  * *

where * means leaf or end of string (and is omitted in practice).

Okay, pretty simple. This structure is not used much in data compression because we generally want sliding windows, and removal of strings as they fall out of the sliding window is difficult.

(Larsson and others have shown that it is possible to do a true sliding suffix tree, but the complexity has prevented use in practice; this would be a nice project if someone wants to make an actual fast implementation of the sliding suffix trie)

Now let's look at the standard way you do a hash table for string matching in the LZ sliding window case.

The standard thing is to use a fixed size hash to a linked list of all strings that share that hash. The linked list can just be an array of positions where that hash value last occured. So :


pos = hashTable[h] contains the position where h last occured
chain[pos] contains the lat position before pos where that same hash h occurred

the nice thing about this is that chain[] can just be an array of the size of the sliding window, and you modulo the lookup into it. In particular :

//search :
h = hash desired string
next = hashTable[h];
while ( next > cur - window_size )
{
  // check match len of next vs cur
  next = chain[next & (window_size-1) ];
}

note that the links can point outside the sliding window (eg. either hashTable[] or chain[] may contain values that go outside the window), but we detect those and know our walk is done. (the key aspect here is that the links are sorted by position, so that when a link goes out of the window we are done with the walk; this means that you can't do anything like MTF on the list because it ruins the position sort order). Also note that there's no check for null needed because we can just initial the hash table with a negative value so that null is just a position outside the window.

To add to the hash table when we slide the window we just tack onto the list :


// add string :
chain[ cur & (window_size)-1 ] = hashTable[h];
hashTable[h] = cur;

and there's the sort of magic bit - we also removed a node right there. We actually popped the node off the back of the sliding window. That was okay because it must have been the last node on its list, so we didn't corrupt any of our lists.

That's it for hash-chain review. It's really nice how simple the add/remove is, particularly for "Greedy" type LZ parsers where you do Insert much more often than you do Find. (there are two general classes of LZ parers - "Optimal" which generally do a Find & Insert at every position, and "Greedy" which when they find a match, step ahead by the match len and only do Inserts).

So, can we get the advantages of hash chains and suffix trees?

Well, we need another idea, and that is "lazy updates". The idea is that we let our tree get out of sorts a bit, and then fix it the next time we visit it. This is a very general idea and can be applied to almost any tree type. I think the first time I encountered it was in the very cool old SurRender Umbra product, where they used lazy updates of their spatial tree structures. When objects moved or spawned they got put on a list on a node. When you descend the tree later on looking for things, if a node has child nodes you would take the list of objects on the node and push them to the children - but then you only descend to the child that you care about. This can save a lot of work under certain usage patterns; for example if objects are spawning off in some part of the tree that you don't visit, they just get put in a high up node and never pushed down to the leaves.

Anyhoo, so our suffix tree requires a node with two links. Like the hash table we will implement our links just as positions :

struct SuffixNode { int sibling; int child; }
like the hash table, our siblings will be in order of occurance, so when we see a position that's out of the window we know we are done walking.

Now, instead of maintaining the suffix tree when we add a node, we're just going to tack the new node on the front of the list. We will then percolate in an update the next time we visit that part of the tree. So when you search the tree, you can first encounter some unmaintained nodes before you get to the maintained section.

For example, say we had "bar" and "band" in our tree, and we add "bang" at level 2 , we just stick it on the head and don't descend the tree to put it in the right place :


b
|  (child links are vertical)
a
|
NG-r-n  (sibling links are horizontal)
     |
     d

(caps indicates unmaintained portion)

now the next time we visit the "ba" part of the tree in a retrieval, we also do some maintenance. We remember the first time we see each character (using a [256] array), and if we see that same character again we know that it's because part of the tree was not maintained.

Say we come in looking for "bank". If see a node with an "n" (that's a maintained n) we know we are done and we go to the child link - there can't be any more n's behind that node. If we see an "N" (no child link), we remember it but we have to keep walking siblings. We might see more "N"s and we are done if we see an "n". Then we update the links. We remove the "n" (of band) from the sibling link and connect it to the "N" instead :


b
|  (child links are vertical)
a
|
n-r
|   
g---d

And this is the essence of MMC (lazy update suffix trie = LUST).

A few more details are significant. Like the simple hash chain, we always add nodes to the front of the list. The lazy update also always adds nodes to the head - that is, the branch that points to more children is always at the most recent occurance of that substring. eg. if you see "danger" then "dank" then "danish" you know that the "dan" node is either unmaintained, or points are the most recent occurance of "dan" (the one in "danish"). What this means is that the simple node removal method of the hash chain works - when the window slides, we just let nodes fall out of the range that we consider valid and they drop off the end of the tree. We don't have to worry about those nodes being an internal node to the tree that we are removing, because they are always the last one on a list.

In practice the MMC incremental update becomes complex because you may be updating multiple levels of the tree at once as you scan. When you first see the "NG" you haven't seen the "n" yet and you don't want to scan ahead the list right away, you want to process it when you see it; so you initially promote NG to a maintained node, but link it to a temporary invalid link that points back to the previous level. Then you keep walking the list and when you see the "n" you fix up that link to complete the maintenance.

It does appear that MMC is a novel and interesting way of doing a suffix trie for a sliding window.


09-22-11 | Roku / Amazon Streaming

Not quite ready for prime time. Setup was real easy, that's good. But some things aren't quite right.

(tangential rant : why in the fuck do you bastards still make those fucking plugs with the DC box built onto the plug so that it blocks other outlets !? god dammit, you must know that it fucking sucks and you just don't give a shit. I have to run extension cords just to get the big plugs spaced out away from the power strip or UPS because they would all run into each other). (oh, and you fuckers can't decide if you want to put the protuberance on the side of the plugs or below the plugs, so no matter which way the power strip orients the plugs, there will be some fucking device that doesn't work with it)

(oh, and I decided not to get a Roku a few days ago because I don't want to buy more electronic shit for no good reason, but then I went on the PS3 to watch some Netflix and got the fucking "a system software update is needed" AGAIN; of course the fucking morons can't check that before you get into the Netflix app, so then you have to reboot back out to the dashboard, and they can't just fucking start the update for you then, you have to manually go fucking dig all over the massive convoluted menu to find it; the week before they logged me out of PSN to force me to agree to some new license agreement; WTF WTF have you got no concept of user experience? why are you morons all so bad at taking my money? the fucking plumber won't return my calls, fucking PS3 has chased me right off their console, wtf wtf).

1. The remote sucks balls. It's like a Fischer Price My First Remote. The buttons feel horrible, really stiff and clunky, and they're way too far apart from each other, so you can't just use one thumb and move it around without straining. The whole ergonomics of it are just awful. It has too much weight in the bottom which makes it take extra muscle to balance. The ok button should be in the middle of the arrows. It should be hourglass shaped. Fucking copy the Tivo remote god dammit.

2. Navigation on the Roku is super super slow. Hit "home" and wait, and wait, and wait, and wait. Okay, there it went. Of course it's better than going from PS3 Netflix to System Settings or something like that (which requires a reboot), and all these devices seem to be unreasonably slow (the god damn TiVo was always frustrating slow, especially because it was only slow because it was wasting time loading animations, fucking give me a "plain text" option so I can get god damn fast menus!). (* addendum : it seems to have sped itself up; maybe it was doing some background task? dunno; it's still not super fast but it's tolerable; some players seem to be faster than others, Amazon seems to be a particularly slow one).

3. Amazon streaming is just unusable. There's no "queue" type of thing where I can select a subset of stuff I want to watch from my computer, and then choose one of that subset from the Roku. This just makes the whole thing complete shit, because I'm not going to browse through a thousand titles with the fucking arrow buttons on the roku. Mouse and keyboard are good tools for browsing, fucking arrow buttons are not acceptable. (there is a sort of "queue" for things you purchase, just not for the free streaming stuff).

Oh well. Maybe if Amazon really does buy Netflix it will all be fixed.

What I really want is a premium subscription service to torrents. I'd like to pay $5 a month or something, and for that I get to choose what movies and TV shows I'd like, and some kid in Russia finds the best torrent for each movie and TV show and feeds them out to the subscribers. Obviously I can use EZRSS or something right now, but it's just flakey enough that I have to manage it by hand a lot, and I would pay to not have to do that.


09-22-11 | Sports Car Tires

I fucking went through my rear tires already, in just about 1 year or about 8000 miles. It's somewhat common for modern 911's to wear out the inner edge of the rear tires very fast, because you run a lot of rear negative camber, they're heavy in the rear, and of course you tend to drive around like a maniac.

I of course knew that tires for this car would be a lot more expensive, but it's a bit more subtle than that. If you just look at tire prices you might think they are 2-4X more expensive. They aren't just 2X or 4X more expensive, they're actually something like 10X more expensive. Here's why :

1. Just the basic tire is something like 3X more expensive due to being a large/rare size. ($300 a tire instead of $100 a tire)

2. But you don't want to buy el cheapo tires like you did for your commuter car, you want some nice performance tires, right? So now we're talking 4-5X more expensive.

3. And you can't get those tires at Big O or Walmart or whatever, so you have to go to a specialty shop, so the install is more expensive.

4. But the biggest factor is that you go through them much much faster. For one thing, they're a poor-treadwear soft compound, but it's also just the driving style. You're literally "burning rubber" all the time, and if you like to go fool around and slide some drifts or donuts, that can

5. Driving street tires on the track can also wreck them in one session, because they can't handle the heat cycles; you'll literally get melted rubber, usually on the outside edges if you're cornering hard, and it can just come off in chunks. (obviously if you're serious you have special track tires and you expect to go through them fast, but some people are under the misconception that they can take their street car to the track once in a while and it will be okay; well, yeah, it will probably be okay, but it will cost a lot more than you think)

The result is that tires are costing me almost $2000/year, which is rather more than I expected.

(basically all the same things could be said for brakes, though they don't wear quite so fast, and track days and donuts don't destroy them as instantly as they destroy tires (on some cars track days can destroy brakes because they get too hot and you can crack pads or even wreck calipers, but Porsches have pretty good brake cooling))


Anyhoo. I'm getting mildly annoyed with the car. My tires are shot and I can't get replacements in for a week cuz they're rare and have to be ordered (I'm sure I could find them at some shop around town if I wanted to make a million phone calls).

It would be nice to have a car that you could just find parts for anywhere. That you could break down in the middle of the hick middle of the country and find a mechanic who could fix it. I like having a car that's fun to drive but I don't like having a car that's a prima donna.

One of the advantages of the Lotus line is that you can take them to a Toyota mechanic. I wrote a post once about how small-car-maker engine production is super retarded, but I think I didn't post it.


09-21-11 | Four Myths

I believe that the modern white middle-upper class male is highly susceptible to these four myths.

1. "Stay out of it". Politics is a mess. It's frustrating. It's largely controlled by corporate spending and the outrageous emotional ideology of fringe crackpots who scream on talk radio. What can you do about it? It's better just to stay out of it. Sure, Fox News is telling insane lies all the time and shouldn't be allowed to mascarade as news, but what can you do about it? Sure, Corporations should not be counted as people under the 1st ammendment. It's just too frustrating and gives us a headache and distracts from our work. So let's be good consumers and go back to arguing about universal remotes and make some more products that people don't need.

2. "Happiness". One of the great mythical movements of the last twenty years or so has been a sort of spiritual glorification of the pursuit of happiness. Atheists who are increasingly disillusioned with the idea of doing something "significant" with their lives are grasping for a central concept to build their lives around, and what they have found is "just be happy".

I believe this is worth restating. Human beings need to believe that their life is for something; that there is a purpose, something to base your actions on day to day, that you're not just ticking off time until you die. Modern ultra-rational man finds it hard to believe in the purposes of older days; obviously religion is out, but even things like "write the great American novel" or "make a difference in the world" are hard for cynical modern man to build his life around, because he starts thinking "what's the point of that really?", all it does is make other people like you, maybe it helps other people but what's the point of helping other people really? This reductivist reasoning can destroy any "life purpose". So after several crises, modern man finds "happiness". I should just do what makes me happy.

Now, the "happiness" pursuers don't just go and do drugs and party or whatever; if you are aware of the happiness movement at all, it's somewhat sophisicated in the sense that it is looking for longer term deeper happiness, which might come from connection to your community, or building something with your hands, or traveling somewhere you are a bit afraid to go, etc. But the reason for it at the core is not doing something for the world, it is entirely selfish.

Because this modern happiness movement is somewhat more sophisticated than plain old gluttony and self-indulgence, it can be a bit hard to see, but in fact it is still exactly what the ruling elite want you to do. They want you to focus on your self, not on the world around you. They want you to avoid difficulty that might make you unhappy. They certainly don't want people dedicating their lives to changing the world or making it a better place.

3. "Identity Liberty". There have been substantial political gains in individual liberties in the last forty years; and there is more freedom and acceptance of "alternative lifestyles" and identities. While this is good, and I don't want to diminish the importance of greater rights for The Gays or whatever, it is really a side issue that has taken center stage. When you ask a liberal how you think the world has done in the last 40 years, they will inevitably bring this up as the major positive.

The things is, the ruling elite really don't give a rats ass about "identity liberty". It's a distraction. It's a gladiator fight in the colliseum to keep the rabble occupied while they keep raping you. They don't give a shit about gays or abortion or any of that shit. They care about the structure of power. And while we are fighting about whether there should be a Native American monument at Little Big Horn they have been putting wall street bankers in power at Treasury, the SEC, Fannie, the Fed, etc. They have been giving corporations greater power than humans or countries via NAFTA, Citizens United, etc. There is no more check on executive power or journalistic oversight. The entire congressional law-making process has become a joke.

It's like we're squabbling over the 2 men from Australia and they've just locked up the 7 bonus armies from the Americas. The most important thing is the structure of power, because the capacity for liberty flows from that.

4. "Meritocracy".

bonus : 5. "Anti-unionism".

... I got bored of this post. I'm gonna go watch TV and drink beer. Rugby world cup is on, woot!


09-20-11 | Sensory Pollution

My TV has a red light to indicate that it's OFF. Everything has lights to indicate they're on, including the power strips and UPS and such. The alarm sensors have a mess of lights.

Everything beeps when you press buttons. Gas air blower and lawn mower. Truck backing up, beep beep beep. Car alarms going off. Fucking beeps and honks when cars lock and unlock.

Fucking car headlights are way too bright. Some annoying cars now have this sparkle flashy thing when they brake.

It's an assault. Literally, it beats on your brain with a cudgel of "look at me!". No, god dammit, I want to look at what I choose to pay attention to.


09-19-11 | Game Theory

When your opponent in poker is playing like an absolute moron, you don't think "god damn this guy", you think "how do I take advantage of it". If you don't want to be quite so callous, another way to say it is : you must accept the reality of the situation you are given, and then think how can you best act in that reality. You shouldn't stick to a plan (like playing straightforward tight poker) and think that the world should go along with your plan, just because you are doing things "right" (in the naive sense) doesn't mean the world is obligated to go along with it and let you win.

WA landlord-tenant law is absurdly pro-landlord. Response : don't rent, be a landlord. (CA and NY law is very pro-tenant, response is the opposite of course).

Landlords don't actually charge you enough for move-out deposit subtractions. I'm constantly pissed off by the fact that they charge me for bullshit that is totally inappropriate (like charging me for cleaning even after I've hired professional cleaners). The thing is, they might charge you the $150 cleaning fee, but they don't charge you $100 for the pain in the butt of hiring the cleaners and letting them in and out of the house. Response : don't clean your rentals, just pay the charge. (further response : don't agree to more than $500 or so security deposit)

Service men who work on plaster, fiberglass, or any of those other nasty toxic substances don't charge nearly enough for the trouble of it. They basically just charge normal low-skill labor rates, no extra fees for the life-shortening or discomfort. Response : never do this work yourself, never work with toxic substances, chemicals or fine particles, always hire someone else to do it.


Home maintenance is one of those unstable equilibria of implicit contract (like the "golden rule"). What I mean is, in home maintance you have two options :

1. Fix things properly so that they last. or 2. Fix things just well enough so that they will probably be okay for 5-10 years.

I'm really talking about things that are hard to go back and change later, that are much cheaper to do well when you have the chance. Like you have a wall open, do you use high quality studs and put in extra wiring so that you won't need to open it again later, or do you just do the minimum for the moment? Or you have the foundation exposed, do you just fill a crack with vinyl crack filler or really properly fix the foundation for the future? Or you're doing framing, do you use dense high-quality treated wood that will resist rot for a long time, or the cheapest wood that passes code?

Let us assume that over 50+ years, the more robust choice (#1) will be much better, but over 1-20 years, the cheap out choice will be better.

For you personally, chances are that the cheap-out way (#2) will be +EV , because chances are you won't live in the same house super long. But for society as a whole, if everyone did the #1 choice and fixed things properly, we would all be better off. You wouldn't come in to a home and find deferred maintenance and crappy short-term patches. Your good quality work might not pay off for you (because you move out), but the next person would inherit it, and you would inherit the good work they had done.

The problem is that cheating on the social constract is always +EV for you personally, though it may be -EV for the group.

A good example in poker arises when many pros are at a table with a whale. The most +EV way for the pros to play is to all mostly avoid each other and go after the whale, but don't make it super obvious, and don't do things that annoy him and might chase him away. But the problem is that for any one pro, you can in the short term (local maxima) increase your EV by also going after the other pros and by really baiting the whale, for example isolation raising big any time the whale enters the pot, and re-raising other people's light isolations. The problem is once all the pros start doing this, they wind up shutting the whale out of a lot of pots and playing too many pots just with each other, and the net EV of the pros goes way down.


09-19-11 | Netflix Super-Self-Crapulation

Wow, great example of an "apology" that makes things so much worse. Paraphrase :

"We've listened to your complaints and decided that we don't give a shit, so we're going to continue in that vein even more! We will be going ahead with our corporate strategy to fuck you over; our long term plan to gradually phase out physical DVD's isn't going fast enough for our quarter-by-quarter stock growth expectations, oh and I'm going to do the massive-douche thing of pretending that fucking you over is somehow good for you".

Original :

I messed up. I owe you an explanation.

It is clear from the feedback over the past two months that many members felt we lacked respect and humility in the way we announced the separation of DVD and streaming and the price changes. That was certainly not our intent, and I offer my sincere apology. Let me explain what we are doing.

For the past five years, my greatest fear at Netflix has been that we wouldn't make the leap from success in DVDs to success in streaming. Most companies that are great at something – like AOL dialup or Bordeers bookstores – do not become great at new things people want (streeaming for us). So we moved quickly into streaming, but I should have personally given you a full explanation of why we are splitting the services and thereby increasing prices. It wouldn’t have changed the price increase, but it would have been the right thing to do.

So here is what we are doing and why.

Many members love our DVD service, as I do, because nearly every movie ever made is published on DVD. DVD is a great option for those who want the huge and comprehensive selection of movies.

I also love our streaming service because it is integrated into my TV, and I can watch anytime I want. The benefits of our streaming service are really quite different from the benefits of DVD by mail. We need to focus on rapid improvement as streaming technology and the market evolves, without maintaining compatibility with our DVD by mail service.

So we realized that streaming and DVD by mail are really becoming two different businesses, with very different cost structures, that need to be marketed differently, and we need to let each grow and operate independently.

It’s hard to write this after over 10 years of mailing DVDs with pride, but we think it is necessary: In a few weeks, we will rename our DVD by mail service to “Qwikster”. We chose the name Qwikster because it refers to quick delivery. We will keep the name “Netflix” for streaming.

Qwikster will be the same website and DVD service that everyone is used to. It is just a new name, and DVD members will go to qwikster.com to access their DVD queues and choose movies. One improvement we will make at launch is to add a video games upgrade option, similar to our upgrade option for Blu-ray, for those who want to rent Wii, PS3 and Xbox 360 games. Members have been asking for video games for many years, but now that DVD by mail has its own team, we are finally getting it done. Other improvements will follow. A negative of the renaming and separation is that the Qwikster.com and Netflix.com websites will not be integrated.

There are no pricing changes (we’re done with that!). If you subscribe to both services you will have two entries on your credit card statement, one for Qwikster and one for Netflix. The total will be the same as your current charges. We will let you know in a few weeks when the Qwikster.com website is up and ready.

For me the Netflix red envelope has always been a source of joy. The new envelope is still that lovely red, but now it will have a Qwikster logo. I know that logo will grow on me over time, but still, it is hard. I imagine it will be similar for many of you.

I want to acknowledge and thank you for sticking with us, and to apologize again to those members, both current and former, who felt we treated them thoughtlessly.

Both the Qwikster and Netflix teams will work hard to regain your trust. We know it will not be overnight. Actions speak louder than words. But words help people to understand actions.

Respectfully yours,

-Reed Hastings, Co-Founder and CEO, Netflix

p.s. I have a slightly longer explanation along with a video posted on our blog, where you can also post comments.

Lesson for myself :

Never ever NEVER put any work into a web site. Do not post to forums. Do not write reviews. Do not keep lists of movies. If you do not own the content, they will fuck you. Be it censorship (Yelp, Amazon, CNET), using your work for advertising profit (everyone), deleting your work or shutting down the site, introducing new "features" or revising the site in a way that breaks it, or just otherwise fucking you.

I know this, but I get sucked into thinking "oh it'll be fine, just this one time". It's also one of those things where everyone else is doing it, so I start to think "hmm maybe it's fine, maybe I'm being unreasonable". Nope. Everyone else is wrong. Make your own correct decision, and that is do not give control of your personal content to anyone else.

ADDENDUM :

Somehow I completely missed the fact that Amazon Prime subscribers get free streaming, just found out today (that's why it's occasionally useful to talk to other human beings). WTF Amazon, way to go informing me. You do a great job of letting me know about the fucking Amazon Visa that I don't want, but not this. Wow. And Roku streams Amazon Prime. Goodbye Netflix!

ADDENDUM 2 :

I'm surprised that a lot of people don't get why this is such a massive fuckup. The greatest asset that Netflix has is a large user base that has invested personal time into the site, writing reviews, tracking what movies they've seen and what they want to see. It's a "Yelp" or "Myspace" for movies. (they've already massively fucked up on this by failing to develop the community features and such, but whatever).

When they split the pricing earlier this year, it caused a lot of us to switch to streaming only, at which time we discovered that with streaming only you couldn't even *see* the movies that weren't available for streaming. That's such a massive fuckup right there. I should still be able to mark what movies I want to watch and which I have seen already. Having users on your site storing their movie-watching preferences is what gives you value. It's what makes them committed to your site.

Now completely splitting the streaming and non-streaming into two sites so that I no longer have one place to go and store my movie watching desires (and hopes and dreams).

ADDENDUM 3 : Netflix also silently deleted your "Saved" section of the Instant Queue recently. Hope you didn't have any data in there you wanted.

this comic is alright.

They're being so massively retarded that it has to be intentional. It makes me think this speculation might be true.


09-15-11 | DIY

I am so fucking sick of doing this shit, it's such a waste of time. I wish there people you could hire that would do this shit for you, but it just doesn't seem to exist. I thought I found the ideal thing - Seward Park Repairs , right in my neighborhood, just call them up they subcontract out and take care of whatever. (the biggest problem with hiring people to do shit is the amount of time it takes to find them and call them up and then vet them and so on). But then I found a forum post by the owner of Seward Park Repairs where he says it's okay to vent a bathroom exhaust fan into the attic. WTF David, if that's the kind of shoddy ass work you do, I'm glad I found out first. Wow, how are you people all so epically incompetent.

I really want a grounds manager to just take care of this shit for me. Higgins, this door is sticking, have someone take care of it, of course sir. I guess I need to make more money, but that's not possible as long I'm wasting all my time on fucking DIY bullshit!

There is this stupid dangerous machismo of "I could do that". Harrumph, I won't hire someone, I'm a man, I can do that myself. Of course other people will judge you and pressure you in this way as well; harrumph why'd you hire someone to put in that vent fan, can't you do that yourself? don't you have a penis?

Well, that's fucking stupid. Just because you *can* do something doesn't mean you should. Just because lots of stupid people think that it's admirable to do things yourself doesn't mean that it actually is.

What's admirable is making the right decision for yourself, regardless of what others think is right. This is obvious, but it's a very difficult way to actually live.

It would be so nice to be able to hire someone to take care of things and be able to trust that they will do a decent job. I don't even mean a builder, it could just be anyone off the street and they could subcontract everything. All they have to be able to do is basic research and decision making and phone calls. Unfortunately that takes a ton of intelligence and is very hard to find. It's hard to find even among highly skilled programmers.

A crucial aspect is the "what needs approval" question. You need an employee who has the sense to know what they should ask you about, and what they should just decide themselves and not bug you. Both ways of getting it wrong are bad; you don't want someone who just goes off and makes a bunch of decisions and you find out too late that you got the gold-plated Versace bathroom set ; but you also don't want someone who comes to you every five minutes with every question. The vast majority of programmers I've worked with fall towards one or the other side; it's very rare when you find someone who has that sense of judgement to know hey this is important I better ask about it, or hey I should just keep on trucking and make my own call on this.

I have a whole rant percolating about how shockingly hard it is to find what I consider "basic professionalism". You want to know why you're all out of work? When you get a business call, return it right away. When you have an appointment, make it, and if you can't then you call and tell them you'll be late *before* the appointment time. When you say that you will do something, you fucking get it done or you let them know very early on that you can't do it. If you are given a task that you don't know how to do right, you say so, you don't just try to fake it and fuck it up.


09-15-11 | Spray Park

Spray Park is one of those easily accessible and outrageously beautiful places that are a real bonus to living here. You hike through a bunch of typical northwest forest, up a decently hard hill, and then suddenly emerge into this sub-alpine meadow wonderland of little wild flower and new growth, backed by the giant mountain. Further up you can get into the true alpine barren lands, which are sort of calming in their emptiness. Eventually you get up into big snow fields, where I've seen the crazy outdoors people skiing in the middle of summer.


09-11-11 | Shitty Product Design

I feel like a lot of "design" is making product worse. With lots of products there is a well known good way to make them, just fucking leave it alone, stop changing things for no good reason, you're fucking them up.

A good example are "stemless wine glasses". Uh, WTF, you moron, the stem is there for a reason. You just made it look trashy and made it much worse. Oh yes, I want finger prints on my wine glass. Yes, I love to warm up my wine when I hold it. Oh yes, I don't want to be able to hold it up to the light properly.

(Classic designs are not always right though. I really don't see the point to double hung sash windows. Why in the fuck do I need to be able to open the top part? When do you want the top sash open and the bottom closed that you couldn't just have the bottom open instead? In my experience with these fuckers, the only thing the top sash is good for is letting in massive air leaks, or falling down slightly and getting stuck and being incredibly hard to slide back up because it has no handles or anything.)

And now for a photo tour of horrible designs around my home :

Pushing a window open with my hand was much too difficult. It only took a second and was easily adjustable, but I had to lean over. I'd much rather crank on this fucking floppy handle for five minutes. The result is that sometimes I want the window open but I just can't be bothered to take the time to crank it a million times, so I just open another window that isn't on a "convenient" crank.

These kind of faucets are horrible in many ways. The mode clicker at the end is always flakey (this one happens to work mostly), the fucking retractable head is totally unnecessary and just makes it wobbly and lower flow, but worse of all is the fucking joystick water control. It makes precisely setting the flow volume and temperature almost impossible. The variant of the joystick which pervades cheap hotel showers is the real pinnacle of shitty water control design. Fucking hot/cold knobs worked great. They're easy to adjust precisely, they hold their position and can't be easily bumped into scalding or freezing, they aren't fighting gravity so they don't slip. If you really want to change it would could do flow & temperature knobs, but don't fucking abandon the two-knob design! Knobs are perfect! I think water control peaked when the two faucets for hot and cold got combined into one faucet, but you still had the two knobs, it's been down-hill ever since.

I've done this one before, but it was in my sight so let's do it again. The Melitta single cup dripper is such a clear case of taking a near-perfect product that does the task it is designed for, and just fucking it up for no reason. (the only thing I would change about the original single cup dripper is I'd make it a bit bigger, because I like to use an obscene unreasonable amount of grounds for one cup of coffee).

Product designer : Why would anyone want to grab a drawer handle from above? I know, let's seal off the top and make a big surface to get dirty!

Quick - find the Stop button. Too late, your burrito exploded. Thank god for the "vegetable" feature. I'm sure glad there's a "hold warm" and "light timer" feature. And WTF is "cook" ? I bet they could cram some more buttons on there and it would be even more deluxe.

You could do the stop button test again. Think about how much your hand has to move around the panel just to set it to bake at 350. There's just no fucking thought about usability. Touch pads like this in general are just horrible interface devices, and sadly are getting more and more common (see for example washer/dryer rant). Two knobs is the perfect interface for an oven. One for temperature and one for function (off/bake/broil) (physical radio buttons would also be okay for function).

WTF are you product designers thinking? Do you actually think you're doing good work? You're not. You're making shit worse. You should be ashamed. You should feel humiliated and miserable every day at work as you take good products and make them trendy or "modernize" them or make them slightly cheaper to mass produce.

There needs to be something like a hippocratic oath for product designers ; "First, don't make it worse".


09-11-11 | Walk to the Lake

In pictures :

Discovered this hole in my bathroom ceiling. The gray at the top is the bottom of the upstairs bathtub, so I assume this was a water overflow that soaked through, so they just cut out the rotten bits all the way through three layers of floor. So that was most of a day wasted fixing this, and then repainting the ceiling. The worst part was that the hole they cut was a ragged mess so I had to square it up, and the ceiling is old plaster and lathe which is a pain in the ass to cut cleanly.

Leaving home now. Back yard is a nice place to sit. It's great to be able to walk down to the lake at the end of a long hot day of breathing plaster dust and paint fumes.

Fucking neighbor is running a drain pipe from his gutter onto my property. More annoying shit to deal with.

I feel like the blackberries have been especially good this year. Maybe it's just because I've moved to an area that has a lot more wild land than Cap Hill. In Seattle if you ignore a patch of land for a few minutes it becomes instantly covered in black berries. There's a pretty good patch of blackberries on almost every block around here, so you can take a stroll and snack as you go. I love just the smell of them, they make the air sweet and rich. I love how they come ripe at different times based on sun exposure and microclimate, so that the ripe season lasts over a month, you just pick from different sides of the block.

One of the trees on my parking strip is growing into the power lines. In Seattle it's the home owner's responsibility to keep their trees out of the lines. Texas and CA are not like that, the city or power company does it. Seattle also has no street sweeping (except right down town). And there are very few street lights (home owners are suggested to have a bright porch light).

On the way down to the lake now; there's been this congregation of sail boats next to I-90 almost every day this summer.

My local swim spot. Nice and un-crowded. Unfortunately there's also a sewage pipe near here (there seems to be one at almost every swim spot, including the official ones with life guards, I'm not quite sure why that is, I guess there's a mutual correlation that swim spots and sewer lines both tend to be on large patches of public land). ( see here for map )


09-11-11 | Consumer Choice

Capitalism just doesn't work.

When I went in to buy a washer+dryer a while ago, I told the guy exactly what I wanted : modern efficiency and quiet and good function, but no fancy features, no computers, just a physical knob and hard switches. Nope, sorry, that doesn't exist. Well, fuck, okay, can I try out the ones you have to see which has the least annoying fucking stupid computer? Nope, they're not plugged in. Well, fuck.

We went to buy a little inflatable boat the other day to paddle around the lake. Going in I thought - the main thing I'd like is for it to have a normal Shrader valve (like a car tire or old American bicycle) so that I can use the nice pumps I have instead of the shitty flimsy plastic pumps that they give you. Nope, not one.

How am I supposed to pick the product I want and steer the market when there's not one good choice?

I decided to just bite the bullet and pay Netflix $8 a month just to let me record movies that I'm interested in watching on instant some day.

Of course much worse is that Comcast is fucking me over and by government-regulated monopoly I have no recourse at all. They get to punch me in the face and I have to say "thank you sir can I have another? please don't take my internets away".

In other stupid product news, I've been constantly annoyed that my fucking TV insists on showing "Air/Cable" as an input option when I have nothing plugged in there. I have to toggle inputs between my computer & PS3, and sometimes I accidentally stop on "Air/Cable" , at which point I'm beaten about the ears with brutal static. Fucking god damn it. First of all, you know there's no signal, it fucking says "no signal" right there, maybe you could show me a silent black screen instead of audible static, hmm? okay? Second of all, when there's no fucking signal, maybe you could just disable that input option the same way you disable the other inputs when nothing is plugged in them hmm? Actually the worst case is when the PC and PS3 are both off and I turn on the TV, then it insists on showing me static and I can't change the input source at all. Anyway, the conclusion is that I'm buying an antenna just so I get a signal instead of static. Fucking hell.


09-10-11 | You Fuckers

If you have set your blog to not deliver a full copy of the post by RSS, I'm unsubscribing. You can't bully me into clicking through. Provide it in a way that's nice to your reader.

Fucking cars that beep their horn when they lock & unlock is my latest nemesis. A quiet little chirp from a special-purpose speaker is sort of okay, though really it should come from the *key* not the *car* and it should be much quieter, the purpose is for the owner with the key to hear it, not the whole world (and you know, really if the key just had a separate lock & unlock button (not a state toggle) and flashed the lights, then it could avoid making any sound at all, which would be preferable). But many car makers have fucking cheaped out in the shittiest pettiest way. They thought hey, we already have a noise maker for the horn, we don't need to add another one to do the lock/unlock chirp, we'll just beep the horn and save a dollar per car. WTF, not okay. Try sitting outside a coffee shop in a strip mall, it's a fucking chorus of honking horns as people get in and out of their horrible cars, BEEP, BEEP BEEP, BEEP, ack, WTF. What if you drove in near where someone was sitting and just honked your horn for no reason? Do you think that would be okay? No, of course not, it would be a huge dick move, fucking honking your horn in a parking lot for no reason, but that is exactly what you're doing. Fucking hell.

Car alarms are fucking infuriating because I shouldn't have to be writing this. We all heard the comedians in the 80's (the guys who did "toilet seat up" and "what's the deal with airline food" and "I can't program my VCR") who did the jokes about "what's the point of a car alarm? it's just to annoy your neighbors; you're not actually running outside to see a crook when your alarm goes off? am I right people?" and we all laughed and though "ha ha he's right car alarms are pointless" - BUT THEN YOU KEPT BUYING THEM! Why !? You just laughed and recognized they serve no purpose but to annoy, so why do you keep buying them? Jesus christ.

(The worst case actually is riding the fucking ferries up here, where the vibration sets off lots of people's alarms, so you get a massive racket in the ferry, and then you get the loudspeaker doing "will the owner of a blue BMW please turn off their alarm", several times per trip)

Jet skis on lake washington are getting annoying. It's not that jet skis are inherently evil, it's that they attract massive douche bags. Jet ski owners (like Harley douchebags) seem to think it's cool to run them un-muffled making way too much noise. And of course they won't just go out to the middle of the lake, they have to come buzz the shore to show off, oo look at you on your douche-mobile, speeding around way too close to kayaks and sail boats, buy a fucking muffler and get away from non-motorized vehicles you fucking dick.

Sometimes when I'm sitting by the lake I imagine what it would be like if there no motor boats at all on the lake, only sail boats. Delightfully peaceful and picturesque. What if there was a green parkway all around the lake. What if there were cafes with outdoor tables and chairs. What if Seattle didn't dump its sewage in the lake? I feel like Seattle is one of the more naturally beautiful urban settings in the world, but we sort of waste it.


09-06-11 | Connecting Wires

This info was a bit hard to find, so summary :

Basic connection of "solid" (typically 14 gauge) to solid wire (by solid I mean single core) : strip a good inch of each. Lay metal bits side by side, grab the tips with pliers and twist clockwise (same direction you will screw on the wire nut). The twisted wires should appear to be like a threaded screw in the direction the nut will tighten on. If after twisting the ends are not neat, don't try to crimp with pliers, instead snip off the forked bit down to a neat nub with wire cutter. Screw on wire nut to hand tight. Wrap electrical tape clockwise (tightening the wire nut, not loosening) around the nut and then around the wires.

Connecting "stranded" to stranded (stranded = many small cores making up the wire). Strip about an inch or slightly less. Lay both wires side by side and fan them out flat (undo any twisted). Mush they fanned out wires together to make like one big stranded wire. It may help to tape the insulated part together just to hold the wires in place as you do this. With your fingers gently twist the big stranded wire clockwise. This is just so it forms a neat tip, not to create threads. Screw on wire nut & tape.

Connecting stranded to solid : this is by far the weakest of the three connections, and ideally you would use solder and/or crimp connection or perhaps a screw terminal block. Another option is to solder the end of the stranded wire to make it effectively a solid end, then wire nut it. But assuming you don't want to do any of that, you do this : strip stranded wire to 1.25 inches, solid to 1 inch. Twist stranded wire by itself to give it a neat solid end. Lay the two wires side by side with the stranded extended slightly past the solid, 1/8" or so. Tape the insulation of the two wires together just to hold them together. Do not twist the wires with each other. Screw on wire nut, then wrap in tape.

In all cases you can test the connection by giving a little tug, there should be no feeling of looseness (and if the little tug was enough to wreck it, it was no good). Wire nut connections only work great between wires of roughly the same gauge.


09-03-11 | Bullshit

WTF is up with "flat feet" ? It's very strange that back in the early 20th century, when half the world went to war, and every able bodied young man was sent out to die, you could avoid all that just by having a low arch. It seems like a scam, like maybe Bush the First had flat feet and it was a way to get out of WWI, or perhaps it was some voodoo animist belief that the flat-footed were bad luck, I dunno, it's very strange. Could I get out of war service because of a deviated septum? it does make it hard to breathe. What about bunions? WTF, why flat feet?

Nails are fucking bullshit. When you build something with nails you're basically saying "I expect this to pull apart in 1-5 years". Screws are slightly better, but still just a friction bond that can easily wiggle free. If you actually are building shit to stay together, you use bolts or proper wood fittings (dovetails or dowels or such like).

Of course nails do have their place - as temporary non-load-bearing tacks to hold together a structure so that the major pieces can bear the load. When a house is framed, for example, nails are used to hold the boards together, but the nails are not expected to bear the loads, they just put the boards in the right place and then the loads are tensile and compressive through the boards. It's similar to the role of rivets in an iron bridge - they should not be load bearing, they just ensure that two I-beams meet up correctly, and the load is all in the beams.

Anyway, the problem with nails is that people don't understand them (and/or get lazy) and use them incorrectly. This goes not just for DIY'ers but also contractors and expensive home builders, who wind up doing things like building railings and fences that are held together by nails such that when you lean on them the force acts to push the nail out.

Shitty nail-built bullshit.

All the shopping at Lowes / Home Depot / etc. really has illuminated my understanding of the depressing shitty quality of modern American construction. All your contractors shop at these places, when you go to your average shitty tract home, most of the home is this stuff. There just isn't a single good quality piece in the place. It's all asian-made super cheapo crap. The light switch covers are bendy plastic. The yard tables are wobbly thin metal. All the pieces are just shitty. Even if you're a contractor who wants to do better work - where do you get your supplies? There is no better choice. It's not like there's the cheap shit and then some more expensive actually well made choices, no, it's just not there at all, there is no well-made choice, you only get shit.

Drills are fucking bullshit. You take this long proboscis and of course you can't possibly line it up straight. WTF , there should be a flat metal plate around the bit which is attached by pistons to the drill body, so that you can set the plate flush and get a good perpendicular hole.

But maybe it doesn't matter because wood is fucking bullshit. You might think if you buy a 2x4 you get a piece of wood which is 2x4. Nope. It's actually probably something like 1.75 x 3.75 , because they measure the size before planing. But it's worse than that, if you buy a bunch of 2x4's, they will all be slightly different sizes, so if you try to make a flat tray from them or something it will be all uneven. But even worse than the variation in sizes, they will be warped and twisted and all out of whack in ways that make clean building seem absolutely impossible to me. I suppose higher grades of wood are probably milled more uniformly.


09-02-11 | Old Wiring

I'm replacing a bunch of our 2 prong outlets with 3 prongs. For my computer, I'm going to try to actually ground it properly, but for the rest I'm just leaving the ground attached to nothing.

All the electrician manuals say if the "receptacle" (that's the fancy name for "holes") is not actually grounded then you must use a two prong outlet, so that the user doesn't think it's grounded when it's not. Oh noes! You lied to me about grounding! What the hell am I going to do if I have a 3 prong dealy to plug in and only 2 prong outlets, I'm not going to say "oh well, I guess I can't use this because I don't actually have a ground", I'm just going to use one of those little adapter dealies. (*)

Oh, and that little tab on the adapter where you can run your own ground is a huge lol. Yeah right, adapter, like I'm going to plug you into the wall and then run a wire outside the wall out my window and hammer in a 6 foot iron spike so I can be properly grounded.

I mean, when the fuck do you need grounding anyway? Our electric service is not sending massive lightning surges into the house at random intervals.

The worst thing about the two prong holes is just that plugs don't stay in them. The grounding prong is really just to hold your fucking plug in the holes. The only time I've ever seen scary sparks from receptacles is because of the two prong dealies not making good contact and pulling halfway out and bending the prongs and so on.

Our house has a mix of old knob & tube wiring and newer stuff. Any electrician who works to code is not allowed to touch the old stuff, their only allowed action is to replace it. So I either have to do the work myself, or hire someone who will work under the table.

It turns out that hiring people to work under the table is not actually hard at all. So far I have yet to encounter a single contractor who insists on working to code; in fact they all say something like "I could do this to code but it will cost you 25% more". Okie doke. (I imagine some national chain guys would insist on doing it by the book).

* = you see the same sort of daft behavior from library writers all the time. They think they are being "rigorous" and "safe" by not providing a function to the user which is maybe a bit dangerous to use, or maybe doesn't do exactly what you would hope it does, but it's retarded. What do you think the user is going to do? Not write code in a way that needs that function? Pfft, of course not, they will just write their own version of the faccility you failed to provide, but their version will be *much worse*, so by being "safe" you have actually made the final product worse.

On a semi-related note, I just got a washer & dryer, and of course the delivery guys won't hook up the gas. It's funny/ironic that because of liability fear, they won't help you with the bit that's actually dangerous and could use the touch of someone with experience (the main issue is knowing how tight to tighten the fittings).

ADDENDUM : well I tried to ground the computer outlets, but it was way too much trouble. I would need a flex drill extension, which is not that hard, but then I would have to worry about what exactly it is that I'm drilling through inside the wall that I can't see (maybe some dental mirrors would make that safer), then after that there's a hole you have to try to thread a wire through it from outside the wall. No thank you. Grounding is over-rated.

Some A-hole at some point ran new romex into the walls, but didn't hook up the ground line of the romex to anything, and then installed two-prong receptacles. That's legal code and all, but it's fucking shitty. It means I can put in a three-prong receptacle and hook up to the romex but I have no ground. And I don't know where the other end of that line is to get to it and try to attach it to something. All the lines go through the basement so it would have been easy to just run it over to a water pipe. They put in a new duplex box so they had the wall open and could have done it but didn't. Fuckers.


08-30-11 | cb's guide to using the Comcast self-install

Comcast now offers self-install for cable modem, which is awesome. Unfortunately, their suggested process is not awesome. My guide :

1. Call in (or use the online live chat) to sign up for your new service. They will tell you to go to your local comcast store and buy a self-install kit. Don't. They will tell you to call in to tech support to speak to someone when you do the self-install. Don't.

2. Just plug in your cable modem, router, and computer.

3. Open a web browser. Any connection you try to make should automatically be showing the Comcast self-install page.

4. Click through various obvious prompts until it says "Activating... (please wait 10 minutes)". Wait for a long time. For me, this page never moved on, so if that happens to you proceed to the next step :

5. Close the browser and re-open it. You should get a prompt about "resume activation in progress". Select "create a new login" and fill in the blanks.

6. You will now to get a page that says "continue to download the Comcast desktop package" ; there's no option to not do it. Go ahead and click continue and you will get a download popup window with the "Run , Download , Cancel" options. Hit Download or Cancel (not Run!!), and then just don't install it. This software is nasty malware that tries to install a toolbar and change your home page and all that shit.

7. Close the browser and open it again - you should now be live on the net.


08-30-11 | Big Pile of Junk

I dunno why I thought a home you buy would be clean; of course not, it's not required in the contract so only a sucker like me would do it (cleaning).

This is what we (or rather, our employees) pulled out of the house :

There was some crazy junk up in the attic and garage from the 40's / 50's. Lots of old metal bits that looked like turbine parts or something. I should've taken more photos but I was in a crazed rush with all the shit I was trying to get done before move-in.

Another thing that's obvious that I didn't realize is that between the inspection and key exchange, many sellers badly neglect the house. No yard maintenance, it might just sit vacant and dirty; sometimes they even actually trash it. It seems to me that a last minute walk-through before signing the closing papers would be a good idea, but nobody does that.

Anyway, paint is up, house is clean, lots of little shit left to do still, but I think the crazed part is over. I intentionally bought a house in very good shape to avoid living in a "fixer upper", but there's still just so much little shit to do.

Of course a lot of the problem is that I find myself succumbing to "home improver's disease". I find my eyes just scanning the room, and when they alight on something that isn't right the thought "I should fix that" pops into my head. All that little shit that you would just ignore in a rental like "those cabinet pulls are really ugly" or "that toilet paper roll holder is kind of broken"; my eyes won't just scan over it and keep moving, they get stuck, like they have friction and just catch up on it and my brain mentally sticks it on the todo list. This is a real disease that I have to resist.

A home is a bit like game development in that the todo list is essentially infinite. For games I've always liked to categorize todos into three groups : 1. must do , 2. would really like, and 3. wish list. You work on tasks in strict priority order generally, first all #1's , then only do #2's when there are no #1's. When new tasks come up you do an initial categorization, but you have to be flexible and move them up and down the list over time as things change (mainly you move them down the list as the #1's get too numerous).

The reality of game development is that you basically never get to any of the #3's, so in fact #3 is just a way of writing down something that you won't do (though it could move up the list under later scrutiny). And in fact you won't do most of the #2's either.


08-23-11 | The Locker Room

The locker room is the most terrifying place for the awkward male. I'm not talking about people who are scared of showing their penis, that's a very ordinary and easy and boring fear. I'm talking about the social awkwardness of it.

The big problem is the two archetypes of locker room monster :

There's the "never nudes". These guys shower in their bathing suit, then when they have to change, they huddle into a little ball in a corner and quickly slip one thing off and another on (or use a towel to cover up even then). They're tense and nerdy and you don't want to be like them.

Then there's the way-too-comfortable guys that walk around totally naked for way longer than is necessary to do their business of changing. The worst of these is the old closet gay guys. They're usually 60+ , have a giant white bush that thank god has overgrown whatever cock they once had, so that they now just have an androgynous mound of hair (and yes, I have looked at old man bush, you can't really avoid it when they come up to you and talk to you when you're sitting down and they're standing completely naked; okay old man, you got me to look at your bush, now go away).

So you have to try to straddle the line between too much showing and too little showing, which creates the painful awkwardness. The ideal man should not be afraid of anyone seeing his naked body, but he also shouldn't inflict it on those who don't want to see it, nor should he make himself the eye candy of perverted old locker room prowlers.

Shower? Definitely bathing suit off. On is way too anal. But mostly facing the shower head. Occasionally turn around to face the room just to show that it's no big deal. Only the absolute minimum of penis cleaning. But don't avoid touching it altogether either.

The walk to the locker is a difficult moment. To wrap the towel or not? Less than twenty feet away - no wrap. Twenty feet or more - wrap.

Dressing - reasonably quick to get the underwear on, but not a frantic dash. Don't stand in a corner, but also don't face the room.

It's a constant tightrope act to behave correctly; I'm worried the whole time that I'm showing too much or hiding too much.


08-23-11 | Painting

People who paint their house themselves are making a mistake. When you count the materials, the tools/equipment, all the time to buy things and learn how to do it (*), and then do it, it's clearly a huge net loss even if you make almost $0 per hour (by almost $0 I mean less than $50). But most of all, you do a shitty job. (* = most people don't actually spend much time learning how to do it, they think they can just slap some paint up however). When shopping for houses, I would say a good 90% had some DIY painting and it was almost always epically shitty. Streaky with obvious brush marks, too thin with the previous color showing through, not primed right and peeling, or perhaps most commonly, not prepped right, so they just painted right over nails and tape and bad splotchy patch jobs without prepping the walls at all. When you depreciate the value for how shitty the job is, the DIY paint job is worth about as much as a kick in the nuts.

In general, the whole Home Depot / DIY movement is a real fucking tragedy. Not just for the peace and quiet of neighborhoods, nor just for the quality of dinner party conversation, but most of all for the innocent houses which have your shitty amateur work inflicted on them.

I've been using day labor painters, but I don't particularly recommend it. I think it was good for me, because I got a lot of really nasty clean-up work done as part of the paint prep (which normal pro painters would refuse to do), but if it was just for the painting, not so great. The main problem is that you have to be there to supervise all the time, which is a massive time cost.

In general, I hate dealing with anyone who is paid hourly. I like to pay by the job, and if you take longer, or if you fuck it up and have to redo part of it, then you should get paid *less* for inconveneniencing me, not more!

Anyway, some random painting tips to self even though I swear not to do this ever again :

1. Buy way more of everything than you think you need. Way more. I mean *way* more. You think you need 2 rolls of blue tape, buy 20. If you don't use them, you can just return them. If you don't you will wind up having to run to the store to buy more of something, which is a huge time waster. Buy lots of brushes, tools, equipment, all kinds of shit that you think you probably won't need, just to have it on hand in case you do.

2. Cover all the floors. I know you think you are saving time by only covering the floors near the walls you are painting, but you will get paint in places you don't expect, and then spend way more time cleaning it up than if you just covered all the floors.

3. When buying more of a certain paint, make sure you check the code # on the can. Don't just buy more "Brand X latex white" , because there might actually be 4 variants of that which are not clearly labelled as being different. Every paint can has a code to uniquely identify it.

4. IMO avoid oil-based paints and primers for interiors. The thickness/durability benefits are not really worth the cleanup pain vs. modern good water/latex based paints. (painting a boat or some such shit might be a different story)

5. Foam brushes are good for tiny touch-up jobs. For painting edges of walls, or anything where you are applying a decent amount of paint, a good quality bristle is the way to go.

In general the mistakes are so predictable and obvious and yet you will almost certainly make them : eg. time saved by doing less prep costs you more time in the end, money spent on cheap equipment is lost back in wasted time, etc.

Also in general, I think it's very fucked up that so many home owners take all the time to learn to do this shit and DIY it; WTF; isn't this what civilization and capitalism is for? specialization for increased efficiency? where did it go wrong?


08-22-11 | Dicks are Rewarded

Sitting out in the lovely warm evening at a restaurant the other day, I beheld this scene :

The outdoor tables are somewhat limitted and obviously desirable. A few people are sort of milling around the hostess in a disorganized line when a table becomes available. One of the guys just walks out of the line and sits down at the table (you're supposed to wait to be seated). His girlfriend doesn't follow because she knows it's wrong and he says "come on, just sit down, we'll tell the hostess we took it".

The point of this story is not what happens next (spoiler : they get the table and the worst consequence to them is some lifted eyebrows). It's that the *worst* possible outcome would be if someone said something and they had to go back in line like everyone else. That makes it only +EV, there's no -EV line. The waiters and hostess are generally too accomodating to say anything, everyone else in line is too much of a pussy, so even the zero-EV line rarely occurs for the violator.

Once in a rare while in these kind of situations I will say something; I was in sort of a vague line at a grocery store the other day and someone cut in front of me and I was like "umm, there is a line" and they got in - but that's just the outcome that they would have gotten if they were not a dick!! it's no penalty, there is only reward for being a dick.

I guess taking 20 items to the 8 item or less line is the simplest example; I have witnessed it many times and never once have I seen any checker or other patron say anything, and even if they did it would just mean waiting in line like a normal person. To really make things right you need to take his groceries and smash them in his face. People would call you "psycho" or something but in reality you're just trying to ensure the probability-weighted cost-benefit of being a dick is negative.


Not exactly the same, but related, I'm very annoyed/jealous by/of people who can be dicks without even being aware of it. They take so much benefit from the world because dicks always win (or at least break even).

The other day I was sitting in the park enjoying another lovely warm evening when some guys started playing football. They were outrageously bad at it, like just throwing long bombs with absolutely no control, running all over the place right through other people. Everyone was giving them the stink eye, but they were just laughing, loving it.

You bastards. When I play sports in the park I am hyper aware of all the people who are annoyed by it, and I'm super careful to carve out a little patch where it's highly unlikely that an errant throw will impinge on someone's peaceful park sitting. And even though I am super careful and considerate about that, I still get stink-eyes from dumb fuckers who decide to sit and read right in the middle of the play field, curse all of you people and your indiscriminate stink-eyes.


08-22-11 | Rambling

Usually the most annoying thing about dogs at the park is actually their owners. The dog is behaving just fine, but the owner is going on and on in a string of blather like "come here choo choo , sit choo choo, good boy, isn't he so cute? get your stick choo choo" , I want to walk up to the owner and go "sit human, quiet human, good human".

I'm generally anti-"experientalist" (that is, the school of thought that you can't have a valid opinion on something without experiencing it first hand) but I'm pretty sure that all the anti-immigration nut-jobs have never actually talked to an immigrant in a serious way.

I fucking hate call centers that automatically route me to a regional call center based on my phone's area code. They never warn you that they are doing so, or give you a chance to opt out of it. Or god forbid, look up my phone number's account info and fucking see my address. Oh no, I have to be like 10 minutes into describing my problem to the fucking wrong state guy and he's like "oh, you're in WA, let me transfer you..." no!!

Also, fuck you call center people who can see my phone number but ask me for it anyway. I've only discovered this because I like to give people my Google Voice number, but I'm calling from my cell phone, so the caller Id shows something different. They say "what's your phone number?" and I say "XXX" and they say "I see you're calling from YYY" and I'm like WTF, first of all, yeah, so what, I told you XXX just fucking write it down you monkey, don't question me, and second of all, you could've just started with, "can I use the number you're calling from?" that would save the average person a lot of time.

Also, if you are fucking confused by the fact that my phone area code is not a Seattle area code, then I have some "it's so hard to program a VCR" jokes that you might like.

I enjoyed the Stieg Larsson books, but I was disappointed by the cliff they went off after the first one. I thought the first was a good beach page-turner that was sort of remarkable for how simple it was, most of the book is about researching in files of paper, and it's very sort of compact and old fashioned, despite how he constantly says it's "not a locked room mystery" , of course it is sort of. But then after the first book it becomes ridiculous super heroes and super villians and stupid action, still a good page turner, but without the quiet charm.

I enjoyed "A small death in Lisbon", but I was a bit disturbed by just how much relish the author seems to take in his characters more prurient behavior. It felt a bit like watching someone masturbate to scat porn. (hmm, not that ever watching someone masturbate is a good thing... my analogy is a bit off...)

"The Shadow Line" is like so fucking great right up until the big reveal at the end. I was so excited to see a cop show that's actually about a big conspiracy, some intrigue. God I am sick of the minute fucking clinical analysis of these boring ass petty crimes that are on all cop shows these days. Give me big government schemes, underworlds, shadowy figures, and skip the fucking CSI pseudo-scientific bullshit. I think the casting is also superb, particularly Rafe Spall as Jay Wratten just absolutely gives me the creeps; actually everyone in it is awesome except for the females who are pretty uniformly terrible. Anyhoo, it's all going great and then ... WTF ? This is what it's all about? Are you fucking kidding me? So disappointing.

BTW quick modern BBC cliche cheat sheet :

Black guy with white female.

Females who act super cold/professional/masculine/emotionless (but then occasionally turn to mush).

Random/inappropriate yelling in beaurocratic meetings.

I'm not sure if the yelling started with Waking the Dead, or maybe Gordon Ramsay? I dunno, but Brits really like to watch a man lose his temper. Grow a fucking sack, Britain. Trevor Eve (or Chiwetel Ejiofor, or Idris Elba, or whoever you have yelling at you on this hour's show) is not your daddy and he's not going to tell you to straighten up and fly right. Is this your idea of admirable behavior? Who thinks this is a good leader? And it's so outrageously unrealistic. The person receiving the yelling always just sort of sits and takes it with a look on their face that's either like "mmm, saucy" or "well I've really been straightened out".


08-21-11 | Offroading

As usual the mainstream press is completely moronic about actual offroading. They usually test offroaders on slippery grass or some nonsense and talk about the 4wd traction. Let me tell you what actually matters :

(* I should note up front that I am not talking about "severe" off-roading, like boulder-climbing or some such, which no production car is suited for, I'm talking about un-graded dirt/rock roads and such; I'm also assuming you're not doing something retarded like going off-roading when it's muddy, which is not so much foolish because you might get stuck, but ass-holish because it destroys roads and erodes hillsides)

1. Reliability. By far the most important thing is to have a vehicle that won't break down in the middle of nowhere, where cell phones won't work and AAA won't come. Above all else this is why the Toyota Truck is the greatest off-roader of all time (and the Honda Civic is the #2 off-roader (not really, but the point is that a reliable car with worse capabilities is much better than a very capable car that could crap out at any time)). Shit like old Land Rovers or International Scouts or what have you are actually terrible off roaders due to high probability of crapping out.

2. Comfort. This is one that I never see mentioned, but in fact the limitting factor on most bad roads is just your comfort. Most bad mountain roads are not impassable - in fact you could probably make it in a Honda Civic or some such, but when they are rutted, wash-board, rocky, pot-holed, it will be back-breaking hell and you will have to go 5 mph in a Civic. To realistically be able to go 50 miles on a bad back road, the most important thing is comfort, you need long travel suspension and a very soft ride. (BTW longer travel is almost always better with suspension). This is actually a bit hard to find these days as everything has gotten "sporty". I'm not sure what the great choices are for this, since the retarded car journalist corps doesn't correctly review offroaders for comfort.

3. Clearance. Far more important than 4wd or any such nonsense is clearance. For fording streams, getting over logs, big rocks, the things that will actually stop you - clearance is king. Cars with good 4wd but terrible clearance are not really great offroaders (yes, that means you, subaru).

Traction and power almost never come in to it for real-world offroading. When it does, it's better to have a Honda Civic with a good winch than a 4x4 without one.


08-18-11 | Rainier Watch

"Sunrise" is peaking right now; it looks like this from afar :

It's sunny, the snow is in patches now, the wild flowers are out in force, carpeting the hill sides in blankets of colors, the meadows are covered by the explosion of spring growth that pops out of the frozen earth.

Go right now, go Friday (the weekend up there is a nightmare), by Monday it may be too late. There are some really great scrambles in the Palisades area, best scrambles of my life, technical, fun, great views for rewards, email me if you want details.

(and no, the real mountain doesn't have "autopano" written on it)

A couple of "fuck yous" to the world :

Fuck you park service, who after hiking 4 miles out away from everyone, sticks all the camp sites right on top of each other. Yeah yeah I get why they do it so save your smart-ass-but-not-actually-smart comments.

Fuck you lazy fat tourists who can't be bothered to go more than half a mile from their cars, but do just love to go off trail and stomp around in the delicate alpine meadow with their fat ugly asses in their unnecessary trekking suits.

Fuck you REI for sticking your giant logo on everything so that I stare right at it as I go to sleep and it subliminally burns into my fucking brain. God dammit I want to see nature, not advertising, it's like fucking Times Square out in the woods these days with the brands on everything. And ..

Fuck you REI for making all your shit bright orange or whatever other heinous color. I guess it's intentional so it's easy to see, but it makes a couple of tents by a lake so much more of an eyesore than it needs to be. If your shit was just green and brown the woods would look so much better.

.. oh I can't really do the "fuck yous" justice right now. The world is a beautiful place out there in the wild.


08-15-11 | More finance

I've gotta get off this topic soon because it's a huge distraction and also just very depressing. The sad thing is that for most of us, the most +EV life move is just to close our eyes and say "la la la" and pretend we don't know about any of this.

It's funny that Geithner was sold to us as a government technocrat who wasn't a "financial insider" when nothing could be further from the truth. While it's true he has worked in government, he is deeply connected to Citi and the whole modern bubble. He rose under Summers and Rubin, who are responsible for Gramm–Leach–Bliley that deregulated finance (as well as lots of other deregulations during that time, such as keeping derivatives off book, removing SEC oversight of many institutions and lowering reserve requirements). Geithner was offered the CEO job of Citi but turned it out. Geithner's appointment to the NY Fed was backed by Citi (the NY Fed was amusingly appointed by the banks it was supposed to regulate, it's long been corrupt by design).

Woodward -- Behind the Boom
What’s behind the ICE arrests of 30 after an immigration raid in Ellensburg, WA (Courtesy of NNIR) « El Comite Pro-Reforma M
What's Obama Doing to Your Taxes - Political Hotsheet - CBS News
What Barack Obama Needs to Know About Tim Geithner, the AIG Fiasco and Citigroup The Big Picture
U.S. Says Rendition to Continue, but With More Oversight - NYTimes.com
The Worden Report Pay No Attention to the Banker behind the Curtain
The Worden Report An Institutional Conflict of Interest at the New York Federal Reserve
The Washington Monthly
The Fed And The Treasury Had A Funny Way Of Guilt-Tripping Sheila Bair
The Big Picture 2
The Big Picture 1
TaxVox » Blog Archive » Why Nobody Noticed Obama’s Tax Cuts
Steamy History of the American Economy during the Clinton Administration
Office of U.S. Trustee vs. Harmon - Witness Timothy Geither
Long-Term Capital Management - Wikipedia, the free encyclopedia
How Citigroup Unraveled Under Geithner’s Watch - ProPublica
Goldman Connection at NY Fed Major Conflict of Interest - Seeking Alpha
F.B.I. Giving Agents New Powers in Revised Manual - NYTimes.com
Deportation of illegal immigrants increases under Obama administration
Deficit and Spending Increase Under Obama - WSJ.com
Daily Kos Taxes lowest in 60 years, thanks to Democrats and Obama


08-14-11 | A note on convex hull simplification

I wrote this in email and thought it worth recording.

A while ago I wrote mainly about OBB algorithms but a little note about convex hull simplification

It's a little unclear, so I clarified :

My algorithm is very simple and by no means optimal.

I construct a standard (exact) convex hull, then make a mesh from it. I then run a normal mesh simplifier (see for example Garland Heckbert Quadric Error Metrics) to simplify the CH as if it was a mesh. This can ruin inclusion. I then fix it by taking all the face planes of the simplified mesh and pushing them out past any vert in the original mesh.

Stan's (Melax - Convex Hull Simplification With Containment By Successive Plane Removal) way is similar but better. He uses a BSP engine to create the hull. First he finds a normal convex hull. Then he considers only the planes that make up that hull. The working hull is the volume that is on the "front" side of all planes. He then considers removing planes one by one. When you remove a plane, the cost to remove it is the volume that is added to the hull, which is the volume of the space that is on the back side of that plane but is on the front side of all other planes. You create a heap to do this so that the total cost to simplify is only N log N. This requires good BSP code which I don't have, which is why I used the mesh-simplifier approach.

An alternative in the literature is the "progressive hull" technique. This is basically using PM methods but directly considering the mesh as a hull during simplification instead of fixing it after the fact as I do. Probably a better way is to use a real epsilon-hull finder from the beginning rather than finding the exact hull and then simplifying.

My code is in Galaxy4 / gApp_HullTest which is available here ; You should be able to run "Galaxy4.exe hull" ; Hit the "m" key to see various visualations ; give it a mesh argument if you have one (takes .x, .m , .smf etc.)

BTW to summarize : I don't really recommend my method. It happens to be easy to implement if you have a mesh simplifier lying around. Stan's method is also certainly not optimal but is easy to implement if you have good BSP code lying around (and is better than mine (I suspect)).

The technique I actually prefer is to just use k-dops. k-dops are the convex hull made from the touching planes in a fixed set of k directions. Maybe find the optimal OBB and use that as the axis frame for the k directions. Increase k until you are within the desired error tolerance (or k exceeds the number of faces in the exact hull).

ASIDE : I have some BSP code but I hate it; I hate all floating point geometry code. I love integer geometry code. The problem with integers in BSP's is that clipping creates rational points. Maybe I'll write some BSP routines based on rational Vec3's. The only big problem is that the precision requirement goes up with each clip. So you either need arbitrary precision rationals or you have to truncate the precision at some maximum, and then handle the errors created by that (like the truncated point could move onto the back side of a plane that you said you were in front of). (this is better than the errors in floating points, because at least the truncated point is at a definite well defined location, floating points move around depending on how you look at them, those wiggly bastards) (I'm tempted to say that they're like quantum mechanics in that they change when you measure them, except that they really aren't at all, and that's the type of pseudo-scientific-mumbo-jumbo that pseudo-intellectual fucktards love and I so despise, so no, I won't say it).


08-12-11 | A review of Obama so far

It's amazing to think back to the Obama election and how ecstatic so many people were, and how sad the reality has turned out to be. I think Obama has got to be the most disappointing president of all time. Let's try to run through the score card and remind ourselves of what's happened.

1. Military Policy. No significant de-escalation of American involvement abroad. No significiant cuts of spending on programs which are irrelevant to the modern world (such as large weapons programs). No merging of services to reduce waste. Continued extra-legal prosecution of war, such as assasinations of people in non combat zones.

2. Immigration. One of those sad cynical things that came out of the GWB white house was using 9/11 as an excuse to crank up enforcement of illegal immigration under DHS ; obviously the real motive was always getting them dirty Mexicans out of our country, but it was politically difficult before 9/11 ; so the anti-Mexicans saw a great opportunity to sneak it in as "protecting our borders from terrorists". Unfortunately this has only gotten worse under Obama, including breaking the veil between INS (ICE) and law enforcement; historically law enforcement had an agreement to not enforce immigration because they wanted immigrants to be able to interact with the police on normal legal issues, but that is now gone.

3. Domestic spying. Only continues to get worse. The FBI was recently given new powers to snoop without warrants; ever since 9/11 the FBI has amped up its monitoring of clearly non-terrorist organizations, claiming that all sorts of groups like environmental activists are "domestic terrorists" (hint : corporate property damage is not "terrorism"), and under Obama that just got massively worse, we're basically back to Hoover levels of sneaky FBI that can now monitor anyone they want without probable cause. Obama continues going around FISA. Obama continues to use "state secrets" and national security defences to make unquestionable arrests. Abroad, "Extraordinary rendition" (aka passing the torture buck) continues.

4. Taxes. No indiciation that Obama will get realistic on this or close any loop holes or the sweetheart deals for various special interests (like financiers). The GWB tax cuts seem to be near permanent at this point, except maybe on the very richest bracket. Amusingly, the majority of Republicans believe that Obama has raised taxes (he hasn't - in fact he cut them for the vast majority of Americans).

5. Health Care. This was a small step in the right direction, but completely lacking the teeth to make it actually work (which requires government control of the outrageous costs of the private medical industry). Sadly "Obamacare" is really just another case of what Dems and Reps have been doing for many years now - creating government laws that gaurantee massive profits to private industries, force you to use private services without controlling their costs. (and it looks like this trend will only get worse in the next N years as there is talk of creating publicly-gauranteed private-profit-takers out of the Mortgage Macs, for social security, etc.; this is just the most beloved form of modern "reform")

6. Accountability. Fools thought some of the lawsuits and investigations into the questionable acts of the GWB administration may have gotten some help or at least been allowed to proceed, but they were not. The opacity of the executive branch started by GWB continues and perhaps has only gotten worse.

7. Funding for basic services. This is one of the saddest things to me and something that often gets lost in the shuffle of all the other disasters. The sort of very basic things that I think everyone except real nut jobs agrees that government should be doing are getting cut because they're the easiest things to cut. Obama is doing nothing about the big costs (Medicare & Defense), but is supporting cuts to "Non-defense discretionary spending". That sounds like just beaurocratic waste, but in fact "non-defense discretionary spending" is the meat of what makes government good and useful - it's education, housing, social programs, etc. And it's even worse at the state level. Here in WA we are cutting ferries, buses, homeless programs, mental illness programs, health care for the poor, animal shelters, etc. you name it, it's being slashed severely.


08-12-11 | The standard cinit trick

Sometimes I like to write down standard tricks that I believe are common knowledge but are rarely written down.

Say you have some file that does some "cinit" (C++ class constructors called before main) time work. A common example is like a factory that registers itself at cinit time.

The problem is if nobody directly calls anything in that file, it will get dropped by the linker. That is, if all uses are through the factory or function pointers or something like that, the linker doesn't know it gets called that way and so drops the whole thing out.

The standard solution is to put a reference to the file in its header. Something like this :


Example.cpp :

int example_cpp_force = 0;

AT_STARTUP( work I wanted to do );


Example.h :

extern int example_cpp_force;

AT_STARTUP( example_cpp_force = 1 );

where AT_STARTUP is just a helper that puts the code into a class so that it runs at cinit, it looks like this :

#define AT_STARTUP(some_code)   \
namespace { static struct STRING_JOIN(AtStartup_,__LINE__) { \
STRING_JOIN(AtStartup_,__LINE__)() { some_code; } } STRING_JOIN( NUMBERNAME(AtStartup_) , Instance ); };

Now Example.obj will be kept in the link if any file that includes Example.h is kept in the link.

This works so far as I know, but it's not really ideal (for one thing, if Example.h is included a lot, you get a whole mess of little functions doing example_cpp_force = 1 in your cinit). This is one of those dumb little problems that I wish the C standards people would pay more attention to. What we really want is a way within the code file to say "hey never drop this file from link, it has side effects", which you can do in certain compilers but not portably.


08-12-11 | The IMF in America

Quick recap of what the IMF does :

When a third world country gets into bad financial trouble, the IMF jumps in. The purpose is not to help them, it's to force them to accept terms that they never would agree to if they weren't desperate. Once the IMF money arrives, it's not allowed to go to social programs or price supports that would actually help the people in the ailing country, it goes to stabilizing the financial market of that country, and the rules they force you to accept ensure that capital can freely flow out.

The end result is that what the IMF does is ensure that western finance companies are able to get their money out of the ailing countries and recoup at least some of their losses. This is what happened with the Asian collapse, this is what happened with Ireland, etc. And of course the conditions ensure that the finance giants get to play freely in that country in the future.

It occurs to me that this is largely what the US Government has done with the Mortgage Crisis. Basically they are treating the average American as a separate 3rd world country that they don't really care about, and rather than doing something that would directly help the people who have been affected by the crisis, instead they are spending massive amount of money on propping up the financial markets so that the large financiers can get their money out and not take too much of a loss.

Imagine you had $3 trillion to help America out of the housing crisis. You could directly subsidize home owners' losses - buy their house from them at book value and sell it back to them at market value. You could do something in between, take the difference between book and market value and split it into a few pieces, subsidize part, make the home owner eat part, and make the mortgage holder eat part. Or you could directly subsidize the finance companies that hold the bad mortgages - buy the mortgage at book and some day sell it back at market.

Our government chose the last of these options. (there are some programs to do the intermediate option, but they haven't ever gotten off the ground, and even if they did their scale is so miniscule as to be irrelevant, something like $50 billion vs. the $2-3 trillion for the third option)

Obviously the claimed intention of the IMF is to prop up the failing economy to keep it on its feet. But they don't put in any restrictions against capital flight - quite the opposite, they *forbid* the government from stopping capital flight - so the result is highly predictable - the western investors save their own bacon.

The exact same thing has happened with mortgages. The claimed reason for propping up the mortgage market was that it would create confidence and keep the mortgage market working so that people would still be able to buy and sell homes, etc. Of course that hasn't happened, what has happened is that private MBS trading has trickled to near zero, and the banks have used the subsidized market to unwind their holdings and preserve their profits.


08-11-11 | Free Internet

I mean "free" in a liberty sense, not a monetary sense.

Recent Seattle Weekly article got me thinking about trying to encrypt and anonymize all my internet access. The whole torrent model is just like fish in a barrel for copyright trolls. You can just hop on the net and get a list of infringers any time you want.

So whatever reason, say you want to be able to work on the net and do as you please without your actions being monitored.

Apparently the major US-based services like FindNot and Anonymizer are not to be trusted (they provide logs to the US government and to subpoenas by the RIAA etc).

Really what you want is something like Tor that takes all your traffic and bounces it around a bunch of other machines and then puts out portions of requests from all over. Currently none of those services seem to be quite ready for prime time; Tor for example kicks you out if you try to do high-bandwidth things like torrents.

Some links :

Web-based DNS Randomness Test DNS-OARC
Tor Project Anonymity Online
SwissVPN - Surf the safer way!
Public IP Swiss VPN - Page 2 - Wilders Security Forums
OneSwarm - Private P2P Data Sharing
I2P - Wikipedia, the free encyclopedia
How To Not Get Sued for File Sharing Electronic Frontier Foundation
Free Anonymous BitTorrent Becomes Reality With BitBlinder TorrentFreak
Chilling Effects Clearinghouse
Anonymous P2P - Wikipedia, the free encyclopedia

In general I'm not sure if dark-nets like Tor can survive. I don't trust the internet providers or the US government to allow you to have that freedom. I suspect that if they ever caught on en masse they would be blocked by the standard extra-judicial mechanisms that they used to shut down online poker and funding WikiLeaks (where the government nicely asks the service provider to block that traffic and the provider complies, even though it's not clear the law is on their side).

The only way to get past that (and into places like china) is to hide encrypted packets inside benign packets. That may be fine for little text messages, but you can never get high bandwidth that way.


08-11-11 | Inflation - 4

More links :

Recent Decisions of the Federal Open Market Committee A Bridge to Fiscal Sanity (Acknowledging Henry B. Gonzalez and Winston
Quantitative easing - Wikipedia, the free encyclopedia
Fed Quantitative Easing Personal Finance Mastery
US Daily Index » The Billion Prices Project @ MIT
The June GDP deflator in the US Conspiracy theory edition « The visible hand in economics
Speculative-Investor.com
Sizing Up Sarah - Up and Down Wall Street - Alan Abelson - Barrons.com
Quantitative Easing A Beginner's Guide InvestingAnswers
PIMCO Investment Outlook - Skunked
Money, Credit, Inflation and Deflation
Mish's Global Economic Trend Analysis True Money Supply (TMS) vs. Austrian Money Supply (AMS or M Prime) Update
Michael Pollaro - The Contrarian Take - Forbes 1
Michael Pollaro - The Contrarian Take - Austrian Money Supply
Jesse's Café Américain Austrian Economics True Money Supply, Deflation and Inflation
Inflation Update TMS (True Money) Status
Inflation Or Deflation Follow the Money Supply (Guest Post) EconMatters
Inflation Charting the Economy, Part 4 Robert Kientz FINANCIAL SENSE
Guest Post U.S. Dollar Money Supply Is Underreported ZeroHedge
Exploring Inflation Over the Past 10 Years Through Charts - Seeking Alpha
Charting the Course to $7 Gas - J. Kevin Meaders - Mises Daily

Now some random hand waving and incoherent thoughts.

The alternative money supply metrics like TMS or "M1 + deposit currency" show that the money supply has massively increased since the great recession (roughly the same as the fed "BASE" ). Historically TMS tracks inflation very closely. So far it appears not to be. One possible explanation is that we basically continue to be in a recession, which is driving prices down, while also inflating the currency to keep prices stable.

Of course M2 and M3 are still way down due to lack of money multiplier in our recessed economy. In theory if credit picks up again and M3 takes off, the Fed will crank back on the money supply and keep things under control. But I believe there are reasons to be skeptical that the Fed will ever really crack down on the loose monetary policy.

The PIMCO newsletter is surprisingly bleak -

the only way out of the dilemma, absent very large entitlement cuts, is to default in one (or a combination) of four ways: 1) outright via contractual abrogation – surely unthinkable, 2) surreptitiously via accelerating and unexpectedly higher inflation – likely but not significant in its impact, 3) deceptively via a declining dollar– currently taking place right in front of our noses, and 4) stealthily via policy rates and Treasury yields far below historical levels – paying savers less on their money and hoping they won’t complain.

Basically the US has got itself into massive debt; I don't really agree that entitlements are the big problem, certainly not in the short term, but anyhoo. When you're a debtor, inflation is great for you - it makes your debt smaller. We fund the debt by selling treasuries. Our debt is much cheaper to fund if we can offer a very low return on treasuries. So the best option for the US government is clearly to inflate the currency and devalue the dollar (reduces our debt) while claiming that inflation is low (keeps treasury yields low).

In other crackpottery, QE is very strange. My very crude cliff notes :


To inject cash into the economy, the Fed bought treasuries from private holders
this takes out non-cash "paper" and adds to the money supply

The government is in debt; to finance that debt it sells treasuries
this gives the government cash in exchange for paper

So during QE, banks bought bonds from Treasury, then sold them to the Fed

Basically the Fed was just giving cash to Treasury, but passing it through banks so they could take a piece of profit

Oddly the US apparently has a law that prevents the Fed from directly buying Treasuries and thus supporting the government debt; also Bernanke and such claim they are not "printing money to monetize the debt" but really passing the money through the open market doesn't change much other than giving a slice of profit to the banks. Says the Dallas Fed : "For the next eight months, the nation’s central bank will be monetizing the federal debt."

Furthermore there is good evidence that QE largely backfired at injecting money into the economy. The problem is that with the fed funds rate at 0% and Treasuries at 2-2.5% , banks can take out fed funds, buy treasuries, then sell them to the fed under QE. Free money for the banks and the Fed gets to pay for government debt, yay.

This is certainly part of why bond yields are so low; they don't need to return much when cash is free. Apparently Japan has been through all this before. There's evidence that QE schemes basically never work; when the fed funds rate is near zero, all possible profitable investments that can be made already have been made, so why in the world would injecting more liquidity into the market help? The only explanation I can see is to intentionally devalue the currency or cause inflation. (of course as we'll note later, QE was not only a monetary policy, it was also a direct and corrupt subsidy for toxic asset holders)

Side note : the Dallas Fed uses Trimmed Mean PCE instead of Core PCE. If you read the press releases from the Dallas Fed or St. Louis Fed it is encouraging that there are still technocrats in government that are trying to be reasonably honest and do their jobs well. Of course they tend to get squashed by the people in power, but still ...

What is QE really doing? Propping up stock values and other investments (particularly MBS'es). Giving banks free profits. Createing an outflow of money from the US to emerging markets. Sending up commidity prices.

"QE effects on commodity markets have been significant. Between August 2011 and January 2011, commodity prices (as measured by CRB Index) rose by 14%. Oil prices have increased by around 20% and average gasoline prices have increased around 15%. Food prices (as measured by the CRB Food Index) have increased 12%, with some individual foodstuffs rising more sharply."

It's unclear to me how you can defend dumping liquidity into the system when banks were already sitting on massive amounts of liquidity with nothing to do with it other than hold it in the Fed (*) or treasuries. It's like watering a sick plant that's already soaked in water.

* = this is one of the funnier quirks; banks actually hold their large cash balance at the federal reserve, so when they got massive cash injects from QE the main thing that happened was that their balance of cash sitting in the fed went sky high, and the fed pays interest to the banks on that deposit - banks now have excess reserves (cash beyond their reserve requirement) around $1.5 Trillion sitting in the Fed, up from only $2 billion in 2007. If your goal is to get banks to lend more, why do you want to make it more attractive for them to leave cash sitting in reserves? This was yet another change in 2008 as part of the massive experiment of letting the Fed tinker with the economy in unprecedented ways. Some links :

Why pay interest on excess reserves - William J. Polley
macroblog “Why is the Fed Paying Interest on Excess Reserves”
FRB Press Release--Board announces that it will begin to pay interest on depository institutions required and excess reserve
Fed Paying Interest on Reserves A Primer - Real Time Economics - WSJ
Dudley Seeing Interest on Reserves as Tool of Choice Sparks New Fed Debate - Bloomberg

One common thread that I don't understand is why even a tiny bit of deflation is considered such a huge disaster that almost anything will be done to prevent it.

More links :

Yellen Defends QE Is She Right - Seeking Alpha
Satyajit Das Economic Uppers & Downers « naked capitalism
Mish's Global Economic Trend Analysis US Treasury Bull Market Not Over; Record Low Yields; Shades of Japan; Why QE3 Totally
How To Make $4 Trillion Vanish In A Flash Why Another Financial Crash Is Certain « The Oldspeak Journal
FOMC Meeting Participants See QE2 Devaluing the Dollar Global Economic Intersection
fed_all_short_stacked.png (PNG Image, 722x519 pixels)
Fed QE and SPX
Bernanke's Dilemma Hyperinflation and the U.S. Dollar - Seeking Alpha
Flim-Flam Economics » Monty Pelerin's World
BERNANKE IS SATAN « The Burning Platform

I can't imagine any way to justify the Fed purchasing MBS's and other such assets. The role of the Fed is supposed to be monetary policy, not subsidizing bad investments or otherwise interfering in markets. The Fed is not the SEC or Treasury. They just slipped it in as part of "QE" which is supposed to just be increasing the money supply and bought $1.25 B of MBS's without even attempting a fair valuation of them. That's basically an illegal expansion of the TARP program. The amount of toxic MBS that the Fed owns now dwarfs what Treasury holds.

Another odd footnote is the fact that banks are no longer required to mark-to-market MBS's and similar products, so we have no idea what's really on their balance sheets.

Another funny one is that Fannie/Freddie have gone from about 20% of the mortgage market to about 80% since the collapse.

If it is the Fed's role to moderate volatility, then it must also slow growth in over-exuberant times. It has shown no willingness to do so in recent years and I can't imagine that it ever will, given the political corruption and collusion that drives decision making. It's like having a train conductor that stokes the fire when you slow down but refuses to use the brakes when you get going to fast.

More links :

The Mess That Greenspan Made Is all this exit strategy talk warranted
Market Talk » Mark-to-Market
Mark-to-market accounting - Wikipedia, the free encyclopedia
Initial Fed Audit Shows Web of Conflict of Interest FDL News Desk
Hussman Funds - Weekly Market Comment Things I Believe - December 20, 2010
Hussman Funds - Weekly Market Comment The Recklessness of Quantitative Easing - October 18, 2010
AEI Relief from Mark-to-Market Accounting


08-09-11 | Threading Links

For reference, some of the links I consulted for the recent postings :

[concurrency-interest] fast semaphore
[C++] Chris M Thomasson - Pastebin.com
[C++] Chris M Thomasson - Pastebin.com -rwmutex eventcount
[C++] Chris M Thomasson - Pastebin.com - wsdequeue
[C++] Chris M Thomasson - Pastebin.com - semaphore and mpmc
[C++] Chris M Thomasson - Pastebin.com - mpsc in relacy
[C++] Chris M Thomasson - Pastebin.com - eventcount from cond_Var
[C++] Chris M Thomasson - Pastebin.com - cond_Var from waitset
[C#] Chris M Thomasson - Pastebin.com - eventcount in C#
yet another win32 condvar implementation - comp.programming.threads Computer Group
yet another (tiny) implementation of condvars - comp.programming.threads Google Groups
Would this work on even one platform - about mutex reordering
Windows NT Keyed Events
Win32 Kernel Experimental WaitLock-Free Fast-Path Event-Count for Windows... anZ2dnUVZ, InterlockedLoadFence, and aPOdnXp1l6
win32 condvar futex - NOT! - 29464
Win32 condition variables redux - Thomasson thread list version
Win32 condition variables redux - comp.programming.threads Google Groups
Usenet - Lock-free queue SPMC + MPMC
Usenet - Condition variables signal with or without mutex locked
Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
ThreadSanitizer - data-race-test - ThreadSanitizer is a Valgrind-based detector of data races - Race detection tools and mor
Thin Lock vs. Futex «   Bartosz Milewski's Programming Cafe
The Inventor of Portable DCI-aka-DCL (using TSD) is... ;-) - comp.programming.threads Google Groups
TEREKHOV - Re win32 conditions sem+counter+event = broadcast_deadlock + spur.wake
TBB Thomasson's MPMC
TBB Thomasson - rwmutex
TBB Thomason aba race
TBB Raf on spinning
TBB eventcount posting Dmitry's code
TBB Download Versions
TBB Dmitry on memory model
Task Scheduling Strategies - Scalable Synchronization Algorithms Google Groups
Subtle difference between C++0x MM and other MMs - seq_cst fence weird
Strong Compare and Exchange
Strategies for Implementing POSIX Condition Variables on Win32
Starvation-free, bounded- ... - Intel® Software Network
spinlocks XXXKSE What to do
Spinlocks and Read-Write Locks
SourceForge.net Repository - [relacy] Index of relacy_1_0rrdinttbb_eventcount
Some notes on lock-free and wait-free algorithms Ross Bencina
So what is a memory model And how to cook it - 1024cores
Sleeping Read-Write Locks
Simple condvar implementation for Win32 - comp.programming.threads Google Groups
Simple condvar implementation for Win32 (second attempt)
SignalObjectAndWait Function (Windows)
sequential consistency « Corensic
search for Thomasson - Pastebin.com
search for Relacy - Pastebin.com
sched_setscheduler
Scalable Synchronization
Scalable Synchronization MCS lock
Scalable Synchronization Algorithms Google Groups
Scalable Queue-Based Spin Locks with Timeout
Relacy Race Detector - 1024cores
really simple portable eventcount... - comp.programming.threads Google Groups
really simple portable eventcount... - 2
really simple portable eventcount... - 1
re WaitForMultipleObjects emulation with pthreads
Re sem_post() and signals
Re Portable eventcount (try 2)
Re Intel x86 memory model question
Re C++ multithreading yet another Win32 condvar implementation
race-condition and sub-optimal performance in lock-free queue ddj code...
Race in TBB - comp.programming.threads Google Groups
QPI Quiescence (David Dice's Weblog)
pthread_yield() vs. pthread_yield_np()
pthread_cond_ implementation questions - comp.programming.threads Google Groups
POSIX Threads (pthreads) for Win32
Porting of Win32 API WaitFor to Solaris Platform
Portable eventcount
Portable eventcount - Scalable Synchronization Algorithms Google Groups
Portable eventcount - comp.programming.threads Google Groups
Portable eventcount (try 2) - comp.programming.threads Google Groups
Parallel Disk IO - 1024cores
Obscure Synchronization Primitives
New implementation of condition variables on win32
my rwmutex algorithm for Linux... - this is good
Mutexes and Condition Variables using Futexes
Multithreading in C++0x part 1 Starting Threads Just Software Solutions - Custom Software Development and Website Developmen
Multithreaded File IO Dr Dobb's Journal
Multi-producermulti-consumer SEH-based queue – Intel Software Network Blogs - Intel® Software Network
MSDN Compound Synchronization Objects
MPMC Unbounded FIFO Queue w 1 CASOperation. No jokes. - comp.programming.threads Computer Group
Memory Consistency Models
low-overhead mpsc queue - Scalable Synchronization Algorithms
Lockless Inc Articles on computer science and optimization.
Lockingunlocking SysV semaphores - comp.unix.programmer Google Groups
Lockfree Algorithms - 1024cores
lock-free read-write locks - comp.programming.threads Google Groups
Lock-free bounded fifo-queue on top of vector - comp.programming.threads Google Groups
Linux x86 ticket spinlock
JSS Petersons
JSS Dekker
Joe Seighs awesome rw-spinlock with a twist; the beauty of eventcounts... - comp.programming.threads Google Groups
joe seigh on eventcount fences
Joe Seigh Fast Semaphore
Joe Duffy's Weblog - keyed events
Implementing a Thread-Safe Queue using Condition Variables (Updated) Just Software Solutions - Custom Software Development a
How to use priority inheritance
High-Performance Synchronization for Shared-Memory Parallel Programs University of Rochester Computer Science
good discussion of work stealing
good discussion of a broken condvar implementation
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
GCC-Inline-Assembly-HOWTO
futex(2) - Linux manual page
FlushProcessWriteBuffers Function (Windows)
First Things First - 1024cores
Fine-grained condvareventcount
fast-pathed mutex with eventcount for the slow-path... - comp.programming.threads Google Groups
experimental fast-pathed rw-mutex algorithm... - comp.programming.threads Google Groups
eventcount needs storeload
eventcount example of seq_cst fence problem
Effective Go - The Go Programming Language
duffy page that's down meh
Dr. Dobb's Journal Go Parallel QuickPath Interconnect Rules of the Revolution Dr. Dobb's and Intel Go Parallel Programming
Don’t rely on memory barriers for synchronization… Only if you don’t aware of Relacy Race Detector! – Intel Software Network
dmitry's eventcount for TBB
Distributed Reader-Writer Mutex - 1024cores
Discussion of Culler Singh sections 5.1 - 5.3
Developing Lightweight, Statically Initializable C++ Mutexes Dr Dobb's Journal
Derevyago derslib mt_threadimpl.cpp Source File
Derevyago - C++ multithreading yet another Win32 condvar implementation - comp.programming.threads Google Groups
Dekker's algorithm - Wikipedia, the free encyclopedia
David's Wikiblog
data-race-test - Race detection tools and more - Google Project Hosting
condvars signal with mutex locked or not Loďc OnStage
Concurrent programming on Windows - Google Books
concurrency-induced memory-access anomalies - comp.std.c Google Groups
CONCURRENCY Synchronization Primitives New To Windows Vista
comp.programming.threads Google Groups
comp.lang.c++ Google Groups - thomasson event uses
Common threads POSIX threads explained, Part 3
Chris M. Thomasson - Pastebin.com
Chris M. Thomasson - Pastebin.com - win_condvar
Chapter 22. Thread - Boost 1.46.1
cbloom rants 07-18-10 - Mystery - Does the Cell PPU need Memory Control -
cbloom rants 07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -
Causal consistency - Wikipedia, the free encyclopedia
C++1x lock-free algos and blocking - comp.lang.c++ Google Groups
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups
C++0x memory_order_acq_rel vs memory_order_seq_cst
C++ native-win32 waitset class for eventcount... - comp.programming.threads Google Groups
C++ native-win32 waitset class for eventcount... - broken for condvar
C++ N1525 Memory-Order Rationale - nice
C++ multithreading yet another Win32 condvar implementation
Bug-Free Mutexs and CondVars w EventCounts... - comp.programming.threads Google Groups
Break Free of Code Deadlocks in Critical Sections Under Windows
Boost rwmutex 2
Boost rwmutex 1
boost atomics Usage examples - nice
Blog Archive Just Software Solutions - Custom Software Development and Website Development in West Cornwall, UK
Atomic Ptr Plus Project
Asymmetric Dekker
appcoreac_queue_spsc - why eventcount needs fence
AppCore A Portable High-Performance Thread Synchronization Library
Advanced Cell Programming
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups
A theoretical question on synchronization - comp.programming.threads Google Groups
A race in LockSupport park() arising from weak memory models (David Dice's Weblog)
A garbage collector for C and C++
A futex overview and update [LWN.net]
A Fair Monitor (Condition Variables) Implementation for Win32


08-09-11 | Inflation - 3

Some reference :

Response to BLS Article on CPI Misconceptions
Consumer price index - Wikipedia, the free encyclopedia
WSJ piece on hedonic price adjustments
Consumer Price Index
Chapter 11—Money and Its Purchasing Power (continued) - - Mises Institute
Bill Gross Claims the CPI is Understated, But Is He Right - Seeking Alpha
An inflation debate brews over intangibles at the mall
United States Consumer Price Index - Wikipedia, the free encyclopedia
Trudy Lieberman Entitlement Reform Archive CJR
Shadow Government Statistics Home Page
Meet the Fed's Elusive New Inflation Target - TheStreet
Inflation The Concise Encyclopedia of Economics Library of Economics and Liberty
How BLS Measures Price Change for Medical Care Services in the Consumer Price Index
Higher Education Price Indices
God Punishes Us When We (Collectively) Vote Republican, Part 5 Angry Bear - Financial and Economic Commentary
Consumer Price Index, a rant
Charts College Tuition vs. Housing Bubble » My Money Blog
Chained Cpi Social Security, CPI, Michael Hiltzik Using 'chained CPI' to determine Social Security payments would rip off ne
cbloom rants 5-14-05 - 1
cbloom rants 12-27-08 - Financial Quackery
cbloom rants 09-17-07 - 2

Some of these guys have the whiff of crackpottery which should give us a bit of pause. Nevertheless...

We can track down a few of the strange problems that I identified last time.

Education basically is miscounted : "The inclusion of financial aid has added to the complexity of pricing college tuition. Many selected students may have full scholarships (such as athletic), and therefore their tuition and fixed fees are fully covered by scholarships. Since these students pay no tuition and fees, they are not eligible for pricing." discounting financial aid makes some sense if you are trying to measure the consumer's expense, but not if you are trying to measure the cost of the good; just because someone else paid for part of it doesn't make it cheaper. But really I imagine the biggest problem with education cost is that they effectively count college as being free for people who can't afford college. That is, people who can't afford it don't buy it, so it's not in the basket (doesn't contribute to the "quantity" in the CPI metric). A better way to measure inflation would be to assume that everyone would go to college if they could afford it. (also it seems that non-acredited technical school time places are not counted at all)

Health care is simply not counted at all, by design : "The weights in the CPI do not include employer-paid health insurance premiums or tax-funded health care such as Medicare Part A and Medicaid" The only thing they count is out-of-pocket / discretionary health care expenses, which are obviously just a tiny fraction of the total.

Real estate has the funny owner's cost to rent thing which makes it very hard to tell if that is being gamed or not.

Obviously anything based on "core" inflation (without food or energy) is ridiculous. The standard argument that those fluctuate too much seasonally is absurd, you could just use a seasonally-adjusted moving average, you don't need to remove them completely.

The other really obviously fishy parts are :

"Substitution". A while ago the CPI was changed to use geometric averages of prices within a category. This seems pretty innocuous, but it basically causes a down-weighting of higher priced items. And in fact the geometric mean is always lower than the arithmetic mean, so this change can only make inflation seem lower, which is a dirty trick. For example :


(1+1+8)/3 = 3.333

(1*1*8)^(1/3) = 2.0

pretty big difference even though they are both "means". Now, they hand wave away and say that this reflects consumers' ability to choose and substitute cheaper products. But it is totally unscientific.

Furthermore, newer measures like the CPI-U or Fed's PCE also explicitly include substitution. This just seems like it obviously does not reflect inflation. When a product gets expensive and the consumer substitutes for something cheaper, they are by definition getting something of lower utility (because it wasn't their first choice), so you can't say that no inflation happened, they are getting less for their money.

"Hedonics". These are poorly documented pure bullshit ways of pretending inflation is lower by claiming that we got more for our money. This is just pure nonsense for various reasons :

1. The whole definition of "better" is so vague and open to interpretation that it has no business in a metric. For example they consider air travel to be massively improved since the 70's. Sure it's safer, more efficient, but also much much less pleasant. Personally I think that the same trip is actually worth much less now than it was in the past, but they say it's worth much more. Similarly for the quality of buildings and clothing and cars and so on; yes, they're safer, faster, more durable, whatever, but they aren't hand crafted, they aren't made of hard wood and steel and chrome; I think most of those things are actually much crappier now than ever, made more cleverly but also more cheaply. Anyway, it just has no business in there. The idea that you can measure the hedonic quality of some product and say it improved by 0.1% from April to May in 2010 is just absurd.

2. Using quality of goods at all just isn't right to begin with. Inflation should be a measure of the cost to buy a standard set of goods at the expected quality level of the era. Just because technology gets better over time doesn't mean you can discount the inflation! For example if computers get 50% better every year and our money is inflating by 50% would you say the cost of computers is not changing? Of course not, the cost is going up 50% , yes they are also getting better but that is not part of the discussion.

It's just wrong on the face of it. If a median decent car was $5k in the 70's and now is $25k , then the price of cars has gone up by 5X. But oh no they say, you have air bags and more power and fuel economy and so on, the modern car is 5X better, so in fact there has been no inflation at all. Well, wait a minute. I *expect* the quality of life and technology to go up over time. Are you telling me that in a world with 0% inflation that technology does not get better over time? That's a strange way to measure things. And it's not really what you want to know when you ask about inflation. You want to know how much money do I need to afford a decent house, car, food, etc. at the expected standard of the time that I buy it.

Obviously we wish there was some item that had absolute constant value that we could measure against. Also obviously measuring inflation is very complicated and we are only scratching the surface. But it's very fishy. Rotten fishy.


08-09-11 | Inflation - 2

The government has made several significant changes to how it counts inflation. A big one occurred in 1995 (Boskin commission), another happened just in the last two years (chained CPI for COLA), and at some point the Fed changed it's core measure (PCE).

In all cases they claimed to be making the inflation measure "more accurate" , and in all cases, the inflation rate was revised downward.

Now, ignoring the details of the changes for now, it should be clear the government has a very strong interest in reporting a low inflation number.

1. It makes their administration look better. If inflation is low, then inflation-adjusted GDP looks better. There are lots of horrifying statistics about inflation-adjusted median income that are very embarassing to the US government, and you can make that go away by having a lower inflation number.

2. Lots of federal costs have automatic COLA (cost of living adjustments) like Social Security, Medicare, federal employee pensions an wages, etc. A lower inflation number directly decreases the amount they have to pay out.

3. They pay less out for TIPS

Whenever governments can lie for their own benefit, they tend to, so it would be *extremely* surprising if the reported inflation was actually correct.


08-09-11 | Randomness and Fault

Recent comment ranting has made me think of something that frequently annoys me.

I get quite aggravated when people invite a certain negative outcome on themselves and then act like it's random or unpredictable or "shit happens" or "just roll with it" or whatever.

There are three separate but similar categories of this : 1. Risky Behavior, 2. Intentional Ignorance, and 3. Futility of Fighting the System.

1. Risky Behavior : these people act like because something is probabilistic, their behavior has no effect on the outcome.

A classic example is risky drivers; someone might be speeding, talking on the phone, not paying attention to the road. They have an accident, and act like "accidents happen, it's random". No, it's not. You chose your behavior, and your behavior increased the probability of an accident. You just (probabilistically) crashed your car on purpose. It was a willful intentional choice to be risky.

More benign cases happen all the time; maybe you have a friend over and they're clearing plates from the table and are carrying way too many at once. Of course they drop one and break it. You are supposed to act like "ha ha, no big deal, accidents happen". But it wasn't an accident. They just (probabilistically) threw your plate into the ground.

Now I don't actually mind if someone comes over and breaks my plate, no big deal it's a fucking plate (it's a whole 'nother rant about how stupid it is to buy expensive plates and get upset when they break), but don't act like it was random, sure there was an element of chance, but it was your actions that (probabilistically) caused it.

It's particularly annoying when the person who has the "accident" told me to "chill out, it'll be fine" or whatever when I warned them to be aware of the risk.

Of course this happens in coding all the time too. I tend to be very cautious in my coding; I'd rather spend time testing and asserting now then have problems later. Inevitably I get into situations where someone on the team is having a nasty hard to reproduce bug. They act like "bugs happen" and it's sort of a random act of god. Did you robustly assert your code? Do you have unit tests? Did you separate out classes that have strict invariants? No? Then you just (probabilistically) chose to have bugs in your code, don't act like they're random.

(there's a separate issue of whether the precautions are actually worth it or not; there's a spectrum of behavior from having to be super careful in advance so that you never have problems in the future (eg. NASA) vs. just being sloppy and fast and accepting a high probability of risk (eg. Game Jam)).

Just because something has a probabilistic element doesn't mean there's no correlation to your actions, or that you're not to blame when things go bad.

2. Intentional Ignorance : this is chosing not to do the research that you easily could have done and thus getting into a bad situation. Now, there's nothing wrong with that per se, that's a life choice and has different trade offs. The thing that annoys me is when people act like they "couldn't have known" or it's perfectly normal not to have known. Not true, you could have easily known.

Say you're visiting a strange town and you go out to eat somewhere and it sucks. It's not random that it sucked - it's because you didn't do any research (probabilistically). Okay, that's fine if that's the choice you want to make, but don't act like it's not your fault - it is a direct result of your choice to not do research that it sucked.

3. Futility of Fighting the System : this is perhaps the most naive and self-defeating variant, and mainly affects the young or the poor (except when it comes to voting, in which case it surpisingly runs across all demographics).

These people act like it doesn't matter what they do, that someone their bank or cell phone carrier or the cops or whatever will find a way to screw them. Basically they refuse to recognize the cause/effect connection between their own actions and the outcomes.

A lot of this is because of the same failure to connect cause/effect in probabilistic situations. Maybe this person tried to be really careful one month and do everything right, and they still got some absurd bank fee or roaming charge or whatever, they conclude that "you can't win" and "what I do doesn't matter". They don't see that their actions might reduce the probability of fuckage even if it doesn't eliminate it.

(of course to some extent this is just an excuse; they really know the truth, but they pretend not to because they don't want to be accountable for their own actions, they want to be able to fuck up and act like they're not to blame, that "it doesn't matter what I do, the system fucks me anyway).

Amazingly even smart people will talk this way about voting, that it "doesn't matter who I vote for the politicians always fuck us" ; well yes, there will be fuckage no matter what, but don't be retarded, of course you can affect the probability of fuckage through your actions. Just because it's not deterministic doesn't mean you are divorced from responsibility.

A lack of determinstic feedback is of course what makes poker so hard for many people. Almost everyone learns well when there is immediate determinstic feedback on whether their action is right or not. (this isn't saying much, dogs and monekys also learn well under those conditions). Many people struggle when the feedback is randomized or unclear or very delayed. For example when you try a new line in poker, like maybe you try three-betting from the blinds with medium range hands, if it goes badly a few times most people will conclude "that was a bad idea" and won't try it any more. It's very hard for these people to learn and get better because they're just looking at what they did in the instant and whether it paid off.


08-09-11 | The Lobster

(this coinage is so obvious I must have stolen it from somewhere, anyway...)

I've been thinking a lot recently about "the lobster".

I've always thought it was bizarre how you can pull into any podunk town in America and go to the scary local diner / steak house, and there will be the regular items - burger, chicken fried steak, what have you, all under $10, and then there's the lobster, for $30, ridiculously overpriced, tucked in the corner of the menu with decorative squiggles around it (as if it needs velvet ropes to separate the VIP section of the menu from the plebian fare).

The thing is, the lobster is not actually good. They probably can't remember the last time anybody actually ordered the lobster. No local would; if the waitress likes you she would warn you not to get, the chefs roll their eyes when the order comes in. Why is it on the menu at all?

I guess it's just there as a trap, for some sucker who doesn't know better, for someone wanting to show off the money they just won, or someone on an expense account to waste money on. You're really just humiliating yourself when you order it, and the restaurant is laughing at you.

I think most people know that you don't actually ever order the lobster in restaurants (other than lobster-specializing places in like Maine or something). But "the lobster" can pop up in many other guises. Expensive watches are obvious lobsters, expensive cars can be less obvious lobsters (is a Maserati a lobster? an Alfa? an Aston? a Porsche?), certainly some of the options and special editions are obvious lobsters, for example the recent Porsche "Speedster" special edition that cost $250k and was just a regular Carrera other than a few colored bits, that's clearly a lobster and Porsche laughs and rolls their eyes at the Seinfelds of the world who are stupid enough to buy the Porsche lobster just because it was on the menu with squiggly lines around it.

I feel like a lot of salesmen try to slip the lobster on you when you're not paying attention. Like when the contractor asks if you want your counters in wood or stone or italian marble - hey wait, contractor, that's the lobster! okay, yeah, you got me, I don't even know where to get italian marble but I thought I'd try to slip it in there. Home improvement in general is full of lobsters. Home theatre stores usually carry a lobster; car wheels ("rims") are rife with lobsters.

The thing that makes the nouveau riche so hilarious is they are constantly getting suckered into buying the lobster and then have the stupidity to brag about it. Ooo look at my gold plated boat ; you fool, you bought the lobster, hide your shame!


One of the things that's so satisfying about video games is that you get a clear reward for more work. You kill some monsters, you get experience, you go up a level; you collect 200 gems, now you can buy the red shield, and it is objectively better than the blue shield you had before. It's very simple and satisfying.

Life is not so clear. More expensive things are not always better. Doing more work doesn't necessarily improve your life. This can be frustrating and confusing.

One of the things that makes me lose it is video game designers who think it's a good idea to make games more realistic in this sense, like providing items in the stores that are expensive but not actually very good. No! I don't want to have to try to suss out "the lobster" in the video game blacksmith, you want video game worlds to be an escapist utopia in which it's always clear that spending more money gets you better stuff. (the other thing I can't stand is games that take away your items; god dammit, don't encourage me to do the work for that if you're going to take it away, don't inject the pains of real life into games, it does not make them better!)


08-08-11 | Some Video Watching

"Thick as Thieves" is the most surprising movie I've seen in a long time. It's fun, smart, funny, it's what movies should be and are so rarely. It's directed by the director of Black Dynamite, and stars Alec Baldwin, but despite those two things it's quite subtle, so much more subtle than modern crap. It lets you find the joke rather than cramming it down your throat, and that makes it much funnier.

"Vengo" is not really a movie with a plot, so much as a snapshot of a (stereotypical/artificial) world of flamenco. I enjoyed it. It reminded me of the music scenes in Kusturica movies (but with less humor), in the sense that seeing the music in a stereotypical fictional setting somehow makes it better, makes you feel you appreciate what it's like for the people who live with that music as the backbone of their lives.

Louie has completely gone off the rails. E1 and E2 of Season 2 were shit, then E3 (buying a house) was back to funny, and then he completely fucking lost the point again in E4+. Uh, hello, fucking Louie CK, people are watching you because they want to laugh at the absurdity and miseries of life, you're supposed to point them out and then make them funny, that provides relief. You are not just supposed to document your miserable little (fake) life in a pretentious attempt at "verite". It's sad because it really could be a good show if he would just get his head out of his ass and tell more jokes.

Game of Thrones is great, like maybe the best fantasy TV series ever (not a lot of competition there, though (actually I can't think of a single one)). By far the best thing about it is the costumes. The costumes are stellar. The sets are great (except for the rare CG set; and for some reason the matte paintings also suck pretty bad, they have poor matte artists and poor matte-foreground integration). Most of the casting is very good. The real letdown is the pathetic George RR Martin fantasy world. WTF. Cold north with frozen ancient evil. Across the narrow sea is the desert full of arab/mongols. Scottish defenders of Hadrian's wall. WTF it's Europe / Tolkein / Anne McCaffrey ; it's like the most generic uncreative fantasy world I think I have ever seen. I find fantasy proxies for Europe to be really boring. (so much worse than fiction in which you take the real Europe and imagine there might have been hidden magic in it). Anyway, I enjoyed the TV show quite a bit.

Justified Season 2 is better than Season 1. There's one or two really bad one-off "episodic" episodes (wow that's a horrible sentence) that are like "random crime happens and good ole Raylon Givens saves the day, and everything is back to normal by the end" but fortunately those are few. Of course the way they make excuses for the Marshalls to be involved in everything is retarded, but if you just ignore all that it's not bad. Actually, objectively it has a lot of flaws but I guess I like the actors enough to forgive them.

Zen is pretty retarded. Why is Italy full of Brittish people? (and why can I not spell British?) The cases are all like super typical horrible mystery writing, which goes like this :

crazy crime happens
intrigue and politics get in the way
it gets more and more complicated
hero gets attacked or in trouble
sexy females randomly introduced to the plot
solution looks hopeless
...
random coincidental shit happens and presto case is solved
And you've got the typical stupid shit like bad guys who can't hit you with a gun from ten feet, but people on the heros side are always perfect sharp-shooters, etc. However, it's almost worth watching just for the gooey cheeziness of it. The soundtrack is straight out of the 80's smooth jazz collection, it's what a "playa" would put on to seduce the ladies. Then there's all these ridiculous self-conscious shots that focus on the cool clothes or the cool cars while we listen to the funky bass line, omg.


08-08-11 | Inflation - a sanity check


Fuel prices have risen faster than (nominal) inflation.

Education - much faster than inflation.

Housing faster than inflation.

Food faster than inflation.

Health care faster than inflation.

Gold, copper, etc. - much faster than inflation.

Foreign currency - faster than inflation.

Ummm....

Something is wrong with this picture.


08-08-11 | Vets are Assholes

Some people seem to have a failure to grasp what is reasonable behavior and what is reasonable to nag about.

I've complained about my current landlord before, who just has no concept about what normal wear and tear by a renter is like. I fucking mow and weed and water and repair shit and oil the counters and all this shit, and yet they send me monthly pesters like "I drove by and the grass was looking a little tall, better get out there and trim it" ; or "it might freeze tonight, better wrap all the outside faucets" ; WTF , do you have any clue what bad renters do to houses? I'm not lighting the house on fire. I'm not turning on the water and letting it run (when the owner pays the utilities), I'm not selling crack out of the house, you should be fucking happy.

One of the more unfortunate bosses I ever had was on the IEEE board for code style guidelines (that's got to be a major red flag right there). I'm all for some level of code uniformity and cleanliness, but he would literally review every checkin I made each night and send me a mail with things like "variables at line 97 aren't lined up in the same column; there's a blank line on 416 that should be deleted" ... are you kidding me? The code is basically so clean compared to the spaghetti mess it could be; there's this failure to grasp minor transgressions.

Vets seem to consistently suffer from this problem.

Every time I have to take my cats in to the vet, I get some kind of condescending lecture from an asshole vet.

Back in CA I got a long lecture about how I shouldn't let my cats run around outside because they get diseases and injuries and so on. Okay, mister vet, I'm sure that's true for humans too, so why don't you just stay in your house and never leave and then we won't have to deal with your uptight ass.

I also got a lecture about putting the food too close to the litter. Which, hey actually is a bad thing to do, and I didn't know it, so it's good that I learned, but the vet never tells you things in the style of "oh, yeah, you might not know this and here's a tip" , it's always like "you are a rotten human being who is intentionally abusing your animals".

With Chi Chi here we always get a lecture about how she was declawed. Well fuck you vet, we didn't do it, we adopted an adult cat from a shelter that was previously declawed, so nyah.

This last time we got a lecture about how fat she is. Yeah she's slightly over weight, but she's not one of those ultra-obese cats that owners actually should be lectured about.


I wrote this a month ago but didn't post it, because it's just whiney and boring and I'm trying to avoid posting things like that (I write a lot of shit that I don't post, despite how it may seem, this blog is not a direct stream of defecation from the anus of my mind; when I'm being wise I don't hit "publish" right after I write something, and 99% of the time when I revisit it a few hours later I decide it should go in the bin).

But recent events have reminded me of it so I dug it back up.

We went backpacking recently, so I was reading lots of hiking books, and I have to say - "The Mountaineers" are assholes. Every single hike description is a fucking diatribe against trail users that they don't approve of. It's not once in a while, it's every single fucking description, they just can't resist being snarky and nasty on every page. It's so bad that I can barely stand to read the books despite them being clearly the best reference material.

And it's the same kind of out of touch ranting. There are plenty of very obvious bad trail users that deserve to be complained about - people who leave trash in the wilderness is probably the worst one, people who bring up boom boxes or dogs in no-dogs-allowed areas, people who trample the flowers off trail, etc. But that's not what The Mountaineers complain about. The things they complain about are :


People who camp near lakes (because it's too high use)

People who have camp fires (even in allowed fire places) ; (they mock us as "kumbayah-ers")

People who only backpack one or two nights (you're scum if you don't get into deep wilderness)

People who want a road that takes them to easy access

etc.

it's like so fucking out of touch with what is actually a sin on the trail.


08-06-11 | A case for OnLive ?

Reading Sanders' last post he brings up an intersting point that might actually be a case for OnLive.

I've long been a sceptic/critic of OnLive. Basically I think that taking standard games and running them over a network where you add 200 ms of latency and get no benefit is totally fucking retarded.

But a game that is custom-made for an OnLive / cloud is kind of interesting.

Particularly an MMO, because 1. latency isn't that big of a deal ; 2. they already have horrible latency so players are used to it, and 3. lots of players are in the same room at the same time, so you can share computer power on the server. Non-networked games where each player are in an independent world are much less compelling.

Also, if you're thinking of a running a Rage-like texture cache, doing local loads and recompresses is sort of like running an OnLive server on your local machine and serving yourself up compressed data - it adds latency, adds compression artifacts, and generally is very undesirable if it there was another choice.

In particular I imagine a use case like this which makes some sense to me :


MMO game
WoW style gameplay that's not super latency critical
cloud-style computing that dynamically puts more servers where needed
Huge number of players can be in the same room and there's no slow down
  (more servers just contribute to processing that area)

Non-GPU renderer; like maybe REYES or a ray tracer
Super high source content sizes stored on shared servers
  (how to generate massive amounts of source content is unknown)
Since you're just sending frames back to clients, render quality is unlimited
  just requires more servers

Could do real-time GI since you can put lots of servers on it
  (and the result is shared for lots of players so the cost is not prohibitive)
or just have massive pre-baked lightmaps with time-of-day variation
  (something like a spherical harmonic per texel, and store 24 of them, one for each hour)
  since back-end storage size is unlimited

the OnLive-style serving the images actually is an advantage in that scenario. In practice, no game company has the know-how to manage such complex servers. And the cost per player is too high. And being able to deliver massively more content just creates a big problem of how to create that content. etc.


08-02-11 | Coder Dictionary : "to checker"

To checker : verb, intransitive : to descend into the technical minutia of a subject which is only tangentially related to your project ; to work very hard and yet make very little progress toward shipping ; to stretch a few day task into many months.

I've definitely been checkering a bit for the last month on all this lockfree shit. However, I contend it was only a "half checker" because what I've been doing is actually useful to posterity (I think), whereas to complete a "full checker" you have to go off into technical minutia about topics that help *nobody* and make your friends completely exasperated.

(I kid, I kid)


08-01-11 | Double checked wait

Something that we have touched on a few times is the "double checked wait" pattern. It goes like this :
consumer :

if ( not available )
{
    prepare_wait();

    if ( not available )
    {
        wait();
    }
    else
    {
        cancel_wait();
    }
}

producer :

make available
signal_waiters();
now, why do we do this? Well, if you did just a naive check like this :

consumer :

if ( not available )
{
    // (*1)
    wait();
}

producer :

make available
signal_waiters();

you have a race. What happens is you check available and see none, so you step in to *1 ; then the producer runs, publishes whatever and signals - but there are no waiters yet so the signal is lost. Then you go into the wait() and deadlock. This is the "lost wakeup" problem.

So, the double check avoids this race. What must the semantics of prepare_wait & wait be for it to work? It's something like this :

Any signal that happens between "prepare_wait" and "wait" must cause "wait" to not block (either because the waitable handle is signalled, or through some other mechanism).

Some implementations of a prepare_wait/wait mechanism may have spurious signals; eg. wait might not block even though you shouldn't really have gotten a signal; because of that you will usually loop in the consumer.

Now let's look at a few specific solutions to this problem :

condition variables

This is the locking solution to the race. It doesn't use double-checked wait, instead it uses a mutex to protect the race; the naive producer/consumer is replaced with :


consumer :

mutex.lock();
if ( not available )
{
    unlock_wait_lock();
}

producer :

mutex.lock();
make available
signal_waiters();
mutex.unlock();

which prevents the race because you hold the mutex in the consumer across the condition check and the decision to go into the wait.

waitset

Any simple waitset can be used in this scenario with a double-checked wait. For example a trivial waitset based on Event is like this :


waitset.prepare_wait :
    add current thread's Event to list of waiters

waitset.wait :
    WaitForSingleObject(my Event)

waitset.signal_waiters :
    signal all events in list of waiters

for instance, "waitset" could be a vector of handles with a mutex protecting access to that vector. This would be a race without the prepare_wait and double checking.

In this case we ensure the double-checked semantics works because the current thread is actually added to the waitset in prepare_wait. So any signal that happens before we get into wait() will set our Event, and our wait() will not actually block us, because the event is already set.

eventcount

Thomasson's eventcount accomplishes the same thing but in a different way. A simplified version of it works like this :


eventcount.prepare_wait :
    return key = m_count

eventcount.wait :
    if ( key == m_count )
        Wait(event)

eventcount.signal_waiters :
    m_count++;
    signal event;

(note : event is a single shared broadcast event here)

in this case, prepare_wait doesn't actually add you to the waitset, so signals don't go to you, but it still works, because if signal was called in the gap, the count will increase and no longer match your key, so you will not do the wait.

That is, it specifically detects the race - it sees "was there a signal between when I did prepare_wait and wait?" , and if so, it doesn't go into the wait. The consumer should loop, so you keep trying to enter the wait until you get to check your condition without a signal firing.

futex

It just occurred to me yesterday that futex is actually another solution to this exact same problem. You may recall - futex does an internal check of your pointer against a value, and only goes into the wait if the value matches.

producer/consumer with futex is like this :


consumer :

if ( value = not_available )
{
    futex_wait(&value,not_available);
}

producer :

value = available
futex_signal(&value);

this may look like just a single wait at a glance, but if we blow out what futex_wait is doing :

consumer :

if ( value == not_available )
{
    //futex_wait(&value,not_available);

    futex_prepare_wait(&value);
    if ( value == not_available )
        futex_commit_wait(&value);
    else
        futex_cancel_wait(&value);
}

producer :

value = available
futex_signal(&value);

we see can clearly see that futex is just double-checked-wait in disguise.

That is, futex is really our beloved prepare_wait -> wait pattern , but only for the case that the wait condition is of the form *ptr == something.


Do we like the futex API? Not really. I mean it's nice that the OS provides it, but if you are designing your own waitset you would never make the API like that. It confines you to only working on single ints, and your condition has to be int == value. A two-call API like "prepare_wait / wait" is much more flexible, it lets you check conditions like "is this lockfree queue empty" which are impossible to do with futex (what you wind up doing is just doing the double-check yourself and use futex just as an "Event", either that or duplicating the condition into an int for futex's benefit (but that is risky, it can race if not done right, so not recommended)).

BTW some of the later extensions of futex are very cool, like bitset waiting and requeue.


08-01-11 | Non-mutex priority inversion

An issue I don't see discussed much is non-mutex priority inversion.

First a review of mutex priority inversion. A low priority thread locks a mutex, then loses execution. A high priority thread then tries to lock that mutex and blocks. It gives up its time slice, but a bunch of medium priority threads are available to run, so they take all the time and the low priority thread doesn't get to run. We call it "priority inversion" because the high priority thread is getting CPU time as if it was the same as the low priority thread.

Almost all operating systems have some kind of priority-inversion-protection built into their mutex. The usual mechanism goes something like this : when you block on a mutex, find the thread that currently owns it and either force execution to go to that thread immediately, or boost its priority up to the same priority as the thread trying to get the lock. (for example, Linux has "priority inheritance").

The thing is, there are plenty of other ways to get priority inversion that don't involve a mutex.

The more general scenario is : a high priority thread is waiting on some shared object to be signalled ; a low priority thread will eventually signal that object ; medium priority threads take all the time so the low priority thread can't run, and the high priority thread stays blocked.

For example, this can happen with Semaphores, Events, etc. etc.

The difficulty is that in these cases, unlike with mutexes, the OS doesn't know which thread will eventually signal the shared object to let the high priority thread go, so it doesn't know who to boost.

Windows has panic mechanisms like the "balance set manager" which look for any thread which is not waiting on a waitable handle, but is getting no CPU time, then they force it to get some CPU time. This will save you if you are in one of these non-mutex priority-inversions, but it takes quite a long time for that to kick in, so it's really a last ditch panic save, if it happens you regret it.

Sometimes I see people talking about mutex priority inversion as if that's a scary issue; it's really not on any modern OS. But non-mutex priority inversion *is*.

Conclusion : beware using non-mutex thread flow control primitives on threads that are not of equal priority !


08-01-11 | A game threading model

Some random ideas.

There is no "main" thread at all, just a lot of jobs. (there is a "main job" in a sense, that runs once a frame kicks off the other jobs needed to complete that frame)

Run 1 worker thread per core; all workers just run "jobs", they are all interchangeable. This is a big advantage for many reasons; for example if one worker gets swapped out (or some outside process takes over that CPU), the other workers just take over for it, there is never a stall on a specific thread that is swapped out. You don't have to switch threads just to run some job, you can run it directly on yourself. (caveat : one issue is the lost worker problem which we have mentioned before and needs more attention).

You also need 1 thread per external device that can stall (eg. disk IO, GPU IO). If the API's to these calls were really designed well for threading this would not be necessary - we need a thread per device simply to wrap the bad API's and provide a clean one out to the workers. What makes a clean API? All device IO needs to just be enqueue'd immediately and then provide a handle that you can query for results or completion. Unfortunately real world device IO calls can stall the calling thread for a long time in unpredictable ways, so they are not truly async on almost any platform. These threads should be high priority, do almost no CPU work, and basically just act like interrupts.

A big issue is how you manage locking game objects. I think the simplest thing conceptually is to do the locking at "game object" granularity, that may not be ideal for performance but it's the easiest way for people to get it right.

You clearly want some kind of reader/writer lock because most objects are read many more times than they are written. In the ideal situation, each object only updates itself (it may read other objects but only writes itself), and you have full parallelism. That's not always possible, you have to handle cross-object updates and loops; eg. A writes A and also writes B , B writes B and also writes A ; the case that can cause deadlock in a naive system.

So, all game objects are referenced through a weak-reference opaque handle. To read one you do something like :

    const Object * rdlock(ObjectHandle h)
and then rely on C's const system to try to ensure that people aren't writing to objects they only have read-locked (yes, I know const is not ideal, but if you make it a part of your system and enforce it through coding convention I think this is probably okay).

In implementation rdlock internally increments a ref on that copy of the object so that the version I'm reading sticks around even if a new version is swapped in by wrlock.

There are various ways to implement write-lock. In all cases I make wrlock take a local copy of the object and return you the pointer to that. That way rdlocks can continue without blocking, they just get the old state. (I assume it's okay for reads to get one-frame-old data) (see note *). wrunlock always just exchanges in the local object copy into the table. rdlocks that were already in progress still hold a ref to the old data, but subsequent rdlocks and wrlocks will get the new data.

One idea is like this : Basically semi-transactional. You want to build up a transaction then commit it. Game object update looks something like this :

    Transaction t;
    vector<ObjectHandle> objects_needed;
    objects_needed = self; 
    for(;;)
    {
        wrlock on all objects_needed;

        .. do your update code ..
        .. update code might find it needs to write another object, then do :

        add new_object to objects_needed
        if ( ! try_wrlock( new_object ) )
            continue; // aborts the current update and will restart with new_object in the objects_needed set

        wrunlock all objects locked
        if ( unlocks committed )
            break; // update done
    }

(in actual C++ implementation the "continue" should be a "throw" , and the for(;;) should be try/catch , because the failed lock could happen down inside some other function; also the throw could tell you what lock caused the exception).

There's two sort of variants here that I believe both work, I'm not sure what the tradeoffs are :

1. More mutex like. wrlock is exclusive, only one thread can lock an object at a time. wrunlock at the end of the update always proceeds unconditionally - if you got the locks you know you can just unlock them all, no problem. The issues is deadlock for different lock orders, we handle that with the try_lock, we abort all the locks and go back to the start of the update and retake the locks in a standardized order.

2. More transaction like. wrlock always proceeds without blocking, multiple threads can hold wrlock at the same time. When you wrunlock you check to see that all the objects have the same revision number as when you did the wrlock, and if not then it means some other commit has come in while you were running, so you abort the unlock and retry. So there's no abort/retry at lock time, it's now at unlock time.

In this simplistic approach I believe that #1 is always better. However, #2 could be better if it checked to see if the object was not actually changed (if it's a common case to take a wrlock because you thought you needed it, but then not actually modify the object).

Note that in both cases it helps to separate a game object's mutable portion from its "definition". eg. the things about it that will never change (maybe its mesh, some AI attributes, etc.) should be held to the side somehow and not participate in the wrlock mechanism. This is easy to do if you're willing to accept another pointer chase, harder to do if you want it to just be different portions of the same continuous memory block.

Another issue with this is if the game object update needs to fire off things that are not strictly in the game object transaction system. For example, say it wants to start a Job to do some path finding or something. You can't fire that right away because the transaction might get aborted. So instead you put it in the "Transation t" thing to delay it until the end of your update, and only if your unlocks succeed then the jobs and such can get run.

(* = I believe it's okay to read one frame old data. Note that in a normal sequential game object update loop, where you just do :


for each object
    object->update();

each object is reading a mix of old and new data; if it reads an item in the list before itself, it reads new data, if it reads an item after itself, it reads old data; thus whether it gets old or new data is a "race" anyway, and your game must be okay with that. Any time you absolutely must read the most recent data you can always do a wrlock instead of a rdlock ;

You can also address this in the normal way we do in games, which is separate objects in a few groups and update them in chunks like "phase 1", then "phase2" ,etc. ; objects that are all within the same phase can't rely on their temporal order, but objects in a later phase do know that they see the latest version of the earlier phase. This is the standard way to make sure you don't have one-frame-latency issues.

*).

The big issue with all this is how to ensure that you are writing correct code. The rules are :

1. rdlock returns a const * ; never cast away const

2. game object updates must only mutate data in game objects - they must not mutate global state or anything outside of the limitted transaction system. This is hard to enforce; one way might be to make it absolutely clear with a function name convention which functions are okay to call from inside object updates and which are not.

For checking this, you could set a TLS flag like "in_go_update" when you are in the for {} loop, then functions that you know are not safe in the GO loop can just do ASSERT( ! in_go_update ); which provides a nice bit of safety.

3. anything you want to do in game object update which is not just mutating some GO variables needs to be put into the Transaction buffer so it can be delayed until the commit goes through. Delayed transaction stuff cannot fail; eg. it doesn't get to participate in the retry/abort, so it must not require multiple mutexes that could deadlock. eg. they should pretty much always just be Job creations or destructions that are just pushes/pops from queues.

Another issue that I haven't touched on is the issue of dependencies. A GO update could be dependent on another GO or on a Job completion. You could use the freedom of the scheduling order to reschedule GOs whose dependencies aren't done for later in the tick, rather than stalling.


07-31-11 | An example that needs seq_cst ?

No, not really. I thought I found the great white whale an algorithm that actually needs sequential consistency , but it turned out to be our old friend the StoreLoad problem.

It's worth having a quick look at because some of the issues are ones that pop up often.

I was rolling a user-space futex emulator. To test it I wrote a little mutex. A very simple mutex based on a very simplified futex might look like this :


struct futex_mutex2
{
    std::atomic<int> m_state;

    futex_mutex2() : m_state(0)
    {
    }
    ~futex_mutex2()
    {
    }

    void lock(futex_system * system)
    {
        if ( m_state($).exchange(1,rl::mo_acq_rel) )
        {
            void * h = system->prepare_wait();
            while ( m_state($).exchange(1,rl::mo_acq_rel) )
            {
                system->wait(h);
            }
            system->retire_wait();
        }
    }

    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        system->notify_one();
    }
};

(note that the actually "futexiness" of it is removed now for simplicity of this test ; also of course you should exchange state to a contended flag and all that, but that hides the problem, so that's removed here).

Then the super-simplified futex system (with all the actual futexiness removed, so that it's just a very simple waitset) is :


//#define MO    mo_seq_cst
#define MO  mo_acq_rel

struct futex_system
{
    HANDLE          m_handle;
    atomic<int>     m_count;

    /*************/
    
    futex_system()
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    
        m_count($).store(0);
    }
    ~futex_system()
    {
        CloseHandle(m_handle);  
    }
        
    void * prepare_wait( )
    {
        m_count($).fetch_add(1,MO);
        
        return (void *) m_handle;
    }

    void wait(void * h)
    {
        WaitForSingleObject((HANDLE)h, INFINITE);
    }

    void retire_wait( )
    {
        m_count($).fetch_add(-1,MO);
    }
    
    void notify_one( )
    {
        if ( m_count($).load(mo_acquire) == 0 ) // mo_seq_cst
            return;
        
        SetEvent(m_handle);
    }
};

So I was finding that it didn't work unless MO was seq_cst (and the load too).

The first point of note is that when I had the full futex system in there which had some internal std::mutexes - there was no bug, the ops on count($) didn't need to be seq cst. That's a common and nasty problem - if you have some ops internally that are seq_cst (such as mutex lock unlock), it can hide the fact that your other atomics are not memory ordered correctly. It was only when I removed the mutexes that the problem revealed itself, but it was actually there all along.

We've discussed this before when we asked "do mutexes need to be seq cst" ; the answer is NO if you just want them to provide mutual exclusion. But if you want them to act like an OS mutex, then the answer is YES. And the issue is that people can write code that is basically relying on the OS mutex being a barrier that provides more than just mutual exclusion.

The next point is that when I reduced the test down to just 2 threads, I still found that I needed seq_cst. That should be a tipoff that the problem does not actually arise from a need for total order. A true seq_cst problem should only show up when you go over 2 threads.

The real problem of course was here :


    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        //system->notify_one();

        #StoreLoad

        if ( system->m_count($).load(mo_acquire) == 0 ) // mo_seq_cst
            return;
        
        SetEvent(system->m_handle);
    }
};

we just need a StoreLoad barrier there. It should be obvious why we need a StoreLoad there but I'll be very explicit :

same as :

    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        int count = system->m_count($).load(mo_acquire);

        if ( count == 0 )
            return;
        
        SetEvent(system->m_handle);
    }

same as :

    void unlock(futex_system * system)
    {
        int count = system->m_count($).load(mo_acquire);

        // (*1)

        m_state($).store(0,rl::mo_release);

        if ( count == 0 )
            return;
        
        SetEvent(system->m_handle);
    }

so now at (*1) we have already loaded count and got a 0 (no wiaters); then the other thread trying to lock the mutex sees state == 1, locked, so it incs count and goes to sleep, and we return, and we have a deadlock.

As noted in the first post on this topic, there's no way to express only #StoreLoad in C++0x , so you wind up needing seq cst. Note that the case we cooked up here is almost identical to Thomasson's problem with "event count" so you can read about that :

Synchronization Algorithm Verificator for C++0x - Page 2
really simple portable eventcount... - comp.programming.threads Google Groups
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups
C++0x memory_order_acq_rel vs memory_order_seq_cst
appcoreac_queue_spsc - comp.programming.threads Google Groups


07-30-11 | A look at some bounded queues - part 2

Okay, let's look into making an MPMC bounded FIFO queue.

We can use basically the same two ideas that we worked up last time.

First let's try to do one based on the read and write indexes being atomic. Consider the consumer; the check for empty now is much more race prone, because there may be another consumer simultaneously reading, which could turn the queue into empty state while you are reading. Thus we need a more single atomic moment to detect "empty" and reserve our read slot.

The most brute-force way to do this kind of thing is always to munge the two variables together. In this case we stick the read & write index into one int together. Now we can atomically check "empty" in one go. We're going to put rdwr in a 32-bit int and use the top and bottom 16 bits for the read index and write index.

So you can reserve a read slot something like this :


    nonatomic<t_element> * read_fetch()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            int wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_acq_rel) )
                break;
        }
                
        nonatomic<t_element> * p = & ( m_array[ rd % t_size ] );

        return p;
    }

but this doesn't work by itself. We have succeeded in atomically checking "empty" and reserving our read slot, but now the read index no longer indicates that the read has completed, it only indicates that a reader reserved that slot. For the writer to be able to write to that slot it needs to know the read has completed, so we need to publish the read through a separate read counter.

The end result is this :


template <typename t_element, size_t t_size>
struct mpmc_boundq_1_alt
{
//private:

    // elements should generally be cache-line-size padded :
    nonatomic<t_element>  m_array[t_size];
    
    // rdwr counts the reads & writes that have started
    atomic<unsigned int>    m_rdwr;
    // "read" and "written" count the number completed
    atomic<unsigned int>    m_read;
    atomic<unsigned int>    m_written;

public:

    mpmc_boundq_1_alt() : m_rdwr(0), m_read(0), m_written(0)
    {
    }

    //-----------------------------------------------------

    nonatomic<t_element> * read_fetch()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_acq_rel) )
                break;
        }
        
        // (*1)
        rl::backoff bo;
        while ( (m_written($).load(mo_acquire) & 0xFFFF) != wr )
        {
            bo.yield($);
        }
        
        nonatomic<t_element> * p = & ( m_array[ rd % t_size ] );

        return p;
    }
    
    void read_consume() 
    {
        m_read($).fetch_add(1,mo_release);
    }

    //-----------------------------------------------------

    nonatomic<t_element> * write_prepare()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == ((rd + t_size)&0xFFFF) ) // full
                return NULL;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,(rd<<16) | ((wr+1)&0xFFFF),mo_acq_rel) )
                break;
        }
        
        // (*1)
        rl::backoff bo;
        while ( (m_read($).load(mo_acquire) & 0xFFFF) != rd )
        {
            bo.yield($);
        }
        
        nonatomic<t_element> * p = & ( m_array[ wr % t_size ] );
        
        return p;
    }
    
    void write_publish()
    {
        m_written($).fetch_add(1,mo_release);
    }

    //-----------------------------------------------------
    
    
};

We now have basically two read counters - one is the number of read_fetches and the other is the number of read_consumes (the difference is the number of reads that are currently in progress). Now we have the complication at the spot :

(*1) : after we reserve a read slot - we will be able to read it eventually, but the writer may not yet be done, so we have to wait for him to do his write_publish and let us know which one is done. Furthermore, we don't keep track of which thread is writing this particular slot, so we actually have to wait for all pending writes to be done. (if we just waited for the write count to increment, a later slot might get written first and we would read the wrong thing)

Now, the careful reader might think that the check at (*1) doesn't work. What they think is :

You can't wait for m_written to be == wr , because wr is just the write reservation count that we saw when we grabbed our read slot. After we grab our read slot, several writes might actually complete, which would make m_written > wr ! And we would infinite loop!

But no. In fact, that would be an issue if this was a real functioning asynchronous queue, but it actually isn't. This queue actually runs in lockstep. The reason is that at (*1) in read_fetch, I have already grabbed a read slot, but not done the read. What that means is that no writers can progress because they will see a read in progress that has not completed. So this case where m_written runs past wr can't happen. If a write is in progress, all readers wait until all the writes are done; then once the reads are in progress, any writers trying to get in wait for the reads to get done.

So, this queue sucks. It has an obvious "wait" spin loop, which is always bad. It also is an example of "apparently lockfree" code that actually acts just like a mutex. (in fact, you may have noticed that the code here is almost identical to a ticket lock - that's in fact what it is!).

How do we fix it? Well one obvious problem is when we wait at *1 we really only need to wait on that particular item, instead of all pending ops. So rather than a global read count and written count that we publish to notify that we're done, we should have a flag or a count in the slot, more like our second spsc bounded queue.

So we'll leave the "rdwr" single variable for where the indexes are, and we'll just wait on publication per slot :


template <typename t_element, size_t t_size>
struct mpmc_boundq_2
{
    enum { SEQ_EMPTY = 0x80000 };

    struct slot
    {
        atomic<unsigned int>    seq;
        nonatomic<t_element>    item;
        char pad[ LF_CACHE_LINE_SIZE - sizeof(t_element) - sizeof(unsigned int) ];
    };

    slot m_array[t_size];

    atomic<unsigned int>    m_rdwr;

public:

    mpmc_boundq_2() : m_rdwr(0)
    {
        for(int i=0;i<t_size;i++)
        {
            int next_wr = i& 0xFFFF;
            m_array[i].seq($).store( next_wr | SEQ_EMPTY , mo_seq_cst );
        }
    }

    //-----------------------------------------------------

    bool push( const t_element & T )
    {
        unsigned int rdwr = m_rdwr($).load(mo_relaxed);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == ((rd + t_size)&0xFFFF) ) // full
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,(rd<<16) | ((wr+1)&0xFFFF),mo_relaxed) )
                break;
        }
        
        slot * p = & ( m_array[ wr % t_size ] );
        
        // wait if reader has not actually finished consuming it yet :
        rl::backoff bo;
        while ( p->seq($).load(mo_acquire) != (SEQ_EMPTY|wr) )
        {
            bo.yield($);
        }
        
        p->item($) = T; 
        
        // this publishes that the write is done :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

    //-----------------------------------------------------
    
    bool pop( t_element * pT )
    {
        unsigned int rdwr = m_rdwr($).load(mo_relaxed);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_relaxed) )
                break;
        }
        
        slot * p = & ( m_array[ rd % t_size ] );
        
        rl::backoff bo;
        while ( p->seq($).load(mo_acquire) != rd )
        {
            bo.yield($);
        }
        
        *pT = p->item($);
        
        // publish that the read is done :
        int next_wr = (rd+t_size)& 0xFFFF;
        p->seq($).store( next_wr | SEQ_EMPTY , mo_release );
    
        return true;
    }
    
};

just to confuse things I've changed the API to a more normal push/pop , but this is identical to the first queue except that we now wait on publication per slot.

So, this is a big improvement. In particular we actually get parallelism now, while one write is waiting, another write can go ahead and proceed if the read on its slot is still pending.

"mpmc_boundq_1_alt" suffered from the bad problem that if one reader swapped out during its read, then all writers would be blocked from proceeding (and that would then block all other readers). Now, we no longer have that. If a reader is swapped out, it only blocks the write of that particular slot (and of course blocks if you wrap around the circular buffer).

This is still bad, because you basically have a "wait" on a particular thread and you are just spinning.

Now, if you look at "mpmc_boundq_2" you may notice that the operations on the "rdwr" indexes are actually relaxed memory order - they need to be atomic RMW's, but they actually are not the gate for access and publication - the "seq" variable is now the gate.

This suggests that we could make the read and write indexes separate variables that are only owned by their particular side - like "spsc_boundq2" from the last post , we want to detect the full and empty conditions by using the "seq" variable in the slots, rather than looking across at the reader/writer indexes.

So it's obvious we can do this a lot like spsc_boundq2 ; the reader index is owned only by reader threads; we have to use an atomic RMW because there are now many readers instead of one. Publication and access checking is done only through the slots.

Each slot contains the index of the last access to that slot + a flag for whether the last access was a read or a write :


template <typename t_element, size_t t_size>
struct mpmc_boundq_3
{
    enum { SEQ_EMPTY = 0x80000 };
    enum { COUNTER_MASK = 0xFFFF };

    struct slot
    {
        atomic<unsigned int>    seq;
        nonatomic<t_element>    item;
        char pad[ LF_CACHE_LINE_SIZE - sizeof(t_element) - sizeof(unsigned int) ];
    };

    // elements should generally be cache-line-size padded :
    slot m_array[t_size];
    
    atomic<unsigned int>    m_rd;
    char m_pad[ LF_CACHE_LINE_SIZE ];
    atomic<unsigned int>    m_wr;

public:

    mpmc_boundq_3() : m_rd(0), m_wr(0)
    {
        for(int i=0;i<t_size;i++)
        {
            int next_wr = i & COUNTER_MASK;
            m_array[i].seq($).store( next_wr | SEQ_EMPTY , mo_seq_cst );
        }
    }

    //-----------------------------------------------------

    bool push( const t_element & T )
    {
        unsigned int wr = m_wr($).load(mo_acquire);
        slot * p = & ( m_array[ wr % t_size ] );
        rl::backoff bo;
        for(;;)
        {       
            unsigned int seq = p->seq($).load(mo_acquire);
            
            // if it's flagged empty and the index is right, take this slot :
            if ( seq == (SEQ_EMPTY|wr) )
            {
                // try acquire the slot and advance the write index :
                if ( m_wr($).compare_exchange_weak(wr,(wr+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // contention, retry
            }
            else
            {
                // (*2)
                return false;
            }
            
            p = & ( m_array[ wr % t_size ] );

            // (*1)
            bo.yield($);
        }

        RL_ASSERT( p->seq($).load(mo_acquire) == (SEQ_EMPTY|wr) );
        
        // do the write :
        p->item($) = T; 
        
        // this publishes it :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

    //-----------------------------------------------------
    
    bool pop( t_element * pT )
    {
        unsigned int rd = m_rd($).load(mo_acquire);
        slot * p = & ( m_array[ rd % t_size ] );
        rl::backoff bo;
        for(;;)
        {       
            unsigned int seq = p->seq($).load(mo_acquire);
            
            if ( seq == rd )
            {
                if ( m_rd($).compare_exchange_weak(rd,(rd+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // retry
            }
            else
            {
                return false;
            }
            
            p = & ( m_array[ rd % t_size ] );

            bo.yield($);
        }
                            
        RL_ASSERT( p->seq($).load(mo_acquire) == rd );
        
        // do the read :
        *pT = p->item($);
        
        int next_wr = (rd+t_size)& 0xFFFF;
        p->seq($).store( next_wr | SEQ_EMPTY , mo_release );
    
        return true;
    }
    
};

So our cache line contention is pretty good. Only readers pass around the read index; only writers pass around the write index. The gate is on the slot that you have to share anyway. It becomes blocking only when near full or near empty. But all is not roses.

Some notes :

(*1) : the yield loop here might look analogous to before, but in fact you only loop here for RMW contention - that is, this is not a "spin wait" , it's a lockfree-contention-spin.

(*2) : when do we return false here? When the queue is full. But that's not all. We also return false when there is a pending read on this slot. In fact our spin-wait loop still exists and it's just been pushed out to the higher level.

To use this queue you have to do :


while ( ! push(item) ) {
    // wait-spin here
}

which is the spin-wait. The wait on a reader being done with your slot is inherent to these methods and it's still there.

What if we only want to return false when the queue is full, and spin for a busy wait ? We then have to look across at the reader index to check for full vs. read-in-progress. It looks like this :


    bool push( const t_element & T )
    {
        unsigned int wr = m_wr($).load(mo_acquire);
        rl::backoff bo;
        for(;;)
        {
            slot * p = & ( m_array[ wr % t_size ] );
        
            unsigned int seq = p->seq($).load(mo_acquire);
            
            if ( seq == (SEQ_EMPTY|wr) )
            {
                if ( m_wr($).compare_exchange_weak(wr,(wr+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // retry spin due to RMW contention
            }
            else
            {
                // (*3)
                if ( seq <= wr ) // (todo : doesn't handle wrapping)
                {
                    // full?
                    // (*4)
                    unsigned int rd = m_rd($).load(mo_acquire);
                    if ( wr == ((rd+t_size)&COUNTER_MASK) )
                        return false;
                }
                            
                wr = m_wr($).load(mo_acquire);

                // retry spin due to read in progress
            }
            
            bo.yield($);
        }
        
        slot * p = & ( m_array[ wr % t_size ] );
        
        // wait if reader has not actually finished consuming it yet :
        RL_ASSERT( p->seq($).load(mo_acquire) == (SEQ_EMPTY|wr) );
        
        p->item($) = T; 
        
        // this publishes it :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

which has the two different types of spins. (notes on *3 and *4 in a moment)

What if you try to use this boundedq to make a queue that blocks when it is empty or full?

Obviously you use two semaphores :


template <typename t_element, size_t t_size>
class mpmc_boundq_blocking
{
    mpmc_boundq_3<t_element,t_size> m_queue;

    fastsemaphore   pushsem;
    fastsemaphore   popsem;

public:
    
    mpmc_boundq_blocking() : pushsem(t_size), popsem(0)
    {
    }
    
    void push( const t_element & T )
    {
        pushsem.wait();
    
        bool pushed = queue.push(x);
        RL_ASSERT( pushed );
        
        popsem.post();
    }
    
    void pop( t_element * pT )
    {
        popsem.wait();
        
        bool popped = queue.pop(&x);
        RL_ASSERT( popped );
                
        pushsem.post(); 
    }
    
};

now push() blocks when it's full and pop() blocks when it's empty. The asserts are only correct when we use the modified push/pop that correctly checks for full/empty and spins during contention.

But what's up with the full check? At (*3) we see that the item we are trying to write has a previous write index in it that has not been read. So we must be full already, right? We have semaphores telling us that slots are available or not, why are they not reliable? Why do we need (*4) also?

Because reads can go out of order.


    say the queue was full
    then two reads come in and grab slots, but don't finish their read yet
    then only the second one finishes its read
    it posts the semaphore
    so I think I can write
    I grab a write slot
    but it's not empty yet
    it's actually a later slot that's empty
    and I need to wait for this reader to finish
    
Readers with good memory might remember this issue from our analysis of Thomasson's simple MPMC which was built on two semaphores - and had this exact same problem. If a reader is swapped out, then other readers can post the pushsem, so writers will wake up and try to write, but the first writer can't make progress because its slot is still in use by a swapped out reader.

Note that this queue that we've wound up with is identical to Dmitry's MPMC Bounded Queue .

There are a lot of hidden issues with it that are not at all apparent from Dmitry's page. If you look at his code you might not notice that : 1) enqueue doesn't only return false when full, 2) there is a cas-spin and a wait-spin together in the same loop, 3) while performance is good in the best case, it's arbitrarily bad if a thread can get swapped out.

Despite their popularity, I think all the MPMC bounded queues like this are a bad idea in non-kernel environments ("kernel-like" to me means you have manual control over which threads are running; eg. you can be sure that the thread you are waiting on is running; some game consoles are "kernel-like" BTW).

(in contrast, I think the spsc bounded queue we did last time is quite good; in fact this was a "rule of thumb" that I posted back in the original lockfree series ages ago - whenever possible avoid MPMC structures, prefer SPSC, even multi-plexed SPSC , hell even two-mutex-protected SPSC can be better than MPMC).


07-29-11 | A look at some bounded queues

A common primitive is a FIFO queue built on an array, so it doesn't do allocations, doesn't have to worry about ABA or fencing the allocator. Let's have a look (code heavy).

In all cases we will use an array that we will use circularly. We track a read index and a write index. The queue is empty if read == write. The queue is full if (write - read) == size. (the exact details of checking this condition race free depend on the queue).

As always, the SPSC case is simplest.

The first method makes the read & write indexes shared variables, and the actual array elements can be non atomic (the read/write indexes act as mutexes to protect & publish the contents).


template <typename t_element, size_t t_size>
struct spsc_boundq
{
    // elements should generally be cache-line-size padded :
    nonatomic<t_element>  m_array[t_size];
    
    typedef int index_type;
    //typedef char index_type; // to test wrapping of index

    char m_pad1[LF_CACHE_LINE_SIZE];    
    atomic<index_type>  m_read;
    char m_pad2[LF_CACHE_LINE_SIZE];
    atomic<index_type>  m_write;

public:

    spsc_boundq() : m_read(0), m_write(0)
    {
    }

    //-----------------------------------------------------

    nonatomic<t_element> * read_fetch()
    {
        // (*1)
        index_type wr = m_write($).load(mo_acquire);
        index_type rd = m_read($).load(mo_relaxed);
        
        if ( wr == rd ) // empty
            return NULL;
        
        nonatomic<t_element> * p = & ( m_array[ rd % t_size ] );    
        return p;
    }
    
    void read_consume()
    {
        // (*2) cheaper than fetch_add :
        index_type rd = m_read($).load(mo_relaxed);
        m_read($).store(rd+1,mo_release);
    }
    
    //-----------------------------------------------------
    
    nonatomic<t_element> * write_prepare()
    {
        // (*1)
        index_type rd = m_read($).load(mo_acquire);
        index_type wr = m_write($).load(mo_relaxed);
        
        if ( (index_type)(wr - rd) >= t_size ) // full
            return NULL;
        
        nonatomic<t_element> * p = & ( m_array[ wr % t_size ] );    
        return p;
    }
    
    void write_publish()
    {
        // cheaper than fetch_add :
        index_type wr = m_write($).load(mo_relaxed);
        m_write($).store(wr+1,mo_release);
    }
};

Instead of copying out the value, I've separate the "fetch" and "consume" so that a reader does :

  p = read_fetch();
  do whatever you want on p, you own it
  read_consume();

of course you can always just copy out p at that point if you want, for example :

t_element read()
{
    nonatomic<t_element> * p = read_fetch();
    if ( ! p ) throw queue_empty;
    t_element copy = *p;
    read_consume();
    return copy;
}

but that's not recommended.

Notes :

*1 : this is the only subtle part of this queue; note the order of the reads is different for read() and write(). The crucial thing is that if there is a race there, you need read() to err on the side of saying it's empty, while write errs on the side of saying its full.

(this is a general theme of lockfree algorithm design; you don't actually eliminate races, there are of course still races, but you make the races benign, you control them. In general you can't know cross-thread conditions exactly, there's always a fuzzy area and you have to put the fuzz in the right place. So in this case, the reader thread cannot know "the queue is empty or not" , it can only know "the queue is definitely not empty" vs. "the queue *might* be empty" ; the uncertainty got put into the empty case, and that's what makes it usable).

*2 : you just do a load and add instead of an RMW here because we are SPSC we know nobody else can be updating the index.

So, this implementation has no atomic RMW ops, only loads and stores to shared variables, so it's reasonably cheap. But let's look at the cache line sharing.

It's not great. Every time you publish or consume an item, the cache line containing "m_write" has to be transferred from the writer to the reader so it can check the empty condition, and the cache line containing "m_read" has to be transferred from the reader to the writer. Of course the cache line containing t_element has to be transferred as well, and we assume t_element is cache line sized.

(note that this is still better than if both threads were updating both variables - or if you fail to put them on separate cache lines; in that case they have to trade off holding the cache line for exclusive access, they swap write access to the cache line back and forth; at least this way each line is always owned-for-write by the same thread, and all you have to do is send out an update when you dirty it) (on most architectures the difference is very small, I believe)

We can in fact do better. Note that the reader only needs to see "m_write" to check the empty condition. The cache line containing the element has to be moved back and forth anyway, so maybe we can get the empty/queue condition into that cache line?

In fact it's very easy :


template <typename t_element, size_t t_size>
struct spsc_boundq2
{
    struct slot
    {
        nonatomic<t_element>    item;
        atomic<int>             used;
        char pad[LF_CACHE_LINE_SIZE - sizeof(t_element) - sizeof(int)];
    };
    
    slot  m_array[t_size];
    
    typedef int index_type;
    //typedef char index_type; // to test wrapping of index

    char m_pad1[LF_CACHE_LINE_SIZE];    
    nonatomic<index_type>   m_read;
    char m_pad2[LF_CACHE_LINE_SIZE];    
    nonatomic<index_type>   m_write;

public:

    spsc_boundq2() : m_read(0), m_write(0)
    {
        for(int i=0;i<t_size;i++)
        {
            m_array[i].used($).store(0);
        }
    }

    //-----------------------------------------------------

    nonatomic<t_element> * read_fetch()
    {
        index_type rd = m_read($);
        slot * p = & ( m_array[ rd % t_size ] );    
        
        if ( ! p->used($).load(mo_acquire) )
            return NULL; // empty
        
        return &(p->item);
    }
    
    void read_consume()
    {
        index_type rd = m_read($);
        slot * p = & ( m_array[ rd % t_size ] );    
        
        p->used($).store(0,mo_release);
        
        m_read($) += 1;
    }
    
    //-----------------------------------------------------
    
    nonatomic<t_element> * write_prepare()
    {
        index_type wr = m_write($);
        slot * p = & ( m_array[ wr % t_size ] );    
        
        if ( p->used($).load(mo_acquire) ) // full
            return NULL;
        
        return &(p->item);
    }
    
    void write_publish()
    {
        index_type wr = m_write($);
        slot * p = & ( m_array[ wr % t_size ] );    
        
        p->used($).store(1,mo_release);
        
        m_write($) += 1;
    }
};

Note that m_read & m_write are now non-atomic - m_read is owned by the reader thread, they are not shared.

The shared variable is now "used" which is in each element. It's simply a binary state that acts like a mutex; it indicates whether it was last read or last written and it toggles back and forth as you progress. If you try to read a slot that was last read, you're empty; if you try to write a slot you last wrote, you're full.

This version has much better cache line sharing behavior and is the preferred SPSC bounded queue.

Next time : MPMC variants.


07-29-11 | Semaphore Work Counting issues

Say you are using something like "fastsemaphore" to count work items that you have issued to some worker threads. Some issues that you may want to consider :

1. "thread thrashing". Say you are issuing several work items, so something like :

queue.push();
sem.post():

queue.push();
sem.post():

queue.push();
sem.post():
this can cause bad "thread thrashing" where a worker thread wakes up, does a work item, sees the queue is empty, goes back to sleep, then you wake up, push the next item, wake the worker, etc.. This can happen for example if the work items are very tiny and your work issuer is not getting them out fast enough. Or it can happen just if the worker has >= the priority of the issuer, especially on Windows where the worker may have some temporary priority boost (for example because it hasn't run in a while), then your sem.post() might immediately swap out your thread and swap in the worker, which is obviously very bad.

The solution is just to batch up the posts, like :

queue.push();
queue.push();
queue.push();

sem.post(3):
note that just calling post three times in a row doesn't necessarily do the trick, you need the single post of 3.

2. If you have several workers and some are awake and some are asleep, you may wish to spin a bit before waking the sleeping workers to see if the awake ones took the work already. If you don't do this, you can get another type of thrashing where you post a work, wake a worker, a previously running worker finishes his job and grabs the new one, now the newly awake worker sees nothing to do and goes back to sleep.

You can handle this by spinning briefly between the sem increment and the actual thread waking to see if someone grabs it in that interval. Note that this doesn't actually fix the problem of course, because this is an inherent race situation. Because the thread wake takes a long time, it is still quite possible that the work queue is empty by the time the new worker wakes up. (to do better you would have to have more information about how long the work item takes, what other work there is to do, etc.)

3. A related case is when a worker sees no work to do and is thinking about going to sleep; he can spin there between seeing the queue empty and actually sleeping to see if some work becomes available during that interval.

I should note that this kind of spinning for optimization is not an unambiguous win, and it's very hard to really measure.

In benchmark/profiling scenarios it can seem to increase your performance a lot. But that's a bit unrealistic; in benchmark scenarios you would do best by giving all your threads infinite priority and locking them to cores, and never letting them go to sleep.

Basically the spinning in these cases takes away a bit of time from other threads. Depending on what other threads you have to run, you can actually hurt performance.


07-29-11 | Spinning

It's important to distinguish the two different reasons for spinning in threaded code; it's unfortunate that we don't have different names for them.

Spinning due to a contended RMW operation is not generally horrible, though this can be a bit subtle.

If you know that the spin can only happen when some other thread made progress, that's the good case. (though you can still livelock if the "progress" that the other thread made is in a loop)

One source of problems is on LL/SC architectures, the lock is usually for the whole cache line. That means somebody else could be fiddling things on your cache line and you might abort your SC without ever making progress. eg. something like :


atomic<int> lock;
int spins; // on same cache line

many threads do :

for(;;)
{
  if ( CAS(x,0,1) ) break;

  spins++;
}

this looks like just a spinlock that counts its spins, but in fact it could be a livelock because the write to "spins" invalidates the cacheline of someone trying to do the CAS, and none of the CAS's ever make progress.

There are also plenty of "lockfree" code snippets posted around the net that are actually more like spin-locks, which we shall discuss next.

The other type of spinning is waiting for some other thread to set some condition.

I believe that this type of spinning basically has no place in games. It really only be used in kernel environments where you can ensure that the thread you are waiting on is current running. If you can't do that, you might be spinning on a thread which is swapped out, which could be a very long spin indeed.

These can come up in slightly subtle places. One example we've seen is this MPMC queue ; the spin-backoff there is actually a wait on a specific, and is thus very bad.

Any time you are actually spinning waiting on a specific thread to do something, you need to actually use an OS Wait on that thread to make sure it gets CPU time.

Relacy's context-bound test will tell you any time you have a spin without a backoff.yield call , so that's nice. Then you have to look at any place you do a backoff.yield and try to see if it's waiting for a thread to set a condition, or only spinning if another thread made progress during that time.

Interestingly a simple spinlock illustrates both types of spins :


for(;;)
{
    if ( lock == 1 )
    {
        // spin wait, bad
        continue;
    }

    if ( ! cas(lock,0,1) )
    {
        // spin contention, okay
        continue;
    }

    break;
}

Sometimes "spin waits" can be well hidden in lockfree code. The question you should ask yourself is :

If one of the threads is swapped out (stops running), is my progress blocked?

If that is true, it's a spin wait, and that's bad.

To be clear, whenever what you are really doing is a "wait" - that is, I need this specific other thread to make progress before I can - you really want to explicitly enforce that; either by using an OS wait, or by knowing what thread it is you're waiting on, etc.

A related topic is that "exponential backoff" type stuff in spins is over-rated. The idea of backoff is that you spin and something like :


first 10 times - mm_pause()
next 5 times - SwitchToThread()
then Sleep(1), then Sleep(2) , etc..

this is sort of okay, because if the other thread is running you act like a spin and make progress quickly, while if the other thread is asleep you will eventually go to sleep and give it some CPU time.

But in reality this is shit. The problem is that if you *ever* actually hit the Sleep(1) , that's a disaster. And of course you could have to sleep much longer. Say for example you have 10 threads and you are waiting on a specific one (without using a proper OS wait). When you Sleep, you might not get the thread you want to run; you might have to sleep all your other threads too to finally get it to go.

It's just sloppy threading programming and the performance is too unpredictable for games.


07-29-11 | The problem with software

The problem with software is that it allows you to do lots of complicated things that you probably shouldn't.

Why is adaptive power steering in cars horrible? Why is an integrated computer-controlled console horrible? Why is a computer-controlled refrigerator or dishwasher always horrible? Or a CPU running your stereo with a touch screen.

There's no inherent reason that computers should make these things worse. But they always do. Computers always make things much much worse.

The reason is that the designers just can't resist the temptation to fuck things up. They could just make it so that when you turn it on, it instantly shows you buttons for the things they want to do. But no, they think "hey, we could show a video here and play some music when you turn it on". NO! Don't do it!. Or, when you turn your steering wheel they could just turn the car wheels by a proportional amount, but they think "well we've got all this computer power, how about if we detect if you've been driving aggressively recently and dynamically adjust the maps". NO! Don't do it!

Furthermore, in practice adding computers is almost always done as a cost-cutting mechanism. They see it as an opportunity to make the mechanical function of the device much worse and then compensate for it through software optimization. (see for example, removal of mechanical differentials and replacement with computer "e-diffs"). It doesn't actually work.

I was thinking about CG in movies and how uniformly horrible it is, and I think it's roughly the same problem. It's a sad fact that models still look massively better than CG (for rigid objects anyway, buildings and space ships and such). I've been watching "Game of Thrones" and it looks absolutely beautiful most of the time - the costumes are great, the sets are great - and then there's some fucking CG shot and I want to vomit. The space shots of the Enterprise in TNG from like the late 80's still look amazing (when they aren't firing lazers or any of the bad CG overlays) - just the model shots.

Part of the advantage of models is that it forces you to be conservative. With CG you can choose to make your spiderman have weird stretchy motion - but you shouldn't. You can choose to make a chrome spaceship - but you shouldn't. You can choose to make the camera fly inside an exploding skull - but you shouldn't.

I also think there may have been an advantage with models in that they were difficult to make, and that meant you had to hire actual artists and craftsmen who had honed their skills and had some taste, as opposed to CG where the barrier to entry is so low and it's so easy to change, it makes it much easier for a director to fuck things up, or for teenagers with no artistry to make the shots.


07-27-11 | Car News

The continuing series in which I pick up the slack for the fucking pathetically awful car journalists of the world, who cannot even get the basic facts that are in the fucking press release right, much less actually tell you anything subtle or revelatory about cars.

08+ Subaru WRX's have open front & rear diffs and a viscous center diff. They use an "e-diff" (aka fucking horrible computer-controlled braking and throttle cuts) to limit side-to-side wheelspin. So basically the famous Subaru "4wd" is no 4wd at all. A viscious center diff + open front and rear diffs is a real shit system, it's almost useless and definitely not fun to drive, for example it doesn't allow throttle steering ; pre-08 WRX's had proper LSD's for front & rear at least optional. It does seem you can swap out the front & rear diffs to Quaifes reasonably cheaply, though this is not a common mod (most of the modders, as always, are posers who don't actually want a good car). This seems to be part of the general de-rallification and cheapening of the WRX post 08.

So in summary, modern Subarus (a brand whose whole image is having good AWD) have terrible AWD systems. (the STI is still okay).

I like the way the BMW 1M is basically mechanically unstable - without traction control it's probably too oversteer biased to sell in the modern world - but you can still use it safely because of the electronic protection. IMO that's a nice way to build a modern sports car (which have to be sold understeer biased generally), as long as you can turn the electronics all the way off to have fun. I'm not delighted by the reports that it still has the 135 numb steering, the electronic variable power steering, the ECU mapped throttle, all that nonsense.

But I'm also disenchanted of stiff cars like the 1M or Cayman R. My car is already way too stiff, and those are much stiffer. Making a car handle well by making it stiff is just not interesting.

The next Porsche 911 (2012 991) will finally have electric power steering, which is a bit of a bummer. Almost all modern cars do, but it generally feels like ass. The BMW M cars have a hybrid system with an electric pump still running a hydraulic system (BMW non-M's are fully electric). The reason for all this is that apparently the power steering pump is a massive parasitic drain, almost 5% of the engine's power, which is bad for acceleration as well as fuel economy. But hydraulic power steering just feels good! I've never driven a car with direct mechanical rack steering, I bet that's wonderful.

I think the new Z06 vette is probably the greatest performance bargain right now; they've moved a lot of the ZR1 goodies down the line. Still, the real bargain is probably in buying a base vette (or a "grand sport") and upgrading a few parts yourself; change the brakes, change the suspension, install a dry sump, and you're good to go, and it's much cheaper. It's just staggeringly good, there's really nothing even close for the price, or really for any price! Huge power, RWD, good balance, front mid-engine, light weight (just over 3100). The base Vette is an LS3, Z06 (C6 not C5) is an LS7, ZR1 is an LS9. Newer "Grand Sport" LS3's and Z06/ZR1 have a dry sump. The older cars didn't have a dry sump and it was a frequent cause of engine explosion :

Let me ask another way, who's LS3 HASNT blown up on the track - Corvette Forum
Let me ask another way, who's LS3 HASNT blown up on the track - Page 7 - Corvette Forum
2009 Z06 dry sump vs ARE dry sump - LS1TECH

(this is not unusual, most factory sports cars have some kind of problem that will make them blow up under track use (non-GT3 Porsches for example, and Caymans in particular)) (apparently even the new vettes cars with factory dry sump can blow up if driven hard enough with slicks, but if you're going into real racing, you didn't expect the factory oiling system to handle it, did you?)

A lot of the doofy car press who like to just repeat cliches without thinking about it like to complain about shit like the "leaf springs" in the vette. If you really don't like them, you can swap the suspension for a few grand, and the car is still cheap as hell. Most of those fancy European suspensions are just Bilstein or KW or whatever off-the-shelf kits, so you can easily have the same. Other doofs like to complain about the "push rod block" with its supposedly ancient technology, or that it takes "umpteen liters of engine" to make power that Europeans make in many fewer litres. That's silly. Who cares how many litres of displacement an engine has? (it's funny that the same doofies will turn around and brag about the giant 6.3L engine in their Merc ; wait, is displacement good or bad? I forget? Oh, actually it's irrelevant and you're an inconsistent moron). The thing that's important is how much does the engine weigh, and how big is it, and in fact, those big 7 L vette engines are physically quite small and light compared to even 4 L european engines.

(part of the reason most euro engines are small is that they tax by displacement, which is retarded; taxing by emissions makes sense, but attaching rules to things that aren't the final metric just creates artificial incentives and stifles creative solutions)

The reason vette engines can be so small and light is because of the magic of pushrods. A modern DOHC euro engine has 4 cam shafts - two for each cylinder bank in a V - with chains running from the drive shaft to the cams to keep timing. The cams are mounted on the tops of the cylinder banks, and this all makes the engine quite large, heavy, and complex. (timing chain guide or tension problems are a common issue with all modern euro engines). In contrast, a push rod motor only has one cam shaft that is directly mounted right next to the drive shaft, between the V's of the cylinder banks. This makes it very compact, and eliminates the chain timing problems.

DOHC engines are complex and have delicate long chains, and are volumetrically much larger than their displacement :

Vette LS engines are very compact, with a low center of mass :

See also :
Pushrod Engine - YouTube
3D animation of a fuel injected V8 - YouTube
Nice comparison diagrams

The disadvantage of pushrods is that you can't run 4 valves per cylinder and you can't run complicated adaptive valve timing, but who cares? The drawback is that it's hard to make them low emission / high mileage and also big power ; certainly the car nuts who intentionally ruin their emission systems shouldn't care. Fetishization of the number of valves or whatever is retarded; the only thing that matters is power to weight. If the power to weight is good, it's a good engine. And the GM LS engine has excellent power to weight, is very cheap to make and maintain, has a wide power band, and is durable if oiled properly.

If there's a valid complaint about the LS engines (other than environmental), it's that flat power curves just somehow don't feel as good as peaky ones. It's satisfying to have to wring out an engine to the top of its rev range; an engine that just burbles and takes off at any speed is just too easy. This is one of those cases where I feel like only the Japanese really get it right with cars like the S2000, having to make lots of shifts and get the revs screaming is part of the visceral, tactile, involved fun of a sports car.

It's the very compact geometry of the LS engine that makes it possible to swap into old cars that had much smaller displacement engines originally (Porsche 914's, Nissan 240's, Mazda RX7's, etc.). A big DOHC V8 doesn't fit in those cars but the pushrod LS does. (that's not the only reason for its popularity of course - it's very cheap, powerful, easy to work on, etc.).


07-27-11 | CALL_IMPORT

Syntactically friendly way to call manual Windows imports :

template <typename t_func_type>
t_func_type GetWindowsImport( t_func_type * pFunc , const char * funcName, const char * libName )
{
    if ( *pFunc == 0 )
    {
        HMODULE m = GetModuleHandle(libName);
        if ( m == 0 ) m = LoadLibrary(libName); // adds extension for you
        ASSERT_ALWAYS( m != 0 );
        t_func_type f = (t_func_type) GetProcAddress( m, funcName );
        // not optional : (* should really be a throw)
        ASSERT_ALWAYS( f != 0 );
        *pFunc = f;
    }
    return (*pFunc); 
}

#define CALL_IMPORT(lib,name) (*GetWindowsImport(&RR_STRING_JOIN(fp_,name),RR_STRINGIZE(name),lib))
#define CALL_KERNEL32(name) CALL_IMPORT("kernel32",name)

so then instead of doing

InitializeConditionVariable(&cv);

you do

// put this somewhere :
VOID (WINAPI *fp_InitializeConditionVariable) (PCONDITION_VARIABLE ) = NULL;

// call like this :
CALL_KERNEL32(InitializeConditionVariable)(&cv);

which is not too bad. (of course you can hide the difference completely by doing

#define InitializeConditionVariable CALL_KERNEL32(InitializeConditionVariable)

so that the client code looks identical to if it was a real lib call, that way you can just have like an #if PLATFORMSDK < 7.1 somewhere that makes the imports for you, and the client code doesn't have to change at all when it goes from being a lib import to a GetProcAddress manual import.

Of course if you are using real C++ then when GetProcAddress fails to find the function it should throw.

Also : warning : if you use this on non-built-in-libs (eg. if you used it on something like "psapi" as opposed to "kernel32" or "user32" or whatever) then there is actually a race that could cause a crash. The problem is that GetModuleHandle doesn't inc the ref on the lib, so it could get unloaded while you are calling it. A more fully correct implementation would return a proxy object that holds a ref on the lib on the stack, that way the lib is kept ref'ed for the duration of the function call.


07-27-11 | Some random ranting

Print to PDF is easy peasy without installing anything; there are various PDF writer printer drivers, but I've found them to be generally flakey and generally a pain. Instead just select print to the built-in Microsoft XPS Document writer, then use one of the web XPS to PDF converters.

There's god damn traffic on the 520 every day now because of the construction on the *side* of the road. Not on the road, mind you, they don't actually block traffic at all, but there's lots of cranes and shite around preparing for the new bridge building, and all the fucking morons freak out and slam on their brakes. It used to be that traffic cleared up by around 10 AM and I could shoot in to the office, but now it lingers until 11 or so. And worst of all when I do get stuck in this traffic I'm just filled with rage because it's so fucking stupid and pointless, there is no god damn reason for these traffic jams they aren't actually blocking the roads you morons!

(every time the god damn toll signs change on the 520 that also leads to a wave of unnecessary stopping ; god fucking dammit, when you are driving on the freeway you don't fucking slam on your brakes to read a sign. If you want to find out about the impending toll get on the fucking internet and look it up AFTER you get off the freeway)

The major bridges here now have adaptive speed limits. That is, the speed limit is on a digital screen and it changes based on traffic conditions. So if traffic is bad it goes down to 45 or whatever. This is so fucking retarded. First of all, if traffic is bad, you have to go slow anyway. I understand the point is to get you to slow down before you get to the traffic to try to smooth things out, but there's no fucking way that's going to work, you're on a freeway where people are used to going 60, they're not going to slow down to 45 just because there's traffic miles ahead. In reality what it does is make you constantly watch for the speed limit signs to see what the limit is today, rather than watching the road. And it opens up the door for the ticketting cops to really fuck you when you're driving along on the freeway at a perfectly reasonable speed and little did you know the limit actually was 45 not the 55 or 60 that it normally is. Hey, let's just randomly change the laws each day of the week.

I often complain about how getting a ticket in the US wrecks your car insurance rates. The $400 ticket is pretty irrelevant compared to the $5000 in increased insurance cost over the next few years. But if you have a reasonable amount of cash there is actually a solution to this - either self insure completely (possible but kind of a pain in the butt because you have to file various forms and set up the right kind of savings account), or just get the absolute minimum of insurance from the cheapest/shadiest provider and just plan on using cash if you ever have a problem.

I have noticed that people who build chicken coops often put them at the very edge of their own property, eg. as close to their neighbors as possible and as far as possible from their own home. This is of course being a huge fucking dick. If you want to have questionable barn-yard animals in a dense urban area, you should accept all the negatives (smell, noise) for yourself, not inflict them on others.

(of course I feel the same way about dogs; for example if a dog won't stop yapping at all hours, the only appropriate place to keep it is inside an isolation tank with its owner; chickens really aren't any noisier or stinkier than dogs (as long as you don't have roosters or particularly noisy varieties of chickens)). (and of course dog owners are just as bad about the placement of their dog run ; they always put it as far as possible from their own bedroom).

I tend to be way overly considerate about how my actions affect others; I've always had the philosophy that anything you do that's voluntary (eg. for your pleasure, not for your survival) should have no ill effects whatsoever on others. But this is really just a form of self-sabotage, because nobody else treats me this way, and nobody really notices that I'm doing it; for example the neighbors don't appreciate the fact that I would like to crank up the music loud but I don't because that would be an annoyance to them. Similarly, if you make a mistake, I've always believed that you should suffer the consequences for yourself. You don't try to minimize the affect on yourself or pass it off to others; when you do a shit, you eat it.

This ( Backyard Chicken Economics Are They Actually Cost-Effective the GoodEater Collaborative ) kind of self-serving economic rationalization is super retarded, and sadly quite common. There's a formula for this particular form of insanity that they seem to follow quite closely. They'll be oddly over-specific about certain things and count costs down to the penny, and then just wave their hands and completely not count things. It's exactly like the wonderfully un-self-aware "minimalist" movement that's going on now, where they exactly count their posessions, and then wave their hands over "cooking gear" as one item, or "I also borrow things from my wife" ; wait, what? you're not counting your wife's stuff? The backyard chicken rationalizer just randomly doesn't count things that he "got from a neighbor" ; wait what? You can't do that. In any case, the whole analysis is retarded because he doesn't count opportunity cost for your time or the land (the land in particular is quite significant). In reality raising chickens in your yard is a *massive* money loser and the only reason to do it is if you like it. The whole article should have been "chickens cost a lot, just like any pet, so you should have them if you like them. the end" instead of this inane long-winded pointless rationalization with charts and graphs.

(Also : "Joshua Levin is a Senior Program Officer at the World Wildlife Fund, specializing in finance and agricultural commodities" ; hmm but he doesn't realize that using organic feed or not changes the value of the eggs? or that small-scale agriculture in any area with outrageously high land and labor costs is just doomed; or that going super-upscale is the only remote chance that it has to be close to break even; holy crap the WWF is in bad shape).


07-26-11 | Pixel int-to-float options

There are a few different reasonable ways to turn pixel ints into floats. Let's have a glance.

Pixels arrive as ints in [0,255]. When you put your ints in floats there is then a range of floats which corresponds to each int value. The total float range shown is the range of values that will map back to [0,255]. In practice you usually clamp, so in fact further out values will also map to 0 or 255.

I'll try to use the scientific notation for ranges, where [ means "inclusive" and ( means "not including the end value". With floats rounding of 0.5's I will always use ( because the rounding behavior for floats is undefined and varies.

On typical images, exact preservation of black (int 0) and white (int 255) is more important than any other value.



int-to-float :  f = i;

float-to-int :  i = round( f ) = floor( f + 0.5 );

float range is (-0.5,255.5)
black : 0.0
white : 255.0

commentary : quantization buckets are centered on each integer value. Black can drift into negatives, which may or may not be an annoyance.



int-to-float :  f = i + 0.5;

float-to-int :  i = floor( f );

float range is [0.0,256.0)
black : 0.5
white : 255.5

commentary : quantization buckets span from one integer to the next. There's some "headroom" below black and above white in the [0,256) range. That's not actually a bad thing, and one interesting option here is to actually use a non-linear int-to-float. If i is 0, return f = 0, and if i is 255 return f = 256.0 ; that way the full black and full white are pushed slightly away from all other pixel values.



int-to-float :  f = i * (256/255.0);

float-to-int :  i = round( f * (255/256.0) );

float range is (-0.50196,256.50196)

black : 0.0
white : 256.0

commentary : scaling white to be 256 is an advantage if you will be doing things like dividing by 32, because it stays an exact power of 2. Of course instead of 256 you could use 1.0 or any other power of two (floats don't care), the important thing is just that white is a pure power of two.


other ?

ADDENDUM : oh yeah; one issue I rarely see discussed is maximum-likelihood filling of the missing bits of the float.

That is, you treat it as some kind of hidden/bayesian process. You imagine there is a mother image "M" which is floats. You are given an integer image "I" which is a simple quantization from M ; I = Q(M). Q is destructive of course. You wish to find the float image F which is the most likely mother image under the probability distribution given I is known, and a prior model of what images are likely.

For example if you have ints like [ 2, 2, 3, 3 ] that most likely came from floats like [ 1.9, 2.3, 2.7, 3.1 ] or something like that.

If you think of the float as a fixed point and you only are given the top bits (the int part), you don't have zero information about what the bottom bits are. You know something about what they probably were, based on the neighbors in the image, and other images in general.

One cheezy way to do this would be to run something like a bilateral filter (which is all the rage in games these days (all the "hacky AA" methods are basically bilateral filters)) and clamp the result to the quantization constraint. BTW this is the exact same problem as optimal JPEG decompression which I have discussed before (and still need to finish).

This may seem like obscure academics to you, but imagine this : what if you took a very dark photograph into photoshop and multiplied up the brightness X100 ? Would you like to see pixels that step by 100 and look like shit, or would you like to see a maximum likelihood reconstruction of the dark area? (And this precision matters even in operations where it's not so obvious, because sequences of different filters and transforms can cause the integer step between pixels to magnify)


07-26-11 | Implementing Event WFMO

One of the Windows API's that's quite lovely and doesn't exist on any other platforms is WaitForMultipleObjects (WFMO).

This allows a thread to sleep on multiple waitable handles and only get awoken when all of them is set.

(WFMO also allows waking on "any", but waking on any is trivial and easy to simulate on other platforms, so I won't be talking about the "any" choice, and will treat WFMO as "wait_all")

Many people (such as Sun (PDF) ) have suggested simulating WFMO by polling in the waiting thread. Basically the suggestion is that the waiter makes one single CV to wait on. Then he links that CV into all the events that he wants to wait on. Then when each one fires, it triggers his CV, he wakes up and checks the WFMO list, and if it fails he goes back into a wait state.

This is a fine way to implement "wait any" (and is why it's trivial and I won't discuss it), but it's a terrible way to implement "wait all". The waiting thread can wake up many times and check the conditions and just go right back to sleep.

What we want is for the signalling thread to check the condition, and only wake the WFMO waiting thread if the full condition state is met.

Events can be either auto-reset or manual-reset, and if they are auto-reset, the WFMO needs to consume their signal when it wakes. This makes it a bit tricky because you don't want to consume a signal unless you are really going to wake up - eg. all your events are on. Events can also turn on then back off again (if some other thread waits on them), so you can't just count them as they turn on.

The first thing we need to do is extend our simple "event" that we posted last time by adding a list of monitors :


struct event_monitor
{
    // *2
    virtual bool got_signal( unsigned int mask ) = 0;
};

struct event
{
    std::mutex  m;
    std::condition_variable cv;
    VAR_T(bool) m_set;
    VAR_T(bool) m_auto_reset;
    
    struct info { event_monitor * mon; unsigned int mask; };
    
    struct match_mon { match_mon(event_monitor * mon) : m_mon(mon) { } event_monitor * m_mon; bool operator () (const info & rhs) const { return m_mon == rhs.mon; } };
    
    std::mutex  m_monitors_mutex;
    std::list<info> m_monitors;
    
    event(bool auto_reset) : m_auto_reset(auto_reset)
    {
        m_set($) = false;
    }
    
    ~event()
    {
    }
    
    void signal()
    {
        m.lock($);
        m_set($) = true;    
        if ( m_auto_reset($) )
            cv.notify_one($); // ??
        else
            cv.notify_all($);       

        m.unlock($);
        
        // (*1)
        // can't be done from inside mutex, that's deadlock     
        notify_monitors();
    }
        
    void wait()
    {
        m.lock($);
        while ( ! m_set($) )
        {
            cv.wait(m,$);
        }
        if ( m_auto_reset($) )
            m_set($) = false;
        m.unlock($);
    }
    
    //-------------------------
    
    void notify_monitors()
    {
        m_monitors_mutex.lock($);
        for( std::list<info>::iterator it = m_monitors.begin();
            it != m_monitors.end(); ++it )
        {
            info & i = *it;
            if ( i.mon->got_signal(i.mask) )
                break;
        }
        m_monitors_mutex.unlock($);
    }
    
    void add_monitor(event_monitor * mon,unsigned int mask)
    {
        m_monitors_mutex.lock($);
        info i = { mon, mask };
        m_monitors.push_back( i );
        m_monitors_mutex.unlock($);
    }
    void remove_monitor(event_monitor * mon)
    {
        m_monitors_mutex.lock($);
        m_monitors.remove_if( match_mon(mon) );
        m_monitors_mutex.unlock($);
    }
};

which is trivial enough.

*1 : Note that for a "wait_any" monitor you would prefer to do the notify from inside the mutex, because that way you can be sure it gets the signal and consumes it (if auto-reset). For "wait_all" you need to notify outside the mutex, for reasons we will see shortly.

*2 : each monitor has a bit mask associated with it, but you can ignore this for now.

So now we can construct a WFMO wait_all monitor that goes with this event. In words : we create a single CV for the waiting thread to sleep on. We receive ->got_signal calls from all the events that we are waiting on. They check for the condition being met, and then only wake the sleeping thread when it is all met. To ensure that the events really are all set at the same time (and properly consume auto-reset events) we have to hold the mutex of all the events we're waiting on to check their total state.


struct wfmo_wait_all : public event_monitor
{
    std::mutex  m; 
    std::condition_variable cv;
    
    std::vector<event *>    m_events;
    VAR_T(bool) m_wait_done;

    void wait( event ** pEvents, int numEvents )
    {
        m.lock($);
        m_wait_done($) = false;
        m_events.resize(numEvents);
        for(int i=0;i<numEvents;i++)
        {
            m_events[i] = pEvents[i];
            m_events[i]->add_monitor(this, 0 );
        }
        
        // sort for consistent order to avoid deadlock :
        std::sort(m_events.begin(),m_events.end());
        
        // must check before entering loop :
        update_wait_done();
        
        // loop until signal :
        while ( ! m_wait_done($) )
        {
            cv.wait(m,$); // unlock_wait_lock(cv,m)
        }
        
        m_events.clear();
        
        m.unlock($);
        
        // out of lock :
        // because notify_monitors take the lock in the opposite direction
        for(int i=0;i<numEvents;i++)
        {
            pEvents[i]->remove_monitor(this);
        }
    }
        
    bool got_signal( unsigned int mask )
    {
        // update our wait state :
        m.lock($);
        
        if ( ! m_wait_done($) )
        {
            update_wait_done();
        }
                    
        bool notify = m_wait_done($);
        
        m.unlock($);
        
        if ( notify )
            cv.notify_one($);
            
        return false;
    }
    
    
    // set m_wait_done
    // called with mutex locked
    void update_wait_done()
    {
        RL_ASSERT( m_wait_done($) == false );
    
        int numEvents = (int) m_events.size();
    
        for(int i=0;i<numEvents;i++)
        {
            m_events[i]->m.lock($);
            
            if ( ! m_events[i]->m_set($) )
            {
                // break out :
                for(int j=0;j<=i;j++)
                {
                    m_events[j]->m.unlock($);
                }
                return;
            }
        }
        
        m_wait_done($) = true;
        
        // got all locks and all are set
        
        for(int i=0;i<numEvents;i++)
        {
            if ( m_events[i]->m_auto_reset($) ) // consume it
                m_events[i]->m_set($) = false;
            
            m_events[i]->m.unlock($);
        }
    }   
};

Straightforward. There are a few funny spots where you have to be careful about the order you take mutexes to avoid deadlocks. (as usual, multiple mutexes are pains in the butt).

We can also try to optimize this. We'll use the mask from (*2) in the event that I told you to ignore before.

Each event in the WFMO set is associated with a bit index, so if we make the signal from each a bit mask, we are waiting for all bits to be on. Because events can turn on and off, we can't use this bit mask as our wait condition reliably, but we can use as a conservative optimization. That is, until the bit mask is full we know our WFMO can't be done. Once the bit mask is full, it still might not be done if there's a race and an event turns off, but then we'll check it more carefully.

The result looks like this :


struct wfmo_wait_all : public event_monitor
{
    std::mutex  m;
    std::condition_variable cv;
    
    std::vector<event *>    m_events;
    VAR_T(bool) m_wait_done;

    std::atomic<unsigned int> m_waiting_mask;

    void wait( event ** pEvents, int numEvents )
    {
        m.lock($);
        m_wait_done($) = false;
        // (*1) :
        const unsigned int all_bits_on = (unsigned int)(-1);
        m_waiting_mask($) = all_bits_on;
        m_events.resize(numEvents);
        for(int i=0;i<numEvents;i++)
        {
            m_events[i] = pEvents[i];
        }
        // sort for consistent order to avoid deadlock :
        std::sort(m_events.begin(),m_events.end());
        
        for(int i=0;i<numEvents;i++)
        {
            m_events[i]->add_monitor(this, 1UL<<i );
        }
        
        // must check before entering loop :
        update_wait_done();
        while ( ! m_wait_done($) )
        {
            cv.wait(m,$);
        }
        
        m_events.clear();
        
        m.unlock($);
        
        // out of lock :
        for(int i=0;i<numEvents;i++)
        {
            pEvents[i]->remove_monitor(this);
        }
    }
        
    bool got_signal( unsigned int mask )
    {
        // this is just an optimistic optimization -
        //  if we haven't seen a signal from each of the slots we're waiting on,
        //  then don't bother checking any further
        
        const unsigned int all_bits_on = (unsigned int)(-1);
        unsigned int prev_mask = m_waiting_mask($).fetch_or(mask);
        // (*2)
        if ( (prev_mask|mask) != all_bits_on )
            return false;
        
        // update our wait state :
        m.lock($);
        
        if ( m_wait_done($) )
        {
            m.unlock($);
            return false;
        }
                
        update_wait_done();
                
        bool notify = m_wait_done($);
        
        m.unlock($);
        
        if ( notify )
            cv.notify_one($);
            
        return false;
    }
    
    
    // set m_wait_done
    // called with mutex locked
    void update_wait_done()
    {
        int numEvents = (int) m_events.size();
    
        const unsigned int all_bits_on = (unsigned int)(-1);
        unsigned int waiting_mask = all_bits_on;

        for(int i=0;i<numEvents;i++)
        {
            m_events[i]->m.lock($);
            
            if ( ! m_events[i]->m_set($) )
            {
                // this one is off :
                waiting_mask ^= (1UL<<i);
            }
        }
        
        if ( waiting_mask == all_bits_on )
        {
            m_wait_done($) = true;
        }       
        else
        {
            m_wait_done($) = false;
        }
        
        // this store must be done before the events are unlocked
        //  so that they can't signal me before I set this :
        m_waiting_mask($).store(waiting_mask);

        // got all locks and all are set
        
        for(int i=0;i<numEvents;i++)
        {
            if ( m_wait_done($) )
            {
                if ( m_events[i]->m_auto_reset($) ) // consume it
                    m_events[i]->m_set($) = false;
            }
            
            m_events[i]->m.unlock($);
        }
    }   
};

*1 : waiting_mask is zero in each bit slot for events that have not been seen, 1 for events that have been seen (or bits outside the array size). We have to start with all bits on in case we get signals while we are setting up, we don't want them to early out in *2.

*2 : this is the optimization point. We turn on the bit when we see an event, and we wait for all bits to be on before checking if the WFMO is really done. The big advantage here is we avoid taking all the event mutexes until we at least have a chance of really being done. We only turn the event bits off when we hold the mutexes and can be sure of seeing the full state.

It goes without saying (and yet I seem to always have to say it) that this only works for a number of events up to the number of bits in an unsigned int, so in real production code you would want to enfore that limit more cleanly. (because this is an optimistic check, you can simply not include events that exceed the number of bits in the bit mask, or you could keep a bool per event and count the number of events that come on instead).

So, anyhoo, that's one way to do a proper WFMO (that doesn't wake the sleeping thread over and over) without Windows. WFMO in Windows works with events, mutexes, semaphores, etc. so if you want all that you would simply add the monitor mechanism to all your synchronization primitives.

BTW an alternative implementation would be for the event to signal its monitor on every state transition (both up/on and down/off). Then the WFMO monitor could keep an accurate bit mask all the time. When you get all bits on, you then have to consume all the auto-reset events, and during that time you have to block anybody else from consuming it (eg. block state transitions down). One thing that makes this tricky is that there can be multiple WFMO's watching some of the same events (but not exactly the same set of events), and you can get into deadlocks between them.


07-25-11 | Semaphore from CondVar

Semaphore from CondVar is quite trivial :

struct semaphore_from_condvar
{
private:
    VAR_T(int) m_count;
    t_cond_var m_cv;
public:
    semaphore_from_condvar(int initialCount) : m_count(initialCount)
    {
    }
    
    void post() // "increment"
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) ++;
        m_cv.signal_unlock(g);
    }
    
    void post(int count)
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) += count;
        m_cv.broadcast_unlock(g);
    }   
    
    void wait() // "decrement"
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        while( VAR(m_count) <= 0 )
        {
            m_cv.unlock_wait_lock(g);
        }
        VAR(m_count)--;
        m_cv.unlock(g);
    }
};

the only thing that's less than ideal is when you have lots of waiters and try to wake N of them. Ideally you would wake exactly N threads. Here we have to wake them all, they will all try to dec count, N will pass through, and the rest will go back to sleep. All with heavy contention on the CV lock.

I noted once that a Windows auto-reset "Event" acts just like a semaphore with a max count of 1. We can see that very explicitly if we do an implementation of event using condvar :


struct event_from_condvar
{
private:
    VAR_T(int) m_count;
    t_cond_var m_cv;
public:
    event_from_condvar()
    {
    }
    
    void signal()
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        VAR(m_count) = 1;
        m_cv.signal_unlock(g);
    }
        
    void wait()
    {
        t_cond_var::guard g;
        m_cv.lock(g);
        while( VAR(m_count) == 0 )
        {
            m_cv.unlock_wait_lock(g);
        }
        RL_ASSERT( VAR(m_count) == 1 );
        VAR(m_count) = 0;
        m_cv.unlock(g);
    }
};

which is a very silly way to implement event, but may be useful if you are limitted to C++0x only.

(the lack of low level wake/wait primitives in C++0x is a bit annoying; perhaps one solution is to use condition_variable_any - the templated variant of CV, and give it a mutex that is just a NOP in lock/unlock ; that would let you use the notify() and wait() mechanisms of CV with out all the mutex luggage that you don't need. But it remains to be seen what actual implementations do).

More on events in the next post.


07-24-11 | A cond_var that's actually atomic - part 2

Last time I noted that I thought the mutex for the waiter list was not needed and an idea on how to remove it. In fact it is easy to remove in exactly that manner, so I have done so.

First of all, almost everywhere in the cond_var (such as signal_unlock) , we know that the external mutex is held while we mess with the waiter list. So it is protected from races by the external mutex. There is one spot where we have a problem :

The key section is in unlock_wait_lock :


    void unlock_wait_lock(guard & g)
    {
        HANDLE unlock_h;
        
        HANDLE wait_handle = alloc_event();

        {//!
        m_waiter_mutex.lock($);
        
            // get next lock waiter and take my node out of the ownership chain :
            unlock_h = unlock_internal(g,NULL);

            // (*1)

            // after unlock, node is now mine to reuse :
            mcs_node * node = &(g.m_node);
            node->m_next($).store( 0 , std::mo_relaxed );
            node->m_event($).store( wait_handle , std::mo_relaxed );
        
            // put on end of list :
            if ( m_waiters($) == NULL )
            {
                m_waiters($) = node;
            }
            else
            {
                // dumb way of doing FIFO but whatever for this sketch
                mcs_node * parent = m_waiters($);
                while ( parent->m_next($).load(std::mo_relaxed) != NULL )
                    parent = parent->m_next($).load(std::mo_relaxed);
                parent->m_next($).store(node,std::mo_relaxed);
            }

            // (*2)         

        m_waiter_mutex.unlock($);
        }//!

        if ( unlock_h )
        {
            //SetEvent(unlock_h);
            SignalObjectAndWait(unlock_h,wait_handle,INFINITE,FALSE);
        }
        else
        {   
            WaitForSingleObject(wait_handle,INFINITE);
        }
        
        // I am woken and now own the lock
        // my node has been filled with the guy I should unlock
        
        free_event(wait_handle);
        
        RL_ASSERT( g.m_node.m_event($).load(std::mo_relaxed) == wait_handle );
        g.m_node.m_event($).store(0,std::mo_relaxed);
        
    }

The problem is at (*1) , the mutex may become unlocked. If there was a waiter for the mutex, it is not actually unlocked, we just retreive unlock_h, but we haven't set the event yet, so ownership is not yet transfered. The problem is if there was no waiter, then we set tail to NULL and someone else can jump in here, and do a signal, but we aren't in the waiter list yet, so we miss it. The waiter mutex fixes this.

Your first idea might be - move the unlock() call after the waiter list maintenance. But, you can't do that because my mcs_node in the guard "g" which I have on the stack is being used in the lock chain, and I wish to repurpose it to use it in the waiter list.

So, one simple solution would be just to have the stack guard hold two nodes. One for the lock chain and one for the waiter chain. Then we don't have to repurpose the node, and we can do the unlock after building the waiter list (move the unlock down to *2). That is a perfectly acceptible and easy solution. It does make your stack object twice and big, but it's still small (a node is two pointers, so it would be 4 pointers instead). (you might also need a flag to tell unlock() whether "me" is the lock node or the wait node).

But we can do it without the extra node, using the dummy owner idea I posted last time :


        {//!
            mcs_node dummy;

            // remove our node from the chain, but don't free the mutex
            //  if no waiter, transfer ownership to dummy       
            unlock_h = unlock_internal(&(g.m_node),&dummy);

            // after unlock, stack node is now mine to reuse :
            mcs_node * node = &(g.m_node);
            node->m_next($).store( 0 , std::mo_relaxed );
            node->m_event($).store( wait_handle , std::mo_relaxed );
        
            // put on end of list :
            if ( m_waiters($) == NULL )
            {
                m_waiters($) = node;
            }
            else
            {
                // dumb way of doing FIFO but whatever for this sketch
                mcs_node * parent = m_waiters($);
                while ( parent->m_next($).load(std::mo_relaxed) != NULL )
                    parent = parent->m_next($).load(std::mo_relaxed);
                parent->m_next($).store(node,std::mo_relaxed);
            }
            
            if ( unlock_h == 0 )
            {
                // ownership was transfered to dummy, now give it to the
                //  successor to dummy in case someone set dummy->next
                unlock_h = unlock_internal(&dummy,NULL);                
            }
        }//!

(parts outside of ! are the same).

Anyway, full code is here :

at pastebin

.. and that completes the proof of concept.

(BTW don't be a jackass and tell me the FIFO walk to the tail is horrible. Yeah, I know, obviously you should keep a tail pointer, but for purposes of this sketch it's irrelevant.)


07-20-11 | A cond_var that's actually atomic

So, all the cond vars that we've seen so far (and all the other implementations I've seen) , don't actually do { signal_unlock } atomically.

They make cond var work even though it's not atomic, in ways discussed previously, ensuring that you will only ever get false wakeups, never missed wakeups. But false wakeups are still not ideal - they are a performance bug. What we would really like is to minimize thread switches, and also ensure that when you set a condition, the guy who wakes up definitely sees it.

For example, normal cond var implementations require looping in this kind of code : (looping implies waking up and then going back to sleep)


thread 1 :
  cv.lock();
  x = 3;
  cv.signal_unlock();

thread 2 :
  cv.lock();
  if ( x != 3 ) // normal cond var needs a while here
    cv.unlock_wait_lock();
  ASSERT( x == 3 );
  cv.unlock();

(obviously in real code you will often have multiple signals and other things that can cause races, so you generally will always want the while to loop and catch those cases, even if the condvar doesn't inherently require it).

Furthermore, if you jump back in and mess with the condition, like :


thread 1 :
  cv.lock();
  x = 3;
  cv.signal_unlock();

  cv.lock();
  x = 7;
  cv.unlock();

most cond_vars don't gaurantee that thread 2 necessarilly sees x == 3 at all. The implementation is free to send the signal, thread 2 wakes up but doesn't get the lock yet, thread 1 unlocks then relocks, sets x = 7 and unlocks, now thread 2 gets the lock finally and sees x is not 3. If signal_unlock is atomic (and transfers ownership of the mutex directly to the waiter) then nobody can sneak in between the signal and when the receiver gets to see the data that triggered the signal.

One way to do it is to use a mutex implementation in which ownership of the mutex is transferred directly by Event signal. For example, the per-thread-event-mutex from an earlier post. To unlock this mutex, you first do some maintenance, but the actual ownership transfer happens in the SetEvent(). When a thread owns the mutex, anyone else trying to acquire the mutex is put into a wait state. When one of them wakes up it is now the owner (they can only be woken from unlock).

With this style of mutex, we can change our ownership transfer. Instead of unlocking the mutex and handing it off to the next waiter to acquire the mutex, you hand it off to the next waiter directly on the signal.

In pseudo-code it looks like this :


lock :
    if owner == null, owner = me
    else
        add me to lock-waiter list
        wait me

unlock :
    set owner = pop next off lock-waiter list
    wake owner

unlock_wait_lock :
    add me to signal-waiter list
    set owner = pop next off lock-waiter list
    wake owner
    wait me
    // mutex is locked by me when I wake up

signal_unlock :
    set owner = pop next off signal-waiter list
    wake owner

reasonably easy. Conceptually simple ; it's just like a normal mutex, except that instead of one list of threads waiting for the lock, there are two lists - one of threads waiting for the lock to come from "unlock" and one waiting for the lock to come from "signal_unlock". This is (almost (*)) the absolute minimum of thread transfers; signal_unlock atomically unlocks the mutex and also unlocks a waiter in the same operation. unlock_wait_lock then has to do an unlock and then when it wakes from wake it owns the mutex.

Note that there is still one non-atomic gap, in between "unlock" and "wait_lock" in the unlock_wait_lock step. You can use SignalObjectAndWait there on Windows but as noted previously that is not actually atomic. But maybe less likely to thread switch in the gap. (* = this is the only spot where we can do a thread transfer we don't want; we could wake a thread and in so doing lose our time slice, then if we get execution back we immediatley go into a wait)

Anyway, here is a working version of an atomically transferring cond var built on an MCS-style mutex. Some notes after the code.

at pastebin

Notes :

1. broadcast is SCHED_OTHER of course. TODO : fix that. It also has to relock the mutex each time around the loop in order to transfer it to the next waiter. That means broadcast can actually thread-thrash a lot. I don't claim that this implementation of broadcast is good, I just wanted to prove that broadcast is possible with this kind of condvar. (it's really most natural only for signal)

2. I use a mutex to protect the waitlist. That was just for ease and simplicity because it's already difficult and complicated without trying to manage the waitlist lockfree. TODO : fix that.

(I think you could do it by passing in a dummy node to unlock_internal in unlock_wait_lock instead of passing in null; that keeps the mutex from becoming free while you build the wait list; then after the wait list is built up, get the dummy node out; but this is not quite trivial ; for the most part the external mutex protects the waitlist so this is the only spot where there's an issue).

(as usual the valid complaint about the mutex is not that it takes 100-200 clocks or whatever, it's that it could cause unnecessary thread switches)

3. As noted in the code, you don't actually have to alloc & free the event handles, they can come from the TLS since there is always just one per thread and it's auto-reset.

Anyway, I think it's an interesting proof of concept. I would never use atomics that are this complex in production code because it's far too likely there's some subtle mistake which would be absolute hell to track down.


07-20-11 | Some condition var implementations

Okay, time for lots of ugly code posting.

Perhaps the simplest condvar implementation is FIFO and SCHED_OTHER using an explicit waiter list. The waiter list is protected by an internal mutex. It looks like this :

(BTW I'm putting the external mutex in the condvar for purposes of this code, but you may not actually want to do that ; see previous notes )


struct cond_var_mutex_list
{
    std::mutex  external_m;
    std::mutex  internal_m;
    // (*1)
    std::list<HANDLE>   waitset;

    cond_var_mutex_list() { }
    ~cond_var_mutex_list() { }
    
    void lock() { external_m.lock($); }
    void unlock() { external_m.unlock($); }
    
    // (*1) should be the event from TLS :
    HANDLE get_event()
    {
        return CreateEvent(NULL,0,0,NULL);
    }
    void free_event(HANDLE h)
    {
        CloseHandle(h);
    }
    
    void unlock_wait_lock() 
    {
        HANDLE h = get_event();
        
        // taking internal_m lock prevents us from racing with signal
        {
        internal_m.lock($);

        waitset.push_back(h);
        
        internal_m.unlock($);
        }
        
        // (*2)
        external_m.unlock($);
        
        WaitForSingleObject(h,INFINITE);
        
        free_event(h);

        // I will often wake from the signal and immediately go to sleep here :
        external_m.lock($);
    }
    
    // could return if one was signalled
    void signal()
    {
        HANDLE h = 0;
        
        // pop a waiter off the front, if any :
        {
        internal_m.lock($);
        
        if ( ! waitset.empty() )
        {
            h = waitset.front();
            waitset.pop_front();
        }
        
        internal_m.unlock($);
        }
        
        if ( h == 0 )
            return;
    
        SetEvent(h);        
    }
    
    // could return # signalled
    void broadcast()
    {
        std::list<HANDLE> local_waitset;
        
        // grab local copy of the waitset
        // this enforces wait generations correctly
        {
        internal_m.lock($);
        
        local_waitset.swap(waitset);

        internal_m.unlock($);
        }

        // (*3)     

        // set events one by one;
        // this is a bit ugly, SCHED_OTHER and thread thrashing
        while( ! local_waitset.empty() )
        {
            HANDLE h = local_waitset.front();
            local_waitset.pop_front();
            SetEvent(h);
        }   
    }

};

I think it's pretty trivial and self-explanatory. A few important notes :

*1 : We use std::list here for simplicity, but in practice a better way would be to have a per-thread struct which contains the per-thread event and a forward & back pointer for linking. Then you don't have any dynamic allocations at all. One per-thread event here is all you need because a thread can only be in one wait at a time. Also there's no event lifetime issue because each thread only waits on its own event (we'll see issues with this in later implementations). (see for example Thomasson's sketch of such but it's pretty self-explanatory )

*2 : This is the crucial line of code for cond-var correctness. The external mutex is unlocked *after* the current thread is put in the waitset. This means that after we unlock the external mutex, even though we don't atomically go into the wait, we won't miss signal that happens between the unlock and the wait.

*3 : This where the "wait generation" is incremented. We swap the waiter set to a local copy and will signal the local copy. At this point new waiters can come in, and they will get added to the member variable waitset, but they don't affect our generation.

The nice thing about this style of implementation is that it only needs mutex and auto-reset events, which are probably the most portable of all synchronization primitives. So you can use this on absolutely any platform.

The disadvantage is that it's SCHED_OTHER (doesn't respect OS priorities) and it can have rather more thread switches than necessary.


The next version we'll look at is Thomassons two-event cond_var. There are a lot of broken versions of this idea around the net, so it's instructive to compare to (what I believe is) a correct one.

The basic idea is that you use two events. One is auto-reset (an auto-reset event is just like a semaphore with a max count of 1); the other is manual reset. signal() sets the auto-reset event to release one thread. broadcast() sets the manual-reset event to release all the threads (opens the gate and leaves it open). Sounds simple enough. The problem is that manual reset events are fraught with peril. Any time you see anyone say "manual reset event" you should think "ruh roh, race likely". However, handling it in this case is not that hard.

The easy way is to use the same trick we used above to handle broadcast with generations - we just swap out the "waitset" (in this case, the broadcast event) when we broadcast(). That way it is associated only with previous waiters, and new waiters can immediately come in and wait on the next generation's manual reset event.

The only ugly bit is handling the lifetime of the broadcast event. We want it to be killed when the last member of its generation is woken, and to get this right we need a little ref-counting mechanism.

So, here it is , based on the latest version from Thomasson of an idea that he posted many times in slightly different forms :


class thomasson_win_condvar
{
    enum { event_broadcast=0, event_signal = 1 };

    struct waitset
    {
        HANDLE m_events[2];
        std::atomic<int> m_refs;

        waitset(HANDLE signalEvent) : m_refs(1)
        {
            // signalEvent is always the same :
            m_events[event_signal] = signalEvent;

            // broadcast is manual reset : (that's the TRUE)
            m_events[event_broadcast] = CreateEvent(NULL, TRUE, FALSE, NULL);
        }
        
        ~waitset()
        {
            RL_ASSERT( m_refs($) == 0 );
    
            //if ( m_events[event_broadcast] )
            CloseHandle(m_events[event_broadcast]);
        }
    };


private:
    VAR_T(waitset*) m_waitset;
    CRITICAL_SECTION m_internal_mutex;
    CRITICAL_SECTION m_external_mutex;
    HANDLE m_signal_event;


public:
    thomasson_win_condvar()
    :   m_waitset(NULL)
    {
        m_signal_event = CreateEvent(NULL,0,0,NULL);
        InitializeCriticalSection(&m_internal_mutex);
        InitializeCriticalSection(&m_external_mutex);
    }

    ~thomasson_win_condvar()
    {
        RL_ASSERT( VAR(m_waitset) == NULL );
        CloseHandle(m_signal_event);
        DeleteCriticalSection(&m_internal_mutex);
        DeleteCriticalSection(&m_external_mutex);
    }


    void dec_ref_count(waitset * w)
    {
        EnterCriticalSection(&m_internal_mutex);
        // if I took waitsets refs to zero, free it

        if (w->m_refs($).fetch_add(-1, std::mo_relaxed) == 1)
        {
            std::atomic_thread_fence(std::mo_acquire,$);
            delete w;
            if ( w == VAR(m_waitset) )
                VAR(m_waitset) = NULL;
        }

        LeaveCriticalSection(&m_internal_mutex);
    }

    void inc_ref_count(waitset * w)
    {
        if ( ! w ) return;

        w->m_refs($).fetch_add(1,std::mo_relaxed);

        LeaveCriticalSection(&m_internal_mutex);
    }
        
public:
    void lock ()
    {
        EnterCriticalSection(&m_external_mutex);
    }
    void unlock ()
    {
        LeaveCriticalSection(&m_external_mutex);
    }

    void unlock_wait_lock()
    {
        waitset* w;
        
        {
        EnterCriticalSection(&m_internal_mutex);

        // make waitset on demand :
        w = VAR(m_waitset);

        if (! w)
        {
            w = new waitset(m_signal_event);
            VAR(m_waitset) = w;
        }
        else
        {
            inc_ref_count(w);
        }
        
        LeaveCriticalSection(&m_internal_mutex);
        }

        // note unlock of external after waitset update :
        LeaveCriticalSection(&m_external_mutex);

        // wait for *either* event :
        WaitForMultipleObjects(2, w->m_events, false, INFINITE);

        EnterCriticalSection(&m_external_mutex);
        
        dec_ref_count(w);
    }


    void broadcast()
    {
        EnterCriticalSection(&m_internal_mutex);

        // swap waitset to local state :
        waitset* w = VAR(m_waitset);

        VAR(m_waitset) = NULL;

        inc_ref_count(w);
        
        LeaveCriticalSection(&m_internal_mutex);

        // at this point a new generation of waiters can come in,
        //  but they will be on a new waitset

        if (w)
        {
            SetEvent(w->m_events[event_broadcast]);

            // note : broadcast event is actually never cleared (that would be a tricky race)
            // instead the waitset it used is deleted and not used again
            // a new waitset will be made with an un-set broadcast event

            dec_ref_count(w);
        }
    }


    void signal()
    {        
        EnterCriticalSection(&m_internal_mutex);

        waitset* w = VAR(m_waitset);

        inc_ref_count(w);
        
        LeaveCriticalSection(&m_internal_mutex);

        if (w)
        {
            SetEvent(w->m_events[event_signal]);

            dec_ref_count(w);
        }
    }

};

I don't think there's anything too interesting to say about this, the interesting bits are all commented in the code.

Basically the trick for avoiding the evilness of a manual reset event is just to make a new one after you set it to "open" and never try to set it to "closed" again. (of course you could set it to closed and recycle it through a pool instead of allocating a new one each time).

This code can be simplified/optimized in various ways, for example when you signal() you don't actually need to make or delete a waitset at all.

I believe you could also get rid of m_internal_mutex completely with a bit of care. Actually it doesn't take any care; if you require that signal() and broadcast() are always called from within the external lock, then the internal mutex isn't needed at all (the external lock serves to protect the things that it protects, namely the waitset).


The Terekhov condvar in pthreads-win32 (reportedly) uses a barrier to block entry to "wait" for new waiters after you "broadcast" but before all the waiters have woken up. It's a gate that's closed when you broadcast, the waiter count is remembered, and it's opened after they all wake up. This works but does cause thread thrashing; waiters who were blocked will go to sleep on the barrier, then wake up and rush in and immediately go to sleep in the wait on the condvar. (caveat : I haven't actually looked at the pthreads-win32 code other than to see that it's huge and complex and I didn't want to read it)

Doug Schmidt wrote the nice page on Strategies for Implementing POSIX cond vars on Win32 (which describes a lot of bad or broken ways (such as using PulseEvent or using SetEvent and trying to count down to reset it)). The way he implemented it in his ACE package is sort of similar to Terekhov's blocking mechanism. this extraction for qemu is a lot easier to follow than the ACE code and uses the same technique. At first I didn't think it worked at all, but the secret is that it blocks wake-stealers using the external mutex. The key is that in this implementation, "broadcast" has to be called inside the external mutex held. So what happens is broadcast wakes a bunch of guys, then waits on their wakes being done - it's still holding the external mutex. The waiters wake up, and dec the count and then try to lock the mutex and block on that. Eventually they all wake up and set an event so the broadcaster is allowed to resume. Now he leaves broadcast and unlocks the mutex and the guys who were woken up can now run. Stolen wakeups are prevented because the external mutex is held the whole time, so nobody can get into the cv to even try to wait.

I'm really not a fan of this style of condvar implementation. It causes lots of thread thrashing. It requires every single one of the threads broadcasted-to to wake up and go back to sleep before any one can run. Particularly in the Windows environment where individual threads can lose time for a very long time on multi-core machines, this is very bad.

Thomasson's earlier waitset didn't swap out the waitset in notify_all so it didn't get generations right. (he did later correct versions such as the one above)

Derevyago has posted a lot of condvars that are broken (manual reset event with ResetEvent() being used is immediately smelly and in this case is in fact broken). He also posted one that works which is similar to my first one here (FIFO, SCHED_OTHER, manual wait list).

Anthony Williams posted a reasonably simple sketch of a condvar ; it uses a manual reset event per generation which is swapped out on broadcast ; it's functionally identical to the Thomasson condvar, except that Anthony maintains a linked list of waiters instead of just inc/dec'ing a refcount. Anthony didn't provide signal() but it's trivial to do so by adding another manual-reset event.

Dmitry's fine grained eventcount looks like a nice way to build condvar, but I'm scared of it. If somebody ever makes a C++0x/Relacy version of that, let me know.


07-20-11 | Some notes on condition vars

Let's talk about condition vars a bit. To be clear I won't be talking about the specific POSIX semantics for "condition_var" , but rather the more general concept of what a CV is.

As mentioned before, a CV provides a mechanism to wait on a mutex-controlled variable. I've always been a bit confused about why you would want CV because you can certainly do the same thing with the much simpler concept of "Event" - but with event there are some tricky races (see waiting on thread events for example) ; you can certainly use Event and build up a system with a kind of "register waiter then wait" mechanism ; but basically what you're doing is build a type of eventcount or CV there.

Anyhoo, the most basic test of CV is something like this :


before : x = 0;

thread 1:
    {
        cv.lock();
        x += 1;
        cv.signal_unlock();
    }

thread 2:
    {
        cv.lock();
        while ( x <= 0 )
        {
            cv.unlock_wait_lock();
        }
        x = 7;
        cv.unlock();
    }

after : x == 7;

in words, thread 2 waits for thread 1 to do the inc, then sets x = 7. Now, what if our CV was not legit. For example what if our unlock_wait_lock was just :
cv.unlock_wait_lock() :
    cv.unlock();
    cv.wait();
    cv.lock();
this code would not work. Why? What could happen is thread 2 runs first and sees x = 0, so it does the unlock. Then thread 1 runs and does x+= 1 but there's no one to signal so it just exits. Then thread 2 does the wait and deadlocks. ruh roh.

In this case, it could be fixed simply by using a Semaphore for the signal and making sure you always Post (inc) even if there is no waiter (that's a way of counting the signals). But that doesn't actually fix it - if you had another thread running that could consume wakeups for this CV, it could consume all those wakeups and you would still go into the wait and sleep.

So, it is not required that "unlock_wait_lock" actually be atomic (and in fact, it is not in most CV implementations), but you must not go into the wait if another thread could get in between the unlock and the wait. There are a few ways to accomplish this. One is to use some kind of prepare_wait , something like :

cv.unlock_wait_lock() :
    cv.prepare_wait();
    cv.unlock();
    cv.retire_wait();
    cv.lock();
prepare_wait is inside the mutex so it can run race-free (assuming signal takes the same mutex) ; it could record the signal epoch (the GUID for the signal), and then retire_wait will only actually wait if the epoch is the same. Calls to signal() must always inc the epoch even if there is no waiter. So, with this method if someone sneaks in after your unlock but before your wait, you will not go into the wait and just loop around again. This is one form of "spurious wakeup" and is why unlock_wait_lock should generally be used in a loop. (note that this is not a bad "thread thrashing" type of spurious wakeup - you just don't go to sleep).

This seems rather trivial, but it is really the crucial aspect of CV - it's why CV works and why we're interested in it. unlock_wait_lock does not need to be atomic, but it sort of acts like it is, in the sense that a lost signal is not allowed to pass between the unlock and the wait.


The next problem that you will see discussed is the issue of "wait generations". Basically if you have N waiters on a CV, and you go into broadcast() (aka notify_all) , it needs to signal exactly those N waiters. What can happen is a new waiter could come in and steal the wakeup that should go to one of the earlier waiters.

As usual I was a bit confused by this, because it's specific to only certain types of CV implementation. For example, if broadcast blocks new waiters from waiting on the CV, there is no need for generations. If your CV signal and wait track their "wait epoch", there is no problem with generations (the epoch defines the generation, if you like). If your CV is FIFO, there is no problem with generations - you can let new waiters come in during the broadcast, and the correct ones will get the wakeup.

So the generation issue mainly arises if you are trying to use a counting semaphore to implement your CV. In that case your broadcast might do something like count the number of waiters (its N), then Post (inc) the semaphore N times. Then another waiter comes in, and he gets the dec on the semaphore instead of the guy you wanted.

The reason that people get themselves into this trouble is that they want to make a CV that respects the OS scheduler; obviously a generic CV implementation should. When you broadcast() to N waiters, the highest priority waiter should get the first wakeup. A FIFO CV with an event per thread is very easy to implement (and doesn't have to worry about generations), but is impossible to make respect OS scheduling. The only way to get truly correct OS scheduling is to make all the threads wait on the same single OS waitable handle (either a Semaphore or Event typically). So, now that all your waiters are waiting on the same handle, you have the generation problem.

Now also note that I said before that using epoch or blocking new waiters from entering during broadcast would work. Those are conceptually simple but in practice quite messy. The issue is that the broadcast might actually take a long time - it's under the control of the OS and it's not over until all waiters are awake. You would have to inc a counter by N in broadcast and then dec it each time a thread wakes up, and only when it gets to zero can you allow new waiters in. Not what you want to do.

I tried to think of a specific example where not having wait generations would break the code ; this is the simplest I could come up with :


before : x = 0

thread 0 :

    cv.lock();
    x ++;
    cv.unlock();
    cv.broadcast();

thread 1 :
    cv.lock();
    while ( x == 0 )
        cv.unlock_wait_lock();
    x = 7;
    cv.signal();
    cv.unlock();

thread 2 :
    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();
    x = 37;
    cv.unlock();

after : x == 37

So the threads are supposed to run in sequence, 0,1,2 , and the CV should control that. If thread 2 gets into its wait first, there can't be any problems. The problem arises if thread 1 gets into its wait first but the wakeup goes to thread 2. The bad execution sequence is like this :


thread 1 : cv.lock()
thread 1 : x == 0
thread 1 : cv.unlock_wait ...

thread 0 : cv.lock()
thread 0 : x++;
thread 0 : cv.unlock()
thread 0 : cv.broadcast .. starts .. sees 1 waiter ..

thread 2 : cv.lock
thread 2 : x != 7
thread 2 : cv.unlock_wait - I go into wait

thread 0 : .. cv.broadcast .. wakes thread 2

deadlock

You can look in boost::thread for a very explicit implementation of wait generations. I think it's rather over-complex but it does show you how generations get made (granted some of that is due to trying to match some of the more arcane requirements of the standard interface). I'll present a few simple implementations that I believe are much easier to understand.


A few more general notes on CV's before we get into implementations :

In some places I will use "signal_unlock" as one call instead of signal() and unlock(). One reason is that I want to convey the fact that signal_unlock is trying to be as close to atomic as possible. (we'll show one implementation where it actually is atomic). The other reason is that the normal CV API that allows signal() outside the mutex can create additional overhead. With the separate API you have either :

lock
..blah..
signal
unlock
which can cause "thread thrashing" if the signal wakes the waiter and then immediately blocks it on the lock. Or you can have :
lock
..blah..
unlock
signal
the problem with this for the implementor is that now signal is outside of the mutex and thus the CV is not protected; so you either have to take the mutex inside signal() or protect your guts in some other way; this adds overhead that could be avoided if you knew signal was always called with the mutex held. So both variants of the separate API are bad and the merged one is preferred.

The other issue that you may have noticed that I combined my mutex and CV. This is slightly more elegant for the user, and is easier for the implementor, because some CV implementations could use knowledge of the guts of the mutex they are attached to. But it's not actually desirable.

The reason it's not great is because you do want to be able to use multiple CV's per mutex. (I've heard it said that you might want to use different mutexes with the same CV, but I can't think of an example where you would actually want to do that).

eg. in our broadcast generation example above, it would actually be cleaner to use a single mutex to protect "x" but then use two different CV's - one that you signal when x = 1, and another that you signal when x = 7. The advantage of this is that the CV signalled state corresponds directly to the condition that the waiter is waiting on.

In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition" , because it reduces unnecessary wakeups. The same could be said for Events or Semaphores or any wakeup mechanism - using wakeups to kick threads to check themselves is not desirable.

We're going to go off on a tangent a bit now.

What would be ideal in general is to actually be able to sleep a thread on a predicate. Instead of doing this :


    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();

you would so something like :

    cv.wait( { x == 7 } );

where it's implied that the mutex is locked when the predicate is checked. This is actually not hard to implement. In your signal() implementation, you can walk the list of waiters in the waitset (you are holding the mutex during signal) and just call predicate() on each waiter and see if he should wake up.

The big advantage is that only the threads that actually want to wake up get woken. Dmitry has done an implementation of something like this and calls it fine grained condvar/eventcount (also submitted to TBB as fine grained concurrent monitor ).

As noted before, using multiple CV's can be a primitive way of getting more selective signalling, if you can associate each CV with a certain condition and wait on the right one.

A nice version of this that some operating systems provide is an event with a bit set. They use something like a 32-bit mask so you can set 32 conditions. Then when you Wait() you wait on a bit mask which lets you choose various conditions to wake for. Then signal passes a bit mask of which to match. This is really a very simplified version of the arbitrary predicate case - in this case the predicate is always (signal_bits & waiting_bits) != 0 (or sometimes you also get the option of (signal_bits & waiting_bits) == waiting_bits so you can do a WAIT_ANY or WAIT_ALL for multiple conditions).

The bit-mask events/signals would be awesome to have in general, but they are not available on most OS'es so oh well. (of course you can mimic it on windows by making 32 Events and using WFMO on them all).

Next up, some condvar implementations...


07-19-11 | Should I take points on my mortgage?

I'm pretty sure the answer is "no".

As usual all the web sites are worthless and all the "real estate professionals" are morons, they just spout the bullshit that was in their training packet that says something about "points pay themselves off after N years so it depends on how long you will live there" ; great, thanks.

I tend to make mistakes on this stuff, so you can check me and let me know if I'm right in these calculations.

The big thing that everyone seems to fuck up is that there are two different factors changing the value of money :

1. inflation is making the mortgage payments actually cost less than they seem to, that is, later payments actually cost you much less in today's dollars, and

2. opportunity cost market appreciation is making points cost more than they seem to. That is, any dollar spent on points could go in the market and earn something (presumably slightly more than inflation).

Running these numbers :


30-year fixed rate loan
417k loan amount (maximum conforming loan)
4.50% base APR
0.125 APR per point
points cost 1% of loan
inflation 3% annual
market appreciation 1.5% after inflation (same as APR)

The total amount paid after 5,10,20,30 years , with 0.0-1.5 points :
(in today's dollars)

years 5 10 20 30
0.0 117880.594414 219360.356317 381927.354991 502405.045225
0.5 119265.482032 220177.748517 381947.163920 501998.542376
1.0 120653.534938 221001.030904 381977.228228 501605.529932
1.5 122044.768524 221830.232121 382017.597785 501226.073492

The normal numbers that you see everywhere don't include inflation or appreciation and look something like this :

years 5 10 20 30
0.0 126772.664518 253545.329037 507090.658074 760635.987111
0.5 127930.215974 253775.431949 505465.863897 757156.295846
1.0 129091.171485 254012.342971 503854.685941 753697.028912
1.5 130255.547604 254256.095209 502257.190418 750258.285627

Which lead you to believe that points are actually a big savings after 15 years or whatever. In reality it seems to me that they hurt you up front and basically never make it back. (after 30 years they finally do, but the liquid investment has added utility which makes it still the winner)

As usual the exact right answer depends highly on the assumed figures, and there's huge uncertainty about them. However I think we can reliably say that points are grossly over-valued in the popular literature.

The next question is - if I have some extra cash, should I pay more down payment or put it in the market? (more down payment is equivalent to early prepayment, so this is equivalent to asking if you should early prepay).

Same assumptions as above.
Extra cash available is 0.5-2.0% of loan amount
Note : rows of this table are not directly comparable

Blue columns : extra cash put towards reducing loan amount
Black columns : extra cash saved and appreciated (counted as reduction of amount paid after N years)
cash 5 5 10 10 20 20 30 30
0.0 117880.594414 117880.594414 219360.356317 219360.356317 381927.354991 381927.354991 502405.045225 502405.045225
2085.0 117291.191442 115633.311216 218263.554536 216938.158824 380017.718216 379113.426599 499893.019999 499136.033092
4170.0 116701.788470 113386.028018 217166.752754 214515.961330 378108.081441 376299.498207 497380.994773 495867.020958
6255.0 116112.385498 111138.744819 216069.950972 212093.763837 376198.444666 373485.569815 494868.969547 492598.008825
8340.0 115522.982526 108891.461621 214973.149191 209671.566343 374288.807892 370671.641423 492356.944321 489328.996691

Keeping the cash in the market is better over all time scales.

What if the market only matches inflation and doesn't beat it by that 1.5% ?

X 5 5 10 10 20 20 30 30
0.0 117880.594414 117880.594414 219360.356317 219360.356317 381927.354991 381927.354991 502405.045225 502405.045225
2085.0 117291.191442 115795.594414 218263.554536 217275.356317 380017.718216 379842.354991 499893.019999 500320.045225
4170.0 116701.788470 113710.594414 217166.752754 215190.356317 378108.081441 377757.354991 497380.994773 498235.045225
6255.0 116112.385498 111625.594414 216069.950972 213105.356317 376198.444666 375672.354991 494868.969547 496150.045225
8340.0 115522.982526 109540.594414 214973.149191 211020.356317 374288.807892 373587.354991 492356.944321 494065.045225

Now over 30 years paying off the loan early is better. But again it's so minor that if you count the utility of liquidity it just never makes sense.

One exception to all this would be if the market actually inverts for a while (returns less than inflation, or negative inflation-adjusted). It seems to me that is a real valid concern over the next 10-20 years, but when that happens pretty much all the "common wisdom" goes out the window.


07-19-11 | The internet just doesn't work

Joe Duffy's page at bluebyte is down or gone. (has been for weeks).

Which reminds me that the internet just doesn't work.

I mean, as a way of stealing money from stupid people, it works great, which I suppose is really what the people behind the modern interent are really interested in. But as a way of presenting information in a simple, efficient, permanent, archivable format, it's shit.

Whenever I go back to one of my posts from a few years ago, which I carefully linked to good info - half the links don't work anymore.

It's just worse than fucking five-hundred-year-old technology (books). I can buy a book and put it on my shelf and it doesn't disappear in the night.

Of course it's even worse if you make fancy pages that use AJAX or Flash (I don't even know what the new widget flavor of the month is) or proprietary formats like PPT or whatever, since that stuff will be a huge pain to keep working 10-20 years from now.

Anyway, pursuant to this I thought I should go and actually download some of my favorite pages. Unfortunately it's much harder than you might think.

Anything at Google Groups is a good example of the problem.

It sure looks like just a bunch of plain text. Oh no. It's running through some kind of crazy Google mumbo-jumbo. If you just use Firefox's "Save Page" you get 600k of shit for that tiny bit of text - and it fails to download it remotely correctly. (but at least it does get the primary text)

If you use HTTrack to try to mirror the whole page, it downloads about 1000k of shit and fails to get a readable page AT ALL.

OMG this should not be so difficult. Text, people, text!


07-18-11 | MCS list-based lock

We did CLH before, but MCS really is better, so let's go through it.

MCS is another list/node based lock from the ancient days. The nodes are now linked forwards from head to tail. The "head" currently owns the mutex and the tail is the next slot for someone trying to enter the lock.

Unlike CLH, as soon as you leave the mutex in MCS, your node is no longer in use in the chain, so you can free it right away. This also means you can just keep your node on the stack, and no allocations or any goofy gymnastics are needed.

Also like CLH, MCS has the nice performance advantage of only spinning on a local variable (do cache line padding to get the full benefit of this).

The way it works is quite simple. Each node has a "gate" flag which starts "blocked". You spin on your own gate being open. The previous lock-holder will point at you via his "next" pointer. When he unlocks he sets "next->gate" to unlocked, and that allows you to run.

The code is :


// mcs on stack

struct mcs_node
{
    std::atomic<mcs_node *> next;
    std::atomic<int> gate;
    
    mcs_node()
    {
        next($).store(0);
        gate($).store(0);
    }
};

struct mcs_mutex
{
public:
    // tail is null when lock is not held
    std::atomic<mcs_node *> m_tail;

    mcs_mutex()
    {
        m_tail($).store( NULL );
    }
    ~mcs_mutex()
    {
        ASSERT( m_tail($).load() == NULL );
    }
    
    class guard
    {
    public:
        mcs_mutex * m_t;
        mcs_node    m_node; // node held on the stack

        guard(mcs_mutex * t) : m_t(t) { t->lock(this); }
        ~guard() { m_t->unlock(this); }
    };
    
    void lock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        // set up my node :
        // not published yet so relaxed :
        me->next($).store(NULL, std::mo_relaxed );
        me->gate($).store(1, std::mo_relaxed );
    
        // publish my node as the new tail :
        mcs_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        if ( pred != NULL )
        {
            // (*1) race here
            // unlock of pred can see me in the tail before I fill next
            
            // publish me to previous lock-holder :
            pred->next($).store(me, std::mo_release );

            // (*2) pred not touched any more       

            // now this is the spin -
            // wait on predecessor setting my flag -
            rl::linear_backoff bo;
            while ( me->gate($).load(std::mo_acquire) )
            {
                bo.yield($);
            }
        }
    }
    
    void unlock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        mcs_node * next = me->next($).load(std::mo_acquire);
        if ( next == NULL )
        {
            mcs_node * tail_was_me = me;
            if ( m_tail($).compare_exchange_strong( tail_was_me,NULL,std::mo_acq_rel) )
            {
                // got null in tail, mutex is unlocked
                return;
            }
            
            // (*1) catch the race :
            rl::linear_backoff bo;
            for(;;)
            {
                next = me->next($).load(std::mo_acquire);
                if ( next != NULL )
                    break;
                bo.yield($);
            }
        }

        // (*2) - store to next must be done,
        //  so no locker can be viewing my node any more        

        // let next guy in :
        next->gate($).store( 0, std::mo_release );
    }
};

there are two subtle "moments" (in Dmitry's terminology) which have been marked in the code.

The main one is (*1) : this is actually exactly analogous to the "unlocking in progress" state that we talked about with the list-event mutex that I designed earlier. In this case it's a "locking in progress". The state is indicated when "m_tail" has been changed, but "->next" has not yet been filled. Because m_tail is exchanged first (it is the atomic sync point for the lock), your node is published before pred->next is set. So the linked list can be broken into pieces and invalid during this phase. But the unlocker can easily detect it and spin to wait for this phase to pass.

The other important point is (*2) - this is what allows you to keep the node on the stack. Your node can be held by another thread, but by the time you get to *2 you know it must be done with you.

So, it's FIFO, SCHED_OTHER, etc. complaints like previous mutexes. But it has a lot of advantages. The mutex itself is tiny, just one pointer. The fast path is one exchange and one CAS ; that's not the best, but it's okay. But the real advantage is its flexibility.

You can change the spin to a Wait() quite trivially. (just store an Event in the node instead of the "gate" flag)

You can spin and try to CAS m_tail to acquire the lock a bit before waiting if you like.

You can combine spinners and waiters! That's unusual. You can use the exact same mutex for a spinlock or a Wait lock. I'm not sure that's ever a good idea, but it's interesting. (one use for this is if you fail to get a handle to wait on, you can just spin, and your app will degrade somewhat gracefully).

Cool!


07-18-11 | cblib Relacy

Announce : cblib now has its own version of "Relacy".

This is no replacement for the original - go get Dmitry's Relacy Race Detector if you are serious about low level threading, you must have this. It's great. (I'm using 2.3.0).

Now in cblib is "LF/cblibRelacy.h" which is a look-alike for Relacy.

What you do, is you write code for Relacy just like you normally would. That is, you write C++0x but you add $ on all the shared variable accesses (and use rl::backoff and a few others things). You set up a test class and run rl::simulate.

Now you can change your #include "relacy.h" to #include "cblib/lf/cblibRelacy.h" and it should just transparently switch over to my version.

What does my version do differently?

0. First of all, it is no replacement for the real Relacy, which is a simulator that tries many possible races; cblibRelacy uses the real OS threads, not fibers, so the tests are not nearly as comprehensive. You need to still do your tests in the real Relacy first to check that your algorithm is correct.

1. It gives you actually usable compiled code. eg. if you take the clh_mutex or anything I've posted recently and combine it with cblibRelacy.h , you have a class you can go and use. (but don't literally do that)

2. It runs its tests on the actual compiled code. That means you aren't testing in a simulator which might hide problems with the code in the real world (eg. if your implementation of atomics has a bug, or if there's a priority inversion caused by the scheduling of real OS threads).

3. Obviously you can test OS primitives that Relacy doesn't support, like Nt keyed events, or threads waiting on IO, etc.

4. It can run a lot more iterations than the real Relacy because it's using real optimized code; for example I found bugs with my ticket locks when the ticket counters overflowed 16 bits, which is way more iterations that you can do in real Relacy.

How does it work :

I create a real Win32 thread for each of the test threads. The atomic ops translate to real atomic ops on the machine. I then run the test many times, and try to make the threads interleave a bit to make problems happen. The threads get stopped and started at different times and in different orders to try to get them to interleave a bit differently on each test iteration.

Optionally (set by cblibRelacy_do_scheduler), I can also use the Relacy $ points to juggle my scheduling, just the same way Relacy does with fibers. Wherever a $ occurs (access to a shared variable), I randomize an operation and might spin a bit or sleep the thread or some other things. This gives you way more interleaving than you would get just from letting the OS do thread switching.

Now as I said this is no substitute for the real Relacy, you'll never get as many fine switches back and forth as he does (well, you could of course, if you made the $ juggle your thread almost always, but that would actually make it much slower than Relacy because he uses fibers to make all the switching less expensive).

One important note - cblibRelacy will not stress your test very well unless you run it with more threads than you have cores. The reason is that if your system is not oversubscribed, then the SwitchToThread() that I use in $ will do nothing.

Also, don't be a daft/difficult commenter. This code is intended as a learning tool, it's obviously not ready to be used directly in a production environment (eg. I don't check any OS return codes, and I intentionally make failures be asserts instead of handling them gracefully). If you want some of these primitives, I suggest you learn from them then write your own versions of things. Or, you know, buy Oodle, which will have production-ready multi-platform versions of lots of stuff.

ADDENDUM : I should also note to be clear : Relacy detects races and other failure types and will tell you about them. cblibRelacy just runs your code. That means to actually make it a test, you need it to do some asserting. For example, with real Relacy you can test your LF Stack by just pushing some nodes and popping some nodes. With cblibRelacy that wouldn't tell you much (unless you crash). You need to write a test that pushes some values, then pops some value and asserts that it got the same things out.


07-17-11 | CLH list-based lock

The multi-way ticket lock we just did is very similar to some classic spin locks. I found this nice page : scalable synchronization pseudocode ( and parent page at cs.rochester ) ( and similar material covered here, but with nice drawings : Mutual Exclusion: Classical Algorithms for Locks (PDF) ).

I wanted to see how the classic MCS queue lock compares to my per-thread-mpsc lock ; the answer is not much. The classic queue locks are really closer kin to the multi-way ticket lock. I'll try to show that now. The MCS lock is probably more well known, but the CLH lock is simpler, so I'll deal with that.

The idea of these locks is to avoid the heavy cache contention inherent to the basic single-variable gate locks. To solve that, the idea is to use a distributed gate; basically one gate variable per waiter, and it's the responsibility of the unlocker to open the gate for the next waiter. So there has to be some kind of linked list so that the unlocker can find the next waiter. And these locks will be inherently FIFO and SCHED_OTHER and all that. (these are really only appropriate for kernels or kernel-like environments)

The CLH algorithm is usually described as a linked list, with the "head" of the list being the node that currently has access to the mutex, and the "tail" being the variable held in the lock struct. When new waiters come in, they tack onto the tail, thus it's FIFO.

There's a node for each waiter, and each node contains the gate for the guy after me :


struct clh_node
{
    // list is linked from tail backwards :
    std::atomic<clh_node *> prev;
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node()
    {
        prev($).store(0);
        succ_must_wait($).store(0);
    }
};

we also need a way of providing a node per-thread ! *per-lock* ! ; this is different than my event-queue-mutex that just needs a node *per-thread* ; the reason is that the nodes in CLH keep getting used even after you unlock, so you can't just reuse them. However, you can free some node when you unlock - just not necessarily the one you passed in. So anyhoo, we need some struct to pass in this node for us, here it is :

struct ThreadNode
{
    std::atomic<clh_node *> pNode;
    
    ThreadNode()
    {
        pNode($).store( new clh_node );
    }
    ~ThreadNode()
    {
        // note that the pNode I delete might not be the one I created
        //  so don't try to hold it by value
        clh_node * n = pNode($).exchange(0, std::mo_relaxed);
        delete n;
    }
};

this could be in the TLS, or it could be in the mutex::guard , or whatever.

Okay, now that we have our helpers we can write the code. When the mutex is held, the tail node will have succ_must_wait = 1 , when you take the lock you stick yourself on the tail and then wait on your predecessor. To unlock the mutex you just set succ_must_wait = 0 on yourself, and that allows the guy after you to go :


struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_tail;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_tail($).store( new clh_node );
    }
    ~clh_mutex()
    {
        clh_node * n = m_tail($).exchange(0);
        delete n;
    }
    
    void lock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        //me->prev($).store(0, std::mo_relaxed );
        clh_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        me->prev($).store(pred, std::mo_relaxed );
        
        // wait on predecessor's flag -
        //  this is why pred can't free himself
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
    }
    
    void unlock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
        
        clh_node * pred = me->prev($).load(std::mo_relaxed);
        me->succ_must_wait($).store( 0, std::mo_release );
        // take pred's node :
        //  this leaves my node allocated, since succ is still looking at it
        I->pNode($).store( pred, std::mo_relaxed );
    }

};

okay, I think this is reasonably self-explanatory. BTW the reason why the classical locks are the way they are is often to avoid test-and-set ops, which they didn't have or were very expensive; here we use only one exchange, the rest is just loads and stores.

That matches the classical algorithm description, but it's a lot more expensive that necessary. The first thing you might notice is that we don't actually need to store the linked list at all. All we need to do is get "pred" from lock to unlock. So you can either store it in the mutex struct, or put it in the "guard" (ThreadNode in this case) ; I think putting it in the guard is better, but I'm going to put it in the mutex right now because it's more analogous to our next step :


struct clh_node
{
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node() { succ_must_wait($).store(0); }
};

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_lock;
    std::atomic<clh_node *> m_lock_pred;
    std::atomic<clh_node *> m_lock_holder;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store( new clh_node );
        m_lock_pred($).store( 0 );
        m_lock_holder($).store( 0 );
    }
    ~clh_mutex()
    {
        clh_node * n = m_lock($).exchange(0);
        delete n;
    }

    clh_node * alloc_slot()
    {
        return new clh_node;
    }
    void free_slot(clh_node * p)
    {
        delete p;
    }
    
    void lock()
    {
        clh_node * me = alloc_slot();
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        clh_node * pred = m_lock($).exchange(me, std::mo_acq_rel);
        
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        m_lock_holder($).store(me, std::mo_relaxed );
        m_lock_pred($).store(pred, std::mo_relaxed );
    }
    
    void unlock()
    {
        clh_node * me = m_lock_holder($).load(std::mo_relaxed);
        
        clh_node * pred = m_lock_pred($).load(std::mo_relaxed);
        
        me->succ_must_wait($).store( 0, std::mo_release );

        free_slot( pred );
    }

};

and rather than pass in the nodes I just bit the bullet and allocated them. But now the obvious thing to do is make alloc_slot and free_slot just take & return nodes from an array. But then "me" is just stepping a pointer through an array. So our "linked list" should just be a sequence of adjacent elements in an array :

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    #define NUM_WAYS    16
    // should be cache line sized objects :
    std::atomic<int> succ_must_wait[NUM_WAYS];
    std::atomic<int> m_lock;
    VAR_T(int) m_lock_pred;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store(0);
        succ_must_wait[0]($).store(0);
        for(int i=1;i<NUM_WAYS;i++)
        {
            succ_must_wait[i]($).store(1);
        }
        m_lock_pred($) = 0;
    }
    ~clh_mutex()
    {
    }

    void lock()
    {   
        int pred = m_lock($).fetch_add(1, std::mo_acq_rel);
        pred &= (NUM_WAYS-1);
        
        rl::linear_backoff bo;
        while ( succ_must_wait[pred]($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        // m_lock_pred just remembers my index until unlock
        //  could be a local
        m_lock_pred($) = pred;
    }
    
    void unlock()
    {
        int pred = m_lock_pred($);
        int me = (pred+1)&(NUM_WAYS-1);
        
        // recycle this slot :
        succ_must_wait[pred]($).store(1, std::mo_relaxed);
        
        // free my lock :
        succ_must_wait[me]($).store( 0, std::mo_release );
    }

};

(as usual, m_lock_pred doesn't really belong as a member variable in the lock).

But this is exactly "Anderson's array-based queue lock" that we mentioned at the end of the ticket-lock post, and it's also just a CLH lock with the nodes stuck in an array. This suffers from the big problem that you must have enough array entries for the threads that will touch the lock or it doesn't work (what happens is multiple threads can get into the mutex at the same time, eg. it doesn't actually provide mutual exclusion).

I don't think this is actually useful for anything, but there you go.


07-17-11 | Atman's Multi-way Ticket Lock

But first - Atman sent me a note about the ticket lock which is related and worth sharing.

Imagine we're in a "high performance computing" type environment, we have a mess of threads running, locked on the processor, which is the appropriate time to use something like a ticket lock. Now, they all try to get the lock. What actually happens?

N threads run into the mutex and 1 gets the lock. So the remaining (N-1) go into their spin loops :


while ( m_serving($).load(std::mo_acquire) != my_ticket ) ;

but what is this actually doing? When you try to load that shared variable, what it does is something like :
is this cache line current?
okay read the variable from my copy of the cache line
now the first guy holding the mutex unlocks it. This marks the cache line dirty and that is propagated to all the other cores. Now the remaining (N-1) spinners try to read m_serving again. But this time it does :
is this cache line current?
no it's not, get the new copy of the cache line
okay read the variable from my copy of the cache line
the cache line had to be copied around (N-1) times. You can see the pattern, and to do N unlocks with N waiters you wind up doing N^2 cache line copies. Obviously this is not okay for N large.

(note that this is why putting a backoff pause in your spin loop can actually be a performance advantage even on non-hyperthreaded cores - it reduces cache line traffic ; also in the special case of the ticket lock, the waiters actually know how far they are from the front of the list, so they can do "proportional backoff" by subracting "my_ticket" from "now serving" and pausing for that amount of time).

Okay, but we can do better. Leave m_ticket as a single gate variable, but split m_serving into many "ways" (ways in the cache sense). So depending on your ticket # you look at a different serving number. This is just like a very large bakery - rather than have a single "now serving" sign, we have one for odd tickets and one for even tickets ; you stand on the side of the room with the appropriate sign for you and just read that sign.

The code is :


struct ticket_mutex_ways
{
    enum { NUM_WAYS = 16 };
    std::atomic<unsigned int> m_ticket;
    // note : m_serving objects should be cache-line-size padded :
    std::atomic<unsigned int> m_serving[NUM_WAYS];
    VAR_T(int)  m_lock_holder;
    
    ticket_mutex_ways()
    {
        m_ticket($).store( 0 );
        for(int i=0;i<NUM_WAYS;i++)
            m_serving[i]($).store( 0 );
    }
    ~ticket_mutex_ways()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel); // mo_acquire ; *1
    
        int way = me % NUM_WAYS;
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving[way]($).load(std::mo_acquire);
            if ( me == cur )
                break;
        
            bo.yield($);
        }
        
        m_lock_holder($) = me;
    }

    void unlock()
    {
        int next = m_lock_holder($) + 1;
        int way = next % NUM_WAYS;
        
        m_serving[way]($).store(next,std::mo_release);
    }
};

the key thing is that in the spin loop you are only touching the serving variable in your way, and there is no cache contention with up to NUM_WAYS lockers. (as noted - you need cache line size padding between the variables)

(*1 standard caveat here - this only needs to be acquire but they you need other mechanisms to protect your mutex, so beware)

Note that "m_lock_holder" really doesn't belong in the mutex structure; it's an annoyance that I have to put it there; it should just be held on the stack until the unlock. If you use some kind of "guard" class to wrap your mutex lock/unlock it would be more appropriate to store this in the guard. (in fact it is probably good class design to make lock & unlock take the "guard" as a parameter, because that allows more flexibility).

This is pretty cool, I think it's about as fast as you can get if you don't care about your mutex being rather large. One nice thing about it is that you don't need to know your number of CPUs. There are a lot of similar algorithms that break unless NUM_WAYS is >= number of threads. (for example, you can do basically the same thing but just use a bool in each way to indicate locked or not, and that works fine as long as num threads is < num ways; that would be "Anderson's array-based queue lock" BTW). With Atman's algorithm, you can even choose to make WAYS less, and you will be fast as long as the number of contenders is less than the # of ways.


07-17-11 | Per-thread event mutexes

Another class of mutex implementations that we haven't talked about yet are those based on a list of waiting threads. Again this is a general pattern which is useful in lots of threading primitives, so it's useful to talk about.

The basic common idea is that you have some kind of node per thread. This can be in the TLS, it can be on the stack (the stack is a form of TLS, BTW), or if you're the OS it is in the OS thread structure. (btw most implementations of mutexes and waitlists inside the kernel take this form, using a node inside the OS thread structure, but of course they are much simpler because they are in control of the thread switching).

A common advantage of this kind of scheme is that you only need as many waitable handles as threads, and you can have many many mutexes.

So a pseudo-code sketch is something like :


per-thread node for linked list

lock(mutex,node) :
  try to acquire mutex
  if failed, add node to waiting list

unlock(mutex) :
  pop node off waiting list
  if not null, set owner to him
  else set owner to none

If I'm building my own mutex in user mode and don't want to use a previously existing mutex, I need to make the linked list of waiters lock-free. Now, note that the linked list I need here is actually "MPSC" (multi-producer single-consumer), because the consumer is always the thread that currently holds the mutex. (for something to be SC doesn't mean the same thread has to consume always, it means there must only be one at a time, using some mechanism of ensuring exclusion (such as a mutex)).

If I'm going to go to the trouble of managing this list, then I want my unlock to provide "direct handoff" - that is, gaurantee a waiter gets the lock, no new locker can sneak in and grab it. I also want my threads to really be able to go to sleep and wake up, since we saw with the ticket lock that if you don't control thread waking with this kind of mutex, you can get the problem that the thread you are handing to is swapped out and that blocks all the other threads from proceeding.

Now I haven't yet specified that the list has to be ordered, but sure let's make it a FIFO so we have a "fair" mutex, and I'll go ahead and use an MPSC_FIFO queue because I have one off the shelf that's tested and ready to go.

So, we obviously need some kind of prepare_wait/doublecheck/wait mechanism in this scenario , so we try something like :


lock(mutex,node) :
{
 try acquire mutex ; if succeeded , return

 prepare_wait : push node to MPSC_FIFO :

 double check :
 try acquire mutex
 if succeeded :
   cancel_wait
   return

 wait;
}

unlock() :
{
 node = pop MPSC_FIFO
 if no node
   set mutex unlocked , return

 set owner to node
 wake node->thread
}

(this doesn't work). But we're on the right track. There's one race that we handle okay :

locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada
unlocker : set mutex unlocked

locker : push to FIFO
locker : double check acquire
locker : gets it now , cancel wait

that's okay. But there's another race that we don't handle :

locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada

locker : push to FIFO
locker : double check acquire , doesn't get it

unlocker : set mutex unlocked

locker : wait

ruh-roh ! deadlock.

So, there's one ugly way to fix this. In the unlock, before the pop, you change the mutex status from "locked" to "unlocking in progress". Then when the locker does the double-check , he will fail to get the mutex but he will see the "unlocking in progress" flag, and he can handle it (one way to handle it is by spinning until the state is no longer "in progress").

But this is quite ugly. And we would like our mutex to not have any spins at all, and in particular not spins that can be blocked for the duration of a thread switch. (though to some extent this is inherent in mutexes, so it's not a disaster here, it is something to beware of generally in lockfree design - you don't want to design with exclusive states like "unlock in progress" that block other people if you get swapped out).

So, another way to handle that second bad race is for the double-check to be mutating. If double-check is a fetch_add , then when unlocker sets the mutex to unlocked, there will be a count in there indicating that there was a push after our pop.

Thus we change the "gate" to the mutex to be a counter - we use a high bit to indicate locked state, and the double check does an inc to count waiters :


struct list_mutex3
{
    enum
    {
        UNLOCKED = 0,
        LOCKED = (1<<30),
        WAITER = 1
    };
    
    std::atomic<int> m_state;
    MPSC_FIFO   m_list;

    list_mutex3()
    {
        m_state($).store( UNLOCKED );
        MPSC_FIFO_Open(&m_list);
    }
    ~list_mutex3()
    {
        MPSC_FIFO_Close(&m_list);
    }

    void lock(ThreadNode * tn)
    {
        // optionally do a few spins before going to sleep :
        const int spin_count = 10;
        for(int spins=0;spins<spin_count;spins++)
        {
            int zero = UNLOCKED;
            if ( m_state($).compare_exchange_strong(zero,LOCKED,std::mo_acq_rel) )
                return;
                
            HyperYieldProcessor();
        }
                  
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
        
        // double check :
        int prev = m_state($).fetch_add(WAITER,std::mo_acq_rel);
        if ( prev == UNLOCKED )
        {
            // I got the lock , but set the wrong bit, fix it :
            m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }

        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-LOCKED,std::mo_release);
        ASSERT( prev >= LOCKED );
        if ( prev == LOCKED )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;

        // we turned off locked, turn it back on, and subtract the waiter we popped :
        prev = m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
        ASSERT( prev < LOCKED && prev >= WAITER );
        SetEvent(tn->m_event);
    }
    
    void cancel_wait(ThreadNode * tn)
    {
        MPSC_FIFO_Fetch( &m_list);
        MPSC_FIFO_Remove(&m_list,tn);
    }
};

I think it's reasonably self-explanatory what's happening. Normally when the mutex is locked, the LOCKED bit is on, and there can be some number of waiters that have inc'ed the low bits. The unlock is reasonably fast because it checks the waiter count and doesn't have to bother with queue pops if there are no waiters.

In the funny race case, what happens is the LOCKED bit turns off, but WAITER gets inc'ed at the same time, so the mutex is still blocked from entry (because the initial CAS to enter is checking against zero, it's not checking the LOCKED bit). During this funny phase that was previously a race, now the unlocker will see that the double-check has happened (and failed) and will proceed into the pop-signal branch.

Remember that fetch_add returns the value *before* the add. (this sometimes confuses me because the Win32 InterlockedIncrement returns the value *after* the increment).

cancel_wait is possible because at that point we own the mutex, thus we are the SC for the MPSC and we can do whatever we want to it. In particular my implementation of MPSC uses Thomasson's trick of building it from MPMC stack and then using exchange to grab all the nodes and reverse them to FIFO order. So Fetch does the exchange and reverse, and then I have a helper that can remove a node. (obviously those should be combined for efficiency). You should be able to do something similar with most MPSC implementations (*).

(* = Dmitry has a very clever MPSC implementation ( here : low-overhead mpsc queue - Scalable Synchronization Algorithms ) which you cannot use here. The problem is that Dmitry's MPSC can be temporarily made smaller by a Push. During Push, it goes through a phase where previously pushed nodes are inaccessible to the popper. This is fine if your popper is something like a worker thread that just spins in a loop popping nodes, because it will eventually see them, but in a case like this I need the gaurantee that anything previously pushed is definitely visible to the popper).

In the fast path (no contention), lock is one CAS and unlock is one fetch_add (basically the same as a CAS). That's certainly not the cheapest mutex in the world but it's not terrible.

Now, clever readers may have already noticed that we don't actually need the LOCKED bit at all. I left it in because it's a nice illustration of the funny state changes that happen in the race case, but in fact we can set LOCKED=1 , and then all our adds of (LOCKED-WAITER) go away, which gives us the simpler code :


    void lock(ThreadNode * tn)
    {   
        int zero = 0;
        if ( m_state($).compare_exchange_strong(zero,1,std::mo_acq_rel) )
            return;
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                    
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

and it's obvious that m_state is just an entry count.

In fact you can do an even simpler version that doesn't require cancel_wait :


    void lock(ThreadNode * tn)
    {                       
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // got the lock
            return;
        }
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = NULL;
        rl::backoff bo;
        for(;;)
        {
            pNode = MPSC_FIFO_Pop(&m_list);
            if ( pNode ) break;
            bo.yield($);
        }

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

where you loop in the unlock to catch the race. This last version is not recommended, because it doesn't allow spinning before going to sleep, and requires a loop in unlock.

One more note : all of these suffer from what Thomasson calls "SCHED_OTHER". SCHED_OTHER is a Linux term for one of the schedulers in that OS. What it means in this context is that we are not respecting thread priorities or any more exotic scheduling that OS wants, because each thread here is waiting on its own event (and by "event" I mean "generic OS waitable handle"). If what you really want is a FIFO mutex then that's fine, you got it, but usually you would rather respect the OS scheduler, and to do that you need all your waiting threads to wait on the same handle.


07-16-11 | Ticket FIFO Mutex

The Linux kernel internally uses a FIFO spinlock that they call "ticket lock". A ticket or "bakery" algorithm is quite a common pattern so we'll have a glance.

The analogy is the easiest way to understand it. There's an atomic ticket machine, when you walk into the shop you grab a ticket (and the machine increments itself). On the wall is a "now serving" sign that counts up as people turn in their tickets.

This can be implemented most obviously using two ints :


struct ticket_mutex2
{
    // (*0)
    std::atomic<unsigned int> m_ticket;
    std::atomic<unsigned int> m_serving;

    ticket_mutex2()
    {
        m_ticket($).store( 0 );
        m_serving($).store( 0 );
    }
    ~ticket_mutex2()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel);
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving($).load(std::mo_acquire);
            if ( me == cur )
                return;
        
            bo.yield($);
            
            // (*1)
        }
    }

    void unlock()
    {
        // (*2)
        // my ticket must match m_serving
        // (*3)
        //m_serving($).fetch_add(1,std::mo_release);
        unsigned int cur = m_serving($).load(std::mo_relaxed);
        m_serving($).store(cur+1,std::mo_release);
    }
};

*0 : obviously you could put the two counters into words and mush them in one int (Linux on x86 used to put them into bytes and mush them into one word), but it's actually a better demonstration of the algorithm to have them separate, because it's a weaker constraint. Lockfree algorithms always continue to work if you mush together variables into larger atomic pieces, but rarely continue to work if you separate them into smaller independent atomic pieces. So when you're trying to show the fundamental requirements of an algorithm you should use the minimum mushing-together required.

(BTW I don't remotely claim that any of the things I've posted have the minimum synchronization constraints required by the algorithm, but that is always the goal).

*1 : you might be tempted to put a Wait here using eventcount or something, but you can't. The problem is if multiple threads go to sleep there, only the one thread that has the next ticket will be able to take the lock. So if you use a generic waitset, you might wake the wrong thread, it won't be able to get in, and you will deadlock. More on this in a moment.

*2 : m_serving is actually protected by the mutex, it is only ever modified by the mutex holder. m_ticket is actually the barrier variable for acquiring the lock. When you get the lock you could store your ticket id as a member in the lock struct and at unlock it will be equal to m_serving.

*3 : you can of course use an atomic increment on serving but because of *2 it's not necessary, and a simple load & inc is cheaper on some architectures (and as per *1, it's a weaker constraint so we prefer to demonstrate its correctness here).

Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock.

So if we want to turn this into a FIFO lock that works in user space, we have to have a sleep/wakeup mechanism.

I don't think this is actually an awesome way to write your own FIFO lock, but it's a nice demonstration of the usefulness of NT's Keyed Events, so I'm going to do that.

You need to get the secret functions :


template <typename ret,typename T1,typename T2,typename T3,typename T4>
ret CallNtInternal(const char * funcName,T1 arg1,T2 arg2,T3 arg3,T4 arg4)
{
    typedef ret NTAPI t_func(T1,T2,T3,T4);

    t_func * pFunc = (t_func*) GetProcAddress( LoadLibrary( TEXT("ntdll.dll") ), funcName );
    ASSERT_RELEASE( pFunc != NULL );

    return (*pFunc) (arg1,arg2,arg3,arg4);
}

#define MAKE_NTCALL_4(ret,func,type1,type2,type3,type4) ret func(type1 arg1,type2 arg2,type3 arg3,type4 arg4) { return CallNtInternal<ret>(#func,arg1,arg2,arg3,arg4); }

MAKE_NTCALL_4( LONG,NtCreateKeyedEvent,OUT PHANDLE, IN ACCESS_MASK, IN PVOID, IN ULONG );
MAKE_NTCALL_4( LONG,NtReleaseKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER ); 
MAKE_NTCALL_4( LONG,NtWaitForKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );

and then you can make the lock :


struct ticket_mutex2_keyed
{
    std::atomic<unsigned int> m_state;
    // ticket is bottom word
    // now serving is top word

    HANDLE  m_keyedEvent;

    // keyed event must have bottom bit off :
    enum { WAITKEY_SHIFT = 1 };

    ticket_mutex2_keyed()
    {
        m_state($).store( 0 );
        NtCreateKeyedEvent(&m_keyedEvent,EVENT_ALL_ACCESS,NULL,0);
    }
    ~ticket_mutex2_keyed()
    {
        CloseHandle(m_keyedEvent);
    }

    void lock()
    {
        // grab a ticket and inc :
        unsigned int prev = fetch_add_low_word(m_state($),1);
    
        // if ticket matches now serving I have the lock :
        if ( top_word_matches_bottom(prev) )
            return;
    
        // wait on my ticket :
        unsigned int ticket = prev&0xFFFF;
        intptr_t waitKey = (ticket<<WAITKEY_SHIFT);
        NtWaitForKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
    }

    void unlock()
    {
        // inc now serving :
        unsigned int prev = m_state($).fetch_add((1<<16),std::mo_release);

        // get a local copy of the "now serving" that I published :
        prev += (1<<16);

        // if lock was not made open to new entries :       
        if ( ! top_word_matches_bottom(prev) )
        {
            // wake up the one after me in the sequence :
            unsigned int next = (prev>>16);
            intptr_t waitKey = (next<<WAITKEY_SHIFT);
            NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
        }
    }
};

Note that we have had to push together our two state variables now, because previous unlock only touched the "now serving" counter, but now it has to also check against the ticket counter to see if there are any people waiting.

Also note that we are taking advantage of the fact that ReleaseKeyedEvent is blocking. If the Release happens before the Wait, the signal is not lost - the unlocking thread blocks until the Wait is entered.

Exercise for the reader : make it possible for lock to spin a while before going into the wait.

I made use of these self-explanatory helpers :


bool top_word_matches_bottom( unsigned int x )
{
    unsigned int t = _lrotl(x,16);
    return t == x;
}

unsigned int fetch_add_low_word(std::atomic<unsigned int> & state,int inc)
{
    unsigned int old = state($).load(std::mo_relaxed);
    while ( ! state($).compare_exchange_weak(old,((old+inc)&0xFFFF) | (old & 0xFFFF0000),std::mo_acq_rel) ) { }
    return old;
}

which do what they do.

Obviously on Linux you could use futex, but there are too many platforms that have neither KeyedEvent nor futex, which make using them not very attractive.

Some links :

Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
spinlocks XXXKSE What to do
Linux x86 ticket spinlock
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
futex(2) - Linux manual page


07-15-11 | Review of many Mutex implementations

This is gonna be a long one. The point of this is not that you should go off and implement your own mutex (don't). The point is that it's educational to understand this simple case, because the issues will be the same in other domain-specific threading problems. A lot of the times people think they are being "safe" by using the OS mutex, but then they still do some atomic CAS on a bool and think "it's no big deal, it's just a CAS and loop" and basically are creating all the same issues of races and livelocks and thrashing without being careful about it.

So, I'm going to present a (hopefully) working implementation of a mutex/lock and then talk about the issues with that type of implementation.

classic single-variable CAS spinlock :


class spinmutex
{
public:

    spinmutex() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        unsigned int prev = 0;
        // (*1)
        while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acquire) )
        //while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acq_rel) )
        {
            b.yield($);
            prev = 0;
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

*1 : I believe the CAS only needs to be acquire, but then you do need some other mechanism to keep stores from moving out the top of the mutex (such as a #loadstore which C++0x doesn't provide), and some way to prevent mutexes from moving to overlap each other (which could lead to deadlock). So it's probably easiest to just make the CAS be acq_rel even though it doesn't need to be. (see previous post on the barriers that we think a mutex needs to provide). Most of the mutexes here have this issue and we won't mention it again.

For some reason people love to implement the basic spinlock with CAS, but in fact you can do it just with exchange :


class spinmutex2
{
public:

    spinmutex2() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        while ( m_lock($).exchange(1,std::mo_acquire) )
        {
            b.yield($);
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

which is cheaper on some platforms.

So, there are a few problems with spinmutex. The most obvious is that you have to just spin, the threads which can't get in don't go to sleep. The other problem is that it doesn't respect OS scheduling directives (thread priorities) and it's quite un-"fair", in that it doesn't order access at all, and in fact greatly favors the last thread in, since it's most likely to be getting CPU time.

So we want to make it sleep. The pattern for making lock-free primitive sleep is to change :


while ( ! trylock() ) { spin }

to :

if ( ! trylock() )
{
  register desire to wait

  if ( trylock() ) return (cancel wait)

  wait;
}

that is, a double-checked wait. (and then perhaps loop, depending on whether waking from wait implies the condition). The reason you need to do this is that before the "register waiter" has finished putting you in the wait list, the lock may become open. If you didn't try to acquire the lock again, you would miss the wake signal.

So the easiest way to transform our spin mutex into one that sleeps is with eventcount :

eventcount sleeping exchange lock :


class ecmutex1
{
public:

    ecmutex1() : m_lock(0)
    {
    }
    
    void lock()
    {
        while ( m_lock.exchange(1,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            if ( m_lock.exchange(1,std::memory_order_acquire) == 0 )
                return; // got the lock
            
            // wait for signal :
            m_ec.wait(ec_key);
            // now retry
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock.load(std::memory_order_relaxed) == 1 );
        m_lock.store(0,std::memory_order_release);
        m_ec.signal();
    }

private:
    //std::atomic<unsigned int> m_lock;
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
};

now, this is okay, but there are a few problems.

One is that the signal for the eventcount is just a "hey wake up and see if you can get the lock" , it's not a "hey wake up you have the lock". That means it can suffer from what I call "thrashing" or spurious wakeup (this is not technically "spurious wakeup" , a true "spurious wakeup" would be a wakeup that didn't come from calling signal()). You might wake a thread, it fails to get the lock, and goes right back to sleep. So that sort of sucks.

This issue is closely related to a fairness problem; we might be able to ensure some level of fairness through eventcount, but that is ruined by the fact that spinning threads can jump in and grab the lock before the one we signalled.

Another issue is that we are calling "signal" every time even when there is no waiter. This is a minor issue because your eventcount probably checks for waiters and does nothing (if it's a good implementation - a bad implementation might implement signal by immediately taking a mutex, in which case you really want to avoid calling it if you have no waiters).

Anyway, Thomasson showed how to improve this last little bit of inefficiency. You use one bit to flag locking and one bit to flag waiting, and you only need to signal the eventcount if the waiting bit is on :


class ecmutex2
{
public:

    enum { UNLOCKED = 0, LOCKED = 1, LOCKED_WAITING = (LOCKED|2) };

    ecmutex2() : m_lock(UNLOCKED)
    {
    }
    
    void lock()
    {
        unsigned int prev = 0;
        // this CAS could be a bit-test-and-set :
        while ( ! m_lock.compare_exchange_strong(prev,LOCKED,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            // change LOCKED->LOCKED_WAITING (and then we will wait)
            // or change UNLOCKED->LOCKED_WAITING (and we take the lock)
            prev = m_lock.exchange(LOCKED_WAITING,std::memory_order_acquire);
            if ( prev == UNLOCKED )
                return;
                
            m_ec.wait(ec_key);
            
            // now retry
            prev = 0;
        }
    }
    
    void unlock()
    {
        unsigned int local = m_lock.load(std::memory_order_relaxed);
        RL_ASSERT( local & LOCKED );
        m_lock.store(UNLOCKED,std::memory_order_release);
        // could always signal :
        //m_ec.signal();
        // faster because it avoids an atomic :
        unsigned int check = m_lock.load(std::memory_order_relaxed);
        if ( (local|check) & LOCKED_WAITING )
        {
            m_ec.signal();
        }
    }

private:
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
    
};

you have to use a CAS (not an exchange) to take the lock initially, because you can't turn off the WAITING bit. The entire advantage of this method is the fact that in the uncontended case (no waiters), unlock only does a load_relaxed instead of the atomic op needed in eventcount to check if signal is necessary.

Note : in some cases it may be an improvement to spin a bit before going to sleep in the lock() side. It also can be an optimization to spin a bit before signalling in the unlock (to see if the WAITING flag turns off) - however, both of these hurt fairness, they make the mutex more LIFO than FIFO, which can indeed be an optimization in many cases, but is also dangerous (more notes on this issue elsewhere). If a thread was already asleep on the mutex, it will tend to stay asleep forever if there are other awake threads that keep trading the mutex around.

Anyhoo, you can implement the exact same thing using windows Event instead of eventcount :

Three-state mutex using Event :


// Thomasson's simple mutex based on windows event :
struct win_event_mutex
{
    std::atomic<int> m_state; // = 0
    HANDLE m_waitset; // auto reset event; set to false

    win_event_mutex()
    {
        m_state($) = 0;
        m_waitset = CreateEvent(NULL,0,0,NULL);
    }
    ~win_event_mutex()
    {
        CloseHandle(m_waitset);
    }

    void lock()
    {
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            while ( m_state($).exchange(2,rl::mo_acquire) )
            {
                WaitForSingleObject(m_waitset, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,rl::mo_release) == 2 )
        {
            SetEvent(m_waitset);
        }
    }
};

the three states are again "0 = unlocked", "1 = locked (exclusive)" , "2 = contended (locked with waiter)".

(I got this from Thomasson but I believe it's actually an old algorithm; I've seen it discussed in many blogs. there is a slightly subtle state transition where m_state can be 2 (contended) and then someone comes in to lock() and exchanges it to 1 (locked, uncontended); that seems to be bad, because there is a Waiter which might now miss a signal (because we turned off the contended flag), but in fact it's okay because if that happens we will then step in and take the lock in the conteded state (by exchanging in 2) and when we unlock we will signal the waiter. So this is another way of doing "unfair" acquisition (the later-entering thread gets the lock even though there was already a waiter) but it is not a lost wakeup).

The unlock is slightly more expensive because it's an exchange instead of just a store. This mutex is "fair" (as fair as Win32 native primitives ever are) for waiters that actually get into the waitset, because all the threads wait on the same Event and thus will get the OS priorities and boosts and so on. But it still doesn't hand off the lock in the wakeup and so on. (I guess I'll call respecting the OS scheduler "pseudo-fair" ; if the mutex implementation is at least as fair as the OS mutex)

BTW this mutex is very similar to the futex-based mutex that Bartosz described here ; in general anywhere you see Win32 event you could use futex on Linux, since futex is a superset of event.

We're going to take a diversion now and look at some other topics in mutex design - in particular, avoiding allocation of OS events unless/until they're actually needed (Win32 CRITICAL_SECTION does this, for example).

Event mutex that makes event on demand :


// Thomasson's simple mutex based on windows event :
// version that does event creation on demand
struct win_event_mutex_ondemand
{
    std::atomic<int> m_state; // = 0
    std::atomic<HANDLE> m_waitset; // auto reset event; set to false

    win_event_mutex_ondemand()
    {
        m_state($) = 0;
        m_waitset($) = 0;
    }
    ~win_event_mutex_ondemand()
    {
        if ( m_waitset($) != 0 )
            CloseHandle(m_waitset($));
    }

    void lock()
    {
        if ( m_state($).exchange(1,std::mo_acq_rel) )
        {
            HANDLE h = m_waitset($).load(std::mo_acquire);
            if ( h == 0 )
            {
                HANDLE newH = CreateEvent(NULL,0,0,NULL);
                if ( m_waitset($).compare_exchange_strong(h,newH,std::mo_acq_rel) )
                {
                     h = newH;
                }
                else
                {
                    // loaded h
                    RL_ASSERT( h != 0 );
                    CloseHandle(newH);
                }
            }
            RL_ASSERT( h != 0 );
            while ( m_state($).exchange(2,std::mo_acq_rel) )
            {
                WaitForSingleObject(h, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,std::mo_acq_rel) == 2 )
        {
            HANDLE h = m_waitset($).load(std::mo_relaxed);
            RL_ASSERT(h != 0 );
            SetEvent(h);
        }
    }
};

This is just the same as the previous win_event_mutex , except that it makes the event on first use, using the modern speculative-creation singleton method.

This works fine, unfortunately it is difficult to ever free the event, so once we make it we have it forever. If your goal is to do something like have 4k mutexes that only use 32 OS events, you can't do it this way. (in general you only need as many waitable handles as you have threads, and you might want to have many more lockable objects than that).

I implemented one way of making a mutex that releases its event when not needed, but it's a bit ugly :

Event mutex that only holds event during contention :


struct win_event_mutex2
{
    struct state // 64 bit double-word 
    {
        // two 32 bit words :
        int lock; HANDLE event; 

        state(int l,HANDLE h) : lock(l), event(h) { } 
        state() : lock(0), event(0) { } 
        bool operator == (const state & rhs) const { return lock == rhs.lock && event == rhs.event; }
    };
    
    std::atomic<state> m_state; // = 0

    win_event_mutex2()
    {
    }
    ~win_event_mutex2()
    {
        state local = m_state($);
        if ( local.event != 0 )
            CloseHandle(local.event);
    }

    void lock()
    {
        HANDLE newH = 0;
            
        state oldState = m_state($).load(std::mo_acquire);
        state newState;
        
        for(;;)
        {
            // increment the lock count :
            newState = oldState;
            newState.lock = oldState.lock+1;
            
            // if there is contention, make sure there is an event to wait on :
            if ( newState.lock > 1 && newState.event == 0 )
            {
                if ( newH == 0 )
                {
                    newH = CreateEvent(NULL,0,0,NULL);
                }
                
                newState.event = newH;              
            }
            else if ( newState.lock == 1 && newState.event == newH )
            {
                newState.event = 0;
            }

            // try to swap in the lock count and event handle at the same time :
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;

        }
        
        if ( newH && newH != newState.event )
        {
            // I made an event but didn't use it
            CloseHandle(newH);
        }
                
        if ( oldState.lock == 0 )
        {
            // I got the lock
            RL_ASSERT( newState.lock == 1 );
            return;
        }
        
        // lock is contended :
        RL_ASSERT( newState.lock > 1 );
        RL_ASSERT( newState.event != 0 );

        WaitForSingleObject(newState.event, INFINITE);
        
        // I own the mutex now!
    }

    void unlock()
    {
        state oldState = m_state($).load(std::mo_acquire);
        RL_ASSERT( oldState.lock >= 1 );
        state newState(0,0);
        
        // at this moment I own the mutex
        
        for(;;)
        {
            // release the lock, and if we're no longer contended remove the event
            RL_ASSERT( oldState.lock >= 1 );
            newState = oldState;
            newState.lock--;
        
            if ( newState.lock == 0 && newState.event != 0 )
            {
                newState.event = 0;
            }
        
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;
        }

        if ( oldState.event )
        {
            if ( newState.event == 0 )
            {
                RL_ASSERT( newState.lock == 0 );

                CloseHandle(oldState.event);
            }
            else
            {
                RL_ASSERT( newState.lock > 0 );
                SetEvent(oldState.event);
            }
        }
    }
};

This is always the cop-out method for implementing lock free algorithms - take the two variables that you need to stay in sync (in this case the lock count and the presence of a waitable event) - and just mush them together into a bigger word and CAS it atomically. That way you don't have to think carefully about all the funny state transition possibilities. I'm sure someone could do a better version of this that's not so expensive in atomic ops on the fast path (no contention).

(btw it's safer / more portable to make "state" just be a uint64 and do the packing manually with shifts and ors, don't use a struct inside std::atomic, it causes too many headaches and is too risky)

(to be clear, the problem with this algorithm is the no-contention fast path is way more expensive than any of our previous mutexes).

Also note : you shouldn't actually use CreateEvent/CloseHandle with something like this, you should have a recycling pool of OS Events that you alloc/free from an event cache of some kind. As I said before, you only need one per thread. If you do use a pool like this, you have to be a bit careful about whether they can be recycled in signalled state, and whether you want to try to Reset them at some point (beware someone could still be in the process of waking up from it), or just make your algorithm work okay with a pre-signalled event, or something.

There is one way this mutex is better than any previous one - when a thread receives a wakeup, it is never bogus; it automatically has the lock when it is woken up. Also under contention there can be no "stolen locks" from threads that jump the queue and quickly grab the atomic - one of the sleeping threads will always be woken.

(btw "pseudo-fairness" like this is not always better; in the case that all your threads are equivalent workers, you actually want a LIFO mutex, because LIFO keeps caches hotter and makes thread switches less likely. However, some code can be stalled indefinately by LIFO mutexes, so they are very dangerous in the general case.)

Okay, that's enough mutexes for one post, we'll do some followups in a later one.


07-14-11 | Some obscure threading APIs

Futex :

Futex from a Win32 programmer's perspective is an enhanced version of the windows Event. It has absolutely nothing to do with a "fast user space mutex". It's a much lower level primitive - it's a "waitset" if you like (more about "waitset" in a future post). Basically it lets you put a thread to sleep with a Wait(), or wake up one or more threads. It has several advantages over Win32 Event which make it very nice.

Futex is associated with the address of an int. This means you don't have to actually create a Futex object, any int in your system can be used as the handle for a waitable event. This is nice.

Futex Wait atomically checks the int vs. a certain value. The basic futex Wait op is :

atomically {
  if ( *address == check_value ) Wait();
  else return;
}
this atomic check before waiting is exactly what you want for implementing lots of threading primitives (mutex, conditionvar, etc.) so that's very handy. The normal thing you would write is something like :

thread 1:

if ( queue empty / mutex locked / whatever )
{
    Wait();
}

thread 2:

push queue / unlock mutex / whatever
Signal();

but that contains a race. You can fix it very easily with futex by passing in the condition to check to the Wait, like :

thread 1 :
if ( num_items == 0 )
    futex_wait( &num_items, 0 );

thread 2 :
num_items++;
Signal();

(note that the alternative to this is a prepare_wait/wait pair with a double-check of the condition after the prepare_wait, and signal applies to anyone who has prepared, not anyone who has waited)

Futex Wake can wake up N threads (sadly Win32 event only provides "wake 1" in the robust auto-reset mode). That's nice.

Some reference :
Futexes are tricky (PDF) (no they're not, BTW)
Thin Lock vs. Futex «   Bartosz Milewski's Programming Cafe
Mutexes and Condition Variables using Futexes
futex(2) - Linux manual page

Now some Win32 :

SignalObjectAndWait :

SignalObjectAndWait (Win2k+).

This seems pretty hot, because you would think it was atomic (signal and wait in one kernel op), but it is *NOT* :

Note that the "signal" and "wait" are not guaranteed to be performed as an atomic operation. Threads executing on other processors can observe the signaled state of the first object before the thread calling SignalObjectAndWait begins its wait on the second object.

which means it's useless. It's just the same as calling Signal() then Wait(). Maybe it's less likely that you get swapped out between the two calls, which would reduce thread thrashing and also reduce the number of system calls, but it does not help you with correctness issues (an actually atomic SignalAndWait would let you implement things differently and solve some lost-wakeup problems).


thread1 :
  Signal(event1);
  /* (!1) */
  Wait(event2);

thread 2:
  Wait(event1);
  Signal(event2); // (!2)
  Wait(event2); // ! stolen wakeup

So first of all, the separate Signal & Wait is a performance bug because if thread1 loses the CPU at (!1) (or immediately upon signalling), then in thread 2 before (!2) you transfer execution back to thread1, it immediately goes to sleep. That's lame, and it's why you prefer to do these things atomically. But, it's even worse in this case, because at (!2) we intended that Signal to go to thread1 and wake him up, but he isn't in his sleep yet. Because SignalAndWait is not atomic we have a race. If it was atomic, code like this could actually be correct.

Windows NT Keyed Events (NtWaitForKeyedEvent/NtReleaseKeyedEvent) :

This is an NT internal API (it's in NTDLL since Win XP, so despite being internal-only it's actually more usable than the Win2k+ or Vista+ stuff (revision : duh, brain fart, not true, 2k+ is fine, only Vista+ stuff is problematic)). It's a very nice extension to the Win32 Event. Basically it lets you associate a value with the Wait and Signal.

NtWaitForKeyedEvent : analogous to Wait() on an Event, but also takes a value (the value is PVOID but it is just a value, it doesn't deref). Execution is stopped until a signal is received with that particular value.

NtReleaseKeyedEvent : analogous to SetEvent() (on an auto-reset event) - eg. it wakes exactly 1 one thread - except that it actually blocks if no thread is waiting - that is, this always wakes exactly 1 thread (never zero, which SetEvent can do). Release also takes a value and only wakes threads waiting on that value.

So for example if you want you can use this to make a primitive implementation of futex. You have one global event (futex_event), then FutexWait(address) does WaitForKeyedEvent(futex_event,address) , using address as the value to wait for, and FutexWake(address) does ReleaseKeyedEvent(futex_event,address). (though obviously this is not a proper futex because it can't broadcast and can't check a value and so on).

More usefully, you can create a mutex which applies to an array! Something like :


template<typename T>
lockable_vector
{
  Event  vector_event;
  vector<T>   items;
  vector<int> contention;

  T * lock(int i)
  {
    if ( contention[i] ) NtWaitForKeyedEvent(vector_event, i);
    contention[i]++;
    return &items[i];
  }

  void unlock(int i)
  {
    contention[i]--;
    if ( contention[i] ) NtReleaseKeyedEvent(vector_event, i);
  }
}

(obviously this just is a rough sketch; you have to update "contention" properly atomically so that you only do the Wait and Release in the right cases) (I'll post a working version of this sometime soon).

(small issue with this : the lowest bit of the key must be unset ; apparently they use it as a flag bit, so you need a 2* or something to use it this way, or just give it aligned pointer addresses as the key)

The point is - you only have one kernel event, and you can mutex on any item in your vector; you can resize the vector and don't have to create/destroy mutexes. Cool! In the typical scenario that you have maybe 2-4 threads accessing an array of 4k items, this is a big win. (in fact this is something that's important in Oodle so I have a few implementations of how to do this without WaitForKeyedEvent).

KeyedEvent is implemented in the obvious way - when a thread in Win32 waits on a handle it gets stuck in that handle's data structure on a linked list. They just added a "wait_value" field to that linked list. So now when you do ReleaseKeyedEvent instead of just popping off the first thread in the list, it walks through the list in the kernel and tries to find a thread whose "wait_value" matches your signal value.

Some reference :
Slim Reader Writer Locks - Under The Hood - Matt Pietrek - Site Home - MSDN Blogs
NtCreateKeyedEvent
Keyed Events (lockless inc)
Concurrent programming on Windows - Google Books


07-14-11 | compare_exchange_strong vs compare_exchange_weak

The C++0x standard was revised a while ago to split compare_exchange (aka CAS) into two ops. A quick note on the difference :

bool compare_exchange_weak( T * ptr, T * old, T new );

bool compare_exchange_strong( T * ptr, T * old, T new );

(BTW in the standard "old" is actually a reference, which is a damn shitty thing to do because it makes it very non-transparent that "old" gets mutated by these functions, so I am showing it as a pointer).

both try to do :


atomically {
  if ( *ptr == *old ) { *ptr = new; return true; }
  else { *old = *ptr; return false; }
}

the difference is that compare_exchange_weak can also return false for spurious failure. (the original C++0x definition of CAS always allowed spurious failure; the new thing is the _strong version which doesn't).

If it returns due to spurious failure, then *old might be left untouched (and in fact, *ptr might be equal to *old but we failed anyway).

If spurious failure can only occur due to contention, then you can still gaurantee progress. In fact in the real world, I believe that LL-SC architectures cannot gaurantee progress, because you can get spurious failure if there is contention anywhere on the cache line, and you need that contention to be specifically on your atomic variable to gaurantee progress. (I guess if you are really worried about this, then you should ensure that atomic variables are padded so they get their own cache line, which is generally good practice for performance reasons anyway).

On "cache line lock" type architectures like x86, there is no such thing as spurious failure. compare_exchange just maps to "cmpxchg" instruction and you always get the swap that you want. (it can still fail of course, if the value was not equal to the old value, but it will reload old). (BTW it's likely that x86 will move away from this in the future, because it's very expensive for very high core counts)

compare_exchange_weak exists for LL-SC (load linked/store conditional) type architectures (Power, ARM, basically everything except x86), because on them compare_exchange_strong must be implemented as a loop, while compare_exchange_weak can be non-looping. For example :

On ARM, compare_exchange_weak is something like :

compare_exchange_weak:

  ldrex     // load with reservation
  teq       // test equality
  strexeq   // store if equal
and strexeq can fail for two reasons - either because they weren't equal, or because the reservation was lost (because someone else touched our cache line).

To implement compare_exchange_strong you need a loop :

compare_exchange_strong:

  while ( ! compare_exchange_weak(ptr,old,new) ) { }

(note that you might be tempted to put a (*old = *ptr) inside the loop, but that's probably not a good idea, and not necessary, because compare_exchange_weak will eventually load *ptr into *old itself when it doesn't fail spuriously).

The funny bit is that when you use compare_exchange you often loop anyway. For example say I want to use compare_exchange_strong to increment a value, I have to do :


cur = *ptr;
while( ! compare_exchange_strong(ptr,&cur,cur+1) ) { }

(note it's a little subtle that this works - when compare_exchange_strong fails, it's because somebody else touched *ptr, so we then reload cur (this is why "old" is passed by address), so you then recompute cur+1 from the new value; so with the compare_exchange_strong, cur has a different value each time around this loop.)

But on an LL-SC architecture like ARM this becomes a loop on a loop, which is dumb when you could get the same result with a single loop :


cur = *ptr;
while( ! compare_exchange_weak(ptr,&cur,cur+1) ) { }

Note that with this loop now cur does *not* always take a new value each time around the loop (it does when it fails due to contention, but not when it fails just due to reservation-lost), but the end result is the same.

So that's why compare_exchange_weak exists, but you might ask why compare_exchange_strong exists. If we always use loops like this, then there's no need for it. But we don't always use loops like this, or we might want to loop at the much higher level. For example you might have something like :

bool SpinLock_TryLock(int * lock)
{
  int zero = 0;
  return compare_exchange_strong(lock,&zero,1);
}
which returns false if it couldn't get the lock (and then might do an OS wait) - you don't want to return false just because of a spurious failure. (that's not a great example, maybe I'll think of a better one later).

(BTW I think the C++0x stuff is a little bit broken, like most of C standardization, because they are trying to straddle this middle ground of exposing the efficient hardware-specific ways of doing things, but they don't actually expose enough to map directly to the hardware, and they also aren't high level enough to separate you from knowing about the hardware. For example none of their memory model actually maps directly to what x86 provides, therefore there are some very efficient x86-specific ways to do threading ops that cannot be expressed portable in C++0x. Similarly on LL-SC architectures, it would be preferrable to just have access to LL-SC directly.

I'd rather see things in the standard like "if LL-SC exist on this architecture, then they can be invoked via __ll() and __sc()" ; more generally I wish C had more conditionals built into the language, that would be so much better for real portability, as opposed to the current mess where they pretend that the language is portable but it actually isn't so you have to create your own mess of meta-language through #defines).


07-14-11 | Good advice

Never give your real email or phone number to :

Realtors

Car dealers

Mortgage brokers

Web sites

Unfortunately the tools for this aren't as nice as they should be. For example Google Voice sort of intentionally makes it difficult to use them as an anonymizer. eg. you can't have a GVoice # that doesn't correspond to a physical phone, and it's a pain in the ass to dial out with your phone and make it route through the GVoice # so that people see that number in their caller id.

Similarly, to sign up for things you often have to give an email that they check, so you can't just make one up. Ideally you would be able to give a temporary email address, which routes through to your real one for a week or so and then expires. Email providers could easily give this to you (and just limit the # of outgoing mails to prevent spammers from using it). But they don't actually want to protect you from advertisers.


07-14-11 | ARM Atomics

Some quick notes for reference in the future.

In general when I'm porting atomics to a new platform, what I would really like is a document from the hardware maker that describes their memory semantics and cache model in detail. Unfortunately that can be very hard to find. You also need to know how the compiler interacts with that memory model (eg. what does volatile mean, how do I generate compiler reorder barriers, etc). Again that can be very hard to find, particularly because so many compilers now are based on GCC and the GCC guys are generally stubborn punk-asses about clearly defining how they behave in the "undefined" parts of C which are so crucial to real code.

So assuming you can't find any decent documentation, the next place to look is some of the large cross-platform codebases. Probably the best one of all is Linux, because it's decently well tested. Unfortunately you can't just copy code from these since most of them are GPL, but you can use them as educational material to figure out the memory semantics of a platform. So some things we learn from Linux straight off :

1. Before ARMv6 you're fucked. There are no real atomic ops (SWP is not good enough) so you have to use some locking/critical-section mechanism to do atomics. Linux in the kernel does this by blocking interrupts, doing the atomic, then turning them back on. If you're not in the kernel, on Linux there's a secret syscall function pointer you can use, or non-Linux you have to use SWP to implement a spinlock which you then use to do CAS and such.

2. With ARMv6 you can use ldrex/strex , which seems to be your standard LL-SC kind of thing.

3. If you're SMP you need full memory barriers for memory ordering.

One thing I don't know is whether any of the Apple/Android consumer ARM multi-core chips are actually SMP ; eg. do they have separate caches, or are they shared single cache with multiple execution units?

Some reference I've found :

[pulseaudio-discuss] Atomic operations on ARM
Wandering Coder - ARM
QEmu - Commit - ViewGit
pulseaudio-discuss Atomic operations on ARM 1
Old Nabble - gcc - Dev - atomic accesses
Linux Kernel Locking Techniques
Linux Kernel ARM atopics
Debian -- Details of package libatomic-ops-dev in sid
Data alignment Straighten up and fly right
Broken ARM atomic ops wrt memory barriers (was [PATCH] Add cmpxchg support for ARMv6+ systems) - Patchwork
Außergewöhnlicher Migrationsdruck
Atomic - GCC Wiki
ARM Technical Resources
ARM Information Center

ARM RealView compiler has some interesting intrinsics that are not documented very well :


ARM RealView has :
    __force_stores
    __memory_changed
    __schedule_barrier
    __yield
    __strexeq

one trick in this kind of work is to find a compiler that has intrinsics you want and then just look at what assembly is generated so that you can see how to generate the op you want on that platform.

(but beware, because the intrinsics are not always correct; in particular the GCC __sync ops are not all right, sometimes have bugs, and sometimes their behavior is "correct" but doesn't match the documentation; it's very hard to find correct documentation on what memory semantics the GCC __sync ops actually gaurantee).

Anyway, maybe I'll update this when I get some more information / do some more research.


07-13-11 | Good threading design for games

I want to back up and talk a bit about how threading should be used in modern games and what makes good or bad threading design in general.

Games are sort of a weird intermediate zone. We aren't quite what would be considered "real-time" apps (a true "real-time" app sets itself to the maximum possible priority and basically never allows any thread switches), nor are we quite "high performance computing" (which emphasizes throughput over latency, and also takes over the whole machine), but we also aren't just normal GUI apps (that can tolerate long losses of cpu time). We sort of have to compromise.

Let me lay out a few principals of what I believe is good threading for games, and why :

1. Make threads go to sleep when they have nothing to do (eg. don't spin). A true real-time or high-performance app might have worker threads that just spin on popping work items off a queue. This allows them to have few-hundred-clock latencies gauranteed. This is not appropriate for games. You are pretty much always running on a system that you need to share. Obviously on the PC they may have other apps running, but even on consoles there are system threads that you need to share with and don't have full control over. The best time to do this is when your threads are idle, so make your threads really go to sleep when they are idle.

2. Never use Sleep(n). Not ever. Not even once. Not even in your spin-backoff loops. In literature you will often see people talk of exponential backoffs and such. Not for games.

The issue is that even a Sleep(1) is 1 milli. 1 milli is *forever* in a game. If the sleep is on the critical path to getting a frame out, you can add 1 milli to your frame time. If you hit another sleep along the critical path you have another millisecond. It's no good.

3. All Waits should be on events/sempahores/what have you (OS waitable handles) so that sleeping threads are woken up when they can proceed, not on a timer.

4. Thread switches should be your #1 concern. (other than deadlocks and races and such things that are just bugs). Taking a mutex is 100 clocks or so. Switching threads is 10,000 clocks or so. It's very very very bad. The next couple of points are related to this :

5. "Spurious" or "polling" wakeups are performance disasters. If a thread has to wake up, check a condition, and then goes back to sleep, you just completely wasted 10,000 clocks. Ensure that threads are waiting on a specific condition, and they are only woken up when that condition is true. This can be a bit tricky when you need to wait on multiple conditions (and don't have Win32's nice WaitForMultiple to help you).

6. You should sacrifice some micro-efficiency to reduce thread thrashing. For example, consider the blocking SPCS queue using semaphore that we talked about recently. If you are using that to push work items that take very little time, you can have the following pattern :

  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
constantly switching. Big disaster. There are a few solutions. The best in this case would be to detect that the work item was very trivial and just do it immediately on the main thread rather than pushing it to a worker. Another would be to batch up a packet of work and send all of them at once instead of posting the semaphore each time. Another is to play with priorities - bump up your priority while you are making work items, and then bump it back down when you want them all to fire.

Basically doing some more work and checks that reduce your max throughput but avoids some bad cases is the way to go.

7. Mutexes can cause bad thread thrashing. The problem with mutexes (critical sections) is not that they are "slow" (oh no, 100-200 clocks big whoop) in the un-contended case, it's what they do under contention. They block the thread and swap it out. That's fine if that's actually what you want to do, but usually it's not.

Consider for example a queue that's mutex protected. There are a bunch of bad articles about the performance of mutex-protected queues vs. lock-free queues. That completely misses the point. The problem with a mutex-protected queue is that when you try to push an item and hit contention, your thread swaps out. Then maybe the popper gets an item and your thread swaps back in. Thread swapping in & out just to push an item is terrible. (obviously mutexes that spin before swapping try to prevent this, but they don't prevent the worst case, which means you can have unpredictable performance, and in games the most important thing is generally the worst case - you want to minimize your maximum frame time)

Basically, mutexing is not a good reason to swap threads. You should swap threads when you are out of work to do, not just to let someone get access to a blocked shared variable, and then you'll run some more when they're done.

8. Over-subscription is bad. Over-subscription (having more threads than cores) inherently creates thread switching. This is okay if the threads are waiting on non-computational signals (eg. IO or network), but otherwise it's unnecessary. Some people think it's elegant to make a thread per operation type, eg. a ray cast thread, a collision thread, a particle thread, etc. but in reality all that does is create thread switch costs that you don't need to have.

In video games, you basically don't want the time-slicing aspect of threads to ever happen (in the absense of other apps). You want a thread to run until it has to wait for something. Over-subscribing will just cause time-slice thread switching that you don't need. You don't want your render thread and your physics thread to time-slice and swap back and forth on one core. If they need to share a core, you want all your physics work to run, then all your render work to run, sequentially.

9. Generally a thread per task is bad. Prefer a thread per core. Make a task system that can run on any thread. You can then manually sequence tasks on the cores by assigning them to threads. So in the example above instead have a physics task and a render task, which might be assigned to different cores, or might be assigned to one thread on one core and run sequentially.


07-13-11 | Houses

Some rambling. I got to "mutual acceptance" on a house, so we're in inspection. It's very hard to stay logical, because it's all such a pain in the ass that I just want to get it over with, so I want to ignore any flaws found at this point. In all these kinds of scenarios, as a buyer you get trapped once you have invested a certain amount of time.

Mortgage brokers are just horrible lying crooks. They're very similar to used car salesman really. They're selling a product, and they could just be honest and fair and still make plenty of profit. But no, they want that profit, PLUS as much more as they can lie and steal from you. So mortgage brokers will take the market rate and then crank it up (and keep the difference for themselves). They'll secretly sneak more profit for themselves into all the little fees, every single line item fee for closing they try to inflate by another 25% or so and keep the profit. On each of these items you can fight them, it's pretty trivial you just say "hey this rate isn't very good I found a better one elsewhere" and the broker suddenly goes "oh, okay, I can adjust this.." , but it's just such a fucking sleazy ordeal.

And the worst thing is that mortgages are a generic commodity. After the broker issues it to you, it's just sold on the market (or maybe they're doing it on behalf of some big mortgage block). You're not actually getting customized service, it should just be like buying barrels of oil. But it's not because they're crooks.

And dealing with mainstream banks doesn't make it any better, in fact if anything it's worse because they just bold-facedly charge you over market rate while making absurd claims that it's worth paying them extra because of their "stability" or "integrity" or "reputation" or whatever.

Buying a house is one of those things like buying a car where the crooks try to sneak in little charges that you ignore because the overall purchase is so large. You think "the house is $500k what do I care about $1000 in fees". But of course that's retarded, the $1000 in ripping off is totally independent of the house purchase, the 500k for the house is a sunk cost and you have to choose to pay the $1000 extra or not.

The sick thing is that so many consumers are happy to let them get away with it. On the car advice forums you will constantly see smug retards giving advice like "oh the $1000 oil change is worth it, why would you cheap out and go to a discount place when you spent $100k on your car?" ; umm, no, fucking moron, just because the car is expensive doesn't make all the associated ripoffs okay. Similarly with a house, you get a lot of "it's a huge purchase why would you cheap out on your survey fee" ; umm, no.

Insurance has got to be the biggest scam in the world. So far as I can tell, in the real world, insurance works like this :

I pay them massive amounts of money to protect me in case something bad happens. When that bad thing happens, I get not only the misfortune of the bad event, but I also get the excruciating displeasure of having to fight a beaurocracy which is doing their best to screw me over and weasel their way out of paying the fair amount they owe me.

Title insurance is just insanely overpriced. This is from an article which is in *favor* of title insurance :

"According to the American Land Title Association, which represents insurers, about 5 cents of every dollar collected in premiums is paid out in claims"

I suppose there are types of insurance that are even more profitable for the insurer, but not many. (I imagine the insurance on movie stars and shit like that is pretty much free money for insurers).

I think I'm going to go with a 30 year fixed, but I'm sort of contemplating a 20 year fixed.

The interest rates are very close right now, so that's not a good reason to go 20 year. The reason I'm considering it is to front-load my interest payments. Right now while I'm employed I get maximum benefit from the mortgage interest deduction. In 10-15 years hopefully I'm not working full time and it doesn't help so much.

Amusingly, every real estate professional I talk to has absolutely no concept of how money works. My realtor told me that rather than paying off his mortgage early, he puts the money in T-bills and gets 2% for it. Umm, hello. Assuming your interest rate is more than 2%, you would get a better return by paying off debt. Paying off debt is just like making an investment at the rate of your debt.

More generally, even smart people believe that they can "get a better rate of return in the market rather than by paying off debt". LOL no, maybe yes sometimes, but on average, no, obviously not. Just think about it for a second.

Banks want to put money in home loans because it's a great way to get a return. If there were better places to put money, they would gladly put the money there instead. There may certainly be investments with higher return, but they also have higher risk, so the risk-adjusted rate of return is the same (actually in general it's worse).

There are of course exceptions; if you lock in a low home loan rate and then the market heats up, you want your money in the market. The biggest issue though is that money paid into your house is no longer liquid, while money in the market always has added utility due to its liquidity.

Inflation is a very complicated aspect of home buying that I don't see discussed much. For example, in the issue of "should I buy points" or not, you can find lots of calculators, but they are all full of shit because they don't count inflation ( one exception is here ). Inflation makes points much less desirable, because they make you pay more today (in valuable dollars) and less in the future (in shit-paper dollars). In fact inflation also greatly reduces the difference between 20 and 30 year loans.


400k loan
4.5% APR
3% inflation

30 years :

total payment = $729,626
inflation adj = $480,259

20 years :

total payment = $607,343
inflation adj = $455,985

lots of calculators will show you the 20 yr vs 30 yr numbers and you go "ZOMG $120k more" , but that 120k is way in the future when you can't buy a computer for $120k , so who cares? The difference in inflated dollars is tiny (this is due to the fact that mortgage rates are so close to inflation right now;

The longer loan term also has a lot of added value; if you have uncertainty about the inflation rate, then any unexpected spikes in inflation massively help the 30 year term. Also, you might die. If you die, early payments were just a waste of money.

Another thing I haven't seen mentioned much is that points (and APR variations in general) matter more when they are higher. The difference between 8% and 9% interest is much bigger than the difference between 1% and 2%.


eval (1.09)^^30 - (1.08)^^30 = 3.20502
eval (1.02)^^30 - (1.01)^^30 = 0.46351

So the lower rates are, the less you should care about getting low rates. (* this is not really applicable to mortgages, see comments)

If you're a gambler, it seems to me like a great time to take out a huge loan. Rates are low, risk of inflation in the future is high, prospects for other investments are poor. I guess the only big risk at the moment is the problem that home values are probably still about 25% too high even despite falling a lot, but it's very hard to really objectively price things when the market is bananas.


07-10-11 | Mystery : Do you ever need Total Order (seq_cst) ?

The seq_cst memory model ops in C++0x are very mysterious to me. In particular, when exactly do you need them and why? Let's talk about some issues.

First of all, one quirk is that C++0x has no way of expressing a #StoreLoad barrier. Because of that, you often wind up having to use seq_cst mem ops just to get a StoreLoad. (for example in "eventcount" seq_cst is needed, but that's just to get the StoreLoad). But this is rather unfortunate because seq_cst is actually much heavier than StoreLoad. It also enforces "total order".

( there are some more quirks in the details of C++0x ; for example an acq_rel or seq_cst is really only the barrier you think it is if it's an "exchange" (read and write). That is, a store marked acq_rel is not actually acq_rel. Similarly, seq_cst fences don't act like you think they do (see Dmitry's posts on this); in particular you cannot just translate a #StoreLoad from another language into a fence(seq_cst) in C++0x (UPDATE : well, usually you can); also seq_cst in C++0x only schedules against other atomic ops with ordering constraints; they do not schedule against non-atomic memory loads, so in that sense it is not like a memory barrier at all ) (UPDATE : this is basically true, but see clarification in newer posts here : cbloom rants 05-31-12 - On C++ Atomic Fences Part 2 and cbloom rants 05-30-12 - On C++ Atomic Fences ).

Let's compare against a theoretical "all_barriers" op, which has LoadLoad,StoreStore,StoreLoad, and LoadStore barriers on both sides of it. seq_cst is the same thing, but also enforces total order. Why would we ever want that?

What does total order do? Well the classic toy example is something like this :


int a,b,c,d = 0,0,0,0;

thread1 : 
  a = 1;
  b = 1;

thread2 :
  c = 1;
  d = 1;

thread3 :
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

thread4:
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

if all the ops are "all_barriers" then we know that the write to b must be after the write to a, and the write to d must be after the write to c, but the ops to {ab} can schedule against the ops to {cd} in any way - and in particular thread 3 and 4 can see them in different ways. So for example with all_barriers, this is a valid output :

thread3 :
 sees {abcd}
  1,0,0,0

thread4 :
 sees {cdab}
  0,0,1,1

if you use seq_cst the order of the stores on the bus is a single linear order. eg. maybe it's {acbd} or whatever, the point is its the same for all observers. In particular, if we pretend that thread3 and 4 can run instantly and atomically then they would print the same thing.

ADDENDUM : An actual working demonstration goes like this :

shared int a,b = 0,0;
result int c,d = 0,0;

thread1 :
  a = 1;

thread2 : 
  b = 1;

thread3 :
  local int A = a;
  local int B = b;
  if ( A == 1 && B == 0 )
    c = 1;

thread4 :
  local int B = b;
  local int A = a;
  if ( B == 1 && A == 0 )
    d = 1;

// invariant :

if memory order is seq_cst then (c+d) == 2 is impossible
any other memory order and (c+d) == 2 is possible

that is, threads 1 & 2 independently write to a & b. If you use a total order, then thread 3 and 4 must see either {ab} or {ba} - both are valid, but the point is it's the same. If you don't use total order then they could see the order differently.

We then check the order in a race-free way. c=1 can only be set if the order was {ab} , d=1 can only be set if the order was {ba} , therefore with seq_cst it's impossible to see both c=1 and d=1.

(end of addendum).

(note that if you only talk about two threads, then this "total order" issue never comes up, and acq_rel is the same as seq_cst; it only becomes different when three or more threads are watching each other)

But this is quite a weird thing to require; does it ever actually matter?

(just to confuse things, Intel recently changed the spec of the MFENCE instruction to enfore "causal consistency" instead of "sequential consistency" ; presumably this is part of the push to higher core counts and the goal is to get away from the very expensive system-wide sequence point that sequential consistency requires. "causal consistency" provides a total order only for operations which are "causally related" - eg. at the same memory location, or tied together by being used together in an atomic op).

(briefly : in C++0x a seq_cst fence acts to strictly order only other seq_cst ops; it also acts as acq_rel fence; it does not act to order ops that are not seq_cst ; for example it is not a #StoreStore barrier for two relaxes stores, one before and one after the fence (unless they are seq_cst themselves) ; part of the confusion comes from the fact that in practice on most platforms a seq_cst fence is implemented with an instruction like MFENCE which is in fact a stronger barrier - it orders all ops as being strictly before the mfence or strictly after).

(as usual the C++ language lawyer guys are spiraling off into a pit of abstraction and mildly fucking up; I'll do a followup post on how it should have been done).

Dekker's mutex is sometimes cited as an example that needs sequential consistency, because it is doing that sort of weird stuff where it uses multiple variables and needs the writes to them to go in the right order, but it actually doesn't need seq_cst. All it needs are #StoreLoad barriers. ( see here - though I think this code is actually broken)

Unfortunately StoreLoad is very awkward to express in C++0x ; one way to do it is to change the load into an AtomicIncrement by 0, so that it's a read-modify-write (RMW), then you can use a #StoreStore (which a normal "release" constraint). For example Dekker's lock can be expressed like this :


    void lock(int tid)
    {
        int him = tid^1;
        rl::linear_backoff b1;
        
        flag[tid]($).store(1,std::mo_relaxed);

        // #StoreLoad here

        while ( flag[him]($).fetch_add(0,std::mo_acq_rel) )
        {       
            b1.yield($);
            if ( turn($).load(std::mo_relaxed) != tid )
            {
                rl::linear_backoff b2;
                flag[tid]($).store(0,std::mo_relaxed);
                while ( turn($).load(std::mo_relaxed) != tid )
                {
                    b2.yield($);
                }   
                flag[tid]($).store(1,std::mo_relaxed);

                //#StoreLoad here
            }
        }
    }
    
    void unlock(int tid)
    {
        turn($).store( tid^1  , std::mo_release);
        flag[tid]($).store( 0 , std::mo_release);
    }

(backoffs are needed to make Relacy work). Now obviously the RMW is an unfortunate stupid expense, it requires a chunk of CPU time to hold that cache line, but it might be better than a seq_cst op, depending on how that's implemented.

So what I'm struggling with is imagining a situation that actually needs "total order" (and "needs" in a fundamental sense, not just because the coder was a dummy doing silly things). That is, you can easily cook up bad code that needs total order because it is making dumb assumptions about disparate memory ops becoming visible in the same order on different processors, but that's just bad code.


07-09-11 | LockFree : Thomasson's simple MPMC

Warming back up into this stuff, here's some very simple algos to study.

first fastsemaphore :


class fastsemaphore
{
    rl::atomic<long> m_count;
    rl::rl_sem_t m_waitset;

public:
    fastsemaphore(long count = 0)
    :   m_count(count)
    {
        RL_ASSERT(count > -1);
        sem_init(&m_waitset, 0, 0);
    }

    ~fastsemaphore()
    {
        sem_destroy(&m_waitset);
    }

public:
    void post()
    {
        if (m_count($).fetch_add(1) < 0)
        {
            sem_post(&m_waitset);
        }
    }

    void wait()
    {
        if (m_count($).fetch_add(-1) < 1)
        {
            // loop because sem_wait returns non-zero for spurious failure
            while (sem_wait(&m_waitset));
        }
    }
};

Most code I post will be in Relacy notation, which is just modified C++0x. Note that C++0x atomics without explicit memory ordering specifications (such as the fetch_adds here) default to memory_order_seq_cst (sequential consistency).

Basically your typical OS "semaphore" is a very heavy kernel-space object (on Win32 for example, semaphores are cross-process). Just doing P or V on it even when you don't modify wait states is very expensive. This is just a user-space wrapper which only calls to the kernel semaphore if it is at an edge transition that will cause a thread to either go to sleep or wake up.

So this is a simple thing that's nice to have. Note that m_count is always 0 or negative. If it's negative it's the (minus) number of threads that are sleeping on that semaphore. (between post and wakeup a thread can be sleeping but no longer counted, so we should say that threads which are not in minus m_count are either running or pending-running).

( see here )

Now we can look at Thomasson's very simple MPMC bounded blocking queue :


template<typename T, std::size_t T_depth>
class mpmcq
{
    rl::atomic<T*> m_slots[T_depth];
    rl::atomic<std::size_t> m_push_idx;
    rl::atomic<std::size_t> m_pop_idx;
    fastsemaphore m_push_sem;
    fastsemaphore m_pop_sem;

public:
    mpmcq() 
    :   m_push_idx(T_depth), 
        m_pop_idx(0),
        m_push_sem(T_depth),
        m_pop_sem(0)
    {
        for (std::size_t i = 0; i < T_depth; ++i)
        {
            m_slots[i]($).store(NULL);
        }
    }

public:
    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        rl::backoff backoff;

        while (m_slots[idx]($).load())
        {
            backoff.yield($);
        }

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        T* ptr;
        rl::backoff backoff;

        while ( (ptr = m_slots[idx]($).load()) == NULL )
        {
            backoff.yield($);
        }
        
        m_slots[idx]($).store(NULL);

        m_push_sem.post();

        return ptr;
    }
};

First let's understand what's going on here. It's just an array of slots with a reader index and writer index that loop around. "pop_sem" counts the number of filled slots - so the popper waits on that semaphore to see filled slots be non-zero. "push_sem" counts the number of available slots - so the pusher waits on that being greater than zero to be able to fill a slot.

So the producer and consumer both nicely go to sleep and wake each other when they should. Also because we use "fastsemaphore" they have reasonably low overhead when they are in the non-sleeping case.

Now, why is the weird backoff logic there? It's because of the "M" (for multiple) in MPMC. If this was an SPSC queue then it could be much simpler :


    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        /* (*1) */

        T* ptr = m_slots[idx]($).exchange(NULL);
        RL_ASSERT( ptr != NULL );

        m_push_sem.post();

        return ptr;
    }

which should be pretty obviously correct for SPSC.

But, now consider you have multiple consumers and the queue is completely full.

Consumer 1 gets a pop_idx = 2. But then at (*1) it swaps out and doesn't run any more.

Consumer 2 gets a pop_idx = 3 and runs through and posts to the push semaphore.

Now a producer runs and gets push_idx = 2. It believes there is an empty slot it can write to, but it looks in slot 2 and there's still something there (because consumer 1 hasn't cleared it's slot yet). So, it has to do the backoff-yield loop to give consumer 1 some CPU time to let it run.

So the MPMC with backoff-yield works, but it's not great. As long as the queue is near empty it works reasonably well, but when it's full it acts like a mutex-based queue, in that one consumer being swapped out can block all your pushers from running (and because it's just a busy wait here, the normal OS hacks to rescue you (like Windows priority boosts) won't work here (this kind of thing is exactly why the Windows scheduler has so many hacks and why despite your whining you really do want it to be like that)).


07-09-11 | TLS for Win32

So as noted in previous discussion of TLS , there's this annoying problem that it's broken for DLL's in Win32 (pre-Vista).

The real annoyance as a library writer is that even if you compile a .lib, somebody might want to use your lib in a DLL, and you can't know that in advance (I guess you could build a separate version of your lib for people who want to put it in a DLL), and even if you do make a DLL version it's annoying to the client to have to hook yourself up to the DLL_PROCESS_ATTACH to set up your TLS (if you want to use the MS-recommended way of doing TLS in DLLs). It just doesn't work very well for modular code components. The result is that if you are writing code that is supposed to always work on Win32 you have to do your own TLS system.

(same is true for Xenon XEX's and maybe PS3 PRX's though I'm not completely sure about that; I'm not aware of any other platforms that are broken like this, but there probably are some).

Anyway, so you want TLS but you can't use the compiler's built-in "__thread" mechanism. You can do something like this :



#define TLSVAR_USE_CINIT
//#define TLSVAR_USE_COMPILER_TLS

// T has to be castable to void *
template <typename T>
struct TLSVar
{
public:

    // shared between threads :
    uint32 m_tls_index;
        
    // AllocIndex is thread-safe
    // it does wait-free speculative singleton construction
    static uint32 AllocIndex(volatile uint32 * pSharedIndex)
    {
        uint32 index = LoadRelaxed(pSharedIndex);
        if ( index == TLS_OUT_OF_INDEXES )
        {
            index = TlsAlloc();
            // store my index :
            uint32 oldVal = AtomicCMPX32(pSharedIndex,TLS_OUT_OF_INDEXES,index);
            if ( oldVal != TLS_OUT_OF_INDEXES )
            {
                // already one in there
                TlsFree(index);
                index = oldVal;
            }
        }
        return index;
    }

    #ifdef TLSVAR_USE_CINIT
    TLSVar() : m_tls_index(TLS_OUT_OF_INDEXES)
    {
        AllocIndex(&m_tls_index);
    }
    #endif
    
    T & Ref()
    {
        #ifndef TLSVAR_USE_CINIT
        AllocIndex(&m_tls_index);
        #endif
        
        // initial value in TLS slot :
        //  this has to be done once per thread
        LPVOID tls = TlsGetValue(m_tls_index);
        if ( tls == NULL )
        {
            T * pT = new T;
            tls = (LPVOID)pT;
            TlsSetValue(m_tls_index,tls);
        }
        
        return *((T *) tls);
    }
    
    operator T & ()
    {
        return Ref();
    }
    
    void operator = (const T t)
    {
        Ref() = t;
    }
    
};

#ifdef TLSVAR_USE_COMPILER_TLS

#ifdef _MSC_VER
#define TLS_VAR(type,name)  __declspec(thread) type name = (type)0;
#else
#define TLS_VAR(type,name)  __thread type name = (type)0;
#endif

#else // TLSVAR_USE_COMPILER_TLS

#ifdef TLSVAR_USE_CINIT
#define TLS_VAR(type,name) TLSVar<type> name;
#else
// use static initializer, not cinit :
#define TLS_VAR(type,name) TLSVar<type> name = { TLS_OUT_OF_INDEXES };
#endif

#endif // TLSVAR_USE_COMPILER_TLS


A few notes :

I made it able to work with cinit or without. The cinit version is somewhat preferrable. I'm not sure if cinit always works on all platforms with modular code loading, so I made it optional.

AllocIndex uses the preferred modern way of instantiating shared singletons. It is "wait free", which means all threads always makes progress in bounded time. In the case of contention there is an unnecessary alloc and free, which is unlikely and usually not a big deal. Whenever an extra alloc/free is not a big deal, this is the best way to do a singleton. If the extra alloc/free is a big deal, then a block is preferred.

Some platforms have a small limit on the number of TLS slots. If you use the compiler __thread mechanism, all your TLS variables get put together in a struct that goes in one TLS slot. If you can't use that, then it's probably best to do the same thing by hand - make a struct that contains everything you want to be thread-local and then just use a single slot for the struct. Unfortunately this is ugly for software engineering as many disparate systems might want to use TLS and they all have to share a globally visible struct def.

Handling freeing at thread shutdown is an annoyance. The pthreads tls mechanism lets you register a function callback for each tls slot which can do freeing at thread shutdown. I'm sure there's some way to get a thread-shutdown callback in Windows. Personally I prefer to use a model where all my threads live for the lifetime of the app (there are no short-lifetime threads), so I just don't give a shit about cleaning up the TLS, but that may not be acceptible to everyone, so you will have to deal with this.


07-08-11 | Event Count and Condition Variable

If you have either event_count or condition_variable, it's pretty straightforward to get the other from the one you have.

eventcount from condition_variable :

# by Chris M Thomasson
#
# class eventcount {  
# public:  
#   typedef unsigned long key_type;  
#   
#   
# private:  
#   mutable rl::atomic<key_type> m_count;  
#   rl::mutex m_mutex;  
#   rl::condition_variable m_cond;  
#   
#   
#   void prv_signal(key_type key) {  
#     if (key & 1) {  
#       m_mutex.lock($);  
#       while (! m_count($).compare_exchange_weak(key, (key + 2) & ~1,   
#         rl::memory_order_seq_cst));  
#       m_mutex.unlock($);  
#       m_cond.notify_all($);  
#     }  
#   }  
#   
#   
# public:  
#   eventcount() {  
#     m_count($).store(0, rl::memory_order_relaxed);  
#   }  
#   
#   
# public:  
#   key_type get() const {   // aka prepare_wait
#     return m_count($).fetch_or(1, rl::memory_order_acquire);  
#   }  
#   
#   
#   void signal() {  // aka notify_one
#     prv_signal(m_count($).fetch_add(0, rl::memory_order_seq_cst));  
#   }  
#   
#   
#   void signal_relaxed() {  
#     prv_signal(m_count($).load(rl::memory_order_relaxed));  
#   }  
#   
#   
#   void wait(key_type cmp) {  
#     m_mutex.lock($);  
#     if ((m_count($).load(rl::memory_order_seq_cst) & ~1) == (cmp & ~1)) {  
#       m_cond.wait(m_mutex, $);  
#     }  
#     m_mutex.unlock($);  
#   }  
# };  
#
and condition variable from event count :
by Dmitry V'jukov :

   1. class condition_variable  
   2. {  
   3.     eventcount ec_;  
   4.   
   5. public:  
   6.     void wait(mutex& mtx)  
   7.     {  
   8.         int count = ec_.prepare_wait();  
   9.         mtx.unlock();  
  10.         ec_.wait(count);  
  11.         mtx.lock();  
  12.     }  
  13.   
  14.     void signal()  
  15.     {  
  16.         ec_.notify_one();  
  17.     }  
  18.   
  19.     void broadcast()  
  20.     {  
  21.         ec_.notify_all();  
  22.     }  
  23. };   
(note this is a simplified condition variable without all the POSIX compliance crud).

In C++0x you have condition_variable at the stdlib level, so that is probably the best approach for the future. Unfortunately that future is still far away. On Pthreads you also have condition_variable (though a rather more complex one). Unfortunately, on Win32 (pre-Vista) you don't have condition_varaiable at all, so you have to build one of these from OS primitives.

(BTW there are various sources for good condition_var implementations for Win32, such as boost::thread and Win32 pthreads by Alex Terekhov).

ADDENDUM : really eventcount is the more primitive of the two; it's sort of a mistake that C++0x has provided "condition_var" as a primitive. They are not trying to provide a full set of OS-level thread control types (eg. they don't provide semaphore, event, what have you) - they are trying to provide the minimal basic set, and they chose condition_var. They should have done mutex and eventcount, as you can build everything from that.

(actually there's something perhaps even more primitive that eventcount which is "waitset" which can be easily used to build any of the basic blocking thread control devices).


07-08-11 | Who ordered Event Count ?

Say you have something like a lock-free queue , and you wish to go into a proper OS-level thread sleep when it's empty (eg. not just busy wait).

Obviously you try something like :


popper :

item = queue.pop();
if ( item == NULL )
{
  Wait(handle);
}


pusher :

was_empty = queue.push();
if ( was_empty )
{
  Signal(handle);
}

where we have extended queue.push to atomically check if the queue was empty and return a bool (this is easy to do for most lock-free queue implementations).

This doesn't work. The problem is that between the pop() and the Wait(), somebody could push something on the queue. That means the queue is non-empty at that point, but you go to sleep anyway, and nobody will ever wake you up.

Now, one obvious solution is to put a Mutex around the whole operation and use a "Condition Variable" (which is just a way of associating a sleep/wakeup with a mutex). That way the mutex prevents the state of the queue from changing while you decide to go to sleep. But we don't want to do that today because the whole point of our lock-free queue is that it's lock-free, and a mutex would spoil that. (actually I suppose the classic solution to this problem would be to use a counting semaphore and inc it on each push and dec it on each pop, but that has even more overhead if it's a kernel semaphore). Basically we want these specific fast paths and slow paths :


fast path :
  when popper can get an item without sleeping
  when pusher adds an item and there are no waiters

slow path :
  when popper has no work to do and goes to sleep
  when pusher needs to wake a popper

we're okay with the sleep/wakeup part being slow because that involves thread transitions anyway, so it's always slow and complicated. But in the other case where a core is just sending a message to another running core it should be mutex-free.

So, the obvious answer is to do a kind of double-check, something like :


popper :

item = queue.pop(); 
if ( item ) return item;  // *1
atomic_waiters ++;   // *2
item = queue.pop();  // *3
if ( item ) { atomic_waiters--; return item; }
if ( item == NULL )
{
  Wait(handle);
}
atomic_waiters--;


pusher :

queue.push();
if ( atomic_waiters > 0 )
{
  Signal(handle);
}

this gets the basics right. First popper has a fast path at *1 - if it gets an item, it's done. It then registers the fact that it's going to wait to a shared variable at *2. Then it double checks at *3. The double check is not an optimization to avoid sleeping your thread, it's crucial for correctness. The issue is that the popper could swap out between *1 and *2, and the pusher could then run completely, and it will see waiters == 0 and not do a signal. So the double-check at *3 catches this.

There's a performance bug with this code as written - if the queue goes empty then you do lots of pushes, all those pushes send the signal. You might be tempted to fix that by moving the "atomics_waiters--" line to the pusher side (before the signal), but that creates a race. You could fix that but then you spot a bigger problem :

This code doesn't work at all if you have multiple pushers or poppers. The problem is "lost wakeups". Basically if there are multiple poppers going into wait at the same time, the pusher may think it's done the wakeups it needs to, but it hasn't, and a popper goes into a wait forever.

To fix this you need a real proper "event count". What a proper event_count does is register the waiter at a certain point in time. The usage is like this :


popper :

item = queue.pop(); 
if ( item ) return item;
count = event_count.get_event_count();
item = queue.pop();
if ( item ) { event_count.cancel_wait(count); return item; }
if ( item == NULL )
{
  event_count.wait_on_count(count);
}


pusher :

queue.push();
event_count.signal();

Now, as before get_event_count() registers in an atomic visible variable that I want to wait on something (most people call this function prepare_wait()), but it also records the current "event count" to identify the wait (this is just the number of pushes, or the number of signals if you like). Then wait_on_count() only actually does the wait if the event_count is still on the same count as when I did get_wait_count - if the internal counter has advanced the wait is not done. signal() is a fast nop if there are no waiters, and increments the internal count.

This eliminates the lost wakeup problem, because if the "event_count" has advanced (and signaled some other popper, and won't signal again) then you will simply not go into the Wait.

Basically it's exactly like a Windows Event in that you can wait and signal, but with the added feature that you can record a place in time on the event, and then only do the Wait if that time has not advanced between your recording and the call to Wait.

It turns out that event_count and condition variables are closely related; in particular, one can very easily be implemented in terms of the other. (I should note that the exact semantics of pthread cond_var are *not* easy to implement in this way, but a "condition variable" in the broader sense need not comply with their exact specs).

Maybe in the future I'll get into how to implement event_count and condition_var.

BTW eventcount is the elegant solution to the problem of Waiting on Thread Events discussed previously.


ADDENDUM : if you like, eventcount is a way of doing Windows' PulseEvent correctly.

"Event" on Win32 is basically a "Gate" ; it's either open or closed. SetEvent makes it open. When you Wait() on the Event, if the gate is open you walk through, if it's closed you stop. The normal way to use it is with an auto-reset event, which means when you walk through the gate you close it behind yourself (atomically).

The idea of the Win32 PulseEvent API is to briefly open the gate and let through someone who was previously waiting on the gate, and then close it again. Unfortunately, PulseEvent is horribly broken by design and almost always causes a race, which leads most people to recommend against ever using PulseEvent (or manual reset events). (I'm sure it is possible to write correct code using PulseEvent, for example the race may be benign and just be a performance bug, but it is wise to follow this advice and not use it).

For example the standard Queue code using PulseEvent :

popper :
  node = queue.pop();
  if ( ! node ) Wait(event);

pusher :
  queue.push(node);
  PulseEvent(event);

is a totally broken race (if the popper is between getting a null node and enterring the wait, it doesn't see the event), and most PulseEvent code is similarly broken.

eventcount's Signal is just like PulseEvent - it only signals people who were previously waiting, it doesn't change the eventcount into a signalled state. But it doesn't suffer from the race of PulseEvent because it has a consistent way of defining the moment in "time" when the event fires.


07-06-11 | Who ordered Condition Variables ?

I'm getting back into some low level threading stuff for a week or two and I'll try to write about it because it's very confusing and I always forget the basics.

Condition Variables are explained very strangely around the net. You'll find sites that say they "let you wait on a variable being set to certain value" (not true), "let you avoid polling" (not true), or "the mutex is a left-over from pre-pthreads implementations" (not true).

What a "Condition Variable" really is is a way to receive a Signal and enter a Mutex at the same time ("atomically" if you like).

Why do we want this?

The typical case is we are waiting on some state. When the state is not true, we want our thread to go into a real sleep. So we use an OS Wait() on the thread that's waiting for that state, and an OS Signal() when the state is set to wake the thread. (Wait and Signal might be "Event" on win32, or a semaphore in pthreads, or a futex, etc). Basically :


Waiting thread :

if ( state I want is not set )
{
  Wait(handle);
  // state I want should be set now
  // **


Signalling thread :

if ( I changed state to the one wanted )
  Signal(handle)

simple enough. The problem is, the invariant at (**) is not true. The state you want is NOT necessarily set there, because there's a race. Immediately after you receive the signal, you could be swapped out, and someone else could change the state, and then it would not be what you wanted.

So the obvious thing is to put a mutex around "state". To be less abstract I'll talk about a one item queue (aka a mailbox).


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
  // mailbox must be full now
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Signal(handle); Unlock(mutex);
}

now when the mailbox filler signals the handle, the waiter immediately wakes up and tries to lock the mutex and can't; the filler then unlocks and the waiter can run, and it is gauranteed to have an item.

Note that the line marked {atomically} *must* be atomic, otherwise you have a race at (**) just like before.

And this :

  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
is exactly what pthread_cond_wait() is.

Personally rather than introducing a new synchronization data type I would have preferred to just get a function that does unlock_wait_lock(). The "cond_var" in pthreads has nothing to do with a "condition"', it's just a waitable handle (and associated mutex and other wiring) in Windows lingo.

There's one more point worth talking about. In the thread that filled the mailbox, we did :


lock mutex
set state
signal
unlock

That's good because it gaurantees that the receiver gets the state it wants and is race free. It's bad because it causes some unnecessary "thread thrashing".

(thread thrashing is a lot like input latency that I mentioned a while ago; it's something you just want to constantly watch and be vigilant about tracking and removing. Any time a thread wakes up and does nothing and goes back to sleep, you are "thrashing" and just wasting massive amounts of CPU. You want to minimize the number of useless thread wakeups)

The alternative is :


lock mutex
set state
unlock
(**)
signal

now, there's a race at ** where the state can get changed before you signal, so the invariant in the receiver is no longer true.

In most cases, however, this is not actually bad, and this form is actually preferred because of its increased efficiency. You have to change the receiver to a "double-check" type of pattern. Something like :


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  retry:
  cond_wait(mutex,handle); //  atomically { Unlock(mutex); Wait(handle); Lock(mutex) }
  // mailbox may or may not be full now
  if ( mailbox empty ) goto retry;
  // now do work
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Unlock(mutex);
  // intentional race here
  Signal(handle);
}

In general I believe it's a safer design to treat the signal as meaning "wake up and check this condition" instead of "wake up and this condition is definitely set". Then you engineer to minimize the number of wakeups when the condition is not set.

BTW a better design of the primitive would have allowed the signalling thread to do


atomically { Unlock(mutex); Signal(handle); }

which would be the ideal thing. Unfortunately a normal cond_var is not expressed this way. However, apparently on some modern UNIXes cond_var actually *acts* this way. What you do is signal inside the mutex, but the other thread isn't actually woken up until you unlock the mutex. Unfortunately this is a hidden optimization (they should have just provided unlock_and_signal() as one call) and you can't rely on it if you're cross-platform.

Some links on this topic :
Usenet - Condition variables signal with or without mutex locked
condvars signal with mutex locked or not Loďc OnStage
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups


07-05-11 | Huh?

I watched the 20 minute Carmack interview (thanks Nino) which was pretty damn excruciating, god dammit I want text transcripts, text text text, text is for information, video is for fucking TV news. Anyway, a few things struck me as strange.

One is that he talks about how important it was to get away from dark corridor shooters with monsters jumping out at you, and yet - you can watch the main gameplay trailer they've released and admire all the ... dark corridors with monsters jumping out at you.

But the main thing that made me go "huh?" is when he's talking about how they need to avoid redoing the entire engine for every game, and what they might save development time on in the future, he says something like "the AI, the animation, is basically good enough" (so they wouldn't be changed for future games).

Uh, what? Maybe if you want to make games where mindless monsters pop out on scripted paths and they animate around awkwardly and unnaturally, then yes, AI and animation are done, but in a more general sense, then no, they're not even remotely close. Time would be much better spent if they never rev'ed the graphics engine again and instead focused on AI and animation.

Valve is a good example; their graphics engine is actually pretty archaic now, but their animation system is very good, and their games look great because of it. Motion is hugely important; and of course valve's animation is still way behind where it should be (animation needs to become more code-driven, less canned, so that it can be more dynamic to weight transfer and surface and other world interactions, and also just more varied, more emotional). I think motion is maybe the most important step in the uncanny valley. If you have very natural motion even on a stick figure it looks startlingly real.

(Valve's characters still look like puppets that play one animation, then the next, sort of like the mechatronic Country Bear Jamboree kind of thing; they're very clever to use robots or toon shading to hide the uncanny valley).

Nobody is even close on AI. I don't expect game AI's that can talk to you or learn, but the goal should be very simple : playing a networked game against AI's should be just as fun as playing against humans. This is the "Turing test" for game AI if you like; if you shut off voice chat and play your Halo or Starcraft or whatever vs. an opponent and don't get told if it's AI or human, you shouldn't be able to tell. The AI should surprise you and experiment and sometimes make mistakes and make you laugh and impress you and do all those things that humans can do. Of course it should be able to, our sights are way too low.

Software developers often get stuck in the abstraction, they wind up comparing to their peers and forget about the absolute target.


06-28-11 | String Extraction

So I wrote a little exe to extract static strings from code. It's very simple.

StringExtractor just scans some dirs of code and looks for a tag which encloses a string with parens. eg. :

    MAKE_STRING( BuildHuff )
it takes all the strings it finds in that way and makes a table of indexes and contents, like :


enum EStringExtractor
{
    eSE_Null = 0,
    eSE_BuildHuff = 1,
    eSE_DecodeOneQ = 2,
    eSE_DecodeOneQ_memcpy = 3,
    eSE_DecodeOneQ_memset = 4,
    eSE_DecodeOneQuantum = 5, ...


const char * c_strings_extracted[] = 
{
    0,
    "BuildHuff",
    "DecodeOneQ",
    "DecodeOneQ_memcpy",
    "DecodeOneQ_memset",
    "DecodeOneQuantum", ...

it outputs this to a generated .C and .H file, which you can then include in your project.

The key then is what MAKE_STRING means. There are various ways to set it up, depending on whether you are replacing an old system that uses char * everywhere or not. Basically you want to make a header that's something like :


#if DO_STRINGS_RAW

#define MAKE_STRING( label )   (string_type)( #label )
#define GET_STRING(  index )   (const char *)( index )

#else

#include "code_gen_strings.h"

#define MAKE_STRING( label )  (string_type)( eSE_ ## label )
#define GET_STRING(  index )  c_strings_extracted[ (int)( index ) ]

#endif

(string_type can either be const char * to replace an old system, or if you're doing this from scratch it's cleaner to make it a typedef).

If DO_STRINGS_RAW is on, you run with the strings in the code as normal. With DO_STRINGS_RAW off, all static strings in the code are replaced with indexes and the table lookup is used.

It's important to me that the code gen doesn't actually touch any of the original source files, it just makes a file on the side (I hate code gen that modifies source because it doesn't play nice with editors); it's also important to me that you can set DO_STRINGS_RAW and build just fine without the code gen (I hate code gen that is a necessary step in the build).

Now, why would you do this? Well, for one thing it's just cleaner to get all static strings in one place so you can see what they are, rather than having them scattered all over. But some real practical benefits :

You can make builds that don't have the string table; eg. for SPU or other low-memory console situations, you can run the string extraction to turn strings into indeces, but then just don't link the table in. Now they can send back indeces to the host and you can do the mapping there.

You can load the string table from a file rather than building it in. This makes it optional and also allows localization etc. (not a great way to do this though).

For final builds, if you are using these strings just for debug info, you can easily get rid of all of them in one place just by #defining MAKE_STRING and GET_STRING to nulls.

Anyhoo, here's the EXE :

stringextractor.zip (84k)

(stringextractor is also the first cblib app that uses my new standardized command line interface; all cblib apps in the future will have a common set of -- options; also almost all cblib apps now take either files or dirs on the command line and if you give them dirs they iterate on contents).

(stringextractor also importantly uses a helper to not change the output file if the contents don't change; this means that it doesn't mess up the modtime of the generated file and cause rebuilds that aren't necessary).

Obviously one disadvantage is you can't have spaces or other non-C-compatible characters in the string. But I guess you could fix this by using printf style codes and do printf when you generate the table.


06-24-11 | Regression

Oodle now can run batch files and generate this :

test_cdep.donetest_huff.donetest_ioqueue.donetest_lzp.donetest_oodlelz.done
r:\test_done_xenon
r:\test_done_ps3passpasspasspass : 128.51pass : 450.50
r:\test_done_win32passpassfailpass : 341.94pass : 692.58
test_cdep.donetest_huff.donetest_ioqueue.donetest_lzp.donetest_oodlelz.done
r:\test_done_xenon
r:\test_done_ps3passpasspasspass : 128.03pass : 450.73
r:\test_done_win32passpasspasspass : 335.55pass : 686.90

Yay!

Something that's important for me is doing constant runs of the speeds of the optimized bits on all the platforms, because it's so easy to break the optimization with an inoccuous check-in, and then you're left trying to find what slowed you down.

Two niggles continue to annoy me :

1. Damn Xenon doesn't have a command line interface (by which I mean you can't actually interact with running programs from a console; you can start programs; the specific problem is that you can't tell if a program is done or still running or crashed from a console). I have my own hacky work-around for this which is functional but not ideal. (I know I could write my own nice xenon-runner with the Dm API's but I haven't bitten that off yet).

2. Damn PS3TM doesn't provide "force connect" from the command line. They provide most of the commands as command line switches, but not the one that I actually want. Because of this I frequently have problems connecting to the PS3 during the regression run, and I have to open up the damn GUI and do it by hand. This is in fact the only step that I can't automate and that's annoying. I mean, why do they even fucking bother with providing the "connect" and "disconnect" options? They never fucking work, the only thing that works is "force disconnect". Don't give me options that just don't work dammit.

(and the whole idea of PS3TM playing nice and not disconnecting other users is retarded because it doesn't disconnect idle people, so when someone else is "connected" that usually means they were debugging two days ago and just never disconnected)

(there is a similar problem with the Xenon (similar in the sense that it can't be automated); it likes to get itself into a state where it needs to be cold-booted by physically turning it off; I'm not sure why the "cold boot" xbreboot is not the same as physically power cycling it, so that's mildly annoying too).


06-23-11 | Map File Graphviz

What I want :

Something that parses the .map and obj's and creates a graph of the size of the executable. Graph nodes should be the size they take in the exe, and connections should be dependencies.

I have all this code for generating graphviz/dotty because I do it for my allocation grapher, but I don't know a good way to get the dependencies in the exe. Getting the sizes of things in the MAP is relatively easy.

To be clear, what you want to see is something like :

s_log_buf is 1024 bytes
s_log_buf is used by Log() in Log.cpp
Log() is called from X and Y and Z
...
just looking at the .map file is not enough, you want to know why a certain symbol got dragged in. (this happens a lot, for example some CRT function like strcpy suddenly shows up in your map you're like "where the fuck did that come from?")

Basically I need the static call-graph or link tables , a list of the dependencies from each function. The main place I need this is on the SPU, because it's a constant battle to keep your image size minimal to fit in the 256k.

I guess I can get it from "objdump" pretty easily, but that only provides dependency info at the obj level, not at the function level, which is what I really want.

Any better solutions?


06-21-11 | Houses

Well I put an offer on a house, but it doesn't look like I'll get it. The multiple-offer scenario is very strange, I always thought it was like an actual auction, like I put in an offer, they tell me what the highest other offer is and I get a chance to beat it. Not so. The seller can basically tell you anything they want to manipulate you, which is a real turn off to me and makes me just want to walk away. (in general in life I'm horrible at dealing with interactions and problems with people - my only weapon is to walk away). This seller basically said "you need to offer a lot more" but wouldn't say what the actual other offers were, so are they just trying to trick me into offering too much? Fuck that.

It's difficult keeping focused on the house hunt. We put a ton of time into looking at that house; I think I visited it 5 times to check out the neighbors and neighborhood at different times of day (you have to try to random sample for dogs yapping, children playing drums, retired people doing amateur construction projects, etc.). It's so much work that it's probably +EV for me to actually get a "bad deal" on the house. I think I might be too worried about making a good investment and could make a mistake of investing too much time searching and not getting what I really want.

Anyway, writing down some things that I've been mulling.

Houses are More Stress than Renting

I've heard this from a lot of home owners, and I'm sure it's true, but it's also largely just in your head. For example, home owners stress about the value of their home when the market goes up and down. But you don't have to, you can just not worry about that and just think of your mortgage as rent. It's not actually affecting you in any way day to day. Similarly, home owners stress about doing maintenance and home improvements; they always have a big todo list of stuff to do to the house (fix the squeaky door, find the basement leak, replace the roof shingles, etc.) , but you don't have to do those things. I've lived in a lot of rental houses over the last 15 years, and I have never seen a home owner do maintenance on *any* of them *ever*. Not even things like pruning or gutter cleaning that should be done annually. So clearly houses hold up just fine if you do no maintenace at all. The reason rentals are relaxing is because you see problems with the house any you just think "meh, not my problem". But you could treat the home you own the same way. I'm a total uptight type-A so this is a major trap for me that I have to avoid. This relates to ...

Priorities

It's one of the sad/funny truths of life that most people actively make their own lives worse. Maybe you live in the city in a shitty apartment, or even in a group house, it's uncomfortable, it's noisey, but you hang out with your friends, you're surrounded by the vitality of the city, you're actually very happy. But when you get money, you go buy a comfortable suburban house. Now you're far away from your friends and you don't see them any more, you have to drive everywhere, you spend your time mowing the lawn and watching TV and home improving, you're more comfortable, but your life is actually much worse.

Most people are horrible decision makers when it comes to their own happiness. Humans tend to prioritize elimination of discomfort, and in reality that is not actually important to happiness. (for example, stupid people spend their time buying the newest fancy hiking gear so they can be comfortable if they ever actually go, smart people just don't care and go hiking in the rain anyway, it may be uncomfortable, but afterward you forget the discomfort and have only the happiness).

I think that in general, home buyers over-value the actual house. Whether it's cute, whether it's in good condition, etc. These things don't actually matter that much to your quality of life. Oh no, I don't have cute decorative wood trim in my living room, my life sucks. No, actually these are probably the least important things.

Perhaps the most under-valued thing of all is picking a location near your family and friends. People always think "I don't want to sacrifice picking the ideal house for this because they might move anyway" or "we can still hang out even if I live further away" , but in fact you won't still hang out, and that is a major quality of life loss.

I am susceptible to this as anyone; the problem is you see a charming beautiful house and it fills you with visions of how good your life would be there, but in reality that is just an illusion (it's sort of like marrying a beautiful woman - in the long run, the beauty is not the important thing to quality of life, it's how you get a long, if she's understanding and compassionate and reasonable and fun, etc.)

I've been trying to figure out what actually matters to me. It's something like this (not in order) :


1. Bedroom where you can open the windows and it's not super noisey

2. No neighbor problems ; ideally see as little of the neighbors as possible ; no apartments or home
improvers.

3. Some private space where I can be naked in the sunshine

4. Some garage space where I can work on my car and bikes in peace

5. Sunny land for a garden

6. Walkable to groceries (about 0.5 miles max)

7. Good room for a home office ; has to be reasonably isolated from rest of the house

8. Not intolerable commute to kirkland ; even though I only do it two days a week, it does create a massive
amount of anger

Decision Making

I find making these kind of large extended decisions very difficult. The problem is you sort of go off chasing lines of thought and don't reset.

You start with certain criteria in mind. Then something comes up and it makes you think "okay maybe I should consider more expensive places too", so now you have to consider them and weigh that in too. Then something else comes up and you chase another thread, then you visit one house and it has a nice out-building and you have to weigh that against the original factors.

Your mind is getting more and more cluttered with all these cases and how to weigh them against each other; you no longer have a fresh perspective to think "do I really want this?"

One solution is to give yourself strict rules at the start (X dollar range, Y location range, Z square foot range) and absolutely stick to them, because as you get more and more confused down the line you will be tempted to violate those ranges and you might make a mistake simply because your decision making capacity has got shot.

Value

How do you get good value from a large purchase? (for most people this is just a car or a house)

Well some people can do it by smooth talking or manipulating the seller. I'm never going to succeed at that so let's talk about other ways.

One is to find a desperate seller. Sometimes you just get lucky with your timing and find a seller who needs to move right now and will take a low price. Basically you just go around to lots of sellers and give them low-ball offers, and eventually someone will bite. You can't be too picky about what you get this way, though, because the chance of finding a desperate seller in the house you want is very low. (it's easier with cars than houses since there are lots of identical cars of a given model).

Probably the best way, though, is to find properties where the market's valuation doesn't match your own valuation. Of course that sucks for resale value, but if you just want to maximize the value to you, your best bet is to look for ways that your personal valuation mismatches the market. A few for me :

I hate finished basements, but they count as square footage, so houses with finished basements
are massively overvalued for me.

I super-value some privacy in the yard.

I don't care about square footage of the house that much in particular, so large houses are over-valued
and small houses are under-valued.  A lot of people look at $/sq-ft way too much.

Blah blah blah I'm bored of this topic.


06-17-11 | C casting is the devil

C-style casting is super dangerous, as you know, but how can you do better?

There are various situations where I need to cast just to make the compiler happy that aren't actually operational casts. That is, if I was writing ASM there would be no cast there. For example something like :

U16 * p;
p[0] = 1;
U8 * p2 = (U8 *)p;
p2[1] = 7;
is a cast that changes the behavior of the pointer (eg. "operational"). But, something like :
U16 * p;
*p = 1;
U8 * p2 = (U8 *)p;
p2 += step;
p = (U16 *) p2;
*p = 2;
is not really a functional cast, but I have to do it because I want to increment the pointer by some step in bytes, and there's no way to express that in C without a cast.

Any time I see a C-style cast in code I think "that's a bug waiting to happen" and I want to avoid it. So let's look at some ways to do that.

1. Well, since we did this as an example already, we can hide those casts with something like ByteStepPointer :


template<typename T>
T * ByteStepPointer(T * ptr, ptrdiff_t step)
{
    return (T *)( ((intptr_t)ptr) + step );
}

our goal here is to hide the nasty dangerous casts from the code we write every day, and bundle it into little utility functions where it's clear what the purpose of the cast is. So now we can write out example as :
U16 * p;
*p = 1;
p = ByteStepPointer(p,step);
*p = 2;
which is much prettier and also much safer.

2. The fact that "void *" in C++ doesn't cast to arbitrary pointers the way it does in C is really fucking annoying. It means there is no "generic memory location" type. I've been experimenting with making the casts in and out of void explicit :


template<typename T>
T * CastVoid(void * ptr)
{
    return (T *)( ptr );
}

template<typename T>
void * VoidCast(T * ptr)
{
    return (void *)( ptr );
}

but it sucks that it's so verbose. In C++0x you can do this neater because you can template specialize based on the left-hand-side. So in current C++ you have to write
Actor * a = CastVoid<Actor>( memory );
but in 0x you will be able to write just
Actor * a = CastVoid( memory );

There are a few cases where you need this, one is to call basic utils like malloc or memset - it's not useful to make the cast clear in this case because the fact that I'm calling memset is clear enough that I'm treating this pointer as untyped memory; another is if you have some generic "void *" payload in a node or message.

Again you don't want just a play C-style cast here, for example something like :

Actor * a = (Actor *) node->data;
is a bug waiting to happen if you change "data" to an int (among other things).

3. A common annoying case is having to cast signed/unsigned. It should be obvious that when I write :

U32 set = blah;
U32 mask = set & (-set);
that I want the "-" operator to act as (~set + 1) on the bits and I don't care that it's unsigned, but C won't let you do that. (see previous rants about how what I really want in this scenario is a "#pragma requires(twos_complement)" ; warning me about the sign is fucking useless for portability because it just makes me cast, if you want to make a real portable language you have to be able to express capabilities of the platform and constraints of the algorithm).

So, usually what you want is a cast that gives you the signed type of the same register size, and that doesn't exist. So I made my own :


static inline S8  Signed(U8 x)  { return (S8) x; }
static inline S16 Signed(U16 x) { return (S16) x; }
static inline S32 Signed(U32 x) { return (S32) x; }
static inline S64 Signed(U64 x) { return (S64) x; }

static inline U8  Unsigned(S8 x)  { return (U8) x; }
static inline U16 Unsigned(S16 x) { return (U16) x; }
static inline U32 Unsigned(S32 x) { return (U32) x; }
static inline U64 Unsigned(S64 x) { return (U64) x; }

So for example, this code :
mask = set & (-(S32)set);
is a bug waiting to happen if you switch to 64-bit sets. But this :
mask = set & (-Signed(set));
is robust. (well, robust if you include a compiler assert that you're 2's complement)

4. Probably the most common case is where you "know" a value is small and need to put it in a smaller type. eg.

int x = 7;
U8 small = (U8) x;
But all integer-size-change casts are super unsafe, because you can later change the code such that x doesn't fit in "small" anymore.

(often you were just wrong or lazy about "knowing" that the value fit in the smaller type. One of the most common cases for this right now is putting file sizes and memory sizes into 32-bit ints. Lots of people get annoying compiler warnings about that and think "oh, I know this is less than 2 GB so I'll just C-style cast". Oh no, that is a huge maintenance nightmare. In two years you try to run on a larger file and suddenly you have bugs all over and you can't find them because you used C-style casts. Start checking your casts!).

You can do this with a template thusly :


// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template <typename t_to, typename t_fm>
t_to check_cast( const t_fm & from )
{
    t_to to = static_cast<t_to>(from);
    ASSERT( static_cast<t_fm>(to) == from );
    return to;
}

but it is so common that I find the template a bit excessively verbose (again C++0x with LHS specialization would help, you could then write just :
small = check( x );

small = clamp( x );
which is much nicer).

To do clamp casts with a template is difficult. You can use std::numeric_limits to get the ranges of the dest type :

template <typename t_to, typename t_fm>
t_to clamp_cast( const t_fm & from )
{
    t_to lo = std::numeric_limits<t_to>::min();
    t_to hi = std::numeric_limits<t_to>::max();
    if ( from < lo ) return lo; // !
    if ( from > hi ) return hi; // !
    t_to to = static_cast<t_to>(from);
    RR_ASSERT( static_cast<t_fm>(to) == from ); 
    return to;
}
however, the compares inherent (at !) in clamping are problematic, for example if you're trying to clamp_cast from signed to unsigned you may get warnings there (you can also get the unsigned compare against zero warning when lo is 0). (? is there a nice solution to this ? you want to cast to the larger ranger of the two types for the purpose of the compare, so you could make some template helpers that do the compare in the wider of the two types, but that seems a right mess).

Rather than try to fix all that I just use non-template versions for our basic types :


static inline U8 S32ToU8Clamp(S32 i)    { return (U8) CLAMP(i,0,0xFF); }
static inline U8 S32ToU8Check(S32 i)    { ASSERT( i == (S32)S32ToU8Clamp(i) ); return (U8)i; }

static inline U16 S32ToU16Clamp(S32 i)  { return (U16) CLAMP(i,0,0xFFFF); }
static inline U16 S32ToU16Check(S32 i)  { ASSERT( i == (S32)S32ToU16Clamp(i) ); return (U16)i; }

static inline U32 S64ToU32Clamp(S64 i)  { return (U32) CLAMP(i,0,0xFFFFFFFFUL); }
static inline U32 S64ToU32Check(S64 i)  { ASSERT( i == (S64)S64ToU32Clamp(i) ); return (U32)i; }

static inline U8 U32ToU8Clamp(U32 i)    { return (U8) CLAMP(i,0,0xFF); }
static inline U8 U32ToU8Check(U32 i)    { ASSERT( i == (U32)U32ToU8Clamp(i) ); return (U8)i; }

static inline U16 U32ToU16Clamp(U32 i)  { return (U16) CLAMP(i,0,0xFFFF); }
static inline U16 U32ToU16Check(U32 i)  { ASSERT( i == (U32)U32ToU16Clamp(i) ); return (U16)i; }

static inline U32 U64ToU32Clamp(U64 i)  { return (U32) CLAMP(i,0,0xFFFFFFFFUL); }
static inline U32 U64ToU32Check(U64 i)  { ASSERT( i == (U64)U64ToU32Clamp(i) ); return (U32)i; }

static inline S32 U64ToS32Check(U64 i)  { S32 ret = (S32)i; ASSERT( (U64)ret == i ); return ret; }
static inline S32 S64ToS32Check(S64 i)  { S32 ret = (S32)i; ASSERT( (S64)ret == i ); return ret; }

which is sort of marginally okay. Maybe it would be nicer if I left off the type it was casting from in the name.


06-16-11 | Optimal Halve for Doubling Filter

I've touched on this topic several times in the past . I'm going to wrap up a loose end.

Say you have some given linear doubling filter (linear in the operator sense, not that it's a line). You wish to halve your image in the best way such that the round trip has minimum error.

For a given discrete doubling filter (non-interpolating) find the optimal halving filter that minimizes L2 error. I did it numerically, not analytically, and measured the actual error of down->up vs. original on a large test set.

I generated halving filters for half-widths of 3,4, and 5. Large filters always produce lower error, but also more ringing, so you may not want the largest width halving filter.


upfilter :  linear  :
const float c_filter[4] = { 0.12500, 0.37500, 0.37500, 0.12500 };

 downFilter : 
const float c_filter[6] = { -0.15431, 0.00162, 0.65269, 0.65269, 0.00162, -0.15431 };
fit err = 17549.328

 downFilter : 
const float c_filter[8] = { 0.05429, -0.21038, -0.01115, 0.66724, 0.66724, -0.01115, -0.21038, 0.05429 };
fit err = 17238.310

 downFilter : 
const float c_filter[10] = { 0.05159, 0.00138, -0.21656, -0.00044, 0.66402, 0.66402, -0.00044, -0.21656, 0.00138, 0.05159 };
fit err = 16959.596

upfilter :  mitchell1  :
const float c_filter[8] = { -0.00738, -0.01172, 0.12804, 0.39106, 0.39106, 0.12804, -0.01172, -0.00738 };

 downFilter : 
const float c_filter[6] = { -0.13475, 0.02119, 0.61356, 0.61356, 0.02119, -0.13475 };
fit err = 17496.548

 downFilter : 
const float c_filter[8] = { 0.05595, -0.19268, 0.00985, 0.62688, 0.62688, 0.00985, -0.19268, 0.05595 };
fit err = 17131.069

 downFilter : 
const float c_filter[10] = { 0.05239, 0.00209, -0.19664, 0.01838, 0.62379, 0.62379, 0.01838, -0.19664, 0.00209, 0.05239 };
fit err = 16811.168

upfilter :  lanczos4  :
const float c_filter[8] = { -0.00886, -0.04194, 0.11650, 0.43430, 0.43430, 0.11650, -0.04194, -0.00886 };

 downFilter : 
const float c_filter[6] = { -0.09637, 0.05186, 0.54451, 0.54451, 0.05186, -0.09637 };
fit err = 17332.452

 downFilter : 
const float c_filter[8] = { 0.04290, -0.14122, 0.04980, 0.54852, 0.54852, 0.04980, -0.14122, 0.04290 };
fit err = 17054.006

 downFilter : 
const float c_filter[10] = { 0.03596, 0.00584, -0.13995, 0.05130, 0.54685, 0.54685, 0.05130, -0.13995, 0.00584, 0.03596 };
fit err = 16863.054

upfilter :  lanczos5  :
const float c_filter[10] = { 0.00551, -0.02384, -0.05777, 0.12982, 0.44628, 0.44628, 0.12982, -0.05777, -0.02384, 0.00551 };

 downFilter : 
const float c_filter[6] = { -0.08614, 0.07057, 0.51557, 0.51557, 0.07057, -0.08614 };
fit err = 17323.692

 downFilter : 
const float c_filter[8] = { 0.05112, -0.13959, 0.06782, 0.52065, 0.52065, 0.06782, -0.13959, 0.05112 };
fit err = 16899.712

 downFilter : 
const float c_filter[10] = { 0.04554, 0.00403, -0.13655, 0.06840, 0.51857, 0.51857, 0.06840, -0.13655, 0.00403, 0.04554 };
fit err = 16566.352

------------------------------


06-14-11 | A simple allocator

You want to be able to allocate slots, free slots, and iterate on the allocated slot indexes. In particular :


int AllocateSlot( allocator );
void FreeSlot( allocator , int slot );
int GetNextSlot( iterator );

Say you can limit the maximum number of allocations to 32 or 64, then obviously you should use bit operations. But you also want to avoid variable shifts. What do you do ?

Something like this :


static int BottomBitIndex( register U32 val )
{
    ASSERT( val != 0 );
    #ifdef _MSC_VER
    unsigned long b = 0;
    _BitScanForward( &b, val );
    return (int)b;
    #elif defined(__GNUC__)
    return __builtin_ctz(val); // ctz , not clz
    #else
    #error need bottom bit index
    #endif
}

int __forceinline AllocateSlot( U32 & set )
{
    U32 inverse = ~set;
    ASSERT( inverse != 0 ); // no slots free!
    int index = BottomBitIndex(inverse);
    U32 mask = inverse & (-inverse);
    ASSERT( mask == (1UL<<index) );
    set |= mask;
    return index;
}

void __forceinline FreeSlot( U32 & set, int slot )
{
    ASSERT( set & (1UL<<slot) );
    set ^= (1UL<<slot);
}

int __forceinline GetNextSlot( U32 & set )
{
    ASSERT( set != 0 );
    int slot = BottomBitIndex(set);
    // turn off bottom bit :
    set = set & (set-1);
    return slot;
}

/*

// example iteration :

    U32 cur = set;
    while(cur)
    {
        int i = GetNextSlot(cur);
        lprintfvar(i);
    }

*/

However, this uses the bottom bit index, which is not as portably fast as using the top bit index (aka count leading zeros). (there are some platforms/gcc versions where builtin_ctz does not exist at all, and others where it exists but is not fast because there's no direct instruction set correspondence).

So, the straightforward version that uses shifts and clz is probably better in practice.

ADDENDUM : Duh, version of same using only TopBitIndex and no variable shifts :


U32 __forceinline AllocateSlotMask( U32 & set )
{
    ASSERT( (set+1) != 0 ); // no slots free!
    U32 mask = (~set) & (set+1); // find lowest off bit
    set |= mask;
    return mask;
}

void __forceinline FreeSlotMask( U32 & set, U32 mask )
{
    ASSERT( set & mask );
    set ^= mask;
}

U32 __forceinline GetNextSlotMask( U32 & set )
{
    ASSERT( set != 0 ); // iteration over when set == 0
    U32 mask = set & (-set); // lowest on bit
    set ^= mask;
    return mask;
}

int __forceinline MaskToSlot( U32 mask )
{
    int slot = TopBitIndex(mask);
    ASSERT( mask == (1UL<<slot) );
    return slot;
}

(note the forceinline is important because the use of actual references is catastrophic on many platforms (due to LHS), we need these to get compiled like macros).


06-14-11 | How to do input for video games

1. Read all input in one spot. Do not scatter input reading all over the game. Read it into global state which then applies for the time slice of the current frame. The rest of the game code can then ask "is this key down" or "was this pressed" and it just checks the cached state, not the hardware.

2. Respond to input immediately. Generally what that means is you should have a linear sequence of events that is something like this :

Poll input
Do actions triggered by input (eg. fire bullets)
Do time evolution of player-action objects (eg. move bullets)
Do environment responses (eg. did bullets hit monsters?)
Render frame
(* see later)

3. On a PC you have to deal with the issue of losing focus, or pausing and resuming. This is pretty easy to get correct if you obeyed #1 - read all your input in one spot, it just zeros the input state while you are out of focus. The best way to resume is when you regain focus you immediately query all your input channels to wipe any "new key down" flags, but just discard all the results. I find a lot of badly written apps that either lose the first real key press, or incorrectly respond to previous app's keys when they didn't have focus.

( For example I have keys like ctrl-alt-q that toggle focus around for me, and badly written apps will respond to that "q" as if it were for them, because they just ask for the global "new key down" state and they see a Q that wasn't there the last time they checked. )

4. Use a remapping/abstraction layer. Don't put actual physical button/keys all around your app. Even if you are sure that you don't want to provide remapping, do it anyway, because it's useful for you as a developer. That is, in your player shooting code don't write

  if ( NewButtonDown(X) ) ...
instead write
  if ( NewButtonDown(BUTTON_SHOOT) ) ...
and have a layer that remaps BUTTON_SHOOT to a physical key. The remap can also do things like taps vs holds, combos, sequences, etc. so all that is hidden from the higher level and you are free to easily change it at a later date.

This is obvious for real games, but it's true even for test apps, because you can use the remapping layer to log your key operations and provide help and such.

(*) extra note on frame order processing.

I believe there are two okay frame sequences and I'm not sure there's a strong argument in one way or the other :


Method 1 :

Time evolve all non-player game objects
Prepare draw buffers for non-player game objects
Get input
Player responds to input
Player-actions interact with world
Prepare draw buffers for player & stuff just made
Kick render buffers

Method 2 :

Get input
Player responds to input
Player-actions interact with world
Time evolve all non-player game objects
Prepare draw buffers for player & stuff just made
Prepare draw buffers for non-player game objects
Kick render buffers

The advantage of Method 1 is that the time between "get input" and "kick render" is absolutely minimized (it's reduced by the amount of time that it takes you to process the non-player world), so if you press a button that makes an explosion, you see it as soon as possible. The disadvantage is that the monsters you are shooting have moved before you do input. But, there's actually a bunch of latency between "kick render" and getting to your eye anyway, so the monsters are *always* ahead of where you think they are, so I think Method 1 is preferrable. Another disadvantage of Method 1 is that the monsters essentially "get the jump on you" eg. if they are swinging a club at you, they get to do that before your "block" button reaction is processed. This could be fixed by doing something like :

Method 3 :

Time evolve all non-player game objects (except interactions with player)
Prepare draw buffers for non-player game objects
Get input
Player responds to input
Player-actions interact with world
Non-player objects interact with player
Prepare draw buffers for player & stuff just made
Kick render buffers

this is very intentionally not "fair" between the player and the rest of the world, we want the player to basically win the initiative roll all the time.

Some game devs have this silly idea that all the physics needs to be time-evolved in one atomic step which is absurd. You can of course time evolve all the non-player stuff first to get that done with, and then evolve the player next.


06-14-11 | ProcessSuicide

The god damn lagarith DLL has some crash in its shutdown, so any time I play an AVI with app that uses lagarith, it hangs on exit.

(this is one of the reasons that I need to write my own lossless video format; the other reason is that lagarith can't play back at 30 fps even on ridiculously fast modern machines; and the other standard HuffYUV frequently crashes for me and is very hard to make support RGB correctly)

Anyhoo, I started using this to shut down my app, which doesn't have the stupid "wait forever for hung DLL's to unload" problem :


void ProcessSuicide()
{
    DWORD myPID = GetCurrentProcessId();

    lprintf("ProcessSuicide PID : %d\n",myPID);

    HANDLE hProcess = OpenProcess (PROCESS_ALL_ACCESS, FALSE, myPID); 
        
    if ( hProcess == INVALID_HANDLE_VALUE )
    {
        lprintf("Couldn't open my own process!\n");
        // ?? should probably do something else here, but never happens
        return;
    }
        
    TerminateProcess(hProcess,0);
    
    // ... ?? ... should not get here
    ASSERT(false);
    
    CloseHandle (hProcess);
}

At first I thought this was a horrible hack, but I've been using it for months now and it doesn't cause any problems, so I'm sort of tempted to call it not a hack but rather just a nice way to quit your app in Windows and not ever get that stupid thing where an app hangs in shutdown (which is a common problem for big apps like MSDev and Firefox).


06-11-11 | God damn YUV

So I've noticed for the last year or so that x264 videos I was making as test/reference all had weirdly shifted brightness values. I couldn't figure out why exactly and forgot about it.

Yesterday I finally adapted my Y4M converter (which does AVI <-> Yuv4MPEG with RGB <-> YUV color conversion and up/down sample, and uses various good methods of YUV, such as out of gamut chroma spill, lsqr optimized conversion, etc.). I added support for the "rec601" (JPEG) and "bt709" (HDTV) versions of YUV (and by "YUV" I mean YCbCr in gamma-encoded space), with both 0-255 and 16-235 range support. I figured I would stress test it by trying to use it in place of ffmpeg in my h264 pipeline for the Y4M conversion. And I found the old brightness problem.

It turns out that when I make an x264 encode and then play it back through DirectShow (with ffdshow), the player is using the "BT 709" yuv matrix (in 16-235 range) (*). When I use MPlayer to play it back and write out frames, it's using the "rec 601" yuv matrix (in 16-235 range).

(*
this appears to be because there's nothing specified in the stream and ffdshow will pick the matrix based on the resolution of the video - so that will super fuck you, depending on the size of the video you need to pick a different matrix (it's trying to do the right thing for HDTV vs SDTV standard video). Their heuristic is :.

width > 1024 or height >= 600: BT.709
width <=1024 and height < 600: BT.601
*)

(in theory x264 doesn't do anything to the YUV planes - I provide it y4m, and it just works on yuv as bytes that it doesn't know anything about; the problem is the decoders which are doing their own thing).

The way I'm doing it now is I make the Y4M myself in rec601 space, let x264 encode it, then extract frames with mplayer (which seems to always use 601 regardless of resolution). If there was a way to get the Y4M directly out of x264 that would make it much easier because I could just do my own yuv->rgb (the only way I've found to do this is to use ffmpeg raw output).

Unfortunately Y4M itself doesn't seem to have any standardized tag to indicate what kind of yuv data is in the container. I've made up my own ; I write an xtag that contains :


yuv=pc.601
yuv=pc.709
yuv=bt.601
yuv=bt.709

where "bt" implies 16-235 luma (16-240 chroma) and "pc" implies 0-255 (fullrange).

x264 has a bunch of --colormatrix options to tag the color space in the H264 stream, but apparently many players don't respect it, so the recommended practice is to use the color space that matches your resolution (eg. 709 for HD and 601 for SD). (the --colormatrix options you want are bt709 and bt470bg , I believe).

Some notes by other people :


TV capture "SD" mpeg2 720x576i -> same res in mpe4, so use --colormatrix bt601 --fullrange ?
TV capture "HD" mpeg2 1440x1080i -> same res in mpe4, so use --colormatrix bt709 --fullrange ?

look at table E-3 (Colour Primaries) in the H.264 spec:

bt470bg = bt601 625 = bt1358 625 = bt1700 625 (PAL/SECAM)
smpte170m = bt601 525 = bt1358 525 = bt1700 NTSC

(yes, PAL and NTSC have different bt601 matrices here)

yup there's only:
--colormatrix <string> Specify color matrix setting ["undef"]
- undef, bt709, fcc, bt470bg, smpte170m, smpte240m, GBR, YCgCo

ADDENDUM : god damn the color matrix change in bt.709 is so retarded. While in theory the phosphors of HDTVs match 709 better than 601, that is actually pretty irrelevant, since YCbCr is run in gamma-corrected space, and we do the chroma sub-sample, and so on ( see Mag of nonconst luminance error - Charles Poynton ). The actual practical effect of the 709 new matrix is that we're watching lots of videos with badly shifted brightness and saturation. In reality, it just made video quality much much worse.

(I also don't understand the 16-235 range that was used in MPEG. Yeah yeah, NTSC needs the top and bottom of the signal for special codes, fine, but why does that have to be hard-coded into the digital signal? The special region at top and bottom is an *analog* thing. The video could have been full range 0-255 in the digital encoding, and then in the DAC output you just squish it into the middle 7/8 of the signal band. Maybe there's something going on that I don't understand, but it just seems like terrible software engineering design to take the weird quirk of one system (NTSC analog output) and push that quirk back up the pipeline to affect something (digital encoding format) that it doesn't need to).


06-08-11 | Tech Todos

Publicly getting my thoughts together :

1. Oodle. Just finish it! God damn it.

2. JPEG decoder. I got really close to having this done, need to finish it. The main thing left that I want to do is work on the edge-adaptive-bilteral filter a bit more; currently it's a bit too strong on the non-artifact areas, I think I can make it more selective about only working on the ringing and blockiness. The other things I want are chroma-from-luma support and a special mode for text/graphics.

3. Byte-wise LZ with optimal parse. This has been on my list for a long time. I'm not really super motivated though. But it goes along with -

4. LZ string matcher test. Hash tables, hash->list, hash->bintree, hash->MMC, suffix trees, suffix arrays, patricia tries, etc. ? Would be nice to make a thorough test bed for this. (would also help the Oodle LZ encoder which is currently a bit slow due to me not spending any time on the string matcher).

5. Cuckoo hash / cache aware hash ; I did a bunch of hash testing a while ago and want to add this to my tests. I'm very curious about it, but this is kind of pointless.

6. Image doubler / denoiser / etc ; mmm meh I've lost my motivation for this. It's a big project and I have too many other things to do.

7. Make an indy game. Sometimes I get the craving to do something interactive/artistic. I miss being able to play with my work. (I also get the craving to make a "demo" which would be fun and is rather less pain in the butt than making a full game). Anyhoo, probably not gonna happen, since there's just not enough time for this.

ADDENDUM : some I forgot :

8. Finish my video codec ; I still want to redo my back end coder which was really never intented for video; maybe support multiple sizes of blocks; try some more perceptual metrics for encoder decision making; various other ideas.

9. New lossy image codec ; I have an unfinished one that I did for RAD, but I think I should just scrap it and do the new one. I'm interested in directional DCT. Also simple highly asymetric schemes, such as static classes that are encoder-optimized (instead of adaptive models; adaptive models are very bad for LHS). More generally, I have some ideas about trying to make a codec that is more explicitly perceptual, it might be terrible in rmse, but look better to the human eye; one part of that is using the imdiff metrics I trained earlier, another part is block classification (smooth,edge,detail) and special coders per class.


06-08-11 | Pots 4

Last bunch for a while since I'm taking the summer off.

Large round bowl; rim is oribe, but not just applied over base glaze, I left some space bare at the rim so it would get better grip. The dark pattern on the side is kind of interesting. I dug grooves by chattering with a cutting tool, then applied an iron slip to the outside of the pot, then after it dried sanded away the slip. The result was just iron left in the chatter holes. Then the pot is glazed with yellow salt, which reacts with iron by darkening. So the outside is smooth but shows the cut grooves as dark spots.

Vase with triangular opening; Oribe glaze, did something weird that I didn't do on purpose at all, not sure how this happened, and I love that unpredictability :

Cylindrical vase ; Shino top oribe bottom ; top is dipped, then I poured on some extra layers of shino that create the white patterns where it's thick. The middle band is cobalt stain which I waxed over before glazing to create the clean boundaries; cobalt without a stain over it turns black in firing, if you put a clear over it, it would be brilliant blue.

Attempt at more geometric forms; meh

Medium size bowl; very round but a bit heavy. There's a ring of unglazed cobalt around the rim on this too; that was a mistake, I should have put clear over it to bring out the blue, you can only see a tiny band of what the blue would have been like. Oribe rim and a little pour of it into the bottom, just pour a tiny bit so you don't have to pour out excess.

Medium size bowl. Inside yellow salt, outside lung chuan. I did the outside by doing first a very thin watered down glaze layer (it's almost invisible but takes off the naked harshness of bare clay; I thinned a bit too much, just a tiny dash of water goes a long way). Then I dripped glaze while the pot spun to create very lumpy thick random application, sort of like sand castles. I think there's potential in this technique, will explore it further.

Small bowl; glaze base lung chuan, oribe splatter application by flicking a paint brush loaded with glaze. Not bad. Maybe stain splatter under clear glaze would be better.


06-07-11 | How to read an LZ compressed file

An example of the kind of shite I'm doing in Oodle these days.

You have an LZ-compressed file on disk that you want to get decompressed into memory as fast as possible. How do you do this ?

Well, first of all, you make your compressor write in independent chunks so that the decompressor can run on multiple chunks at the same time with threads. But to start you need to know where the chunks are in the file, so the first step is :


1. Fire an async read of the first 64k of the file to get the header.

the header will tell you where all the independent chunks are. (aside : in final Oodle there may also be an option to aglomerate all the headers of all the files, so you may already have this first 64k in memory).

So after that async read is finished, you want to fire a bunch of decomps on the chunks, so the way to do this is :


2. Make a "Worklet" (async function callback) which parses the header ; set the Worklet to run when the IO op #1
finishes.

I used to do this by having the WorkMgr get a signal from IO thread (which still happens) but I now also have a mechanism to just run Worklets directly on the IO thread, which is preferrable for Worklets that are super trivial like this one.

Now, if the file is small you could just have your Worklet #2 read the rest of the file and then fire async works on each one, but if the file is large that means you are waiting a long time for the IO before you start any decomp work, so that's not ideal, instead what we do is :


3. In Worklet #2, after parsing header, fire an async IO for each independent compressed chunk.  For each chunk, create
a decompression Worklet which is dependent on the IO of that chunk (and also neighbors, since due to IO
sector alignment the compression boundaries and IO boundaries are not quite the same).

So what this will do is start a bunch of IO's that then retire one by one, as each one retires it starts up the decomp task for that chunk. This means you start decompressing almost immediately and for large files you keep the CPU and IO busy the whole time.

Finally the main thread needs a way to wait for this all to be done. But the handles to the actual decompression async tasks don't exist until async task #2 runs, so the main thread can't wait on them directly. Instead :


4. At the time of initial firing (#1), create an abstract waitable handle and set it to "pending" state; then
pass this handle through your async chain.  Task #2 should set it to needing "N to go", since it's the first
point that knows the count, and then the actual async decompresses in #3 should decrement that counter.  So
the main thread can wait on it being "0 to go".

You can think of this as a sempahore, though in practice I don't use a semaphore because there are some OS's where that's not possible (sadly).

What the client sees is just :


AsyncHandle h = OodleLZ_StartDecompress( fileName );

Async_IsPending(h); ?

Async_Block(h);

void * OodleLZ_GetFinishedDecompress( h );

if they just want to wait on the whole thing being done. But if you're going to parse the decompressed file, it's more efficient to only wait on the first chunk being decompressed, then parse that chunk, then wait on the next chunk, etc. So you need an alternate API that hands back a bunch of handles, and then a streaming File API that does the waiting for you.


06-04-11 | Keep Case

I've been meaning to do this for a long time and finally got off my ass.

TR (text replace) and zren (rename) in ChukSH now support "keep case".

Keep case is pretty much what you always want when you do text replacement (especially in source code), and everybody should copy me. For example when I do a find-replace from "lzp1f" -> "lzp1g" what I want is :


lzp1f -> lzp1g  (lower->lower)
LZP1F -> LZP1G  (upper->upper)
Lzp1f -> Lzp1g  (first cap->first cap)
Lzp1G -> Lzp1G  (mixed -> mixed)

The kernel that does this is matchpat in cblib which will handle rename masks like : "poop*face" -> "shit*butt" with keep case option or not.

In a mixed-wild-literal renaming spec like that, the "keep case" applies only to the literal parts. That is, "poop -> shit" and "face -> butt" will be applied with keep-case independently , the "*" part will just get copied.

eg :


Poop3your3FACE -> Shit3your3BUTT

Also, because keep-case is applied to an entire chunk of literals, it can behave somewhat unexpectedly on file renames. For example if you rename

src\lzp* -> src\zzh*

the keep-case will apply to the whole chunk "src\lzp" , so if you have a file like "src\LZP" that will be considered "mixed case" not "all upper". Sometimes my intuition expects the rename to work on the file part, not the full path. (todo : add an option to separate the case-keeping units by path delims)

The way I handle "mixed case" is I leave it up to the user to provide the mixed case version they want. It's pretty impossible to get it right automatically. So the replacement text should be provided in the ideal mixed case capitalization. eg. to change "HelpSystem" to "QueryManager" you need to give me "QueryManager" as the target string, capitalized that way. All mixed case source occurances of "HelpSystem" will be changed to the same output, eg.


helpsystem -> querymanager
HELPSYSTEM -> QUERYMANAGER
Helpsystem -> Querymanager
HelpSystem -> QueryManager
HelpsYstem -> QueryManager
heLpsYsTem -> QueryManager
HeLPSYsteM -> QueryManager

you get it.

The code is trivial of course, but here it is for your copy-pasting pleasure. I want this in my dev-studio find/replace-in-files please !


// strcpy "putString" to "into"
//  but change its case to match the case in src
// putString should be mixed case , the way you want it to be if src is mixed case
void strcpyKeepCase(
        char * into,
        const char * putString,
        const char * src,
        int srcLen);

void strcpyKeepCase(
        char * into,
        const char * putString,
        const char * src,
        int srcLen)
{   
    // okay, I have a match
    // what type of case is "src"
    //  all lower
    //  all upper
    //  first upper
    //  mixed
    
    int numLower = 0;
    int numUpper = 0;
    
    for(int i=0;i<srcLen;i++)
    {
        ASSERT( src[i] != 0 );
        if ( isalpha(src[i]) )
        {
            if ( isupper(src[i]) ) numUpper++;
            else numLower++;
        }
    }
    
    // non-alpha :
    if ( numLower+numUpper == 0 )
    {
        strcpy(into,putString);
    }
    else if ( numLower == 0 )
    {
        // all upper :
        while( *putString )
        {
            *into++ = toupper( *putString ); putString++;
        }
        *into = 0;
    }
    else if ( numUpper == 0 )
    {
        // all lower :
        while( *putString )
        {
            *into++ = tolower( *putString ); putString++;
        }
        *into = 0;
    }
    else if ( numUpper == 1 && isalpha(src[0]) && isupper(src[0]) )
    {
        // first upper then low
        
        if( *putString ) //&& isalpha(*putString) )
        {
            *into++ = toupper( *putString ); putString++;
        }
        while( *putString )
        {
            *into++ = tolower( *putString ); putString++;
        }
        *into = 0;
    }
    else
    {
    
        // just copy putString - it should be mixed 
        strcpy(into,putString);
    }
}


ADDENDUM : on a roll with knocking off stuff I've been meaning to do for a while ...

ChukSH now also contains "fixhtmlpre.exe" which fixes any less-than signs that are found within a PRE chunk.

Hmm .. something lingering annoying going on here. Does blogger convert and-l-t into less thans?

ADDENDUM : yes it does. Oh my god the web is so fucked. I've been doing a bit of reading and it appears this is a common and atrocious hack. Basically the problem is that people use XML for the markup of the data transfer packets. Then they want to sent XML within those packets. So you have to form some shit like :


<data> I want to send <B> this </B> god help me </data>

but putting the less-thans inside the data packet is illegal XML (it's supposed to be plain text), so instead they send

<data> I want to send &-l-tB> this &-l-t/B> god help me </data>

but they want the receiver to see a less-than, not the characters &-l-t , so the receiver parses those codes back into less-than and then treats the data received as its own hunk of XML with internal markups.

Basically people use it as a way to send codes that the current parser will ignore, but the next parser will see. There are lots of pages about how this is against compliance standards but nobody cares and it seems to be widespread.

So anyway, the conclusion is : just changing less thans to &-l-t works fine if you are just posting html (eg. for rants.html it works fine) but for sending to Blogger (or probably any other modern XML-based app) it doesn't.

The method I use now which seems to work on Blogger is I convert less thans to


<code><</code>

How is there not a fucking "literal" tag ? (There is one called XMP but it's deprecated and causes line breaks, and it's really not just a literal tag around a bunch of characters, it's a browser format mode change)


More :

01/2011 to 06/2011
10/2010 to 01/2011
01/2010 to 10/2010
01/2009 to 12/2009
10/2008 to 01/2009
08/2008 to 10/2008
03/2008 to 08/2008
11/2007 to 03/2008
07/2006 to 11/2007
12/2005 to 07/2006
06/2005 to 12/2005
01/1999 to 06/2005

back to cbloom.com