Go to the new cbloom rants @ blogspot

10-22-10 | Some notes on Chroma Sampling

First some motivation before we dig into this :

Take bmp, downsample convert to YUV 420 (specifically YCbCr 601), upsample and convert back to RGB, measure RMSE. (yes I know RGB rmse is not really what you want, more on this later).

Testing "my_soup" :

ffmpeg bmp -> y4m -> bmp , default options :
rmse : 5.7906 

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 2.3052

my_y4m : 

RGB -> chroma then boc :
rmse : 3.3310

box then RGB -> chroma :
rmse : 3.3310

box rgb FindBestY :
rmse : 3.1129

lsqr solved YCbCr for box :
rmse : 3.1129

box rgb (decoder spill) :
rmse : 3.0603

linear-down linear-up :
rmse : 2.7562

down-for-linear-up :
rmse : 2.0329

down-for-linear-up _RGB FindBestY :
rmse : 1.8951

solve lsqr for linear up :
//float solution rmse : 1.6250
rmse : 1.7400

Clearly there is a lot of win to be had from good chroma sampling. Let's talk a bit about what's going on.

(BTW the ffmpeg results don't directly compare to my_y4m, they're just there for reference and sanity check; for example I use symmetric-centered (JPEG style) 420 subsampling and I think they use offset MPEG-style 420 subsampling ; I'm also still not sure if they are in 16-235 or 0-255 , I am definitely in 0-255 ).

First of all just in terms of basic filtering, you might do your operation like this :

"encode" :

RGB -> floats
float RGB -> matrix multiply -> YUV
UV plane -> filters -> downsampled
quantize & clamp to [0,255] ints
transmit YUV 420

"decode" :

UV plane -> floats -> filters -> upsampled
YUV -> matrix multiply -> float RGB
quantize & clamp to [0,255] ints
write RGB bmp

So, you might think that's a reasonable way to go, and try various filters. I did experiments on this before (see this blog post and in particular the comment at the end). I found that fancy filters don't really help much, that the best thing is just doing bilinear reconstruction and a special optimized downsample filter that I call "down for bilinear up".

But once you think about it, there's no reason that we should follow this particular process for making our YUV. In particular, downsampling in chroma space does some very weird things that you might not expect. A lot of the weirdness comes from the fact that the RGB we care about is clamped in the range [0-255]. And our advantage is that we have Y at higher resolution.

So let's start looking at better ways to do chroma sampling.

Our first good link is this : Twibright Luminaplex and Hyperluma

His writing is very weird, so I'll paraphrase briefly. One of the problems with chroma subsampling is that it's not light linear. eg. averaging Cb and Cr does not produce a resulting color which is the average of what your eye will see. Instead of subsampling CbCr , you should instead solve for the YCbCr which produces the light-linear color that you would see for the average of those 2x2 chromas. The easiest way to do this is just to subsample CbCr is in some way, and then instead of computing Y from the original RGB, you turn it into a solve to find the Y, given CbCr, that produces the best result.

The next good idea from the "Twibright" page is just to abandom the idea of computing the YCbCr from a matrix in general. We know what the decoder will do to upsample, so instead we just take the idea that our encoder should output the coefficients which will make the best result after upsampling. In the "Hyperluma" algorithms, he sets his goal as preserving constant actual luma ( luma is not "Y" , it's actual perceived brightness). Basically he does the chroma subsample using RGB, and then given the low-res chroma and the original high-res RGB, solve for the Y that gives you the correct luma.

The "Luminaplex" algorithm takes this one step further and does a brute force optimization. In the end with compression if you want absolute optimality it always comes down to this - wiggle the discrete values, see what the output is, and take the best. (we saw this with DXT1 as well).

Implementing these ideas gives me the "down-for-linear-up _RGB FindBestY" results above.

On the Twibright page he claims that he can implement Luminaplex for box upsampling and still get great results. I found that to not be true on most real images, you need at least bilinear upsampling. To solve for the optimal YCbCr for a given RGB target image, you have a large coupled linear system. That is, for each pixel, you have to consider the current CbCr and also the neighbors, and for those neighbors, you have to consider their neighbors. This is a sparse semi-diagonal linear system. In particular, you are solving a linear system like this :

find x to minimize | A x - b |

b = the target RGB values
 (b has 3 * # of pixels values)

x = the YCbCr values for the output
 w*h of Y, (w*h)/4 each of Cb and Cr

A = the YCbCr -> RGB matrix plus upsample filter
    3*w*h rows
    (3/2)*w*h columns

for each R and B row, there are 5 non-zero entries
  (1 for Y and 4 for Cr or Cb)
for each G row there are 9 non-zero entries
  (1 for Y and 4 for Cr and 4 for Cb)

you can solve this reasonably easily with a standard sparse matrix linear solver. Note that in this form we are directly solving for minimum RGB RMSE , but you can of course solve for other metrics (that are linear transformations of RGB anyway). The fours in the number of terms come from the four-tap box filter; bigger filters have more non-zero terms and so make the matrix much fatter in memory and slower to solve.

If you implement this you get "solve lsqr for linear up" , in the results above. Note that this has the problem mentioned in the last post. I actually want to solve for discrete and clamped YCbCr, but it's too hard, so I just solve the equation as if they are continuous, and then round to the nearest int and clamp to 0-255. To improve this, I actually re-find Y from Chroma after the round-and-clamp. The loss from going discrete is this bit :

//float solution rmse : 1.6250
rmse : 1.7400

I believe that this is as good as you can do for a decoder which operates in the simple linear way. That is, up to now we have assumed that we have a generic decoder, so we can use these techniques to make optimal baseline-compatible JPEGs or H264 videos or whatever. But there are things we can do on the decode side as well, so let's look at that.

The most interesting link I've seen is this page by Glenn Chan : Towards Better Chroma Subsampling

Glenn makes the realization that many of the YCbCr produced by normal subsampling are actually impossible. That is, they can't come from any RGB in [0-255]. When you see an impossible YCbCr, you know it was caused by downsampling, which means you know that some chroma was put on your pixel that should have been on a neighbor's pixel. For the moment let's pretend we have box sampling. In a 2x2 block we have some black pixels and some red pixels. When you box downsample you will get the average of the red chroma (call it Cr = 1) and the black chroma (Cr = 0). When you box upsample you will get Cr = 0.5 over all four pixels. But not you have a pixel with Y = 0 and Cr = 0.5 ; the only way to make a zero Y but with some red chroma would be for it to have negative G and/or B. So this must be a mistake - when we see Y = 0 and Cr = 0.5, we know that the chroma on this pixel must haved "spilled" onto us from our neighbor incorrectly. To fix it, we just take our unwanted Cr and push it over to our neighbor, and we get a perfect result - the Y = 0 gets Cr = 0 and is black, and the Cr = 0.5 red pixel gets pushed up to Cr = 1.

Glenn works out how much chroma a pixel can hold for a given Y. One way to think about this is to think of the RGB->YCbCr as a rotation (+ shear and scale, but you can think of it as rotation for our purposes). You've taken the RGB axial box and have put a new box around it in a rotated space. To completely cover the range of the original box, we have to use a much bigger box in this new space. The result is a large amount of empty space in the YCbCr box which did not come from the original RGB box. Handling this better is a general problem for all compressors with color conversions - we often code YCbCr as if they have full range, but in fact after we have seen Y we know that the range for CbCr might be much smaller.

There's another way of getting the same result, which is to use the fact that we know our Y is more reliable than our CbCr. That is, use your YCbCr to reproduce RGB. Now see if the RGB are all in [0-255] , if they are you are fine. If not, you have to clamp them. Now recompute Y from RGB (something like 0.2R + 0.7G + 0.1B). Because of the clamping, this will now be wrong, eg. not match the transmitted Y. So what we are doing is ensuring that the Y of the output RGB is equal to the Y we transmitted. To acheive that, we adjust CbCr so that we are not clamping the RGB.

On some very bad cases, the win from the "spill" decoder is massive :

on "redlines" :
alternating vertical black and red lines :

ffmpeg default :
rmse : 144.9034

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 94.7088

my_y4m box filter :
rmse : 101.9621

my_y4m bilinear :
rmse : 101.3658

my_y4m box spill :
rmse : 7.5732


The main limitation of Glenn's method is that it only helps when you are pushing pixels into illegal values. eg. the black next to red example above was helped enormously, but if it was instead grey next to red, then no illegal value would have been made and we would have done nothing. (eg. on my_soup it only gave us 3.1129 -> 3.0603)

The other problem with Glenn's method is that it is rather slow in the decoder, too slow for something like video (but certainly okay for a JPEG loader in Photoshop or something like that).

There are some simplifications/approximations of Glenn's method which would probably be viable in video.

One is to compute an approximate "chroma capacity" based on Y, and then for each 2x2 box upsample, instead of putting the chroma onto each pixel with equal weight, you do it weighted by the chroma capacity. Chroma capacity is a triangle function of Y, something like min( Y, 255-Y ). So a 2x2 box upsample adjusted for capacity is just :

given subsampled chroma Cr & Cb
and non-subsampled Y's (Y_0,1,2,3)

unadjusted box upsample is just :
Cr_n = Cr
Cb_n = Cb

adjusted box upsample is :

CC_n = min( Y_n, 255 - Y_n )
CC_n *= 4 / ( CC_0 + CC_1 + CC_2 + CC_3 )

Cr_n = CC_n * Cr
Cb_n = CC_n * Cb

(this is similar to his "proportion" method but much simplified). On the synthetic case of the red and black quad, this produces the same results as the more expensive method. On real images it's slightly worse.

Another approach to accelerating Glenn's method would be to just go ahead and do the YCbCr->RGB on each pixel, and then when you do the clamp (which you must do anyway), use that branch to spill to neighbors, and compute the spill amount directly from how far your RGB is pushed out of [0..255] , eg. if B is -4 , then (-4 / Cb_to_B) worth of Cb goes onto the neighbor.

I've only implemented the "spill" method for box sampling, but you can do it for any time of upsampling filter. It's a little awkward though, as you have to implement your upsampler in a sort of backwards way; rather than iterating on the high res pixels and sampling from the subsampled CbCr plane with some filter and accumulating into each output pixel only once, instead you need to iterate over the low res subsampled CbCr and create the filter output and add it into various target pixels.

There's one final method to look at which we've not implemented yet. Glenn's method is of course a form of "luma aided chroma upsample", but it's not what I'm usually refering to when I say that. What we usually mean is using the luma edge information to select the chroma upfilter. That is, rather than always doing bilinear chroma upsample or bicubic or sinc or whatever, you do a decision tree on the luma pixels and choose one of several filters which have various shapes. This is actually a variant of the old "super resolution" problem. We have a low res signal and wish to make the best possible high res output, possibly with the help of some side information. In this case we have the high res Y plan as side information which we believe is well correlated; in many of the "super resolution" papers in the literature what they have is previous video frames, but the techniques are quite similar. I've seen some papers on this but no implementation (though I hear some TV's actually have a hacky version of this called "warpscan" or something like that).

Training filters for various edge scenarios is relatively straightforward, so I might do that in the next few days.

BTW one scary issue for all these fancy filters is if you don't control the decoder or know its exact spec. In particular, TV's and such that are doing fancy chroma upsamplings now means your manipulations could make things worse.

Also something I have observed with chroma sampling is that minimizing any simple analytic metric like RMSE can lead to ringing. This is a very generic issue in lossy compression (it's roughly the same reason that you get ringing in wavelets). In order to reduce a very large single pixel error, the encoder will apply something like a sinc filter shape to that pixel. That might cut the error at that pixel from 100 to 50, which is a huge win, but it also adds a low magnitude ringing around that pixel (maybe magnitude 5 or so). In RMSE terms this is a big win, but visually it's very bad to create that ringing around the single bad pixel, better to just leave its big error alone. (nonlinear error metrics might help this, but nonlinear error metrics don't lead to simple to solve linear matrix equations)

The best links :

Twibright Luminaplex and Hyperluma
Towards Better Chroma Subsampling

Some good links related to chroma sampling and color : psx h4x0rz in teh wired YCbCr to RGB Conversion Showdown
psx h4x0rz in teh wired Immaculate Decoding
Marty Reddy Color FAQ
Chroma Sampling An Investigation
hometheaterhifi - chroma upsampling error
COMPSCI708S1T CBIR - Colour Features
CiteSeerX — Optimal image scaling using pixel classification

And some really unrelated links that I just happened to have found in the last few days :

LIVE lab on image quality
Live Chart Playground - Google Chart Tools Image Charts (aka Chart API) - Google Code
JPEG Post Processing
Shape-Adaptive Transforms Filtering Pointwise SA-DCT algorithms
Perl jpegrescan - Dark Shikari - Pastebin.com

10-20-10 | Discrete Math is the Bane of Computer Science

We are constantly faced with these horrible discrete optimization problems.

The specific example I've been working on today is YCbCr/YUV optimization. Say you know the reconstruction matrix (YUV->RGB) and you want to find the optimal YUV coefficients for some given RGB ?

Well the standard answer is you invert the reconstruction matrix and then multiply that by your RGB. That is wrong!.

In general what we are trying to solve here is a problem something like this :

13 * x + 17 * y = 200;
11 * x +  7 * y = 100;

find x & y to minimize the error
x & y are integers
and bounded in [-100,100]

I guess these types of problems are "integer programming" (this particular case is integer linear programming since my example is linear, but sometimes we have to do it on nonlinear problems), and integer programming is NP-hard , so there is no exact solution but brute force.

(more generally, minimize error E(x,y,..) subject to various linear constraints)

The usual approach is just to wave your hands around and pretend that your variables are actually continuous, work out the math for the solution as if they were, and then round to ints at the end.

The problem is that that can be arbitrarily far from the optimal solution. Obviously in some simple problems you can bound the error of using this approach, or you may even know that this is the minimum error, but in nasty problems it might be way off.

We saw this same issue before in DXT1

I've never really seen many pages or much reference on this kind of problem. I did at one point decide that I was sick of the approximate hacky solutions and tried to find some software package to do this for real and came up empty on real usable ones.

There are a few hacky tricks you can use :

1. Find the continuous solution, then for each float value try clamping it to various integers nearby. See which of these rounds is best. The problem is for an N variable problem this takes 2^N tries if all you try is floor and ceil, and really you'd like to try going +1 and -1 beyond that a few times.

2. You can always remove one linear variable. That is, if you can reduce the problem to just something like a x = b, then the optimal integer solution is just round(b/a). What that means is if you use some other approach for all but one variable (such as brute force search), then you can solve the last variable trivially.

3. Often the importance of the various variables is not the same, their coefficients in the error term may vary quite a bit. So you may be able to brute force search on just the term with the largest contribution to the error, and use a simpler/approximate method on the other terms.

10-18-10 | How to make a Perceptual Database

Well, I'm vaguely considering making my own perceptual test database since you know it's been like 20 years since it was obvious we needed this and nobody's done. I'm going to brainstorm randomly about how it should be.

EDIT : hmm whaddaya know, I found one.

You gather something like 20 high quality images of different characteristics. The number of base images you should use depends on how many tests you can run - if you won't get a lot of tests don't use too many base images.

Create a bunch of distorted images in various ways. For each base image, you want to make something like 100 of these. You want distortions at something like 8 gross quality levels (something like "bit rates" 0.125 - 2.0 in log scale), and then a variety of distortions that look different at each quality level.

How you make these exactly is a question. You could of course run various compressors to make them, but that has some weird bias built in as testers are familiar with those compressors and their artifacts, so may have predispositions about how they view them. It might be valuable to also make some synthetically distorted images. Another idea would be to run images through multiple compressors. You could use some old/obscure compressors like fractals or VQ. One nice general way to make distortions is to fiddle with the coefficients in the transformed space, or to use transforms with synthesis filters that don't match the analysis (non-inverting transforms).

The test image resolution should be small enough that you can display two side by side without scaling. I propose that a good choice is 960 x 1080 , since then you can fit two side by side at 1920x1080 , which I believe is common enough that you can get a decent sample size. 960 divides by 64 evently but 1080 is actually kind of gross (it only divides up to 8), so 960x1024 might be better, or 960x960. That is annoyingly small for a modern image test, but I don't see a way around that.

There are a few different types of test possible :

Distorted pair testing :

The most basic test would be to show two distorted images side by side and say "which looks better". Simply have the use click one or the other, then show another pair. This lets testers go through a lot of images very quickly which will make the test data set larger. Obviously you randomize which images you show on the left or right.

To pick two distorted images to show which are useful to test against, you would choose two images which are roughly the same quality under some analytic metric such as MS-SSIM-SCIELAB. This maximizing the amount of information you are getting out of each test, because when you put up images where one is obviously better than another you aren't learning anything (* - but this a useful way to test the user for sanity, occasionally put up some image pairs that are purely randomly chosen, that way you get some comparisons where you know the answer and can test the viewer).

Single image no reference testing :

You just show a single distorted image and ask the viewer to rate its "quality" on a scale of 0-10. The original image is not shown.

Image toggle testing :

The distorted and original image are shown on top of each other and toggled automatically at N second intervals. The user rates it on a scale of 0-10.

Double image toggle testing :

Two different distorted images are chosen as in "distorted pair testing". Both are toggled against the original image. The user selects which one is better.

When somebody does the test, you want to record their IP or something so that you make sure the same person isn't doing it too many times, and to be able to associate all the numbers with one identity so that you can throw them out if they seem unreliable.

It seems like this should be easy to set up with the most minimal bit of web programming. You have to be able to host a lot of bandwidth because the images have to be uncompressed (PNG), and you have to provide the full set for download so people can learn from it.

Obviously once you have this data you try to make a synthetic measure that reproduces it. The binary "this is better than this" tests are easier to deal with than the numeric (0-10) ones - you can directly test against them. With the numeric tests you have to control for the bias of the rating on each image and the bias from each user (this is a lot like the Netflix Prize actually, you can see also papers on that).

10-16-10 | Image Comparison Part 11 : Some Notes on the Tests

Let's stop for a second and talk about the tests we've been running. First of all, as noted in a comment, these are not the final tests. This is the preliminary round for me to work out my pipeline and make sure I'm running the compressors right, and to reduce the competitors to a smaller set. I'll run a final round on a large set of images and post a graph for each image.

What is this MS-SSIM-SCIELAB exactly? Is it a good perceptual metric?

SCIELAB (see 1 and 2 ) is a color transform that is "perceptually uniform" , eg. a delta of 1 has the same human visual importance at all locations in the color cube, and is also spatially filtered to account for the difference in human chroma spatial resolution. The filter is done in the opponent-color domain, and basically the luma gets a sharp filter and the chroma gets a wide filter. Follow the blog posts for more details.

SCIELAB is pretty good at accounting for one particular perceptual factor of error measurement - the difference in importance of luma vs. chroma and the difference in spatial resolution of luma vs. chroma. But it doesn't account for lots of other perceptual factors. Because of that, I recognize that using SCIELAB is somewhat distorting in the test results. What it does is give a bonus to compressors that are perceptually tuned for this particular issue, and doen't care about other issues. More on this later. (*1) (ADDENDUM : of course using SCIELAB also gives an advantage to people who use a colorspace similar to LAB, such as old JPEG YUV, and penalize people who use YCoCg which is not so close. Whether or not you consider this to be fair depends on how accurate you believe LAB to be as an approximation of how the eye sees).

MS-SSIM is multi-scale SSIM and I use it on the SCIELAB data. There are a few issues about this which are problematic.

How do you do SSIM for multi-component data?

It's not that clear. You can just do the SSIM on each component and then combine - probably with an arithmetic mean, but if you look at the SSIM a bit you might get other ideas. The basic factor in SSIM is a term like :

ss = 2 * x*y / (x*x + y*y )

If x and y are the same, this is 1.0 , the more different they are, the smaller it is. In fact this is a normalized dot product or a "triangle metric". That is :

ss = ( (x+y)^2 - x*x - y*y ) / (x*x + y*y )

if you pretend x and y are the two short edges of a triangle, this is the length of the long edge between them minus the length squared of each edge.

Now, this has just been scalar, but to go to multi-component SSIM you could easily imagine the right thing to do is to go to vector ops :

ss = 2 * Dot(x,y) / ( LenSqr(x) + LenSqr(y) )

That might in fact be a good way to do n-component SSIM, it's impossible to say without doing human rating tests to see if it's better or not.

Now while we're at it let's make a note on this basic piece of the SSIM.

ss = 2 * x*y / (x*x + y*y ) = 1 - (x-y)^2 /  ( x*x + y*y )

we can see that all its done is take the normal L2 MSE term , and scale it by the inverse magnitude of the values. The reason they do this is to make SSIM "scale independent" , that is if you replace x and y with sx and sy you get the same number out for SSIM. But in fact what it does is make errors in low values much more important than errors in high values.

ssim :

    delta from 1->3

    2*1*3 / ( 1*1 + 3*3 ) = 6 / 10 = 0.6

    delta from 250->252

    2*250*252 / ( 250*250 + 252*252 ) = 0.999968

Now certainly it is true that a delta of 2 at value 1 is more important than a delta of 2 at value 250 (so called "luma masking") - but is it really *this* much more important? In terms of error (1 - ssim), the different is 0.40 vs. 0.000032 , or 1250000 % greater. I think not. My conjecture is that this scale-independence aspect of SSIM is wrong, it's over-counting low value errors vs. high-value errors. (ADDENDUM : I should note that real SSIM implementations have a hacky constant term added to the numerator and denominator which reduce this exaggeration)

As usual in the SSIM papers they show that SSIM is better at detective true visual quality than a straw man opponent - pure RMSE. But what is in SSIM ? It's plain old MSE with a scaling to make low values count more, and it's got a local "detail" detection term in the form of block sdevs. So if you're going to make a fair comparison, you should test against something similar. You could easily do RMSE on scaled values (perhaps log-scale values, or simple sqrt-values) to make low value errors count more, and you could easily add a detail preservation term by measuring local activity and adding an RMSE-like term for that.

What's the point of even looking at the RMSE numbers? If we just care about perceptual quality, why not just post that?

Well, a few reasons. One, as noted previously, we don't completely trust our perceptual metric, so having the RMSE numbers provide a bit of a sanity check fallback for that. For another, it lets us sort of check on the perceptual tuning of the compressor. For example if we find something that does very well on RGB-RMSE but badly on the perceptual metric, that tells us that it has not been perceptually tuned; it might actually be an okay compressor if it has good RMSE results. Having multiple metrics and multiple bit rates and multiple test images sort of let you peer into the function of the compressor a bit.

What's the point of this whole process? Well there are a few purposes for me.

One is to work out a simple reproducable pipeline in which I can test my own compressors and get a better idea of whether they are reasonably competitive. You can't just compare against other people's published results because so many of the test are done on bad data, or without enough details to be reproducable. I'd also like to find a more perceptual metric that I can use.

Another reason is for me to actually test a lot of the claims that people bandy about without much support, like is H264-Intra really a very good still image coder? Is JPEG2000 really a lame duck that's not worth bothering with? Is JPEG woefully old and easy to beat? The answers to those and other questions are not so clear to me.

Finally, hopefully I will set up an easy to reproduce test method so that anyone at home can make these results, and then hopefully we will see other people around the web doing more responsible testing. Not bloody likely, I know, but you have to try.

(*1) : you can see this for example in the x264 -stillimage results, where they are targetting "perceptual error" in a way that I don't measure. There may be compressors for example which are successfully targetting some types of perceptual error and not targetting the color-perception issue, and I am unfairly showing them to be very poor.

However, just because this perceptual metric is not perfect doesn't mean we should just give up and use RMSE. You have to use the best thing you have available at the time.

Generally there are two classes of perceptual error which I am going to just brazenly coin terms for right now : occular and cognitive.

The old JPEG/ SCIELAB / DCTune type perceptual error optimization is pretty much all occular. That is, they are involved in studying the spatial resolution of rods vs. cones, the occular nerve signal masking of high contrast impulses, the thresholds of visibility of various DCT shapes, etc. It's sort of a raw measure of how the optical signal gets to the brain.

These days we are more interested in the "cognitive" issues. This is more about things like "this looks smudgey" or "there's ringing here" or "this straight line became non-straight" or "this human face got scrambled". It's more about the things that the active brain focuses on and notices in an image. If you have a good model for cognitive perception, you can actually make an image that is really screwed up in an absolute "occular" sense, but the brain will still say "looks good".

The nice thing about the occular perceptual optimization is that we can define it exactly and go study it and come up with a bunch of numbers. The cognitive stuff is way more fuzzy and hard to study and put into specific metrics and values.

Some not very related random links :

Perceptual Image Difference Utility
New cjpeg features
NASA Vision Group - Publications
JPEG 2000 Image Codecs Comparison
IJG swings again, and misses Hardwarebug
How-To Extract images from a video file using FFmpeg - Stream #0
Goldfishy Comparison WebP, JPEG and JPEG XR

A brief note on this WebP vs. JPEG test :

Real world analysis of google’s webp versus jpg English Hard

First of all he uses a broken JPEG compressor (which is then later fixed). Second he's showing these huge images way scaled down, you have to dig around for a link to find them in their native sizes. He's using old JPEG-Huff without a post-unblock ; okay, that's fine if you want to compare against ancient JPEG, but you could easily test against JPEG-Arith with unblocking. But the real problem is the file sizes he compresses to. They're around 0.10 - 0.15 bpp ; to get the JPEGs down to that size he had to set "quality" to something like 15. That is way outside of the functional range of JPEG. It's abusing the format - the images are actually huge, then compressed down to a tiny number of bits, and then scaled down to display.

Despite that, it does demonstrate a case where WebP is definitely significantly better - smooth gradients with occasional edges. If WebP is competitive with JPEG on photographs but beats it solidly on digital images, that is a reasonable argument for its superiority.

10-18-10 | Frustum and RadiusInDirection

Ryg has a note on Zeux' Frustum culling series , both are good reads.

The thing that struck me when reading it is that you can get almost directly to the fastest possible code from a very generic C++ framework.

The way I've done frustum culling for a while is like this :

template < class t_Volume >
Cull::EResult Frustum::Cull(const t_Volume & vol) const
    Cull::EResult eRes = Cull::eIn;

    for(int i=0;i < m_count;i++)
        const Plane::ESide eSide = PlaneSide(vol,m_planes[i]);
        if ( eSide == Plane::eBack )
            return Cull::eOut;
        else if ( eSide == Plane::eIntersecting )
            eRes = Cull::eCrossing;
        // else Front, leave eRes alone

    return eRes;

now any primitive class which implements "PlaneSide" can be culled. (Cull is trinary - it returns all in, all out, or crossing, similarly PlaneSide is trinary, it returns front or back or crossing).

Furthermore, PlaneSide can be overridden for classes that have their own idea of how it should be done, but almost always you can just use the generic PlaneSide :

template < class t_Volume >
Plane::ESide PlaneSide(const t_Volume & vol, const Plane & plane) const
    const float dist = plane.DistanceToPoint( vol.GetCenter() );
    const float radius = vol.GetRadiusInDirection( plane.GetNormal() );

    if ( dist > radius )
        return Plane::eFront;
    else if ( dist < - radius )
        return Plane::eBack;
        return Plane::eIntersecting;

For a volume to be generically testable against a plane it has to implement GetCenter() and GetRadiusInDirection().

GetRadiusInDirection(dir) tells you the half-width of the span of the volume along direction "dir". The neat thing is that GetRadiusInDirection turns out to be a pretty simple and very useful function for most volumes.

Obviously the implementation for Sphere is the simplest, because RadiusInDirection is the same in all directions :

Sphere::GetCenter() { return m_center; }
Sphere::GetRadiusInDirection() { return m_radius; }

for an AxialBox, if you store your box as {m_center, m_halfExtent} so that min & max are { m_center - m_halfExtent, m_center + m_halfExtent } then the implementation is :

AxialBox::GetCenter() { return m_center; }
inline float AxialBox::GetRadiusInDirection(const Vec3 & dir) const
         fabsf(dir.x) * m_halfExtent.x +
         fabsf(dir.y) * m_halfExtent.y +
         fabsf(dir.z) * m_halfExtent.z;

if you now compile our test, everything plugs through - you have Ryg's method 4c. (without the precomputation of absPlane however).

Of course because we are generic and geometric, it's obvious how to extend to more primitives. For example we can make our Frustum work on oriented bounding boxes just like this :

float OrientedBox::GetRadiusInDirection(const Vec3 & dir) const
    const float radius = 
        fabsf((dir * m_axes.GetRowX()) * m_radii.x) +
        fabsf((dir * m_axes.GetRowY()) * m_radii.y) +
        fabsf((dir * m_axes.GetRowZ()) * m_radii.z);

    return radius;

(I store my OrientedBox as an m_center, an orthonormal rotation matrix m_axes, and the extent of each of the 3 axes in m_radii ; obviously you could speed up this query by storing the axes scaled by their length, but that makes some of the other operations more difficult so YMMV).

Similarly tests against cylinders and lozenges and k-Dops or convex hulls or what-have-you are pretty straightforward.

We can use GetRadiusInDirection for other things too. Say you want to find the AxialBox that wraps the old AxialBox but in a new rotated orientation ? Well with GetRadiusInDirection it's very obvious how to do it - you just take the new basis axes and query for the slab spans along them :

AxialBox AxialBox::Rotate( const Matrix & mat ) const
    AxialBox ret;
    ret.m_center = m_center;

    ret.m_halfExtent.x = GetRadiusInDirection( mat.GetColumnX() );
    ret.m_halfExtent.y = GetRadiusInDirection( mat.GetColumnY() );
    ret.m_halfExtent.z = GetRadiusInDirection( mat.GetColumnZ() );

    return ret;

And you will find that this is exactly the same as what Zeux works out for AABB rotation . But since we are doing this with only GetCenter() and GetRadiusInDirection() calls - it's obvious that we can use this to make an AxialBox around *any* volume :

template < class t_Volume >
AxialBox MakeAxialBox( const t_Volume & vol, const Matrix & mat )
    AxialBox ret;
    ret.m_center = vol.GetCenter();

    ret.m_halfExtent.x = vol.GetRadiusInDirection( mat.GetColumnX() );
    ret.m_halfExtent.y = vol.GetRadiusInDirection( mat.GetColumnY() );
    ret.m_halfExtent.z = vol.GetRadiusInDirection( mat.GetColumnZ() );

    return ret;

The nice thing about generic programming is it gives you a set of interfaces which provide a contract for geometric volumes, and then anything that implements them can be used in certain functions. You wind up doing this kind of thing where you write a routine just for AxialBox rotations, but then you see "hey I'm only using calls that are in the generic Volume spec so this is general".

Now I'm not claiming by any means that you can make C++ generic templates and they will be competitive with hand-tweaked code. For example in the case of AABB vs. Frustum you probably want to precompute absPlane , and you probably want to special case to a 5-plane frustum and unroll it (or SIMD it). Obviously when you want maximum speed, you want to look at the assembly after the C++ compiler has had its turn and make sure it's got it right, and you may still want to SIMD and whatever else you might want to do.

But as in all optimization, you want to start your assembly work from the right algorithm, and often that is the one that is most "natural". Interestingly the interfaces which are most natural for generic programming are also often ones that lead to fast code (this isn't always the case, just like the true physics equations aren't always beautiful, but it's a good starting point anyway).

BTW a few more notes on Frustum culling. Frustum culling is actually a bit subtle, in that how much work you should do on it depends on how expensive the object you are culling is to render. If the object will take 0.0001 ms to just render, you shouldn't spend much time culling it. In that case you should use a simpler approximate test - for example a Cone vs. Sphere test. Or maybe no test at all - combine it into a group of objects and test the whole group of culling. If an object is very expensive to render (like maybe it's a skinned human with very complex shaders), it takes 1 ms to render, then you want to cull it very accurately indeed - in fact you may want to test against an OBB or a convex hull of the object instead of just an AABB.

Another funny thing about frustum culling in game usage is that the planes are not all equal. That is, our spaces are not isotropic. We usually have lots of geometry in the XY plane and not much above or below you. You need to take advantage of that. For example using an initial heirarchy in XY can help. Or if you are using a non-SIMD cull with early outs, order your plane tests to maximize speed (eg. the top and bottom planes of the frustum should be the last ones checked as they are least important).

10-16-10 | Image Comparison Part 10 : x264 Retry

Well I've had a little bit more success.

I still can't get x264 to do full range successfully; or maybe I did, but then I can't figure out how to make the decoder respect it.

I think the thing to do is make an AVISynth script containing something like :

ConvertToYV12( matrix="pc.709")

The pc YV12's are supposed to be the full 0-255 ones, and then on the x264 command like you also do "--fullrange on --colormatrix bt709" , which are just info tags put into the stream, which theoretically the decoder is supposed to see so that it can do the inverse colorspace transform correctly, but that doesn't seem to be working. Sigh!

One difficulty I have is that a lot of programs don't handle these one frame videos right. MPlayer refuses to extract any frames from it, Media Player Classic won't show me the one frame. FFmpeg does succeed in outputting the one frame, so its what I'm using to decode right now.

Anyway these are the some of the links that don't actually provide an answer :

Convert - Avisynth
Re FFmpeg-user - ffmpeg & final cut pro best format question
new x264 feature VUI parameters - Doom9's Forum
MPlayer(1) manual page
Mark's video filters
libav-user - Conversion yuvj420P to RGB24
H.264AVC intra coding and JPEG 2000 comparison
H.264 I-frames for still images [Archive] - Doom9's Forum
FFmpeg-user - ffmpeg & final cut pro best format question
FFmpeg-user - Converting high-quality raw video to high-quality DVD video
FFmpeg-devel - MJPG decoder picture quality
FFmpeg libswscaleoptions.c Source File
YCbCr - Wikipedia, the free encyclopedia

log rmse :

ms-ssim-scielab :

There's still a large constant error you can see in the RMSE graph that I believe is due to the [16-235] problem.

It should not be surprising that --tune stillimage actually hurts in both our measures, because it is tuning for "psy" quality in ways that we don't measure here. In theory it is actually the best looking of the three.

NOTE : This is counting the sizes of the .x264 output including all headers, which are rather large.


call x264 -o r:\t.mkv r:\my_soup.avs --preset veryslow --tune psnr %*
call ffmpeg -i t.mkv -sws_flags +bicubic+accurate_rnd+full_chroma_int -vcodec png tf.png


call ffmpeg -sws_flags +bicubic+accurate_rnd+full_chroma_int+full_chroma_inp -i my_soup.avi -vcodec libx264 -fpre c:\progs\video_tools\ffmpeg-latest\presets\libx264-lossless_slow.ffpreset r:\t.mkv
call ffmpeg -sws_flags +bicubic+accurate_rnd+full_chroma_int+full_chroma_inp -i r:\t.mkv -vcodec png r:\tf.png

I think I'll just write my own Y4M converter, since having my own direct Y4M in/out would be useful for me outside of this stupid test.

ADDENDUM : well I did.

added to the chart now is x264 with my own y4m converter :

I actually was most of the way there with the improved ffmpeg software scaler flags. I was missing the main issue - the reason it gets so much worse than our jpegfnspaq line at high bit rate is because "fns" is short for "flat no sub" and the no sub is what gets you - all subsampled codecs get much worse than non-subsampled codecs at high bit rates.

Even using my own Y4M converter is a monstrous fucking pain in the ass, because god damn FFMPEG won't just pass through the YUVJ420P data raw from x264 out to a Y4M stream - it prints a pointless error and refuses to do it. That needs to be fixed god damn it. The workaround is to make it output to "rawvideo" with yuvj420p data, and then load that same raw video but just tell it it has yuv420p data in it to get it to write the y4m. So my test bat for x264 is now :

call dele r:\t.*
call dele r:\ttt.*
c:\src\y4m\x64\release\y4m.exe r:\my_soup.bmp r:\t.y4m
r:\x264.exe -o r:\t.mkv r:\t.y4m --fullrange on --preset veryslow --tune psnr %*
call ffmpeg.bat -i r:\t.mkv -f rawvideo r:\ttt.raw
call ffmpeg.bat -pix_fmt yuv420p -s 1920x1200 -f rawvideo -i r:\ttt.raw r:\ttt.y4m
c:\src\y4m\x64\release\y4m.exe r:\ttt.y4m r:\ttt.bmp
namebysize r:\ttt.bmp r:\t.mkv r:\xx_ .bmp

The big annoyance is that I have the resolution hard-coded in there to make rawvideo work, so I can't just run it on arbitrary images.

FYI my_y4m is currently just doing PC.601 "YUV" which is the JPEG YCbCr. I might add support for all the matrices so that it can be a fully functional y4m converter.

I was going to use the Y4M reader/write code from MJPEGTOOLS , but it looks like it's fucking GPL which is a cancerous toxic license, so I can't use it. (I don't actually mind having to release my source code, but it makes my code infected by GPL, which then makes my code unusable by 90% of the world).

10-16-10 | Image Comparison Part 9 : Kakadu JPEG2000

Kakadu JPEG2000 (v6.4) can be tuned for visual quality or for MSE (-no_weights) , so we run both :

my_soup :

Performance in general is excellent, and we can see that they did a good job with their visual tuning (according to this metric anyway). KakaduMSE is slightly worse that jpeg_paq through the [-1,1] zone, but the visually tuned one is significantly better.

moses :

Moses is one of those difficult "noisey / texturey" type of images (like "barb") that people historically say is bad for wavelets, and indeed that seems to be the case. While Kakadu still stomps on JPEG, it's not by nearly as much as on my_soup.

The old MSU test says that ACDSee and Lurawave are better than Kakadu (v4.5) so maybe I'll try those, but they're both annoyingly commercial.

10-15-10 | Image Comparison Part 8 : Hipix

Hipix is a commercial lossy image tool. I hear it is based on H264 Intra so I wanted to see if it was a better version of that idea (x264 and AIC both having let me down).

Well, at this point we should be unsurprised that it sucks balls :

log rmse :

ms-ssim-scielab :

One note : the rightmost data point from hipix is their "perfect" setting, which is very far from perfect. It's only a little over 2 bpp and the quality is shit. I feel bad for any sucker customers who are saving images as hipix "perfect" and thinking they are getting good quality.

I started to think , man maybe my ms-ssim-scielab is just way off the mark? How can everyone be so bad? Any time your test is telling you things that are hard to believe, you need to reevaluate your test. So I went and looked at the images with my own eyes.

Yep, hipix is awful. JPEG just blows it away.

A sample from the hipix image closest to the 0 on the x axis , and a JPEG of the same size : (HiPix is 230244 bytes, JPEG is 230794 bytes)


HiPix :

Note the complete destruction of the wood grain detail in the hipix, as well as introduction of blockiness and weird smudge shapes. Note the destruction of detail in the plate rim, and the ruining of the straight line edge of the black bowl.

BTW when you are evaluating perceptual quality, you should *NOT* zoom in! JPEG is optimized for human visual system artifact perceptibility at the given scale of the image. JPEG intentionally allows nasty artifacts that look bad when you zoom in, but not when you look at the image in its normal size.

Conclusion : Hipix needs to immediately release a "new and improved HiPix v2.0 that's way better than the last!" by just replacing it with JPEG.

Since they don't offer a command line app I won't be testing this on any more images.

ADDENDUM : Well I ran two points on Moses :

The two points are "High" and "Perfect" and perfect is way not perfect.

10-15-10 | Image Comparison Part 7 : WebP

I thought I wasn't going to be able to do this test, because damn Google has only released webpconv for Linux (or you know, if you download Cygwin and built yourself WTFBBQ). But I found these :

WebP for .NET

webp.zip solution for VC

... both of which are actually broken. The .NET one just fails myseriously on me. The webp.zip one has some broken Endian stuff, and even if you fix that the BMP input & output is broken. So.. I ripped it out and relaced it with the cblib BMP in/out, and it seems to work.

(I didn't want to use the webp converter in ffmpeg because I've seen past evidence that ffmpeg doesn't do the color conversion and resampling right, and I wanted to use the Google-provided app to make sure that any bad quality was due only to them)

My build of WebP Win32 is here : webp.zip

Here are the results :

log rmse :

ms-ssim-scielab :

Now I am surprised right off the bat that the ms-ssim-scielab results are not terrible, but the rmse is not very good. I've read rumors in a few places that On2 tweaked WebP/WebM for RMSE/PSNR , so I expected different.

Looking at the RMSE curve it's clear that there is a bad color conversion going on. Either too much loss in the color convert, or bad downsample code, something like that. Any time there is a broken base color space, you will see the whole error curve is a bit flatter than it should be and offset upwards in error.

The perceptual numbers are slightly worse than jpeg-huff through the "money zone" of -1 to 1. Like all modern coders it does have a flatter tail so wins at very low bit rate.

(BTW I think JPEG's shortcoming at very low bit rate is due to its very primitive DC coding, and lack of deblocking filter, but I'm not sure).

BTW I also found this pretty nice Goldfishy WebP comparison

Also some pretty good links on JPEG that I've stumbled on in the last few days :
jpeg wizard.txt
ImpulseAdventure - JPEG Quality Comparison
ImpulseAdventure - JPEG Quality and Quantization Tables for Digital Cameras, Photoshop
ImpulseAdventure - JPEG Compression and JPEG Quality

Here's how the WebP test was done :

webp_test.bat :
call dele s:\*.bmp
call dele s:\*.webp
Release\webp -output_dir s:\ -format webp -quality %1 r:\my_soup.bmp
Release\webp -output_dir s:\ -format bmp s:\my_soup.webp
namebysize s:\my_soup.bmp s:\my_soup.webp s:\webp_ .bmp
call mov s:\webp_*.bmp r:\webp_test\

md r:\webp_test
call dele r:\webp_test\*
call webp_test 5
call webp_test 10
call webp_test 15
call webp_test 20
call webp_test 25
call webp_test 30
call webp_test 40
call webp_test 50
call webp_test 60
call webp_test 65
call webp_test 70
call webp_test 75
call webp_test 80
call webp_test 85
call webp_test 90
call webp_test 95
call webp_test 100
call mov s:\webp_*.bmp r:\webp_test\
imdiff r:\my_soup.bmp r:\webp_test -cwebp_imdiff.csv
transposecsv webp_imdiff.csv webp_trans.csv

BTW the webpconv app is really annoying.

1. It fails out mysteriously in lots of places and just says "error loading" or something without telling you why.

2. It takes an "output_dir" option instead of an output file name. I guess that's nice for some uses, but you need an output file name option for people who are scripting. (you can fix this of course by making your batch rename the input file to "webp_in" or something then you ran rename the output at will)

3. It's got image format loaders for like 10 different formats, but they're all semi-broken. Don't do that. Just load one standard format (BMP is good choice) and support it *well* , eg. be really compliant with variants of the bitstream, and let the user convert into that format using ImageMagick or something like that.

4. It won't write the output files if they already exist, and there's no "force overwrite" option. This one had me absolutely pulling out my hair as I kept running it with different options and the output files stayed the same. (you can fix this of course by making your batch delete the output first)

Despite all this negativity, I actually do think the WebP format might be okay if it had a good encoder.

ADDENDUM : WebP on Moses :

On "my_soup" it looked like WebP was at least close to competitive, but on Moses it really takes itself out of the running.

10-15-10 | Image Comparison Part 6 : cbwave

"cbwave" is my ancient wavelet coder from my wavelet video proof of concept. It's much simpler than JPEG 2000 and not "modern" in any way. But I tacked a bunch of color space options onto it for testing at RAD so I thought that would be interesting to see :

cbwaves various colorspaces :

log rmse :

ms-ssim-scielab :

notes :

RMSE : Obviously no color transform is very bad. Other than that, KLT is surprisingly bad at high bit rate (something I noted in a post long ago). The other color spaces are roughly identical. This coder has the best RMSE behavior of any we've seen yet. This is why wavelets were so exciting when they first came out - this coder is incredibly simple, there's no RDO or optimizing at all, it doesn't do wavelet packets or bit planes, or anything, and yet it beats PAQ-JPEG (on rmse anyway).

MS-SSIM-SCIELAB : and here we see the disappointment of wavelets. The great RMSE behavior doesn't carry over to the perceptual metric. The best color space by far is the old "YUV" from JPEG, which has largely fallen out of favor. But we see that maybe that was foolish.

cbwave also has an option for downsampling chroma, but it's no good - it's just box downsample and box upsample, so these graphs are posted as an example of what bad chroma up/down sampling can do to you : (note that the probem only appears at high bit rates - at low bit rates the bad chroma sampling has almost no effect)

log rmse :

ms-ssim-scielab :

cbwave is a fixed pyramid structure wavelet doing daub97 horizontally and cdf22 vertically; the coder is a value coder (not bitplane) for speed. Some obvious things to improve it : fix the chroma subsample, try optimal weighting of color planes for perceptual quality, try daub97 vertical, try optimal per-image wavelet shapes, wavelet packets, directional wavelets, perceptual RDO, etc.

ASIDE : I'd like to try DLI or ADCTC , but neither of them support color, so I'm afraid they're out.

CAVEAT : again this is just one test image, so don't take too many conclusions about what color space is best.

ADDENDUM : results on "moses.bmp" , a 1600x1600 with difficult texture like "barb" :

Again YUV is definitely best, KLT is definitely worst, and the others are right on top of each other.

10-14-10 | Image Comparison Part 5 : RAD VideoTest

VideoTest is my test video coder for RAD. It's based on "NewDCT" , in fact it has exactly the same DCT core, but it has a sightly better perceptual tuning, and it has a better RDO encoder.

log rmse :

scielab ms-ssim :

videotest vs. newdct is almost identical in rmse, but we did make a nice step up in perceptual measure.

I am finally beating JPEG ari in the perceptual measure, but it's a bit disturbing how much work I had to do! And of course PAQ JPEG still dominates.

The videotest I frame coder has pretty sophisticated RDO, but it's missing a lot of other things that modern coders have, it has no I predictors, no in-frame matches. It uses a little bit of a perceptual D measure for RDO, but not as well tweaked as x264 by a long shot.

videotest currently crashes for high bit rates; I remember I put some stupid fixed size buffer somewhere to get things working one day, and now I forget where it is :( So that's why there are no results for it above 2.0 bpp

10-14-10 | Xbox 360 vs my HTPC

I was looking at the Xbox 360 sitting on my HTPC, and it makes me really angry. For one thing, I have two perfectly good computers sitting right there that are totally redundant. Why is it so damn hard to play games on the PC?

But more than that, I was thinking my HTPC has got a dual-core AMD chip in it that was like $50, it's got an ATI Dx10 part that was again about $50. It's got a nice stereo-cabinet like case and it's very cool and quiet. I'm pretty sure I could make a console from off the shelf retail parts that would be faster than an Xbox 360 or PS3. Yeah yeah the 360/PS3 are faster in theory if you just count flops or something, but the massive advantage of a proper PC CISC OO core would make my homebrew console faster on most real world code.

Obviously those parts weren't so cheap when the 360 or PS3 were being developed. Also I guess the cost of all the little bits adds up : case, mobo, PSU, DVD drive, hard drive. Still, it's pretty upsetting that our consoles are these awful PowerPC chips with weird GPUs when a proper reasonable computer is so cheap.

10-14-10 | Image Comparison Part 4 : JPEG vs NewDCT

"NewDCT" is a test compressor that I did for RAD. Let's run it through our chart treatment.

For reference, I'll also include plain old JPEG huff , default settings, no PAQ.

log rmse :

scielab ms-ssim :

A few notes :

Note that the blue "jpeg" line is two different jpegs - the top one is jpegflatnosub , optimized for rmse, the bottom one is regular jpeg, optimized for perceptual metric. In contrast "newdct" is the same in both runs, which is semi-perceptual.

The first graph is mainly a demonstration of something retarded that people in literature and all over the net do all the time - they take standard jpeg_huff , which is designed for perceptual quality, and show PSNR/RMSE numbers for it. Obviously JPEG looks really bad when you do that and you say "it's easy to beat" , but you are wrong. It's retarded. Stop it.

In fact in the second graph we see that JPEG's perceptual optimization is so good that even shitty old jpeg_huff is competitive with my newdct above 1.0 bpp . Clearly I still have things to learn from JPEG.

I have no idea what's up with jpeg_paq going off the cliff for small file sizes; it becomes worse than jpeg_ari. Must be a problem in the PAQ jpeg stuff, or maybe an inherent weaknesss in PAQ on very small files that don't give it enough data to learn on.

Note that the three JPEG back ends always give us 3 horiztonal points - they make the same output, only the file sizes are different. (I'm talking about the bottom chart, in the top chart there are two different jpegs and they make different output, as noted previously).

Below 0.50 bpp JPEG does in fact have a problem. All more modern coders will have a straighter R/D line than JPEG does, it starts to slope down very fast. But, images generally look so bad down there that it's rather irrelevant. I noted before that the "money zone" is -1 to 1 in log bpp, that's where images look pretty good and you're getting a good value per bit.

10-14-10 | Image Comparison Part 3 : JPEG vs AIC

Testing Bilsen's AIC (AIC is a subset of H264 Intra without the good encoder of x264) :

Bilsen's AIC doesn't have the crippling low quality colorspace problem of x264, but JPEG just kills it on both metrics. Note that I use jpegflatnosub for rmse and jpeg default options for the perceptual metric.

I think we've already debunked the claims that JPEG is "easy to beat" or "not competitive with modern codecs" or that the "H264 Intra Predictors are a big advantage". (granted AIC is not the best modern codec by a long shot).

I should fill in some more details before I go further.

All the tests so far have been on one image, I made it with my camera by taking a RAW photo and scaling & cropping it down from 4000x3000 down to 1920x1200 to reduce noise and improve chroma resolution. The image is called "my_soup" (maybe I'll post it somewhere for download). I will at some point run some tests on a bunch of images, because it's a bad idea to test on just one.

As I said before, the JPEG I'm using is just IJG , but I am losslessly recompressing the JPEGs with PAQ. I also tried the old JPEG -arith , and I found it's about half way between jpeg-huff and jpeg-PAQ, so I believe this is roughly a fair way of making the JPEG entropy coder back end "modern". I haven't really tried to optimize the JPEG encoding at all, for example there might be better quant matrices, or better options to give to IJG, and obviously you could easily add an unblock on the outside, etc. Without any of that stuff, JPEG is already competitive.

I should also take this chance to state the caveat : MS-SSIM-SCIELAB is in no way a proof of visual superiority. It's the best analytic metric I have handy that is pretty close to visual quality, but the only test we have for real visual quality at the moment is to look at the output with your own eyes.

The jpeg results are made like this :

jpegtest.bat :

c:\util\jpeg8b\cjpeg -dct float -optimize -quality %2 -outfile %1.jpg %1
paq8o8 -6 %1.jpg
call d %1.jpg*
c:\util\jpeg8b\djpeg -dct float -bmp -dither none -outfile de.bmp %1.jpg
namebysize de.bmp %1.jpg.paq8o8 jpeg_test_ .bmp

jpegtests.bat :

call jpegtest %1 5
call jpegtest %1 10
call jpegtest %1 15
call jpegtest %1 20
call jpegtest %1 25
call jpegtest %1 30
call jpegtest %1 40
call jpegtest %1 50
call jpegtest %1 60
call jpegtest %1 65
call jpegtest %1 70
call jpegtest %1 75
call jpegtest %1 80
call jpegtest %1 85
call jpegtest %1 90
call jpegtest %1 95
call jpegtest %1 100
call dele jpg_test\*
call mov jpeg_test_* jpg_test\
imdiff %1 jpg_test -c
call zr imdiff.csv jpg_imdiff.csv
transposecsv jpg_imdiff.csv jpg_trans.csv

10-14-10 | Xbox 360 vs PS3

I was about to buy a 360, then I though hmm maybe I should consider a PS3 instead. To decide I made a list of games I'm interested in on each platform, which games are on both and which are exclusive. I came up with :

Both :
Assassin's Creed
Dragon Age
Elder Scrolls

Xbox360 :
Mass Effect

PS3 :
God of War

Hmm.. looks like there are actually more games for me on the PS3 than there are on the 360. In fact the 360 has only one exclusive game that I care about.

Generally the type of games I like just aren't made very much any more. I like "adventure" games or "light RPGs" , games that are mainly about wandering around exploring some big beautiful world, and not so much about fighting or puzzles or inventory management or any of that tedious crap.

It looks like I should get a PS3, not a 360. Did I miss something?

10-14-10 | Image Comparison Part 2

Well, I wanted to post JPEG vs x264 numbers, but there's a problem.

The first problem I had was that the headers out of x264 are very large. JPEG has about 240 bytes of header ; MP4 has about 1200 bytes of header for one frame of video only , the .x264 internal format has about 600 bytes of header. So I'm just doing a subtract to correct for that, but that's rather approximate and ugly since you can store side information in headers, etc. But whatever, that's not the big problem.

The big problem is that the "lossless" x264 is actually really bad (it's only lossless after color conversion). Here is the quality for lossless x264 :

rmse : 4.9177 , psnr : 34.3296
ssim : 0.9840 , perc : 88.5788%
scielab rmse : 3.140
scielab ssim : 0.9844 , angle : 88.7457%

I confirmed that is caused by the chroma conversion & subsample by just making the y4m and then converting the y4m back to RGB, and I get the exact same numbers. You don't usually see this because people show PSNR in the Y color plane. That's absolutely terrible. For comparison, here's JPEG at 100% quality :

rmse : 2.1548 , psnr : 41.4967
ssim : 0.9928 , perc : 92.3420%
scielab rmse : 0.396
scielab ssim : 0.9997 , angle : 98.4714%

now JPEG q 100 isn't even lossless, but this is way more like what you'd expect. A 2.0 base rmse is about what you get from the "lossy" old school chroma conversion that JPEG does.

Unfortunately this large constant error ruins any attempt to measure x264's performance. You can see that the RMSE line for x264 is just offset way up :

x264 vs JPEG : log rmse vs log bpp

The slope is way better than JPEG so we think that if the color space wasn't throwing away so much information it would be much better.

So if anybody has a suggestion on how to run x264 without the destructive y4m color transform, I'd appreciate it. I believe the problem is that it's putting YUV back into bytes. I suspect that the color consersions and up/down sample are just not being done very well (I'm using ffmpeg) , maybe there's some -highquality setting that I don't know about that would make them better?

In any case, we can still see a few things. JPEG behaves very badly below 0.5 bpp which x264 degrades nicely into very low bit rates.

x264 vs JPEG : scielab SSIM :

And another point on graph scaling. SSIM is a dot product, and as such is nonlinear and distortion. In particular, in the functional range, SSIM values are usually between 0.980 and 0.990 , which is a tiny and confusing space.

One better way to map it is to turn the dot product into an angle (using acos). I then change the angle into a percent (100% = 0 degrees apart, 0% = 90 degrees apart). The "SSIM angle" plot looks like this :

scielab SSIM angle :

In particular, it's slightly easier to see the separation between normal JPEG and the "flatnosub" (RMSE tuned) JPEG in the SSIM angle plot than in the regular SSIM plot. In regular SSIM everybody goes into this asymptote to 1 together and it makes a mess. This scielab MS-SSIM is semi-perceptual so it rewards JPEG over jpeg-flat-nosub.

You should basically ignore the x264 results here because of the aforementioned problem with large constant error.

ADDENDUM : you can repro the x264 problem like this :

ffmpeg -i my_soup.avi -vcodec libx264 -fpre presets\libx264-lossless_slow.ffpreset test.mkv
ffmpeg -i test.mkv -vcodec png out.png

imdiff my_soup.bmp out.png

and the result is :

rmse : 5.7906 , psnr : 32.9104
ssim : 0.9783 , perc : 86.7096%
scielab rmse : 2.755
scielab ssim : 0.9928 , angle : 92.3736%

or without even getting x264 involved :

ffmpeg -i my_soup.avi -pix_fmt yuv420p my_soup.y4m
ffmpeg -i my_soup.y4m -vcodec png uny4m.png

same results, which shows that the problem is the yuv420 conversion

I confirmed that my_soup.avi is in fact a perfect lossless RGB copy of my_soup.bmp ; Note that this result is even worse than the one I reported above. The one above was made by first converting the AVI to Y4M using Mencoder , so apparently that path is slightly higher quality than whatever ffmpeg is doing here.

ADDENDUM : I think ffmpeg/x264 use "broadcast standard" YUV in the 16-235 range instead of 0-255 , so that might be a large piece of the problem.

latest attempt, still not good :

ffmpeg -f image2 -i my_soup.bmp -sws_flags +accurate_rnd+full_chroma_int -vcodec libx264 -fpre c:\progs\video_tools\ffmpeg-latest\presets\libx264-lossless_slow.ffpreset test.mkv
ffmpeg -i test.mkv -sws_flags +accurate_rnd+full_chroma_int -vcodec png uny4m_2.png
imdiff my_soup.bmp uny4m_2.png

10-13-10 | More RetardoGraphism

Funny that this pops up in my Google Reader just as I'm on a theme of writing about how people graph things wrong to abuse information.

We're shown this :

Wow! Move enabled is great !

Oh wait. He's testing std::sort on vector< set< int > > , which is pretty contrived and one of the classic "don't do" examples, but whatever, that's fine. Oh, and he's intentionally broken RVO (return value optimization) , which is pretty crucial to making the STL run fast.

Like hay, if you construct a synthetic example that does lots of copies without move and doesn't do them with move, it's faster!

Yeah, move is great, I'm all for it, but let's be clear what it's doing : it's making it easier to write code, and it's letting you use certain patterns that were previously verboten. It is *not* speeding up real world performant code. Existing real world high performance code *already* never copies large objects. You currently get around that by making heavy use of RVO , of swap() to avoid copies, and by using pointers to objects instead of objects by value. What move gives us is the ability to write fast code without worrying about those things.

The penalty of move (and rvalues in general) is yet more intellectual burden, something you have to be careful about and understand and implement in your new data types correctly, and yet another thing for people to do wrong.

10-12-10 | Image Comparison Part 1

First of all, let's talk about how we graph these things. For this comparo, I'm measuring RGB L2 error (aka MSE) using IJG JPEG. I'm comparing :

    jpeg : default settings (with -optimize)
    jpegflat : jpeg + flat quantization matrix
    jpegflatnosub : jpegflat + no subsample for chroma

what people usually plot is PSNR vs. bpp, which looks like this :

jpegs psnr vs bpp :

these psnr vs bpp graphs are total shit. I can't see anything. The area that you actually care about is around 0.5 - 1.5 bpp, and it's all crammed together. In particular it's impossible to tell if jpeg vs jpegflat is better, and I have no intuition for what the numbers mean. Please stop making these graphs right now.

(NOTE : bpp means bits per *pixel* not bits per byte; eg. uncompressed is 24 bpp)

IMO rmse is better than PSNR because it's more intuitive. 1 level of rmse is one pixel value, it's intuitive. Unfortunately, the plot is only slightly better :

jpegs rmse vs bpp :

What we want obviously is to expand the area around 1 bpp. The obvious thing should occur to - our bpp scale is wrong. In particular, what we really care about are doublings of bpp - eg. 0.25 bpp, 0.5 bpp, 1.0 bpp, 2.0 bpp - the step from 0.25 to 0.50 is about the same importance as the step from 4.0 bpp to 8.0 bpp. Obviously we should use a log scale. A similar argument applies to rmse, so we have :

jpegs log rmse vs log bpp :

which is much clearer. It also is amazingly linear through the "money zone" of -1 to 1 (0.5 to 2.0 bpp) which is where JPEG performs well.

BTW note of course PSNR is a log scale as well, it's just log rmse flipped upside down and then offset all weird by some constant. I don't like it as much, but PSNR vs. log bpp is okay :

PSNR vs log bpp :

Conclusion :

Plots vs log bpp are the way to go.

If you are showing L2 errors for JPEG you need to be using a flat quantization matrix with no subsampling of chroma. (note that I didn't optimize the relative scaling of the planes, which might change the results or improve them further).

Next post I'll move on to some perceptual measurements.

(BTW the JPEG numbers posted here are all with PAQ ; see later posts for full details)

10-12-10 | Corporate Tax

I don't understand corporate tax. What exactly does it tax? Obviously "corporate income", but what does that mean exactly? In particular, since salaries are deducted as expenses, the only things I can think of that are actually taxed are : 1. corporate profits that are kept as cash reserves, and 2. dividend payouts. And for (1) any corp worth its salt is going to find a way to defer that accounting against some future cost, so all I can is dividend payouts. But furthermore, since the tax accounting of expenses is different than the real accounting, you have plenty of corps that pay out dividends and yet have "no profit" for tax purposes (eg. GE). So I'm a bit confused and I wonder if I'm missing some other category of money that counts as corporate profit.

Anyway, the thing that's been bothering for a long time is that I've read a few times that sales or VAT taxes, eg. consumption taxes, don't hurt economies, but income taxes do. That just makes no sense to me, I don't understand it all. It seems to me that for economic growth you want to maximize fluidity, the ease of money flow, and you want more transactions. Assuming there is some +EV utility from each exchange, more exchanges = growth. To encourage echanges you should minimize friction. It seems to me that consumption taxes are a direct friction and should discourage purchases, and thus be very bad for economies.

So far as I can tell, the argument against corporate tax comes down to people like this : News N Economics O.K., so the Senate rebukes the VAT for what exactly

who pull some data and draw a pretty questionable line on it :

There's some major fallacies in this argument. First of all, correlation is not causation. Even if this did show that high corporate income tax rates are (negatively) correlated to GDP growth, that doesn't prove that lowering CIT helps growth. There are any number of possibly reasons they could be correlated. One that seems obvious off hand from the graph is that it seems developing or poorer nations tend to have lower CIT for various reasons, and developing nations tend to have higher growth. If that hypothesis is correct, then the correlation of CIT to growth is purely incidental. Another possibility is that countries with healthy economies don't need high CITs to raise enough taxes. You can't just take two axes of a many-dimensional data set, you need to look at all the axes and do PCA to pull out the most correlated dimensions.

In any case, the graph is just showing the wrong thing for the X axis. It is showing the *nominal* CIT rate. That's bananas retarded bullshit. What you need to show is the *real* CIT rate. Perhaps even better is to show the dollars of CIT as a percentage of total tax collected - then we can see how having a large CIT as a fraction of your tax bundle affects growth.

This is clearly a case of disinformation, where people pull some numbers and make graphs and make it look like it was researched and studied and it's good hard facts - when in fact it is completely distorted bananas bullshit.

Interestingly, if you look at CIT collected as a fraction of GDP, the US actually has the 4th LOWEST CIT in the OECD - just 1.8 % , vs. a 2.5 % weighted average. (Germany, also nominally one of the highest averages only 1%). CBO paper (PDF)

Obviously the vast majority of corporations find ways to wind up with no "income". (BTW there is an IRS watchdog that's supposed to catch executives who just check their profit number at year end and increase their salaries by that amount, but obviously there are plenty of legal ways to accomplish the same thing).

One reason is that the rules for what CIT applies to are not uniform across the OECD, so showing the nominal rate across countries where the word "corporation" doesn't even mean the same thing is kind of retarded. For example : in the US , over 50% of businesses are not subject to corporate tax , because they are sole owners who prefer to take the income as personal income tax : taxanalysts.com Featured Articles -- The Corporate Tax Conundrum

What The Top U.S. Companies Pay In Taxes - Forbes.com
The truth about tax burdens - OECD Observer
The Gap Between Statutory and Real Corporate Tax Rates
Tax Reform Would Pay Dividends
Roubini Global Economics - U.S. EconoMonitor
Putting U.S. Corporate Taxes in Perspective — Center on Budget and Policy Priorities
OECD Tax Database
Most U.S. firms paid no taxes over 7-year span - SFGate
Ezra Klein - The corporate tax shuffle
Dave Johnson Tax Tricks -- Do Corporations Pass Taxes on to Customers

There are a few canards going around that I think we can debunk.

1. High corporate income taxes discourage economic activity. It seems to me the opposite is true, high CIT encourages corps to pay out all profit as salaries or reinvest it in a deductable way. If anything that should be a stimulating motivator. Similarly, the idea that some executive are going to work less because of higher CIT is absurd since it simply doesn't apply to their income.

2. Corporations pass on income tax bills to their customers. As Dave Johnson wisely argued, this is just bunk. Corporations already price their goods as high as the market can tolerate. The floor of pricing is set by cost, and CIT is not a cost, since it applies to profit. Therefore CIT should have zero effect on prices to consumers. If anything CIT is a tax on investors who receive smaller dividends.

A few issues about high CIT that we should actually care about :

1. Very high nominal CIT coupled with extensive deductions forces corps to spend a lot of money on tax preparation. Obviously simplification of the tax code all around would be beneficial. Oddly, it is the corps themselves who have lobbied us into this morass of special exemptions.

2. The overlap of CIT and dividend tax is obviously a little weird. IMO dividends need to be taxed as normal income to close the loophole of major shareholders getting such low tax rates, and if that was fixed then CIT would have to come down. Or dividends could just be deducted from corp income, but then what would it be exactly?

3. Variation of CIT across countries for international corps creates strange incentives. Though of course tax variations across states and even special breaks in certain cities create weird distortions as well (see for example Microsoft's dodge in Nevada below).

Not related, but I also found this amazing piece of partisan claptrap from our own lovely government : The Economics of the Estate Tax
it starts out with "The estate tax, also known as a death tax," and then only gets better from there. With official studies like that from government committees, is it any wonder that we can't get our discussion above the level of absurdism ? BTW Also found some references on how the step-up in basis actually makes the estate tax a tax *break* for many heirs. Estate tax elimination could cost heirs
5 estate-tax myths that won't die - MSN Money

Also not related, but I found a huge site devoted to Microsoft's massive tax evasion : Microsoft Tax Dodge
Broke-ass Washington State Set to Give Microsoft $100 Million Annual Tax Cut and Amnesty for $1 Billion in Tax Evasion, Feed

10-11-10 | DeUnicode v1.0

DeUnicode v1.0 is up at : Binaries on cbloom.com .

usage :

DeUnicode v1.0 by cbloom
 DeUnicode [-opts] < dir >
-r : recurse
-t : special the
-a : ascii mode
-d : display only (don't do)
-q : quiet
-v : verbose

I highly recommend using -d for a while at first to make it sure it's working right for you.

For some apps that don't handle even OEM / Console / "A" code page issues right, there's now a "-a" option to make the names into 7 bit ascii , which everyone should work with.

Now if only I could run this on the entire internet ... my god I can't believe they let URL's be non-ascii...

BTW I changed my license (bpl.txt) from BSD to Zlib since it's more permissive.

10-11-10 | Windows 1252 to ASCII best fit

I'd like to construct a Windows 1252 to ASCII (7 bit) best fit visual character mapping (eg. accented a -> a , vertical bar linedraw -> | , etc.). I can't find it. ... okay I did it ..

const int c_windows1252_to_ascii[256] = 
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
 96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111,
 35,102, 34, 46, 35, 35, 94, 35, 83, 60, 79, 35, 90, 35, 90, 35,
 35, 39, 39, 34, 34, 46, 45, 45,126, 84,115, 62,111, 35,122, 89,
 32, 33, 99, 35, 36, 89,124, 35, 35, 67, 97, 60, 35, 45, 82, 35,
 35, 35, 50, 51, 35, 35, 35, 46, 44, 49,111, 62, 35, 35, 35, 35,
 65, 65, 65, 65, 65, 65, 65, 67, 69, 69, 69, 69, 73, 73, 73, 73,
 68, 78, 79, 79, 79, 79, 79, 35, 79, 85, 85, 85, 85, 89, 35, 35,
 97, 97, 97, 97, 97, 97, 97, 99,101,101,101,101,105,105,105,105,
 35,110,111,111,111,111,111, 35,111,117,117,117,117,121, 35,121

( see this table as visible chars at cbloom.com )

this was generated by the Win32 functions and it's not perfect. It gets the accented chars right but it just gives up on the funny chars and it puts the "default" char in, which I set to "#" (35) which is probably not the ideal choice.

So anyway, it would be better to have a hand-tweaked table if you can find it.

Some links I found that were not particularly helpful :

Character sets
Cast of Characters- ASCII, ANSI, UTF-8 and all that
ASCII Table 7-bit
ASCII Character Map
ANSI character set and equivalent Unicode and HTML characters
A tutorial on character code issues

Also, in further news of the printf with wide chars considered harmful front, I've discovered it can cause execution to break, not merely fail to convert the string well.

I get some wchar string from some perfectly reasonable source (such as MultiBytetoWideChar or from a file name) and try to print it with printf %S (capital S for wide chars). The problem is at this point (output.c in the MSVC CRT) :

    e = _WCTOMB_S(&retval, L_buffer, _countof(L_buffer), *p++);
    if (e != 0 || retval == 0) {
        charsout = -1;

because it's decided that the wchar is no good for some reason. wctomb_s will fail "if the conversion is not possible in the current locale". It winds up failing the whole printf (which causes it to return -1 and set errno). WTF don't fail my entire printf because you can't map one of the wchars. So fucked.

(I also have no clue why this particular wchar was failing to convert; it was like a squiggly f looking thing, it showed up just fine in the MSVC watch window, but for some reason the CRT locale shit didn't like it).

see :

_set_invalid_parameter_handler (CRT)
_setmbcp (CRT)
setlocale, _wsetlocale (CRT)

Anyway, my recommended best practice remains "don't use wide chars in printf" , unless you use autoprintf and let it convert them to console code page for you. (note that wstrings are converted automatically, but raw wchars you have to call ToString() to make them convert)

If you use autoprintf and put this somewhere, it will handle the std string variants nicely :

inline const char * autoprintf_StringToChar (const std::string & rhs)
    return rhs.c_str();

inline const String ToString( const std::wstring & rhs)
    return autoPrintfWChar(rhs.c_str());

10-11-10 | FFmpeg success

I've never gotten ffmpeg to work because I could never find a win32 build with everything integrated in it properly (for the win32 build to work easily you need to statically build in mingw, libavcodec, etc.). Anyway, I finally found one :

Automated FFmpeg Builds at arrozcru

And in particular it's got a mess of x264 presets included so you don't have to try to figure out that command line. In particular I can now do :

echo USAGE : [in] [speed] [out]
echo speed = medium,slow,fast,slower,etc.
%FFMPEG%\bin\ffmpeg -i %1 -threads 0 -acodec copy -vcodec libx264 -fpre %FFMPEG%\presets\libx264-%2.ffpreset %3

(my latest fucking video Babel problem is that god damn Youtube doesn't offer mp4 options for many videos anymore, they are all FLV, and my HTPC can't play FLV, so I have to convert youtubes friggle fracking frizzum). (BTW you can also change containers sometimes by using -sameq and not specifying anything else)

Some more reference :

ffmpeg.org FFmpeg FAQ
ffmpeg.org FFmpeg Documentation
VideoAudio Encoding Cheat Sheet for FFmpeg
Links - FFmpeg on Windows
FFmpeg x264 encoding guide robert.swain

BTW any time you are trying a format change pathway it's a good idea to check sync with something like this video

ADDENDUM : Well, as I predicted the Unicode Fuckitude on Windows command lines is coming true. FFmpeg won't open some videos with weird characters in file names, and Youtube in fact has a bunch of videos with weird chars in their names. The simplest solution is to run my Deunicode on your download directory periodically. Deunicode actually still allows 8-bit characters, I might make a -ascii option to restrict it even further to just 7 bit ascii. (8 bit characters work consistently on Windows but still have the annoying attribute that they may display differently on the CLI vs in Explorer).

BTW the problem is not really FFMpeg , it's the fucked up Windows Console CP thing. ( see previous ). The only way to make command line apps that work with unicode file names in Windows is to do the elaborate "GetUnicodeFileNameFromMatch" thing in cblib that I do. So far as I know I'm the only person in the whole world who actually does this (I guess not that many people actually use the command line in windows any more).

10-09-10 | Game Controls

I'm playing a bit of Xbox 360 for the first time ever on the loaner box from work (side note : jebus it makes a hell of a lot of racket, the DVD is crazy loud, the fans are crazy loud, it's pretty intolerable; I also get disc unreadable errors semi-frequently which I guess is because the work box is screwed, but between this and the Red Ring problems, I can only conclude that this console is built like a piece of shit, just epically bad manufacturing quality).

Anyway, one thing that I find very depressing is that the games have almost uniformly bad basic controls. This is the most fundamental aspect of any game and you all should be ashamed that you don't do it right.

Perhaps the most frustrating and retarded of them is that so many games just drop inputs. This happens because the game or the player is in some kind of state where you are not allowed to do that action. The result is that they just lose the button press. eg. say you have a jump and an attack. You can't do them at the same time. I jump, and then right about the time that I'm landing I hit attack. That needs to work whether or not I hit the attack button right before I land or right after I land.

There are various ways to do this, but an easy one is to store a little history of all inputs that have been seen in the last few millis (200 ms or so is usually enough), and whether or not they have been acted upon or not. Each frame, you act not just on the new inputs that are seen that frame, but also on inputs seen recently that have not been acted upon.

For doing things like "holding down A makes you sprint" , you need to be checking for button "is down" *not* the edge event. Unbelievably, some major games do this wrong, and the result is that they can miss the "is down" event, so I'm running around holding A and I'm not sprinting. (even if you do all the processing right as above, the player might press A down during the inventory screen or something like that when you are ignoring it).

For simultaneous button press stuff, you also need to allow some slop of course. Say you want to do something when both A and B are pressed. When you see an "A down" event you need to immediately start doing the action for A to avoid latency, but then if you see a "B down" within 50 millis or whatever, then you need to cancel the A action and start an "AB simultaneous press" action.

One pretty bad trend is that lots of people are using something like Havok for physics and they run their player motion through it. They are often lazy about this and just get it working the same way one of the Havok demos did it, and the result is incredibly janky player motion, where you stutter, get caught on shit in the world, etc. There's a mistaken belief that more finely detailed collision geometry is better. Not so. As much as I dont' really love the old Quake super-slidey style of motion that feels like the world is all greased up, it's better than this stuttering and getting caught on shit. The player's collision proxy should be rotationally symmetric around the vertical axis - you shouldn't be able to get stuck on junk by rotating. A vertical lozenge is probably the best player collision proxy.

A context-dependent generic "action" button is a perfectly reasonable thing to do and lots of games do it now, *however* it should not be one of your major buttons. eg. when you are not at an "action" location, the action button needs to do something pretty irrelevant to gameplay, eg. it should not be your "jump" or "attack" button or something important. That is, just because you're at an action location, you shouldn't lose the ability to do something major that you

Automatic lockons for combat and such are okay, but you need to provide a way to get in and out of that mode, or a button to hold down to override it or something. Specifically, you should never take over control of the player's movement or camera without their having some ability to override it.

It's well known that for 1st person games you need to provide an "invert look" option, but IMO it's just as important to provide inverts for 3rd person cameras too, not just up-down but also left-right. Everyone has a different idea of what the right way for a 3rd person camera to move is, so just expose it, it's fucking trivial, do it.

It's very depressing to me that devs don't spend time on this basic shit.

It basically comes down to an incorrect philosophy of design. The incorrect philosophy is :

we have given the player a way in which it is possible for them to make the character do what they want

the correct philosophy is :

the character should at all times do what the player obviously wants it to do

and any time you fail in that, you have a major deficiency that should be fixed. Any time you see someone press a button and the character doesn't do what they wanted, don't coach the player on how to do it right - fix it so that the game just does the right thing.

A couple other minor points :

You have to be careful about the keys that you use in your menu system and what they do in the game world. This is a general important UI issue (on mouse-based interfaces, you should be aware of where buttons on popups lie on top of buttons underneath them - the issue is if a user clicks the popup multiple times and it disappears they may accidentally click beneath it, so don't put a "delete all my work" button under the popup).

In games in particular, if you have A = okay and B = back in your menu system, then you should make sure that if the user is hammering on "B" and comes out of the menu system and applies to the game world, that it won't fuck them.

Context-sensitive and tap/hold buttons and modifiers (eg. hold R1 changes the action of "A") are all okay for overloading, but *modal* controls are never okay. That is, if I press R1 it changes my mode and then my buttons do different things. Modal controls are absolutely awful because it removes the players ability to act from muscle memory. It means I can't just hammer on "A" in a panic and get the expected result, I might wind up doing the wrong thing and going "fuck I'm in the wrong mode". Even worse are modal controls where the mode is not displayed clearly on screen in a reliable place because the game is trying do an "immersive" HUD-less interface.

I imagine most people would agree that modal controls are terrible, but in fact many games do them. For example any time you press a button to go into a selection wheel for spell casting or dialog or something like that it's a modal control (a better way to do this is hold the button down to bring up the wheel, but really selection wheels are pretty fucking awful in general).

That is, for each action in the game there should be a unique physical player motion. eg. "fireball" should be hold L1 , dpad up, release. Actions should never move around on the wheel, they should be in a reliable place, and there should never be any conditionals in the action sequence to do a certain action.

Another type of modality is when certain actions are not allowed due to some state of the world. eg. maybe you're near friendlies so your attacks are disabled, or you're in a cut scene so you can't save. Again most games get this wrong and don't correctly acknowledge that they have created hidden modes that affect input. The modes need to be clearly displayed - often there's no reliable way to tell you are in a certain mode or not, for example a cut scene needs to be clearly separated from normal gameplay (eg. by bringing in vignetting or letterboxing or something) if it changes my ability to use controls. If I can't do my attacks or sprint or whatever, I should at least get some ackowledgement from the game that it received my input, and it's just not doing it.

10-09-10 | The right way to sell a used car

I've always thought the right way to buy a used car is from a mechanic, with a lifetime service warranty. That is, the mechanic finds the car, inspects it, and sets a price. There's a purchase price and a per year service subscription which covers all maintenance and any possible repair needed. That way he can't trick you and sell you a lemon, because he would have to cover the repairs. It lets the buyer just pick a car without agonizing over research. If you believe in capitalist markets, it would be much better for the system because it removes the problems of unequal information and increases fluidity, eg. if I want a Datsun 240Z I just go buy one and don't have to sweat researching it forever.

It's not that I don't want them to make a profit - they should price the car and the service plan at a level where they expect to make a profit obviously. But by having at least the option to buy the lifetime coverage it means they have to show you the real expected cost, and then you can make a decision.

Unless they're unified, it doesn't make sense. eg. a mechanic can't easily offer a lifetime service plan, because they don't know the state of the car that you want covered, they would have to do an extensive checkup on the car before and it would be hard to price against expected cost. Separately, it makes no sense to buy a 3rd party extended warranty when it's not the mechanic offering it, because the warranty company is only interested in you not getting reimbursed, and the mechanic hates to deal with the warranty company.

Of course this doesn't exist. The reason is that it removes *three* different ways for them to fuck you over and make not just a fair profit, but an exploitative profit. 1. used car sellers of course do want to sell you lemons at inflated prices, and this stops them, 2. mechanics of course want to overcharge you and do unnecessary work, and this stops them, and 3. warranty companies want to charge you a lot and then not actually pay out claims due to some technicality or exclusion.

But amazingly it looks like Hartech in the UK actually offers the comprehensive sale-to-service plan.

10-08-10 | Optimal Baseline JPEG

One of the things we are missing is a really good modern JPEG encoder/decoder. I mentioned most of this in the WebP post, but I thought it was important enough to repeat. This would be a great project if someone wants to do it; I'd like to, I think it's actually important, not just as a fair comparison between modern coders like x264 and good old JPEG, but also because it would actually be useful to people who care about JPEG images. (eg. a common use case is you have some old jpeg and you want to decode it as well as possible.

Using normal JPEG code streams, but trying to make the encoder & decoder as good as possible, you should do something like :

Encoder :

Decoder :

10-08-10 | Charity

I've been thinking about this Giving Pledge thing quite a lot since it was announced. Obviously it's good and all, but something bothers me about it.

The problem is that it perpetuates the myth that private charities are a more worthy recipient of money than government. Most of these people have avoided giving their fair share of taxes to the government their whole lives, and now they are giving gobs of it to charities.

That's not really what we need. It would have been more beneficial for longer if they had simply worked to get the high income tax rate raised.

The things that America really needs more of are basic government services. We need money for schools and teachers, we need money for the homeless and unemployment, we need money for prisoner rehabilitation and drug programs, we need money for libraries and roads.

What we don't really need is more money for "alternative education" and laptops for schools and "measurement based learning evaluation" and all that marginal shit that the Bill Gates foundation does.

10-08-10 | Old News

This is from 2006 but I just found it and it's fucking hillarious.

Skim the article, then read through the comments, and then a few pages down it gets AMAZING :

Why Windows Threads Are Better Than POSIX Threads

10-08-10 | Stress

There's a common over-simplification that stress is caused by "work" or "over-work" or just having a lot of todos on your plate. That's not really true. In fact having todos that you think are meritted and that you can do yourself is not really stressful, you just knock them out, if one is eating at you, you just do it. So what is stressful?

Todos that are on your list but rely on someone else to do. When something is your responsibility or affects you or you're dependent on it for some later work, but you can't just do it yourself. Even if the other person is totally cool and reasonable it's stressful because you have to be like "eh, is this done yet?" and you feel bad about pestering them too much, etc.

Things that you know need to be done but aren't your responsibility or you can't make someone do it. For example when you're a coder and you're working on a game and you think the art or design or sound or whatever is really doing something wrong, and maybe you try to say something about it nicely but nobody pays attention to you, and day after day you just keep thinking "ugh this sucks it needs to be different" but you can't do anything about it. Or in code when you're a low level guy and you think the engine architecture or coding methods or something basic are totally fucked but the lead doesn't want to do anything about it.

Expectations that are not voiced or even opposite-voiced. One of the great stress causers is when your boss/lover doesn't tell you what they need from you, and then gets mad at you for not doing it. You wind up being like "ahh, I didn't know, what did I do?". Often expectations are even opposite-voiced, as in when your boss says "I don't expect you to work overtime or come in weekends" , but when somebody does do those things they get a bonus and the boss loves them, or when your lover says "I don't need you to buy me jewelry and flowers all the time" when in reality they do. Of course the boss/lover is not entirely to blame in these scenarios, a certain amount of opposite-speak is to be expected and it is the mature man who can ignore the literal words and listen to the actual meaning. However, when you are put in the situation that you have to guess what you're supposed to do, the penalty or reward can seem capricious and you wind up constantly in fear that you aren't doing something that you should be doing, which leads to extreme stress.

Tasks that are unclear, especially when you don't have the freedom to really do them the way you want, or you're told that you have more freedom than you really have. The worst thing is probably when the task-giver actually does have a clear idea of what they want, but doesn't tell you. They'll just tell you something vague like "optimize the game", so you go off and do some things, then come back and they go "oh no no, not that, more like this". Usually asking doesn't help, you'll be like "do you want this or this?" and they say "oh whatever, I don't care", but then you do something and they go "oh, not that".

Conflicts of reality / expectations that you know are false. This is very common in games where the producers and leads are telling everyone that you're going to ship in 6 months, but you've looked at the game and you know damn well it's more like 12 months. There's a conflict between what you know to be the reality and what they are claiming, and this causes constant stress, because they are giving your crunch-like tasks to wrap things up and you know they aren't the right tasks and the crunch is unwarranted, etc.

As an underling who is afflicted by these stressors, there's not really a whole lot that you can do about it. But as a stress causer, you can be aware of what you are doing to others and try to reduce it.

10-08-10 | The Greatest Con

The Greatest Con is when the powers that be convince people that what they really want is the thing that's good for The Powers. The oppressed take up the cause of their oppressor and become the advocates of their own destruction.

I see shit like this on the web all the time :

"Class action = thousands of those damaged getting 5 bucks each + one law firm getting millions."

Well you've really bought the lie my friend, that lawyers are evil and using class actions as a tool to reform corporate behavior is pointless. Of course along with that was the lie that the punitive damages were excessive so now we have limits all over which make most settlements against corporations just slaps on the wrist.

But there are tons of these diseases. It's sickening when people are so convinced of some bullshit that they actively fight to screw themselves. They believe in something that keeps themselves down and penalizes themselves, and the powers just laugh and laugh.

Poor people convinced that high taxes on the rich are "unfair" or that estate taxes are "double taxation".

Soldiers convinced that risking their lives for some ill-conceived cowboy bullshit is "noble".

Consumers everywhere convinced that they have to buy the latest disposable bullshit to be "green" or "fashionable".

Middle class people believing they need to invest in risky stocks or leverage up to buy big houses. Of course all this does is give money to brokers.

10-07-10 | Portable CRT God Dammit

What a god damn disaster. How is it that none of us have made our old portable wrapper for the CRT? I want a standard interface to the functions that do exist, and a standard way to ask for "does this exist", eg. something like :

#if fseek64_exists



int32 off32 = check_value_cast_throw<int32>(off64);


Mostly it should just be a wrapper that passes through to the underlying CRT, but for some things that are platform independent (eg. string stuff, sprintfs, etc.) it could just be its own implementation.

I see that Sean has started a portable types header called sophist that gives you sized types (int16 etc) and some #defines to check to get info about your platform. That's a good start.

For speed work you'd like some more things like "size of register" and something like "intr" (an int the size of a register) (one big issue here is whether the 64 bit type fits in a register or not). Also things like "can type be used unaligned".

Obviously C99 would help a lot, but even it wouldn't be everything. You want the stuff above that tells you a bit more about your platform and exposes low-level ops a bit. You also want stuff that's at the "POSIX" library level as opposed to just the CRT, eg. dir ops & renames and truncate and chmod and all that kind of stuff.

Every time I do portability work I think "god damn I wish I just made my own portability library" but instead I don't do it and just hack enough to make the current project work. If I had just done it the clean way from the beginning I would have saved a lot of work and been happier and made something that was useful to other people. And.. I'm just doing it the hacky way yet again.

(actually Boost addresses a lot of this but is just sick over-complex and inconsistent in quality; for example Boost.Thread looks pretty good and has Win32 condition variables for example). I also just randomly found this ptypes lib which is pretty good for Win32 vs POSIX implementations of threading stuff.

10-04-10 | Time Wasting

I can watch this rnickeymouse channel of motorcycles on "the snake" on Mullholland all day long. I wish he wouldn't spill the money shot right in the titles though, it would be more fun if you got to see the rider approaching and guess if he's gonna make it or not.

I love these MGdM music videos :
XCÈNTRIC That's not entertainment on Vimeo
LOS PLANETAS Alegrías del incendio on Vimeo
EL GUINCHO Bombay on Vimeo

I can't stop listening to the Whoa-B Dubstep Mix . I used to think I didn't like "Dubstep" because I'd only heard horrible party anthem shit like Rusko or overly noisy cacophonous shit like Gaslamp Killer, both of which I hate, but I like this deep smooth shit.

MIMIC-TPW nice animation of the pacific weather comming at Seattle.

10-02-10 | WebP

Well, since we've done this in comments and emails and now some other people have gone over it, I'll go ahead and add my official take too.

Basically from what I have seen of WebP it's a joke. It may or may not be better than JPEG. We don't really know yet. The people who have done the test methodology obviously don't have image compression background.

If you would like to learn how to present a lossy image compressor's performance, it should be something like these :

Lossy Image Compression Results - Image Compression Benchmark
S3TC and FXT1 texture compression
H.264AVC intra coding and JPEG 2000 comparison

eg. you need to work on a good corpus of source images, you need to study various bit rates, you need to use perceptual quality metrics, etc. Unfortunately there is not a standardized way to do this, so you have to present a bunch of things (I suggest MS-SSIM-SCIELAB but that is nonstandard).

Furthermore, the question "is it better than JPEG" is the wrong question. Of course you can make an image format that's better than JPEG. JPEG is 20-30 years old. The question is : is it better than other lossy image formats we could make. It's like if I published a new sort algorithm and showed how much better it was than bubblesort. Mkay. How does it do vs things that are actually state of the art? DLI ? ADCTC ? Why should we like your image compressor that beats JPEG better than any of the other ones? You need to show some data points for software complexity, speed, and memory use.

As for the VP8 format itself, I suspect it is slightly better than JPEG, but this is a little more subtle than people think. So far as I can tell the people in the Google study were using a JPEG with perceptual quantization matrices and then measuring PSNR. That's a big "image compression 101" mistake. The thing about JPEG is that it is actually very well tuned to the human visual system (*1); that tuning of course actually hurts PSNR. So it's very easy to beat JPEG in terms of PSNR/RMSE but in fact make output that looks worse. (this is the case with JPEG-XR / HD-PHOTO for example, and sometimes with JPEG2000 ). At the moment the VP8 codec is not visually tuned, but some day it could be, and when it eventually is, I'm sure it could beat JPEG.

That's the advantage of VP8 over JPEG - there's a decent amount of flexibility in the code stream, which means you can make an optimizing encoder that targets perceptual metrics. This is also what makes x264 so good; I don't think Dark Shikari actually realizes this, but the really great thing about the predictors in the H264 I frames is not that they help quality inherently, it's that they give you flexibility in the encoder. That is, for example, if you are targetting RMSE and you don't do trellis quantization, then predictors are not a very big win at all. They only become a big win when you let your encoder do RDO and start making decisions about throwing away coefficients and variable quantization, because then the predictors give you different residual shapes, which give you different types of error after transform and quantization. That is, it lets the encoder choose what the error looks like, and if your encoder knows what kinds of errors look better, that is very strong. (it's also good just when targetting RMSE if you do RDO, because it lets the encoder choose residual shapes which are easier to code in an R/D sense with your particular transform/backend coder).

My first question when somebody says they can beat JPEG is "did you try the trivial improvements to JPEG first?". First of all, even with the normal JPEG code stream you can do a better encoder. You can do quantization matrix optimization (DCTune), you can do "trellis quantization" (thresholding output coefficients to improve R/D), you can sample chroma in various ways. With the standard code stream, in the decoder you can do things like deblocking filters and luma-aided chroma upsample. You should of course also use a good quality JPEG Encoder such as "JPEG Wizard" and a lossless JPEG compressor ( also here ). (PAQ8PX, Stuffit 14, and PackJPG all work by decoding the JPEG then re-encoding it with a new entropy encoder, so they are equivalent to replacing the JPEG entropy coder with a modern one).

(BTW this is sort of off topic, but note that the above "good JPEG" is still lagging behind what a modern JPEG would be like. Modern JPEG would have a new context/arithmetic entropy coder, an RDO bit allocation, perceptual quality metric, per-block variable quantization, optional 16x16 blocks (and maybe 16x8,8x16), maybe a per-image color matrix, an in-loop deblocker, perhaps a deringing filter. You might want a tiny bit more encoder choice, so maybe a few prediction modes or something else (maybe an alternative transform to choose, like a 45 degree rotated directional DCT or something, you could do per-region quantization matrices, etc).)

BTW I'd like to see people stop showing Luma-only SSIM results for images that were compressed in color. If you are going to show only luma SSIM results, then you need to compress the images as grayscale. The various image formats do not treat color the same way and do not allocate bits the same way, so you are basically favoring the algorithms that give less bits to chroma when you show Y results for color image compressions.

In terms of the web, it makes a lot more sense to me to use a lossless recompressor that doesn't decode the JPEG and re-encode it. That causes pointless damage to the pixels. Better to leave the DCT coefficients alone, maybe threshold a few to zero, recompress with a new entropy coder, and then when the client receives it, turn it back into regular JPEG. That way people get to still work with JPEGs that they know and love.

This just smells all over of an ill-conceived pointless idea which frankly is getting a lot more attention than it deserves just because it has the Google name on it. One thing we don't need is more pointless image formats which are neither feature rich nor big improvements in quality which make users say "meh". JPEG2000 and HD-Photo have already fucked that up and created yet more of a Babel of file types.

(footnote *1 : actually something that needs to be done is JPEG needs to be re-tuned for modern viewing conditions; when it was tweaked we were on CRT's at much lower res, now we're on LCD's with much smaller pixels, they need to do all that threshold of detection testing again and make a new quantization matrix. Also, the 8x8 block size is too small for modern image sizes, so we really should have 16x16 visual quantization coefficients).

10-01-10 | Some Car News

Lots of Cool news from Paris :

Paris 2010 2012 Ford Focus ST hatches early — Autoblog - finally euro "hot hatches" are coming to America. (obviously we have things like the Mini, GTI, Mazdaspeed3, but none of them really quite qualify). Unfortunately still no signs of the high end hot hatches. And WTF ever happened to the Fiat 500 that was supposed to take us by storm? By the time it comes out here we'll all be in hovercars. I'd love to see the Fiesta rally car made into a production model, but it looks like that won't happen, and even if it did it won't come to the US (FYI the Fiesta is a smaller Focus).

(BTW the right rules for WRC Rally cars : build whatever you want, but there must be a homologation version available in large quantities that must be sold for $40k or less; apparently they no longer require homologation)

Hyundai pledges to hit 40 mpg with 2012 Veloster, beat out Honda CR-Z — Autoblog - I'm glad to see some car manufacturers are pursuing the correct way to make green vehicles. Light weight, small, efficient engines. Hybrids are such a fucking scam, it makes me sick that they are underwritten by my tax dollars so that people can drive cars that are too big and too heavy while pretending to be green. The Mazda2 is also a cool car, only 2000 pounds which is absolutely shocking my modern standards. I'd love to see a Mazda2 with a high-reving 2 liter. The Fiesta ECOnetic for example is a great green car (of course not coming to America because we suck).

(Americans are basically completely retarded; we claim to want "green" and "sustainable" but have to buy the latest disposable thing all the time, we don't buy hatchbacks even though they are absolutely the most practical car design (station wagons either); we don't like deisel; we like giant trucks for no reason; we believe in angels and actually think Obama is a muslim , seriously WTF you are all fired, get out of my country).

2014 Lotus Elan First Look - 2010 Paris Auto Show - Automobile Magazine , all the new Lotuses look identical so I'll pick the one that is in the sweet spot I want - the Elan is 2+2 mediumweight (2900 pounds or so) coupe that should absolutely kick the ass of the venerable Porsche 911. It's got more power, less weight, a better mid-engine layout. I'm sure it will have shitty electrics and reliability problems, but hey so do Porsches. It's basically the best of the Cayman (the weight and geometry) with the practicality (2+2) of the 911. It does cost more like a GT3 than a base 911. It replaces the Evora which is a piece of shit worst of all worlds monster; it's too small, too weak in the engine, and yet somehow also too heavy. Somehow the Elan massively fixes all of those issues.

We should also note that all the new Lotuses are disappointingly heavy, and most lack manual transmission options, and they have joined that unfortunate design trend of not giving you any windows. It's a shame that Lotus insists on building their own cars. It would be much better if they were just a division of Toyota, so they could focus just on design and let Toyota do the building (since they're made of Toyota bits anyway, and Toyota is much better at putting them together well).

In other news, It looks like there are some problems with the new Subaru WRX's blowing engines; something about stripped bearings and rods knocking. So far it seems that Subaru is being decent about replacing them under warranty, but NOT if you chipped your car and did larger turbos and all that. I keep thinking about buying one of these because they are epically practical around here, and N and the Porsche have taught me the joy of ripping up country dirt/gravel roads, which a WRX with rally mods would be awesome for. This is a warning to myself :

2009 WRX Engine Failure Poll - Page 2 - NASIOC
2009 Impreza WRX motor issues - Page 7 - NASIOC
09 WRX Engine Failures - ClubWRX Forum - Subaru Impreza WRX and STi Community and Forums

Also apparently the 2008's were just terrible in various ways; if you like the new WRX hatch (like I do), it looks like you have to go to 2010.

The more I read about Porsches the more it seems the M96 (996, Boxster 986) engines were just complete shit. Everyone knows about the common and disastrous IMS failures ( YouTube - Porsche IMS Bearing Failure Explained ), but there were also RMS failures, cylinder sleeve failures, timing chain failures, etc. etc. basically every part of the engine was made too cheaply and can blow to bits, and worst of all Porsche never owned up to it and consistently screws over their customers. It's absolutely shameful and yet the brand seems to have taken almost no hit from it. A few of the 996 owners who blew engines left the brand, but lots didn't, for example :

WTF, you're like an abused wife. Your engine just blew up and Porsche took a shit in your mouth, and now you want to buy almost the same car again?

In theory the 997 has upgraded bits that address the failures, but we'll see. Of course I think it's shitty that Porsche has done this, but it's really not that unusual for any manufacturer. BMW knows the HPFP's in most of the new 135/335/etc line are prone to failure, and they improved them over the years. Of course they don't just give you the improved version if you have an older car, and they will even deny the fact that it was a design flaw that was fixed. They'll claim the old one is just fine, but the new one is "even better". This is just standard practice. The thing that makes it really especially egregious with Porsche is that they make so much profit ($28k per car) that they could have easily covered the cost of an engine per car ($15k) and still made plenty. When you buy a $12k Hyundai you shouldn't really expect too much in dealer freebies (but you actually do get a lot!), while a $90k Porsche buyer might reasonably expect some love. No. Shame on you, I hope the Elan is great.

One last thing on the whole Porsche 996 debacle : LN Engineering and Flat 6 Innovations are the top resources for info on this. The thing is, the 986 Boxsters and 996 Carreras have taken such a huge hit from this in resale value, that they are actually a good deal again. You can get a 986 Boxster with a blown engine for $5k , and then send the engine up to Flat6 and have them rebuild it for you with all the new LNE replacement bits, maybe $10k for the rebuild, and you wind up spending $15k for a really great car, which actually has a strong reliable engine after those guys are done with it (or you can buy one that's not blown for $12k and spend $3k getting the upgraded bits put on and be in the same spot).

Just as another random example Cadillac NorthStar engines were similar disposable shit. Lots of car makers have these complete shit products, and various years of production that you absolutely should never buy. It's not surprising that the car makers try to hush it up, but where the fuck are our consumer advocates? Maybe if they got called on it more they would actually do something about it. Why doesn't Consumer Reports track this stuff better and say "x year of this model is a lemon, do not buy". There are hundreds of car review web sites, and you could go read them for days and not see mention of any of this kind of stuff.

Oh, and finally, here's a good tip : if you're buying a used car, get yourself an OBD2 reader. Every car now has a generic serial port called OBD2 you can plug into and pull history off the ECU. It will tell you if any faults have ever triggered in the history of the car. It should also tell the date that the ECU started recording, so you can tell if it's been wiped or replaced. An OBD2 reader only costs like $50 and you can get them at any auto parts shop.

If the car looks good, you might want to get a compression or leakdown test. You can do this pretty easily yourself (depending on how hard it is to get to the spark plugs on that car, some cars make it a fucking PITA), or it should cost about $100 to have a shop do it for you. In my experience, having a mechanic do an "inspection" of a used car is pretty useless. All they do is turn it on and listen for a second, the same thing you do (and they should pull the OBD2 codes, but you already did that so you don't need them to do it). A compression/leakdown test makes them actually get their hands dirty, and it will tell you if you have bad failures in the cylinders / valves / seals.

P.S. : SCCA car classes ( nice HTML table version from 2003 ) is a pretty cool way to tell how fast a car *really* is. Cars within a class should run roughly comparable times on an autocross course. Unlike the stupid magazine tests that are done by incompetent drivers in cars that are not track-prepped with retarded unequal tires (some on run flats and some on R-comps) in unequal weather conditions and with insufficient sample size, or stupid 0-60 times which are determined more by launch technique and gearing than anything else, the SCCA classes are built from significant statistical samples, and the cars are actually run by people who know how to drive them, know how to set them up, and are trying to get the best possible time in each car. And autocross is a good way to judge "quickness", which is what I really want in a street car (as opposed to "speed")

10-01-10 | Some Data Compression Corpora We Need Badly

If somebody wants a university project, these would be nice :

1. A lossless data compression corpus that is *broad* and also *representative*. That is, there are many types of data (probably 100-1000 files), some small, some large. Importantly the type of correlation structure in the data should be very diverse (eg. not just a ton of different English text files or executables). Too many of the corpora are simply too small, and even the ones that are reasonably large are too self-redundant, they wind up not containing a sample of a certain type of data that does occur in the wild.

Finally the thing that's really missing is there should be a weighting number assigned to each file such that they are given importance based on their chance of occurance in the wild. To get these numbers you could do a few different things - download every archive on thepiratebay and sample what's inside them (this gives you a sampling of the type of files people actually put in archives), or maybe put a snooper on the internet backbone and sample the total set of all data that flies on the internet. The point is that this sampling should be based on the actual frequency of various data types, not just an ad hoc composite.

2. An image set with human quality metrics. Somebody needs to take a big set of test images (32-100), munge them in various ways by running them through various compressors (as well as other ways of damaging them that aren't well known compressors), and then get actual human visual ratings on the damaged versions. Then provide all the damaged versions (or code to produce them) with the human ratings.

If we had a test set like that, we could tweak our algorithmic approximations of human quality rating (eg. SSIM etc) until they reproduce what the actual humans say. This is not a test set for image compressors, it's a test set for image quality metric training, which is what we really need to take image compressors to the next level.

09-30-10 | Coder News

Ignacio wrote a nice article on GPU radiosity via hemicubes . I always wanted to try that, I'm jealous.

I suspect that you could be faster & higher quality now by doing GPU ray tracing instead of render-to-textures. The big speed advantage of GPU ray tracing would come from the fact that you can make the rays to sample into your lightmaps for lots of lightmap texels and run them all in one big batch. Quality comes from the fact that you can use all the fancy raytracing techniques for sampling, monte carlo methods, etc. and it's rather easier to do variable-resolution sampling (eg. start with 50 rays per lightmap texel and add more if there's high variance). In fact, you could set up all the rays to sample into your lightmap texels for your whole world, and then run N bounces by justing firing the whole gigantic batch N times. Of course the disadvantage to this is you have to implement your whole renderer twice, once as a raytracer and once for normal rendering, avoiding that is the whole point of using the hemicube render to texture method.

Was talking to Dave and randomly mentioned that I thought the new Rvalue references were retarded because you can basically do everything they give you using RVO and swaps. But then I did some more reading and have changed my mind. Rvalue references are awesome!

Want Speed Pass by Value. « C++Next (Dave Abrahams series on value semantics)
C++ Rvalue References Explained (good intro)
A Brief Introduction to Rvalue References (Howard E. Hinnant, Bjarne Stroustrup, and Bronek Kozicki)
Rvalue References C++0x Features in VC10, Part 2 - Visual C++ Team Blog - Site Home - MSDN Blogs
rvalue reference (the proposal document)
InformIT C++ Reference Guide The rvalue Reference Proposal, Part I

Basically it lets you write templates that know an object is a temporary, so you can mutate it or move from it without making a copy. I think one problem we sometime have as coders is that we think about our existing code and think "bah I don't need that" , but it's only because we have been so conditioned not to write code in that way because we don't have the right tools. C++0x makes it possible to do things in templates that you always wished you could do. That doesn't necessarily mean doing lots more complicated things, it means doing the little things and getting them exactly right. "auto" , "decltype" , "[[base_check]]", etc. all look pretty handy.

09-28-10 | Branchless LZ77 Decoder

It occurred to me that if you do an LZ77 where the literal/match flag is sent via a run length of literals, then a run of literals is the same as a match, but the source is the next few bytes of the compressed buffer, rather than some previous location in the decompressed buffer.

That is, you're decompressing from "comp" buffer into "dest" buffer. A "match" is just a copy from the "dest" buffer, and literals are just at copy from the "comp" buffer.

So, let's say we do a byte-wise LZ77 , use one bit to flag literal or match, then 7 bits for the length. Our branchless decoder is something like :

    U8 control = *comp++;
    int source = control >> 7;
    int length = (control & 127) + 1;
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);

Where "select(a,b,c)" = c ? ( a : b ); or something like that. (sometimes better implemented with a negative and; on the PC you might use cmov, on the spu you might use sel, etc.)

While this should be very fast, compression will not be awesome because the division for literals and matches is not ideal and 7 bits of length is a lot for literals but not enough for matches, and offsets are always 16 bits which is too little. We can do a slightly better version using a 256 entry lookup table for the control :

    U8 control = *comp++;
    int source = source_table[control];
    int length = length_table[control];
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);

for example with the table you could let the match lengths be larger and sparse. But it would probably be better to just have a branch that reads more bytes for long lengths.

Adding things like optionally larger offsets starts to make the branchless code complex enough that eating one branch is better. If you're willing to do a lot of variable shifts it's certainly possible, for example you could grab 1 control byte, look it up in various tables. The tables tell you some # of bits for match length, some # for match offset, and some # for literal run length (they add up a multiple of 8 and use some portion of the control byte as well). Unfortunately variable shifting is untenably slow on many important platforms.

BTW one useful trick for reducing branches in your LZ decoder is to put the EOF case out in some rare case, rather than as your primary loop condition, and you change your primary loop to be an unconditional branch. On PC's this doesn't change much but on some architectures an unconditional branch is much cheaper than a conditional one, even it's predictable. That is, instead of :

while ( ! eof )
   .. do one decode step ..
   if ( mode 1 )
   if ( mdoe 2 )

You do :

   .. do one decode step ..
   if ( mode 1 )
   if ( mode 2 )
     if ( rare case )
        if ( eof )

Also, obviously you don't actually use "memcpy" , but whatever you use for the copy has an internal branch. And of course we have turned our branch into a tight data dependency. On some platforms that's not much better than a branch, but on many it is much better. Unfortunately unrolling doesn't help the data dependency much because of the fact that LZ can copy from its own output, so you have to wait for the copy to be done before you can start the next one (there's also an inherent LHS here, though that stall is the least of our worries).

09-21-10 | Waiting on Thread Events Part 2

So we've seen the problem, let's look at some solutions.

First of all let me try to convince you that various hacky solutions won't work. Say you're using Semaphores. Instead of doing Up() to signal the event, you could do Increment(1000); That will cause the Down() to just run through immediately until it's pulled back down, so the next bunch of tests will succeed. This might in fact make the race never happen in real life, but the race is still there.

1. Put a Mutex around the Wait. The mutex just ensures that only one thread at a time is actually in the wait on the event. In particular we do :

WaitOnMessage_4( GUID )

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        MutexInScope( WaiterMutex );

            pop everything on the receive FIFO
            mark everything I received as done
            if ( the GUID I wanted is done )

        Wait( ReceiveEvent );

note that the WaiterMutex is never taken by the Worker thread, only by "main" threads. Also note that it must be held around the whole check for the GUID being done or you have another race. Also we have the separate ReceiveMutex for the pop/mark phase so that when we are in the wait we don't block other threads from checking status.

Now when multiple threads try to wait, the first will get in the mutex and actually wait on the Event, the second will try to lock the mutex and stall out there and wait that way.

The problem with this is that threads that can proceed don't always get to. It's not broken, you will make progress, but it's nowhere near optimal. Consider if N threads come in and wait on different GUIDs. Now the worker finishes something, so that wakes up a Wait and he goes around the loop which releases the Mutex - if you just so happened to wake up the one guy who could proceed, then it's great, he returns, but if you woke up the wrong guy, he will take the mutex and go back to sleep. (also note when the mutex unlocks it might immediately switch thread execution to one of the people waiting on locking it because of the Window fair scheduling heuristics). So you only have a 1/N chance of making optimal progress, and in the worst case you might make all N threads wait until all N items are done. That's bad. So let's look at other solutions :

2. Put the Event/Semaphore on each message instead of associated with the whole Receive queue. You wait on message->event , and the worker signals each message's event when it's done.

This also works fine. It is only race-free if GUIDs can only be used from one thread, if multiple threads can wait on the same GUID you are back to the same old problem. This also requires allocating a Semaphore per message, which may or may not be a big deal depending on how fine grained your work is.

On the plus side, this lets the Wait() be on the exact item you want being done, not just anything being done, so your thread doesn't wake up unnecessarily to check if its item was done.

WaitOnMessage_5A( GUID )
    Event ev;

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        ev = message(GUID)->event

    Wait( ev );

we're glossing over the tricky part of this one which is the maintenance of the event on each message. Let's say we know that GUIDs are only touched by one thread at a time. Then we can do the event destruction/recycling when we receive a done message. That way we know that event is always safe for us to wait on outside of the mutex because the only person who will ever Clear or delete that event is us.

I think you can make this safe for multiple threads thusly :

WaitOnMessage_5B( GUID )
    Semaphore ev;

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        message(GUID)->users ++;
        ev = message(GUID)->semaphore;

    Down( ev );

        pop everything on the receive FIFO
        mark everything I received as done
        ASSERT( the GUID I wanted is done );

        message(GUID)->users --;
        if ( message(GUID)->users == 0 )
            delete message(GUID)->semaphore;

and you make worker signal the event by doing Up(1000); you don't have to worry about that causing weird side effects, since each event is only ever signalled once, and the maximum number of possible waits on each semaphore is the number of threads. Since the semaphore accesses are protected by mutex, I think you could also do lazy semaphore creation, eg. don't make them on messages by default, just make them when somebody tries to wait on that message.

3. Make the Event per thread. You either put the ReceiveEvent in the TLS, or you just make it a local on the stack or somewhere for each thread that wants to wait. Then we just use WaitOnMessage_3 , but the ReceiveEvent is now for the thread. The tricky bit is we need to let the worker thread know what events it needs to signal.

The easiest/hackiest way is just to have the worker thread always signal N events that you determine at startup. This way will in fact work fine for many games. A slightly cleaner way is to have a list of the events that need signalling. But now you need to protect access to that with a mutex, something like :

WaitOnMessage_6( GUID , LocalEvent )
        add LocalEvent to EventList

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        Wait( LocalEvent );

        remove LocalEvent from EventList


and worker does :

    do work
    push message to receive FIFO
        signal everything in EventList

one nice thing about this is that you can push GUID with LocalEvent and then only signal events that want to hear about that specific message.

Note with the code written as-is it works but has very nasty thread thrashing. When the worker signals the events in the list, it immediately wakes up the waiting threads, which then try to lock the EventListMutex and immediately go back to sleep, which wakes the worker back up again. That's pretty shitty so a slightly better version is if the worker does :

    do work
    push message to receive FIFO
        copy everything in EventList to TempList
    signal everything in TempList

this fixes our thread thrashing, but it now gives us the usual lock-free type of restrictions - events in TempList are no longer protected by the Mutex, so that memory can't be reclaimed until we are sure that the worker is no longer touching them. (in practice the easiest way to do this is just to use recycling pooled allocators which don't empty their pool until application exit). Note that if an event is recycled and gets a different id, this might signal it, but that's not a correctness error because extra signals are benign. (extra signals just cause a wasteful spin around the loop to check if your GUID is done, no big deal, not enough signals means infinite wait, which is a big deal).

NOTE : you might think there's a race at -*- ; the apparent problem is that the worker has got the event list in TempList, and it gets swapped out, then on some main thread I add myself to EventList, run through and go to sleep in the Wait. Then worker wakes up at -*- and signals TempList - and I'm not in the list to signal! Oh crap! But this can't happen because if that was the work item I needed, it was already put in the receive FIFO and I should have seen it and returned before going into the Wait.

We can also of course get rid of the Mutex on EventList completely by doing the usual lockfree gymnastics; instead make it a message passing thing :

WaitOnMessage_6B( GUID , LocalEvent )
    Send Message {Add,LocalEvent,GUID}

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        Wait( LocalEvent );

    Send Message {Remove,LocalEvent,GUID}

and worker does :

    pop message from "send FIFO" to get work
    do work
    push message to "receive FIFO"

    pop all event messages
        process Add/Remove commands 

    signal everything in EventList

but this isn't tested and I have very little confidence in it. I would want to Relacey this before using it in real code.

4. Just don't do it. Sometimes domain-specific solutions are best. In particular there's a very simple solution I can use for the current problem I'm having.

The thing that made me hit this issue is that I made my work-stealing Worklet system able to wait on IO's. In this case the "worker" is the IO service thread and the messages are IO requests. So now the main thread can wait on IO and so can the Worklets. It's important that they really go to a proper sleep if they are blocked on IO.

But I can solve it in a simple way in that case. The Worklets already have a way to go sleep when they have no work to do. So all I do is if the Worklets only have work which depends on an uncompleted IO, they push their Work back onto the main work dispatch list and go to sleep on the "I have no work" event. Then, whenever the main thread receives any IO wakeup event, it immediately goes and checks the Worklet dispatch list and sees if anything was waiting on an IO completion. If it was, then it hands out the work and wakes up some Worklets to do it.

This solution is more direct and also has the nice property of not waking up unnecessary threads and so on.

09-21-10 | Waiting on Thread Events

A small note on a common threading pattern.

You have some worker thread which takes messages and does something or other and then puts out results. You implement the message passing in some way or other, either with a mutex or a lock-free FIFO or whatever, it doesn't matter for our purposes here.

So your main thread makes little work messages, sends them to the worker, he does them, sends them back. The tricky bit comes when the main thread wants to wait on a certain message being done. Let's draw the structure :

   main      ---- send --->     worker
  thread    <--- receive ---    thread

First assume our messages have some kind of GUID. If the pointers are never recycled it could be the pointer, but generally that's a bad idea, but a pointer + a counter would be fine. So the main thread says Wait on this GUID being done. The first simple implementation would be a spin loop :

WaitOnMessage_1( GUID )

        pop everything on the receive FIFO
        mark everything I received as done
        is the GUID I wanted in the done set?
           if so -> return

        yield processor for a while


and that works just fine. But now you want to be a little nicer and not just spin, you want to make your thread actually go to sleep.

Well you can do it in Win32 using Events. (on most other OS'es you would use Semaphores, but it's identical here). What you do is you have the worker thread set an event when it pushes a message, and now we can wait on it. We'll call this ReceiveEvent, and our new WaitOnMessage is :

WaitOnMessage_2( GUID )

        Clear( ReceiveEvent );

        pop everything on the receive FIFO
        mark everything I received as done
        is the GUID I wanted in the done set?
           if so -> return

        Wait( ReceiveEvent );


that is, we clear the receive event so it's unsignalled, we see if our GUID is done, if not we wait until the worker has done something, then we check again. We aren't sleeping on our specific work item, but we are sleeping on the worker doing something.

In the Worker thread it does :

    pop from "send FIFO"

    do work

    push to "receive FIFO"
    Signal( ReceiveEvent );

note the order of the push and the Signal (Signal after push) is important or you could have a deadlock due to a race because of the Clear. While the code as-is works fine, the Clear() is considered dangerous and is actually unnecessary - if you remove it, the worst that will happen is you will run through the Wait() one time that you don't need to, which is not a big deal. Also Clear can be a little messy to implement for Semaphores in a cross-platform way. In Semaphore speak, Wait() is Down() , Signal() is Up(), and Clear() is trylock().

Okay, so this is all fine and we're happy with ourselves, until one day we try to WaitOnMessage() from two different threads. I've been talking as if we have one main thread and one worker thread, but the basic FIFO queues work fine for N "main" threads, and we may well might want to have multiple threads generating and waiting on work. So what's the problem ?

First of all, there's a race in the status check, so we put a Mutex around it :

WaitOnMessage_3A( GUID )

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )

        Wait( ReceiveEvent );


you need that because one thread could come in and pop the queue but not process it, and then another thread comes in and sees its GUID is not done and goes to sleep incorrectly. We need the pop and the flagging done to act like they are atomic.

Now you might be inclined to make this safe by just putting the Mutex around the whole function. In fact that works, but it means you are holding the mutex while you go into your Wait sleep. Then if another thread comes in to check status, it will be forced to sleep too - even if its GUID is already done. So that's very bad, we don't want to block status checking while we are asleep.

Why is the _3A version still broken ? Consider this case : we have two threads, thread 1 and 2, they make their own work and send it off then each call WaitOnMessage which is :

WaitOnMessage_3( GUID )

        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )


        Wait( ReceiveEvent );



Thread 1 runs through to [A] and then swaps out. Thread 2 runs through to [B], which waits on the event, the worker thread then does all the work, sets the event, then thread 2 gets the signal and clears the event. Now thread 1 runs again at [A] and immediately goes into an infinite Wait.

D'oh !

I think this is long enough just describing the problem, so I'll get at some solutions in the next post.

09-20-10 | A small followup on Fast Means

I wrote previously about various ways of tracking a local mean over time .

I read about a new way that is somewhat interesting : Probabilistic Sliding Windows. I saw this in Ryabko / Fionov where they call it "Imaginary Sliding Windows", but I like Probabilistic better.

This is for the case where you want to have a large window but don't want to store the whole thing. Recall a normal sliding window mean update is :

U8 window[WINDOW_SIZE];
int window_i = 0;
int accum = 0;

// add in head and remove tail :
accum += x_t - window[window_i];
window[window_i] = x_t;
window_i = (window_i + 1) & (WINDOW_SIZE-1);

mean = (accum/WINDOW_SIZE);

// WINDOW_SIZE is power of 2 for speed of course

that's all well and good except when you want WINDOW_SIZE to be 16k or something. In that case you can use histogram probabilitic sliding window. You keep a histogram and accumulator :

int count[256] = { 0 };
int accum = 0;

at each step you add the new symbol and randomly remove something that you have :

// randomly remove something :
r = random_from_histogram( count );
count[r] --;
accum -= r;

//add new :
count[x_t] ++;
accum += x_t;

It's very simple and obvious - instead of knowing which symbol leaves the sliding window at the tail, we generate one randomly from the histogram that we have of symbols in the window.

Now, random_from_histogram is a bit of a problem. If you just do a flat random draw :

// randomly remove something :
    int r = randmod(256);
    if ( count[r] > 0 )
        count[r] --;
        accum -= r;

then you will screw up the statistics in a funny way; you will draw unlikely symbols too often, so you will skew towards more likely symbols. Maybe you could compute exactly what that skewing is and compensate for it somehow. To do an unbiased draw you basically have to do an arithmetic decode. You generate a random in [0,accum-1) and then look it up in the histogram.

Obviously this method is not fast and is probably useless for compression, but it is an interesting idea. More generally, probabilistic approximate statistics updates are an interesting area. In fact we do this already quite a lot in all modern compressors (for example there's the issue of hash table collisions). I know some of the PAQs also do probabilistic state machine updates for frequency counts.

There's also a whole field of research on this topic that I'd never seen before. You can find it by searching for "probabilistic quantiles" or "approximate quantiles". See for example : "Approximate Counts and Quantiles over Sliding Windows" or "On Probabilistic Properties of Conditional Medians and Quantiles" (PDF) . This stuff could be interesting for compression, because tracking things like the median of a sliding window or the N most common symbols in a window are pretty useful things for compressors.

For example, what I would really like is a fast probabilistic way of keeping the top 8 most common symbols in the last 16k, and semi-accurately counting the frequency of each and the count of "not top 8".

09-16-10 | A small followup on Cumulative Probability Trees

See original note . Then :

I mentioned the naive tree where each level has buckets of 2^L sums. For clarity, the contents of that tree are :

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 CC13 C14 C15
C0-C1 C2-C3 C4-C5 C6-C7 C8-C9 C10-C11 C12-C13 C14-C15
C0-C3 C4-C7 C8-C11 C12-C15
C0-C7 C8-C15

I said before :

But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
if ( i&1 ) sum += Tree[1][i-1];
if ( i&1 ) sum += Tree[1][i-1];

This is O(logN) but rather uglier than we'd like. 

A few notes on this. It's easier to write the query if our array is 1 based. Then the query is equivalent to taking a value from each row where the bit of "i" is on. That is, line up the bits of "i" with the bottom bit at the top row.

That is :

sum = 0;

if ( i&1 ) sum += Tree[0][i];
if ( i&2 ) sum += Tree[1][i>>1];
if ( i&4 ) sum += Tree[2][i>>2];

Some SIMD instructions sets have the ability to take an N-bit mask and turn it into an N-channel mask. In that case you could compute Tree[n][i>>n] in each channel and then use "i" to mask all N, then do a horizontal sum.

Also the similarity to Fenwick trees and his way of walking the tree can be made more obvious. In particular, we obviously only have to do the sums where i has bits that are on. On modern CPU's the fastest way is just to always do 8 sums, but in the old days it was beneficial to do a minimum walk. The way you do that is by only walking to the bits that are on, something like :

sum = 0;
while ( i )
  int k = BitScan(i);
  sum += Tree[k][i>>k];
  i -= (1 << k);

we can write this more like a Fenwick query :

sum = 0;
while ( i )
  int b = i & -i; // isolate the bottom bit of i
  int k = BitScan(b); // b -> log2(b)
  sum += Tree[k][i>>k];
  i = i & (i-1); // same as i -= b; turns off bottom bit

but then we recall that we made the FenwickTree structure by taking these entries and merging them. In particular, one way to build a Fenwick Tree is to take entries from simple table and do "if my bottom bit is zero, do i>>=1 and step to the next level of the table".

What that means is when we find the bottom bit pos is at bit 'k' and we look up at (i>>k) - that is the entry that will just be in slot "i" in the Fenwick structure :

sum = 0;
while ( i )
  sum += FenwickTree[i];
  i = i & (i-1); // turn off bottom bit

so the relationship and simplification is clear.

Fast decoding would involve a branchless binary search. I leave that as an exercise for the reader.

Not that any of this is actually useful. If you're just doing order-0 "deferred summation" is better, and if you're doing high order, then special structures for small/sparse contexts are better, and if your doing low orders escaped from high order it doesn't really work because you have to handle exclusion.

A better way to do excluded order-0 would be to have a 256-bit flag chunk for excluded or not excluded, then you just keep all the character counts in an array, and you sum them on demand with SIMD using the binary-tree summing method. At each step you sum adjacent pairs, so in step one you take the 256 counts and sum neighbors and output 128. Repeat. One tricky thing is that you have to output two sums - the sum above symbol and the sum below symbol (both excluded by the flags). And the decoder is a bit tricky, but maybe you can just do that sum tree and then binary search down it. Unfortunately the ability to do horizontal pair adds (or the way to swizzle into doing that) and the ability to byte arithmetic is one of the areas where the SIMD instruction sets on the various platforms differ greatly, so you'd have to roll custom code for every case.

09-16-10 | Modern Arithmetic Coding from 1987

I was just reading through my old paper "New Techniques in Context Modeling and Arithmetic Encoding" because I wanted to read what I'd written about PPMCB (since I'm playing with similar things now and PPMCB is very similar to the way modern context mixers do this - lossy hashing and all that). (thank god I wrote some things down or I would have lost everything; I only wish I'd written more, so many little heuristics and experimental knowledge has been lost).

Anyway, in there I found this little nugget that I had completely forggoten about. I wrote :

Well what do you fucking know. This is a "carryless range coder" which was apparently rediscovered 20+ years later by the russians. ( see also ). The method of reducing range to avoid the carry is due to Rubin :

F. Rubin, "Arithmetic Stream Coding Using Fixed Precision Registers", IEEE Trans. Information Theory IT-25 (6) (1979), p. 672 - 675

And I found it in Cormack's DMC which amazingly is still available at waterloo : dmc.c

Cormack in 1987 wrote in the dmc.c code :

which is a little uglier than my version above, but equivalent (he has a lot of ugly +1/-1 stuff cuz he didn't get that quite right).

And actually in the Data Compression Using Dynamic Markov Modelling paper that goes with the dmc.c code, they describe the arithmetic coder due to Guazzo and it is in fact an "fpaq0p" style carryless arithmetic coder (it avoids carries by letting range get very small and only works on binary alphabets). I don't have the original paper due to Guazzo so I can't confirm that attribution. The one in the paper does bit by bit output, but as I've stressed before that doesn't characterize the coder, and Cormack then did bytewise output in the implementation.

Anyway, I had completely forgotten about this stuff, and it changes my previous attribution of byte-wise arithmetic coding to Schindler '98 ; apparently I knew about it in '95 and Cormack did it in '87. (The only difference between the Schindler '98 coder and the NTiCM '95 coder is that I was still doing (Sl*L)/R while Schindler moved to Sl*(L/R) which is a small but significant change).

Apparently the "russian range coder" is actually a "Rubin arithmetic coder" (eg. avoid carries by shrinking range), and the "fpaq0p binary carryless range coder" is actually a "Guazzo arithmetic coder" (avoid carries by being binary and ensuring only range >= 2).

09-16-10 | Gasoline

Is it actually dangerous to keep your gas can near your furnace? Or is it just one of those paranoid wives' tales.

I remember a MythBusters where they showed that making gasoline explode is actually really difficult, like it has to be perfectly aerosolized and everything.

My only storage area is my basement where my heater is, but it does make me a little nervous.

09-15-10 | Problems at Seattle PD

Lots of problems in Seattle recently.

The latest one is this deaf wood carver dude John Williams who was shot under very questionable circumstances :

The Buck Stops with Nobody by Cienna Madrid - News - The Stranger, Seattle's Only Newspaper

Also this year was the lovely "Mexican Piss" beating where Seattle cops took turns randomly stomping on someone who hadn't done anything :

SeattleCrime.com News I'm going to beat the fucking Mexican piss out of you, homey! You feel me
I'm Going To Beat The Fucking Mexican Piss Out Of You! Seattle Police Brutalize Innocent Man Video

(though I should note that even if he *was* the "gang banger" that they thought he was (he wasn't) it still would be disgusting behavior; most people focus on the fact that he was innocent and it was a mistaken identity, but really any time you cuff a suspect and beat him up you should be fired)

(side note : there was a similar incident a while ago : Seattle officer punches girl in face during jaywalking stop ; and while clearly punching someone in the face for being rowdy is not very good police procedure (it's just not a good way to control someone), and he probably should have been mildly disciplined for that, overall this incident is not part of the "Seattle PD gone mad" problem; the jaywalker was clearly being too aggressive with the cop in this case (See Youtube : Seattle Cop Punches Girl in Face - Jaywalking Stop 6-15-10 ); also if a cop ever stops you for anything, make sure your friend gets video of it; though cops these days seem to not really care if they're on tape, they'll keep tasing you for doing nothing anyway, and get away with it ).

It was a bad week : Questions raised after five police killings in a week , Questions raised after 5 police killings in 1 week ; perhaps even more disturbing than the killing of John Williams was the two killings by Taser of guys who were just "creating a disturbance".

I'm absolutely terrified of Tasers. Their manufacturer has created this myth that they are safe, when in fact they are often lethal (and god knows what kind of long term damage they do even when they aren't lethal). In fact while Tasers don't kill quite so many people as police guns do, it's getting close - (see for example : 96 Taser-related Deaths in the US since January 2009 - Gordon Wagner - Open Salon ). The big problem is the rules for engagement with Tasers are not the same as with a gun. Cops are only supposed to use their gun if they deem it absolutely necessary to protect themselves or others, but Tasers can be used at will to subdue a suspect. This has led to rampant abuse of the Taser where it's used in cases where simple physical detention would have been just fine. Or even just letting the guy making a disturbance scream for a while until he settles down. Coincidentally perhaps, back in 2004 SeattlePI did a big piece on Tasers : Seattle Post-Intelligencer Tasers Under Fire ).

See also : Analysis Taser-related deaths in US accelerating Raw Story

And if you want to really feel sick, watch a man who was not resisting at all get tasered and brutally restrained (and thus killed) : Taser Death Caught On Tape - CBS News Video

It certainly seems like Seattle PD has been using lethal force more readily than they should. The obvious thought is that there's a bit of a siege mentality after all the cop assassinations last year in the two separate cases of Chris Monfort and Maurice Clemmons :

Suspected cop killer slain by police
Seattle cop shot dead in unprovoked attack - U.S. news - Crime & courts - msnbc.com
Police Seattle police officer killed in 'assassination' KING5.com Seattle Area Local News
Local News Suspect in officer's slaying shot by police Seattle Times Newspaper
Local News Man charged with killing police officer is paralyzed Seattle Times Newspaper

09-14-10 | Threaded Stdio

Many people know that fgetc is slow now because we have to link with multithreaded libs, and so it does a mutex and all that. Yes, that is true, but the solution is also trivial, you just use something like this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((char)*(_stream)->_ptr++) : _filbuf(_stream))

when you know that your FILE is only being used by one thread.

In general there's a nasty issue with multithreaded API design. When you have some object that you know might be accessed from various threads, should you serialize inside every access to that object? Or should you force the client to serialize on the outside?

If you serialize on the inside, you can have severe performance penalties, like with getc.

If you serialize on the outside, you make it very prone to bugs because it's easy to access without protection.

One solution is to introduce an extra "FileLock" object. So the "File" itself has no accessors except "Lock" which gives you a FileLock, and then the "getc" is on the FileLock. That way somebody can grab a FileLock and then locally do fast un-serialized access through that structure. eg:

instead of :

c = File.getc();

you have :

FileLock fl = File.lock();

c = fl.getc();

Another good solution would be to remove the buffering from the stdio FILE and introduce a "FileView" object which has a position and a buffer. Then you can have mutliple FileViews per FILE which are in different threads, but the FileViews themselves must be used only from one thread at a time. Accesses to the FILE are serialized, accesses to the FileView are not. eg :

eg. FILE * fp is shared.

Thread1 has FileView f1(fp);
Thread2 has FileView f2(fp);

FileView contains the buffer

you do :

c = f1.getc();

it can get a char from the buffer without synchronization, but buffer refill locks.

(of course this is a mess with readwrite files and such; in general I hate dealing with mutable shared data, I like the model that data is either read-only and shared, or writeable and exclusively locked by one thread).

(The FileView approach is basically what I've done for Oodle; the low-level async file handles are thread-safe, but the buffering object that wraps it is not. You can have as many buffering objects as you want on the same file on different threads).

Anyway, in practice, macrogetc is a perfectly good solution 99% of the time.

Stdio is generally very fast. You basically can't beat it except in the trivial way (trivial = use a larger buffer). The only things you can do to beat it are :

1. Double-buffer the streaming buffer filling and use async IO to fill the buffer (stdio uses synchronous IO and only fills when the buffer is empty).

2. Have an accessor like Peek(bytes) to the buffer that just gives you a pointer directly into the buffer and ensures it has bytes amount of data for you to read. This eliminates the branch to check fill on each byte input, and eliminates the memcpy from fread.

(BTW in theory you could avoid another memcpy, because Windows is doing IO into the disk cache pages, so you could just lock those and get a pointer and process from them directly. But they don't let you do this for obvious security reasons. Also if you knew in advance that your file was in the disk cache (and you were never going to use it again so you don't want it to get into the cache) you could do uncached IO, which is microscopically faster for data not in the cache, because it avoids that memcpy and page allocation. But again they don't let you ask "is this in the cache" and it's not worth sweating about).

Anyway, the point is you can't really beat stdio for byte-at-a-time streaming input, so don't bother. (on Windows, where the disk cache is pretty good, eg. it does sequential prefetching for you).

09-14-10 | Learnings

Sometimes you can only observe yourself by seeing things in others.

I'm watching a bit of the WSOP. My god it's such a stupid waste of time, there's no poker and the "characters" are just retards, but it does give me a certain type of "schadenfreude" - like pleasure : it gives me lots of opportunities too see other people and think "he sucks". It's the pleasure of judging and feeling superior to other people. The whole WSOP show is a parade of losers, and the fact that they are playing poker lets me sit their and judge their poker play and think "what a donkey" "I can't believe he made that play" "omg so bad I can't believe he didn't bet there" or "meh that play was okay but he could have made so much more by doing this". Blah blah blah. It's a very sleazy type of viewing pleasure.

Anyway, that's not the point. The thing that always strikes me on a personal level is when I see the smart, careful, analytical, well behaved, soft spoken guys. Your Alan Cunnignham, your Howard Lederer, your Erik Seidels. They're the only guys I actually respect and think they carry themselves well and might have something actually deep and interesting to say. And they are just boring as all hell. God get them off my TV they are a waste of space. If I had the choice to hang out with them, I would pass, because it would just be excruciating. That's something to learn from. But I probably won't.

I got another little lesson the other day listening to the Buzzing Fly show. Some retard had posted a comment on their web site chastizing Ben for taking a break from DJ'ing to have a bite to eat. Ben took a moment out of the radio show to respond and justify himself. Of course Ben was completely in the right. But by even responding to the troll, he lowered himself. That's something to learn from.

09-14-10 | A small note on structured data

We have a very highly structured file at work. A while ago I discovered that I can improve compression of primitive compressors by transposing the data as if it were a matrix with rows of the structure size.

Apparently this is previously known :

J. Abel, "Record Preprocessing for Data Compression", 
Proceedings of the IEEE Data Compression Conference 2004, 
Snowbird, Utah, Storer, J.A. and Cohn, M. Eds. 521, 2004.

A small note :

He suggests finding the period of the data by looking at the most frequent character repeat distance. That is, for each byte find the last time it occurred, count that distance in a histogram, find the peak of the histogram (ignoring distances under some minimum), that's your guess of hte record length. That works pretty well, but some small improvements are definitely possible.

Lets look at some distance histograms. First of all, on non-record-structured data (here we have "book1") our expectation is that correlation roughly goes down with distance, and in fact it does :


(maximum is at 4)

On record-structured data, the peaks are readily apparently. This file ("struct72") is made of 72-byte structures, hence the peaks out at 72 and 144. But we also see strong 4-8-12 correlations, as there are clearly 4-byte words in the structs :

Vertical bar chart

The digram distance histogram makes the structure even more obvious, if you ignore the peak at 1 (which is not so much due to "structure" as just strong order-1 correlation), the peak at 72 is very strong :

Vertical bar chart

When you actually run an LZ77 on the file (with min match len of 3 and optimal parse) the pattern is even stronger; here are the 16 most used offsets on one chunk :

 0 :       72 :      983
 1 :      144 :      565
 2 :      216 :      282
 3 :      288 :      204
 4 :      432 :      107
 5 :      360 :      106
 6 :      720 :       90
 7 :      504 :       88
 8 :      792 :       78
 9 :      648 :       77
10 :      864 :       76
11 :      576 :       73
12 :     1008 :       64
13 :     1080 :       54
14 :     1152 :       51
15 :     1368 :       49

Every single one is a perfect multiple of 72.

A slightly more robust way to find structure than Jurgen's approach is to use the auto-correlation of the histogram of distances. This is a well known technique from audio pitch detection. You take the "signal" which is here out histogram of distance occurances, and find the intensity of auto-correlation for each translation of the signal (this can be done using the fourier transform). You will then get strong peaks only at the fundamental modes. In particular, in our example "struct72" file you would get a peak at 72, and also a strong peak at 4, because in the fourier transform the smaller peaks at 8,12,16,20, etc. will all add onto the peak at 4. That is, it's correctly handling "harmonics". It will also detect cases where a harmonic happened to be a stronger peak than the fundamental mode. That is, the peak at 144 might have been stronger than the one at 72, in which case you would incorrectly think the fundamental record length was 144.

Transposing obviously helps with compressors that do not have handling of structured data, but it hurts compressors that inherently handle structured data themselves.

Here are some compressors on the struct72 file :

struct72                                 3,471,552

struct72.zip                             2,290,695
transpos.zip                             1,784,239

struct72.bz2                             2,136,460
transpos.bz2                             1,973,406

struct72.pmd                             1,903,783
transpos.pmd                             1,864,028

struct72.pmm                             1,493,556
transpos.pmm                             1,670,661

struct72.lzx                             1,475,776
transpos.lzx                             1,701,360

struct72.paq8o                           1,323,437
transpos.paq8o                           1,595,652

struct72.7z                              1,262,013
transpos.7z                              1,642,304

Compressors that don't handle structured data well (Zip,Bzip,PPMd) are helped by transposing. Compressors that do handle structured data specifically (LZX,PAQ,7Zip) are hurt quite a lot by transposing. LZMA is the best compressor I've ever seen for this kind of record-structured data. It's amazing that it beats PAQ considering it's an order of magnitude faster. It's also impressive how good LZX is on this type of data.

BTW I'm sure PAQ could easily be adapted to manually take a specification in which the structure of the file is specified by the user. In particular for simple record type data, you could even have a system where you give it the C struct, like :

struct Record
  float x,y,z;
  int   i,j,k;

and it parses the C struct specification to find what fields are where and builds a custom model for that.


It's actually much clearer with mutual information.

I(X,Y) = H(X) + H(Y) - H(XY)

In particular, the mutual information for offset D is :

I(D) = 2*H( order0 ) - H( bigrams separated by distance D )

This is much slower to compute than the character repeat period, but gives much cleaner data.

On "struct72" : (note : the earlier graphs showed [1,144] , here I'm showing [1,256] so you can see peaks at 72,144 and 216).


Here's "book1" for comparison :

Vertical bar chart

(left column is actually in bits for these)

and hey, charts are fun, here's book1 after BWT :

Vertical bar chart

09-14-10 | Challenges in Data Compression 2.5 : More on Sparsity

Sparsity & Temporality show up as big issues in lots of funny ways.

For example, they are the whole reason that we have to do funny hand-tweaks of compressors.

eg. exactly what you use as your context for various parts of the compressor will make a huge difference. In PPM the most sensitive part is the SEE ; in CM it looks like the most sensitive part is the contexts that are used for the mixer itself (or the APM or SSE) (not the contexts that are looked up to get statistics to mix).

In these sensitive parts of the coder, you obviously want to use as much context as possible, but if you use too much your statistics become too sparse and you start making big coding mistakes.

This is why these parts of the model have to be carefully hand tweaked; eg. use a few bits from here, compact it into a log2-ish scale, bucketize these things. You want only 10-16 bits of context or something but you want as much good information in that context as possible.

The problem is that the ideal formation of these contexts depends on the file of course, so it should be figured out adaptively.

There are various possibilities for hacky solutions to this. For example something that hasn't been explored very much in general in text compression is severely asymmetric coders. We do this in video coding for example, where the encoder spends a long time figuring things out then transmits simple commands to the decoder. So for example the encoder could do some big processing to try to figure out the ideal compaction of statistics and sends it to the decoder. (* maybe some of these do exist)

If sparsity wasn't an issue, you would just throw every bit of information you have at the model. But in fact we have tons of information that we don't use because we aren't good enough at detecting what information is useful and merging up information from various sources, and so on.

For example an obvious one is : in each context we generally store only something like the number of times each character has occurred. We might do something like scale the counts so that more recent characters count more. eg. you effictively do {all counts} *= 0.9 and then add in the count of the new character as ++. But actually we have way more information than that. We have the exact time that each character occurred (time = position in the file). And, for each time we've used the context in the past, we know what was predicted from it and whether that was right. All of that information should be useful to improve coding, but it's just too much because it makes secondary statistics too sparse.

BTW it might pop into your head that this can be attacked using the very old-school approaches to sparsity that were used in Rissanen's "Context" or DMC for example. Their approach was to use a small context, then as you see more data you split the context, so you get richer contexts over time. That does not work, because it is too conservative about not coding from sparse contexts; as I mentioned before, you cannot tell whether a sparse context is good or not from information seen in that context, you need information from an outside source, and what Context/DMC do is exactly the wrong thing - they try to use the counts within a context to decide whether it should be split or not.

09-12-10 | Winter

It's supposed to be a really cold, wet, snowy winter this year. The gray fall has been in full force for a week now and I'm depressed and body-wrecked all ready.

I have to decide whether to get snow/winter tires for the car, buy a beater winter car, or just move to the Caribbean for the season.

09-12-10 | Challenges in Data Compression 2 : Sparse Contexts

A "sparse" context is one with very little information in it. Exactly what defines "sparse" depends on the situation, but roughly it means it hasn't been seen enough times that you can trust the evidence in it. Obviously a large-alphabet context needs a lot more information to become confident than a binary context, etc.

In the most extreme case, a sparse context might only have 1 previous occurance. Let's take the binary case for clarity, so we have {n0=1,n1=0} or {n0=0,n1=1} in a single context.

There are two issues with a sparse context :

1. What should our actual estimate for P0 (the probability of a zero) be? There's a lot of theory about ideal estimators and you can go read papers on the unbiased estimator or the KT estimator or whatever that will give you formulas like

P0 = (n0 + k) / (n0 + n1 + 2k)

k = 0 or 1/2 or whatever depending on the flavor of the month

but this stuff is all useless in the sparse case. At {n0=1,n1=0} the actual probabilitiy of a 0 might well be 99% (or it could be 10%) and these estimators won't help you with that.

2. Making an estimate of the confidence you have in the sparse context. That is, deciding whether to code from it at all, or how to weight it vs other contexts. Again the important thing is that the statistics in the context itself do not really help you solve this problem.

Now let me take a second to convince you how important this problem is.

In classical literature this issue is dismissed because it is only a matter of "initialization" and the early phase, and over infinite time all your contexts become rich, so asymptotically it doesn't affect ratio at all. Not only is that dismissive of practical finite size data, but it's actually wrong even for very large data. In fact sparse contexts are not just an issue for the early "learning" phase of a compressor.

The reason is that a good compressor should be in the "learning" phase *always*. Basically to be optimal, you want to be as close to the end of sparse contexts as possible without stepping over the cliff into your statistics becoming unreliable. There are two reasons why.

1. Model growth and the edges of the model. The larger your context, or the more context information you can use, the better your prediction will be. As you get more data, the small contexts might become non-sparse, but that just means that you should be pushing out more into longer contexts. You can think of the contexts as tree structured (even if they actually aren't). Around the root you will have dense information, the further out you go to the leaves the sparser they will be.

For example, PPM* taught us that using the longest context possible is helpful. Quite often this context has only occurred once. As you get further into the file, you longest possible context simply gets longer, it doesn't get less sparse.

2. Locality and transients. Real files are not stationary and it behooves you to kill old information. Furthermore, even in dense contexts there is a subset of the most recent information which is crucial and sparse.

For example, in some context you've seen 50 A's and 20 B's. Now you see two C's. Suddenly you are in a sparse situation again. What you have is an insufficient sample sitation. Should I still use those 50 A's I saw, or am I now 99% likely to make more C's ? I don't have enough statistics in this context to tell, so I'm in a sparse situation again.

There are various attempts at addressing this, but like all the problems in this serious there's been no definitive attack on it.

09-12-10 | PPM vs CM

Let me do a quick sketch of how PPM works vs how CM works to try to highlight the actual difference, sort of like I did for Huffman to Arithmetic , but I won't do as good of a job.


make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );

code in order from most likely best to least :

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2

update counts :
counts_o4 += sym;

Now let's do context mixing :

CM :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

estimate weights from counts at each context :

w_o4 = estimate_weight( order4 );
w_o2 = estimate_weight( order2 );

make blended counts :
counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

now code :
arithmetic_code( counts );

update counts :
counts_o4 += sym;

It should be clear we can put them together :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

<FONT COLOR=#009C00>
if ( CM )
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );

    make blended counts :
    counts = w_o4 * counts_o4 + w_o2 * counts_o2 + ...

    // now code :
    arithmetic_code( counts );
<FONT COLOR=#0000FF>else PPM
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );

    code in order from most likely best to least :

    if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
    exclude counts_o4
    if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
    exclude counts_o2


update counts :
counts_o4 += sym;

In particular if we do our PPM in a rather inefficient way we can make them very similar :

make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

accumulate into blended_counts :
blended_counts = 0;

<FONT COLOR=#009C00>
if ( CM )
    estimate weights from counts at each context :

    w_o4 = estimate_weight( order4 );
    w_o2 = estimate_weight( order2 );

else PPM
    estimate escape probability from counts at each context :

    esc_o4 = estimate_escape( order4 );
    esc_o2 = estimate_escape( order2 );

    do exclude :
    exclude counts_o4 from counts_o2
    exclude counts_o4,counts_o2 from counts_o1

    make weights :

    w_o4 = (1 - esc_04);
    w_o2 = esc_04 * (1 - esc_02);



make blended counts :
blended_counts += w_04 * counts_o4;
blended_counts += w_02 * counts_o2;

arithmetic_code( blended_counts );

update counts :
counts_o4 += sym;

Note that I haven't mentioned whether we are doing binary alphabet or large alphabet or any other practical issues, because it doesn't affect the algorithm in a theoretical way.

While I'm at it, let me take the chance to mark up the PPM pseudocode with where "modern" PPM differs from "classical" PPM : (by "modern" I mean 2002/PPMii and by "classical" I mean 1995/"before PPMZ").


make a few contexts of previous characters
  order4 = * ((U32 *)(ptr-4));
<FONT COLOR=#900000>
also make non-continuous contexts like
skip contexts : AxBx
contexts containing only a few top bits from each byte
contexts involving a word dictionary
contexts involving current position in the stream 

look up observed counts from each context :

counts_o4 = lookup_stats( order4 );
counts_o2 = lookup_stats( order2 );
counts_o1 = lookup_stats( order1 );
counts_o0 = lookup_stats( order0 );

<FONT COLOR=#900000>
possibly rescale counts using a "SEE" like operator
eg. use counts as an observation which you then model to predict coding probabilities

estimate escape probability from counts at each context :

esc_o4 = estimate_escape( order4 );
esc_o2 = estimate_escape( order2 );
<FONT COLOR=#900000>
secondary estimate escape using something like PPMZ
also not just using current context but also other contexts and side information

code in order from most likely best to least :
<FONT COLOR=#900000>
use LOE to choose best order to start from, not necessarily the largest context
also don't skip down through the full set, rather choose a reduced set

if ( arithmetic_code( (1-esc_o4) * counts_o4 ) ) return; else arithmetic_code( esc_o4 );
exclude counts_o4
if ( arithmetic_code( (1-esc_o2) * counts_o2 ) ) return; else arithmetic_code( esc_o2 );
exclude counts_o2

update counts :
counts_o4 += sym;
<FONT COLOR=#900000>
do "partial exclusion" like PPMii, do full update down to coded context
  and then reduced update to parents to percolate out information a bit
do "inheritance" like PPMii - novel contexts updated from parents
do "fuzzy updates" - don't just update your context but also neighbors
  which are judged to be similar in some way

09-12-10 | Context Weighting vs Escaping

The defining characteristic of PPM (by modern understanding) is that you select a context, try to code from it, and if you can't you escape to another context. By contrast, context weighting selects multiple contexts and blends the probabilities. These are actually not as different as they seem, because escaping is the same as multiplying probabilities. In particular :

    context 0 gives probabilities P0 (0 is the deeper context)
    context 1 gives probabilities P1 (1 is the parent)
    how do I combine them ?

    escaping :

    P = (1 - E) * P0 + E * (P1 - 0)

    P1 - 0 = Probablities from 1 with chars from 0 excluded

    weighting :

    P = (1 - W) * P0 + W * P1

    with no exclusion in P1

In particular, the only difference is the exclusion. Specifically, the probabilities of non-shared symbols are the same, but the probabilities of symbols that occur in both contexts are different. In particular the flaw with escaping is probably that it gives too low a weight to symbols that occur in both contexts. More generally you should probably be considering something like :

    I = intersection of contexts 0 and 1

    P = a * (P0 - I) + b * (P1 - I) + c * PI

that is, some weighting for the unique parts of each context and some weighting for the intersection.

Note that PPMii is sort of trying to compensate for this because when it creates context 0, it seeds it with counts from context 1 in the overlap.

Which obviously begs the question : rather than the context initialization of PPMii, why not just look in your parent and take some of the counts for shared symbols?

(note that there are practical issues about how your compute your weight amount or escape probability, and how exactly you mix, but those methods could be applied to either PPM or CW, so they aren't a fundamental difference).

09-12-10 | The defficiency of Windows' multi-processor scheduler

Windows' scheduler is generally pretty good, but there appears to be a bit of shittiness with its behavior on multicore systems. I'm going to go over everything I know about it and then we'll hit the bad point at the end.

Windows as I'm sure you know is a strictly exclusive high priority scheduler. That means if there are higher priority threads that can run, lower priority threads will get ZERO cpu time, not just less cpu time. (* this is not quite true because of the "balanced set manager" , see later).

Windows' scheduler is generally "fair", and most of its lists are FIFO. The scheduler is preemptive and happens immediately upon events that might change the scheduling conditions. That is, there's no delay or no master scheduler code that has to run. For example, when you call SetThreadPriority() , if that affects scheduling the changes will happen immediately (eg. if you make the current thread lower than some other thread that can run, you will stop running right then, not at the end of your quantum). Changing processor affinitiy, Waits, etc. all cause immediate reschedule. Waits, Locks, etc. always put your thread onto a FIFO list.

The core of the scheduler is a list of threads for each priority, on each processor. There are 32 priorities, 16 fixed "realtime" priorities, and 16 "dynamic" priorities (for normal apps). Priority boosts (see later) only affect the dynamic priority group.

When there are no higher priority threads that can run, your thread will run for a quantum (or a few, depending on quantum boosts and other things). When quantum is done you might get switched for another thread of same priority. Quantum elapsed decision is now based on CPU cycles elapsed (TSC), not the system timer interrupt (change was made in Vista or sumfin), and is based on actual amount of CPU cycles your thread got, so if your cycles are being stolen you might run for more seconds than a "quantum" might indicate.

Default quantum is an ungodly 10-15 millis. But amusingly enough, there's almost always some app on your system that has done timeBeginPeriod(1), which globally sets the system quantum down to 1 milli, so I find that I almost always see my system quantum at 1 milli. (this is a pretty gross thing BTW that apps can change the scheduler like that, but without it the quantum is crazy long). (really instead of people doing timeBeginPeriod(1) what they should have done is make the threads that need more responsiveness be very high priority and then go to sleep when not getting input or events).

So that's the basics and it all sounds okay. But then Windows has a mess of hacks designed to fix problems. Basically these are problems with people writing shitty application code that doesn't thread right, and they're covering everyone's mistakes, and really they do a hell of a good job with it. The thing they do is temporary priority boosts and temporary quantum boosts.

Threads in the foreground process always get a longer quantum (something like 60 millis on machines with default quantum length!). Priority boosts affect a thread temporarily, each quantum the priority boost wears off one step, and the thread gets extra quantums until the priority boost is gone. So a boost of +8 gives you 8 extra quantums to run (and you run at higher priority). You get priority boosts for :

 IO completion (+variable)

 wakeup from Wait (+1)
   special boost for foreground process is not disableable

 priority inversion (boosts owner of locked resource that other thread is waiting on)
   this is done periodically in the check for deadlock
   boosts to 14

 GUI thread wakeup (+2) eg. from WaitMsg

 CPU starvation ("balanced set manager")
    temporary kick to 15

boost for IO completion is :
    disk is 1
    network 2
    key/mouse input 6
    sound 8
(boost amount comes from the driver at completion time)

Look at like for example the GUI thread wakeup boost. What really should have been happening there is you should have made a GUI message processing thread that was super minimal and very high priority. It should have just done WaitMsg to get GUI input and then responded to it as quickly as possible, maybe queued some processing on a normal priority thread, then gone back to sleep. The priority boost mechanism is basically emulating this for you.

A particularly nasty example is the priority inversion boost. When a low priority thread is holding a mutex and a high priority thread tries to lock it, the high priority thread goes to sleep, but the low priority thread might never run if there are medium priority threads that want to run, so the high priority thread will be stuck forever. To fix this, Windows checks for this case in its deadlock checker. All of the "INFINITE" waits in windows are not actually infinite - they wait for 5 seconds or so (delay is setable in the registry), after which time Windows checks them for being stalled out and might give you a "process deadlocked" warning; in this check it looks for the priority inversion case and if it sees it, it gives the low priority thread the big boost to 14. This has the wierd side effect of making the low priority thread suddenly get a bunch of quanta and high priority.

The other weird case is the "balanced set manager". This thing is really outside of the normal scheduler; it sort of sits on the side and checks every so often for threads that aren't running at all and gives them a temporary kick to 15. This kick is different than the others in that it doesn't decay (it would get 15 quanta which is a lot), it just runs a few quanta at 15 then goes back to its normal priority.

You an use SetThreadPriorityBoost to disable some of these (such as IO completion) but not all (the special foreground process stuff for example is not disabled by this, and probably not the balanced set or priority inversion stuff either is my guess).

I'm mostly okay with this boosting shit, but it does mean that actually accounting for what priority your thread has exactly and how long you expect it to run is almost impossible in windows. Say you make some low-priority worker thread at priority 6 and your main thread at priority 8. Is your main thread actually higher priority than the background worker? Well, no, not if he got boosted for any of various reasons, he might actually be at higher priority right now.

Okay, that's the summary of the normal scheduler, now let's look at the multi-processor defficiency.

You can understand what's going on if you see what their motivation was. Basically they wanted scheduling on each individual processor to still be as fast as possible. To do this, each processor gets its own scheduling data; that is there are the 32 priority lists of threads on each processor. When a processor has to do some scheduling, it tries to just work on its own list and schedule on itself. In particular, simple quantum expiration thread switches can happen just on the local processor without taking a system-wide lock. (* this was changed for Vista/Win7 or sumpin; old versions of the SMP scheduler took a system-wide spinlock to dispatch; see references at bottom).

(mutex/event waits take the system-wide dispatch lock because there may be threads from other processors on the wait lists for those resoures)

But generally Windows does not move threads around between processors, and this is what can create the problems. In particular, there is absolutely zero code to try to distribute the work load evenly. That's up to you, or luck. Even just based on priorities it might not run the highest priority threads. Let's look at some details :

Each thread has the idea of an "ideal" processor. This is set at thread creation in a global round-robin and a processor round-robin. This is obviously a hacky attempt to balance load which is sometimes good enough. In particular if create a thread per core, this will give you an "ideal" processor on each core, so that's what you want. It also does assign them "right" for hyperthreading, that is to minimize thread overlap, eg. it will assign core0-hyperthread0 then core1-hyperthread0 then core0-hyperthread1 then core1-hyperthread1. You can also change it manually with SetThreadIdealProcessor , but it seems to me it mostly does the right thing so there's not much need for that. (note this is different than Affinity , which forces you to run only on those processors , you don't have to run on your ideal proc, but we will see some problems later).

When the scheduler is invoked and wants to run a thread - there's a very big difference between the cases when there exist any idle processors or not. Let's look at the two cases :

If there are any idle processors : it pretty much does exactly what you would want. It tries to make use of the idle processors. It prefers whole idle cores over just idle hyperthreads. Then from among the set of good choices it will prefer your "ideal", then the ones it last ran on and the one the scheduler is running on right now. Pretty reasonable.

If there are not any idle processors : this is the bad stuff. The thread is only schedules against its "ideal" processor. Which means it just gets stuffed on the "ideal" processor's list of threads and then schedules on that processor by normal single-proc rules.

This has a lot of bad consequences. Say proc 0 is running something at priority 10. Proc 1-5 are all running something at priority 2. You are priority 6 and you want to run, but your ideal proc happened to be 0. Well, tough shit, you get stuck on proc 0's list and you don't run even though there's lots of priority 2 work getting done.

This can happen just by bad luck, when you happen to run into other apps threads that have the same ideal proc as you. But it can also happen due to Affinity masks. If you or somebody else has got a thread locked to some proc via the Affinity mask, and your "ideal" processor is set to try to schedule you there, you will run into them over and over.

The other issue is that even when threads are redistributed, it only happens at thread ready time. Say you have 5 processors that are idle, on the other processor there is some high priority thread gobbling the CPU. There can be threads waiting to run on that processor just sitting there. They will not get moved over to the idle processors until somebody kicks a scheduling event that tries to run them (eg. the high priority guy goes to sleep or one of the boosting mechanisms kicks in). This is a transient effect and you should never have long term scenarios where one processor is overloaded and other processors are idle, but it can happen for a few quanta.

References :

Windows Administration Inside the Windows Vista Kernel Part 1
Windows Administration Inside the Windows Vista Kernel Part 2
Sysinternals Freeware - Information for Windows NT and Windows 2000 - Publications
Inside the Windows NT Scheduler, Part 1 (access locked)
Available here : "Inside the Windows NT Scheduler" .doc version but make sure you read the this errata

Stupid annoying audio-visual formatted information. (god damnit, just use plain text people) :

Mark Russinovich Inside Windows 7 Going Deep Channel 9
Dave Probert Inside Windows 7 - User Mode Scheduler (UMS) Going Deep Channel 9
Arun Kishan Inside Windows 7 - Farewell to the Windows Kernel Dispatcher Lock Going Deep Channel 9

(I haven't actually watched these, so if there's something good in them, please let me know, in text form).

Also "Windows Internals" book, 5th edition.

ADDENDUM : The real interesting question is "can we do anything about it?". In particular, can you detect cases where you are badly load-balanced and kick Windows in some way to adjust things? Obviously you can do some things to load-balance within your own app, but when other threads are taking time from you it becomes trickier.

09-12-10 | Challenges in Data Compression 1 : Finite State Correlations

One of the classic shortcomings of all known data compressors is that they can only model "finite context" information and not "finite state" data. It's a little obtuse to make this really formally rigorous, but you could say that structured data is data which can be generated by a small "finite state" machine, but cannot be generated by a small "finite context" machine. (or I should say, transmitting the finite state machine to generate the data is much smaller than transmitting the finite context machine to generate the data, along with selections of probabilistic transitions in each machine).

For example, maybe you have some data where after each occurance of 011 it becomes more likely to repeat. To model that with an FSM you only need a state for 011, and it loops back to itself and increases P. To model it with finite contexts you need an 011 state, an 011011 , 011011011 , etc. But you might also have correlations like : every 72 bytes there is a dword which is equal to dword at -72 bytes xor'ed with the dword at -68 bytes and plus a random number which is usually small.

The point is not that these correlations are impossible to model using finite contexts, but the correct contexts to use at each spot might be infinitely large.

Furthermore, you can obviously model FSM's by hard-coding them into your compressor. That is, you assume a certain structure and make a model of that FSM, and then context code from that hard-coded FSM. What we can't do is learn new FSM's from the data.

For example, say you have data that consists of a random dword, followed by some unknown number of 0's, and then that same dword repeated, like

you can model this perfectly with an FCM if you create a special context where you cut out the run of zeros. So you make a context like
and then if you keep seeing zeros you leave the context alone, if it's not a zero you just use normal FCM (which will predict DEADBEEF). What you've done here is to hard-code the finite state structure of the data into your compressor so that you can model it with finite contexts.

In real life we actually do have this kind of weird "finite state" correlation in a lot of data. One common example is "structured data". "Structured data" is data where there is a strong position-based pattern. That is, maybe a sequence of 32 bit floats, so there's strong correlation to (pos&3), or maybe a bunch of N-byte C structures with different types of junk in that.

Note that in this sense, the trivial mode changes of something like HTML or XML or even English text is not really "structured data" in our sense, even though they obviously have structure, because that structure is largely visible through finite contexts. That is, the state transitions of the structure are given to us in actual bytes in the data, so we can find the structure with only finite context modeling. (obviously English text does have a lot of small-scale and large-scale finite-state structure in grammars and context and so on).

General handling of structured data is the big unsolved problem of data compression. There are various heuristic tricks floating around to try to capture a bit of it. Basically they come down to hard coding a specific kind of structure and then using blending or switching to benefit from that structure model when it applies. In particular, 4 or 8 byte aligned patterns are the most common and easy to model structure, so people build specific models for that. But nobody is doing general adaptive structure detection and modeling.

09-11-10 | Some Car Links

I found a nice series of videos of Colin McRae talking about rally driving techniques :

YouTube - WRC - Colin McRae - Stage Starts & Finding Grip Lessons
YouTube - WRC - Colin McRae - Scandinavian Flick & Handbrake Lessons
YouTube - WRC - Colin McRae - Power Slide Lessons
YouTube - WRC - Colin McRae - Pace Note Lessons
YouTube - WRC - Colin McRae - Oversteer & Understeer Lessons
YouTube - WRC - Colin McRae - Left-Foot Braking Lessons
YouTube - WRC - Colin McRae - Jump Lessons
YouTube - WRC Handbrake turn

The onboard videos that WRC puts up are also pretty great. There are tons of them, here's just a starter : WRC Finland 2008: Loeb SS12

While I'm dropping car links, here's the gratuitious expert opinion to back up something I want to say : Sir Stirling Moss : Dirty driving is inexcusable . I think all the recent hero worship of Senna is pretty disgusting. Dirty players in all sports seem to always have their admirers. I grew up in the basketball glory years of Jordan and Magic and so on, but we still had clowns like Stockton and Laimbeer who were basically famous for being thugs or cheaters who played outside the spirit of the game, and a certain portion of the populace loved those guys. It's shitty.

It's becoming more obvious that the purported unintended acceleration in toyotas basically consists of stupid grannies getting confused and stomping the wrong pedal (aka "driver error"). They've taken quite a massive hit to their corporate image for what seems to have been larely a media exaggeration.

09-09-10 | Misc

People who RAR their torrents are worse than Hilter.

I think Ricky Gervais fucking sucks. I tried to put up with his podcast because everyone says it's great, but all he does is make fun of Karl Pilkington and laugh like an asshole. I just downloaded his "Out of England" standup and it is excruciating. Obviously he's doing the character of an asshole in both of those, but it seems like he's sort of adopted a permanent performance art where he plays a non-funny jerk, and I just don't enjoy the meta-humor of that enough to put up with it.

I disconnected my cable TV a while ago and I'm mostly happy with that, but now that football season is upon us I am wishing I had it. Of course TV is just a stupid waste of time, but hell that's pretty much what life is, we're all just trying to waste time until we get the sweet sweet peace of death.

09-09-10 | The sorry state of compression on Wikipedia

It disturbs me a little bit that PAQ has a huge Wikipedia page but PPM , which is a far more significant and fundamental basic algorithm, has almost nothing. Presumably because nobody is actively working on PPM at the moment, and the people working on PAQ are pimping it.

Similarly there is no page at all for ROLZ or LZP (though there are some decent pages on German Wikipedia (ROLZ) and Polish Wikipedia (LZP) ).

A lot of Wikipedia is really well sorted these days (basic math and computer science is best) but there are gaps where people haven't gotten motivated. Compression is one of those gaps, somebody cool needs to go write a bunch of good articles.

Actually it's so shameful maybe I will write it, but I don't really want to get sucked into that rat hole and waste my time on that.

What a proper PPM entry should cover :

1. The basic operation of the PPM algorithm.

2. Theoretical description of finite context modeling, theory of asymptotic optimality for finite context markov sources. Probability structure of PPM as a cascade of models.

3. Exclusions (Update Exclusion, Lazy Exclusion) (Partial Update Exclusion (PPMii)

4. Zero frequency problem & Escape estimation. PPMA,B,C,D variants. PPMP and PPMX. Clearest in Korodi paper.

(Data Compression Using Adaptive Coding and Partial String Matching '84)
(A note on the PPM data compression algorithm '88)
(Probability estimation for PPM '95)
(Experiments on the Zero Frequency Problem '95)

5. PPM* (Unbounded Length Contexts for PPM '95)

6. Langdon's PPM/LZ equivalence

7. Modern PPM : PPMZ, PPMd, PPMii, Rkive

8. Local Order Estimation (PPMZ)

9. Secondary Estimation (PPMZ)

10. Blending (PPMii / Volf)

11. "Information Inheritance" (novel context intitialization) (PPMii) ( "PPM: One Step to Practicality" - Shkarin )

12. Non-simple contexts (skip contexts, position contexts) (Rkive)


Furthermore, the PAQ page needs to be changed to be more of an Encyclopedia entry and less of a list of software version releases. In particular it should describe the algorithm and its relationship to earlier work, primarily CTW which it is obviously very similar to, and the Weighting/Mixing schemes of Volf/et.al. as well as the very early binary markov coders such as DMC.


While I'm at it, the LZ77 entry is pretty terrible (it makes a big deal about the overlap matches which is actually pretty irrelevant, and doesn't discuss coding methods or optimal parse or anything interesting), CTW is missing an entry, and the Arithmetic Coding entry is really bizarre and badly written. A proper arithmetic coding entry should briefly describe the theoretical [0,1] interval shrinking, and then start talking about practical implementation issues & history, CACM87, DCC95, approximate arithcoding like the skew coder and howard/vitter and ELS coders, the Q & QM coders, CABAC, and finishing with the "Range" coder. The entry on the range coder should be deleted. The example code on the range coder page is actually wrong because it doesn't deal with the underflow case, which is the whole interesting question.

09-06-10 | WeightedDecayRand

I'm sure I've talked to people about this many times but I don't think I've ever written it up. Anyway, here's one from the files of game developer arcana. I'm sure this is common knowledge, but one of those things that's too trivial to write an article about.

When you use random numbers in games, you almost never actually want random numbers. Rather you want something that *feels* sort of random to the player. In particular the biggest difference is that real random numbers can be very bunchy, which almost always feels bad to the player.

Common cases are like random monster spawns or loot drops. It's really fucking annoying when you're trying to cross some terrain in an Ultima type game and you get unlucky with the random number generator and it wants to spawn a random encounter on every single fucking tile. Or if you're killing some mobs who are supposed to drop a certain loot 1 in 10 times, and you kill 100 of them and they don't fucking drop it because you keep getting unlucky in rand().

What we want is not actually rand. If you design a monster to drop loot roughly 1 in 10 times, you want the player to get it after 5 - 20 kills. You don't really want them to get it on first kill, nor do you want them to sit around forever not getting it.

In all cases, to do this what we need is a random generator which has *state* associated with the event. You can't use a stateless random number generator. So for example with loot drop, the state remembers that it has dropped nothing N times and that can start affecting the probability of next drop. Basically what you want to do for something like that is start with a very low probability to drop loop (maybe even 0) and then have it ratchet up after non-drop. Once the loot is dropped, you reset the probability back to 0. So it's still random, but a more controllable experience for the designer.

Now, this state generally should decay in some way. That is, if you kill mobs and get no loot and then go away and do a bunch of stuff, when you come back it should be back to baseline - it doesn't remember what happened a long time ago. And then the next important issue is how does that state decay? Is it by time or by dicrete events? Whether time or event driven makes the most sense depends on the usage.

The simplest version of a stateful semi-random generator is simply one that forbids repeats. Something like :

int Draw()
        int i = RandMod(m_num);
        if ( i != m_last )
            m_last = i;
            return i;

which is an okay solution in some cases. But more generally you want to not just forbid repeats, rather allow them but make them less likely. And for N-ary events you don't just care about the last one, but all the last ones.

A simple binary stateful semirandom generator is like this :

int Draw()

    float p = frandunit() * ( m_p0 + m_p1 );
    if ( p < m_p0 )
        m_p0 -= m_repeatPenalty;
        m_p0 = MAX(m_p0,0.f);
        return 0;
        m_p1 -= m_repeatPenalty;
        m_p1 = MAX(m_p1,0.f);       
        return 1;

You should be able to see that this is a simple weighted coin flip and we penalize repeats by some repeat parameter. Here we have introduced the idea of a Tick() - the tick is some function that push the probabilities back towards 50/50 , either by time evolution or by a fixed step.

More generally you want N-ary with various parameters. Here's some code :


class WeightedDecayRand

    // redrawPenalty in [0,1] - 0 is a true rand, 1 forbids repeats
    // restoreSpeed is how much chance of redraw is restored per second or per draw
    explicit WeightedDecayRand(int num,float redrawPenalty,float restoreSpeed);

    int Draw();
    void Tick(float dt);
    int DrawFixedDecay();
    int DrawTimeDecay(double curTime);

    int     m_num;
    float * m_weights;  
    float   m_redrawPenalty;
    float   m_restoreSpeed;
    float   m_weightSum;
    double  m_lastTime;



WeightedDecayRand::WeightedDecayRand(int num,float redrawProbability,float restoreSpeed) :
    m_weights = new float [num];
    for(int i=0;i < num;i++)
        m_weights[i] = 1.f;
    m_weightSum = (float) num;

    delete [] m_weights;

int WeightedDecayRand::Draw()
    float p = frandunit() * m_weightSum;
    for(int i=0;;i++)
        if ( i == m_num-1 || p < m_weights[i] )
            // accepted !
            m_weightSum -= m_weights[i];
            m_weights[i] -= m_redrawPenalty;
            m_weights[i] = MAX(m_weights[i],0.f);
            m_weightSum += m_weights[i];
            return i;
            p -= m_weights[i];

void WeightedDecayRand::Tick(float dt)
    m_weightSum = 0;
    for(int i=0;i < m_num;i++)
        if ( m_weights[i] < 1.f )
            m_weights[i] += dt * m_restoreSpeed;
            m_weights[i] = MIN(m_weights[i],1.f);
        m_weightSum += m_weights[i];

int WeightedDecayRand::DrawFixedDecay()
    int ret = Draw();
    Tick( 1.f );
    return ret;

int WeightedDecayRand::DrawTimeDecay(double curTime)
    if ( curTime != m_lastTime )
        Tick( (float)(curTime - m_lastTime) );
        m_lastTime = curTime;

    int ret = Draw();
    return ret;


09-06-10 | You win, Google

You've been desperately trying to get me to stop using you as my home page. The fading UI elements pissed me off, but I stayed. The ever-increasing stream of slow loading and distracting graphics couldn't chase me off either.

But this new animating shit has done it.

I thought briefly about doing a customized Google home, but then I have to enable auto-login and it takes all the time to login and blah blah, so fuck that. I'm just making my own, which has lots of advantages :

1. It's very simple and loads very fast.

2. I can keep it on my local disk so it doesn't have to load from the web at all until I actually choose a destination.

3. It never changes. Pointless change is infuriating to the serious thinker.

4. I can put whatever I want on it.

Here's my start : myhome prototype 1

I tried the AJAX Search API (as you might note from the vestigial title of my prototype page) but just having that script in the page makes it take 2-3 seconds to load. Maybe the cool thing would be to do my own AJAX search as a sub-page that's only loaded if I actually hit the "Search" button. The nice thing about AJAX search is I can configure how results are shown (eg., just the Web maam), and I don't have to go to a google.com page at all EVER !

I stole the basics from Simply Google , but I imagine there's probably something newer/better to steal web elements from.

It's already pretty sweet just to have a home page that loads instantly, so I think this is goodbye, Google.com home page.

09-06-10 | Cross Platform SIMD

I did a little SIMD "Lookup3" (Bob Jenkin's hash), and as a side effect, it made me realize that you can almost get away with cross platform SIMD these days. All the platforms do 16-byte SIMD, so that's nice and standard. The capabilities and best ways of doing things aren't exactly the same, but it's pretty close, and you can mostly cover the differences. Obviously to get super maximum optimal you would want to special case per platform, but even then having a base cross-platform SIMD implementation to start from would let you get all your platforms going easier and identify the key-spots to do platform specific work.

Certainly for "trivial SIMDifications" this works very well. Trivial SIMDification is when you have a scalar op that you are doing a lot of, and you change that to doing 4 of them at a time in parallel with SIMD. That is, you never do horizontal ops or anything else funny, just a direct substitution of scalar ops to 4-at-a-time vector ops. This works very uniformly on all platforms.

Basically you have something like :

U32 x,y,z;

    x += y;
    y -= z;
    x ^= y;
and all you do is change the data type :
simdU32 x,y,z;

    x += y;
    y -= z;
    x ^= y;
and now you are doing four of them at once.

The biggest issue I'm not sure about is how to define the data type.

From a machine point of view, the SIMD register doesn't have a type, so you might be inclined to just expose a typeless "qword" and then put the type information in the operator. eg. have a generic qword and then something like AddQwordF32() or AddQwordU16() . But this is sort of a silly argument. *All* registers are typeless, and yet we find data types in languages to be a convenient way of generating the right instructions and checking program correctness. So it seems like the ideal thing is to really have something like a qwordF32 type, etc for each way to use it.

The problem is how you actually do that. I'm a little scared that anything more than a typedef might lead to bad code generation. The simple typedef method is :

#if SPU
typedef qword simdval;
typedef __vector4 simdval;
#else if SSE
typedef __m128 simdval;

But the problem is if you want to make them have their type information, like say :

typedef __vector4 simdF32;
typedef __vector4 simdU32;

then when you make an "operator +" for F32 and one for U32 - the compiler can't tell them apart. (if for no good reason you don't like operator +, you can pretend that says "Add"). The problem is the typedef is not really a first class type, it's just an alias, so it can't be used to select the right function call.

Of course one solution is to put the type in the function name, like AddF32,AddU32,etc. but I think that is generally bad code design because it ties operation to data, which should be as indepednent as possible, and it just creates unnecessary friction in the non-simd to simd port.

If you actually make them a proper type, like :

struct simdF32 { __vector4 m; };
struct simdU32 { __vector4 m; };

then you can do overloading to get the right operation from the data type, eg :

RADFORCEINLINE simdU32 operator + ( const simdU32 lhs, const simdU32 rhs )
    return _mm_add_epi32(lhs,rhs);

RADFORCEINLINE simdF32 operator + ( const simdF32 lhs, const simdF32 rhs )
    return _mm_add_ps(lhs,rhs);

The problem is that there is some reason to believe that anything but the fundamental type is not handled as well by the compiler. That is, qword,__vector4, etc. get special very good handling by the compiler, and anything else, even a struct which consists of nothing but that item, gets handled worse. I haven't actually seen this myself, but there are various stories around the net indicating this might be true.

I think the typedef way is too just too weak to actually be useful, I have to go the struct way and hope that modern compilers can handle it. Forunately GCC has the very good "vector" keyword thingy, so I don't have to do anything fancy there, and MSVC is now generally very good at handling mini objects (as long as everything inlines).

Another minor annoying issue is how to support reinterprets. eg. I have this simdU32 and I want to use it as a simdU16 with no conversion. You can't use the standard C method of value at/address of, because that might go through memory which is a big disaster.

And the last little issue is whether to provide conversion to a typeless qword. One argument for that would be that things like Load() and Store() could be implemented just from qword, and then you could fill all your data types from that if they have conversion. But if you allow implicit conversion to/from typeless, then all your types can be directly made into each other. That usually confuses the hell out of function overloading among other woes.

09-04-10 | Holy fuck balls

MSVC 2005 x64 turns this :

static RADFORCEINLINE void rrMemCpySmall(void * RADRESTRICT to,const void * RADRESTRICT from, SINTa size)
    U8 * RADRESTRICT pto = (U8 * RADRESTRICT) to;
    const U8 * RADRESTRICT pfm = (const U8 * RADRESTRICT) from;
    for(SINTa i=0;i < size;i++) pto[i] = pfm[i];


call memcpy

god damn you, you motherfuckers. The whole reason I wrote it out by hand is because I know that size is "small" ( < 32 or so) so I don't want the function call overhead and the rollup/rolldown of a full memcpy. I just want you to generate rep stosb. If I wanted memcpy I'd fucking call memcpy god dammit. Friggle fracking frackam jackam.

In other news, I know it's "cool" to run with /W4 or MSVC or /Wall on GCC (or even /Wextra), but I actually think it's counterproductive. The difference between /W3 and /W4 is almost entirely warnings that I don't give a flying fuck about, shit like "variable is unused". Okay fine, it's not used, fucking get rid of it and don't tell me about it.

Shit like "variable initialized but not used" , "conditional is constant", "unused static function removed" is completely benign and I don't want to hear about it.

I've always been peer-pressured into running with max warning settings because it's the "right" thing to do, but actually I think it's just a waste of my time.

MoveFileTransacted is pretty good. There's other transactional NTFS shit but really this is the only one you need. You can write your changes to a temp file (on your ram drive, for example) then use MoveFileTransacted to put them back on the real file, and it's nice and safe.

BTW while I'm ranting about ram drives; how the fuck is that not in the OS? And my god people are getting away with charging $80 for them. AND even those expensive ones don't support the only feature I actually want - dynamic sizing. It should be the fucking size of the data on it, not some static predetermined size.

09-03-10 | Page file on Ramdrive = 3LIT3

One of the awesome new "tweaks" that computer buffoons are doing is making ramdisks and putting their windows swap file on the ram disk. Yes, brilliant, that speeds up your swap file alright. So awesome.

(BTW yes I do know that it actually makes sense in one context : some of the fancy ram drives can access > 3 GB RAM in 32 bit windows, in which case it gives you a way to effectively access your excess RAM in an easy way on the old OS; but these geniuses are on Win7-64).

(And of course the right thing to do is to disable page file completely; sadly Windows has not kept up with this modern era of large RAMs and the page file heuristics are not well suited to use as an emergency fallback).

What I have copied from the tweakers is putting my firefox shite on a ram disk :

SuperSpeed Firefox By Putting Profile and SQLite Database in RAMDisk Raymond.CC Blog
Guide Move 99% of All Firefox Writes off your SSD

I can get away with a 128M ram disk and it massive reduces the amount of writes to my SSD. Reducing writes to the SSD is nice for my paranoia (it also speeds up firefox startup and browsing), but the really strong argument for doing this is that I've caught SQLite eating its own ass a few times. I really hate it when I'm just sitting there doing nothing and all of a sudden my fan kicks in and my CPU shoots up, I'm like "WTF is going on god dammit". Well the last few times that's happened, it's been the SQL service fucking around on its files for no god damn good reason. I'm not sure if putting the firefox SQL DB files on the ramdisk actually fixes this, but it makes it cheaper when it does happen anyway.

Also, don't actually use the above linked methods. Instead do this :

Do "about:config" and add "browser.cache.disk.parent_directory" . Obviously also check browser.cache.disk.enable and browser.cache.disk.capacity ; capacity is in KB BTW.

Run "firefox -ProfileManager". Make a new profile and call it "ramprofile" or whatever. Change the dir of that profile to somewhere on the ramdisk. Look at your other profile (probably "default") and see what dir it's in. Make "ramprofile" the default startup profile.

Quit firefox and copy the files from your default profile dir to your ramprofile dir. Run firefox and you are good to go.

Do not set your ramdisk to save & restore itself (as some of fancy ramdisks can do these days). Instead put the profile dir copy operation in your machine startup bat. It's somewhat faster to zip up your default profile dir and have your machine startup bat copy the zip over to the ramdisk and unzip it there.

There's another advantage of this, which is that your whole Firefox profile is now temporary for the session, unless you explicitly copy it back to your hard disk. That's nice. If you pick up tracking cookies or whatever while browsing, or some malicious script changes your home page or whatever, they go away when you reboot.

09-03-10 | LZ and Exclusions

I've been thinking a lot about LZ and exclusions.

In particular, LZMA made me think about this idea of excluding the character that follows from the previous match. eg. after a match, if the string we were matching from was ABCD... and we just wrote ABC of length 3, we know the next character is not a D. LZMA captures this information by using XOR (there are possibly other effects of the xor, I'm not sure).

An earlier LZ77 variant named "LZPP" had even more ideas about exclusions. I won't detail that, but I'll try to do a full treatment here.

There are various ways that exclusions arise from LZ77 coding. First of all we will simplify and assume that you always code the longest possible match at each location, and you always take a match if a match is possible. These assumptions are in fact not valid in modern optimal parse coders, but we will use them nontheless because it makes it easier to illustrate the structure of the LZ77 code.

First let's look at the exclusion after matches. Any time you are in the character after a match, you know the next character cannot be the one that continues the match, or you would just written a longer match. But, that is not just true of the match string you chose, but also of all the match strings you could have chosen.

Say for example you just wrote match of length 3 and got the string [abc]. Now you are on the next character. But previously occuring in the file are [abcd] and [abce] and [abcf]. ALL of those following characters can be excluded, not just the one that was in your actual match string. In particular, in a context coding sense, if the match length was L, you are now effectively coding from a context of length L and you know the next character must be novel in that context.

Also note that this applies not only to literal coding, but also to match coding. All offsets you could code a match from that start with one of the excluded characters can be skipped from the coding alphabet.

There's another type of exclusion in LZ coding that is powerful, and that is exclusion due to failure to match. Say your min match length is 2. You've written a literal and now you are about to write another literal. You can exclude all characters which would have been a match at the last position. That is,

You've just written [a] as a literal. The digrams [ab], [ac], [ad] occur in the file. You can thus exclude {b,c,d} from the current literal, because if it was one of those you would have written a match at the previous spot.

There are two primary states of an LZ coder : after a match and after a literal (there's also an implicit state which is inside a match, continuing the match).

Coding with general exclusions is difficult to do fast, and difficult to do with a bitwise binary style coder. You can code one exclusion using the xor method, but getting beyond that seems impossible. It looks like you have to go to a full general alphabet coder and a tree structure like Fenwick or some relative.

Another issue for LZ is the redundancy of offsets. In particular the most obvious issue is that when you write short matches, many offsets are actually identical, so having the full set to choose from is just wasting code space.

eg. say you code a match of length 2, then offsets which point to a string which is the same as another string up to length 2 are redundant.

There are two ways to take advantage of this :

1. Code length first. Walk offsets backwards from most recent (lowest offset) to highest. Each time you see a substring of the [0,length) prefix that you've never seen before, count that as offset++, otherwise just step over it without incrementing offset. Basically you are just considering a list of the unique [length] substrings that have occured in the file, and they are sorted in order of occurance.

2. Code offset first. Walk from that offset back towards the current position (from large offset down to small). Match each string against the chosen one at offset. The largest substring match length is the base match length. Code (length - base). This relies on the fact that if there was a match of that substring with a lower offset, we would have coded from the lower offset. So whenever we code a larger offset, we know it must be a longer match than we could have gotten somewhere later in the list.

These seem outrageously expensive, but they actually can be incorporated into an ROLZ type coder pretty directly. Method 1 can also incorporate exclusions easily, you just skip over offsets that point to a substring that begins with an excluded character. One disadvantage of method 1 is that it ruins offset structure which we will talk about in the next post.

08-31-10 | LOOP

Every once in a while I think this is a good idea :

#define LOOP(var,count) for(int var=0;(var)<(int)(count);var++)
#define LOOPBACK(var,count) for(int var=(int)(count)-1;(var)>=0;var--)

but then I never wind up using it, and I find the code that does use it looks really ugly. Maybe if I got in the habit that ugliness would go away.

Certainly adding a "loop" keyword would've been a good idea in C99. Instead we have GCC trying to optimize out signed int wrapping, and many compilers now specifically look for the for(;;) construct and special-case handling it as if it was a loop() keyword.

In other minor news, I'm running with the "NoScript" addon now. I've used FlashBlock for a long time to much happiness, so this is just the next level. It does break 90% of web sites out there, but it has also made me shockingly aware of how many random sites are trying to run scripts that are highly questionable (mainly data-mining and ad-serving).

People sometimes ask me about laptops because I wrote about them before :

cbloom rants 05-07-10 - New Lappy
cbloom rants 04-15-10 - Laptop Part 2
cbloom rants 04-12-10 - Laptop search
cbloom rants 01-20-09 - Laptops Part 3

A small update : I'm still very happy with the Dell Precision 6500. They've updated it with some newer GPU options, so you can now get it with an ATI 7820 , which is a full DX11 part and close to as fast as you can get mobile. Other than that all my advice remains the same - get USB 3 of course, quad i7, LED backlighting, install your own SSD and RAM and do a clean windows install. Do NOT get RAID disks, and do NOT install any Intel or Dell drivers. I have no DPC latency problems or any shite like that. The only thing about it that's slightly less than perfect is the temperature / fan control logic is not quite right.

It looks like there's a problem with the NV GTX 280 / Dual chip / Powermizer stuff. See :

My Solution for Dell XPS M1530 DPC Latency
Dell, DPC Latency, and You - Direct2Dell - Direct2Dell - Dell Community
Dell Latitude DPC Latency Issues

So I recommend staying away from that whole family. The new Vaio Z looks really awesome for a small thin/light, however there is a bit of a nasty problem. They only offer it with SSD's in RAID, and they are custom non-replaceable SSD's, and it appears that the reason for the 4 SSD's in RAID is because they are using cheapo low end flash. There's lots of early reports of problems with this setup, and the fact that you have to run RAID and can't change the drives is a big problem. Also, last time I checked Windows still can't send through TRIM to RAID, so that is borked, but theoretically that will be addressed some day.

08-27-10 | TV Reviews

"Red Riding" is outrageously good. Like maybe the best short series ever on TV good.

"Treme" is outrageously bad. Like so bad that it makes me lose all faith in Simon, and makes me wonder if "The Wire" actually is good or if I was part of some collective delusion. It's a mess of characters I don't really care about, each with their own uninteresting unrealistic story. It's just awful writing, and everything feels so forced, it's just screaming "look how new orleans this is" all the time.

Ghost in the Shell Stand Alone Complex (the series) is really damn good. I like it way better than the movies, for example; the movies are obviously prettier, but also pretty vapid. The series is meaty, and totally sucks you in. Especially in "2nd Gig" it really gets rolling where after each episode you just have to immediately watch the next one to see what happens next. Every once in a while there's an episode that's not part of the main story line which is supposed to flesh out one of the side characters (Pozu, Saito, Tokosa) and those can be real stinkers. The episodes on the main story line are the good ones.

I rather enjoyed the "Jesse Stone" series, I'm somewhat embarrassed to admit. It's definitely very cheesy, mild, CBS fare that would please your grandma, but I found it had a really nice tone, a quietness, a really well crafted mood. There's a slowness to it, the camera hangs on scenery. In many ways it reminds me of the "Wallander" series which I also liked (though Wallander is much better). Both are basically terrible stereotypical cop stories. Oh, the cop is an alcoholic, has emotional demons, trouble with his ex-wife, he doesn't follow the rules, has trouble with the chief, has hunches, but he's great at what he does. How fucking cliche and boring. Both shows are saved by the light. The light is like another actor, a presence that pervades everything. In Wallander it's the thin, bright, sideways Swedish light. In Jesse Stone it's the clouds and fog and darkness, with shafts breaking through. The first few Jesse Stones are the good ones, the later ones are pretty weak.

Yikes, one nightmare I'd like to forget is "Gavin and Stacey". Sometimes I randomly grab something because it is high rated on Metacritic TV. Lately I have not been doing well with the British imports. "Gavin and Stacey" is like some awful modern Married with Children, where they're just crass awful people and that's supposed to be funny for some reason. "Ideal" was about some fat guy who talks with his mouth closed and smokes pot a lot and nothing funny ever happens.

I've been watching a bit of "Twin Peaks" ; I never saw it originally. It holds up surprisingly well. I think it's because the show always had a sort of weird cheesy sitcom/soap-opera gone horribly wrong vibe to it, and the dated production and video look go along with that. It's very inconsistent. I find that the episodes that were actually directed by David Lynch are really good, really creepy and ominous and exciting, but then the other episodes just get really boring. Lynch crafts these moments that are just so weird, but they're actually really little things that he does. One scene that really struck me was when Cooper is lying on the floor and the old man waiter at the hotel comes in and just keeps talking about the glass of milk, and the scene just keeps going and going and Lynch draws it out and nothing is really happening but you get more and more creeped out.

08-27-10 | Washington

The blackberries on Mercer Island are ripe now; you can smell the sweet heavy musk of them as you ride by. I love to stop and have a bite on my ride.

All the nu-fred yuppie trainees are pretty annoying. I have to remember to just ignore them. I could list all the dumb assholish things they do (tailing me too close, slamming on the brakes when I'm right behind them, running stop signs, etc.) , but it's the same thing drivers do, it's the same thing everyone does. They're all fucking assholes and retards, I can't let that get me down too much. One good move I've picked up lately in both my riding and driving is just to pull over and stop. Some fucking dick is riding my ass for no reason and annoying me. In my youth I would've just grit my teeth, or yelled at him, or something. Now I just pull over and stop for a while and let him get away from me, and go back to enjoying myself without them.

The mountains are covered in blankets of huckleberries now. I've written about them before . Now's the time! I think the easiest way to get to them is off Steven's Pass. You can actually just go to the ski resort and then hike south to Josephine Lake, which eliminates a lot of hill climbing, and you get into prime berry territory. They have such a bright, unique flavor. There's like notes of apricot or something; they kind of remind me of the flavor of "now n' laters" that has that tanginess.

Summer is almost over. This has been one of the worst summers of my life. I don't mean that bad things have happened, I mean the summer itself was shit. Normally I go inside through the winter, get fat and drink and get depressed, then the sun comes out and I go outdoors and run around naked and be free and get fit and happy. This year, summer didn't start until July 5 (I remember it well because July 4 was still rainy and gray). Now it appears to be fading away into fall already, and it was MIA through most of July. And I worked way too much through almost all of it. I never really got into "summer mode" at all, never got fit, never got that feeling of being free and running around. The closest was when we had that brief heat wave.

I do love heat waves here. It just gives you no choice but to get down to the lake and have a swim. Everyone cool who loves life is hanging out by the water, and it's just a grand old time. The water is very cold, but it's invigorating and perfect on a 90+ day. The life guards love to yell at me, and the samoans dive on top of each other, and the russian mobster-wannabes cruise around Kirklans.

08-27-10 | Cumulative Probability Trees

I've been thinking about Cumulative Probability Trees today; if you prefer, it's partial-sum trees. Basically you have a count for each index C[i] , and you want to be able to quickly get the sum of all counts between 0 and i = S[i]. You want to be able to query and update both in <= logN time. So for example just using S[i] as an array makes query be O(1), but then update is O(N), no good.

The standard solution is Fenwick Trees . They are compact (take no more room than the C[i] table itself). They are O(log(N)) fast. In code :

F[i] contains a partial sum of some binary range
the size of the binary range is equal to the bottom bit on of i
if i&1  - it contains C[i]
if i&3 == 2 - contains Sum of 2 ending with i (eg. C[i]+C[i-1] )
if i&7 == 4 - contains Sum of 4 ending with i
if i&F == 8 - contains Sum of 8 ending with i
(follow the link above to see pictures). To get C[i] from F[i] you basically have to get the partial sums S[i] and S[i-1] and subtract them (you can terminate early when you can tell that the rest of their sum is a shared walk). The logN walk to get S from F is very clever :

sum = 0;
while (i > 0)
    sum += F[i];
    i = i & (i-1);

The i & (i-1) step turns off the bottom bit of i, which is the magic of the Fenwick Tree structure being the same as the structure of binary integers. (apparently this is the same as i -= i & -i , though I haven't worked out how to see that clearly).

If you put F[0] = 0 (F starts indexing at 1), then you can do this branchless if you want :

sum = 0;
UNROLL8( sum += F[i]; i = i & (i-1); );

(for an 8-bit symbol, eg 256 elements in tree).

You can't beat this. The only sucky thing is that just querying a single probability is also O(logN). There are some cases where you want to query probability more often than you do anything else.

One solution to that is to just store the C[i] array as well. That doesn't hurt your update time much, and give you O(1) query for count, but it also doubles your memory use (2*N ints needed instead of N).

One option is to keep C[i], and throw away the bottom level of the Fenwick tree (the odd indexes that just store C[i]). Now your memory use is (3/2)*N ; it's just as fast but a little ugly.

But I was thinking what if we start over. We have the C[i], what if we just build a tree on it?

The most obvious thing is to build a binary partial sum tree. At level 0 you have the C[i], at level 1 you have the sum of pairs, at level 2 you have the sum of quartets, etc :

showing the index that has the sum of that slot :


So update is very simple :

Tree[0][i] ++;
Tree[1][i>>1] ++;
Tree[2][i>>2] ++;

But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :

sum = 0;

if ( i&1 ) sum += Tree[0][i-1];
if ( i&1 ) sum += Tree[1][i-1];
if ( i&1 ) sum += Tree[1][i-1];

This is O(logN) but rather uglier than we'd like.

So what if we instead design our tree to be good for query. So we by construction say that our query for cumprob will be this :

sum = Tree[0][i];
sum += Tree[1][i>>1];
sum += Tree[2][i>>2];

That is, at each level of the tree, the index (shifted down) contains the amount that should be added on to get the partial sum that preceeds you. That is, if i is >= 64 , then Tree[6][1] will contain the sum from [0..63] and we will add that on.

In particular, at level L, if (i>>L)is odd , it should contain the sum of the previous 2^L items. So how do we do the update for this?

Tree[0][i] ++;
i >>= 1;
if ( i&1 ) Tree[1][i] ++;
i >>= 1;
if ( i&1 ) Tree[2][i] ++;


Tree[0][i] ++;
if ( i&2 ) Tree[1][i>>1] ++;
if ( i&4 ) Tree[2][i>>2] ++;


Tree[0][i] ++;
i >>= 1;
Tree[1][i] += i&1;
i >>= 1;
Tree[2][i] += i&1;

this is exactly complementary to the query in the last type of tree; we've basically swapped our update and query.

Now if you draw what the sums look like for this tree you get :

These are the table indexes :
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
0 1 2 3
0 1

These are the table contents :
C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
0 C0-C1 0 C4-C5 0 C8-C9 0 C12-C13
0 C0-C3 0 C8-C11
0 C0-C7
it should be obvious that you can take a vertical sum anywhere and that will give you the partial sum to the beginning. eg. if I take a vertical sum in the 13th slot I get (C12-C13)+(0)+(C8-C11)+(C0-C7) = (C0-C13) , which is what I want.

Now what if we slide everything left so we don't have all those zeros in the front, and we'll go ahead and stick the total sum in the top :
C0 C0-C1 C2 C2-C3 C4 C4-C5 C6 C6-C7 C8 C8-C9 C10 C10-C11 C12 C12-C13 C14 C14-C15
C0-C1 0 C4-C5 0 C8-C9 0 C12-C13 0
C0-C3 0 C8-C11 0
C0-C7 0

Now, for each range, we stuff the value into the top slot that it covers, starting with the largest range. So (C0-C7) goes in slot 7 , (C0-C3) goes in slot 3, etc :
C0 -v- C2 -v- C4 -v- C6 -v- C8 -v- C10 -v- C12 -v- C14 -v-
C0-C1 -v- C4-C5 -v- C8-C9 -v- C12-C13 -v-
C0-C3 -v- C8-C11 -v-
C0-C7 -v-

(the -v- is supposed to be a down-arrow meaning you get those cell contents from the next row down). Well, this is exactly a Fenwick Tree.

I'm not sure what the conclusion is yet, but I thought that was like "duh, cool" when I saw it.

08-27-10 | Jesus you Google Spam guys need to get your heads out of your asses

Blogger in all its wisdom has recently been marking some of your comments as Spam.

In all its user friendliness, it doesn't give me any notification of this. In fact, it CC's the comment to my email just like normal, but doesn't mark it in any way as being spam-filtered.

Because of this the comments can sit for a long time in the Blogger Spam pot until I realize "WTF why is that not showing up" and go fix it.

So if you posted something, but don't see it, that's probably why. You can just email me and say "WTF where's my comment".

It's pretty awesome that it thinks technical posts which have absolutely no spam-like properties at all are spam, but the small handful of times I actually have gotten spam comments they have been completely obvious, and allowed right through.

This is why I fucking hate web software. I had things working decently, it was fine, and then they fucking push some new feature on me that I'm not allowed to opt out of and it fucks up my life. God damn you.

I'm sure that someday they will break the Blogger API that I am using to autopost here, and when that happens I might just retire.

The really offensive thing is getting changes pushed on you that you didn't ask for. Then everybody asks "can we please have it the old way, I liked it just fine" and they tell you "no, eat your broccoli, this way is better". We got the same thing with Win 7 and Vista where random shit is changed and people are like "please just an option to have the old way" and they say "this way is better, you're not allowed to have the old way". Fuck you, you're not my mom. I'll make my own damn decisions about what's better for me or not. It pisses me off that this has become standard practice in software.

Basically being a fucking dick is now a carefully studied science. They call it "controlling the user experience" or "locking in eyeballs" or "tying together products" or "promoting certain user stories" or "rapid upgrade encouragement". It's fucking manipulation and it's not fucking cbloom endorsed.

In related news, the volume of spam coming to my email address has recently exploded. The Gmail spam filter is very unreliable, so I've made a point of going and examining all the spam periodically, but that's not really possible now that I'm getting 500 spams a day (up from 50-100 a week ago).

So if you send me an email and I don't reply when you think I should have, it's possible it got spam-marked.

It would be so fucking easy for gmail to fix this properly if they cared. For one thing, you could show me the spam probability in the spam folder, and let me sort by it, so I can just manually look at the ones you weren't sure about. You could also whitelist people I know. You could send back captcha challenges when someone's mail gets marked as spam so they at least know it.

Hell there are a million trivial solutions, it's not like it's a hard problem.

Maybe I'll write my own spam filter one day.

08-25-10 | ProfileParser

I realized a long time ago that the work I did for my AllocParser could basically be applied as-is to parsing profiles. It's the same thing, you have a counter (bytes or clocks), you want to see who counted up that counter, get hierarchical info, etc.

Also Sean and I talked long ago about how to do the lowest impact possible manually-instrumented profiler with full information. Basically you just record the trace and don't do any processing on it during recording. All you have to do is :

Profiler_Push(label)  *ptr++ = PUSH(#label); *ptr++ = rdtsc();
Profiler_Pop( label)  *ptr++ = POP( #label); *ptr++ = rdtsc();

#define PUSH(str)  (U64)(str)
#define POP (str) -(S64)(str)

where ptr is some big global array of U64's , and we will later use the stringized label as a unique id to merge traces. Once your app is done sampling, you have this big array of pushes and pops, you can then parse that to figure out all the hierarichical timing information. In practice you would want to use this with a scoper class to do the push/pop for you, like :

class rrScopeProfiler { rrScopeProfiler() { Push; } ~rrScopeProfiler() { Pop; } };

#define PROFILE(label)  rrScopeProfiler RR_STRING_JOIN(label,_scoper) ( #label );

Very nice! (obviously the pop marker doesn't have to provide the label, but it's handy for consistency checking).

(if you want to get fancy, the profiler push/pop array should really be write-combining memory, and the stores should be movnti's on x64 (*), that way he profiler push/pop wouldn't even pollute your cache, which makes it very low impact indeed)

(* BTW movnti is exposed as _mm_stream_si64 ; unfortunately, this does not exist in VC2005, you need 2008; the fact that they took away inline asm and then failed to expose all the instructions in intrinsics was a huge "fuck you" to all low-level developers; it was terrible in the early days, they've caught up a bit more with each rev ) (note that movntq is not the same, movnti comes from normal registers).

So I did this, and made my AllocParser able to parse that kind of input and turn it into AllocRecords. (the normal mode of AllocParser is to handle stack traces of memory allocations).

So now I can make tabview output for either top-down or bottom-up hierarchies, and also graphiz output like : lz_profile.svg .

There are a few minor things to clean up, like I'd like to be able to show either seconds or clocks, I'd like to be able to divide clocks by the total # of bytes to make it a clocks per byte, and if a node only has a self time, I'd like to delete the redundant +self node.

Another option is to show the hierarchy in a totally different graphviz style, using the boxes-within-boxes method. I tried that before for allocs and it was hideous, but it might be okay for timing. If I do that, then I could combine this with a global timeline to show thread profiles over time, and then use the arrows to connect dependencies.

Then I have to buy that old plotter I've been talking about ...

08-25-10 | Computing Dream

I'd like to get a box like an old Cray with all the LED's all over the outside. Then, have it run unit tests on my code modules over and over in a loop, constantly syncing from p4 and re-running them.

Each LED corresponds to one unit test; if it passes it shows green, if it fails to run it shows red, if it fails to compile it flashes red.

It also runs speed and memory use tests and shows them in equalizer-type graph info. It bleeps and bloops.

So you can just sit there and jam away and write code and keep checking it in, and you get instant visual feedback if you break a unit test, or slow something down.

Ooh, it would be cool if there were various display widgets and various outputs, and you could actually run physical wires to hook them up. Like oldschool telephone switchboards. So maybe I only have like 4 speed display graph panels, but I have 20 tests that can output speed information. I can reach over and plug the wire from the test I want to the panel I want to see it on.

08-25-10 | Disturbance

I'm very sensitive to disturbances when I work. I like absolute peace. I have trouble working even in the office at RAD, and I have my own very nice office with windows and fresh air and a door I can close, but it still badly hurts my productivity.

Part of it is just little bits of noise around. I've always had a bad case of the "prairie dog" prey instinct - whenever I hear a noise I have to stop and look around "what's that? what's that?" , I get all jittery and nervous. But even beyond that - when I'm at the office at odd times when there's nobody directly near me, I'm still affected. Just knowing that someone is in the same building as me bugs me. I can't relax, my butthole is all tight and I just can't get into the groove and dive into the code.

Working at home is often good, and N is very understanding about me needing to go in my room and be left alone, but still I feel the craving for more. Mainly it's the damn home improvers that plague me now. The worst thing is not knowing when it's going to happen. Getting my mind into a real sharp work state takes a lot of effort and forethought. It's sort of like a performer getting ready to be "on" for the camera - you have to psyche yourself up, make sure you're hydrated and have proper blood sugar, then the stage lights go on and I sit down at my DevStudio to shine ... and then the fucking neighbor starts running his roto-tiller or some shit and my performance is cancelled.

As I get older I realize that the artist's studio in the country is really the ideal setup. Of course we've always hurt about these artists who have a country home, and then an outbuilding that they turn into studio, so you can just retreat into your private space and be alone to work. I always thought "what an indulgence" or "what sensitive woosies" , but yeah, that would be sweet.

08-24-10 | AutoPrintf Release

Okay, I think AutoPrintf is close enough to final to release it. The nicer, more readable version is in : cblib.zip , but I understand you all like little autonomous files, so I grabbed the shit it needs and crammed it together in an ugly mess here :


(among other advantages, the cblib version doesn't pull vector.h or windows.h into the apf headers, both of which I consider to be very severe sins)

See older posts for a description of how it works and earlier not-good way of doing it and initial announcement .

The basic way it works is :

that's it, very simple!

Here's my main.cpp for an example of usage :

#include < float.h >
#include < stdio.h >     
#include < string.h >

//#include "autoprintf.h"
#include "apf.h"

#include < windows.h >

struct Vec3 { float x,y,z; };

namespace apf 
inline const String ToString( const Vec3 & v )
    return StringPrintf("{%.3f,%.3f,%.3f}",v.x,v.y,v.z);

inline size_t autoArgConvert(const HWND arg)
    return (size_t)arg;

inline const String ToString( const HWND v )
    return StringPrintf("[%08X]",v);
int main(int argc,const char * argv[])
    //autoprintf("test bad %d",3.f);
    autoprintf("hello %-7a",(size_t)400,"|\n");
    autoprintf("percent %% %s","100%"," stupid","!\n");
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello %a ",L"unicode is balls"," \n");
    //autoprintf("hello %S ",L"unicode is balls"," \n");
    autoprintf("hello ",apf::String("world")," \n");
//  autoprintf("hello ",LogString()); // compile error
    autoprintf("top bit ",(1UL<<31)," \n");
    autoprintf("top bit %d",(1UL<<31)," \n");
    autoprintf("top bit %a",(1UL<<31)," \n");
    autoprintf("top bit %a\n",(size_t)(1UL<<31));
    HANDLE h1 = (HANDLE) 77;
    HWND h2 = (HWND) 77;
    autoprintf("HANDLE %a\n",h1);
    autoprintf("HWND %a\n",h2);
    char temp[1024];
    autosnprintf(temp,1023,"hello %a %a %a",4.f,7,apf::String("world"));
    Vec3 v = { 3, 7, 1.5f };
    autoprintf("vector ",v," \n");
    autoprintf("vector %a is cool %a\n",v,(size_t)100);
    return 0;

The normal way to make user types autoconvert is to add a ToString() call for your type, but you could also use autoArgConvert. If you use autoArgConvert, then you will wind up going through a normal %d or whatever.

One nice thing is that this autoprintf is actually even safer than my old safeprintf. If you mismatch primitive types, (eg. you put a %d in your format string but pass a float), it will check it using the same old safeprintf method (that is, a runtime failure). But if you put a std::string in the list when you meant to put a char *, you will get a compile error now, which is much nicer.

Everything in cblib now uses this (I made Log.h be an autoprintf) and I haven't noticed a significant hit to compile time or exe size since the templates are all now deterministic and non-recursive.

Yes it does a lot of dynamic allocation. Get with the fucking 20th century. And it's fucking printf. Printf is slow. I don't want to hear one word about it.

08-24-10 | Free Trail Maps

Don't support those Green Trails cocksuckers who charge $400 for their maps. (if you do have the Green Trails maps, I encourage you to post them for public use on the net; you fuckers have abused your right for me to respect your copyright; in fact I vaguely considered buying a set just to scan it and post it, but they're not even worth that).

It's most appalling because all the maps are based on the USGS data, which is paid for by our fucking tax dollars. Fortunately there are some perfectly good free map sites :

libremap.org : Libre Map Project - Free Maps and GIS data
digital-topo-maps.com : Free Printable Topo Maps - Instant Access to Topographic Maps
ACME Mapper 2.0

Of these, digital-topo-maps.com is the easiest to browse around cuz it just uses Google maps (actually, it seems to be like 10X faster than Google's own interface, so it's actually just a nice way to browse normal maps too).

Libre Maps is the most useful for hiking with because it has nice printer-ready pages in high quality.

Also, I sometimes forget that Google still has the Terrain maps because they are hidden under More... now. I'm sure somebody has done trail overlays for Google Terrain, but I haven't found it.

08-23-10 | Oh god dammit

1. BLAH I want to skip the DVD menu so bad. The worst ones are when it forces me to watch the same sequence THREE TIMES. They put some shit unskippable before the menu, then they use it on loop as the background for the menu, and then again after I hit play. WTF WTF.

2. When I call AAA it gives me fucking California because my phone is from CA. Seriously? People don't have cell phone numbers from other states? I get to wait on hold for five minutes, then request WA, then wait again.

3. Fucking grocery store near my house is phasing out human checkers for these fucking automated machines. In theory that's a nice idea, but in practice the things are so fucking broken that they are instant boiling blood and CHARLES SMASH rage attack. They just constantly freak out and go into "please return the item to the bagging area" ; god dammit I already put the item in the bagging area you fucking turd.

4. Fucking ticket I got for going 86 on a 65 mph freeway is costing me $1500 a year in raised insurance rate. Unbelievable. I understand it's a random tax and so on, which I'm sort of fine with, but the collusion with the insurance industry is criminal (Geico buys laser guns for police departments, car insurers pay lobbyists to support speed cameras, etc.). It'll be on my record for at least three years, so total cost to me is around $5000 , total profit for the municipality is maybe $100.

5. On the other hand, fuck you for crowing about red light cams being dismissed . When you run a red light, the camera should immediately fire a predator missile and blow up your car, you fucking dangerous self-righteous cock. I see a lot of people these days with photo-blocking plates on their cars too. You fucking shit head, if you run red lights you deserve punishment.

6. I got another fraudulent withdrawal from my First Mutual account from the EXACT SAME fraudulent merchant. I put in yet another fraud report, and I asked them WTF why didn't they block it since I had already reported fraud the last time. They said they won't block future withdrawals unless I do a $20 "stop payment" request. WTF WTF, I have to pay you to stop people from just withdrawing money from my account any time I want? I also asked about how these ACH's are authorized. They told me anybody with a merchant account can ACH withdraw from anybody else whenever they want. WTF WTF. I have to fill out a hundred forms and sign a bunch of shit and fax it and all that bullshit if I want to do an ACH from my *OWN* account, but other fuckers can just ACH any time they want straight out of my money without my permission.

7. The United States now has an overt policy of assassinating anyone they want in non-warzones without any judicial oversight or even the slightest proof of guilt necessary. Holy shit, at least in the past the CIA was secretive about their assassinations because they knew they were doing something wrong. Now we just blow up people. I also think the idea that we should care whether or not these assassinations are of American citizens or not is disgusting. They're human beings in a non-warzone with no proof of guilt and plenty of collateral damage. We all should be outraged, but we're so fucking whipped that we don't even blink any more.

08-23-10 | AutoPrintf v2

Okay, v2 works and confirmed doesn't give massive exe size bloat or compiler time death the way v1 does.

At the heart of v2 is a "fixed" way of doing varargs. The problem with varargs in C is that you don't get the types of the variables passed in, or the number of them. Well there's no need to groan about that because it's actually really trivial to fix. You just make a bunch of functions like :

template < typename T1, typename T2, typename T3 >
inline String autoToStringSub( T1 arg1, T2 arg2, T3 arg3)
    return autoToStringFunc( 3,
            safeprintf_type(arg1), safeprintf_type(arg2), safeprintf_type(arg3), 
            arg1, arg2, arg3, 0 );

for various number of args. Here autoToStringFunc(int nArgs, ...) is the basic vararg guy who will do all the work, and we just want to help him out a bit. This kind of adapter could be used very generally to make enhanced varargs functions. Here I only care about the "printf_type" of the variable, but more generaly you could use type_info there. (you could also easily make abstract Objects to encapsulate the args and pass through an array of Objects, so that the called function wouldn't have to be a stupid C vararg function at all, but then it's harder to pass through to the old C funcs that still want varargs).

On top of this we have a type-change adapter :

template < typename T1, typename T2 >
inline String autoToString( T1 arg1, T2 arg2)
    return autoToStringSub( 
        autoprintf_StringToChar( autoArgConvert(arg1) ), 
        autoprintf_StringToChar( autoArgConvert(arg2) ));

autoToString calls down to autoToStringSub, and uses autoArgConvert. autoArgConvert is a template that passes through basic types and calls ToString() on other types. ToString is a template that knows the basic types, and the client can extend it by adding ToString for their own types. If they don't, it will be a compile error. StringToChar is a helper that turns a String into a char * and passes through anything else. We have to do it in that double-call way so that the String can get allocated and stick around as a temporary until our whole call is done.

The next piece is how to implement autoToStringFunc() , which takes "enhanced varargs". We need to figure out which pieces are format strings and do various types of printfs (including supporting %a for auto-typed printf). The only tricky part of this is how to step around in the varargs. Here is the only place we have to use a little bit of undefined behavior. First of all, think of the va_list as a pointer to a linked list. Calling va_arg essentially advances the pointer one step. That's fine and stanard. But I assume that I can then take that pointer and pass it on as a va_list which is the remaining args (see note *).

The key way we deal with the varargs is with functions like this :

static inline void SkipVAArg(ESafePrintfType argtype, va_list & vl)
    case safeprintf_charptr:    { va_arg(vl,char *); return; }
    case safeprintf_wcharptr:   { va_arg(vl,wchar_t *); return; }
    case safeprintf_int32:      { va_arg(vl,int); return; }
    case safeprintf_int64:      { va_arg(vl,__int64); return; }
    case safeprintf_float:      { va_arg(vl,double); return; }
    case safeprintf_ptrint:     { va_arg(vl,int*); return; }
    case safeprintf_ptrvoid:    { va_arg(vl,void*); return; }
        // BAD
        safeprintf_throwsyntaxerror("SkipVAArg","unknown arg type");

And the remainder is easy!

* : actually it looks like this is okay by the standard, I just have to call va_end after each function call then SkipArgs back to where I was. I believe this is pointless busywork, but you can add it if you want to be fully standard compliant.

08-22-10 | AutoPrintf v1

Well autoprintf v1 appears to be all working. The core element is a bunch of functions like this :

template < typename T1, typename T2, typename T3, typename T4 >
inline String autoToString( T1 arg1, T2 arg2, T3 arg3, T4 arg4 )
    return ToString(arg1) + autoToString( arg2,arg3,arg4);

template < typename T2, typename T3 >
inline String autoToString( const char *fmt, T2 arg2, T3 arg3 )
    autoFormatInfo fmtInfo = GetAutoFormatInfo(fmt);
    if ( fmtInfo.autoArgI )
        String newFmt = ChangeAtoS(fmt,fmtInfo);
        if ( 0 ) ;
        else if ( fmtInfo.autoArgI == 1 ) return autoToString(newFmt.CStr(), ToString(arg2).CStr(),arg3);
        else if ( fmtInfo.autoArgI == 2 ) return autoToString(newFmt.CStr(), arg2,ToString(arg3).CStr());
        else return autoPrintf_BadAutoArgI(fmt,fmtInfo);

         if ( fmtInfo.numPercents == 0 )    return ToString(fmt) + autoToString(arg2,arg3);
    else if ( fmtInfo.numPercents == 1 )    return StringPrintf(fmt,arg2) + autoToString(arg3);
    else if ( fmtInfo.numPercents == 2 )    return StringPrintf(fmt,arg2,arg3);
    else return autoPrintf_TooManyPercents(fmt,fmtInfo);

you have an autoToString that takes various numbers of template args. If the first arg is NOT a char *, it calls ToString on it then repeats on the remaning args. Any time the first arg is a char *, it uses the other specialization which looks in fmt to see if it's a printf format string, then splits the args based on how many percents they are. I also added the ability to use "%a" to mean auto-typed args, which is what the first part of the function is doing.

That's all dandy, but you should be able to see that for large numbers of args, it generates a massive amount of code.

The real problem is that even though the format string is usually a compile-time constant, I can't parse it at compile time, so I generate code for each arg being %a or not being %a, and for each possible number of percents. The result is something like 2^N codegens for N args. That's bad.

So, I know how to fix this, so I don't think I'll publish v1. I have a method for v2 that moves most of the work out of the template. It's much simpler actually, and it's a very obvious idea, all you have to do is make a template like :

autoprintf(T1 a1, T2 a2, T3 a3)
    autoPrintfSub( autoType(a1), autoArg(a1) ,autoType(a2), autoArg(a2) , .. )

where autoType is a template that gives you the type info of the arg, and autoArg does conversions on non-basic types for you, and then autoPrintfSub can be a normal varargs non-template function and take care of all the hard work.

... yep new style looks like it will work. It requires a lot more fudging with varargs, the old style didn't need any of that. And I'm now using undefined behavior, though I think it always works in all real-world cases. In particular, in v2 I'm now relying on the fact that I can do :

  va_arg(vl) .. a few types to grab some args from vl
  vsnprintf(  vl);

that is, I rely on the fact that va_arg advances me one step in the va_list, and that I then still have a valid va_list for remaining args which I can pass on. This is not allowed by the standard technically but I've never seen a case where it doesn't work (unless GCC decided to get pedantic and forceably make it fail for no good reason).

08-21-10 | Adler32

Sean pointed out that I should try Adler32 for corruption checking. For reference I did some study of file hashes before and I'll use the same test set now, so you can compare to that. I'm using the Adler32 from zlib which looks like a decent implementation.

Testing on 10M arrays of average length 192 (random in [128,256]).

count : 10000000
totalBytes : 1920164768
clocks per byte :
burtle               : 1.658665
crc32                : 10.429893
adler32              : 1.396631
murmur               : 1.110712
FNV                  : 2.520380

So Adler is in fact decently fast, not as fast as Murmur but a bit faster than Burtle. (everything is crazy fast on my x64 lappy; the old post was on my work machine, everything is 2-3X faster on this beast; it's insane how much Core i7 can do per clock).

BTW I wasn't going to add Murmur and FNV to this test - I didn't test them before because they are really not "corruption detection" hashes, they are hashes for hash tables, in particular they don't really try to specifically gaurantee the one bit flips will change the hash or whatever it is that CRC's gaurantee, but after I saw how non-robust Adler was I figured I should add them to the test, and we will see that they do belong...

Now when I count collisions in the same way as before, a problem is evident :

collisions :
rand32               : 11530
burtle               : 0
crc32                : 11774
adler32              : 1969609
murmur               : 11697
FNV                  : 11703

note that as before, rand32 gives you a baseline on how many collisions a perfect 32 bit hash should give you - those collisions are just due to running into the limitted space of the 32 bit word. Burtle here is a 64 bit hash and never collides. (I think I screwed up my CRC a little bit, it's colliding more than it should. But anyhoo). Adler does *terribly*. But that's actually a known problem for short sequences.

How does it do on longer sequences ? On arrays of random length between 2k and 4k (average 3k) :

num hashes : 10000000
totalBytes : 30722620564
clocks per byte :
burtle               : 1.644675
crc32                : 11.638417
adler32              : 1.346784
murmur               : 1.027105
FNV                  : 2.999243
collisions :
rand32               : 11530
burtle               : 0
crc32                : 11586
adler32              : 12335
murmur               : 11781
FNV                  : 11653

it's better, but still the worst of the group.

BTW I should note that the adler32 implementation does unrolling and rollup/rolldown and all that kind of stuff and none of the other ones do. So it's speed advantage is a bit unfair. All these sort of informal speed surveys should be taken with a grain of salt, since to really fairly compare them I would have to spend a few weeks on each one making sure I got it as fast as possible, and of course testing on various platforms. In particular FNV and Murmur use multiplies with is a no-go, but you could probably use shift and add to replace the multiplies, and you'd get something like Bob's "One at a Time" hash.

So I figured I'd test on what is more like my real usage scenario.

In the RAD LZH , I compress 16k data quanta, and check the CRC of each compressed chunk before decompressing. So compressed chunks are between 0 and 16k bytes. Since they are compressed they are near random bits. Corruption will take various forms, either complete stompage with random shite, or some bit flips, or tail or head stomps. Complete stompage has been tested in the above runs (it's the same as checking the collision rate for two unrelated sequences), so I tested incremental stomps.

I made random arrays between 256 and 16k bytes long. I then found the hash of that array, did some randomized incremental stomping, and took the hash after the changes. If the hashes were the same, it counts as a collision. The results are :

numTests : 13068402
burtle               : 0 : 0.00000%
crc32                : 0 : 0.00000%
adler32              : 3 : 0.00002%
murmur               : 0 : 0.00000%
FNV                  : 0 : 0.00000%

Adler32 is the only one that fails to detect these incremental stomps. Granted the failure rate is pretty low (3/13068402) but that's not secure. Also, the hashes which are completely not designed for this (Murmur and FNV) do better. (BTW you might think the Adler32 failures are all on very short arrays; not quite, it does fail on a 256 byte case, then twice at 3840 bytes).

ADDENDUM : Ok I tested Fletcher32 too.

cycles :
rand32               : 0.015727
burtle               : 1.364066
crc32                : 4.527377
adler32              : 1.107550
fletcher32           : 0.697941
murmur               : 0.976026
FNV                  : 2.439253

large buffers :
num hashes : 10000000
totalBytes : 15361310411
rand32               : 11530
burtle64             : 0
crc32                : 11710
adler32              : 12891
fletcher32           : 11645
murmur               : 11792
FNV                  : 11642

small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
crc32                : 11487
adler32              : 24377
fletcher32           : 11793
murmur               : 11673
FNV                  : 11599

difficult small buffers :
num hashes : 10000000
totalBytes : 1920164768
rand32               : 11530
burtle64             : 0
burtle32             : 11689
crc32                : 11774
adler32              : 1969609
fletcher32           : 11909
murmur               : 11665
FNV                  : 11703

Conclusion : Adler32 is very bad and unsafe. Fletcher32 looks perfectly solid and is very fast.

ADDENDUM 2 : a bit more testing. I re-ran the test of munging the array with incremental small changes of various types again. Running on lengths from 256 up to N, I get :

munge pattern 1 :
length : 6400
numTests             : 25069753
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 14
fletcher32           : 22
murmur               : 0
FNV                  : 0

munge pattern 2 :
length : 4096
numTests             : 31322697
rand32               : 0
burtle64             : 0
burtle32             : 0
crc32                : 0
adler32              : 9
fletcher32           : 713
murmur               : 0
FNV                  : 0

So I strike my conclusion that Fletcher is okay. Fletcher and Adler are both bad.

ADDENDUM 3 : Meh, it depends what kind of "corruption" you expect. The run above in which Fletcher is doing very badly includes some "munges" which tend to fill the array with lots of zeros, in which area it does very badly.

If you look at really true random noise type errors, and you always start your array full of random bits, and then you make random bit flips or random byte changes (between 1 and 7 of them), and then refill the array with rand, they perform as expected over a very large number of runs :

numTests : 27987536
rand32               : 3 : 0.00001%
burtle64             : 2 : 0.00001%
burtle32             : 2 : 0.00001%
crc32                : 1 : 0.00000%
adler32              : 1 : 0.00000%
fletcher32           : 2 : 0.00001%
murmur               : 2 : 0.00001%
FNV                  : 1 : 0.00000%

08-21-10 | autoprintf

AH HAHA HA ! I think I finally I have the one true printf solution. I can now do :

    autoprintf("hello world\n");
    autoprintf("hello ",7," world\n");
    autoprintf("hello %03d\n",7);
    autoprintf("hello ","world %.1f",3.f,"\n");
    autoprintf("hello %d",3," world %.1f\n",3.f);
    autoprintf("hello ",(size_t)400,"\n");
    autoprintf("hello ",L"unicode is balls"," \n");
    autoprintf("hello ",String("world")," \n");

In particular, all of the following things work :

I'm gonna clean up the code a bit and try to extricate it from cblib (meh or maybe not) and I'll post it in a few days.

It does pretty much everything I've always wanted from a printf. There is one thing missing, which is formatting for arbitrary types. Currently you can only format the basic types, and the non-basic types go through a different system. eg. you can either do :

autoprintf("hello %5d ",anInt); 
autoprintf("hello ",(size_t)anInt); 
but you can't yet do
autoprintf("hello %5",(size_t)anInt); 
(note that the type specifier is left off, only format specifiers are on the %). I know how to make this work, but it makes the implementation a lot more complicated, so I might punt on it.

The more complicated version is to be able to pass through the format spec into the templated converter. For example, you might have a ToString() for your Vec3 type which makes output like ("{%f,%f,%f}",x,y,z) . With the current system you can do :

Vec3 v;
autoprintf("v = ",v);
and it will call your ToString, but it would be groovy if you could do :
Vec3 v;
autoprintf("v = %.1",v);
as well and have that format apply to the conversion for the type. But that's probably more complication than I want to get into.

Another thing that might be nice is to have an explicit "%a" or something for auto-typed, so you can use them at the end like normal printf args. eg :

autoprintf("hello %d %a %f\n", 3, String("what"), 7.5f );

08-20-10 | Deobfuscating LZMA

I've been trying to figure out LZMA for a while, if anyone can help please chime in.

LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular, it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse, which there were many of before LZMA, and yet it trounces them all.

So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a full spec ).

LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.

Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom). This is pretty standard.

Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword (or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data. The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long" flag. Other match-related coding events don't use it.

In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do that and just always uses N bits of position (I think N=2 is a typical good value).

Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are sent without compression.

The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :

LZMA state machine :

Literal :

  state < 7 :
    normal literal
  state >= 7 :
    delta literal

  state [0-3] -> state = 0
  state [4-9] -> state -= 3 ([1-6])
  else state -= 6 [10-11] -> ([4-5])

Match :

   len 1 : 
     state ->   < 7 ? 9 : 11
   len > 1 :
     state ->   < 7 ? 8 : 11

     state ->   < 7 ? 8 : 11

     state ->   < 7 ? 7 : 10

// or from Igor Pavlov's code :

static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4,  5,  6,   4, 5};
static const int kMatchNextStates[kNumStates]   = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates]     = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};

Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've recently encoded. For example :

initial state = 5
code a normal match -> 7
code a rep match -> 11
code a literal -> 5
code a literal -> 2
code a literal -> 0

Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to be an advantage, but other coders have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three that gives you 8 states of previous context), which seems to me to have about the same effect.

There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match, the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is a delta literal ?

Delta literals are coded as :

    char literal = *ptr;
    char lastPosPtr = ptr - lastOffset;
    char delta = literal ^ *lastPosPtr;

that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos, the last offset, that's important for structured data).

When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another, these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer. And that's what's cool about this.

Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file (where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.

Okay, I think that's it for how the coding works. A few more tidbits :

The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches. The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).

One thing I can't figure out is how the binary tree is maintained with the sliding window.

ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant". My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree, you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which is the alternative.

The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree, it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match), but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries. I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps. If anyone has figured it out, please help me out.

ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are 72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte which is often very correlated, and some bytes that are near random. So we might have something like :

[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F]
... history ...                                * start here

we will encode :

00,00,40,00 : 
  4 byte match at offset 72
  (offset 72 is probably offset0 so this is a rep0 match)

7E :
  delta literal
  encode 7E ^ 7F = 1

00 :
  one byte match to offset 72 (rep0)

4C :
  delta literal
  encode 4C ^ 3F = 0x73

2F :
  regular literal

Also because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will be captured very well.

Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple bottom-bits cycles).

08-19-10 | Fuzz Testing

I'm "fuzz testing" my LZ decoder now, by which I mean making it never crash no matter how the data is corrupted.

The goal is to make this work without taking any speed hit. There are lots of little tricks to make this happen. For example, the LZ decode match copier is allowed to trash up to 8 bytes past where it thinks the end is. This lets me do a lot fewer bounds checks in the decode. To prevent actual trashing then, I just make the encoder never emit a match within 8 bytes of the end of a chunk. Similarly, the Huffman decoder can be made to always output a symbol in finite number of steps (never infinite loop or access a table out of bounds). You can do this just by doing some checks when you build your decode tables, then you don't have to do any checks in the actual decode loop.

So, how do we make sure that it actually works? To prove that it is 100% fuzz resilient, you would have to generate every possible bit stream of every possible length and try decoding them all. Obviously that is not possible, so we can only try our best to find bad cases. I have a couple of strategies for that.

Random stomps. I just stomp on the compressed data in some random way and then run the decoder and see what happens (it should fail but not crash). I have a test loop set up to do this on a bunch of different files and a bunch of different stomp methods.

Just stomping random bytes in turns out to not be a very good way to find failures - that type of corruption is actually one of the easiest to handle because it's so severe. So I have a few different stomp modes : insert random byte, insert 00, insert FF, flip one bit, and the same for shorts, dwords, qwords. Often jamming in a big string of 00 or FF will find cases that any single byte insert won't. I randomize the location of the stomp but prefer very early position ones, since stomping in the headers is the hardest to handle. I randomize the number of stomps.

One useful thing I do is log each stomp in the form of C code before I do it. For example I'll print something like :

compBuf[ 906 ] ^= 1 << 3;
compBuf[ 61  ] ^= 1 << 3;
compBuf[ 461 ] ^= 1 << 4;

then if that does cause a crash, I can just copy-paste that code to have a repro case. I was writing out the stomped buffers to disk to have repro cases, but that is an unnecessary slowdown; I'm currently running 10,000+ stomp tests.

(note to self : to do this, run main_lz -t -z -r1000)

Okay, so that's all very nice, but you can still easily miss failure cases. What I really want is something that gives me code coverage to tell that I've handled corrupted data in all the places where I read data. So I stole an idea from relacy :

Each place I get a byte (or bits) from the compressed stream, I replace :

U8 byte = *ptr++;


U8 byte = *ptr++;

// I wanted to do this but couldn't figure out how to make it work :
// U8 byte = FUZZ( *ptr++ );

(and similar for getting bits). Now, what the FUZZ macros do is this :

The first time they are encountered, they register their location with the FuzzManager. They are then a disabled possible fuzz location. Each one is given a unique Id.

I then start making passes to try to fuzz at all possible locations. To do this, each fuzz location is enabled one by one, then I rerun the decompressor and see if that location was in fact hit. If a fuzz location is enabled, then the FUZZ macro munges the value and returns it (using all the munge modes above), and if it's disabled it just passes the byte through untouched.

Once I try all single-munges, I go back and try all dual munges. Again in theory you should try all possible multi-fuzz sequences, but that's intractable for anything but trivial cases, and also it would be very odd to have a problem that only shows up after many fuzzes.

As you make passes, you can encounter new code spots, and those register new locations that have to be covered.

Again, a nice thing I do is before each pass I log C code that will reproduce the action of that pass, so that if there is a problem you can directly reproduce it. In this case, it looks like :

Fuzz : 1/36

In order to have reproducability, I use FILE/LINE to identify the fuzz location, not an index, since the index can change from run to run based on the code path taken. Also, note that I don't actually use FILE/LINE because I have FUZZ in macros and templates - I use __FUNCDNAME__ so that two versions of a template get different tags, and I use __COUNTER__ so that macros which cause multiple fuzzes to occur at the same original code line get different location numbers. eg. this works :

#define A()  do { U8 t = *ptr++; FUZZ(t); } while(0)
#define B()  A(); A();

template < int i > void func() { B(); }

void main()
    func< 0 >();
    func< 1 >();

// there should be 4 separate unique FUZZ locations registered :


I log :

rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318000) = 0;
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318001) = 1;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318000) = 2;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318001) = 3;


As usual I'm not sure how to get the same thing in GCC. (maybe __PRETTY_FUNCTION__ works? dunno).

The actual FUZZ macro is something like this :

#define FUZZ_ID     __FILE__ "|" __FUNCDNAME__ , __LINE__*1000 + __COUNTER__

#define FUZZ( word )    do { static int s_fuzzIndex = rrFuzz_Register(FUZZ_ID); if ( rrFuzz_IsEnabled(s_fuzzIndex) ) { word = rrFuzz_Munge(word); } } while(0)

The only imperfection at the moment is that FUZZ uses a static to register a location, which means that locations that are never visited at all never get registered, and then I can't check to see if they were hit or not. It would be nice to find a solution for that. I would like it to call Register() in _cinit, not on first encounter.

Anyway, this kind of system is pretty handy for any code coverage / regression type of thing.

(note to self : to do this, define DO_FUZZ_TEST and run main_lz -t -r1000)

ADDENDUM : another practical tip that's pretty useful. For something small and complex like your headers, or your Huffman tree, or whatever, you might have a ton of consistency checks to do to make sure they're really okay. In that case, it's usually actually faster to just go ahead and run a CRC (*) check on them to make sure they aren't corrupted, then skip most of the validation checks.

On the primary byte stream we don't want to do that because it's too slow, but for headers the simplicity is worth it.

(*) not actually a CRC because doing byte-by-byte table lookups is crazy slow on some game platforms. There are other robust hashes that are faster. I believe Bob Jenkin's Lookup3 is probably the best and fastest, since we have platforms that can't do multiplies fast (ridiculous but true), so many of the hashes that are fast on x86 like Murmur2 are slow on consoles.

08-16-10 | Range Coder Revisited .. oh wait, nevermind

Hmm. I just wrote a long article on this and then as I was searching around for reference material, I discovered that I already covered the topic in full detail :

cbloom rants 10-05-08 - 1
cbloom rants 10-06-08 - 1
cbloom rants 10-08-08 - 1

So, WTF I'm going insane. Anyway, here are some more links :

encode.ru : How fast should be a range coder
ctxmodel.net : Context Modelling
CiteSeerX : Arithmetic coding , Langdon 79
Sachin Garg : 64-bit Range Coding and Arithmetic Coding

One random thing I should note is that if you have 64 bit registers, you can let range go between 2^32 and 2^64 , and output 32 bits at a time.

ADDENDUM : another random thing that occurs to me : if you're doing an fpaq0p-style sloppy binary arith coder where range is allowed to get quite small, you can actually do a few encodes or decodes in a row without checking for renormalization. What you would have to do is first do a *proper* renorm check that handles the underflow from straddling the middle case (which it normally doesn't handle) so that you are sure you have >= 24 bits in your range. Then, you can do several binary arithmetic codes, as long as the total probability shift is <= 24. For example, you could do two codes with 12 bits of probability precision, or 3 with 8 bits. Then you check renorm again. Probably the most sensible is doing two 12-bit precision codes, so you are able to do a renorm check once per two codes rather than every code. Of course then you do have to handle carries.

08-14-10 | Driver Aids

I think all the radar/laser automated cruise controls that are being put in cars now are extremely foolish and dangerous and should be illegal. It makes it possible for drivers who are already too lazy and distracted to completely stop paying attention to the road.

The whole situation with car safety (eg. raising door panels, Volvo's auto-stop research) is sort of like if you had a problem with people running around shooting each other, and your solution is to make everyone wear bullet proof vests. How about guns with built in digital cameras that detect if you're pointing them at a human and then run mood detection to tell if they're hostile or not and block firing. (of course arguing that cars shouldn't be safer is absurd).

I'm really rathered bothered by the whole idea of an "accident". It's almost never actually an accident, it's usually gross misconduct by one (or more) parties. The fact that you just exchange insurance and it gets paid for completely distorts the reality of punishment that would lead to different behaviors. In particular there should be a party at fault and they should lose their license. Though this fantasy is a bit unrealistic since we know well that punishment for rare events doesn't actually change behavior.

08-12-10 | The Lost Huffman Paper

"On the Implementation of Minimum Redundancy Prefix Codes" by Moffat and Turpin , 1997 , is a brilliant work that outlines very clearly the best ways to write a huffman decoder. It's a completely modern, practical work, and basically nobody has added anything beyond this technique in the last 13 years.

However, the importance of it was missed when it came out. For many years afterwards people continued to publish "improvements" to Huffman decoding (such as Sub-linear Decoding of Huffman Codes Almost In-Place (1998) ) which are just pure useless shit (I don't mean to single out that paper, there were probably hundreds of shitty foolish pointless papers on "efficient huffman" written after Moffat/Turpin).

Most people in the implementation community also missed this paper (eg. zlib, JPEG, etc. people who make important use of huffman decodes have missed these techniques).

I missed it too. Recently we did a lot of work on Huffman decoding at RAD, and after trying many techniques and lots of brainstorming, we came up with what we thought was a brilliant idea :

Store the code in your variable bit input word left-justified in a register. The Huffman codes are numerically arranged such that for codes of any given length, leaves (symbols) are lower values than nodes (branches). Then, the code for the first branch of each codelen can be left-justified in a word, and your entire Huffman decode consists of :

while ( bits >= huff_branchCodeLeftAligned[codeLen] )

return ( (bits>>(WORD_SIZE-codeLen)) - baseCode[ codeLen ] );

(this returns a symbol in "canonical order" - that is most probable is 0 ; if your symbols are not in order from most to least probable, you need an additional table lookup to reorder them).

This is really incredibly fucking hot. Of course it's obvious that it can be improved in various ways - you can use a fast table to skip the first few steps, you can use a nextCodeLen table to skip blanks in the codeLen sequence, and you can use a binary search instead of linear search. For known-at-compile-time huffman trees you could even optimize the binary search for the probability distribution of the codes and generate machine code for the decoder directly.

All of those ideas are in the Moffat+Turpin paper.

ADDENDUM : this post happened to get linked up, so I thought I'd flesh it out and fill in some of the blanks I'm implying above, since I'm sure you blog readers aren't actually going and reading the Moffat/Turpin paper like you should.

Here are some other posts on Huffman codes :

cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
cbloom rants 08-10-10 - Transmission of Huffman Trees
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 05-20-09 - Some Huffman notes
cbloom rants 05-18-09 - Lagrange Space-Speed Optimization (and 05-25-09 - Using DOT graphviz for some Huffman space-speed SVG's)

In particular : cbloom rants 05-22-09 - A little more Huffman describes a 1 bit at a time Huffman decoder with the code values right-justified in the word.

And cbloom rants 05-18-09 - Lagrange Space-Speed Optimization describes first Dmitry Shkarin's standard table-walking Huffman decoder and then a generalization of it; both use N-bit reads of right-justified code words and table stepping.

In all cases a practical Huffman decoder should use an initial table lookup to accelerate the first N bit step. (N usually 8-12 depending on application). The reality is that what you do after that is not super important because it is rare (the majority of Huffman codes are short). Because of this, there are advantages to using a right-justified "upside down" huffman code word, with the MSB at the bottom (I believe Zip does this) because it means you can get the first N bits by doing just an AND with a constant (eg. get the "top" 8 bits by doing &0xFF).

There are two key efficiency issues for Huffman decoder implementation : 1. Reducing branches, and reducing the dependency-chain that leads to branches. That is, doing math is not bad, but doing math to decide to branch is bad. 2. Avoiding variable shifts, because variable shifts are catastrophically slow on some important platforms.

Finally, let's look at a real implementation of the Moffat/Turpin Huffman decoder. The variable bit input word is stored in a word left-justified with top bits at the left. The first branch code at each code length is also left-aligned.

We start with table-accelerating the first N bits :

if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;

In practice you might use two tables, and Consume() is an unavoidable variable shift.

Next you have to handle the cases of code lens > TABLE_N_BITS. In that case rather than the loop in the pseudo-code above, you would actually unroll :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+1] )
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+1))) - baseCode[ (TABLE_N_BITS+1) ] ];
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+2] )
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+2))) - baseCode[ (TABLE_N_BITS+2) ] ];

this does a branch on each code length, but avoids variable shifts. In some extreme cases (platforms with very slow variable shift, and huffman trees with very low minimum code lengths), this unrolled branching version can even be faster than the peek table, in which case you would simply set TABLE_N_BITS to zero.

In some cases, certain code lengths might not occur at all, and you can avoid checking them by having an additional table of which codelengths actually occur (in practice this rarely helps). This would look like :

if ( bits < huff_branchCodeLeftAligned[11] )
    return symbolUnsort[ (bits>>(WORD_SIZE-codeLenTable[11])) - baseCode[ 11 ] ];

where the 11 is not a codelen but the 11th code len which actually occurs, and you have the extra codeLenTable[] lookup.

Obviously you could just unroll directly starting at codelen=1 , and obviously this is also just a search. You are just trying to find where bits lies in the sorted huff_branchCodeLeftAligned table. So instead of just a linear search you could binary search. However note that lower code lens are much more likely, so you don't want to just binary search at the beginning. And the binary search makes the branches much less predictable, so it's not always a win. However, as Moffat/Turpin describes, in some cases, for example if you have a hard-coded Huffman tree, the huff_branchCodeLeftAligned can be constants and you can optimize the binary tree branch structure, so you can do codegen to make an optimal decoder, like :

if ( bits < 0xA01230000 )
  if ( bits < 0x401230000 )
    // decode codeLen = 4 
    // decode codeLen = 5

There's one final issue. Because the bit word is left aligned in the machine word, we can't make any branchCode value for "all bits are branches". In particular, with this compare :

if ( bits < huff_branchCodeLeftAligned[11] )

when bits is all 1's (0xFFF...) we can't make a branchCodeLeftAligned that returns true. There are a few solutions for this, one is to use <= branchCodeMinusOne , but then you have to make sure that you start with codeLen >= minCodeLen , because below that branchCode is zero and you have a problem. Another solution is to make sure bits is never full; that is, if you have a 32 bit word, then only allow 31 bits (or less) in your variable bit register. The final solution is the one I prefer :

The actual case of bits = all 1's in a 32 bit register should only occur 1 in 4 billion times, so we don't have to handle it fast, we just have to handle it correctly. So I suggest you do the unrolled checks for decode, and then after you have checked all codelens up to maximum allowed (24 or 32 or whatever your max codelen is), if bits was ~0 , it will have not decoded, so you can do :

if ( bits < huff_branchCodeLeftAligned[21] ) .. return decoded 21 bit code
if ( bits < huff_branchCodeLeftAligned[22] ) .. return decoded 22 bit code
if ( bits < huff_branchCodeLeftAligned[23] ) .. return decoded 23 bit code
if ( bits < huff_branchCodeLeftAligned[24] ) .. return decoded 24 bit code
// 24 is my max allowed code len

// failed to do any decode ! must be the bad case
assert( bits == (~0) );
// huff code must be maxCodeLen (not necessarily 24 - the maximum actually used in this huffman table)
// and symbol must be the last ( least likely one )
// return decoded maxCodeLen code;
return symbolUnsort[ numSymbols-1 ];

Finally note that in most cases it is slightly more efficient to use a "semi-huffman" code rather than a true huffman code. The semi-huffman code I propose is huffman up to codelen = 16 or so, and then simply flat after that. In most cases this affects compression ratio microscopically (because the long code lengths are very rare) but can reduce complexity a lot. How does it reduce complexity?

1. You don't have to do the proper length-limitted huffman tree construction. Instead you just build a normal unlimitted huffman code length set, and then anything with code length >= 16 is flagged as part of the semi-huffman set.

2. When you transmit your code lengths, you don't have to send the lengths > 16, you just send lengths in [0,16] (0 means doesn't occur).

3. Your unrolled decoder only has to go up to 16 branches (if you table-accelerate, you do 8 bits by table then 8 more branches).

4. Then in the final case instead of just handling the ~0 case you handle all the "long" symbols with a flat code.

08-11-10 | ROLZ and Links

I found this little tutorial on Fenwick Trees a little while ago. Peter's original paper is a better way to learn it, but the graphics on this page are really nice; you can really grock the structure of the tree best when you see it visually.

I also found this : Anatomy of ROLZ data archiver , which is the only actual algorithm description I've ever found of ROLZ , since Ilia doesn't write up his work. (there's also a brief description at the Russian Wikipedia ).

Anyway, it's pretty obvious how you would do ROLZ, there are few unexpected cool things on the "Anatomy of ROLZ data archiver" page.

1. The way he keeps the lists of offsets for each context by just stepping back through the history of the file already processed is pretty cool. It means there's no actual separate [context][maxoffsets] table at all, the offsets themselves are pointers back a linked list. It also means that you can do sliding-window trivially.

2. In the BALZnoROLZ.txt file he has Ilia Muraviev's binary probability updater :

//This is predictor of Ilya Muraviev
class TPredictor {
    unsigned short p1, p2;
    TPredictor(): p1(1 << 15), p2(1 << 15) {} 
    ~TPredictor() {}
    int P() {
        return (p1 + p2); 
    void Update(int bit) { 
        if (bit) {
            p1 += unsigned short(~p1) >> 3; 
            p2 += unsigned short(~p2) >> 6; 
        else {
            p1 -= p1 >> 3; 
            p2 -= p2 >> 6; 

First of all, let's back up a second, what is this? It's a probability update for binary arithmetic coding. A very standard way to do fast probability updates for binary arithmetic coding is to do :

#define PROB_ONE    (1<<14) // or whatever
#define PROB_UPD_SHIFT  (6) // or something

prob = PROB_ONE >> 1; // start at 1/2

if ( bit )
 prob += (PROB_ONE - prob) >> PROB_UPD_SHIFT;
 prob -= prob >> PROB_UPD_SHIFT;

what this is doing is when you get a zero bit :

prob *= (1 - 2^-PROB_UPD_SHIFT);

that's equivalent to a normal counting probability update if you put :

n1 = prob*N
n0 = N - n1

when I get a zero bit n0++ and N++

prob = n1 / N

so update is 

prob := prob*N / (N+1)

or prob *= N / (N+1)


N/(N+1) = (1 - 2^-PROB_UPD_SHIFT)

which means

N = (2^PROB_UPD_SHIFT - 1)

then you keep prob and reset N; that is, this update is equivalent to pretending you have such an n0 and N and you increment them and compute the new probability, but then you don't actually store N, so the next update will have the same weight (if N increased then each update has a smaller effect than the last). This is an IIR filter that acts a bit like a moving average of the last N. The larger N is, the bigger window we are effectively using. A small N adapts very quickly.

So Ilia's probability update is a 2^3-1 and 2^6-1 window size, and then averaged. That's a very simple and neat idea that never occured to me - use two simple probability estimators, one that adapts very fast and one that adapts more slowly, and just blend them.

08-11-10 | Huffman - Arithmetic Equivalence




coder transformation.

This is something well known by "practictioners of the art" but I've never seen it displayed explicitly, so here we go. We're talking about arbitrary-alphabet decoding here obviously, not binary, and static probability models mostly.

Let's start with our Huffman decoder. (a bit of review here or here or here ). For simplicity and symmetry, we will use a Huffman decoder that can handle code lengths up to 16, and we will use a table-accelerated decoder. The decoder looks like this :

// look at next 16 bits (but don't consume them)
U32 peek = BitInput_Peek(16);

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int bits = symbol_codeLength[sym];

// and remove that code length from the bit stream :

this is very standard (more normally the huffTable would only accelerate the first 8-12 bits of decode, and you would then fall back to some other method for codes longer than that). Let's expand out what Peek and Consume do exactly. For symmetry to the arithcoder I'm going to keep my bit buffer right-aligned in a big-endian word.

int bits_bitLen = // # of bits in word
U32 bits_code = // current bits in word

BitInput_Peek(16) :
  ASSERT( bits_bitLen >= 16 );
  U32 ret = bits_code >> (bits_bitLen - 16);

BitInput_Consume(bits) :
  bits_bitLen -= bits;
  bits_code &= (1 << bits_bitLen)-1;
  while ( bits_bitLen < 16 )
    bits_code <<= 8;
    bits_code |= *byteStream++;
    bits_bitLen += 8;
it should be obvious what these do; _Peek grabs the top 16 bits of code for you to snoop. Consume removes the top "bits" from code, and then streams in bytes to refill the bits while we are under count. (to repeat again, this is not how you should actually implement bit streaming, it's slower than necessary).

Okay, now let's look at an Arithmetic decoder. (a bit of review here or here and here ). First lets start with the totally generic case. Arithmetic Decoding consists of getting the probability target, finding what symbol that corresponds to, then removing that symbol's probability range from the stream. This is :

AC_range = size of current arithmetic interval
AC_code  = value in range specified

Arithmetic_Peek(cumulativeProbabilityTotal) :
  r = AC_range / cumulativeProbabilityTotal;
  target = AC_code / r;
  return target;

Arithmetic_Consume(cumulativeProbabilityLow, probability, cumulativeProbabilityTotal)
  AC_range /= cumulativeProbabilityTotal;
  AC_code  -= cumulativeProbabilityLow * AC_range
  AC_range *= probability;

  while ( AC_range < minRange )
    AC_code <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;

Okay it's not actually obvious that this is a correct arithmetic decoder (the details are quite subtle) but it is; and in fact this is just about the fastest arithmetic decoder in the world (the only thing you would do differently in real code is share the divide by cumulativeProbabilityTotal so it's only done once).

Now, the problem of taking the Peek target and finding what symbol that specifies is actually the slowest part, there are various solutions, Fenwick trees, Deferred Summation, etc. For now we are talking about *static* coding, so we will use a table lookup.

To decode with a table we need a table from [0,cumulativeProbabilityTotal] which can map a probability target into a symbol. So when we get a value from _Peek we look it up in a table to get the symbol, cumulativeProbabilityLow, and probability.

To speed things up, we can use cumulativeProbabilityTotal = a power of two to turn the divide into a shift. We choose cumulativeProbabilityTotal = 2^16. (the longest code we can write with our arithmetic coder then has code length -log2(1/cumulativeProbabilityTotal) = 16 bits).

So now our static table-based arithmetic decode is :

Arithmetic_Peek() :
  r = AC_range >> 16;
  target = AC_code / r;

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbHigh = cumProbTable[sym+1];

  AC_range >>= 16;
  AC_code  -= cumProbLow * AC_range
  AC_range *= (cumProbHigh - cumProbLow);

  while ( AC_range < minRange )
    AC_code <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;

Okay, not bad, and we still allow arbitrarily probabilities within the [0,cumulativeProbabilityTotal] , so this is more general than the Huffman decoder. But we still have a divide which is very slow. So if we want to get rid of that, we have to constrain a bit more :

Make each symbol probability a power of 2, so (cumProbHigh - cumProbLow) is always a power of 2 (< cumulativeProbabilityTotal). We will then store the log2 of that probability range. Let's do that explicitly :

Arithmetic_Peek() :
  r = AC_range >> 16;
  target = AC_code / r;

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

  AC_range >> 16;
  AC_code  -= cumProbLow * AC_range
  AC_range <<= cumProbLog2;

  while ( AC_range < minRange )
    AC_code  <<= 8;
    AC_range <<= 8;
    AC_code |= *byteStream++;

Now the key thing is that since we only ever >> shift down AC_Range or << to shift it up, if it starts a power of 2, it stays a power of 2. So we will replace AC_Range with its log2 :

Arithmetic_Peek() :
  r = AC_log2Range - 16;
  target = AC_code >> r;

int sym = arithTable_symbol[target];

int cumProbLow  = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];

  AC_code  -= cumProbLow << (AC_log2Range - 16);
  AC_log2Range += (cumProbLog2 - 16);

  while ( AC_log2Range < min_log2Range )
    AC_code  <<= 8;
    AC_log2Range += 8;
    AC_code |= *byteStream++;

we only need a tiny bit more now. First observe that an arithmetic symbol of log2Probability is written in (16 - log2Probability) bits, so lets call that "codeLen". And we'll rename AC_log2range to AC_bitlen :

Arithmetic_Peek() :
  peek = AC_code >> (AC_bitlen - 16);

int sym = arithTable_symbol[peek];

int codeLen = sym_codeLen[sym];
int cumProbLow  = sym_cumProbTable[sym];

  AC_code   -= cumProbLow << (AC_bitlen - 16);
  AC_bitlen -= codeLen;

  while ( AC_bitlen < 16 )
    AC_code  <<= 8;
    AC_bitlen += 8;
    AC_code |= *byteStream++;

let's compare this to our Huffman decoder (just copying down from the top of the post and reorganizing a bit) :

BitInput_Peek() :
  peek = bits_code >> (bits_bitLen - 16);

// use peek to look in decode tables :
int sym = huffTable_symbol[peek];

// use symbol to get actual code length :
int codeLen = sym_codeLen[sym];

BitInput_Consume() :
  bits_code &= (1 << bits_bitLen)-1;
  bits_bitLen -= codeLen;

  while ( bits_bitLen < 16 )
    bits_code <<= 8;
    bits_bitLen += 8;
    bits_code |= *byteStream++;

you should be able to see the equivalence.

There's only a small difference left. To remove the consumed bits, the arithmetic coder does :

  int cumProbLow  = sym_cumProbTable[sym];

  AC_code   -= cumProbLow << (AC_bitlen - 16);

while the Huffman coder does :

  bits_code &= (1 << bits_bitLen)-1;

which is obviously simpler. Note that the Huffman remove can be written as :

  code = peek >> (16 - codeLen);

  bits_code -= code << (bits_bitLen - codeLen);

What's happening here - peek is 16 bits long, it's a window in the next 16 bits of "bits_code". First we make "code" which is the top "codeLen" of "peek". "code" is our actual Huffman code for this symbol. Then we know the top bits of bits_code are equal to code, so to turn them off, rather than masking we can subtract. The equivalent cumProbLow is code<<(16-codeLen). This is the equivalence of the Huffman code to taking the arithmetic probability range [0,65536] and dividing it in half at each tree branch.

The arithmetic coder had to look up cumProbLow in a table because it is still actually a bit more general than the Huffman decoder. In particular our arithmetic decoder can still handle probabilities like [1,2,4,1] (with cumProbTot = 8). Because of that the cumProbLows don't hit the nice bit boundaries. If you require that your arithmetic probabilities are always sorted [1,1,2,4], then since they are power of two and sum to a power of two, each partial power of two must be present, so the cumProbLows must all hit bit boundaries like the huffman codes, and the equivalence is complete.

So, you should now see clearly that a Huffman and Arithmetic coder are not completely different things. They are a continuum on the same scale. If you start with a fully general Arithmetic coder it is flexible, but slow. You then constrain it in various ways step by step, it gets faster and less general, and eventually you get to a Huffman coder. But those are not the only coders in the continuum, you also have things like "Arithmetic coder with fixed power of two probability total but non-power-of-2 symbol probabilities" which is somewhere in between in space and speed.

BTW not directly on topic, but I found this in my email and figure it should be in public :

Well, Adaptive Huffman is awful, nobody does it.  So you have a few options :

Static Huffman  -
    very fast
    code lengths must be transmitted
    can use table-based decode

Arithmetic with static probabilities scaled with total = a power of 2
    very fast
    can use table-based decode
    must transmit probabilities
    decode must do a divide

Arithmetic semi-adaptive
    "Deferred Summation"
    doesn't transmit probabilites

Arithmetic fully adaptive
    must use Fenwick tree or something like that
    much slower, coder time no longer dominates
      (symbol search in tree time dominates)

Arithmetic by binary decomposition
    can use fast binary arithmetic coder
    speed depends on how many binary events it takes to code symbols on average

It just depends on your situation so much. With somehting like image or
audio coding you want to do special-cased things like turn amplitudes
into log2+remainder, use a binary coder for the zero, perhaps do
zero-run coding, etc. stuff to avoid doing the fully general case of a
multisymbol large alphabet coder.

08-10-10 | One Really nice thing about the PS3

Is that PS3RUN takes command line args and my app just gets argc/argv. And I get stdin/stdout that just works, and direct access to the host disk through app_home/. That is all hella nice.

It means I can make command line test apps for regression and profiling and just run them and pass in file name arguments for testing on.

08-10-10 | HeapAllocAligned

How the fuck is there no HeapAllocAligned on Win32 ?

and yes I know I can use clib or do it myself or whatever, but still, WTF ?

08-10-10 | A small note on memset16

On the SPU I'm now making heavy use of a primitive op I call "memset16" ; by this I don't mean that it must be 16-byte aligned, but rather that it memsets 16-byte patterns, not individual bytes

void rrSPU_MemSet16(void * ptr,qword pattern,int count)
    qword * RADRESTRICT p = (qword *) ptr;
    char * end = ((char *)ptr + count );

    while ( (char *)p < end )
        *p++ = pattern;

(and yes I know this could be faster, this is the simple version for readability).

The interesting thing has been that taking a 16-byte pattern as input actually makes it way more useful than normal byte memset. I can now also memset shorts and words, floats, doubles, and vectors! So this is now the way I do any array assignments when a chunk of consecutive elements are the same. eg. instead of doing :

float fval = param;
float array[16];

for(int i=0;i<16;i++) array[i] = fval;

you do :

float array[16];
qword pattern = si_from_float(fval);


In fact it's been so handy that I'd like to have it on other platforms, at least up to U32.

08-10-10 | Transmission of Huffman Trees

Transmission of Huffman Trees is one of those peripheral problems of compression that has never been properly addressed. There's not really any research literature on it, because in the N -> infinity case it disappears.

Of course in practice, it can be quite important, particularly because we don't actually just send one huffman tree per file. All serious compressors that use huffman resend the tree every so often. For example, to compress bytes what you might do is extend your alphabet to [0..256] inclusive, where 256 means "end of block" , when you decode a 256, you either are at the end of file, or you read another huffman tree and start on the next block. (I wrote about how the encoder might make these block split decisions here ).

So how might you send a Huffman tree?

For background, you obviously do not want to actually send the codes. The Huffman code value should be implied by the symbol identity and the code length. The so-called "canonical" codes are created by assigning codes in numerical counting up order to symbols of the same length in their alphabetical order. You also don't need to send the character counts and have the decoder make its own tree, you send the tree directly in the form of code lengths.

So in order to send a canonical tree, you just have to send the code lens. Now, not all symbols in the alphabet may occur at all in the block. Those technically have a code length of "infinite" but most people store them as code length 0 which is invalid for characters that do occur. So you have to code :

which symbols occur at all
which code lengths occur
which symbols that do occur have which code length

Now I'll go over the standard ways of doing this and some ideas for the future.

The most common way is to make the code lengths into an array indexed by symbol and transmit that array. Code lengths are typically in [1,31] (or even less [1,16] , and by far most common is [4,12]), and you use 0 to indicate "symbol does not occur". So you have an array like :

{ 0 , 0 , 0 , 4 , 5 , 7 , 6, 0 , 12 , 5 , 0, 0, 0 ... }

1. Huffman the huffman tree ! This code length array is just another array of symbols to compress - you can of course just run your huffman encoder on that array. In a typical case you might have a symbol count of 256 or 512 or so, so you have to compress that many bytes, and then your "huffman of huffmans" will have a symbol count of only 16 or so, so you can then send the tree for the secondary huffman in a simpler scheme.

2. Delta from neighbor. The code lens tend to be "clumpy" , that is , they have correlation with their neighbors. The typical way to model this is to subtract each (non-zero) code len from the last (non-zero) code len, thus turning them into deltas from neighbors. You can then take these signed deltas and "fold up" the negatives to make them unsigned and then use one of the other schemes for transmitting them (such as huffman of huffmans). (actually delta from an FIR or IIR filter of previous is better).

3. Runlens for zeros. The zeros (does not occur) in particular tend to be clumpy, so most people send them with a runlen encoder.

4. Runlens of "same". LZX has a special flag to send a bunch of codelens in a row with the same value.

5. Golomb or other "variable length coding" scheme. The advantage of this over Huffman-of-huffmans is that it can be adaptive, by adjusting the golomb parameter as you go. (see for example on how to estimate golomb parameters). The other advantage is you don't have to send a tree for the tree.

6. Adaptive Arithmetic Code the tree! Of course if you can Huffman or Golomb code the tree you can arithmetic code it. This actually is not insane; the reason you're using Huffman over Arithmetic is for speed, but the Huffman will be used on 32k symbols or so, and the arithmetic coder will only be used on the 256-512 or so Huffman code lengths. I don't like this just because it brings in a bunch more code that I then have to maintain and port to all the platforms, but it is appealing because it's much easier to write an adaptive arithmetic coder that is efficient than any of these other schemes.

BTW That's a general point that I think with is worth stressing : often you can come up with some kind of clever heuristic bit packing compression scheme that is close to optimal. The real win of adaptive arithmetic coding is not the slight gain in efficiency, it's the fact that it is *so* much easier to compress anything you throw at it. It's much more systematic and scientific, you have tools, you make models, you estimate probabilities and compress them. You don't have to sit around fiddling with "oh I'll combined these two symbols, then I'll write a run length, and this code will mean switch to a different coding", etc.

Okay, that's all standard review stuff, now let's talk about some new ideas.

One issue that I've been facing is that coding the huffman tree in this way is not actually very nice for the decoder to be able to very quickly construct trees. (I wind up seeing the build tree time show up in my profiles, even though I only buld tree 5-10 times per 256k symbols). The issue is that it's in the wrong order. To build the canonical huffman code, what you need is the symbols in order of codelen, from lowest codelen to highest, and with the symbols sorted by id within each codelen. That is, something like :

codeLen 4 : symbols 7, 33, 48
codeLen 5 : symbols 1, 6, 8, 40 , 44
codeLen 7 : symbols 3, 5, 22

obviously you can generate this from the list of codelens per symbol, but it requires a reshuffle which takes time.

So, maybe we could send the tree directly in this way?

One approach is through counting combinations / enumeration . For each codeLen, you send the # of symbols that have that codeLen. Then you have to select the symbols which have that codelen. If there are M symbols of that codeLen and N remaining unclaimed symbols, the number of ways is N!/(M!*(N-M)!) , and the number of bits needed to send the combination index is log2 of that. Note in this scheme you should also send the positions of the "not present" codeLen=0 group, but you can skip sending the entire group that's largest. You should also send the groups in order of smallest to largest (actually in order or *complement* order, a group that's nearly full is as good as a group that's nearly empty).

I think this is an okay way to send huffman trees, but there are two problems : 1. it's slow to decode a combination index, and 2. it doesn't take advantage of modelling clumpiness.

Another similar approach is binary group flagging. For each codeLen, you want to specify which remaining symbols are of that codelen or not of that codelen. This is just a bunch of binary off/on flags. You could send them with a binary RLE coder, or the elegant thing would be Group Testing. Again the problem is you would have to make many passes over the stream and each time you would have to exclude already done ones.

(ADDENDUM : a better way to do this which takes more advantage of "clumpiness" is like this : first code a binary event for each symbol to indicate codelen >=1 (vs. codeLen < 1). Then, on the subset that is >= 1, code an event for is it >= 2, and so on. This the same amount of binary flags as the above method, but when the clumpiness assumption is true this will give you flags that are very well grouped together, so they will code well with a method that makes coherent binary smaller (such as runlengths)).

Note that there's another level of redundancy that's not being exploited by any of these coders. In particular, we know that the tree must be a "prefix code" , that is satisfy Kraft, (Sum 2^-L = 1). This constrains the code lengths. (this is most extreme for the case of small trees; for example with a two symbol tree the code lengths are completely specified by this; on a three symbol tree you only have one free choice - which one gets the length 1, etc).

Another idea is to use MTF for the codelengths instead of delta from previous. I think that this would be better, but it's also slower.

Finally when you're sending multiple trees in a file you should be able to get some win by making the tree relative to the previous one, but I've found this is not trivial.

I've tried about 10 different schemes for huffman tree coding, but I'm not going to have time to really solve this issue, so it will remain neglected as it always has.

08-09-10 | More SPU

These are my favorite pages : CELL Instruction Reference and IBM Intrinsics Reference .


To make an int that lives in a vector, here are some rules :

1. spu_promote() is good (or si_from_int which is the same thing)). eg.

vec_int4 v = spu_promote(i,0);

2. Store is bad. eg.

vec_int4 v;
v[0] = i;

3. Array initialization from a variable is bad. eg.

vec_int4 v = (vec_int4) { i }

The first one will give you a vector with i in the top word and *don't care* in the rest. The latter two will generate code that preserves the values in the bottom parts of the register. That will lead to a bunch of random cwd's and shufbs and shit like that popping up in your code in unexpected places.

Here's a helper class that you can use when you have an int on the stack which you need to force to actually be treated as a vector. When you look at your disasm and you see shufb's for no reason, toss these guys in to replace C ints.

struct RR_ALIGN_SPU rrSPU_VectorInt
    qword   m_qw;
    operator int() const { return ((vec_int4)m_qw)[0]; }
    void operator = (int x) { m_qw = si_from_int(x); }

(the main time you will need to do this by hand is for arrays, eg. int x[4]; will be shit, so use rrSPU_VectorInt x[4]);


So I wrote a mini profiler for the SPU that tracks some times and copies it back to the PPU. It's super lightweight, and for the most part I found that when I enabled it I lost maybe 0.1% of my speed. Today I disabled profiling and expected a little speed boost and found ... my app is now slower without profiling. I'm at 140 MB/sec without profiling, 142 MB/sec with profiling. So maybe I'll ship with the profiler enabled.

This is a typical but particularly ridiculous example of what I've seen all along - random changes cause something else to happen in the code gen which can either be good or bad - and it's not by a trivial amount either.

ADDENDUM : I found out what this one was. Is I replace my profile macros with GCC_SCHEDULE_BREAK macros, I get the high speed. So the profiling was helping by causing scheduler breaks (due to the asm mtfb presumably).

08-06-10 | Forceinline

Forceinline is another thing like restrict that is really in the wrong place. What I want is for it to be at the *call* , or at least overrideable at the call.

For example say you write something like memcpy - (not as an intrinsic but as a function). Most of the time you're okay with it just being a function, but in some little routine that is a super hot spot you want to be able to say :

for ( .. a million .. )
  .. important stuff
  __forceinline memcpy(to,from,size);

(and the opposite __notinline). More generally Sean once mentioned just the idea of being able to mark up "yeah really optimize this" or "this is done often" parts of the code, so that could suffice as well.

At the moment the only way I know how to do this is some ugly shit like :

__forceinline void inl_myfunc()
 .. write code here ..

void call_myfunc()

then clients can choose to call inl_myfunc or call_myfunc. Ugly. C99 cleaned up the inline/extern spec so you can get compilation of non-inlined "inline" functions in only one place, but it failed to let the client specify whether or not it should inline or not.

BTW it should be evidently clear that the standard compiler inlining heuristic of using complexity is totally wrong. Little shitlet functions that happen to call memcpy should *not* get it inlined, and my big complex LZ decoder function *should*. In fact there's just no way for the compiler to know when it's a good idea to inline or not because it doesn't have information about how often a spot in code is hit.

Restrict continues to cause me no end of annoyance; I'm working on some chunk of code that I know is all alias-free, but I look at the disasm and I see it's doing pointless loads and stores. Okay, WTF, I forgot to put restrict on something. Now I have to randomly browse around my code and type definitions and try to find the one that's missing restrict. That's fucking retarded for workflow. I should just be able to say __restrict { } over my chunk of code.

08-06-10 | Visual Studio File Associations

This is my annoyance of the moment. I rarely double-click CPP/H files, but once in a while it's handy.

On my work machine, I currently have VC 2003,2005,2008, and 2010 installed. Which one should the CPP/H file open in?

The right answer is obviously : whichever one is currently running.

And of course there's no way to do that.

Almost every day recently I've had a moment where I am working away in VC ver. A , and I double click some CPP file, and it doesn't pop up, and I'm like "WTF", and then I hear my disk grinding, and I'm like "oh nooes!" , and then the splash box pops up announcing VC ver. B ; thanks a lot guys, I'm really glad you started a new instance of your gargantuan hog of a program so that I could view one file.

Actually if I don't have a DevStudio currently running, then I'd like CPP/H to just open in notepad. Maybe I have to write my own tool to open CPP/H files. But even setting the file associations to point at my own tool is a nightmare. Maybe I have to write my own tool to set file associations. Grumble.

(for the record : I have 2008 for Xenon, 2005 is where I do most of my work, I keep 2003 to be able to build some old code that I haven't ported to 2005 templates yet, and 2010 to check it out for the future).

08-06-10 | More SPU

1. CWD instruction aligns the position you give it to the size of the type (byte,word,dword). God dammit. I don't see this documented anywhere. CWD is supposed to generate a shuffle key for word insertion at a position (regA). In fact it generates a shuffle key for insertion at position ( regA & ~3 ). That means I can't use it to do arbitrary four byte moves.

2. Shufb is not mod 32. Presumably because it has those special 0 and FF selections. This is not a huge deal because you can fix it by doing AND splats(31) , but that is enough slowdown that it ruins shufb as a fast path for doing byte indexing. (people use rotqby instead because that is mod 16).

A related problem for Shufb is that there's no Byte Add. If you had byte add, and shufb was mod 32, then you could generate grabbing a 16-byte portion of two quads by adding an offset.

In order to deal with this, you have to first mod your offset down to [0,15] so that you won't overflow, then you have to see if your original offset before modding had the 16 bit set, and if so, swap the reg order you pass to shuffle. If you had a byte add and shuffle was mod 32, you wouldn't have to do any of that and it would be way faster to do grabs of arbitrary qwords from two neighboring qwords.

(also there's no fast way to generate the typical shuffle mask {010203 ...}. )

3. You can make splats a few different ways; one is to rely on the "spu_splats" builtin, which figures out how to spread the value (usually using FSMB or some such variant). But you can also write it directly using something like :

(vec_int4) { -1,-1,-1,-1 }

which the compiler will either turn into loading a constant or some code to generate it, depending on what it thinks is faster.

Some important constants are hard to generate, the most common being (vec_uchar16){0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} , so you might want to load that into a register and then pass it around if you need it much. (if you are using the stdlib memcpy at all, you can also get it from __Shuffles).

4. Something that I always want with data compression and bit manipulation is a "shift right with 1 bits coming in". Another one is bit extract and bit insert , grab N bits from position M. Also just "mask with N bits" would always be nice, so I don't have to do ((1 << N)-1) (or whatever) to generate masks. Obviously the SPU is not intended for bit manipulation. I'm sure getting Blu-Ray working on it was a nightmare.

5. If you build with -fpic (to make your ELF relocatable), then all your static addresses are relative to r126, which means if you just reference some static in your inner loop it will do a ton of math all the time (the add r126 is not a big deal, but just fetching the address a lot of work) ; as usual if you just copy it out to a local :

   combinedDecodeTable = s_combinedDecodeTable;

the compiler will "optimize" out the copy and will do all the math to compute s_combinedDecodeTable in your inner loop. So you have to use the "volatile trick" :

    volatile UINTa v_hack;
    v_hack = (UINTa) s_combinedDecodeTable;
    const CombinedDecodeTable * RADRESTRICT const combinedDecodeTable = (const CombinedDecodeTable * RADRESTRICT const) v_hack;


6. memcpy seems to be treated differently than __builtin_memcpy by the compiler. They both call the same code, but sometimes calling memcpy() doesn't inline, but __builtin_memcpy does. I don't know WTF is up with that (and yes I have -fbuiltins set and all that shit, but maybe there's some GCC juju I'm missing). Also, it is a decent SPU-optimized memcpy, but the compiler doesn't have a true "builtin" knowledge of memcpy for SPU the way it does on some platforms - eg. it doesn't know that for types that are inherently 16-byte aligned it can avoid the rollup/rolldown, it just always calls memcpy. So if you're doing a lot of memcpys of 16-byte aligned stuff you should probably have your own.

7. This one is a fucking nightmare : I put in some "scheduling barriers" (asm volatile empty) to help me look at the disassembly for optimization (it makes it much easier to trace through because shit isn't spread all over). After debugging a bit, I forgot to take out the scheduling barriers and ran my speed test again -

it was faster.

Right now my code is 5-10% faster with a few scheduling barriers manually inserted (19 clocks per symbol vs 21). That's fucking bananas, it means the compiler's scheduler is fucking up somehow. It sucks because there's no way for me to place them in a systematic way, I have to randomly try moving them around and see what's fastest.

8. Lots of little things are painfully important on the SPU. For example, obviously most of your functions should be inline, but sometimes you have to make them forceinline (and it's a HUGE difference if you don't because once a val goes through the stack it really fucks you), but then sometimes it's faster if it's not inline. Another nasty one is manual scheduling. Obviously the compiler does a lot of scheduling for you (though it fucks up as noted previously), but in many cases it can't figure out that two ops are actually commutative. eg. say you have something like A() which does shared++ , and B() also does shared++ , then the compiler can't tell that the sequence {A B} could be reordered to {B A} , so you have to manually try both ways. Another case is :


if ( x )
    .. more ..
    .. more ..

and again the compiler can't figure out that the B() could be hoisted out of the branches (more normally you would probably write it with B outside the branch, and the compiler won't bring it in). It might be faster inside the branches or outside, so you have to try both. Branch order reorganization can also help a lot. For example :

if ( x )
    if ( y )
        .. xy case
        .. xNy case
    if ( y )
        .. Nxy case
        .. NxNy case

obviously you could instead do the if(y) first then the if(x). Again the compiler can't do this for you and it can make a big difference.

This kind of shit affects every platform of course, but on a nice decent humane architecture like a Core iX/x86 it's < 1%. On SPU it's 5-10% which is nothing to sneeze at and forces you to do this annoying shit.

9. One thing that's annoying in general about the SPU (and even to some extent the PPU) is that some code patterns can be *SO* slow that your non-inner-loop parts of the code dominate.

On the wonderful PC/x86 , you basically can just find your hot spots and optimize those little routines, and ignore all your glue and startup code, cuz it will be reasonably fast. Not true on the SPU. I had some setup code that built some acceleration tables that's only called once, and then an inner loop that's called two hundred thousand times. The setup code takes 10% of the total time. On the PC it doesn't even show up as a blip. (On the Xenon you can get this same sort of thing if your setup code happens to do signed-16-bit loads, variable shifts, or load-hit-stores). On the SPU my setup code that was so bad was doing lots of byte reads & writes, and lots of branches.

The point is on the PC/x86 you can find 10% of your code base that you have to worry about for speed, and the other 90% you just make work. On the SPU you have to be aware of speed on almost all of your code. Blurg that blows so bad.

10. More generally, I found it just crucial to have my SPU ELF reloader running all the time. Every time I touch the code in any way, I hit "build" and then look over and see my speed, because my test app is just running on the PS3 all the time reloading the SPU ELF and running the test. Any time you do the most inoccuous thing - you move some code around, change some variable types - check the speed, because it may have made a surprisingly big difference. (*)

(* = there is a disadvantage to this style of optimization, because it does lead you into local minima; that is, I wind up being a greedy perturbation search in the code space; some times in order to find real minima you have to take a big step backwards).

11. For testing, be careful about what args you pass to system_init ; make sure you aren't sharing your SPUs with the management processes. If you see unreliable timing, something is wrong. (5,1) seems to work okay.

08-06-10 | Vibram

I fucking hate Vibram. I need to buy some damn hiking shoes (my hipster slipper-tennis-shoes are not ideal for backpacking), but it's basically impossible because vibram seems to have cornered the market. Having vibram under me feels like I'm walking on wet wood, or ice cubes, or something else unreasonably stiff and slippery.

I guess it's a durable material, but shoe soles (like car tires) have the property that durable and grippy are inherently opposite. I want the softest rubber you can get on my feet. In fact I would love to get just regular shoes that have that super grippy climbing shoe rubber that wears off in like a week. Fine, I'll replace them periodically, but at least in the mean time I will have soft sticky contact with the ground.

08-06-10 | The Casey Method

The "Casey Method" of dealing with boring coding tasks is to instead create a whole new code structure to do them in. eg. need to write some boring GUI code? Write a new system for making GUIs.

When I was young I would reinvent the wheel for everything stupidly. Then I went through a maturation phase where I cut back on that and tried to use existing solutions and be sensible. But now I'm starting to see that a certain judicious amount of "Casey Method" is a good idea for certain programmers.

For example, if I had to write a game level editor tool in C# the straightforward way, it would probably be the most efficient way, but that's irrelevant, because I would probably kill myself becuase I finished, or just quit the job, so the actual time to finish would be infinite. On the other hand, if I invented a system to write the tool for me, and it was fun and challenging, it might take longer in theory, and not be the most efficient/sensible/responsible/mature way to go, but it would keep be engaged enough that I would actually do it.

08-05-10 | P4 Shelf

The fact that I can't access the new "p4 shelve" from the old P4Win client makes it almost useless to me. P4V is ridiculously awful. It takes like 30 seconds to start up, I don't know WTF they're doing, but I know it's not okay.

Shelving with just one workspace as a way to save your work seems to work okay, but I've been using it to make temp checkins to go between my various computers, and it's not awesome for that. I guess what I really have to do is make real branches for that, but branches scare the bejeesus out of me.

08-04-10 | Initial Learnings of SPU

08-02-10 | Work

Work is a compulsion. I've been working way too much lately, it's hurting my back and my shoulder, making me depressed. There's no real pressure for me to do it, all the pressure comes from myself. For one thing I do feel like I need to get a lot of things done really quick. First I have to finish this optimization / cross-platform shit I'm doing, then I want to get my threading stuff cross-platform and tested better, then I need to get back to video and finish some things. I feel like I really need to finish this video stuff and I have to do it fast.

But more than that, work is like a mental tick that I sometimes indulge in. It's an autistic fugue. You go into this hole where all you can think about is technical issues, and it's horrible, but it also sort of feels good. Like taking a poo, or playing with a loose tooth. Then when I get into this state I just can't stop. I try to relax with N, but all I can think about is spu_shufb and should I be using _align_hint ? And does DMA invalid ll-sc reservations? And I have to go back to work.

I imagine it's a bit like having OCD. It's not like the OCD guy really wants to count the lines in the wood grain. But if he walks into the room and doesn't count them, it just eats at his brain - "must count wood grain" - repeating over and over. It's not like working really makes me happy; after a day of solid work I don't feel good, in fact quite the opposite, my brain feel fried from fatigue, and my body is in great pain from sitting too much, but I can't resist it, and if I don't work I just keep thinking "work work work".

08-02-10 | Java will be faster than C

I think it's quite likely that in the next 10 years, Java and C# programs will be faster than C/C++ programs. The languages are simply better - cleaner, more well defined, more precise in their operation, and most importantly - much easier to parallelize. C/C++ is just too hard to make parallel for most tasks, it's not worth it for the average programmer. But Java/C# are very easy.

Certainly I think that the Appleophile bloggers who are enamored of "native code" are missing the big picture. It's a damn shame that so many simple utilitarian apps are being written for specific platforms, when we have pretty decent platform-independent languages.

Obviously speed parity has not been acheived yet, though a lot of it is 1. retarded Java programmers who are doing things like SetPixel() one by one instead of using higher level APIs, and 2. we're not actually on 8+ cores all the time yet.

08-02-10 | SPU Developing

Programming for the SPU might be really fun if it didn't take like 10 minutes to get the debugger loaded.

The whole MSVC editor-is-my-debugger thing is such a massive win that it really hurts to step back from it to the bad old days of separate debugger.

I did one thing that sort of helps - I run my test app in a loop, and when it finishes each test, I unload my SPU image, wait for a key press, then reload it and repeat. This lets me rebuild my SPU ELF and just reload it and repeat. This avoids a lot of test cycle time, but only works until you have a crash so it fails a lot.

The ability to just do stdin/stdout to/from a console from the PS3 is pretty awesome. It lets me write my test apps as if they were just command line apps and run them from my PC.

07-31-10 | GCC Scheduling Barrier

When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.

There's a common belief that an empty volatile asm in GCC is a scheduling barrier :

 __asm__ volatile("")

but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec). What I believe it does do is it splits the "basic blocks" for compilation, but then after initial optimization there's another merge pass where basic blocks are combined and they can in fact then schedule against each other.

The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.

Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.

__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :

"In most cases, these builtins are considered a full barrier"

What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )

For example on PS3 __sync_val_compare_and_swap appears to be :

   .. loop ..

which means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).

(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).

I also stumbled on this thread where Linus rants about GCC .

I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :

char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.

See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow

07-29-10 | Bothersome

I think I've been working too much and I'm stressed out and probably shouldn't be taking it out on the internet.

One of the reasons that I never talk to people is that they almost always bring the conversation down to a low level. It's one of my greatest frustrations. Of course it happens in politics all the time, I want to talk about something like how we could actually get better regulation of corporations; obviously just putting Glass-Steagall back in place would be good, but you have to look at the underlying reason why that went away - too much political influence of the big banks, too much importance put on GDP; but besides that you have to ask why the free market is not regulating itself better. Why are private funds investing in hedge funds that charge such high fees and don't actually beat the S&P 500? Maybe it's just ignorance, or maybe they're getting kickbacks. And why aren't shareholders making sure that executives do what's in the best interest of the company? I think this is one of the most important things, there's a big problem with corporate boards and the whole shareholder-election process that isn't being addressed; boards are full of cronies and are failing in their oversight. When's the last time you ever heard of a board shaking up a major company because it was being run badly? Never!

Anyway, I want to talk about something interesting, and instead I get, "but regulation is bad" , uhh , okay, that view is fine, but how about some actual content, if you want to disagree tell me something interesting. No, "mmm I don't like the idea of bigger government". Umm, okay, why not? how about some ideas about how it could be controlled in other ways. No. The conversation is brought down to a boring low level.

Or, even when you're talking to smart people, they will often go into this annoying pedantic correction mode; they don't want to let you be right about anything so they pick some little irrelevant detail to squabble about, like "urm you didn't actually mean GDP, you meant GNP, that's a common mistake". Umm, okay, maybe that's true, if it is true you could make your correction interesting by explaining yourself a bit, or you could just shut the fuck up because it's not germane to the point I'm trying to make and it just drags the conversation down into semantics.

I also really don't like talking to people about a topic that they obviously don't care enough to actually learn. When I find somebody who really knows something about a topic and can teach me, I am ecstatic, I want to pick their brain, first of all I want to get references and do the reading so I can get up on the background material because I don't want to waste their time going over stuff anybody could teach me. I enjoy teaching myself, but find that people who actually want to learn are very rare. Most people just want to rant about some topic they don't actually know anything about. I'll offer "well if you'd like to learn I can point you to.." oh no, I'll just rant without learning thank you. Okay, I'm done with this conversation. Or you get the people who think they're an expert and because of that don't want to listen to anything you have to say. I mean, I know perfectly well that I think I'm an expert on lots of things that I'm probably not, but I still want to learn. If you actually have new insight that I haven't figured out, that's fantastic, please give it to me.

In other stressful news, one of the neighbors just bought their child a drum kit.

Despite my past complaints about the god damn home improvers, it's actually a very quiet neighborhood. For one thing we aren't afflicted by that common Seattle blight of being near "musicians". (Seattle has perhaps the highest per-capita of amateur bands in the US, which sounds good in theory but is actually fucking terrible, because they practice). It's kind of amusing when you just go for walks around the neighborhood; in our neighborhood there's a guy who plays accordian about a block away who's quite good actually, and a guy who jams on electric guitar really loud about two blocks away. (though it doesn't beat where I lived in SF, where down the block from me some older guys would hang out in their garage and play really good jazz).

Giving a child drums should really be illegal. I mean, you could practice on those "Rock Band" style fake drums until you're decent; once you're decent the sound is not so bad, but there's something in particular about an instrument being played badly that is just excruciating and hard to tune out.

07-27-10 | 2d arrays

Some little things I often forget and have to search for.

Say you need to change a stack two dimensional array :

int array[7][3];

into a dynamically allocated one, and you don't want to change any code. Do you know how? It's :

int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);

It's a little bit cleaner if you use a typedef :

typedef int array_t [3];

int array[7][3];

array_t array[7];

array_t * array = (array_t *) malloc(7*sizeof(array_t));

those are all equivalent (*).

You can take a two dimensional array as function arg in a few reasonable ways :

void func(int array[7][3]) { }

void func(int (*array)[3]) { }

void func(int array[][3]) { }

function arg arrays are always passed by address.

2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.

(* = if your compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).

07-26-10 | Virtual Functions

Previous post on x86/PPC made me think about virtual functions.

First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not a moron and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.

How C++ virtual functions are implemented on most compilers :

Why does this hurt ?

How can virtual calls be removed ?

07-26-10 | Jeebus

God damn my landlord is unreasonable. I hate all you people so fucking much. Just leave me alone please. I'm really sick of living in someone else's house. I want to be able to do whatever I want with my home.

I dunno, maybe I should just go ahead and buy a house up here. It's okay here I guess, though I don't know if I can stand the winters, or certain other drawbacks. Plus where I want to live if I'm single vs. married with kids is very different.

I know in my head that people who are successful in life are people who just choose a certain plan and commit to it as if they were sure, even though that's totally illogical and there's no reason to be sure of it. You have to act as if you are going to stick with something; every house you move into you should treat as if you are going to be there for life. People who hesitate or hedge tend to get nowhere.

07-26-10 | Code Issues

How do I make a .c/.cpp file that's optional? eg. if you don't build it into your project, then you just don't get the functionality in it, but if you do, then it magically turns itself on and gives you more goodies.

I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available. If they build in rrLZHOptimal, it should automatically provide that option.

I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which registers itself and sets that function pointer to the actual implementation.

This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :

1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that case you will be initializing everything with singletons and so on).

2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever. (this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by default, but with a __noremove__ modifier or something)). There are various tricks to address this but I don't think any of them is very nice. (*)

I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call. I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing, and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.

(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)

ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build - eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would force you to link in every module you might want to use, which is the thing you want to avoid.

For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS or some such? Or should I just provide code that builds for SPU and let you do your management?

I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same. The repro looks like this. In some header file I have a function sort of like this :

StupidFunction.h :

inline int StupidFunction()

Then in two different files I have :
A.cpp :

#define SOME_POUND_DEFINE  (0)
#include "StupidFunction.h"


and :
B.cpp :

#define SOME_POUND_DEFINE  (1)
#include "StupidFunction.h"


and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).

If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and so on, so for all I know their behavior might be correct, but it's damn annoying. It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).

ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) , and example of GCC inline rules insanity

In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.

I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.

The best link that summarizes it IMO is the GCC note on CPP Traditional Mode that describes the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax, in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not need to retokenize).

I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use). It also means you can't do some of the more creative string munging things in the CPP that I enjoy.

In particular, in every CPP except GCC, this works :

#define V(x) x
#define CAT(a,b)  V(a)V(b)

to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output (one and only one!).

In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be constants. For example, lots of code bases have something like "offsetof" :

/* Offset of member MEMBER in a struct of type TYPE. */
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

well, that's illegal under GCC for no damn good reason at all. So you have to do :

/* Offset of member MEMBER in a struct of type TYPE. */
#ifndef __GNUC__
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
/* The cast to "char &" below avoids problems with user-defined
   "operator &", which can appear in a POD type.  */
#define offsetof(TYPE, MEMBER)                                  \
  (__offsetof__ (reinterpret_cast <size_t>                      \
                 (&reinterpret_cast <const volatile char &>     \
                  (static_cast<TYPE *> (0)->MEMBER))))
#endif /* C++ */

damn annoying. (code stolen from here ). The problem with this code under GCC is that a "type *" cannot be used in a constant expression.

A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also annoying because there are places I would use arbitrary values. (see for example 1 or 2 ).

ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :

template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table

that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is to take those template args as ints and cast them to t_key type internally, but that fucking blows.

In general I like to use template args as a way to make the compiler generate different functions for various constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline problem example.

07-21-10 | x86

x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms. (addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular P-Pro through Core i7).

It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.

1. Complex addressing.

This is way more huge than I ever expected. There are two important subaspects here :

1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]

1.B. being able to use memory locations directly in instructions instead of moving through registers.

Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.


  x = y;


  mov eax, ecx;


  x = s.array[i];


  mov eax, [eax + ecx*4 + 48h]

and those run at almost the same speed !

This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!

ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.

2. Out of order execution.

Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.

In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.

The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?

Typical C++ code may be something like :

.. stuff ..
.. stuff ..

this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :

..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
.. stuff ..

And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).

ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.

ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.

Final addendum :

Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.

What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.

At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.

There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.

Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.

Some other PPC notes :

1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.

2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :

int x = 0;

  for( ... one million .. )
    .. do lots of stuff using x ..
    x = blah;


the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right before the function :

//int x; // x = r7
r7 = 0;

  for( ... one million .. )
    .. do lots of stuff using r7 ..
    r7 = blah;

stack_x = r7

Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :

//int x; // x = r7
r7 = 0;
stack_x = r7;

  for( ... one million .. )
    .. do lots of stuff using r7 - stores to stack_x every time ! ..
    r7 = blah;
   stack_x = r7;


The conclusion is the same one I came to last time :

When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.

07-18-10 | Mystery : Do Mutexes need More than Acquire/Release ?

What memory order constraints do Mutexes really need to enforce ?

This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific questions :

1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point) (and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)

2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ? (eg. can Lock just be Acquire and Unlock just be Release ?)

Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared variable.

In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :

y = 1;


  load x
  x ++;
  store x;


y = 2;

and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in fact the y=1 assignment could be completely eliminated.

Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock, then lock() must be a full barrier, but unlock() still doesn't need to be.

Joe Duffy mentions this subject but doesn't come to any conclusions. He does argue that it can be confusing if they are not full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves badly.

One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they are used to. Some coders may see "Lock" and think code neither moves in or out. But on some platforms it does make the mutex much more expensive.

To be concrete, is this a good SpinLock ?

    while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )


    StoreRelease( lock , 0 );

    // AtomicExchange( lock, 0 , memory_order_seq_cst );

One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS has complex mechanisms to prevent live locks and starvation and all that shit.

For more concreteness - Viva64 has a nice analysis of Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any* sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock is only Release ordered.

The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind? Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?

07-18-10 | Mystery : Does the Cell PPU need Memory Control ?

Is memory ordering needed on the PPU at all ?

I'm having trouble finding any information about this, but I did notice a funny thing in Mike Acton's Multithreading Optimization Basics :


// Pentium
#define  AtomicStoreFence() __asm { sfence }
#define  AtomicLoadFence()  __asm { lfence }

// PowerPC
#define  AtomicStoreFence() __asm { lwsync }
#define  AtomicLoadFence()  __asm { lwsync }

// But on the PPU
#define  AtomicStoreFence() 
#define  AtomicLoadFence()

Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as write-combined or SSE; normal x86 doesn't need them at all).

Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one store queue (I think)). The instructions are in order, all of the out-of-orderness of these PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses act sequential on the PPU ?

Hmmm I'm not confident about this and need more information. The nice thing about Cell being open is there is tons of information about it from IBM but it's just a mess and very hard to find what you want.

Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that platform anyway.

ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :

#define  AtomicStoreFence() 
#define  AtomicLoadFence()

is a bad idea, because the compiler can still reorder things on you. Better to do :

#define  AtomicStoreFence()  _CompilerWriteBarrier() 
#define  AtomicLoadFence()   _CompilerReadBarrier()

07-18-10 | Mystery : Why no isync for Acquire on Xenon ?

The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync" to enforce Acquire memory constraint. Which is right? See :

Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming

Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :

First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;

lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).

lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.

In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.

sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.

Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.

isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).

isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :

"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."

All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.

ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.

07-17-10 | Broken Games

Soccer is broken. There's too little scoring and it's too easy to play a very defensive style. The issue with low scores is not just lack of excitement (that aspect is debatable), the big problem is that in a game that's often 1:0 or 0:0 , it greatly increases the importance of bad calls and flukes. If the scores were more like 7-5 , then 1 slip up wouldn't matter so much. Statistically, the "best" team loses in soccer more often than any other major sport.

Anyway, it seems like it would be very easy to fix. You just do something to force an attacking style. One random idea I had is you could require that 3 forwards always stay on the opponent's side of midfield. This prevents you from drawing back everyone for defense, and means that the attackers can get a numbers advantage whenever they want to take that risk.

No Limit Hold'em is broken. It's far too profitable and easy to play a very tight style. The fix is very easy - antes. But almost nobody does it outside of the biggest games. (an extra blind would also work well).

Baseball is broken. Not in a game rule system way, I think it actually functions pretty well in that sense. Rather it is broken as a spectator sport because it is just too slow and drawn out. The fix for this is also very easy - put time limits on pitchers and batters. None of this fucking knocking the dirt off your shoes, throwing to first, then asking for more time, oh my god.

Basketball is broken. I wrote about that before so I won't repeat myself.

Rugby is broken. In a lot of ways actually. The rules of the scrum are very hard to enforce, so you constantly get collapsed scrums and balls not put in straight and so on, very messy, not fun to play or watch. The best part of the game is the free running, but it's very easy to win without a good running game at all, just by playing well in the set pieces and kicking, which is really ugly rugby. I don't have any great ideas on how to fix it, but it's definitely broken. Sevens is actually a much better game for the most part.

I guess it's pretty hard to change these things because they are established and have history and fans and so on, and any time you make a change a bunch of morons complain that you're mucking things up, but a bit of tweakage could seriously improve most sports.


Tennis is broken. Power and serving are rewarded too much over control, which makes matches boring. The French Open is generally the best tennis of the year to watch because it's slower. This could be easily fixed by limitting racket technology to 1980 levels or something.

Auto racing is horribly broken. I think F1 is hopeless and boring so I won't talk about that. The Le Mans / GT series are almost interesting, but the stupid rules just make it incomprehensible who has an advantage each year. Some manufacturer can happen to have a car that fits well with the current rule set, and then they dominate for a few years. In many of the series, the cars are so modified that they hardly share any parts with their street origins at all. Like currently the BMW M3's are struggling in the ALMS but winning in the ELMS because of tiny differences in the arcane rules (something about suspension and aero that's allowed).

I think the solution is very easy : let manufacturers bring anything they want, but it has to be available for the public to buy at some fixed price. So rather than all these classes that have all these rules (no 4WD for example in the Le Mans series, and minimum weights and so on) - get rid of the rules and have a $100k series and a $150k series. Let manufacturers make the best car they can make for that price, and if they want to take a loss and bring a car that's got more value than that, they can, as long as the public can buy it. This would really let us see what a $150k M3 can do vs a $150k R8.

07-16-10 | Content

Where the fuck is all this content that we are supposed to have in this age of the internet and vast media ?

News sites covering sports events should have a "spoiler free" mode. They should let you view the information in chronological order (past to present), and let you block how far ahead you want to see. eg. say I have game 3 of the NBA finals on tape, and before I watch it I want to catch up on what happened in game 2, I should be able to mark "don't show me game 3" and go read the news. I'm hitting that particular problem right now with the Tour de F.

Why is there no fucking decent blog of someone telling me news about the Tour? Yes, I know there are plenty of news sites like velonews or cyclingnews or whatever, that's not what I want. I also don't want a "tour diary" from a rider. I want a smart, funny, 3rd party who is following everything and can write about what happens and also some editorial info about the secret dramas. Where is my content?

For ages I've wanted a blog I could follow that was just a well-curated extraction of amusement. I like to see a funny photo or some hot chicks or whatever trashy internet amusement there is, but I don't want to have to slog through the mass of crap that you're bathed in when you go to the massive aggregator sites like milkandcookies or daily* or whatever. Like just one little high quality nugget once a day, why the fuck do I not have that?

The other thing I've wanted forever is a science news site that's targetted at science degree graduates, but not specialists in that exact topic. There's a big gap between popular reporting, which is just woefully low-level, often just wrong, or completely inane (like reporting crackpot fringe loonies as if they are real science), and the full-on rigor and impenetrability of actual research papers. There could be a middle ground, intended for intelligent scientific people, written by people who actually understand the full depth of what they're writing about. The only place I know of to get that is in college magazines; for example the Caltech Engineering&Science magazine that I occasionally get is actually a pretty good source for that depth of material.

In other news, the opening of the Montlake bridge almost every day of the summer so that a few fuckers can get their over-height sailboats through is a really ridiculous slap in the face of any kind of civic sense. I've been on a sailboat and gone from Lake Union to Lake Washington, and it is a delight, but you can get through just fine on a moderate size boat without raising the bridge. You have to almost intentionally get a really tall mast just so you can fuck up the lives of thousands of people when the bridge raising causes traffic to back up onto the 520 and leads to a massive traffic jam. It's really appalling.

07-13-10 | Tech Blurg

How do I atomically store or load 128 bit data on x64 ?

One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).

The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86 load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do something like

load :
    sse load 128

store :
    sse store 128

or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128 and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).

(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).

In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you know MS is not going to copy your syntax, but you could copy theirs. If you would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide compatible declarations. Shit like making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).

You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real mess. BTW not having pragma warning disable is pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for specific bits of code where you know the warning is bogus or innocuous.

The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.

07-12-10 | Corporate Inequality

One of the things that bothers me is that all the corporations I deal with basically have free reign to fuck me, and I have nothing I can do back. If I every do anything wrong, they charge me fees, and they can fuck up severely and I get nothing.

I deposited a check at Well's Fargo once that was pretty old; I had misplaced it and just found it and went and deposited it. They charged me a $25 fee for depositing a check that was too old.

Recently I had some money taken from my First Mutual account through a fraudulent ACH. They of course reimbursed it - but WTF that is not enough. They allowed someone to take money out of my account when the person didn't even sign my name. The name signed is "Lindsey Meadows" or something and First Mutual just let it right on through. I should get to charge them $25 for their gross incompetence.

Anytime anyone bills you wrong, you get to wait for 30 minutes on the phone. If you are *lucky* they will fix the bill. What about fucking compensating me you cocks? But if I make a mistake and send in my monthly payment with a slightly wrong number on the check I get a fee.

I sent a bike by UPS and they checked it out and told me the rate was $65. Two weeks later I get a notice from them that they measured it during shipping and decided it was oversize and they were charging me an extra $70.

Just recently, UPS has completely fucked up delivery of two packages; one I sent they bounced back to me even though the address was completely correct; I had to talk to them on the phone and send it again. They gave me no reimbursement at all (they didn't charge me for the second shipping, but didn't reimburse the first), and of course no compensation for the trouble or the delay. Recently they completely lost a package that was shipped to me; again it was insured and everything so I'll get it fine, but I should be able to drop a $70 charge on them for failed delivery. Oh you screwed up delivery? Okay, that's a $50 fee you owe me. Oh you don't remember agreeing to that? It's on page 97 of the contract you had to sign to accept my package.

My garbage pickup company will occasionally just drop a $10 oversize garbage pickup fee on me. I never put out anything outside the bin, I have no idea what the fuck they are charging me for, maybe some neighbor sneaks shit in or its just a mistake, but the fact that they can just tack on fees at will without my agreement is the problem.

Of course I signed away permission for them to do that somewhere in the contract. But that is no fucking excuse. Retarded anti-humanists will say "it's your own fault, you had the opportunity to read the contract and you chose to sign it". What? First of all, you can't be serious that I'm supposed to study every fucking contract that's put in my face. Second of all, if I actually didn't sign the contracts that I didn't agree with, I couldn't live anywhere since all the rental agreements are absurd, I couldn't have a phone, a credit card, a bank, utility service, I mean are you fucking stupid? Of course I have to deal with these companies, I have no choice to not sign abusive contracts.

I fucking hate our lack of freedom and independence.

There is no free market solution to these problems. For one thing, there is basically no significant competition in almost every service. Even in sectors that have apparent competition, like say car insurance or banking, sure there are various people to choose from, but in fact they are almost all identical. They all run their business the same way, and none of them is actually good to their consumers.

The only solution is strong government regulation. In particular there are two very simple consumer protection laws that I would like to see :

1. Elimination of non-voluntary fees. All charges to a consumer must be explicitly authorized (and no they cannot be preauthorized on contingencies).

This is extremely powerful, simple, and would make a huge difference. For example it solves bank overdraft abuse. When you try to make an overdraft, the bank has to contact you (by email and cell) and confirm that you want to accept the overdraft (and the $25 fee). If you say no, the charge just bounces (and there is no fee to you). Obviously the same thing would apply to cell phone abuse through roaming or overage minutes or whatever.

Now you the consumer might want to preauthorize $50 a month in fees to give you some wiggle room, but you could choose $0 preauthorized fees if you like.

2. Make user agreements illegal. This is a little trickier because I don't think you can quite ban them completely, so you have to say something about them being "minimal" and "transparent". Maybe you could require that the average person should be able to fully read and understand it within 60 seconds.

Agreements that protect the service provider from lawsuits or that specify settlement through arbitration should simply be illegal.

But this line of thinking is all irrelevant because nothing that significantly reduces corporations' power to fuck us over will ever be done.

07-12-10 | My Nifty P4

What's new in MyNifty :

    no stall for down networks
    catch lots more cases that need to do checkouts (especially for the vcproj)
    don't check files for being read-only before trying p4 edit (lets you fix mistakes)
    don't check files for being in project before doing p4 edit (lets you edit source controlled files not in your project)

The result should be that using VC with MyNifty, you basically don't even know that your files are in source control - everything should just autocheckout at the right time.

Go get the original NiftyPlugins at Code.Google and do the install.

Then download MyNifty.zip

Extract MyNifty.zip into the dir where the NiftyP4 DLL is installed. It should be something like :

    C:\Documents and Settings\charlesb\Application Data\Microsoft\MSEnvShared\Addins\

(you may want to save copies of the originals).

Run VC.

Tools -> Addin Manager
NiftyPerforce : enable & startup

Tools -> Options -> Environment -> Documents
"Allow Editing of Read only files " = yes !

Output pane should have a "NiftyPerforce" option; you should see a message there like "NiftyPerforce ( Info): MyNiftyP4! RELEASE"

Nifty should be set to : useSystemEnv = True, autoCheckoutOnEdit = false, autoCheckout on everything else = true. I recommend autoAdd and autoDelete = false.

Make sure your P4 ENV settings are done, or use P4CONFIG files.

07-10-10 | Clipless pedals

I think clipless pedals are fucking terrible. Yes, they are slightly more efficient, and it does feel kind of nice to be locked in, but for the average amateur cyclist, they are a big problem and way too many people use them.

First of all, the efficiency difference vs. Powergrips or toe clip-strap pedals is pretty small. Those also lock your foot in pretty well and let you spin. When people say "clipless pedals are a huge gain" they are comparing to platform pedals which is fucking ridiculous, you need to compare vs. toe clips. The nice thing about toe clips is you can leave them loose around town, and then when you get out on a long course, you just reach down and pull the strap tight and then you are locked in nice and neat.

The first problem with clipless pedals is ergonomic. Yes, they can be adjusted just right so that you have good geometry and they won't cause any pain - but that is a big pain in the ass, and the average amateur doesn't have the perfect adjustment. The result is knee and hip pain. The extra freedom of strap pedals lets you get a more comfortable position and avoid the pain.

The biggest problem with clipless pedals is that it turns the amateur road rider into a real dick-head. They aren't comfortable with clipping in and out, so they go to great measures to avoid it. They'll hang on to posts at red lights, run lights and stop signs, won't wait in a line of other cyclists, etc. They create a real hazard because they can't get their feet out easily.

Of all the dick cyclists on the road, the yuppie amateur road racer has to be the worst. They're the ones who are all wobbly and don't stop for pedestrians. They ride in big groups and don't get out of the way for cars. They run stop signs and act like they aren't doing anything wrong. They'll often ride way out in the road for no good reason and not get over to let cars by. Of course cyclists should take the lane when they need to for safety reasons, but that is not the case for these turds.

It really makes me sad when I see some out of shape people who are obviously trying to get into cycling, and the fucking shop has set them up with some harsh aluminum frame, with a way too aggressive forward position, and clipless pedals; they're obviously very uncomfortable on their bike, and also out of control.

07-10-10 | PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :

__speed_critical {
  .. code ..

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard

for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
    if ( i == count ) break;

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :

x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :

#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.

07-09-10 | Backspace

#define _WIN32_WINNT 0x0501 
#include <windows.h>
#include <psapi.h>

static bool strsame(const char * s1,const char * s2)
    for(int i=0;;i++)
        if ( s1[i] != s2[i] )
            return false;
        if ( s1[i] == 0 )
            return true;

// __declspec(dllimport) void RtlFillMemory( void *, size_t count, char );
extern "C" 
void * __cdecl memset(void *buf,int val,size_t count)
    char * p = (char *) buf;
    for(size_t i=0;i < count;i++)
        p[i] = val;
    return buf;

//#undef RtlCopyMemory
//NTSYSAPI void NTAPI RtlCopyMemory( void *, const void *, size_t count );
extern "C" 
void * __cdecl memcpy(void *dst,const void *src,size_t count)
    char * d = (char *) dst;
    char * s = (char *) src;
    for(size_t i=0;i < count;i++)
        d[i] = s[i];
    return dst;

//int CALLBACK WinMain ( IN HINSTANCE hInstance, IN HINSTANCE hPrevInstance, IN LPSTR lpCmdLine, IN int nShowCmd )
int my_WinMain(void)
    bool isExplorer = false;
    HWND active = GetForegroundWindow();
    DWORD procid;
    DWORD otherThread = GetWindowThreadProcessId(active,&procid);
    if ( active )
        HWND cur = active;
            HWND p = GetParent(cur);
            if ( ! p ) break;
            cur = p;
        char name[1024];
        name[0] = 0;
        if ( GetClassNameA(cur,name,sizeof(name)) )
            //lprintf("name : %s\n",name);
            isExplorer = strsame(name,"CabinetWClass") ||
    if ( isExplorer )
        //lprintf("sending alt-up\n");
        INPUT inputs[4] = { 0 }; // this calls memset
        inputs[0].type = INPUT_KEYBOARD;
        inputs[0].ki.wVk = VK_LMENU;
        inputs[1].type = INPUT_KEYBOARD;
        inputs[1].ki.wVk = VK_UP;
        // send keyups in reverse order :
        inputs[2] = inputs[1]; // this generates memcpy
        inputs[3] = inputs[0];
        inputs[2].ki.dwFlags |= KEYEVENTF_KEYUP;
        inputs[3].ki.dwFlags |= KEYEVENTF_KEYUP;
        //lprintf("sending backspace\n");
        // can't use SendInput here cuz it will run me again
        // find the actual window and send a message :
        DWORD myThread = GetCurrentThreadId();
        HWND focus = GetFocus();
        if ( ! focus )
            focus = active;
        // the HotKey thingy that I use will still send the KeyUp for Backspace,
        //  so I only need to send the key down :
        //  (some apps respond to keys on KeyUp and would process backspace twice- eg. Firefox)
        // also, if I send the KeyDown/Up together the WM_CHAR gets out of order oddly
        int vk = VK_BACK;
        int ScanKey = MapVirtualKey(vk, 0);
        const LPARAM lpKD = 1 | (ScanKey << 16);
        //const LPARAM lpKU = lpKD | (1UL << 31) | (1UL << 30);
    return 0;

extern "C"
int WinMainCRTStartup(void)
    return my_WinMain();

Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.

Download the EXE

07-08-10 | Remote Dev

I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.

There is, however, something it would be awesome for : dev kits.

If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company. They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.

It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.

Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office to turn on the dev kit, or make sure noone is using it, or whatever).

07-07-10 | Counterpoint

Dave Moulton is usually right on the money about issues bike related, but when he rants about bicycle helmet laws , I think he's off the money.

Basically he contends that the real problem with bike-car safety is that car drivers do not take the responsibility of their powerful machine seriously enough. Yes, of course, I agree absolutely, but SO WHAT ?

We could have a lot of discussions about the way the world should be, but it's irrelevant and non-productive to pine for things that will never be. Yes, it would be nice if people payed attention to the road when they drove, didn't talk on cell phones, didn't drink coffee and talk to their passenger. A car is a deadly powerful killing machine, and stupid people forget that because it's so comfy and feels safe and is easy to drive and so on. Yes, I think most people would drive better if they had to drive something like an old open-roof roadster where you are exposed to the elements and feel vulnerable. But you are not going to change American's driving habits. People want to jump in their giant beast, mash the gas, watch TV while they're driving, and fuck you if you're a pedestrian or cyclist in their way.

Look, drivers are fucking dangerous morons. It doesn't matter if you're on a bike or not - they are constantly running red lights, pulling into crosswalks, not stopping for pedestrians, going the wrong way around roundabouts. Almost every single day I see some major violation of basic traffic laws, and even beyond that there are just constant violations of basic human sense and decency. One of the ones that's really getting my goat recently is how people around here love to blow right through a stop sign and come to stop about ten feet past it, with their nose way out in the intersection (they do this intentionally because they want to get further out into the intersection to see to the sides; the reasonable thing to do would be to first stop at the stop sign and then pull forward to see). So when I'm driving or biking through the intersection, all I see is somebody who blows through a stop sign and is coming right at me (and then they stop just before hitting me).

I personally would love to see the elimination of the entire concept of the vehicular "accident". They are only rarely accidents; it's usually somebody fucking up. The person who crashed should not only have to pay the cost of the accident, but should get a punitive legal punishment such as license suspension or even prison time. For example when the old lady ran right through a red light and smashed my car should have clearly had her license taken away. In cases of hitting pedestrians you should get jail time. It's almost impossible for a pedestrian to ever be at fault, because even if they do jump out right in front of you - you should always expect them to do that when you are in an area with pedestrians, so you should be going slow and be ready to slam on the brakes. But this is never going to happen so it's a pointless rant.

As for the issue of mandatory helmets - I don't really think it's anything to get riled up about. As a cyclist of course you should choose to wear a helmet even if it's not a law. Obviously making it a law is political weakness. Oh shit cyclists are getting hit by cars - let's restrict the cyclists because god knows we're not gonna restrict the cars. Well duh, of course that's how politics works. But there are perfectly reasonable reasons to make helmets madatory - the same reason seat belts in cars are mandatory - because it reduces the medical cost which is shared by society.

We can rant all we want about how drivers should pay more attention, be more courteous, be reasonable and intelligent, but it just won't ever happen.

What I would like to see is better ways for me as a cyclist to avoid cars, and me as a driver to avoid cyclists. Part of the problem is that the people who put down the bike lanes are real fucking morons. Right in my neighborhood we have bike lanes or "sharrows" right down the busiest arterial roads, when there are perfectly good quiet back streets that run parallel and would be much better routes for bikes. Personally when I ride, I take the back routes that are very low traffic, but the majority of cyclists are just as retarded as the retarded cars, and they take the road that is "bike recommended" even though it's much worse.

(BTW as a reminder, let me emphasize that the retardation equivalence of bikes and cars in no way excuses the cars from their sins; if you have a fight between someone with a feather and someone with a knife and they are both being dangerous morons with their weapon - the feather guy can be forgiven but the knife guy is a fucking selfish ignorant dick; you often hear self-righteous car morons go on about how the bikes "do bad things too" ; so fucking what? so what? he's poking you with his feather, just ignore him and be careful with your damn knife).

07-05-10 | Country Living

Probably because I've been reading Tanizaki (wonderful) recently, and also because my neighborhood has turned into a construction yard as all the god damn home-improvers have kicked into high gear for the summer, I have been fantasizing a lot recently about living out in the country.

The woods have a wonderful silence to them; the boughs are baffles, muffling sound, making the air heavy and still. I imagine having a clearing in the woods so a bit of light can get in. In the clearing is a japanese style pavillion, dark thick wood braces and paper screens. It is empty of all clutter, my private space, quiet and peaceful, where I can just think and work and be alone.

There are actually lots of huge wooded properties for sale out not too far from Seattle. I think the best nearby place is out in the Snoqualmie river valley, around Duvall/Carnation/Novelty Hill. You can get 40 acres for around $600k which is pretty stonkering. 40 acres is enough that you can put a little building in the middle and not be able to see or hear a neighbor at all. It also seems like a pretty good investment. It's inevitable that the suburbs will get built out to there eventually, and then all that land could be worth a fortune. This is why I've never understood living in traditional suburbs; if you go just another ten miles out you get to real country where you can have big wild property with woods and gardens and isolation, for less money!

But then I start thinking - if I'm going to live in the middle of nowhere, why live in the middle of nowhere near Seattle? It's too far to really go into the city on any kind of regular basis, so I may as well just live in the countryside in CA or Spain or Argentina or somewhere with better weather.

Living in the country is really only okay if I'm married or something. If I'm single I have to be in the city. Even if I am with the woman I love, moving out to the country is sort of like retiring from life. It's changing gears to a very isolated, simple life. That's very appealing to me, but I don't think it's time for that phase of my life just yet.

Lately I have been taking lots of walks around Seattle U. It has pretty nice grounds, with lots of little hidden gardens tucked behind or between buildings where you can stroll or sit. I love the feeling on a college campus. You can just feel the seriousness in the air. Even when there are lots of kids around there's a feeling of quiet and solitude; maybe it's because the big buildings create a sort of echoing canyon that changes the sounds.

I miss having deep intellectual problems to work on that you really have to go and think about for a long time. Even though I'm sort of doing research right now, it's engineering research, where my time needs to be spent at the machine writing code for test cases, it's not theoretical research. It's really a delightful thing to have a hard theoretical problem to work on. You just keep it in the back of your mind and you chew on it for months. You try to come at it in different ways, you search for prior art papers about it. All the time you are thinking about it, and often the big revelation comes when you are taking a hike or something.

07-04-10 | Counterpoint 1

In which I reply to other people's blogs :

The Windows registry vs. INI files is surprisingly pro-registry.

First of all, using the Registry does not actually *fix* any issues with multiple instances that you can't fix pretty easily with INI files. In particular, the contention over the config data is the easy part of the problem. There's an inherent messy ambiguity with settings and multiple instances. If I change settings in instance 1 do I want those to be immediately reflected in instance 2? If so, the only way to fix this is to have all instances constantly checking for new settings (or get some notification). This problem is exactly the same with the registry or INI files. Sure the registry gives you nice safe atomic writes, but you can implement that yourself easily enough, or you could use an off the shelf database. So that is really not much of an argument. In fact, getting changes across instances with INI files could be done pretty neatly using a file change notification to cause reloads (I'm not sure if Windows provides a similar watcher notification mechanism for registry changes). (the system that most people use of dumping settings changes on program exit is equally broken with the registry or INI files).

Second, storing things like last dialog positions and dumping structs and such is not really appropriate for INI files (or really even for the registry for that matter). The INI file is for things that the user might want to edit by hand, or copy to other machines, or whatever. That other junk is really a logically separate type of data. It's like the NCB in MSVC, which we all know you want to just wipe out from time to time. (in fact making it separate is nice because if I accidentally get my last dialog position off in outer space I can just delete that data). I think the official nice Win32 way to store this data is off in AppData somewhere, but I don't love that either.

Third, the benefits of the INI are massive and understated. 1. text editting is in fact a huge benefit over the registry. It lets you see all the options and edit them in a tool that is friendly and familiar. 2. it lets you do all the things you would do normally on files - eg. I can easily email my INI to friends, I can save backups of settings I made for some purpose, hell I can munge the INI from batch files, I can easily zip it up to save old versions, etc.

And this last one is by far the most important - making programs be "transportable" - that is, they rely on nothing but stuff in dirs under them - is just a massive win. It lets me rearrange my disk, copy programs around without running installers, save versions of programs, etc.

Back in the DOS days, whenever I finished a code project, we would make tape backups (lol tape) of the code * and all the compilers used to build it *. To do that all you had to do was include the dir containing the compiler. Five years later we could pull out projects that used some bizarro compilers that we didn't have any more, and it would all just work because they were fully transportable. The win of that is so massive it dominates the convenience of the registry for the developer.

Which brings us to the most important part : the convenience for the developer is not the issue here! It's what would be nicer for the user. And there INI is just all win. If it's more work for the developer to make that work, we should do that work.

07-05-10 | Counterpoint 2

In which I reply to other people's blogs :

Smartness overload ( and addendum ) is purportedly a rant against over-engineering and excessive "smartness".

But let's start there right away. Basically what he wants is to have less smartness in the development of the basic architectural systems, but that requires *MORE* smartness every single time you use the system. For example :

In many situations overgeneralization is a handy excuse for laziness. Managing object lifetimes is one of my pet peeves. It’s common to use single, “universal” system for all kinds of situations instead of spending 5 minutes and think.

He's anti-smartness but pro "thinking" each time you write some commmon code. My view is that "smartness" during development is very very bad. But by that I mean requiring the client (coder) to think and make the right decision each time they do something simple. That inevitably leads to tons of bugs. Having systems that are clear and uniform and simple are massive wins. When I'm trying to write some leaf code, I shouldn't have to worry about basic issues, they should be taken care of. I shouldn't have to write array code by hand every time I need an array, I should use a vector. etc.

Furthermore, he is arguing against general solutions. I don't see how you can possibly argue that having each coder cook up their own systems for lifetime management is a good idea. Uniformity is a massive massive win. Even if you wrote some manual lifetime control stuff that was great, when some co-worker goes into your code and tries to use things they will be lost and have problems. What if you need to pass objects between code that use different schemes? What a mess.

Yet, folks insist on using reference counted pointers or GC everywhere. What? Some of counters can be manipulated from multiple threads? Well, let’s make _all_ pointers thread-safe, instead of thinking for another 5 minutes and separating special cases. It may be tempting to have a solution that just works in every case, write it once and use everywhere. Sadly, in many cases it means unnecessary overhead.

Yes! It is very tempting to have a solution that just works in every case! And in fact, having that severely *reduces* the need for smartness, and severely reduces bugs. Yes, if the overhead is excessive that's a problem, but that can't be dealt with without destroying good systems.

I think what's he trying to say is something along the lines of "don't use a jackhammer to hammer a nail" or something; that you shouldn't use some very heavy complex machinery when something simple would do the trick. Yes, of course I agree with that, but he also succumbs to the fallacy of taking that way too far and just being anti-jackhammer in general. The problem with that is that you wind up having to basically cook up the heavy machinery from scratch over and over again, which is much worse.

Especially with thread safety issues, I think it is very wrong-headed to suggest that coders should "think" and "not be lazy" in each occurance of a problem and figure out what exactly they need to thread-protect and how they can do it minimally, etc. To write thread-safe code it is *crucial* to have basic systems and common paradigms that "just work everywhere". Now that doesn't mean that have to make all smart pointers theadsafe. You could easily have something like "SingleThreadSmartPointer" and "ThreadSafeSmartPointer". An even better mechanism would be to design your threading system such that cross-thread smart pointers aren't necessary. Of course you want sensible efficient systems, but you also want them to package up common actions for you in a gauranteed safe way.

Finally, let's get to the real meat of the specific argument, which is about object lifetime management. He seems to be trashing a bogus smart pointer system in which people are passing around smart pointers all the time, which incurs lots of overhead. This is reminiscent of all the people who think the STL is incredibly slow, just because they are using it wrong. Nobody sensible has a smart pointer system like that. Smart people who use the boost:: pointers will make use of a mix of pointers - scoped_ptr, shared_ptr, auto_ptr, etc. for different lifetime management cases. Obviously the case where a single object always owns another object is trivial. You could use auto_ptr or even just a naked pointer if you don't care about automatic cleanup. The nice thing is that if I later decide I need to share that object, I can change it to shared_ptr, and it is easy to do so (or vice-versa). Even if something is a shared_ptr, you don't have to pass it as a smart pointer. You can require the caller to hold a ref and then pass things as naked pointers. Obviously little helper functions shouldn't take a smart pointer that has to inc and dec and refcount thread-safely, that's just bone headed bad usage, not a flaw of the paradigm.

Now, granted, by not using smart pointers everywhere you are introducing holes in the automaticity where bad coders can cause bugs. Duh. That is what good architecture design is all about - yes if we can make everything magically work everywhere without performance overhead we would love to, but usually we can't so we have to make a compromise. That compromise should make it very easy for the user to write efficient and mistake-free code. See later for more details.

Object lifetime management involves work one way or another. If you use smart pointers or some more lazy type of GC, that amount of work needed for the coder to do every time he works with shared objects is greatly reduced. This make it easier to write leaf code and reduces bugs.

The idea of using an ID as a weak reference without a smart pointer is basically a no-go in game development IMO. Let me explain why :

First of all, you cannot ever convert the ID to a pointer *ever* because that object might go away while you are using the pointer. eg.

Object * ptr = GetObject( ID ) ; // checks object is alive

// !! in other thread Object is deleted !!

ptr->stuff(); // crash !

So, one solution to this is to only use ID's. This is of course what Windows and lots of other OS'es do for most of their objects. eg. HANDLE, HWND, etc. are actually weak reference ID's, and you can never convert those to pointers, the function calls all take the ID and do the pointer mapping internally. I believe this is not workable because we want to get actual pointers to objects for convenience of development and also efficiency.

Let me also point out that a huge number of windows apps have bugs because of this system. They do things like

HWND w = Find Window Handle somehow

.. do stuff on w ..

// !! w is deleted !!

.. do more stuff on w .. // !! this is no good !

I have some Windows programs that snoop other programs and run into this issue, and I have to wind up checking IsWindow(w) all over the place to tell if the windows whose ID I'm holding has gone away. It's a mess and very unsafe (particularly because in early versions of windows the ID's can get reused within moderate time spans, so you actually might get a success from IsWindow but have it be a different window!).

Now, of course weak references are great, but IMO the way to make them safe and useful is to combine them with a strong reference. Like :

ObjectPtr ptr = GetObject( ID ); // checks existence of weak ref

the weak ref to pointer mapping only returns a smart pointer, which ensures it is kept alive while you have a pointer. This is just a form of GC of course.

By using a system like this you can be both very efficient and very safe. The system I use is roughly like this :

Object owners use smart pointers.  Of course you could just have a naked pointer or something, but the performance cost of using a
smart pointer here is nil and it just makes things more uniform which is good.

Weak references resolve to smart pointers.

Function calls take naked pointers.  The caller must own the object so it doesn't die during the call.  Note that this almost never
requires any thought - it is true by construction, because in order to call a function on an object, you had to get that object from
somewhere.  You either got it by resolving a weak pointer, or you got it from its owner.

This is highly efficient, easy to use, flexible, and almost never has problems. The only way to break it is to intentionally do screwy things like

object * ptr = GetObject( ID ).GetPointer();
CallFunc( ptr );

which will get a smart pointer, get the naked pointer off it, and let the smart pointer die.

Now, certainly lots of projects can be written without any complicated lifetime management AT ALL. In particular, many games throughout history have gotten away with having a single phase of the world tick when all object destruction happens; that lets you know that objects never die during the frame, which means you can use much simpler systems. I think if you are *sure* that you can use simpler systems then you should use them - using fancy systems when you don't need them is like using a hash table to implement an array with the index as the hash key. Duh, that's dumb. But if you *DO* need complicated lifetime management, then it is far far better to use a properly engineered and robust system than to do ad-hoc per-usage coding of custom solutions.

Let me make another more general point : every time you have to "think" when you write code is an opportunity to get it wrong. I think many "smart" coders overestimate their ability to write simple code correctly from scratch, so they don't write good robust architectural systems because they know they can just write some code to handle each case. This is bad software engineering IMO.

Actually this leads me into another blog post that I've been contemplating for a while, so we'll continue there ...

07-03-10 | Length-Limitted Huffman Codes Heuristic

In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.

I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place, which goes nicely with Moffat's in place Huffman codelen builder ). Here's the algorithm :

1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen. We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).

2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix". That is, they do not satisfy the kraft inequality K <= 1 for decodability.

3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by increasing some code lengths.

4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :

  while ( codeLen[ s ] < max && K > 1 )
    codeLen[ s ] ++;

    // adjust K for change in codeLen
    K -= 2 ^ - codeLen[ s ]

5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :

  while ( (K + 2^-codeLen[ s ]) <= 1 )
    // adjust K for change in codeLen
    K += 2 ^ - codeLen[ s ]

    codeLen[ s ] --;

6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.

Okay, so what's happening here ?

There's one forward pass and one backwards pass. First we truncate the code lengths that were too long. We are now in trouble and we need to find some other code lengths that we can make longer so that we are prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just walk forward from the lowest symbol count. (*).

After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage of this you want to find the highest count symbol whose length you can decrease and still have a prefix code.

As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter much because these are the very rare symbols.

footnotes :

(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)

(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K, when considering overshooting you have to consider the penalty from doing len++ in the step that does the overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot K. Keep doing ops while K > 1 and you will wind up with K = 1. ).

Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?

What would the correct priority queue op be for the (**) footnote ?

Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can do it on two codes of length L+1. Or one of length L+1 and two of length L+2.

Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th queue.

But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the package-merge algorithm !

Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0 and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.

For example :

L=3:   A  B

L=2:   A  B  {AB}  C

L=1:   A  B  {AB}  C  {AB|C}

At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1. So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.

07-02-10 | Length-Limitted Huffman Codes

I have something interesting to write about Huffman decoders, but that will have to wait a bit.

In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).

Here's my summary :

Explicity what we are trying to do is solve :

Cost = Sum[ L_i * C_i ]

C_i = count of ith symbol
L_i = huffman code length

given C_i, find L_i to minimize Cost

contrained such that L_i <= L_max


Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)

This is solved by construction of the "coin collector problem"

The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.

For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code

The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value

The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property

If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
    equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward

BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.

You start at the 2^-1 (final) list.  You ask : what's the next node of this list?  It's either a symbol or
  made from the first two nodes of the prior list.  So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
  become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
  can look at them next time.

Okay, so it all works fine, but it bothers me.

I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.

What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.

07-02-10 | Bank Fraud

In the last month I've been the target of fraud twice. It's quite ridiculous how poor the security of our electronic banking system is. To do an ACH out of someone's bank account, all you need is an account number. !? WTF !? No password, nothing. In theory they require a signature, but in practive they don't actually check that. Furthermore, the signature only authorizes someone to do ACH's - it doesn't specify a limit or a specific transaction! So once you authorize someone they can keep running more transfers after the fact.

This is a ridiculous load of shit. The credit card companies are almost as bad. The "verified by Visa" shit is a vaguely decent start in the right direction, but it's still just pathetically easy to get someone's credit card number and use it at will.

The insane thing is that it would be so easy to fix. I've mentioned this idea before, but the simplest one that occurs to me which would work in the current system is to have temporary one use only account numbers. So when I want someone to do an ACH withdrawal, I ask my bank for a temp account number which will expire or is conditional, and I give that number to the merchant. Same thing for credit card numbers. But no. They say some bullshit about how it would be too expensive to retrofit more security into the system, while they swim in piles of our money.

It blows my mind how many people don't carefully check over their bill from everyone. At the grocery store, from your phone company (phone companies are such massive scamming lying crooks that the only viable option is to not have a phone contract at all IMO), from your bank, etc.

06-21-10 | RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :

do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

06-20-10 | PNG Comparo

Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo. First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format. I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking and compare that vs. no-loco PNG. God dammit now I have to do that.

Whatever, here you go :

RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco

RRZ heuristic RRZ best png best
ryg_t.yello.01.bmp 392963 359799 373573
ryg_t.train.03.bmp 35195 31803 34260
ryg_t.sewers.01.bmp 421779 420091 429031
ryg_t.font.01.bmp 26911 22514
ryg_t.envi.colored03.bmp 95394 97203
ryg_t.envi.colored02.bmp 54662 55036
ryg_t.concrete.cracked.01.bmp 299963 309126
ryg_t.bricks.05.bmp 370459 375964
ryg_t.bricks.02.bmp 455203 465099
ryg_t.aircondition.01.bmp 20522 20320
ryg_t.2d.pn02.bmp 22147 24750
kodak_24.bmp 559564 558085 572591
kodak_23.bmp 479240 478041 483865
kodak_22.bmp 574252 571301 580566
kodak_21.bmp 549865 545584 547829
kodak_20.bmp 429556 439993
kodak_19.bmp 541424 545636
kodak_18.bmp 618961 631000
kodak_17.bmp 508672 504961 510131
kodak_16.bmp 466277 481190
kodak_15.bmp 506728 504213 516741
kodak_14.bmp 581520 580301 590108
kodak_13.bmp 677041 688072
kodak_12.bmp 465297 477151
kodak_11.bmp 510200 519918
kodak_10.bmp 497400 500082
kodak_09.bmp 491896 493958
kodak_08.bmp 610524 610505 611451
kodak_07.bmp 473500 473233 486421
kodak_06.bmp 534037 540442
kodak_05.bmp 624368 623341 638875
kodak_04.bmp 522061 532209
kodak_03.bmp 437765 464434
kodak_02.bmp 500964 508297
kodak_01.bmp 586328 582389 588034
bragzone_TULIPS.bmp 565997 591881
bragzone_SERRANO.bmp 103462 96932
bragzone_SAIL.bmp 613845 609953 623437
bragzone_PEPPERS.bmp 366611 376799
bragzone_MONARCH.bmp 508096 507937 526754
bragzone_LENA.bmp 467103 474251
bragzone_FRYMIRE.bmp 241899 230055
bragzone_clegg.bmp 444736 483056

PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.

Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :

1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).

2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.

3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.

My guess is that this should be in the neighborhood of 80 MB/sec.

06-20-10 | KZip

While futzing around on PNG I discovered that there is this whole community of people who are hard-core optimizing PNG's for distribution sizes. They have these crazy tools like pngcrush/optipng/advancecomp/pngout.

AdvPNG is just the Zip encoder from 7zip (Igor Pavlov). It's a semi-optimal parser using a forward-pass multi-lookahead heuristic like Quantum. Some day I need to try to read the source code to figure out what he's doing exactly, but god damnit the code is ugly and it's not documented. Dammit Igor.

PNGOUT is a lossless "deflate" (zip) optimizer. The engine is the same as KZip by Ken Silverman. Unfortunately Ken has not documented his algorithm at all anywhere. Come on Ken! Nobody is gonna pay you for KZip! Just write about it!

Anyway, my guess from reading his brief description and looking at what it does is : KZip has some type of optimal parser. Who knows what kind of optimal parser; my guess is that knowing a bit about how Ken codes it is probably not an actual Storer-Szymanski optimal parser, but rather some kind of heuristic, perhaps a like 7zip/LZMA. KZip also clearly has some kind of Huffman split point optimizer (similar to what I just did ). Again just guessing from the command line options it looks like his Huffman splitter is single pass and is just based on a heuristic that detects changes in the statistics and decides to put a split there. Hmmm, I wish I'd found this months ago.

KZip appears to be the best zip optimizer in existence. Despite claims of being crazy slow I actually think it's quite fast by my standards. No surprise KZip is a lot smaller and faster than my optimal parser, but I do make smaller files. Ken ! Set your algorithm free! (ADDENDUM : whoops, that was only on PNG data; for some reason it's pretty fast on image data, but it's slow as balls in some other cases, not sure what's going on there; 7zip is a lot faster than kzip and the file sizes are usually very close (once in a while kzip does significiantly better)).

For a while I've been curious to try my RAD LZ optimizer on a Zip token stream. It would be a nice sanity check to test it against KZip, but I'm not motivated enough to take the pain of figuring out how to write Zip tokens.

06-20-10 | Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

06-20-10 | Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :

Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :

Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 | Filters for PNG-alike

The problem :

Find a DPCM pixel prediction filter which uses only N,W,NW and does not range-expand (eg. ubytes stay in ubytes). (eg. like PNG).

We certainly could use a larger neighborhood, we could use adaptive predictors that evaluate the neighborhood for edges/etc., we would wind up with GAP from CALIC or something newer. We want to keep it simple so we can have a very fast decoder.

These filters :

    case 0: // 0 predictor is the same as NONE
        return 0;
    case 1: // N
        return N;
    case 2: // W
        return W;
    case 3: // gradient // this tends to win on synthetic images
        int pred = N + W - NW;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 4: // average
        return (N+W)>>1;
    case 5: // grad skewed towards average // this tends to win on natural images - before type 12 took over anyway
        int pred = ( 3*N + 3*W - 2*NW + 1) >>2;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 6: // grad skewed even more toward average
        int pred = ( 5*N + 5*W - 2*NW + 3) >>3;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 7: // grad skewed N
        int pred = (2*N + W - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 8: // grad skewed W
        int pred = (2*W + N - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 9: // new
        int pred = (3*N + 2*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 10:    // new
        int pred = (2*N + 3*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
    case 11: // new
        return (N+W + 2*NW + 1)>>2;
    case 12: // ClampedGradPredictor
        int grad = N + W - NW;
        int lo = RR_MIN3(N,W,NW);
        int hi = RR_MAX3(N,W,NW);
        return rr::clamp(grad,lo,hi);
    case 13: // median
        return Median3(N,W,NW);
    case 14:
        // semi-Paeth
        // but only pick N or W 
        int grad = N + W - NW;
        // pick closer of N or W to grad
        if ( RR_ABS(grad - N) < RR_ABS(grad - W) )
            return N;
            return W;

perform like this :

name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ryg_t.train.03.bmp 56347 76447 70047 80619 76187 79811 77519 79771 76979 77399 76279 80383 75739 75875 74635
ryg_t.sewers.01.bmp 604843 475991 496067 468727 466363 455359 458111 458727 469203 456399 462727 496691 452663 495931 464915
ryg_t.font.01.bmp 42195 54171 53599 56135 73355 66955 70163 60199 63815 69499 70099 79983 59931 61219 56735
ryg_t.envi.colored03.bmp 316007 147315 136255 158063 139231 154551 148419 155379 149667 148683 143415 145871 148099 144851 153311
ryg_t.envi.colored02.bmp 111763 98835 87711 111367 93123 108867 100003 106275 95927 96391 97383 96907 99275 94371 100707
ryg_t.concrete.cracked.01.bmp 493143 416307 449899 426755 403779 402635 400923 408019 420291 398215 408299 437371 409055 440371 424087
ryg_t.bricks.05.bmp 568755 534267 524243 514563 509375 493923 497639 505267 499007 501079 497419 551687 499855 547935 511595
ryg_t.bricks.02.bmp 684515 590955 577207 557155 560391 537987 545019 551251 544555 549543 544635 602115 545919 600139 557659
ryg_t.aircondition.01.bmp 41595 33215 34279 33535 33199 32695 32683 32619 33207 32503 32987 35795 31507 35171 32243
ryg_t.2d.pn02.bmp 25419 27815 28827 29499 32203 32319 32443 29679 32531 32315 32575 34183 28307 31247 27919
ryg_gemein.bmp 797 801 173 153 565 585 553 521 205 633 529 813 153 825 141
kodak_24.bmp 843281 735561 760793 756509 744021 736037 735797 735989 753709 733105 741801 785077 719185 777905 727221
kodak_23.bmp 865985 593977 616885 617105 591501 592561 588837 595949 609437 586121 595793 620657 585453 617825 601713
kodak_22.bmp 919613 732613 756249 741441 719973 712757 711245 721365 732337 709621 719157 760065 706517 763481 724053
kodak_21.bmp 797741 753057 691653 746033 713469 716065 710321 737017 713845 720525 704129 744785 701733 744633 709445
kodak_20.bmp 624901 563261 538069 571633 540981 548329 541773 559437 548549 546565 540337 559545 539289 557105 545901
kodak_19.bmp 847781 724541 728337 732717 718309 712989 712217 716593 729505 712893 715053 752173 689953 756849 698881
kodak_18.bmp 958657 808577 820253 820429 783373 783089 777377 794965 799449 779117 782541 821717 783005 821025 802569
kodak_17.bmp 782061 664829 655845 685273 651025 657225 650477 666625 666645 653117 652293 680813 644849 675045 656597
kodak_16.bmp 664401 696493 595101 673961 646981 647441 643353 675157 640833 658721 631509 684517 628837 682217 629293
kodak_15.bmp 779289 639009 660137 677649 643985 651729 644809 650525 666229 641669 650377 673545 635441 662389 645817
kodak_14.bmp 912149 800681 742901 787273 754233 754325 749641 778777 758045 760761 744001 795029 743909 791341 755425
kodak_13.bmp 1015697 932969 898929 941109 890989 900533 890221 918073 905705 897865 888113 925757 894461 922701 907029
kodak_12.bmp 690537 641517 583717 641973 613461 617881 612461 635253 617913 620833 607177 644837 598393 638889 602213
kodak_11.bmp 773937 725081 674657 714169 697533 691425 690389 708197 695109 698369 685965 738609 671365 739385 676597
kodak_10.bmp 766701 650637 648349 665309 648997 641961 641481 651741 651785 643369 642133 685733 623437 684949 629193
kodak_09.bmp 703689 640025 628161 659637 635201 637657 633881 644193 647817 636281 634105 662125 615393 662801 621873
kodak_08.bmp 1026037 862013 888589 835329 884329 841405 859717 839193 861917 856097 865405 942073 792833 950005 792201
kodak_07.bmp 725525 654925 605753 629757 633945 622693 627053 643069 621433 634965 620645 673361 597821 666629 604337
kodak_06.bmp 796077 786789 680841 750253 731117 723657 722981 753989 714849 739405 709825 772685 703417 773613 705529
kodak_05.bmp 981245 850313 836165 845001 815801 814429 810709 832337 825073 815897 811729 852385 808993 853785 828145
kodak_04.bmp 845721 672369 679245 692469 659981 663625 657621 670629 678165 658845 661957 691945 653629 688253 670325
kodak_03.bmp 656841 605525 559081 619713 581713 596313 587633 607841 600725 592761 584449 601901 579221 592745 587409
kodak_02.bmp 761633 666601 663837 677385 648117 649785 644985 661253 662077 647553 647545 682089 640501 687305 654309
kodak_01.bmp 896361 850353 811873 828397 822169 807197 810217 827621 815245 818249 807665 869829 781173 875265 788725
bragzone_TULIPS.bmp 1052229 729141 757129 688121 703317 671417 683069 683653 696193 682109 691257 756241 675637 761477 704293
bragzone_SERRANO.bmp 150898 172286 160718 193602 255462 274994 278094 212478 254902 274478 277418 285990 171918 213214 167058
bragzone_SAIL.bmp 1004581 865725 826729 834385 811121 798037 798301 819669 807221 805785 796761 862209 795581 868857 813521
bragzone_PEPPERS.bmp 712087 469243 456511 438211 442695 428627 433227 437215 436595 436059 434359 471587 426191 474595 439575
bragzone_MONARCH.bmp 907737 670825 671745 644513 638201 623613 627513 640213 640485 630941 630941 676401 624925 679553 652849
bragzone_LENA.bmp 745299 484027 513747 506631 478875 481911 477131 481819 494431 474343 483007 498519 478559 498799 489203
bragzone_FRYMIRE.bmp 342567 432867 399963 481259 608567 664895 670755 544963 612075 666279 668187 693467 433755 456899 421263
bragzone_clegg.bmp 806117 691593 511625 518489 1208409 1105433 1167561 733217 951773 1158429 1165953 1289969 502845 1286365 505093

commentary :

The big surprise is that ClampedGradPredictor (#12) is the fucking bomb. In fact it's so good that it hides the behavior of other predictors. For example plain old Grad is never picked. Also predictor #5 (grad skewed towards average) was actually by far the best until #12 came along.

The other minor surprise is that W is actually best sometimes, and N is never best, and generally N is much worse than W. Now, it is no surprise that W is better than N - it is a well known fact that typical images have much stronger horizontal correlation than vertical, but I am surprised just how *big* the difference is.

More in the next post.

06-20-10 | Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :

void rrMTF_Init(rrMTFData * pData)
    for (int i = 0; i < 256;  i++)
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;

and you have to do this by hand :

void rrMTF_Init(rrMTFData * pData)
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
        pMTFimage[i] = pMTFmap[i] = (U8) i;

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 | Windows 7 Niggles

1. When you edit a file name in explorer it by default doesn't include the extension. I constantly do "F2 , ctrl-C" to grab a file name, and then find later that I have the fucking file name without extension. .URL files are the worst, they for some reason just absolutely refuse to show me their extension. God fucking dammit, LEAVE SHIT ALONE.

2. Fucking folders can't all be set to Details as far as I can figure out. Yes yes yes I know you can do "set all folders to look like this one". But that only applies to folders that you have ever visited at least once. When you go to some random folder which you have never visited before - BOOM you're looking at fucking icons again. I despise icons. There must be a way to set the default options for folders, but I can't find it on web searches.

3. Fucking backspace not going up dir in Explorer really chaps my hide. I need to fix that. I guess I'll do that after I write my own AllSnap replacement (Grrr).

06-19-10 | NiftyP4 and Timeout

I've fiddled with the NiftyP4 code and have it almost working perfectly for my needs (thanks Jim!). One small niggle remains :

When I lose my net connection from home to work, VC will still hang longer than I'd like. This appears to be comming from the P4 command. The problem is that P4 hangs, and that Nifty waits on P4. Those issues in more detail :

P4 stalls way too long when it can't connect to the server. So far as I can tell there is no way to set this timeout variable in P4 (??). (net searching is a bit hard because Perforce does have a timeout variable, but that is for controlling how long client login sessions on the server last before they are reset). I'd like to set this timeout to like 1 second, currently it's 10 seconds or something. I basically either have a fast connection or not, the long timeout is never right.

Nifty stalls on P4. This is so that when you do something like "save all", it gets notication of the VC command and can check out any necessary files before VC tries to save them. So it runs the P4 command and waits before returning to let VC do its save.

So my hack solution for the moment is to make Nifty only wait for 500 millis for the P4 command and just return if it isn't done by then. This will then give you a "file is not writeable" popup error kind of thing and saves you from the horrible stalled out DevStudio.

BTW some notes for working on addins :

The easiest way to find the names of VC commands seems to be from Keyboard mapping. It appears that they are basically not documented at all. If you can't get it too hookup from the Keyboard mapping command name, your next best option is to trap *every* command and log its name, then go into VC and do the things you want to find the name for and see what comes out of your logs. (see links below)

Disabling an addin from the Tools->Addins menu does not actually unload it (sadly). You have to close VC and restart it to make it unload so that you can update the DLL.

The easiest way to debug an addin is just to make a .AddIn file in "C:\Users\you\Documents\Visual Studio 2005\Addins" and point it at the DLL that you build from your AddIn project. Then you can set up F5 in your AddIn project to run devenv. That way you debug the devenv session that loads your AddIn and you can set breakpoints etc.

See also :

Using EnableVSIPLogging to identify menus and commands with VS 2005 + SP1 - Dr. eX's Blog - Site Home - MSDN Blogs
Source Checkout - niftyplugins - Project Hosting on Google Code
Resources about Visual Studio .NET extensibility
Many Visual studio events missing from Events.SolutionEvents
HOWTO Getting Project and ProjectItem events from a Visual Studio .NET add-in.
HOWTO Capturing commands events from Visual Studio .NET add-ins
Binding to Visual Studio internal Commands from add-in

06-19-10 | Ranting Wussies

The disappointing thing about most ranters is that when you actually get them in person with the subject of their mockery, they turn into these polite normal boring reasonable conciliatory people who see the validity of the other side's position. You see Jon Stewart do this a lot, of course BSNYC and his ilk do it, etc. Stick to your fucking guns you woosies.

06-19-10 | Amazon censors product reviews

It's hard for me to get my hackles too up about this, but it's a topic that's worth noting over and over, so I'm making myself write about it.

A while ago I decided since I'm buying all this stuff from Amazon, I should write some reviews so that other people can see what products are good and which aren't. I mainly wrote reviews for products that had zero reviews, and mainly in cases where the product was not clearly described or listed.

I thought nothing of it, until I went back and was reviewing some of my purchases and noticed that there was no review on some of the products I was pretty sure I had reviewed.

Well, it turns out that Amazon censored my review. Not only did they not publish it, but they *silently* don't publish it, they don't give you any notification or reason that it was declined, it just doesn't show up.

It took me a while emailing back and forth with support to get some answers, but it turns out most of the declines come down to a specific policy :

Amazon does not allow reviews to address errors in the listing.

Some of the reviews I wrote were about incorrect pictures or list prices. For example a picture might show a set of several brass wool scrubbers, but the shipped product is only one of them. I write a review to make it clear what you're getting. BOOM! Review does not show up.

Here are the current guidelines . In particular, the troublesome ones are :

# Details about availability, price, or alternate ordering/shipping
# Comments on other reviews visible on the page (because page visibility is subject to change without notice)

Off-topic information: 

* Feedback on the seller, your shipment experience or the packaging 

* Feedback about typos or inaccuracies in our catalog or product description

Granted, some amount of censorship is necessary or you would be flooded by spam and libel, but this goes rather too far. Not being allowed to write a review when what you get is not what they said you would get is a pretty big problem. I can also understand them not wanting reviews that say "this is available cheaper at Walmart" but all my reviews that mentioned price were things like "listing says retail price is $100, in fact retail price is around $40". It's also simply not true that they objectively enforce those standards; you will see plenty of positive reviews around Amazon that mention price in a positive way, things like "what a great product for only $10" - that review is allowed through, but if it's a negative review like "product is not worth $10" it won't be allowed through; all my reviews that mentioned price and were censored were negative reviews. For example one of my reviews that was blocked was about a listing that showed a picture of a box full of cleaning pads and showed a list price of $40; actual shipped product was one cleaning pad (not a box full) - review deleted.

Of course I should note that this is not unusual. Yelp of course censors reviews in roughly the same way ; both Amazon and Yelp really want you to write "stories" that attract "community" and generally make people hang out and buy product. They don't want factual information that helps customers. I had several reviews deleted from Yelp because they failed to be "personal" enough (they were short things like "this place sucks"). Yelp is also semi-corrupt, despite claims otherwise they are in fact in the pocket of sponsors, and will delete reviews or ban people who write too many negative reviews.

CNET and Chow are actually much worse. They will bald-facedly delete reviews and discussion threads that are critical of sponsors. Most of the advertising sponsored web forums, such as the 2+2 poker forums or car forums like 6speedonline are the same way as well. They will lock or delete threads that are critical of sponsors or the site administration.

In conclusion : the internet is fucked. It is now owned by corporations who censor and edit the content to create the message they want. You have to be very careful about what you read on these sites, because valuable negative information might have been deleted, and if you value your own content, you should not contribute to any one of these sites.

06-19-10 | Verio spam filters outgoing mail

Verio, the host for cbloom.com , filters your outgoing emails through some spam filter and rejects them. You, the paying customer have no control over this, you cannot disable it or add whitelists or anything. In its wisdom it decides that emails like this are spam :

Brownies have been placed in the kitchen.  Make them vanish, please.

After much emailing with Verio tech support I get these copy-paste responses :

Outbound spam filters 

We apologize for any inconvenience you might have experienced with this issue.

In response to your concern. The outgoing mails might have some contents that is not allowed by your filters.

and when I ask if there's any way to disable or get around it :

Outbound spam filters 

I am sorry but no. There is no way to whitelist outgoing messages, we have no support for getting around our internal, outgoing spam filters

This is obviously retarded. Either you're a spammer or you're not. Spammers should have their accounts banned and the rest of us should be allowed to send fucking email. If you are considering getting an account with Verio, don't. (their prices are also terrible by modern standards).

06-19-10 | Gmail doesn't let you send mails to yourself

I always CC myself when I write an email that I think is interesting (especially with my multiple email personalities, I often CC from one to another). I've been confounded for a while that when I use gmail, it seems to not deliver the emails to myself. Not only that, but it gives no error message or delivery failure or anything, it just silently doesn't send them.

See here for more people complaining.

06-17-10 | Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :

from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od   

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :

On "mississippi" :

At index 1 :


matches :


with length 4.

Now step forward one.

At index 2 :


I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :

static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;

06-17-10 | Friday Code Review

Today I bring you a piece of my own code that had a nasty bug. This is from my buffered file reader, which I did some fiddling on a while ago and didn't test hard enough (there are now so many ways to use my code that it's really hard to make sure I test all the paths, which is a rant for another day).

Here's the whole routine :

   S64  OodleFile_ReadLarge(OodleFile * oof,void * into,S64 count)
    // use up the buffer I have :
    U8 * into_ptr = (U8 *)into;
    S64 avail = (S64)( oof->ptrend - oof->ptr );
    S64 useBuf = RR_MIN(avail,count);
    oof->ptr += useBuf;
    into_ptr += useBuf;
    count -= useBuf;
    rrassert( oof->ptr <= oof->ptrend );
    if ( count > 0 )
        rrassert( oof->ptr == oof->ptrend );
        S64 totRead = 0;
        while( count > 0 )
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            S32 read = oof->vfile->Read(into_ptr,curRead);
            if ( read == 0 )
                break; // fail !
            count    -= read;
            into_ptr += read;
        return useBuf + totRead;
        return useBuf;

But the problem is here :

        S64 totRead = 0;
        while( count > 0 )
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            S32 read = oof->vfile->Read(into_ptr,curRead);
            count    -= read;
            into_ptr += read;
            // totRead += read;  // doh !!
        return useBuf + totRead;

That's benign enough, but I think this is actually a nice simple example of a habit I am trying to get into :

never use another variable when one of the ones you have will do

In particular this loop could be written without the redundant counter :

        while( count > 0 )
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            S32 read = oof->vfile->Read(into_ptr,curRead);
            count    -= read;
            into_ptr += read;
        return useBuf + (S64)(into_ptr - (U8 *)into);

Don't step the pointer and also count how far it steps, do one or the other, and compute them from each other. I keep finding this over and over and its one of the few ways that I still consistently make bugs : if you have multiple variables that need to be in sync, it's inevitable that that will get screwed up over time and you will have bugs.

(BTW I just can't help going off topic of my own posts; oh well, here we go : I hate the fact that I have to have C-style casts in my code all over now, largely because of fucking 64 bit. Yeah yeah I could static_cast or whatever fucking thing but that is no better and much uglier to read. On the line with the _MIN , I know that a _MIN of two types can fit in the size of the smaller type, so it's safe, but it's disgusting that I'm using my own semantic knowledge to validate the cast, that's very dangerous. I could use my check_value_cast here but it is tedious sticking that everywhere you do any 32-64 bit casting or pointer-int casting. On the output I could use ptrdiff_t or whatever you portability tards want, but that only pushes the cast to somewhere else (if I Read return a type ptrdiff_t then the outside caller has to turn that into S64 or whatever). I decided it was better if my file IO routines always just work with 64 bit file positions and sizes, even on 32 bit builds. Obviously running on 32 bits you can't actually do a larger than 4B byte read into a single buffer, so I could use size_t or something for the size of the read here, but I hate types that change based on the build, I much prefer to just say "file pos & size is 64 bit always".)

06-17-10 | Investment

I bought a Hattori knife a little while ago. It requires a lot of babying - it has some carbon steel so you have to keep it dry, the steel is quite soft so you have to make sure to only let it touch other soft things, you have to sharpen it with a whetstone every so often. It's an amazing knife, but you really have to invest a lot into it. The funny thing is that the fact that it requires so much work actually makes you more fond of it. It's like the knife needs you, it's great and that greatness comes from the work you put into it.

The Porsche is sort of the same way. I don't know if they actually do this intentionally, maybe they do if they're very clever, but the car is a bit temperamental, you have to baby it a bit, check the oil all the time, let it get up to temperature before thrashing it, etc. Of course that "quirkiness" is really "shittiness", a modern car should just run and not need to be babied, but that investment that you put in actually makes you feel closer to it, it makes you really fond of it.

It's well known that our attachment to children is based on the same principle. You put so much work into making a child that you then feel very committed to it.

Of course this is really all just a classic example of one of the major flaws in human intuitive reasoning - the fallacy of sunk cost. Just because you have previously put in lots of work to your child doesn't mean that you should continue to do so. You should ignore sunk cost and just evaluate future EV based on the current situation. If you're a rational human you should wake up each day and say to yourself "should I keep my kids today?".

BTW this is another rant but I do find it somewhat amusing that the most celebrated features of humanity are basically irrationality. There's only good decision making and bad decision making. Irrational decision making is bad. Anyone who is "emotional" or "sticks to their guns" or is "brave" or "puts their family before everything else" or whatever is a bad decision maker. That's not to say that a rational decision maker will never do anything that might be described as "brave" or wouldn't "put their family first", but they will do it based on considering the consequences of each choice and selecting one; anyone who does it based only on "feeling" is dangerous and not to be trusted.

06-15-10 | The end of ad-hoc computer programming

I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending. For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition. You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other terms.

That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal in a time-quality tradeoff sense.

And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.

I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.

06-15-10 | Top down / Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :

  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

06-15-10 | Neighbors

I'm really fucking sick of having neighbors. I hate when one neighbor smokes next door and the smell comes in my window. I hate when their dog yaps. I hate when they run the lawn-mower or leaf blower. I hate the fucking hippy hypocrites who give me the passive-aggressive cold shoulder. I hate the constant home renovation cock munchers who like to hammer a little bit each day.

I really don't like interacting with human beings very much. I wish I could have sanctuary where I was alone in my own environment with no fucking headaches and annoyances from outside forces.

Sometimes I think about moving out to the country so I can get a bunch of land and not have to see any neighbors, but my god country people are fucking unbearable. I don't mind the fact that they have very simple tastes - cheap beer, country music, muscle cars - but they're just rotten human beings, they are not generous of spirit and wide of mind. They give you dirty looks if you're an outsider; if you tell them their joke about date rape is offensive they say "what are you a fag or somethin?". They're the worst. Yes I'm talking about you Pomeroy.

Of course the are the country towns that have been colonized by retirees or "artists", liberal types from the city who now run a bead boutique and give dirty passive aggressive looks to anyone who hasn't made their house "quaint" enough. They're almost as bad as the real country folk.

N and I took a trip out to Eastern WA over the weekend and saw many striking and beautiful things. It gets bloody well hot and dry out there so I think it was just about the last chance to do it in comfort, so I'm glad we did. Probably best to go out there in May, which is their spring. We still got a lot of fresh spring greenery and a lot of wild flowers, but I do think we were a week or two late. We have an amazing knack of finding surprising off-the-map beautiful things together, it's almost magical, like we will suddenly decide to take some side road that we hadn't noticed before and sudden we are in a strange different world of purple flowers and intense wind and rippling alfalfa.

I really despise commuting. For a while there the novelty of the new car made it okay again, but it's back to being just awful. The past few weeks I've been working mainly from home because I had a nice productive spurt and just wanted to go with it and knock out some code. Commuting just puts me in such a foul mood that I lose hours of productivity afterward before I can simmer down and think again, and then when I come home it sets me off again, such that when I get home N usually asks "what's wrong with you?" , oh I just fucking hate everyone and I'm really depressed about how fucking shitty humanity is because I had to drive with them, that's all.

I just don't know if I can function in this world. Sometimes I think about buying a house and having a stable job and a family and all that, but that life involves working regular hours and commuting and shit like that and I just don't think I can do it.

06-12-10 | Some Bike Notes

I found this interesting article on Optimizing Tire Pressure Drop (PDF) . Now I certainly do agree that too many people in the quest for "performance" (both on cars and bikes) think that stiff ride = fast. With that in mind they over-inflate their bike tires, and use too narrow bike tires, and on cars they use over-size rims on tires without enough rubber. Anyway I'm not sure I believe the article, because if I follow it, it suggests really crazy pressures :

I'm about 200 pounds.  Assuming the 60/40 weight distribution is right, that's 120 on the rear and 80 pounds
on the front.

My lightspeed has 23 mm tires front and rear, so the pressures should be roughly 80 psi front and 120 psi rear.
(it's roughly 1:1 for 23 mm which is very unclear from the stupid way they draw the graph)

My city bike has 25 mm tire rear and 28 mm tire front.  The pressures should be :
rear : 110 psi, front : 60 psi

I've been running something like 115 rear, 100 front on my lightspeed and 110 rear, 80 front on my city bike, so according to this article I need to drop the front pressure even a lot more.

One problem I have with this analysis is that it assumes your weight distribution is static. In reality the 60/40 is only when I'm seated. When I am braking downhill my weight transfers forward a lot and with these low pressure the front tire will mush.

I've been having a lot of trouble on my blue bike because of the wheels I bought from Pricepoint. They're good components (Shimano hubs on Mavic Open Pro rims) but I think the spokes are shitty and the lacing job was done wrong somehow, because they keep coming loose; I thighten and true them, and then a week later they're loose again. Contrary to the popular press, I have found that the "pre-built" wheelsets (like Mavic Cosmos or whatever) are excellent quality and hold up great, but the "hand built" wheels that are supposed to be so superior are only as good as the build-up, which if you buy form some cheapo online place is probably not very good.

Anyway, because of my wheel trouble I swapped out my rear wheel for another that I have sitting around with an 8 speed cassette on it (my bike is normally a 10 speed). I switched the downtube lever to friction shifting (instead of index) and off you go. Friction shifting on these many-speed modern bikes is sort of interesting; the cogs are so close together and the shift is so smooth that it almost feels like a completely analog shift. You just move the lever and it silently slips into a very slightly different gear. You can just dial the lever to the gear ratio you want like an analog slider. I wonder when we will have continuous transmissions for bicycles; you could imagine just having a single cog with a ratcheting mechanism to get bigger or smaller, or perhaps more realistically a cone gear with a belt drive like some of the early CVT's on cars (a guide holds the belt to one part of the cone which sets the gear).

Anyway, friction shifting for 8+ gears is not awesome. On flat ground with no load it's fine, but if your bike has any flex at all, it will cause you to change gears when you stand up and dig up a hill, the gears are just too close together to avoid hop. I don't think you can friction shift past 6, maybe 7 gears.

IMO 8 gears was probably the perfect amount. I know some purists will say 5 was enough. Eh, not really. 5 is plenty if you are on flat terrain, sure, but for varied terrain you do want some very small gears, and also some big ones for flats (though I agree with Dave Moulton that the very big gears most bikes come with these days are very pointless; sure it's fun to go 30 mph on a downhill, but you could go 27 mph and it would be almost as good). At 8 you can have enough range and also fine enough steps in the "money zone" where you spend most of your time. Beyond 8, there's little gain from the additional gears, and you start having more problems because the cogs are so very close together, it's more finnicky about having the index adjustment just right, if things aren't right it's easier to get hops into the wrong gear, and of course the chain has to be thinner and weaker.

06-10-10 | Ranting

I really hate programmers who whine about other people's code or the compiler or the OS or whatever, always blaming other people for their woes. 99% of the time their own code would not stand up to the standards that they are holding others to, and lots of the time the thing they are whining about is actually because of their own bad usage of a perfectly good system.

That said, I will now do some ranting.

Fucking external code and your use of unsigned ints! I've tracked down the last two (*) of my nasty bugs in porting to 64 bit and both were because of people using unsigned ints for no good reason. (* = I think; you can never know that you found your last bug, much like you can never prove a physical theory right).

One was in the BMP reader code that I stole from MSDN years ago after getting fed up with not being able to load some weird formats (RLE4 or whatever). (BTW this is another rant - every god damn video game codebase I've ever used has some broken ass home-rolled BMP reader in it that only handles the most friendly and vanilla BMP formats; fucking don't roll your own broken ass shit when there are perfectly good fully functional ones you can take). Anyway, because it's MS code it uses the awful DWORD and UINT and all that shit. It actually had a number of little bugs that I fixed over the years, such as this one :

<              dwSrcInc = ((dwWidth >> 3) + 3) & ~3;
->             dwSrcInc = (((dwWidth+7) >> 3) + 3) & ~3;

but the nasty one was the inherent type of dwSrcInc. BMP's of course can scan up or down so there's some code that does :

if ( normal ) dwSrcInc = pitch;
else dwSrcInc = - pitch;

Looks good, right? Well, sort of. Of course dwSrcInc is a DWORD. What happens when you put a negative in there ? Yeah, you're getting it. We then update the pointer like :

ptr += dwSrcInc;

Well, this works out the way you want if the pointer is 32 bits, because the pointer math is done like an unsigned int add and it wraps. But when the pointer is 64 bits, you're adding not quite 4 billion to your pointer. No good.

While I'm at it, this is the list of nasty not-obvious stuff to watch out for that I made :

64-bit dangers :

use of C-style casts, like :

    int x = (int) pointer;

use of hard-coded sizes in mallocs :

    void ** ptrArray = malloc( 4 * count );

structs with pointers that are IO'd as bytes
    or otherwise converted to bytes as in unions etc

use of 0xFFFFFFFF and similar

use of (1<<31) as a flag bit in pointers

struct packing changes

ptrdiff_t is signed
size_t is unsigned !!

unions of mismatched sizes do not warn !!

06-07-10 | Unicode CMD Code Page Checkup

I wrote a while ago about the horrible fuckedness of Unicode support on Windows :

cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3

Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very bad idea, don't do that).

Here is what I have figured out on Win64 so far :

1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.

2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida Concole works if you like that). The raster fonts do not support it.

3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation. Even wprintf() passed in correct unicode file names does not do the code page translation correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code page translation correctly. (as usual the pedantic rule-morons will spout a bunch of bullshit about the fact that printf is just fine the way it is and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing translation and it should let me control how!)

expected : printf with wide strings (unicode) would do translation to the console code page 
(as selected with chcp) so that  characters show up right.
(this should probably be done in the stdout -> console pipe)

observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page

4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that your command line apps might be piped to a unicode text file. To handle this correctly in your own console app, you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.

5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app because the user's font choice might not support the code page you want.

6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.

expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)

observed : arguments are only available in CMD code page

7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way to take a command line that works in CMD and run it from your app. _wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the work of promoting your CMD code page arguments to proper unicode).

repro : write an app that takes your own full argc/argv array, and spawn a process with those same args 
(use an exe name that involved troublesome characters)

expected : same app runs again

observed : if A and Console code pages differ, your own app may not be found

8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer and paste it into a console. My oh my.

repro : copy-paste a file name from (unicode) Explorer into CMD

expected : unicode is translated into current CMD code page and thus usable for command line arguments

observed : does not work

9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else to list the files then changing it back (their output seems to be the same as mine in UTF8 code page). This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be found if you copy-paste them to your command line!

repro : dir a file with strange characters in it.  copy-paste the text output from dir and type 
"dir <paste>" on the command line

expected : file is found by dir of copy-paste

observed : depending on code page, the file is not be found

So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the same; I think the answer is they are only different if output is piped to a file).

I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :

1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names. (I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For arguments that aren't files, convert from the console code page to wchar.

2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.

3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into printf.

For more information :

Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI

Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.

BTW a semi-related rant :

WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :

typedef wchar_t wchar; // it's not fucking char_t

// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }

// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str)    (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str)	(sizeof(str)/sizeof(str[0]))

fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless difference in names between the non-w and the w string function names.

Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a

#define wprintf wprintf_does_not_do_what_you_want

kind of thing.

06-07-10 | Exceptions

Small note about exceptions for those who are not aware :

x64 totally changes the way exceptions on the PC are implemented. There are two major paradigms for exceptions :

1. Push/pop tracking and unwinding. Each time you enter a function some unwind info is pushed, and when you leave it is popped. When an exception fires it walks the current unwind stack, calling destructors and looking for handlers. This makes executables a lot larger because it adds lots of push/pop instructions, and it also adds a speed hit everywhere even when exceptions are never thrown becuase it still has to push/pop all the unwind info. MSVC/Win32 has done it this way. ("code driven")

2. Table lookup unwinding. A table is made at compile time of how to unwind each function. When an exception is thrown, the instruction pointer is used to find the unwind info in the table. This is obviously much slower at throw time, but much faster when not throwing, and also doesn't bloat the code size (if you don't count the table, which doesn't normally get loaded at all). I know Gcc has done this for a while (at least the SN Systems version that I used a while ago). ("data driven")

All x64 code uses method #2 now. This is facilitated by the new calling convention. Part of that is that the "frame pointer omission" optimization is forbidden - you must always push the original stack and return address, because that is what is walked to unwind the call sequence when a throw occurs. If you write your own ASM you have to provide prolog/epilog unwind info, which goes into the unwind table to tell it how to unwind your stack.

This has been rough and perhaps slightly wrong. Read here for more :

X64 Unwind Information - FreiK's WebLog - Site Home - MSDN Blogs
Uninformed - vol 4 article 1
Introduction to x64 debugging, part 3 « Nynaeve
Introduction to x64 debugging, part 2 « Nynaeve
Exception Handling (x64)

BTW many people (Pierre, Jeff, etc.) think the overhead of Win32's "code driven" style of exception handling is excessive. I agree to some extent, however, I don't think it's terribly inefficient. That is, it's close to the necessary amount of code and time that you have to commit if you actually want to catch all errors and be able to handle them. The way we write more efficient code is simply by not catching all errors (or not being able to continue or shut down cleanly from them). That is we decide that certain types of errors will just crash the app. What I'm saying is if you actually wrote error checkers and recovery for every place you could have a failure, it would be comparable to the overhead of exception handling. (ADDENDUM : nah this is just nonsense)

06-07-10 | Cortazar

Playing with form is not really interesting in and of itself. If you do it well, it can enhance the story, but if you do it badly it can be a distraction, or just cheezy. I'm sure they think they're very clever when they come up with it - tell the story backwards in time, or tell it from the point of view of the protagonist's hat, or whatever, but no it's really not very clever, anybody can come up with that shit. The question I always ask myself, is "if this was told as just a straightforward narrative, would it be interesting?". The answer with Cortazar is no.

I'm always amazed that novelists actually finish their novels. Especially conceptual works. If it's a page-flipper story novel, then the story actually cares you (the writer) along with it, you're excited to see what happens next just like the reader is. But when it's just a dry game in form, I would get bored of writing it after 5 pages.

06-07-10 | Coding Style

I sometimes feel like I live in my own coding world, where I have these strange ideas that nobody else agrees with.

Pierre wrote a pretty good blog post a while ago (which I can't find now grrr) about how the standard maxim to "optimize late & optimize only hot spots" can be wrong. In particular, the general bloat of inefficiency everywhere can be a huge drag and not provide any easy targets. I agree with that to some extent (certainly at OddWorld we had the problem of some nasty generalized speed hit from various small inefficiencies). However, the opposite is also true - optimizing early or micro-optimizing can really make you do stupid things. I've written before about how it can get you into local minima traps where you don't see a better algorithm because your current bad one is so tweaked. One of the worst things you can do is super-optimize a bad algorithm. And on the flip side, it can actually be a huge boon to your coding if your low level stuff is really slow.

I was reminded of this recently when I got this introduction to x64 ASM with an example Huffman decoder. Despite lots of work in ASM, this Huffman decoder is pathetically bad, it actually walks through node pointers to decode, which is just about the worst possible way to do it. I don't mean to pick on that guy, it's example code and it's actually really nice example code, it was super useful to me in my x64 learnings, but I've seen so many bad Huffman ASM implementations, it has to be one of the all-time great examples of foolish premature optimization of bad algorithms. If you had a really slow bit input routine, you might be motivated to work on the *algorithm* to avoid bit inputs as much as possible. Then you would come up with what all the smart people do, which is to read ahead big chunks of bits and use a table to decode symbols directly. Usually the best way to optimize a slow function is to not call it. (I don't actually have a good reference on fast Huffman, but I did write a tiny bit earlier).

Of course there is also the flip side. Sometimes a slow underlying function can cause you to waste lots of times on optimization that's not really necessary. Probably the worst culprit I see of this is allocations. I see people doing masses of work to remove allocations and often it just doesn't make any sense. A small block alloc these days is around 20-50 clocks, often less than a divide. There are good reasons to remove allocs (mainly if you are on a low memory system and want very predictable memory use patterns), but speed is not one of them, and people who are using a very bad malloc back end with a malloc that is hundreds or thousands of clocks are just giving themselves a problem that isn't real.

Browsing around the web another one struck me. I tend to be *very* careful in the way I write code (compared to my game developer compadres anyway, I suppose I am wild and reckless compared to many of the systems devs). I used to be somewhat of a specialist in saving bad code bases, and in that spot I really hate to even try to work with the code I'm given. The first thing I do is start tracing through runs and see what's really happening, and then I start adding comments and asserts everywhere. Then I start wrapping up common functionality into classes that encapsulate something and enforce a rule so that I can be *sure* that what everyone thinks is happening really is happening. When I see code that does not *prove* to me that it's working right, I assume it's not working right.

This carefulness is multiplied many fold when it comes to threading. I am hyper careful and don't trust myself for the most basic things. I see a lot of people write simple threading code and do it without much in the way of comments or helper functions because they are "smart" and they can tell what's happening and don't need helpers. They just do some interlocked ops to coordinate the threads and then maybe do some non-protected accesses when they "know" they can't have races, etc. That's awesome until you have a mysterious problem like this . You could blame the min() for doing something weird, or blame the compiler for not being nice enough with it's volatile function, but IMO this is the type of thing that would never happen if the code was written more carefully.

Even when I'm at my most lazy, I would write this :

        // pull available work with a limit of 5 items per iteration
        LONG work = min(gAvailable, 5);


    LONG avail = gAvailable;
    LONG work = min(avail, 5);

but this is also an example where the hated "volatile" is showing its ugliness. I hate that the "volatile" is far away on the variable decl and not right here where I need it. Sometimes I would write :

    LONG avail = *((volatile LONG *)&gAvailable);
    LONG work = min(avail, 5);

but even better if I had my threading helpers I would do something like :

    LONG avail = LoadShared_Relaxed(&gAvailable);
    LONG work = min(avail, 5);

where I explicitly load the shared variable using "Relaxed" memory ordering semantics (actually I don't know the usage case, but this should probably have been Acquire memory semantics). LoadShared_Relaxed is nothing but *ptr , so many people don't see the point of having a function call there at all, but it makes it absolutely clear what is happening. It also just makes it more verbose to touch the shared variable, which encourages you to load it out into a local temporary, which is good.

Another option which I often use is to make a Shared<> template , so gAvailable would be Shared< LONG > gAvailable. Then you have to access it with members like LoadRelaxed() or StoreRelease() , etc.

I treat threading code like a loaded gun. I don't point it anyone's face, I store it without bullets, etc. You take these precautions not because they are absolutely necessary, but because they ensure you don't have bad surprises.

A lot of the times in rebuttal, people will show me their unsafely written code and say "well this works fine, tell me what's wrong with it". Umm, no, you don't understand the issue at all my friend. The point is that with unsafely written code, I have to use my brain to figure out whether it is working or not, and as we have seen, that is *very* very hard. Especially with threading and races. In fact the only way I could do it with any level of confidence is to instrument it and run it through Relacy or one of the other automatic race detectors.

Maybe I miss managing people and getting to pick on their code, so now I'm picking on random code from around the internet. Actually that might be fun, do a weekly code review of some random code I grab from the web ;)

06-02-10 | Words that are changing meaning

"Deprecated" :

The primary meaning is "belittled, insulted, strongly disapproved of". Old dictionaries will only have this meaning. I contend that this word is basically never used in this way anymore. (the new meaning is "obsolute / superceded by something else and further usage is discouraged" ).

"Pander" :

I was shocked to find that the primary definition of pander in most dictionaries is "Noun : Pimp" , and the primary verb meaning is "to act as a pimp ; eg. to provide for the sexual gratification of others". I don't know when that was the meaning but it sure isn't now. I contend that 99% of people would say the meaning of pander is to "cater to the desires of the audience or target group, without scruples" , as in politicians pandering to the NAARP or filmmakers pandering to retarded people who want sex and action. (I want sex and action, just not retarded sex and action).

06-02-10 | 64 bit transition

The broke ass 64/32 transition is pretty annoying. Some things that MS could/should have done to make it easier :

Don't sell Win7-32 for desktops. If you go to most PC makers (Dell, Lenovo, etc.) and spec a machine, the default OS will be Win7 - 32 bit. I'm sure lots of people don't think about that and just hit okay. That fucking sucks, they should be encouraging everyone to standardize to 64 bit.

Give me a 32+64 combined exe. This is trivial and obvious, I don't get why I don't have it. For most apps there's no need for a whole separate Program Files (x86) / Program Files install dir nonsense. Just let me make a .EXE that has the x86 and the x64 exes in it, and execute the right one. That way I can just distribute my app as a single EXE and clients get the right thing.

Let me call 32 bit DLL's from 64 bit code. As usual people write a bunch of nonsense about why you can't do it that just isn't true. I can't see any real good reason why I can't call 32 bit DLL's from 64 bit code. There are a variety of solutions to the memory addressing range problem. The key thing is that the memory addresses we are passing around are *virtual* addresses, not physical, and each app can have different virtual address page tables. You could easily make the 32-bit to 64-bit thunk act like a process switch, which swaps the page tables. It's perfectly possible for an app with 32 bit of address space to access 64 bits of physical memory (of course this is exactly what 32 bit apps running on 64 bit windows do). Okay, that would be pretty ugly, but there's a very simple solution - reserve the lower 4G of virtual address space in 64 bit apps for communication with 32 bit apps. That is, make all the default calls to HeapAlloc go to > 4G , but give me a flag to alloc in the lower 4 G. Then if I want to call to a 32 bit DLL I have to pass it memory from the lower 4G space. Yes obviously you can't just build your old app and attach to a 32 bit DLL and have it just work, but it would still give me a way to access it when I need it (and there are plenty of other reasons why you wouldn't be able to just link in the 32 bit dll without thinking, eg. the sizes of many of your types have changed).

Right now if I have some old 32 bit DLL that I can't recompile for whatever reason and I need to get to it from my 64 bit app, the recommended solution is to make a little 32 bit server app that calls the DLL for me and use interprocess communication with named memory maps to give its data to the 64 bit app. That's okay and all but they easily could have packaged that into a much easier thing to do.

This code project article is actually one of the best I've seen on the transition. It reminds me of one of the weird broke-ass things I've seen : some of my favorite command line apps are still old 32 bit builds. They still work ... mostly. The problem is that they go through the fucking WOW64 directory remapping, so if you try to do anything with them in the Windows system directories you can get *seriously* fucked up. I really don't like the WOW64 directory remap thing. It only exists to fix broke-ass apps that manually look for stuff in "c:\program files" or whatever. I feel it could have been off by default and then turned on only for exe's that need it. I understand their way makes most 32 bit apps work out of the box which is what most people want, so that all makes sense and is fine, but it is nasty that my perfectly good 32 bit apps don't actually get to see the true state of the machine for no good reason.

To give an example of the most insane case : I use my own "dir" app. Before I rebuilt it, I was using the old 32 bit exe version. It literally lies to me about what files are in a given dir because it gets the WOW64 remap. That's whack. The 32 bit exes work just fine on Win64 and the only thing forcing me to rebuild a lot of these things is just to make them avoid the WOW64.

06-02-10 | Some random Win64 junk

To build x64 on the command line, use

vcvarsall.bat amd64

Oddly the x64 compiler is still in "c:\program files (x86)" , not "c:\program files" where it's supposed to be. Also there's this "x86_amd64" thing, I was like "WTF is that?" , it's the cross-compiler to build x64 apps from x86 machines.

amd64 is a synonym for x64. IA64 is Itanium.

The actual "CL" command line to build x64 vs x86 can stay unchanged I think (not sure about this yet). For example you still link shell32.lib , and you still define "WIN32".

Another nasty case where 64 bit bytes your ass is in the old printf non-typed varargs problem. If you use cblib/safeprintf this shouldn't be too bad. Fucking printf formatting for 64 bit ints is not standardized. One nice thing on MSVC is you can use "%Id" for size_t sized things, so it switches between 32 and 64. For real 64 bit use "%I64d" (on MSVC anyway).

size_t really fucking annoys me because it's unsigned. When I do a strlen() I almost always want that to go into a signed int because I'm moving counters around that might go negative. So I either have to cast or I get tons of warnings. Really fucking annoying. So I use my own strlen32 and anyone who thinks I will call strlen on a string greater than 2 GB is smoking some crack.

Here are the MSC vers :

MSVC++ 9.0  _MSC_VER = 1500
MSVC++ 8.0  _MSC_VER = 1400
MSVC++ 7.1  _MSC_VER = 1310
MSVC++ 7.0  _MSC_VER = 1300
MSVC++ 6.0  _MSC_VER = 1200
MSVC++ 5.0  _MSC_VER = 1100

Some important revs : "volatile" changes meaning in ver 1400 ; most of the intrinsics show up in 1300 but more show up in 1400. Template standard compliance changes dramatically in 1400.

06-02-10 | Directives in the right place

I'm annoyed with "volatile" and "restrict" and all that shit because I believe they put the directive in the wrong place - on the variable declaration - when really what you are trying to do is modify certain *actions* , so the qualifier should be on the *actions* not the variables.

eg. if I write some matrix multiply routine and I know I have made it memory-alias safe, I want to say "restrict" on the *code* chunk, not on certain variables.

Same thing with "volatile" in the Java/C++0x/MSVC>=2005 sense, it's really a command for the load or the store, eg. "this store must actually be written to memory and must be ordered against other stores" , not a descriptor of a variable.

06-02-10 | NiftyP4 Problems

I've switch to NiftyP4, but it has some problems. On the plus side, you can download the code and it just fucking builds right out of the box! That is fucking amazing, kudos to you, I can't remember the last time I downloaded some open source thing and just built it without any hair pulling. One thing to be careful of is you have to uninstall the old install from the original installer, then you install your newly built one from the installer it makes and uninstall it from that installer.

So I'm going to try to fix some things in Nifty. It's annoying difficult to debug an add-in for visual studio, so I might have to use printf debugging. The problems I have are :

1. It randomly doesn't work sometimes. The spawnm p4 process fails. Usually if I shut down devenv and restart it this gets fixed. Need to track this down. "Failed to spawn process: System.ComponentModel.Win32Exception: The system cannot find the file specified"

2. When it can't connect to the p4 server it does hang devenv for too long. Would be nice to fix.

3. Checking out projects and solutions is pretty fucked. I'm sort of inclined to just make it check out all the vcproj's whenever you open a solution, and then revert unchanged when you close. Even just a key to check out all vcproj's in the solution would be handy, though ideally that would be triggered whenever you edit project properties. I guess if I had that I could make a macro that does "checkout all projects; open properties" and put that on my properties keyboard shortcut.

4. The toolbar is constantly querying status if its visible so that it can tell whether to gray out the "p4 edit" buttons or not. That's kind of annoying. I might just make it so it doesn't gray those buttons based on whether the file is in p4 or not.

This is what I use to disconnect projects from source control without fucking everything up : (it's just text substitution to remove the source control lines from the projects).

c:\bat>type KillVcScc.bat
call p4 edit *.sln
call p4 edit *.vcproj
call spawnm killScc *.sln
call spawnm KillSCC *.vcproj
call spawnm_tr tr *.sln SourceCodeControl xxx
call dele *scc

c:\bat>type killscc.bat
if "%1"=="" endbat
call p4 edit %1
call zc -o %1 %1.sav
KillPrefix.exe Scc %1.sav %1.sav2
KillPrefix.exe CanCheckoutShared %1.sav2 %1

05-29-10 | Lock Free in x64

I mentioned long ago in the low level threading articles that some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.

The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to implement DCAS the same way. There are various solutions to this :

1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :

1.A. There are not other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is a very minor suckitude.

1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked. Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit x86 version of your app instead of the 64 bit version.

1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes. This is part of the general bloating of 64 bit structs and mildly sucks. (BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).

1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.

So #1 is an okay solution, but what are the alternative ?

2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know of other than the fact that twenty years from now you'll have to touch your code again.

3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time. This is a somewhat more complex way to do things conceptually, but it is very powerful and general, and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have. See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".

You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit {pointer,count} until your pointer becomes more than 52 bits, and then switch to a pointer to extra data.

05-29-10 | Some more x64

Okay , MASM/MSDev support for x64 is a bit fucked. MSDev has built-in support for .ASM starting in VC 2005 which does everything for you, sets up custom build rule, etc. The problem is, it hard-codes to ML.EXE - not ML64. Apparently they have fixed this for VC 2010 but it is basically impossible to back-fix. (in VC 2008 the custom build rule for .ASM is in an XML file, so you can fix it yourself thusly )

The workaround goes like this :

Go to "c:\program files (x86)\microsoft visual studio 8\vc\bin". Find the occurance of ML64.exe ; copy them to ML.exe . Now you can add .ASM files to your project. Go to the Win32 platform config and exclude them from build in Win32.

You now have .ASM files for ML64. For x86/32 - just use inline assembly. For x64, you extern from your ASM file.

Calling to x64 ASM is actually very easy, even easier than x86, and there are more volatile registers and the convention is that caller has to do all the saving. All of this means that you as a little writer of ASM helper routines can get away with doing very little. Usually your args are right there in {rcx,rdx,r8,r9} , and then you can use {rax,r10,r11} as temp space, so you don't even have to bother with saving space on the stack or any of that. See list of volatile registers

BTW the best docs are just the full AMD64 manuals .

For example here's a full working .ASM file :

public my_cmpxchg64


align 8
my_cmpxchg64 PROC 

  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  jne my_cmpxchg64_fail
  mov rax, 1

align 8
  mov [rdx], rax
  mov rax, 0
align 8
my_cmpxchg64 ENDP


And how to get to it from C :

extern "C"  extern int my_cmpxchg64( uint64 * val, uint64 * oldV, const uint64 newV );

BTW one of the great things about posting things on the net is just that it makes me check myself. That cmpxchg64 has a stupid branch, I think this version is better :

align 8
my_cmpxchg64 PROC
  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  sete cl
  mov [rdx], rax
  movzx rax,cl
my_cmpxchg64 ENDP

and you can probably do better. (for example it's probably better to just define your function as returning unsigned char and then you can avoid the movzx and let the caller worry about that)

ADDENDUM : I just found a new evil secret way I'm fucked. Unions with size mismatches appears not to even be a warning of any kind. So for example you can silently have this in your code :

union Fucked
        void * p1;
        int t;
    } s;
    uint64  i;

build in 64 bit and it's just hose city. BTW I think using unions as a datatype in general is probably bad practice. If you need to be doing that for some fucked reason, you should just store the member as raw bits, and then same_size_bit_cast() to convert it to the various types. In other words, the dual identity of that memory should be a part of the imperative code, not a part of the data declaration.

05-29-10 | x64 so far

x64 linkage that's been useful so far :

__asm__ cmpxchg8bcmpxchg16b - comp.programming.threads Google Groups
_InterlockedCompareExchange Intrinsic Functions
x86-64 Tour of Intel Manuals
x64 Starting Out in 64-Bit Windows Systems with Visual C++
Writing 64-bit programs
Windows Data Alignment on IPF, x86, and x86-64
Use of __m128i as two 64 bits integers
Tricks for Porting Applications to 64-Bit Windows on AMD64
The history of calling conventions, part 5 amd64 - The Old New Thing - Site Home - MSDN Blogs
Snippets lifo.h
Predefined Macros (CC++)
Physical Address Extension - PAE Memory and Windows
nolowmem (Windows Driver Kit)
New Intrinsic Support in Visual Studio 2008 - Visual C++ Team Blog - Site Home - MSDN Blogs
Moving to Windows Vista x64
Moving to Windows Vista x64 - CodeProject
Mark Williams Blog jmp'ing around Win64 with ml64.exe and Assembly Language
Kernel Exports Added for Version 6.0
Is there a portable equivalent to DebugBreak()__debugbreak - Stack Overflow
How to Log Stack Frames with Windows x64 - Stack Overflow
BCDEdit Command-Line Options
Available Switch Options for Windows NT Boot.ini File
AMD64 Subpage
AMD64 (EM64T) architecture - CodeProject
20 issues of porting C++ code on the 64-bit platform

One unexpected annoyance has been that a lot of the Win32 function signatures have changed. For example LRESULT is now a pointer not a LONG. This is a particular problem because Win32 has always made heavy use of cramming the wrong type into various places, eg. for GetWindowLong and stuffing pointers in LPARAM's and all that kind of shit. So you wind up having tons of C-style casts when you write Windows code. I have made good use of these guys :

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_value_cast( t_fm & from )
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // just value cast :
    return (t_to) from;

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_p( t_fm & from )
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // cast through char * to make aliasing work ?
    char * ptr = (char *) &from;
    return *( (t_to *) ptr );

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
// cast with union is better for gcc / Xenon :
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_u( t_fm & from )
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    union _bit_cast_union
        t_fm fm;
        t_to to;        
    _bit_cast_union converter = { from };
    return converter.to;

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
    t_to to = static_cast<t_to>(from);
    ASSERT( static_cast<t_fm>(to) == from );
    return to;

inline int ptr_diff_32( ptrdiff_t diff )
    return check_value_cast<int>(diff);

BTW this all has made me realize that the recent x86-32 monotony on PC's has been a delightful stable period for development. I had almost forgotten that it used to be always like this. Now to do simple shit in my code, I have to detect if it's x86 or x64 , if it is x64, do I have an MSC version that has the intrinsics I need? if not I have to write a got damn MASM file. Oh and I often have the check for Vista vs. XP to tell if I have various kernel calls. For example :

#if _MSC_VER > 1400

// have intrinsic

#elif _X86_NOT_X64_

// I can use inline asm
__asm { cmpxchg8b ... }


// kernel library call available


// X64 , not Vista (or want to be XP compatible) , older compiler without intrinsic,
//  FUCK !

#error just use a new newer MSVC version for X64 because I don't want to fucking write MASM rouintes


Even ignoring the pain of the last FUCK branch which requires making a .ASM file, the fact that I had to do a bunch of version/target checks to get the right code for the other paths is a new and evil pain.

Oh, while I'm ranting, fucking MSDN is now showing all the VS 2010 documentation by default, and they don't fucking tell you what version things became available in.

This actually reminds me of the bad old days when I got started, when processors and instruction sets were changing rapidly. You actually had to make different executables for 386/486 and then Pentium, and then PPro/P3/etc (not to mention the AMD chips that had their own special shiznit). Once we got to the PPro it really settled down and we had a wonderful monotony of well developed x86 on out-of-order machines that continued up to the new Core/Nehalem chips (only broken by the anomalous blip of Itanium that we all ignored as it went down in flames like the Hindenburg). Obviously we've had consoles and Mac and other platforms to deal with, but that was for real products that want portability to deal with, I could write my own Wintel code for home and not think about any of that. Well Wintel is monoflavor no more.

The period of CISC and chips with fancy register renaming and so-on was pretty fucking awesome for software developers, because you see the same interface for all those chips, and then behind the scenes they do magic mumbo jumbo to turn your instructions into fucking gene sequences that multiply and create bacterium that actually execute the instructions, but it doesn't matter because the architecture interface still just looks the same to the software developer.

05-28-10 | Foolishness

Many of the food blog snobistas descend into this foolishness of making everything at home when some things are just not a wise use of time. I mean I'm sure these homemade chicharrones are delicious and all, but fuck that's a lot of work when you could just walk over to the local Mexicatessen and buy a big bag that was also freshly fried in their house-made lard. And while you're there buy some carnitas and tortillas too and have a much better meal for cheaper and less work.

It's real foolishness when people say things like climbing is not a hard workout . I just have to roll my eyes pretty much anytime anybody talks about exercise because you just all don't get it. *Anything* is a workout if you make it a workout. There's no inherent difficulty level of any activity, it depends how hard you do it. You hear retards all the time saying "yoga's not a hard workout" , well maybe if you do it like a moron it's not, use some more intensity, make it more difficult for yourself if you need more work. I've heard plenty of people tell me "biking's not a hard enough workout". Oh really? Go faster, dumbass, or maybe try going up some hills.

My anger at the drivers around here grows and grows until it hits a boiling point where I just become depressed about how fucking stupid and selfish you all are. It really is amazing to me that people here constantly blow through yellows, roll through stop signs, and yet take forever to get moving when a light turns green and are busy-bodies about my speed (my speed which is almost always less than that of the SUV that's screeching around the corner at 90% of its limit so it would be unable to make a quick correction if anything surprising happened). What a boring topic, I apologize. At least in places like LA people are more consistently aggressive assholes; it's less hypocritical.

I've been working at home recently and it's been a great boost of productivity for me. It's so good to not have to worry about when I'm going to try to make the commute to avoid traffic, and it's awesome to be able to go directly from morning coffee to coding, which is the most productive instant of the day for me. There are two problems : 1. I got a nice standing desk all set up at work and I miss having it at home, I'm hurting my body spending too much time in a chair. 2. it's a little hard because N is also home many days, and there's a bit of difficult tension when I have to say "leave me alone I'm working".

I really don't like the way the financial meltdown narrative has been crafted by the media. One of the false narratives is the "black swan" story - that everything was being done with mathematical models and that it was a very unlikely but high impact event that was not accounted for in the models that caused the crisis. The other false narrative is that it was evil investment bankers at Goldman or similar that somehow caused it all. The reality is it was caused by ignorance and greed and corruption at almost every level of society. From presidents and congress who stripped regulation from the finance and mortgage industry, to the Fed keeping rates way too low and not monitoring banks well, to Fannie Mae et.al. underwriting too many loans, to Countrywide et.al. intentionally issuing loans they knew were bad to make more profit, to Goldman et.al. for packaging loans they knew were bad and selling them as safer than they really were, to the ratings agencies asleep at the wheel, to individual real estate investors getting in way over their heads trying to make an easy buck, etc. etc.

05-27-10 | Weird Compiler Error

Blurg just fought one of the weirder problems I've ever seen.

Here's the simple test case I cooked up :

void fuck()
#ifdef RR_ASSERT
#pragma RR_PRAGMA_MESSAGE("yes")
#pragma RR_PRAGMA_MESSAGE("no")


And here is the compiler error :

1>.\rrSurfaceSTBI.cpp(43) : message: yes
1>.\rrSurfaceSTBI.cpp(48) : error C3861: 'RR_ASSERT': identifier not found

Eh !? Serious WTF !? I know RR_ASSERT is defined, and then it says it's not found !? WTF !?

Well a few lines above that is the key. There was this :

#undef  assert
#define assert  RR_ASSERT

which seems like it couldn't possibly cause this, right? It's just aliasing the standard C assert() to mine. Not possible related, right? But when I commented out that bit the problem went away. So of course my first thought is clean-rebuild all, did I have precompiled headers on by mistake? etc. I assume the compiler has gone mad.

Well, it turns out that somewhere way back in RR_ASSERT I was in a branch that caused me to have this definition for RR_ASSERT :

#define RR_ASSERT(exp)  assert(exp)

This creates a strange state for the preprocessor. RR_ASSERT is now a recursive macro. When you actually try to use it in code, the preprocessor apparently just bails and doesn't do any text substitution. But, the name of the preprocessor symbol is still defined, so my ifdef check still saw RR_ASSERT existing. Evil.

BTW the thing that kicked this off is that fucking VC x64 doesn't support inline assembly. ARGH YOU COCK ASS. Because of that we had long ago written something like

#ifdef _X86
#define RR_ASSERT_BREAK()  __asm int 3
#define RR_ASSERT_BREAK()  assert(0)

which is what caused the difficulty.

05-27-10 | Loop Branch Inversion

A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do. eg :

    Make bounding area
    Do Kd-tree descent .. 
    loop ( tree nodes )
        bounding intersection, etc.
        found an object

The problem is that DoPerObjectWork then is some conditional, maybe something like :


or even worse - it's a function pointer that you call back.

Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just propagate that branch up through the loops, but there's way to express it neatly in C.

The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :

template < int workType >
void t_LoopAndDoWork(query)

and then provide a dispatcher which does the branch outside :

    case 0 : t_LoopAndDoWork<0>(query); break;
    case 1 : t_LoopAndDoWork<1>(query); break;

this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be kept in sync and can be far apart in the code.

This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable into the right template call. But it's ugly.

BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel can have between 1 and 4 channels. Instead of this :

for(int c=0;c<channels;c++)

where channels is a constant (template parameter), it's better to do :

if ( channels > 1 ) DoStuff(1);
if ( channels > 2 ) DoStuff(2);
if ( channels > 3 ) DoStuff(3);

because those ifs reliably go away.

05-26-10 | Windows Page Cache

The correct way to cache things is through Windows' page cache. The advantage from doing this over using your own custom cache code is :

1. Automatically resizes based on amount of memory needed by other apps. eg. other apps can steal memory from your cache to run.

2. Automatically gives pages away to other apps or to file IO or whatever if they are touching their cache pages more often.

3. Automatically keeps the cache in memory between runs of your app (if nothing else clears it out). This is pretty immense.

Because of #3, your custom caching solution might slightly beat using the Windows cache on the first run, but on the second run it will stomp all over you.

To do this nicely, generally the cool thing to do is make a unique file name that is the key to the data you want to cache. Write the data to a file, then memory map it as read only to fetch it from the cache. It will now be managed by the Windows page cache and the memory map will just hand you a page that's already in memory if it's still in cache.

The only thing that's not completely awesome about this is the reliance on the file system. It would be nice if you could do this without ever going to the file system. eg. if the page is not in cache, I'd like Windows to call my function to fill that page rather than getting it from disk, but so far as I know this is not possible in any easy way.

For example : say you have a bunch of compressed images as JPEG or whatever. You want to keep uncompressed caches of them in memory. The right way is through the Windows page cache.

05-26-10 | Windows 7 Snap

My beloved "AllSnap" doesn't work on Windows 7 / x64. I can't find a replacement because fucking Windows has a feature called "Snap" now, so you can't fucking google for it. (also searching for "Windows 7" stuff in general is a real pain because solutions and apps for the different variants of windows don't always use the full name of the OS they are for in their page, so it's hard to search for; fucking operating systems really need unique code names that people can use to make it possible to search for them; "Windows" is awful awful in this regard).

I contacted the developer of AllSnap to see if he would give me the code so I could fix it, but he is ignoring me. I can tell from debugging apps when AllSnap is installed that it seems to work by injecting a DLL. This is similar to how I hacked the poker sites for GoldBullion, so I think I could probably reproduce that. But I dunno if Win7/x64 has changed anything about function injection and the whole DLL function pointer remap method.

BTW/FYI the standard Windows function injection method goes like this : Make a DLL that has some event handler. Run a little app that causes that event to trip inside the app you want to hijack. Your DLL is now invoked in that app's process to handle that event. Now you are running in that process so you can do anything you want - in particular you can find the function table to any of their DLL's, such as user32.dll, and stuff your own function pointer into that memory. Now when the app makes normal function calls, they go through your DLL.

05-25-10 | Thread Insurance

I just multi-threaded my video test app recently, and it was reasonably easy, but I had a few nagging bugs because of hidden ways they were touching shared memory without protection deep inside functions. Okay, so I found them and fixed them, but I'm left with a problem - any time I touch one of those deep functions, I could screw up the threading without realizing it. And I might not get any indication of what I did for weeks if it's a rare race.

What I would like is a way to make this more robust. I have very strong threading primitives, I want a way to make sure that I use them! In particular, I want to be able to mark certain structs as only touchable when a critsec is locked or whatever.

I think that a lot of this could be done with Win32 memory page protections. So far as I know there's no way to associate protections per-thread, (eg. to make a page read/write for thread A but no-access for thread B). If I could do that it would be super sweet.

One idea is to make the page no access and then install my own exception handler that checks what thread it is, but that might be too much overhead (and not sure if that would fail for other reasons).

The main usage is not for protected crit-sec'ed structs, that is really the easiest case to maintain because it's very obvious right there in the code that you need to take the critsec to touch the variables. The hard case to maintain is the ad hoc "I know this is safe to touch without protection". In particular I have a lot of code that runs like this :

Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

So what I would really like to do is :

Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

-- set A memory to be read-only !
-- set B,C,D memory to be read/write only for their own thread

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

-- make A,B,C,D read/write only for main thread !

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

The thing that this saves me from is when I'm tinkering in DoComplicatedStuff() which is some function called deep inside Phase 2 somewhere and I change it to no longer follow the memory access rule that it is supposed to be following. This is just my hate for having rules for code correctness that are not enforced by the compiler or at least by run-time asserts.

05-25-10 | State and the Web

There's a major way that the whole iPple device thing is taking us backwards. Plain old HTML (eg. not apps) is awesome in that they actually get something really right :

Minimal state. Recordable state at every transition point. This lets you bookmark your point anywhere in your work, go backwards and forwards, save your spot and come back to it, etc.

This all goes back to the entire state being a little token that you can just grab and store off. Granted, lots of web pages fuck this up because they use some server-side shit and they don't show you all the public state or whatever fucking dick-ass thing they do. But good old fashioned Web gets this awesomely right.

It's actually a paradigm that I think more developers should espouse in their Win32 apps, both publicly and internally.

By "publically" I mean you should expose it to the user - let the user drag off the current spot to a link, and let them restore. This should be in like every app. The full state of the app should be in an edit box somewhere that I can copy/paste or drag to the desktop. I should be able to double-click it to jump back into the app at that same point.

"Internally" I mean it's nice to make sure your state some very simple plain C structures, so that you can just push & pop or save old versions of the state, like :

State save(curState);



curState = save;

this is actually one of the new things I'm doing in my Video Test framework and it has been awesomely useful.

Yeah yeah the C++ way is to give every member a stream-in/stream-out, but it's too hard to maintain that robustly all the time.

This is actually related to another very important programming paradigm in general : minimize state, and avoid redundancy. Don't store variables that are computed from other variables. Don't copy values from one place to another. Always go get them at the original source. This is a massive bug reducer. Every time I see something like "this variable must be kept in sync with this variable" I think "why not just get rid of one of them?".

05-23-10 | Two Windows Woes

Slow net. My god WTF is wrong with Windows networking. (I don't mean the TCP/IP stack, I mean shared computer browsing). What the fuck is wrong with networking in general? Why are there such massive stalls? I mean for browsing my local network, how in the world can it take so fucking long to discover the machines on my fucking LAN !? And if a machine is not there, can't you just fail in like a millisecond !? I mean a fucking millisecond is FOREVER to send a packet of light out on some wires and get a reply back.

I do have one major practical problem with Windows slow networking : my file copy and dir listing routines are ungodly slow across the net. I know this can be done faster. TeraCopy for example is pretty fast, I would love to know what they are doing. The super brute force solution would be to just run my own file system client/server and send packets to my own port. For example if I want to get a dir listing, I just send one packet saying "list this dir" and the listener on the other side does it locally and then sends me back one big packet with the full dir listing. I could run that on TCP/IP and it would be like instant. So how do I get speed like that over proper Windows networking? Or maybe that is the way to go and I could just remote-run my listener app on any machine I want to talk to?

Kill stuck apps. WTF I know you are capable of killing stuck apps, because if I use my own "killproc" app I can kill them cleanly (or another nice way is to attach the debugger to the app and then kill it from there). But sometimes even fucking Task Manager refuses to kill it, and why can't I just kill it from the fucking X box. Okay maybe not the X box because that's just a GUI widget on the app, but let me fucking right-click in the non responding Window and say "yes really fucking kill the fuck out of this piece of shit app".

I might have to write my own app that uses IsHungAppWindow and then hard-kills whatever is not responding. I could put it on a hot key and it would save my bacon when the fucking Task Manager won't run (my god why is Task Manager still just a fucking app like everything else, which means that it can't always get enough CPU or screen display rights; there should be a machine monitor in the ctrl-alt-del screen that is always accessible).

05-23-10 | Misc

I went driving at Pacific Raceways with the Porsche DE ("Driver's Education" , which is the euphemism they use to make it sound safe, it's actually run your fucking suped-up old 911 around a race track at insane speeds and occasionally spin out and toss it into the bushes). Maybe I will write up some more details on it, but it was a fucking blast, the car was *amazing*, I was exhausted, I used about a year's worth of tires and brakes. I highly recommend it.

It is kind of funny to me how people take something that is pure hooliganism and laughs and adrenaline (driving cars fast) and have to turn it into something they can be anal about and practice and study and be "right" about have "the way to do things". It always happens, I mean wine and food and such are the same way, the people who really love something become way too obsessive about it and make it way too analytical and lose focus on the simple joy of it.

Coding for RAD is kind of fucking me up. I have lots of bits of good code that I know I've written but I don't know where they are anymore. For example I know I wrote a bunch of careful stuff to try to sleep to framerate well but I can't find it anymore. Is it in my cblib stuff, or is it in my RAD/Oodle stuff, or is it in my RAD/shared stuff? Urg.

Dell lappy is pretty great. I dropped it and slightly dented the case. Metal cases feel awesome and give that impression of "quality" but in fact plastic is a pretty fucking amazing material to make things out of. Plastic does not get hot, it's super lightweight, it's very tough, and it has this amazing property that it can take an impact, deform, and then return to its original shape. (same goes for car interiors of course).

Part of the problem with plastic car exteriors was that the paints weren't good enough. That's no longer true, there are new amazing paints that can make plastic cars look like metal.

Top Chef Masters is pretty good, way better than the first season of TCM, though fucking Kelly Choi is a real drag (even worse than Padma; at least Padma actually hot, and it's amusing when she's stoned off her ass and says everything is delicious, whereas Kelly is freakish looking with her stick body and giant head, and makes that weird forced-smile face all the time; they both share the inability to just read a freaking cue card smoothly). (it's a real pet peeve of mine when people think that someone is hot just for being thin; thinness is correlated to hotness, but it is not causal!). (and of course anybody who's on TV at all will have a million weirdos who insist she's super hot).

05-21-10 | Video coding beyond H265

In the end movec-residual coding is inherently limitted and inefficient. Let's review the big advantage of it and the big problem.

The advantage is that the encoder can reasonably easy consider {movec,residual} coding choices jointly. This is a huge advantage over just picking what's my best movec, okay now code the residual. Because movec affects the residual, you cannot make a good R/D decision if you do it separately. By using block movecs, it reduces the number of options that need to be considered to a small enough set that encoders can practically consider a few important choices and make a smart R/D decision. This is what is behind all current good video encoders.

The disadvantage of movec-residual coding is that they are redundant and connected in a complex and difficult to handle way. We send them independently, but really they have cross-information about each other, and that is impossible to use in the standard framework.

There are obviously edges and shapes in the image which occur in both the movecs and the residuals. eg. a moving object will have a boundary, and really this edge should be used for both the movec and residual. In the current schemes we send a movec for the block, and then the residuals per pixel, so we now have finer grain information in the residual that should have been used to give us finer movecs per pixel, but it's too late now.

Let's back up to fundamentals. Assume for the moment that we are still working on an 8x8 block. We want to send that block in the current frame. We have previous frames and previous blocks within the current frame to help us. There are 256^3^64 possible values for this block. If we are doing lossy coding, then not all possible values for the block can be sent. I won't get into details of lossiness, so just say there are a large number of possible values for the pixels of the block; we want to code an index to one of those values.

Each index should be sent with a different bit length based on its probability. Already we see a flaw with {movec-residual} coding - there are tons of {movec,residual} pairs that specify the same index. Of course in a flat area lots of movecs might point to the same pixels, but even if that is eliminated, you could go movec +1, residual +3, or movec +3, residual +1, and both ways get to +4. Redundant encoding = bit waste.

Now, this bit waste might not be critically bad with current simple {movec,residual} schemes - but it is a major encumbrance if we start looking at more sophisticated mocomp options. Say you want to be able to send movecs for shapes, eg. send edges and then send a movec on each side. There are lots of possibilities here - you might just send a movec per pixel (this seems absurdly expensive, but the motion fields are very smooth so should code well from neighbors), or you might send a polygon mesh to specify shapes. This should give you much better motion fields, and then the information in the motion fields can be used to predict the residuals as well. But the problem is there's too much redundancy. You have greatly expanded the number of ways to code the same output pixels.

We could consider more modest steps as well, such as sticking with block mocomp + residual, but expanding what we can do for "mocomp". For example, you could use two motion vectors + arbitrary linear combination of the source blocks. Or you could do trapesoidal texture-mapping style mocomp. Or mocomp with a vector and scale + rotation. None of these is very valuable, there are numerous problems : 1. too many ways to encode for the encoder to do thorough R/D analysis of all of them, 2. too much redundancy, 3. still not using the joint information across residual & motion.

In the end the problem is that you are using a 6-d value {velocity,pixel} to specify a 3-d color. What you really want is a 3-d coordinate which is not in pixel space, but rather is a sort of "screw" in motion/pixel space. That is, you want the adjacent coordinates in motion/pixel space to be the ones that are closest together in the 6-d space. So for example RGB {100,0,0} and {0,200,50} might be neighbors in motion/pixel space if they can be reached by small motion adjustments.

Okay this is turning into rambling, but another way of seeing it is like this : for each block, construct a custom basis transform. Don't send a separate movec or anything - the axes of the basis transform select pixels by stepping in movec and also residual.

ADDENDUM : let me try to be more clear by doing a simple example. Say you are trying to code a block of pixels which only has 10 possible values. You want to code with a standard motion then residual method. Say there are only 2 choices for motion. It is foolish to code all 10 possible values for both motion vectors! That is, currently all video coders do something like :

Code motion = [0 or 1]
Code residual = [0,1,2,3,4,5,6,7,8,9]

Or in tree form :

   0 - [0,1,2,3,4,5,6,7,8,9]
   1 - [0,1,2,3,4,5,6,7,8,9]

Clearly this is foolish. For each movec, you only need to code the residual which encodes that resulting pixel block the smallest under that movec. So you only need each output value to occur in one spot on the tree, eg.

   0 - [0,1,2,3,4]
   1 - [5,6,7,8,9]

or something. That is, it's foolish to have to ways to encode the residual to reach a certain target when there were already cheaper ways to reach that target in the movec coding portion. To minimize this defficiency, most current coders like H264 will code blocks by either putting almost all the bits in the movec and very few in the residual, or the other way (almost none in the movec and most in the residual). The loss occurs most when you have many bits in the motion and many in the residual, something like :

   0 - [0,1,2]
   1 - [3,4,5,6]
   2 - [7,8]
   3 - [9]

The other huge fundamental defficiency is that the probability modeling of movecs and residuals is done in a very primitive way based only on "they are usually small" assumptions. In particular, probability modeling of movecs needs to be done not just based on the vector, but on the content of what is pointed at. I mentioned long ago there is a lot of redundancy there when you have lots of movecs pointing at the same thing. Also, the residual coding should be aware of what was pointed to by the movec. For example if the movec pointed at a hard edge, then the residual will likely also have a similar hard edge because it's likely we missed by a little bit, so you could use a custom transform that handles that better. etc.

ADDENDUM 2 : there's something else very subtle going on that I haven't seen discussed much. The normal way of sending {movec,residual} is actually over-complete. Mostly that's bad, too much over-completeness means you are just wasting bits, but actually some amount of over-completeness here is a good thing. In particular for each frame we are sending a little bit of extra side information that is useful for *later* frames. That is, we are sending enough information to decode the current frame to some quality level, plus some extra that is not really worth it for just the current frame, but is worth it because it helps later frames.

The problem is that the amount of extra information we are sending is not well understood. That is, in the current {movec,residual} schemes we are just sending extra information without being in control and making a specific decision. We should be choosing how much extra information to send by evaluating whether it is actually helpful on future frames. Obviously the last frames of the video (or a sequence before a cut) you shouldn't send any extra information.

In the examples above I'm showing how to reduce the overcomplete information down to a minimal set, but sometimes you might not want to do that. As a very course example say the true motion at a given pixel is +3, movec=3 to get to final pixel=7 , but you can code the same result smaller by using movec=1 - deciding whether to send the true motion or not should be done based on whether it actually helps in the future, but more importantly the code stream could collapse {3,7} and {1,7} so there is no redundant way to code if the difference is not helpful.

This becomes more important of course if you have a more complex motion scheme, like per-pixel motion or trapezoidal motion or whatever.

05-20-10 | Some quick notes on H265

Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff", it's mostly "do the same thing but use more CPU".

The best source I know of at the moment is H265.net , but you can also find lots of stuff just by searching for video on citeseer. (addendum : FTP to Dresen April meeting downloads ).

H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas that are outside of this kind of scheme. Some quick notes on the kind of things we may see :

Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel mocomp from that.

Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition. This lets the block transform match the data better. Which also leads us to -

Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb. These techniques can also be used for ...

Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities). Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.

Unblocking filters ("loop filtering" WTF retarded name is that) are an obvious area for improvement. The biggest area for improvement is deciding when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve (eg. it could have but chose not to).

Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg. oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.

Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's (very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW or EZDCT.

05-19-10 | Some quick notes on VP8

The VP8 release is exciting for what it might be in two years.

If it in fact becomes a clean open-source video standard with no major patent encumbrances, it might be well integrated in Firefox, Windows Media, etc. etc. - eg. we might actually have a video format that actually just WORKS! I don't even care if the quality/size is really competitive. How sweet would it be if there was a format that I knew I could download and it would just play back correctly and not give me any headaches. Right now that does not exist at all. (it's a sad fact that animated GIF is probably the most portable video format of the moment).

Now, you might well ask - why VP8 ? To that I have no good answer. VP8 seems like a messy cock-assed standard which has nothing in particular going for it. The entropy encoder in particular (much like H264) seems badly designed and inefficient. The basics are completely vanilla, in that it is block based, block modes, movecs, transforms, residual coding. In that sense it is just like MPEG1 or H265. That is a perfectly fine thing to do, and in fact it's what I've wound up doing, but you could pull a video standard like that out of your ass in about five minutes, there's no need to license code for that. If in fact VP8 does dodge all the existing patents then that would be a reason that it has value.

The VP8 code stream is probably pretty weak (I really don't know enough of the details to say for sure). However, what I have learned of late is that there is massive room for the encoder to make good output video even through a weak code stream. In fact I think a very good encoder could make good output from an MPEG2 level of code stream. Monty at Xiph has a nice page about work on Theora. There's nothing really cutting edge in there but it's nicely written and it's a good demonstration of the improvement you can get on a fixed standard code stream just with encoder improvements (and really their encoder is only up to "good but still basic" and not really into the realm of wicked-aggressive).

The only question we need to ask about the VP8 code stream is : is it flexible enough that it's possible to write a good encoder for it over the next few years? And it seems the answer is yes. (contrast this to VP3/Theora which has a fundamentally broken code stream which has made it very hard to write a good encoder).

ADDENDUM : this post by Greg Maxwell is pretty right on.

ADDENDUM 2 : Something major that's been missing from the web discussions and from the literature about video for a long time is the separation of code stream from encoder. The code stream basically gives the encoder a language and framework to work in. The things that Jason / Dark Shikary thinks are so great about x264 are almost entirely encoder-side things that could apply to almost any code stream (eg. "psy rdo" , "AQ", "mbtree", etc.). The literature doesn't discuss this much because they are trapped in the pit of PSNR comparisons, in which encoder side work is not that interesting. Encoder work for PSNR is not interesting because we generally know directly how to optimizing for MSE/SSD/L2 error - very simple ways like flat quantizers and DCT-space trellis quant, etc. What's more interesting is perceptual quality optimization in the encoder. In order to acheive good perceptual optimization, what you need is a good way to measure percpetual error (which we don't have), and the ability to try things in the code stream and see if they improve perceptual error (hard due to non-local effects), and a code stream that is flexible enough for the encoder to make choices that create different kinds of errors in the output. For example adding more block modes to your video coder with different types of coding is usually/often bad in a PSNR sense because all they do is create redundancy and take away code space from the normal modes, but it can be very good in a perceptual sense because it gives the encoder more choice.

ADDENDUM 3 : Case in point , I finally have noticed some x264 encoded videos showing up on the torrent sites. Well, about 90% of them don't play back on my media PC right. There's some glitching problem, or the audio & video get out of sync, or the framerate is off a tiny bit, or some shit and it's fucking annoying.

ADDENDUM 4 : I should be more clear - the most exciting thing about VP8 is that it (hopefully) provides an open patent-free standard that can then be played with and discussed openly by the development community. Hopefully encoders and decoder will also be open source and we will be able to talk about the techniques that go into them, and a whole new

05-13-10 | P4 with NiftyPerforce and no P4SCC

I'm trying using P4 in MSDev with NiftyPerforce and no P4SCC.

What this means is VC thinks you have no SCC connection at all, your files are just on your disk. You need to change the default NiftyPerforce settings so that it checks out files for you when you edit/save etc.

Advantages of NiftyPerforce without P4SCC :

1. Much faster startup / project load, because it doesn't go and check the status of everything in the project with P4.

2. No clusterfuck when you start unconnected. This is one the worst problems with P4SCC, for example if you want to work on some work projects but can't VPN for some reason, P4SCC will have a total shit fit about working disconnected. With the NiftyPerforce setup you just attrib your files and go on with your business.

3. No difficulties with changing binding/etc. This is another major disaster with P4SCC. It's rare, but if you change the P4 location of a project or change your mappings or if you already have some files added to P4 but not the project, all these things give MSdev a complete shit-fit. That all goes away.

Disadvantages of NiftyPerforce without P4SCC :

1. The first few keystrokes are lost. When you try to edit a checked-in file, you can just start typing and Nifty will go check it out, but until the checkout is done your keystrokes go to never-never land. Mild suckitude. Alternatively you could let MSDev pop up the dialog for "do you want to edit this read only file" which would make you more aware of what's going on but doesn't actually fix the issue.

2. No check marks and locks in project browser to let you know what's checked in / checked out. This is not a huge big deal, but it is a nice sanity check to make sure things are working the way they should be. Instead you have to keep an eye on your P4Win window which is a mild productivity hit.

One note about making the changeover : for existing projects that have P4SCC bindings, if you load them up in VC and tell VC to remove the binding, it also will be "helpful" and go attrib all your files to make them writeable (it also will be unhelpful and not check out your projects to make the change to not have them bound). Then NiftyPerforce won't work because your files are already writeable. The easiest way to do this right is to just open your vcproj's and sln's in a text editor and rip out all the binding bits manually.

I'm not sure yet whether the pros/cons are worth it. P4SCC actually is pretty nice once it's set up, though the ass-pain it gives when trying to make it do something it doesn't want to do (like source control something that's out of the binding root) is pretty severe.


I found the real pro & con of each way.

Pro P4SCC : You can just start editting files in VC and not worry about it. It auto-checks out files from P4 and you don't lose key presses. The most important case here is that it correctly handles files that you have not got the latest revision of - it will pop up "edit current or sync first" in that case. The best way to use Nifty seems to be Jim's suggestion - put checkout on Save, do not checkout on Edit, and make files read-only editable in memory. That works great if you are a single dev but is not super awesome in an actual shared environment with heavy contention.

Pro NiftyP4 : When you're working from home over an unreliable VPN, P4SCC is just unworkable. If you lose connection it basically hangs MSDev. This is so bad that it pretty much completely dooms P4SCC. ARG actually I take that back a bit, NiftyP4 also hangs MSDev when you lose connection, though it's not nearly as bad.

05-12-10 | P4 By Dir

(ADDENDUM : see comments, I am dumb).

I mentioned this before :

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have some work code and some home code open at the same time they shit their pants. I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

But I'm having second thoughts, because putting little config shitlets in my source dirs is one of the things I hate about CVS. Granted it would be much better in this case - I would only need a handful of them in my top level dirs, but another disadvantage is my p4bydir app would need to scan up the dir tree all the time to find config files.

And there's a better way. The thing is, the P4 Client specs already have the information of what dirs on my local machine go with what depot mappings. The problem is the client spec is not actually associated with a server. What you need is a "port client user" setting. These are stored as favorites in P4Win, but there is no authoritative list of the valid/good "port client user" setups on a machine.

So, my new idea is that I store my own config file somewhere that lists the valid "port client user" sets that I want to consider in p4bydir. I load that and then grab all the client specs. I use the client specs to identify what dirs to map to where, and the "port client user" settings to tell what p4 environment to set for that dir.

I then replace the global p4.exe with my own p4bydir so that all apps (like NiftyPerforce) will automatically talk to the right connection whenever they do a p4 on a file.

05-12-10 | Cleartype

Since I ranted about Cleartype I thought I'd go into a bit more detail. this article on Cleartype in Win7 is interesting, though also willfully retarded.

Another research question we’ve asked ourselves is why do some people prefer bi-level rendering over ClearType? Is it due to hardware issues or is there some other attribute that we don’t understand about visual systems that is playing a role. This is an issue that has piqued our curiosity for some time. Our first attempt at looking further into this involved doing an informal and small-scale preference study in a community center near Microsoft.

Wait, this is a research question ? Gee, why would I prefer perfect black and white raster fonts to smudged and color-fringed cleartype. I just can't imagine it! Better do some community user testing...

1. 35 participants. 2. Comments for bi-level rendering: Washed out; jiggly; sketchy; if this were a printer, I’d say it needed a new cartridge; fading out – esp. the numbers, I have to squint to read this, is it my glasses or it is me?; I can’t focus on this; broken up; have to strain to read; jointed. 3. Comments for ClearType: More defined, Looks bold (several times), looks darker, clearer (4 times), looks like it’s a better computer screen (user suggested he’d pay $500 more for the better screen on a $2000 laptop), sort of more blue, solid, much easier to read (3 times), clean, crisp, I like it, shows up better, and my favorite: from an elderly woman who was rather put out that the question wasn’t harder: this seems so obvious (said with a sneer.)

Oh my god, LOL, holy crap. They are obviously comparing Cleartyped anti-aliased fonts to black-and-white rendered TrueType fonts, NOT to raster fonts. They're probably doing big fonts on a high DPI screen too. Try it again on a 24" LCD with an 8 point font please, and compare something that has an unhinted TrueType and an actual hand-crafted raster font. Jesus. Oh, but I must be wrong because the community survey says 94% prefer cleartype!

Anyway, as usual the annoying thing is that in pushing their fuck-tard agenda, they refuse to acknowledge the actual pros and cons of each method and give you the controls you really want. What I would like is a setting to make Windows always prefer bitmap fonts when they exist, but use ClearType if it is actually drawing anti-aliased fonts. Even then I still might not use it because I fucking hate those color fringes, but it would be way more reasonable. Beyond that obviously you could want even more control like switching preferrence for cleartype vs. bitmap per font, or turning on and off hinting per font or per app, etc. but just some more reasonable global default would get you 90% of the way there. I would want something like "always prefer raster font for sizes <= 14 point" or something like that.

Text editors are a simple case because you just to let the user set the font and get what they want, and it doesn't matter what size the text is because it's not layed out. PDF's and such I guess you go ahead and use TT all the time. The web is a weird hybrid which is semi-formatted. The problem with the web is that it doesn't tell you when formatting is important or not important. I'd like to override the basic firefox font to be my own choice nice bitmap font *when formatting is not important* (eg. in blocks of text like I make). But if you do that globally it hoses the layout of some pages. And then other pages will manually request fonts which are blurry bollocks.

CodeProject has a nice font survey with Cleartype/no-Cleartype screen caps.

GDI++ is an interesting hack to GDI32.dll to replace the font rendering.

Entropy overload has some decent hinted TTF fonts for programmers you can use in VS 2010.

Electronic Dissonance has the real awesome solution : sneak raster fonts into asian fonts so that VS 2010 / WPF will use them. This is money if you use VS 2010.

05-11-10 | Note from the Mail Man

For reference, this is the ferocious beast that is terrorizing the poor mailman :

LOL. It would actually be pretty damn sweet if I could stop getting mail. Don't think the duplex neighbor would like that though.

05-11-10 | Some New Cblib Apps

Coded up some new goodies for myself today and released them in a new cblib and chuksh .

RunOrActivate : useful with a hot key program, or from the CLI. Use RunOrActivate [program name]. If a running process of that program exists, it will be activated and made foreground. If not, a new instance is started. Similar to the Windows built-in "shortcut key" functionality but not horribly broken like that is.

(BTW for those that don't know, Windows "shortcut keys" have had huge bugs ever since Win 95 ; they sometimes work great, basically doing RunOrActivate, but they use some weird mechanism which causes them to not work right with some apps (maybe they use DDE?), they also have bizarre latency semi-randomly, usually they launch the app instantly but occasionally they just decide to wait for 10 seconds or so).

RunOrActivate also has a bonus feature : if multiple instances of that process are running it will cycle you between them. So for example my Win-E now starts an explorer, goes to existing one if there was one, and if there were a few it cycles between explorers. Very nice. Also works with TCC windows and Firefox Windows. This actually solves a long-time useability problem I've had with shortcut keys that I never thought about fixing before, so huzzah.

WinMove : I've been using this forever, lets you move and resize the active window in various ways, either by manual coordinate or with some shorthands for "left half" etc. Anyway the new bit is I just added an option for "all windows" so that I can reproduce the Win-M minimize all behavior and Win-Shift-M restore all.

I think that gives me all Win-Key functions I actually want.

ADDENDUM : One slightly fiddly bit is the question of *which* window of a process to activate in RunOrActivate. Windows refuses to give you any concept of the "primary" window of a process, simply sticking to the assertion that processes can have many windows. However we all know this is bullshit because Alt-Tab picks out an isolated set of "primary" windows to switch between. So how do you get the list of alt-tab windows? You don't. It's "undefined", so you have to make it up somehow. Raymond Chen describes the algorithm used in one version of Windows.

05-09-10 | Some Win7 Shite

Perforce Server was being a pain in my ass to start up because the fucking P4S service doesn't get my P4ROOT environment variable. Rather than try to figure out the fucking Win 7 per-user environment variable shite, the easy solution is just to move your P4S.exe into your P4ROOT directory, that way when it sees no P4ROOT setting it will just use current directory.

New P4 Installs don't include P4Win , but you can just copy it from your old install and keep using it.

This is not a Win7 problem so much as a "newer MS systems" problem, but non-antialiased / non-cleartype text rendering is getting nerfed. Old stuff that uses GDI will still render good old bitmap fonts fine, but newer stuff that uses WPF has NO BITMAP FONT SUPPORT. That is, they are always using antialiasing, which is totally inappropriate for small text (especially without cleartype). (For example MSVC 2010 has no bitmap font support (* yes I know there are some workarounds for this)).

This is a huge fucking LOSE for serious developers. MS used to actually have better small text than Apple, Apple always did way better at smooth large blurry WYSIWYG text shit. Now MS is just worse all around because they have intentionally nerfed the thing they were winning at. I'm very disappointed because I always run no-cleartype, no-antialias because small bitmap fonts are so much better. A human font craftsman carefully choosing which pixels should be on or off is so much better than some fucking algorithm trying to approximate a smooth curve in 3 pixels and instead giving me fucking blue and red fringes.

Obviously anti-aliased text is the *future* of text rendering, but that future is still pretty far away. My 24" 1920x1200 that I like to work on is 94 dpi (a 30" 2560x2600 is 100 dpi, almost the same). My 17" lappy at 1920x1200 has some of the highest pixel density that you can get for a reasonable price, it's pretty awesome for photos, but it's still only 133 dpi which is shit for text (*). To actually do good looking antialiased text you need at least 200 dpi, and 300 would be better. This is 5-10 years away for consumer price points. (In fact the lappy screen is the unfortunate uncanny valley; the 24" at 1920x1200 is the perfect res where non-atialiased stuff is the right size on screen and has the right amount of detail. If you just go to slightly higher dpi, like 133, then everything is too small. If you then scale it up in software to make it the right size for the eye, you don't actually have enough pixels to do that scale up. The problem is that until you get above 200 dpi where you can do arbitrary scaling of GUI elements, the physical size of the pixel is important, and the 100 dpi pixel is just about perfect). (* = shit for anti-aliased text, obviously great for raster fonts at 14 pels or so).

( ADDENDUM : Urg I keep trying to turn on Cleartype and be okay with it. No no no it's not okay. They should call it "Clear Chromatic Abberation" or "Clearly the Developers who thing this is okay are colorblind". Do they think our eyes only see luma !? WTF !? Introducing colors into my black and white text is just such a huge visual artifact that no amount of improvement to the curve shapes can make up for that. )

It's actually pretty sweet right now living in a world where our CPU's are nice and multi-core, but most apps are still single core. It means I can control the load on my machine myself, which is damn nice. For example I can run 4 apps and know that they will all be pretty nice and snappy. These days I am frequently keeping 3 copies of my video test app running various tests all the time, and since it's single core I know I have one free core to still fuck around on the computer and it's full speed. The sad thing is that once apps actually all go multi-core this is going to go away, because when you actually have to share cores, Windows goes to shit.

Christ why is the registry still so fucking broken? 1. If you are a developer, please please make your apps not use the registry. Put config files in the same dir as your .exe. 2. The Registry is just a bunch of text strings, why is it not fucking version controlled? I want a log of the changes and I want to know what app made the change when. WTF.

The only decent way to get environment variables set is with TCC "set /S" or "set /U".

"C:\Program Files (x86)" is a huge fucking annoyance. Not only does it break by muscle memory and break a ton of batch files I had that looked for program files, but now I have a fucking quandary every time I'm trying to hunt down a program.. err is it in x86 or not? I really don't like that decision. I understand it's needed for if you actually have an x86 and x64 version of the same app installed, but that is very rare, and you should have only bifurcated paths on apps that actually do have a dual install. (also because lots of apps hard code to c:\program files , they have a horrible hack where they let 32 bit apps think they are actually in c:\program files when they are in "C:\Program Files (x86)"). Blurg.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]


[HKEY_CURRENT_USER\Control Panel\Desktop]

Some links :

Types - Vista/Win7 has borked the "File Associations" setup. You need a 3rd party app like Types now to configure your file types (eg. to change default icons).

Shark007.net - Windows 7 Codecs - WMP12 Codecs - seem to work.

Pismo Technic Inc. - Pismo File Mount - nicest ISO mounter I've found (Daemon tools feels like it's made out of spit and straw).

Hot Key Plus by Brian Apps - ancient app that still works and I like because it's super simple.

Change Windows 7 Default Folder Icon - Windows 7 Forums ; presumably you have the Preview stuff for Folders turned off, so now make the icon not so ugly.

- how to move your perforce depot Annoyingly I used a different machine name for new lappy and thus a different clientview, so MSVC P4SCC fails to make the connection and wants to rebind every project. The easiest way to fix this is just to not use P4SCC and kill all your bindings and just use NiftyPerforce without P4SCC.

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have some work code and some home code open at the same time they shit their pants. I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

allSnap make all windows snap - AllSnap for x64/Win7 seems to be broken, but the old 32 bit one seems to work just fine still. (ADDENDUM : nope, old allsnap randomly crashes in Win 7, do not use)

KeyTweak Homepage - I used KeyTweak to remap my Caps Lock to Alt.

Firefox addons :

PDF Download
Stop Autoplay
Adblock Plus

ADDENDUM : I found the last few guys who are ticking my disk :

One that you obviously want to disable is Windows Media Player Media sharing serivcee : Fix wmpnetwk.exe In Windows 7 . It just constantly scans your media dirs for shit to serve. Fuck you.

The next big culprit is the Windows Reliability stuff. Go ahead and disable the RAC scheduled task, but that's not the real problem. The nasty one is the "last alive stamp" which windows writes once a minute by default. This is to help diagnose crashes. You could change TimeStampInterval to 30 or so to make it once every 30 minutes, but I set it to zero to disable it. See :
Shutdown Event Tracker Tools and Settings System Reliability
How to disable hard disk thrashing with Vista - Page 7

And this is a decent summary/repeat of what I've said : How to greatly reduce harddisk grinding noises in Vista .

05-07-10 | New Lappy

I got my new lappy, a Dell Precision M6500 big 17" behemoth. I'm setting up Win 7 today which is mostly going smoothly.

First impressions of the M6500 : build quality is very nice, lightweight all metal like the Lattitude series. The screen is pretty great, 1920x1200 matte, pretty bright, decent contrast, the only complaint is that it's not super viewing-angle-independent. It is a very nice bonus that the lappy LCD res is the same res that I run my 24" external, so I can switch between lappy LCD and external LCD and it doesn't hose my layouts (however, on the minus side of that equation, because the lappy LCD is so pixel dense, I have to run in large fonts on it, and switch back to small fonts for the external LCD, boo). The internal peripherals and the case are well designed for popping things in and out. It takes two 2.5" thin lappy disks and has 4 RAM slots. The disks are easy to get in and out, and 2 of the RAMs are easy, but the other two are under the keyboard and take a bit more work. The thing is pretty amazingly quiet, the only auditory annoyance is that the fan oscillates up and down too much, I wish I could program the fan to just be on all the time in low speed instead of jumping up and down. The keyboard action is very nice, like the Lattitude series, but it is the standard fucking retarded 17" lappy thing of just using a normal 15" lappy keyboard and sticking it in the larger case. Jebus can't you people actually make a custom keyboard for the 17" form factor that takes advantage of the extra space to give me better layout!? Come on! Anyway, since I don't really use lappy as a lappy, this is mostly academic.

Win 7 reinstall went very smoothly - it autodetects all the hardware well enough that you can at least boot up. You then need to install a few things - the GPU drivers, the USB3 driver, the Touchpad driver. That's about it. Smoothest Windows install I've ever had, by far.

I had one minor annoyance during install - turns out the lappy came with 1066 Mhz RAM and I bought 1333 Mhz RAM to add. Well, when you do that the lappy boots right up and doesn't complain at all, and runs through memory check (and even the Windows heavy duty memory check) and doesn't find any errors. But it will in fact give you random RAM failures and blue-screen you. So I had to reformat the disk and pull the 1066 RAM and reinstall windows. Then I discovered that the lappy doesn't work with memory only in the C/D slots and not the A/B slots. So I pulled it apart again, and then finally got going. Sigh.

Make sure you switch the BIOS to AHCI, not Intel's fucking raid thing (*before* the Windows reinstall). Currently I am running the MS AHCI driver which supports TRIM. Unclear whether I will ever install the Intel driver.

I put in an Intel X25-M SSD. Holy shit this thing is so good. If you are a serious computer and you are not on an SSD - GET ONE RIGHT NOW.

So far I have disabled : Indexing, ReadyBoot, Prefetch, Superfetch, Indexing, Scheduled Defrag, Defender, Updater, System Restore, Page File, Hibernate, Wifi polling. That has cleaned things up nicely, but some fucking Win7 service is still pinging my disk every 10 seconds or so and I haven't tracked down what it is yet. (fuckers). When I am doing nothing on my computer, it should go to 0% CPU and never ever touch my disk.

Win7 is mostly good so far. As usual there's the annoyance that MS loves to randomly rename things and move them around. Perhaps the worst thing so far is that fucking Backspace is no longer "go up a dir" in Explorer. Yes, I know Alt-Up does that, but it should be fucking backspace god dammit! This was widely complained about during Beta, but like fuckers they refused to provide a config switch to let me have my damn old backspace behavior. I found an AHK solution to map Backspace to Alt-Up , but AHK is a fucking bloated beast of flakey crapware so I'm hoping to find another solution to this (probably just write my own).

The Win-# hot keys are almost a good thing, except that using a number which is their position in the task bar is fucking awful. You should let me assign my own hotkeys, that way I can use Win-V for VisualStudio and Win-F for Firefox or whatever, so that I can actually memorize the keypresses and be fast instead of having to count each time.

Finally, it annoys me to all fuck that MS refuses to give me the one feature that would make UAC usable : just a button to always allow promotion of a given app. When it says "do you want to allow?" it should be "Yes/No/Always" , not just "Yes/No". As is, the fucking Shareware "ProcessGuard" is much much better than UAC because with ProcessGuard you can actually say "always allow this app" or "always forbid this app" blah blah. It's so fucking obvious and such a major usability fuckup. The result is that 99% of power users just turn off UAC, whereas if you had a "always allow for this app" I would totally leave UAC on. I dunno, it just boggles my mind, it would be so easy to make UAC useful and functional. You just have to understand how computing works. I want to have a host of programs installed on my computer which I have marked trusted and let them do whatever they want. Then I want to be able to download random junk from the internet and run them in safe mode where they are forbidden from doing various things like installing stuff in startup or fucking with my windows dir. WTF, why do I not have this !?

Oh, I guess I'm going to try going to MSVC 2010. Since this is supposed to be my machine for the next 10 years, I'd rather just eat the pain of getting on new stuff all at once and then hopefully not have to do it ever again. We'll see about that...

ADDENDUM : Urg. I take it all back, Win 7 is a FUCKING ABORTION (later addendum : that might have been a slight exaggeration). I am in a constant hell of fucking file ownerships and user privilidges and shit. Here's just a minor sample :

You Map a network drive. Everything seems fine and dandy. Now you open a command prompt in "run as administator" mode. Your mapped drive is not there !? WTF !? Oh, brilliant fuckers that they are, the drive mapping is *PER USER* so the fucking administrator doesn't see it, so you have to remap it for the administrator account.

I install Perforce. Of course Perforce Server installs as "owned by" the Administrator account. So when I am logged in as my account if I try to do anything to those files it says "fuck you I'm going to pop up annoying boxes".

I'm trying to copy files from my old lappy's disk to my new one. Of course now Win 7 pays attention to the NTSF security tags and it sees those files are owned by some other user, so I get a bunch of random "Access Denied" messages with no explanation. Of course I can just go and do a fucking recursive "take ownership" of all those files, but that's really just a hack fix and if I plug that disk in somewhere else I'll have to do it again.

Jesus christ. Somebody give me an operating system where my files are just fucking files without some security or owner bullshit and I can run whatever I want and it just fucking works. I want Windows 95 please.

It's easy enought to disable UAC popups, but that's only the fucking tip of the iceberg. UAC is fucking you up back and forth all the time. Win 7 UAC also does some super nasty shit that most people don't know about, in that it remaps a bunch of virtual directory names *per user* (and also for 32 bit vs 64 bit and other compatibility modes), so depending on how your program is run it can see very different things on the same box. DO NOT WANT.

04-22-10 | Image Writings

I gathered my recent image compression posts to send someone, thought I'd dump them here too - (it's also quite possible I missed some since I just found these by googling myself)

05-13-09 - Image Compression Rambling Part 1
05-14-09 - Image Compression Rambling Part 2
05-15-09 - Image Compression Rambling Part 3 - HDR
05-26-09 - Some Study of DCT coefficients
05-27-09 - Optimal Local Bases
02-19-09 - PNG Sucks
05-15-09 - Trellis Quantization
01-14-10 - A small note on Trellis quantization
02-23-10 - Image Compresson - Color , ScieLab
03-03-10 - Image Compresson - Color , ScieLab - Part 2
07-06-09 - Small Image Compression Notes
07-07-09 - Small Image Compression Notes , Part 2
08-25-09 - Oodle Image Compression Looking Back
08-27-09 - Oodle Image Compression Looking Back Pictures
02-10-10 - Some little image notes
05-18-09 - Lagrange Space-Speed Optimization
01-12-10 - Lagrange Rate Control Part 1
01-12-10 - Lagrange Rate Control Part 2
01-13-10 - Lagrange Rate Control Part 3

12-08-08 - DXTC Summary
11-18-08 - DXTC Part 1
11-18-08 - DXTC Part 2
11-20-08 - DXTC Part 3
11-21-08 - DXTC Part 4
09-08-09 - DXTC Addendum
06-17-09 - DXTC More Followup
11-20-08 - Pointless
11-21-08 - More Texture Compression Nonsense
02-10-09 - Fixed Block Size Embedded DCT Coder - Mozilla Firefox

04-21-10 | Car Alignment

I got a "performance alignment" done for my car about a week ago. I'd heard it was the #1 best thing to do for your car if you are a serious driver (after or when you get good tires if your car doesn't come with them), but MY GOD I was not prepared for what an incredible difference it was. It was like "night and day" or "a whole new car" or whatever cliche you want to use to express the incredible difference in feel. It really felt like a different car. Before, the 997 felt powerful but a bit heavy and clumsy and mild, with the more aggressive alignment, it felt way sharper, much more "bite" for turn-in, much more grip in the corners, able to get more power down through the corner without losing grip, just awesome. I immediately went out and did some mad driving around the city because it just felt so damn good.

If you are a serious driver and are still running stock alignment, I highly recommend it. But do your research first.

Most people think of "alignment" as making the wheels straight to fix pulling issues. In reality alignment is much more than that and can give you a lot of parameters to tweak to play with the way the car handles. It's much cheaper than doing suspension mods, and often more effective. The main things you will want adjusted are toe and camber. I posted these links before :

Wheel Alignment A Short Course
Caster, Camber, Toe

But to repeat myself a bit, most stock alignments come with pretty significant toe-in and pretty minimal camber. What the toe-in does is make the car more stable, it acts to straighten itself. That means the front end doesn't wander under braking or when you hit bumps. Most pussy comfort drivers complain when cars are "wiggly" (which means they are "lively"), so manufacturers sell the cars with lots of toe-in. This makes them feel nice and stable on the freeway, but it sucks for making crisp turns. The other issue is camber - lots of negative camber means the wheels are tilted in; this will give you better contact patches when the car is in a lean through a corner. If you spend most of your time going straight on freeways or just turning slowly, then the stock alignment with minimal camber is fine, but if you do a lot of hard cornering, you will get much better grip with lots of negative camber. I now have my car set to zero toe and -1.2 degrees camber. Race cars will go even more extreme, sometimes even using toe *out* which makes them super lively and easy to turn-in, and tons of negative camber.

For Porsches you can ask for the "RoW Performance alignment" or just go to a good sport alignment shop and ask for an "aggressive street alignment" or something like that.

The next step on a Porsche is get to the GT3 "Lower Control Arms" which let you go to greater negative camber (-2 degrees seems good). You can hack this with "camber plates" but camber plates are like $50 and the LCA's are like $150 and labor will be $200 or so, so just go with the better LCA solution.

04-20-10 | Life

One of the shittiest things you can do is to dump on someone's happiness. Happiness is rare and difficult, and often involves lots of work to build up to it. When somebody finds a new hobby that is making them really happy and they always want to tell stories about it at lunch ("a funny thing happened last weekend at the model airplane field...") you just have to suck it up and feign interest and be a good listener. When somebody buys a new toy (like an iPad) you have to go along with the after-purchase show-off euphoria and do the proper oohs and ahs and feign jealousy. If you think that what they're doing or buying is actually awful and a waste of time, just shut the fuck up and keep it to yourself.

Similarly with events - if someone asks you out to do something, and they are clearly enjoying it but you are not - be a fucking good sport and try to get in the spirit and at least feign moderate happiness. Let them enjoy their time, don't make yourself into a distraction and annoyance by griping or wandering off or whatever. When you agree to go out with someone you implicitly agree that if they like it and you don't, you will go along with it. It's not that hard. And just tolerating it while obviously showing your annoyance and saying "I'm just here for you, let me know when we can leave" does not count as going along with it, it's still shitty.

Recently I've been thinking about this party I went to at the top of the condo tower on First Hill. My date at the time made me go despite my apprehension. As I expected, it was fucking awful, just a mob of people I didn't know with nothing to talk about, no dancing, no games, just stupid fucking boring people drinking and chit chatting about nonsense. It was excruciating, and like the dick that I am, even though I was "being a good sport" in my mind, I made it abundandtly clear with my body language and constant wandering off that I was not happy. The funny thing is that many months later, the horrible party is actually one of the more memorable social events that I've attended in the last few years. It gives me something to talk about, the condo tower it was in is the tallest building around and the party was right at the top so it's a unique experience. It's weird this need to "do something" that we have; when I haven't been out much I get this feeling of stir craziness, that I'm wasting my life, you get depressed without knowing why exactly, then you go out to some event and it's just awful the whole time and you can't wait to leave, and then months later it is the thing you did that you remember, not all those days when you actually were happy and just stayed home or went for a bike ride or whatever it is that you actually enjoy.

People who suck at things are a real problem. In theory, it doesn't actually matter whether people are good at things or not - we put a lot of our self worth into our "skills" (I am so fucking great because I'm good at X), but in reality when I'm hanging out with someone I could care less about their skills. What actually matters is their attitude and emotional intelligence and sense of humor and kindness and so on, that is what actually makes you a good person to be with. But in reality, people who suck at things are just a fucking drag. The problem is that *they* care that they suck. The result is that they are often in a funk because they screwed something up, or they are super afraid of being judged, or they have big insecurity problems. The result is that they bring you down because they don't want you to be so much better at them, they force you to hide your own skills. For example, I really don't mind if someone cooks for me and makes mediocre food; what makes it fun or not is their attitude, their conversation, their enthusiasm, in fact if they are excited about their food that is way more important than whether it is actually good or not (the converse is people who are really awesome cooks but act all humble and self-deprecating about it which is just annoying and unpleasant). However, that almost never works out because they imagine you are thinking horrible things about them and their food. Similarly with playing board games or sports with someone; when I toss a frisbee with someone, I don't care how great they are, yeah it's better if you can actually make some throws and catches, but enthusiasm and good attitude and hustle are way more important. The problem is that people who suck get down on themselves and get in a funk and then it's just annoying to be with them.

People who can suck and not care about it are very rare and very cool. In fact that is one of my great unattained goals for myself.

04-20-10 | Speeders Beware

Press Release :

Starting today, April 9, and extending through May 1st law enforcement agencies throughout
Washington will crack down on speeding with extra patrols on local roads, state highways 
and interstate freeways.

From April 9th through May 1st, extra speeding patrols will begin throughout Washington on:

1.      Fridays from 11 a.m. to 7 p.m.
2.      Saturdays from noon to 8 p.m.
3.      Sundays from noon to 8 p.m.

04-19-10 | PNWR Driver Skills Day

I went and did the Porsche club "Driver Skills" DS day last weekend. It's a prerequisite before you get to run on the real race track (aka "Driver Education" or DE), and I thought it would be a good way to get used to running the car closer to the limit, since I'm not very familiar with it yet. DS is held at Bremerton Raceway Park which is just an abandoned air strip. It's just a big tarmac and they put a bunch of cones on it and you do various exercises. (Porsche club also runs its autocross out there).

The day starts by stripping everything loose out of your car - everything! Things typically forgotten are floor mats and the tire kit in your trunk. Take it all out. It's good to bring a big tub or something to keep your gear in that you took out of your car.

Then there was a one hour ground school. If you know anything about race driving (eg. if you know what a "late apex" is or what the "traction budget" is) then this is pretty boring, but to their credit the guys speaking were actually pretty charistmatic and moved quickly and made it not excruciating.

The rest of the day consisted of six different exercise stations. You spend an hour at each station then rotate to the next. At each station there are 10 cars or so and you take turns running the course, and a few volunteer instructors rotate through the cars so you almost always have an instructor in car with you. The instructors were uniformly great guys & gals - they volunteer and were very friendly and knowledgeable and had great attitudes despite us trying to make them vomit and standing in the rain all day.

Oh yeah, it was wet, it rained pretty hard all day. That made everything very slippery. In one way it was good to get to practice with the car in the wet, but it would have been nice to get some runs in the dry.

The sessions were :

Skidpad (donuts) : they wet the track and you go around in circles. You can really feel how you can throttle steer in this exercise, more throttle and you circle wider, less throttle you come in. Like most of the things you learn in DS, you should already know this in your brain, but you need to actually feel it in your car and hands for it to become intuition. I also go to play with the understeer/oversteer characteristics of my car. When I gradually take the throttle up past the limit of adhesion, my car goes into understeer and plows out; if I really punch the throttle hard, it will kick the rear out and go into an oversteer spin. I was never able to kick the rear out and control it, it's hard to do in the C4S because you have to kick the throttle so hard to get it to step out. The main mistake I was making was once I got into a spin I was reflexively putting the clutch in and letting off the throttle abruptly; what I need to do is just ease down on the throttle and try to catch it before bailing out.

Braking / accident avoidance : part 1 is go as fast as you can and then slam on the brakes to full stop as late as you can. This was a trip because it is absolutely amazing how fast the car can stop when you are fully on the brakes (even in the wet!). The goal of the exercise is to brake as late as possible and still stop before the cone. I kept braking way too soon because my intuition says "you have to brake now!" ; by the end I started getting closer, but I really needed more time to get used to it. Part 2 was accident avoidance; a surprise obstacle (a very brave volunteer) jumps out to one side, and you have to brake and steer away at the same time. Again I was just not braking hard enough here at first, because my gut says you can't steer if you're braking that hard; in fact you can!

Handling oval : run the car around in an oval to practice apexing. Get up to as much speed as possible before the turn, hard on the brakes without much turning, look for the apex, turn hard and try to power out past the apex, making a perfect "late apex" turn. Hard to do in practice. A few things I need to get better at on this - make sure you actually look and stare at the apex, it will be out your side window as you brake in; make sure you brake enough before turning in, you need to get well slow or you'll understeer and miss the apex. Basically what you're doing is prolonging your straights on both entry and exit, which is what gives you more speed; you brake late and very hard so you enter deep into the turn, then at very low speed you turn hard to aim back past the apex, then get on the power early and power out hard in a very gradual curve.

Slalom : weaving between cones and trying to go as fast as possible. The main thing here is smoothness and looking ahead. This felt pretty natural to me, it is a lot like skiing. The main thing is to look way ahead and have your line planned out, take the most gradual arc you can, don't jerk the car back and forth.

Advanced Slalom : like slalom, but cones are not in a straight line, so you have to look ahead more and use more planning. Pretty fun. The skill is a lot like slalom, being smooth, but also a bit like Autocross in visualizing the "empty space"; that is, don't see the cones and think that you have to drive past the cones, rather see all the empty space that the cones allow and pick your best line in that empty space - often there are straights hidden in the cones, and straights = speed.

+ bonus shifting training. I actually learned a lot from this. One that I was coached on all day by instructors was to keep your hands on the wheel! I habitually get my hand over to the shifter too often, sometimes I'll anticipate I might have to shift in a corner and I go move my hand over to the stick to prepare, but that means I'm not cornering as well, you need to stay on the wheel, the quickly pop over to the shifter and quickly pop back. I also do some funny unnecessary extra movements, like sometimes I love the shifter to neutral, let go for a half second, and then move it into the next gear. The main thing was some guidance on rev matching downshifts. I know how to do it, but it's great to just have someone watch you and pay attention to the RPM dial and the lurch of the car and tell you your mistake each time. The goal is to make a perfectly smooth downshift so that you can't feel it at all. The main thing is the Porsche engine is real "heavy" , you need to give it a real good kick of gas to get the revs up, and you need to do it well enough before you let the clutch out. A basic downshift should consist of : clutch in, select neutral, increase throttle, select gear, clutch out. The detail here is crucial - you increase throttle *before* you select gear. Another little trick I learned is that it's better to over-shoot the RPM for rev match than to undershoot it, so err on the side of too much gas.

Autocross : obviously the exercise that puts it all together, we got to do a few runs on a tiny autocross course. This was a fucking blast, and I plan to do some more Auto-X some day. You get to slam on the gas for the straights, hard on the brakes, make the turns, it involves all the skills. Like the advanced slalom course, a lot of the skill is in picking a good line, which means not following the cones - use the freedom of all that extra tarmac. Take your turns as wide and sweeping as you can, turn soft wiggles into straights. My main mistakes were : following the direction of the cones too much, not going as wide as I should after and before corners, coming into corners too fast - you have to really brake hard coming in, not getting on throttle hard enough or early enough in the straights - intuitively you see a 50 foot long straight before the next turn and your mind says "okay just coast in there for the straight" but what you need to do it floor it as late as possible and then slam on the brakes before the turn.

I learned a lot about my driving and my car's responses, so it was a huge success, and a lot of fun. It is a very long day - I woke up around 5 AM to get the ferry, so I was exhausted by the end, and the other drawback is that there is an awful lot of standing around. With 10 cars on each skill, that means you only spend 1/10 of the time actually driving and the rest is waiting. It would be so awesome to be the only one out there with just a real top instructor and tons of track time, you could improve so much.

There were also a huge variety of people there; it was only about 50% Porsches or maybe a bit less, lots of BMWs, and a few red herrings like a Hyundai Genesis and a Pontiac G8. There were a few real track cars that were trailered in. There were also people there who were not real speed drivers at all, who just wanted to get more comfortable with their cars.

My car is not really a good autocross car at all, it's too big and heavy, the 4WD is a disadvantage, and really the visibility of a convertible would help a lot. I was jealous of the people in Miatas and Elises and shit like that - a small light RWD car would be a fucking blast for AutoX, you can get up and down in speed quickly. Also the power of my car is really a disadvantage for me in a way - I'm not good enough to handle the car at the speeds it can do.

The ideal track car would be something really cheap so you don't have to worry about the abuse it will get, light, only medium power so it can't go fast enough to hurt me, RWD front engine. Put sticky tires on it and do some suspension mods. But there's no way I will ever track often enough for it to be worth having a "track car". It's another one of those things that you would ideally share between 10 friends or something.

04-15-10 | Laptop Part 2

Well I need to hurry up and get something so I decided to just go for the Thinkpad W510. And... Lenovo is completely out of stock on the reasonable screens ( "HD+" at 1600x900 ). Apparently they are indefinitely out of stock; they say "more than 4 weeks" but the reality is they have lost their supplier so they are just fucked.

So I'm looking at Dells. The Dell Latitude E6410 is very similar to the Thinkpad T410 ; Core i5 or i7 duals, nice magnesium case, 14.1" at 1440x900 matte (worse than my current lappy which is 1440x1050 but passable). NVidia 3100M. Proper hard drive options. All around it looks like a very nice general purpose lappy.

Dell Precision M4500 is the "mobile workstation" , equivalent to the Thinkpad W510. It's interesting how Dell and Lenovo seem to have lined up their products as direct competing analogues; there's some economic theory term for that. Anyway, it has options for dual or quad Core i5/i7, NV 1800 or 880 GPU, it can do two hard disks, screen is 15.6" at 1600x900 or 1920x1080 ; seems reasonable for me. One problem is it only has 2 RAM slots, so I'll max out at 8 GB, while the Thinkpad W510 has four. (Correction : no it can't really do two hard disks, it can run a second SSD in a mini slot, that's not as good as the 6500 which can do two real disks).

(the Dell M6500 is the even bigger one, but it doesn't offer a DX11 GPU either so there's really no point in stepping up to it; the best is an ATI FirePro M7740 which is a shitty business certified kind of thing on DX 10.1 ; it does have a proper 17" 16:10 screen in "WUXGA" (1920x1200) and you can get it factory built with RAID 0 2x500 GB 7200 RPM disks which is not bad. )

Dell has some retarded naming going on right now, watch out for that. eg the Latitude 6400 and 6500 are the old model, the xx10 are the new ones. The Dell Precision 4400 , 6400, etc are the old ones, the x500 are the new ones. WTF. Also the "Precision M" series is actually part of the "Latitude E" family and thus is compatible with the "E Series" docking solutions. But the docking changed from the 6400 to the 6410 generation so watch out for that. Yay. All these business Dell lappies do docking, though there are many reports of problems with that. It looks like those problems may be various drivers, but come on Dell, you need to get your shit together.

(BTW you might think the AlienWares would be ever better, but no, not really. The best GPU you can get is an ATI HD 4870 which is DX 10. The screens are glossy and they don't have docking. Boo).

A few niggles to look at :

WiMAX ? So far as I know WiMax is not really rolled out in the US yet, but it is exciting for the future. Since I don't actually use my lappy on the go this is probably not for me. Maybe someday I'll have a WiMax router at home and cancel my cable.

Dual Core at something like 2.4 / 2.6 Ghz or Quad Core at 1.6 / 1.73 Ghz ? I suspect that the dual is actually faster for most current usage, but I'm tempted to go with the quad to give myself a better test environment for my multicore code. It is also nice the way multicore fixes Windows broke ass multi-tasking so that you can actually have 3 apps running and they are all fully responsive.

(BTW all you fucking app developers should not be gobbling so much CPU when your app is out of focus god damnit, especially gobbling CPU for fucking cutesy UI update shit when you are not focus).

128 GB SSD or 500 GB 7200 RPM HD ? SSD is no doubt the win for a thin/light lappy, but for this type of workstation lappy I'm not so sure. There are reports of data loss, performance degradation over time, etc. that are a bit scary for something I hope to run for 5-10 years. The fast random access and not having to worry about defrag is a pretty huge win though. One annoyance is I would have to buy my own SSD third party and reintall Windows.

Small addendum on SSDs : yeah they are the win. Going for a 160 GB Intel X25-M. Make sure you get a "G2" (gen2) as Gen 1 apparently has no TRIM. Also the X25-E is *way* faster, but crazy expensive and only 32 GB. That indicates though that SSD speeds are going to continue to shoot up and prices come down.

04-14-10 | Apples

I can't eat an apple without almost choking. There's always a moment after I've eaten about half the apple where the back-log of unchewed bits hits critical blockage volume in my throat; I struggle to force it down with hard swallowing, and then panic and try to find water quickly. An apple without a glass of water is almost a booby trap, if someone hands you an apple they might be trying to kill you.

One of the great mysteries of living in Seattle is that it is fucking impossible to find a decent apple up here. The apples in grocery stores here are uniformly disgusting - mealy, mushy, old, often with brown soft spots. The exact same chain stores carry the exact same apples in California and they are fresh and crisp and delightful. Now, you snobavores might suggest I get my apples at a local farmer's market. Sadly, the snobosphere has ruined farmers markets for apples. It's no long cool enough to just get a nice farmer-grown Fuji or Gala or some shit, so they no longer sell those proper apples, it's all "heritage varietals".

Heritage apples are fucking inedible. They're astringent, dry, pastey. It's almost like taking a bite out of a raw potato, which I just realized is probably where "pomme de terre" came from. In French class people would always be like "apple of the earth? WTF are those Frenchies thinking, potatos and apples are nothing alike" , but in fact old apples from before the bitterness was bred out and the sugar was bred up were like potatos!

04-14-10 | Friendly Apps

A recent post by Multimedia Mike is about something that I advocate which I think most people don't do enough of : make your app easy for yourself to use. Developers too often think of usability as an annoyance which they have to do for end users, and they will struggle and deal with great annoyances when they use their own apps themselves. This is ridiculous, you are your most important user. Mike mentions a few good things, I'll repeat him a bit :

1. Make good defaults. Make things that you always want automatic. With every command line option, ask if you usually want it on or off, and make the default the thing you usually want, and then make the option to turn it off. Be aggressive about "refactoring" over time. If you find that you have to type a million options to run your app right, something is wrong. If you have trouble remembering your own options, something is wrong. Another detail : make options automatically turn each other on. For example when I enable "heavy prepass mocomp search" I now also automatically enable "save motion to disk cache" because I pretty much always want them together and I kept running mocomp search and forgetting to enable saving it to disk.

2. Make full logging and saving the default. You never want to run your app and have it produce some weird results or crash or something and have no record of it because your forgot to enable logging. Full logging should be on by default, and only disabled from the command line in weird cases. All my apps now automatically write logs to c:\logs\appname.log using argv[0] automatically. My video test app also writes a log for each run which is named with the date and time so that I have logs of every run ever. Each log also writes out info about the build and the command line options so that I am never left thinking "WTF run was this?". (Actually there's one thing I'm still not doing here that I really should do, which is to record the sync state of perforce that was used to build the current EXE ; we had that working at Oddworld and it is the fucking bomb). This is a variant of the Carmack adage that "no time spent visualizing your algorithm is ever wasted" ; my variant is something like "no amount of logging is too much".

2.1. Make clean versions of your logs and output! Just because you are logging tons of detail, that's not an excuse to just output a ton of shit that is impossible to parse. You might need a "detailed log" and a "summary log" ; certainly don't just spit it all to stdout. If you find yourself having to go through the log constantly to find the little bit of info you actually want, pull that bit out and format it right. Do computations for yourself in the app instead of doing them afterwards. eg. I found I kept going into my logs to get the total lagrange J, so make your app compute it for you and display a nice clean summary at the end. But try to avoid just numeric summaries, instead output CSV's and stuff with more detail, charts and graphs so that your human brain can see patterns and problems. You want to present things to yourself just like you would in a technical talk to your peers - make it clear and pretty, that helps you to parse the data on an intellectual level.

3. Never fail your app because the user forgot to do something obvious - instead make the app fix it. For example if the user gives me output file names which are in a directory that doesn't exist, I make the directory. I wrote a big rant about this long ago which still stands, about validating user options *before* you do your two hour CPU crunching. You should definitely do that, but even failing that you should simply not fail ever. For example if you for some reason cannot validate file names before your run, then if you fail to open the output file, just output to c:\fallback_output or something.

4. Make the app automatically do things that you always have to do. eg. I have to preprocess some video formats before I can read them - fold that into the app so it gets done for you automatically. This kind of stuff has merit in the short term just because it saves time and aggravation, but it has *loads* of merit in the long term when you come back to some work after months away from it and you can't remember how the fuck to make your app work any more - if it's all in the code and it's all automatic you don't have to remember the complicated process of how to use things right. Similarly, app parameter ranges should be clearly documented and preferrably rescaled into reasonable ranges. Say your take a lagrange lambda parameter and reasonable ranges are like 0.0001 to 0.002 ; that's damn annoying, so rescale to expose it as [0,1] to the command line.

5. Be flexible in how you parse command line args. Write a proper flexible parser once and be done with it and stop writing cheap hacky parsers for every new app you write. For example I don't want to have to figure out whether your app wants "-i7" or "-i 7" or "-i=7" or "--i7" , just accept all of them and stop bothering me.

People tend to cover up a lot of this stuff using perl scripts or batch files or whatever, which is okay to some extent *IF* you check in and document your scripts/batches the same way you do your code. The problem is that most people are very sloppy/lazy about their helper scripts, so they become broken and undocumented over time and you still have the problem where you come back to something after a year and are like "fuck how do I run this?".

04-12-10 | Laptop search

My beloved lappy has lost its video out, so I've been using it recently as an actual laptop. My god this blows. All of you who actuall use laptops as laptops - beware! - you are destroying your neck and back. It's horrifically bad for you.

Anyway, now I'm forced to find a new one. This one has lasted me almost 10 years (!) and in many ways the technology is still cutting edge. (for example sadly you cannot beat the 1400x1050 matte LCD screen I have). I would like to find a lappy which will last me the next 10 years. Because the lappy is so important to me and the lifetime I expect of it is so long, price is basically no object. Hell, I should probably pay $10k if it really gave me a much better lappy, as this is my primary work computer; it is my artistic tool and my livelihood. Sadly there does not really exist such a thing as a superpremium laptop which is actually better.

Let's go through what I've learned so far :

CPU : Intel is in a code name obfuscation shit hole. The Core i3 / i5 / i7 seem to all be the same chip. These are all "Nehalem" variants , though Nehalem is technically only the 45 nm "Lynnfield" variant and "Clarkdale" is the 32 nm "Westmere" variant. Jebus. For mobile, you would prefer the 32 nm Clarkdale for lower power/heat. It seems Clarkdales are the 500 and 600 series, while Lynnfields are the 700+ series. The letter after the number actually contains the most info. "Q" means quad core, others are dual core. Most of them have hyperthreading, but a few are nerfed. "L" seems to mean low power. The "M" mobile chips draw 35 W , the "LM" draw 25. So for example i7 620 M = Clarkdale with two cores at 35 W, i7 720 QM = Lynnfield quad core at 45 W. See pcgameshardware for example. Correction : I guess the mobile variant of Clarkdale is called Arrandale, and it has better integrated graphics than Lynnfield. See Anand .

GPU : the GPU situation is sadly not much changed from one year ago. The ATI 5000 series is the only DX11 part. The 5400 is low power with full capability, but quite slow - even a 3800 is faster on many games. The 5600/5800 are the strongest parts, much much faster than a 5400. The number in the last tens indicates a revving up of speed within the genus, eg. 5430,5450,5470 are progressively faster. NVidia has renamed the 9400 the "Ion" but otherwise has not done much. Optimus seamless switching seems to have been picked up in very few laptops indeed. Reportedly the integrated Intel graphics is much better now. ATI's switchable and crossfire solutions are reported to be very poor with bad driver quality. See notebookcheck for example.

Screen : it is impossible to find a 4:3 any more (God's Own screen dimension). The best you can do is 16:10. There are new confusing acronyms for screen resolutions. For example Lenovo advertises some screens merely as "HD" or "HD+" ; it turns out "HD" is 1366x768 (ugh) and "HD+" is 1600x900 (meh). The best options are WSXGA+ at 1680x1050 for 15-16" or WXGA+ at 1440x900 for 14-15" . You can find matte though it's a bit hard. The only real standout screen I've seen is the Dell Studio XPS which has an RGB LED ; sadly it is crippled by extra glossy nerf-ware technology. IPS is the best panel technology (wide viewing angle) but they basically do not exist at all in laptops right now.

SSD : There seems to be still a massive variation in SSD brands . The Intel X25 series looks like the only safe bet. Most notebook makers won't tell you the brand of the SSD they put in (usually it's some kind of Samsung), so the safest thing is to spec with minimum drive and put in your own. Only with an Intel SSD and Windows 7 will you be sure to get proper "TRIM" support, which helps a lot.s

Other junk : Intel 6200 or 6300 seems the win for Wifi. Everybody does gigabit ethernet now. Sadly quite a few don't have eSATA ports. Almost nobody has DVI out any more, but most have either DisplayPort or HDMI ; sadly if they have DisplayPort out you probably need a fancy adaptor for an older or non-top-of-the-line LCD.

Docking : since I basically just carry my laptop between home and work, I would really like a proper docking solution this time. Sadly, there is still no universal docking standard, and the off brands don't have docks. Many people try to pass off USB docks, but those are shit. (people are also trying to pass off external USB video cards, umm no). The only serious docks I could find are HP and Lenovo.

OS : I guess Win 7 is the way to go, and then may as well go 64 bit I guess.

Let's look at some concrete options :

Dell, Sony - these once proud brands seem to have crumbled into producers of fragile shit. Maybe some of their products are okay, but the Sony Vaio E series is flimsy junk, and there are widespread reports of spontaneously self-destructing Dells and zero customer service.

Asus, Acer, Toshiba. These guys now only make set-spec laptops that you cannot customize. Nope.

Thinkpad T410 : cool, quiet (33 db max), Core i5 or i7 duals, NV Quadro NVS 3100M (bollocks), 14.1" matte LED WXGA+ (1440x900) - but terrible brightness/contrast/color , solid cover latch (yay!), eSata, real docking port, great quality keyboard with real pgup and home in the right place. Conclusion : everything is the win about this except the shit GPU. The new keyboard is supposedly much worse than the old Thinkpad keyboard, but still miles ahead of anyone else, and I hardly ever use it anyway.

Thinkpad W510 : 15.6" "HD+" 1600x900 shitty widescreen resolution choice but better contrast/color than the T410. (the "FHD" 1920x1080 is indefinitely out of stock, and too small pixels anyway). NV Quadro FX 880M - better than the 3100M but not near the top in performance, and not DX11. Core i7 Lynnfield Quad chips. Moderate noise (30-40 dB).

Thinkpad W701 : 17" "WUXGA" 1900x1200 great RGB LED - strong color, the right res, all win. NV FX 2800M or 3800M. Big and heavy as hell. Also built in Wacom digitizer WTFBBQ.

HP 6540b : 15.6" 1600x900 matte, ATI HD 4550 , weak plastic case, no eSATA but has docking, loud (33-47 dB). Not really any major advantage over the Lenovos.

HP EliteBook 8540w : very similar to the W510 ; 15.6" HD+ or FHD , LED anti-glare. NV FX 880M or 1800M. HP does this annoying thing where their pre-configured models are around 50% of the price of the configurable ones. With an 8540w you might pay $1500 for a pre-config, or $3000 for the EXACT SAME spec custom configured. It forces you to find a pre-config that is close to what you want and then do the mods yourself.

HP EliteBook 8740w : very similar to the W701 ; 17" WUXGA "DreamColor" (quite possibly the same panel supplier?) ; same GPU choices, docking. Not sure how to differentiate vs. the W701. I also really don't care much about these super fancy screens as I hope to very rarely use my laptop screen.

Sager/Medion/Deviltech/etc : all the generic laptops now seem to be built on a Clevo base. You can get a top GPU (ATI HD5870) and all the other goodies you want. Another advantage is easy access to upgrade everything, no soldered down parts. Sadly, driver support for these things is notoriously iffy, and none of them have docks. AVADirect Clevo for around $1500 is similar to the $3000+ Thinkpad W701 but has the better 5870 GPU, and you can select your brand of SSD. The big loss with these things is shitty build quality and no docking. Plus they are more likely to have random weird bad performance problems like long DPC timeouts due to bad drivers / config.

04-09-10 | Food notes

How to heat tortillas for tacos : Put a stack of 4-5 tortillas dry in a fry pan on medium heat. Sprinkle with salt. Place a tiny dab of butter on the top tortilla (away from the heat). Just chill and let them sit. Randomly swap one of the interior ones to the bottom periodically. I've tried many techniques for this in the past, such as using more oil, using the oven, dry frying, all are much worse.

Salt & Pepper crab : boil crab until just slightly under done. Remove from heat and rinse in cold water. As soon as your hands can stand it, break it into bits, take off the top shell, break crabs in half, break off the legs, pinchers. Make sure to reserve the "tomalley" or tasty guts. Quickly dab all the crab pieces in a light dusting of flour. Heat wok on blazing high. Add oil, tons and gobs of garlic, saute a second, add bunch of green onions, add the crab, toss around to coat, add the tomalley, stir around, tiny dash of sweet soy, then tons of sea salt, just cracked black pepper and szechuan red pepper. Serve crabs and pour the saucey goodness all over them. The crab is in its shell, but you eat with your fingers and get the sauce on your hands which you then lick.

Best pork chops : prep pork chop for sear as usual. (standard sear goes like this : heat dry pan to blazing hot; dry meat thoroughly by patting with paper towels are air drying; brush meat with very very light coat of high-temp tolerant oil; season meat well with salt). Cut pork chop fat every inch so it doesn't curl and the fat crisps better. Put in pan. After one side is done, flip. This is the interesting bit. Right after you flip, apply a smear of room temperature butter to the top side of the chop. Then sprinkle on your "rub" - a bit of brown sugar, a bit of smoked paprika, pepper, thyme, za'atar, whatever you want. The rub and butter should be on the top side away from the heat while the bottom sears. Once it has gotten a good sear, toss the whole thing in the 325 oven. Let bake until the middle is done, about 10 minutes. During this time the rub and butter will drip all over the meat and make a pan sauce.

Lately I'm addicted to this simple salad : avocado slices, grapefruit supremes, mache and arugula. First make the supremes, squeeze out the grapefruit pith to get some juice for the dressing. Finish the dressing with OO, sherry vinegar, honey, salt. Dress the mache and arugula immediately so it has time to sit and soften a bit. Then assemble the salad, top with toasted pecan bits.

Gnocchi are quite easy to make. No need to measure incredients really, just treat it like pie crust and do it by feel; add a tiny bit of flour to initially make the dough (1 cup), then when you work the dough you will add more flour as necessary until it's dry enough to roll. Work minimally and roll gently with hands moving steadily outward and past the edges. One thing to be careful of - they cook very very fast, so do everything else in your dinner first and finish the sauce for the Gnocchi and turn the pan with sauce in it down to minimum temp; when you simmer the gnocchi and they float to the top, toss in the sauce and then plate immediately.

04-08-10 | The Death of The Computer Utopia

The other day N was looking at some image on the web on her mac, and something was going wrong and she exclaimed "they won't let me set this image as my desktop!". My first thought was "preposterous!" It's a computer! If you can see the image, you can do whatever you want with it. Nobody can control what you do with your content on your computer. It is a wonderful free space, a space where you can live outside of strict boundaries and rules for how things are supposed to work. You can take any two bits you want and plug them together. If you so choose, your computer can be delightfully free of subscriptions, advertising, corporate control, everything that ruins the shitty rest of the world.

But then I realized - my god, I'm not sure that's true any more. Do Macs have some thing where they won't let certain images be set as desktops? Maybe they don't allow desktop images at all, because Steve Jobs believes it's bad for the Feng Shui of the UI. Maybe they have a deal with some copyright holders that checks for fee licensed media rights ?

The fact is, Apple is destroying the utopia that was the free and open computing space. And you all are lapping it up. I've been calling for a boycott of Apple for some time, but there seems to be no interest around the web. That's sort of weird and surprising to me, because I believe what Apple is doing right now is actually worse than anything Microsoft ever did, and there was massive anger about the draconian oppression that MS applied in their rise to power. I find it quite mad that MS is legally required to give competing application developers equal access to their operating system (so that eg. IE isn't allowed to be more closely meshed with the OS than any other web browser) due to the various suits against them, and yet right now Apple is exerting far more nefarious control - not only do competing apps not get equal access, they might not get to run AT ALL! That is just mind blowing.

Some people seem to think this is a good thing . I think that's quite mad and naive. Windows is a democracy, it's capitalist, it's a free space. If people choose to buy apps that are janky and slow and crash and have bad UI, it was their choice to do so. We should be forcing apps to get better by choosing to buy only the good ones. Instead, commentators applaud the dictator for cracking down. This is equivalent to the US Government mandating a certain look & feel for all homes being built. Oh, you want a tacky McMansion? No, I'm sorry that's against the law. Yes, the houses would look better, but the result is not the point!

I can't imagine ever buying a computing device that is not completely free and open for me to do anything what I want with it. I can't imagine ever buying content (like music or an e-book) that isn't delivered to me in a form where I can do anything I want with it. Maybe it should be a law that all operating systems must be open ? Perhaps, but I would rather see that expressed in the marketplace - boycott all operating systems that aren't open!

The most disturbing thing to me is the blind consumer acceptance of this new paradigm. It just shocks me that there's not more uproar. I see this as the end of the delightful utopia that has been computing. Computing will soon become just like very other fucked up shitty aspect of life - telecom, health insurance, finance, etc. - where we have no real choices because they are all the same, they all fuck you with obfuscated rules and corporate dodges, where you have no power and no freedom, you can only sign up for some program that was pre-designed for you. Already the internet is becoming unfree - broadband providers are already detecting and throttling various content types such as torrents or streaming video; my email host Verio is now automatically spam filtering my outbound email and I'm not allowed to disable it; the government is snooping everything we send online - and people have swallowed it all.

04-07-10 | Video

Blurg, the complexity wheel turns. In the end, all the issues with video come down two huge fundamental problems :

1. Lack of the true distortion metric. That is, we make decisions to optimize for some D, but that D is not really what humans perceive as quality. So we try to bias the coder to make the right kind of error in a black art hacky way.

2. Inability to do full non-greedy optimization. eg. on each coding decision we try to do a local greedy decision and hope that is close to globally optimal, but in fact our decisions do have major effects on the future in numerous and complex ways. So we try to account for how current decisions might affect the future using ugly heuristics.

These two major issues underly all the difficulties and hacks in video coding, and they are unfortunately nigh intractable. Because of these issues, you get really annoying spurious results in coding. Some of the annoying shit I've seen :

A. I greatly improve my R/D optimizer. Global J goes up !! (lower J is better, it should have gone down). WTF happened !? On any one block, my R/D optimizer now has much more ability to make decisions and reach a J minimum on that block. The problem is that the local greedy optimization is taking my code stream to weird places that then hurt later blocks in ways I am not accounting for.

B. I introduce a new block type. I observe that the R/D chooser picks it reasonably often and global J goes down. All signs indicate this is good for coding. No, visual quality goes down! Urg. This can come from any number of problems, maybe the new block type has artifacts that are visually annoying. One that I have run into that's a bother is just that certain block types will have their J minimum on the R/D curve at very different places - the result of that is a lot of quality variation across the frame, which is visually annoying. eg. the block type might be good in a strict numerical sense, but its optimum point is at much higher or much lower quality than your other block types, which makes it stand out.

C. I completely screw up a block type, quality goes UP ! eg. I introduce a bug or some major coding inefficiency so a certain block type really sucks. But global quality is better, WTF. Well this can happen if that block type was actually bad. For one thing, block types can actually be bad for global J even if they are good for greedy local J, because they produce output that is not good as a future mocomp source, or even simply because they are redundant with other block types and are a waste of code space. A more complex problem which I ran into is that a broken block type can change the amount of bits allocated to various parts of the video, and that can randomly give you better bit allocation, which can make quality go up even though you broke your coder a bit. Most specifically, I broke my Intra ("I") block (no mocomp) coder, which caused more bits to go to I-like frames, which actually improved quality.

D. I improve my movec finder, so I'm more able to find truly optimal movecs in an R/D sense (eg. find the movec that actually optimizes J on the current block). Global J goes down. The problem here is that optimizing the current movec can make that movec very weird - eg. make the movec far from the "true motion". That then hurts future coding greatly.

In most cases these problems can be patched with hacks and heuristics. The goal of hacks and heuristics is basically to try to address the first two issues. Going back to the numbering of the two issues, what the hacks do is :

1. Try to force distortion to be "good distortion". Forbid too much quality variation between neighboring blocks. Forbid block mode decisions that you somehow decide is "ugly distortion" even if it optimized J. Try to tweak your D metric to make visual quality better. Note that the D tweakage here is a pretty nasty black art - you are NOT actually trying to make a D that approximates a true human visual D, you are trying to make a D under which your codec will make decisions that produce good global output.

2. To account for the greedy/non-greedy problem, you try to bias the greedy decisions towards things that you guess will be good for the future. This guess might be based on actually future data from a previous run. Basically you decide not to make the choice that is locally optimal if you have reason to believe it will hurt too much in the future. This is largely based on intuition and familiarity with the codec.

Now I'll mention a few random particular issues, but really these themes occur again and again.

I. Very simple block modes. Most coders have something like a "direct block copy" mode, or even a "solid single color", eg. DIRECT or "skip" or whatever. These type of blocks are generally quite high distortion and very low rate. The problem occurs when your lambda is sort of near the threshold for whether to prefer these blocks or not. Oddly the alternate choice mode might have much higher rate and much higher distortion. The result is that a bunch of very similar blocks near each other in an image might semi-randomly select between the high quality and low quality modes (which happen to have very similar J's at the current lambda). This is obviously ugly. Furthermore, there's a non-greedy optimization issue with these type of block modes. If we compare two choices that have similar J, one is a skip type block with high distortion, another is some detailed block mode - the skip type is bad for information conveyance to the future. That is, it doesn't add anything useful for future blocks to refer to. It just copies existing pixels (or even wipes some information out in some cases).

II. Gradual changes need to be send gradually. That is, if there is some part of the video which is slowly steadily changing, such as a slow cross fade, or very slow scale/rotate type motion, or whatever - you really need to send it as such. If you make a greedy best J decision, at low bit rate you will some times decide to send zero delta, zero delta, for a while because the difference is so small, and then it becomes too big where you need to correct it and you send a big delta. You've turned the gradual shift into a stutter and pop. Of course the decision to make a big correction won't happen all across the frame at the same time, so you'll see blocks speckle and move in waves. Very ugly.

III. Rigid translations need to preserved. The eye is very sensitive to rigid translations. If you just let the movec chooser optimize for J or D it can screw this up. One reason is that very small motions or movements of monotonous objects might slip to movec = 0 for code size purposes. That is, rather than send the correct small movec, it might decide that J is better by incorrectly sending a zero delta movec with a higher distortion. Another reason is that the actual best pixel match might not correspond to the motion, you can get anomalies, especialy on sliding patterned or semi-patterned objects like grass. In these cases, it actually looks better to use the true motion movec even if it has larger numerical distortion D to do so. Furthermore there is another greedy/non-greedy issue. Sometimes some non-true-motion movec might give you well the best J on the current block by reaching out and grabbing some random pixels that match really well. But that screws up your motion field for the future. That movec will be used to condition predictions of future movecs. So say you have some big slowly translating field - if everyone picks nice true motion movecs they will also be coherent, but if people just get to pick the best match for themselves greedily, they will be a big mess and not predict each other. That movec might also be used by the next frame, the previous B frame, etc.

IV. Full pel vs. half/quarter/sub-pel is a tricky issue. Sub-pel movecs often win in a strict SSD sense; this is partly because when the match is imperfct, sub-pel movecs act to sort of average two guess together; they produce a blurred prediction, which is optimal under L2 norm. There are some problems with this though; sub-pel movecs act to blur the image, they can stand out visually as blurrier bits; they also act to "destroy information" in the same way that simple block modes do. Full pel movecs have the advantage of giving you straight pixel copies, so there is no blur or destruction of information. But full pel movecs can have their own problem if the true motion is subpel - they can produce wiggling. eg. if an area should really have movecs around 0.5 , you might make some blocks where the movec is +0 and some where it is +1. The result is a visible dilation and contraction that wiggles along, rather than a perfect rigid motion.

V. A good example of all this evil is the movec search in x264. They observed that allowing very large movec search ranges actually decreases quality (vs a more local incremental searches). In theory if your movec chooser is using the right criterion, this should not be - more choices should never hurt, it should simply not choose them if they are worse. Their problem is twofold - 1. their movec chooser is obviously not perfect in that it doesn't account for current cost completely correctly, 2. of course it doesn't account for all the effects on the future. The result is that using some heuristic seed spots for the search which you believe are good coherent movecs for various reasons, and then doing small local searches actually gives you better global quality. This is a case where using "broken" code gives better results.

In fact it is a general pattern that using very accurate local decisions often hurts global quality, and often using some tweaked heuristic is better. eg. instead of using true code cost R in your J decision, you make some functional fit to the norms of the residuals; you then tweak that fit to optimize global quality - not to fit R. The result is that the fit can wind up compensating for the greedy/non-greedy and other weird factors, and the approximation can actually be better than the more accurate local criterion.

Did I say blurg?

04-06-10 | Poker

Psychological mistakes that even advanced players make in poker :

I need to change up / I'm too obvious. People think they can't keep doing a move because it's too obvious. Some TAG keeps raising, people keep reraising him, and he keeps folding to reraises. So he raises, it's your chance to reraise, you think "oh, I better not, it's too obvious". No! It keeps working, assume it will keep working until you get evidence otherwise.

One psychological case of this mistake is actually "pity". Sometimes you subconsciously pity an opponent. Some doofus is getting completely run over by bluffs and aggression. It comes your turn to play a hand with the fish, and you think "oh everyone is bluffing him, I better not, it's too obvious". In reality what you are doing is subconsciously pitying him and going easy on him.

Over-credit. A classic and common huge mistake is giving people too much credit. Never assume that they are making a smart decision. Say you raise in a spot where you know that you always have a good hand. For example, maybe you correctly play very tight out of position. So you raise OOP and someone reraises you. You think : "he knows I am playing very tight OOP and yet he reraised me anyway - therefore he must have a really good hand". No! You are giving too much credit.

Long time readers may recall the ridiculous list of care instructions that my landlord gave me on moving in. Well, the inevitable result of that has happened : I no longer pay attention to any of those requests. If they had given me a short reasonable list, I would have tried to abide by it, because I am very (overly) considerate, but when the request is just ridiculous and way too much, you wind up just saying fuck it.

I was reminded of it when watching the Louis Theroux special about Fresno ("Meth Town"). He's hanging out with some family and people are doing meth and Louis does his fucking asshole douchebag condescending thing that he does and takes the dad aside and is like "aren't you concerned about the kids being around this?". And the guy is just like "WTF?" Like, if you grow up where people are beating and killing each other and abusing kids and people are hungry and poor, your fucking middle class condescension about not showing the kids drug use is rather out of place.

The general concept applies to all laws, the idea that being too strict about a host of stupid laws actually makes people pay less attention to the important ones. When police are busting people in the ghetto all the time over stupid minor infractions, it makes them just generally think laws are stupid.

I see the same thing with the programmer-artist relationship in video games. Often the coders will make a set of directives about what good art should be like (no T-joints, clean manifolds, uv mapping with as little stretch and shear as possible, no reflections in uv mapping, etc. etc.) and it's just too much, it's unreasonable, and the artists just wind up ignoring it all. It's much better to make a minimal list of things that absolutely must be followed and enforce it strictly.

The Aura ("El Aura") was disappointing. I think it starts off really beautifully, the mood and camera work and all the stillness and portent is wonderful. The weirdness of the epileptic "aura" is interesting and you start to think this is going to be cool. But then it all just devolves into typical typical "Heist gone wrong". I'm so sick of fucking "heist gone wrong" flicks. There's always like an extra guard they forgot about, or an extra alarm switch, and then the robbers start freaking out, and someone gets shot. Blah blah blah.

I'm watching a bit of Peep Show again with N as she hasn't seen it before. It's really damn good, but it also makes me think : my god I am so sick of all this TV about people who just suck so bad. They are so negative and self-defeating and they don't support each other and it just leaves a sour taste in my mouth. I don't want to be like that, I don't want to be around other people who are like that, and I don't really even want to see it. It brings you down, it makes you feel like that is what normal behavior is. It's much better to surround yourself with people who are living well and being good to each other and being dynamic and positive and kind and intelligent and creative and so on.

04-04-10 | Porsche 997 Oil Change

Well I did my own oil change. It was a mild pain in my ass for a few stupid reasons which I will educate you about. I also don't have a garage which is really fucking annoying, this all would be so much more fun if I had my own space I could just get dirty and keep my tools in and everything, and stay out of the wet. It was threatening to rain all day so I had to hurry, which makes all mechanical things much less fun. (the correct way to do all manual labor jobs is with a whole lot of thinking about what you're gonna do next and taking breaks). If I had a real garage and work space where I could just hang out and drink beer and go "mmm hmmm" a lot, I would do a lot more little jobs myself.

Working on your only car is also inherently a pain in the ass because if you discover you don't have a part or a tool you need, you can't get in your car and go to the store for it. Urg.

There are a number of guides to doing this on the net (such as : here ) (I think maybe you have to be a member to see pictures or something, which is fucking bullshit BTW). It's all pretty standard and this is really one of the easiest cars in the world to do. A few more notes for dumb newbies like me :

You run your car before beginning to get the oil warm so it flows. It will be hot coming out, don't get burnt. Okay, I knew this, but one thing I didn't think about : the heat will make the bolts and such swell. When you take off the drain plug from the oil pan, it will be warm. If you leave it on your hex head like I did, it will cool down and contract and will become a real bitch to get off. To get it off at that point you'll probably have to heat it back up again. A hair dryer or your kitchen stove work fine (and you will then just throw out that drain plug and use a new one). Always use a new crush washer. Oh, also, as noted elsewhere - there's a lot of oil in this car and it comes out fast; make sure you leave the oil fill cap on when you first open the drain plug! (then remove the fill cap) Do NOT use one of those cute low profile oil pans or the integrate oil pan / jug things that use a funnel to fill - they can't take in the oil as fast as the car drops it and you could have a huge mess (luckily I also had a normal oil pan and swapped it out fast).

The oil filter housing can be very hard to get off. The correct tool is a 74.4 mm oil wrench with 14 flats. Porsche sells a branded tool for $40, you can get a generic one for $14 ; sometimes it's called a "Mercedes oil wrench". If it doesn't fit exactly, shim with some sandpaper. I did not have luck with plastic Jansco oil wrenches that auto parts shops sell - the plastic just warps if the filter housing is on too tight, you need a metal cup. To get off a stuck housing, you might want one of the 3-prong style oil filter wrenches. Removing a stuck housing is often destructive, so you may want to have a spare one on hand. It's a good idea to just go ahead and buy a spare housing, and hopefully you'll never need it, but if you do you'll be glad it's on hand. (ADDENDUM : the ideal tool is a SIR M0093 : like this ; FYI this thing needs a 22 mm socket )

If you're a perfectionist, you should pre-load the oil filter housing with oil. If you don't you may get a "check oil level" the first time you start the car up. This is not actually a big deal though.

The special tools are mostly 3/8" socket drive; you'll want a 3/8 torque wrench to get the tightness right.

I wish I'd had a second oil drain pan, or just some other pan I could put dirty stuff in that wasn't full of oil. A drain pan is like $8 , so just buy two. Obviously you want a big piece of cardboard or carpet or something under the whole operation, because wind and drips and shit will get out of the pan.

The Rhino Ramps for getting the car up in the air are in fact awesome. Very easy. They do slip if the surface they are on is not ideal, which is pretty scary when you're driving off and they slip. The best way to fix this is to put a bit of rubber mat or carpet under them. (but you don't actually need ramps - everything is so easy to get to on this car that just putting the back wheels on some 2x4's is enough clearance).

Disposing of the oil is in fact not a big deal at all. A bigger annoyance is just all the random bits of mess. Your tools get oily, you're left over with the oil drain pan which of course is oily. The worst part was actually the pouring the oil from the drain pan into disposal containers (yes of course I use a big funnel). Next time I will take more care with that : 1. Have two drain pans as I said previously; put the disposal container in the second pan to catch any spill. 2. Have more/bigger containers than you need, so that you don't have to try to fill them near full. 3. Have transparent containers with secure caps (screw-top gallon milk jugs would be fine).

Some useful info :

Torques :
drain plug : 37 ft-lbs
filter : 19 ft-lbs.

Capacity :
Engine oil 997.1 S - 3.8 l Filling capacity without oil filter 8.25 litre
Engine oil 997.1 S - 3.8 l Filling capacity with oil filter 8.50 litre

But I recommend way underfilling, like maybe 7.5 quarts, then check the level and add 0.5 quarts and check again (after waiting a while). (for our purposes 1 quart ~ 1 liter is good enough). Your goal is to get the level near the middle of the indicator. (Beware : the electronic oil gauge is very tempermental, it could read very low, and you add oil, and discover you've over-filled).

ADDENDUM : some related links :

Transaxle Oil Change (gear box & differential) - Rennlist Discussion Forums
TOOL Page (¯`·.¸(¯`·.¸ ZDMAK SPECIAL TOOL STORE ¸.·´¯)¸.·´¯)
Socket - Oil Filter Housing 27MM - 38 Drive & Other BMW Mini Cooper Parts at MiniCarParts.Net
Porsche Oil Filter Housing And Filter Assembly Filters and Maintenance & Stock Filters Maintenance
Porsche 997 - Oil Circulation - Page 1
Porsche 996 Oil Circulation - Page 1
Porsche 911 (997) LN Engineering Magnetic Oil Drain Plug
Pelican Technical Bulletin All About Motor Oil
Pelican Parts.com - Oil Filter Socket - 74mm
Pelican Parts - Product Information 000-721-920-40-OEM
Oils What motor oil should I use Which oil is best for my Porsche or aircooled engine
oil drain plug stripped - Rennlist Discussion Forums
Oil change on Carrera S's and a Brake Fluid Flush - Rennlist Discussion Forums
Oil Change Instructions
OCD Oil Change on the Red Dragon - 6speedonline.com Forums
How to remove stripped drain plug - RennTech.org Forums
How To Change Your Oil (The Real Down and Dirty)
HELP locating an oil filter wrench - Rennlist Discussion Forums
DIY Oil Change with Pics - Rennlist Discussion Forums
DIY Oil Change in the 997 - Rennlist Discussion Forums
DIY Oil Change in the 997 - Page 5 - Rennlist Discussion Forums
Changed the Oil in my 997.2 today PHOTOS - Rennlist Discussion Forums
Car maintenance bibles Oil Viscosity.
Amazon.com End Cap Oil Filter Wrench 76mm 14 Flutes Home Improvement

04-01-10 | Automotive Manual Transmissions

Hmm, I have also been oddly ignorant about how exactly a manual transmission works. Let's go through it.

The engine output shaft connects directly to the flywheel. The flywheel is pressed against the clutch plate, which is connected to the transmission on one shaft. The transmission has another shaft that connects to the driveshaft/axles/wheels. Inside the transmission the two shafts - one from the engine, one from the wheels, are mated by gears.

Usually the two main shafts in the transmission (engine shaft and wheel shaft in my incorrect parlance) are coaxial (for torque/inertia reasons) and they are mated by gears on a separate shaft, the "layshaft", but we aren't really concerned with the exact geometry and we'll just talk about the two shafts being connected by "transmision internal bits".

Let's get to the first misconception : disengaging the clutch pedal does NOT disconnect the transmission from the wheels nor does it take the transmission "out of gear". All the clutch does is disconnect the engine from the transmission. Clutch is friction based and can slip if the engine & transmission side are not turning at the same speed.

If the transmission is in gear and the clutch is depressed (disengaged), the engine is disconnected from the transmission, so the wheels will spin the transmission internal bits at the speed they want without resistance.

Simple Transmission - how it works

Inside the transmission you have the shaft from the engine and the shaft from the wheels. Sometimes they are connected using a "layshaft" which carries some gears to mate them together. I'll just refer to this stuff as "transmission internal bits". We conceptually think of the gears "engaging and disengaging" but on all modern transmissions the various gears are constantly meshed, and instead they are connected or disconnected from the shafts using dog clutches and syncromeshes. Let's get into that a bit :

One amusing thing I didn't realize is that the H pattern on your shifter actually directly mechanically moves the gear selection. I always thought that it just sent a signal to the transmission to pick those gears, but no it is directly human powered. On a typical 6-speed H shifter you have 1/2 vertically, 3/4, and 5/6. There are three gear clutches inside the transmission. The three horizontal positions of the shifter each control one of the three clutches. The middle horizontal position on all three is neutral - clutch not engaged. So when you shifter is in neutral, the side-to-side position doesn't matter, all 3 clutches are disengaged. In this state the layshaft (internal transmission gears) does not couple the wheel shaft to the engine shaft.

When you move the shifter up and down on the 1/2 axis, it moves a double-sided clutch plate either up to mesh with the 1st gear or down to mesh with the 2nd gear (in the middle it meshes with neither and that's neutral). When it's fully in 1st or 2nd, it locks those gears to the axle using the dog clutch (a dog clutch is like the bumps on a microwave tray, just some bumps and grooves), and the wheels are locked to the shaft that goes to the flywheel in that gear ratio. In between, when the shifter arm is between a gear selection and neutral, the clutch plate for those gears is only partially engaged. The dog clutch also has a syncromesh on it (just called syncros usually). The syncro is a kind of spiral gear which causes the two parts to spin up to equal speed. Obviously the shaft to the wheels and the shaft to the flywheel can be at different speeds; when they're spinning at different speeds the dog clutch would never engage and just grind and bounce off each other; the syncromesh can slightly engage and cause them to match speeds as you move the lever.

The interesting thing is that your manual action of moving the shift lever in to select the gear is actually applying the force to bring the internal transmission clutch plates together, and as you push it in you are causing the syncromesh to mate and spin the pieces up together. This is totaly separate from your left foot which controls the clutch on the flywheel that mates the engine to the whole transmission. If you shift really hard and fast you put a lot of strain on the syncros to make these parts match up. This is sort of separate from the normal "rev matching" discussion but we'll come back to that.

A key thing that I didn't really think about before is that there are actually 3 spinning parts of the equation. There's the engine, which is the RPM you see on your dash, there's the wheels, which you see on your speedo, and there's the transmission internal bits (layshaft if you like). You have no indicator of the transmission internal bits speed, but you can feel it when they're off. In particular :

neutral, clutch pedal out : trans spins with engine, decoupled from wheels
in gear, clutch pedal in : trans spins with wheels, decoupled from engine
neutral, clutch pedal in : trans connected to nothing, spins down slowly

Let's look at what actually happens when you go through a shift.

1. You're in 1st gear. Everything is locked together. The engine is spinning at a multiplier of the wheels rate based on the gear ratio.

2. You press in the clutch. This disengages the engine from the transmission at the flywheel. The engine is now spinning on its own (with much less resistance) and the transmission is still locked to the wheel speed.

3. You move the shifter downward towards 2nd gear. It will move through the neutral position (middle vertical position) were the internal transmission clutch is not selecting any gear. At this point the transmission internal bits are disconnected from the wheels, and are also still disconnected from the engine because the clutch is in.

4. (optional) At this moment you could now double-clutch. If you let the clutch out now, it will connect the transmission bits to the engine and get them in sync. You can now use the gas pedal to easily spin up the transmission internal bits without the wheels being connected. Then you push the clutch back in and keep moving the shifter down towards 2nd :

5. As you move the shift lever towards 2nd, the syncromesh on the transmission dog clutch starts to engage. This causes the transmission internal bits to spin up to the speed of the wheels (multiplied by the gear ratio). The syncros will grind a bit and get the transmission up to speed, then you finish the lever movement and the dog clutch engages. Now the transmission is spinning with the wheels.

6. Finally you let the clutch pedal out. This engages the engine to the transmission, which is already locked to the wheels. Here you get on the throttle to rev match so that the clutch doesn't grind too much on the engine flywheel.

Note : the "rev match" was only for the engine-transmission clutch, not the clutch (syncros) inside the transmission! If you did this series of steps and skipped the double-clutching, you are just mashing the syncros together without any rev matching for them! To address that you should let out the clutch as you move the lever, so that the engine can start to engage with the transmission, you can use your throttle to spin it up to rev match, and then as you slot in the gear selector the syncros will mesh without too much grinding.

Now check out :

How Manual Transmissions Work

YouTube - How Manual Transmissions Work

Some interesting conclusions :

1. To get out of gear there is no reason to push in the clutch pedal. Disengaging a gear separates the dog clutches and syncromeshes - it is totally unrelated to the engine clutch at the flywheel. (not quite true : if you don't push in the clutch, selecting neutral slides the parts while they are under load, with the clutch in they move without the engine applying force through them)

2. You also don't strictly need the clutch pedal to get into gear. With the transmission in neutral, all you have to do is modulate the throttle to get the engine RPM to exactly the right spot for the current wheel speed, and then slip the shift lever into gear. The syncros with slip you right into gear if your rev match was good. Obviously this is very hard to do reliably, which is why you use the clutch pedal. Note that this is not really "clutchless gear shifting" - you are still using a clutch which is inside the transmission on the gear selectors, you just aren't using the big friction clutch plate between the engine and the transmission.

3. The feeling of shift "ease" in the shift lever is related to how mushy your syncros are. Low performance commuter cars usually have very soft forgiving syncros. I guess these are made of brass or something like that, they are designed to absorb the strain of different rotation speeds and let you slop around the shifter. High performance cars usually have very stiff syncros (traditionally made of steel as in the Porsche but I know newer cars like the BMW M3 are using carbon fiber syncros now). This causes more resistance when moving the shift lever to select gears - even with the clutch pedal fully in. You should shift slowly and firmly, and you can also use double clutching to get the transmission internal bits up to the right speed so that the syncros match. Porsche has "balk" type syncros for 1st and 2nd gear, which means rather than let you slot them together with a bad speed match and grind you up to speed, they just lock you out.

A nice article : Phil Ethier On shifting

I'll fill in a few cracks. The engine shaft that the pistons turn comes out and is connected to the "flywheel". This is just a big disk, which the clutch plate then presses against. You will often hear modders talk about going to a "light weight fly wheel". The weight of the flywheel is useful because it holds some rotational inertia from the engine RPM's, so when you let the clutch out to mate the engine with the wheel speed, the engine is not immediately jerked to the speed of the wheels. (this is a particularly bad problem when starting from a stop, since the wheels are not moving and you would stall your engine). A light weight fly wheel lets the engine spin up faster. It does not increase peak horsepower, since at that point everything is spinning at full speed, but it makes it faster to get to peak horsepower from a stop. It lets you change the engine RPM much faster when the clutch is not engaged, since then the only resistance is the flywheel. When the whole drivetrain is engaged, I can't imagine that the flywheel is a very significant part of the resistance or rotating weight (transmission and wheel weight and everything are much greater). See : on flywheels and YouTube - 3D animation of dual mass flywheel .

Oh, another thing that people like to do is go to a "short shifter". Recall the shift lever is a direct mechanical connection to the gear selectors, when you move it up down, that moves the selectors back and forth, through a mechanical linkage (either cables or rigid arms). A short shifter just moves the mechanical pivot on the shift lever to give you more mechanical advantage, so that a smaller movement of the shifter arm produces the same amount of movement of the gear selector in the transmission. Note that there do exist badly designed 3rd party short shifters which do not do this right, and they will screw up your syncros.

ADDENDUM : let's go a bit further now and understand some fancy new technology.

First of all, the "Sequential Manual Transmission". There are two big differences with an SMT : 1. linear lever operation, and 2. no clutch pedal.

The first thing to understand is that the internal workings of an SMT are exactly the same as a normal MT. You have the clutch, syncros, gear selectors, etc. all the same.

The linear lever operation is accomplished by making the linear motion perform a ratched rotation of a drum. When you push forward it turns the drum one way, when you pull back it turns the drum the other way. The drum has grooves which fit to the gear selectors. The gear selectors are just like in a normal MT - 1st/2nd 3rd/4th kind of thing, it causes them to move back and forth against the sycros and dog clutches. See : here .

The next thing is clutchless operation. This is very simple in fact. They just use the same linear lever pull to engage and disengage the clutch. As you push forward the clutch is disengaged, the lever motion then also selects the gear, and as the lever finishes its movement the clutch is engaged again. A standard manual SMT just does this directly mechanically by hooking up the lever to the clutch; there are also electronic ones that use the lever to control a computer that moves the clutch (it's not an electronic motor to move dry clutches in this case, but rather hydraulic pressure to control wet clutches, but it's the same deal).

Now that we understand an SMT we can move to the new DCT (Dual Clutch Transmission). A DCT is very similar to an SMT. As usual I was a little confused about how a DCT works exactly because of silly commentators describing it in confusing ways, saying there are two gearboxes and the computer selects between them.

This is the best simple schematic diagram . There is still one flywheel being driven by the engine, and there is still just one drive shaft going to the wheels. Between them is the transmission. There are now two "layshafts". (the two layshafts are usually co-linear and share an axle so it's not evident that they are two separate shafts, but they are there). The two layshafts are both connected to the wheels by gears, so when their gear selectors are engaged they being spun with the wheels (multiplied by the gear ratio). The layshafts have gear selectors on them just like a normal SMT - the gear selector can be in the middle (neutral) engaging no gear, it can move back and forth linearly to select one or another gear, meshing through a syncro and then locking a dog clutch to engage that gear.

The two layshafts are then mated to the engine drive flywheel through two different clutches. The two clutches are concentric. Rather than a big disk clutch as in a normal MT, they are concentric rings so they both have some access to the surface area of the flywheel. If clutch 1 is pressed to the flywheel, the engine spins layshaft 1 through the gear selected on layshaft 1, and layshaft 2 is then spun by the wheels through its gear selection, and not connected to the engine, so its clutch plate will be spinning at a very different speed (but not engaged). While clutch 1 is engaged, layshaft 2 can easily select different gears by moving its gear selectors. Its gears are not connected to the engine, so this is just like you moving the stick around in a manual transmision with the clutch pedal depressed - the gear selector will mesh through the syncros and spin up the internal transmission bits (layshaft 2) to match the wheel speed.

With layshaft 2 in some gear, the DCT can then just disengage clutch 1 and engage clutch 2. There is still a tiny transision time, and the engine has to very quickly adjust RPM's to match the different spinning speed of the new clutch. The DCT does not magically take steps out of a gear shift, it just changes the order. Instead of :

disengage clutch, move gear selector, engage clutch

It does : move gear selector on idle shaft, disengage clutch 1, engage clutch 2.

So the time to move the gear selector is hidden.

Here's : a more detailed schematic , and nice actual technical diagrams of the Audi S-Tronic or Porsche PDK .

04-01-10 | Automotive Differentials

You hear people talk about the merit of an LSD (Limited Slip Diff) all the time, but I was a little confused about the whole thing, so here we go. First of all, the LSD is not your option vs. having NO diff. Basically every car now has a diff, they just might have an "open diff". Let's now watch this awesome video :

YouTube - How Differential Gear works (BEST Tutorial) (old promotional video from Chevy - watch this if nothing else!!)

And if you like check out some computer animations (BTW tons of these awesome mechanical animations on Youtube these days!! love them) :
YouTube - Gear Animation ( www.dizayn.tr.gg )
YouTube - diferencial catia

Okay, so an Open Diff is cool because it lets the wheels turn at different rates when you go around corners. The alternative back when that Chevy video was made was a locked rear axle with both wheels spinning at the same speed all the time, which would make the tires skitter badly around corners (BTW this is different than the "solid axle" that cheap American cars used to have (aka "live axle" or "beam axle"), which referred to the use of the drive axles as suspension, those cars still had differentials).

If you always had perfect tire traction to the road, an open diff would be perfect. Going around a corner, the outer wheel is easier to push and the inner wheel resists more, so the open diff spins it less, and all is merry. The problem occurs when you don't have perfect traction. In that case the slipping wheel has very little resistance, so the open diff just spins it like crazy and the wheel with traction barely moves.

The most obvious case where you would encounter this is on snow or ice when you get stuck - you hit the gas, and one of your wheels just spins and spins which the one with good traction doesn't budge. To address this for serious offroading, they make "locking diffs" which pushes in another gear to the differential which locks the two sides together so that the two wheels are forced to spin at the same speed. If your goal is to go forward in a straight line regardless of traction, a locking/locked diff is ideal.

The issue with racing and cornering is that going through a corner with power you will often lose traction due to the heavy combined load of cornering and accelerating. When that happens an open diff will basically refuse to put down power (it will just spin the loose wheel). An LSD will still put down some power if one of the wheels has traction, which will let you accelerate out of the corner quicker.

Traction Circles great video by the most boring race driver in the world.

And now you can read these :

Limited slip differential - Wikipedia, the free encyclopedia
Differential (mechanical device) - Wikipedia, the free encyclopedia

BTW a decent AWD/4WD car will have an LSD of some kind between the front and rear wheels. See 4WD vs AWD for a good discussion.

Modern cars with open diffs use braking to limit wheel spin. Basically one wheel loses traction and goes nuts spinning, so the electronic traction control puts some brake on it, then both wheels feel resistance, so the open diff now gives power to both wheels which lets the one with traction move us.

The easiest way to make an LSD is just to take your open diff, put the whole thing in a box, and put some fluid in that box. Boom you have a viscous coupling LSD. (do not try this at home)

03-31-10 | More Porsche 997.1 Owners Notes

The front bumper has two radiators off to the lower sides in air scoops. These scoops fill up with leaves and pebbles and junk from the road. Some people have done a DIY to install mesh screens over these holes (the GT3 has the screens standard). I might do that someday, but for now I just stuck a vacuum cleaner nozzle in the scoops. Make sure you dig the nozzle way back in to the far corner, because there's a little pit back there that is just full of pebbles.

Around the luggage compartment under the hood are water drain channels. These are pretty shallow and if they have gunk in them the water will overflow and can get into the luggage. Make sure you keep leaves and junk out of them. On each side of the battery are little water drain holes to get water away from the battery. Again leaves and junk can easily block these holes which then causes the "brake booster" to suck in water. Check them every so often to make sure they are not clogged.

The oil pressure gauge seems worthless at first (because the oil temp is easier to read and they are semi-redundant), but actually it's the best way to check on the health of your engine and know when it is ready to thrash. When you start it up, it should read around max, 5 bar. Once the oil gets loosened up and is flowing right, it should settle in to about 1-2 bar at idle and 3-4 bar at high RPM. If it doesn't get up to 3-4 bar once warm at high RPM, you have an oil leak. My next oil change is going to be to 5w50.

There are a few known design flaws with the car (aside from the RMS/IMS elephant in the room) :

They seem to be prone to premature serpentine belt failure. This is not a huge problem, just something to be aware of. Changing the serpentine belt is very very easy on this car, like most of the minor maintenance. All you need is a 24 mm wrench. Take a photo of the belt routing before you remove it so that you can replace it the same way. The tensioner is self-adjusting, not manually tensioned, so you don't have to worry about tightening it right when you put it back on. One reason for failure seems to be that the pulleys can wear grooves . I would replace every 30-40k miles instead of the recommended 60k.

Another is that shift cables can break , particularly under heavy use. This is also pretty cheap to replace, though not a DIY, and obviously inconvenient if it happens out on the road. Generally the transmission is very "notchy" and heavy, which is cool and by design, it is meant to be able to handle a lot of torque to quickly shift and get power through. The syncros in the transmission that help you rev match are pretty minimal and hard, that is you need to rev match pretty well or they will balk, unlike a lot of comfort cars where the syncros are very forgiving and let you shift even with a bad rev match. The problem is if you are trying to jam the transmission without rev matching well enough, it puts heavy loads on those cables and they can't take it. Even if they don't break they are prone to get out of adjustment, particularly the 1st/2nd gear clutch control.

03-31-10 | Video Annoyances

Oo lord working with video is annoying.

I noticed over at Media Xiph.org they have some sample videos in Y4M 444. 444 means it has full res chroma. I have previously tested against x264 using some Y4M 420 sequences like "parkrun", but the problem with that is that it has pre-killed chroma, so it actually is biased against me because I don't kill chroma in the same way they do, so I'm sort of taking a chroma killing penalty twice. What I want is full res color data so that x264 and I can both kill chroma in our own ways and then compare to the original.

So I get the full res 444 Y4M data. Then the problems begin. Most Windows video tools won't real Y4M because it's not packaged in an AVI with a codec for DirectShow and all that. But the linuxy tools that are starting to dominate do read Y4M. *BUT* my favorite tools, x264 and mplayer/mencoder both don't read Y4M 444, they only read 420. Now the linuxy thing to do would be to pipe the video through y4mscaler, but then I'm killing the chroma. That's fine to do the x264 compress, but I need to get back to RGB full res for my compress.

So now FFMPEG does read Y4M 444. But for some fucking reason the ffmpeg on my machine refuses to show any command line help. All the things I'm supposed to be able to do to make it tell me what codecs it has and so on don't work. Yay. So I get to randomly search the web and try things in the dark until it works.

ffmpeg -i t:\park_joy_444_720p.y4m -vcodec rawvideo -pix_fmt bgr24 r:\park_joy_444_720p_rawvideo.avi

So we have raw video out, yay! That's awfully big so we want to compress it with something lossless. The only lossless codec that I have found which really works in all players and supports plain RGB right is Lagarith. (other people have had success with HuffYUV but I have trouble making it work with RGB and it randomly crashes on me sometimes).

BTW if you want 420 subsampled colorspace YUV, x264 is a great option for lossless video, and it's easy, just compress with -q0. In fact this is a good way to convert a 420 Y4M or YUV file into an MP4 which can then be loaded by Windows video tools. HOWEVER you may need to specify low profile and no bframes, and even then some Windows video tools will still have problems with x264 output. Also - Y4M is a raw YUV file with a header. A .YUV file has no header and you have to tell it the size. The easiest way is to name the file "blah_widthxheight.yuv" and then give it to x264, which will parse the file name for width & height.

Unfortunately we cannot go straight to Lagarith in ffmpeg because FFMPEG is some linuxy thing that uses hard-coded "lib" system rather than using the Windows pluggable codec thing. So even though I have Lagarith on my machine and all the Windowsy video products can encode to it, ffmpeg can't. (I guess on Linux ffmpeg actually loads libs using the purported magic Linux lib system, but on Windows they are hard-compiled into FFMPEG; that is advantageous in that it means ffmpeg actually fucking works, unlike the apps that rely on DirectShow codecs to be set up right which is a fragile disaster, but it's disadvantageous in that you rely on the whims of ffmpeg builders to put in the right codecs). So anyway to transcode I have to use VirtualDub. Vdub is a pretty nice app for simple AVI transcoding, it's basically just an interface to the DirectShow filters.

(BTW I guess I could do this directly in ffmpeg by going throught AVISynth since there is an ffmpeg-avisynth connection and then avisynth can get you to anything in DirectShow, but lord help me if I have to get into that snake pit).

I set up my config for transcoding in the VirtualDub GUI and then I save the settings out to a "VCF" file and then I can run it from the command line like this :

c:\progs\virtualdub>vdub /s r:\vdub_lagarith.vcf /p r:\park_joy_444_720p_rawvideo.avi r:\park_joy_444_720p_lagarith.avi /r

Okay, so now we have the lagarith RGB video. That is all I need to be an input to my coder, but to compress with x264 we need to go back through some hoops. The most reliable way I've found is to use MPlayer to turn that into Y4M :

mplayer -benchmark -ao null -vo yuv4mpeg:file=park_joy_444_720p_lagarith.y4m park_joy_444_720p_lagarith.avi

And then finally we can run x264 on the Y4M :

x264_2pass park_joy_444_720p_lagarith.y4m park_joy_444_720p_x264.mp4 --bitrate 10500 --preset slow

x264_2pass.bat :
set x264args=%3 %4 %5 %6 %7 %8 %9 %10 %11 %12 %13 %14 %15 %16 %17 %18 %19 %20 %21 %22 %23 %24 %25 %26 %27 %28 %29
x264 -o %2 %1 -p 1 -I 9999 %x264args%
x264 -o %2 %1 -p 3 -I 9999 %x264args%
REM x264 -o %2 %1 -p 3 -I 9999 %x264args%

(note I specifically do not want to start from the original Y4M and just use y4mscaler or something to get to x264, I want to start from the same RGB lagarith video that my coder will start with, since that's the most unbiased source type; eg. neither of us code in RGB and neither of us are tuned for RGB, so it's not biased one way or the other).

Yay ! Oh, and BTW x264 seems to often severely miss it's bitrate, usually over. That bitrate 10500 is my attempt to make it hit 12500. Depending on the video you'll have to look at the output size and fiddle with that to make it hit the right size.

One annoying niggle is that somebody along the way seems to fuck up framerate a bit. On park_joy , which is at 50 fps, it seems to be perfect. However on the 29.97 fps videos (30000/1001), somebody is doing the math wrong and getting off sync, so on long videos it will have a frame slip, so that if you compare the original to the final output, you'll find a frame where they get out of sync by one frame and you have to step ahead in the source to get them back in perfect sync. I haven't tracked down who the culprit is or how to fix it, so I just live with manually stepping a bit to find sync when comparing. I also haven't found an app that can just change the time code on a video without touching the frame data, that is take a 29.97 fps video and make it claim to be 30 fps, but show the exact same frames, eg. change duration but keep the frame count the same.

03-30-10 | The Worst

One of the weirdest/worst types of people are the guys who wear really wacky clothes, but then are just boring as hell. Like the accountant who toils away in the cube farm at your company, but wears purple sparkley dress shoes every day. Or maybe he has a whole collection of weird shoes. Often these guys focus on some particular weird accessory, like maybe rainbow suspenders, or bow ties, or hats, or whatever. They are into rockabilly or swing music or some other type of music that for a second you think "oh that's kind of cool" but then you realize it's completely uncool and uninteresting. (another variant of this loser is the guy who is constantly wearing clothing related to their hobby or interest, like wearing their swing dancing outfits to work, or wearing their league bowling shirt or whatever; oo how interesting, I'm just dying to ask you about your hobby which you are so obviously advertising). Their wacky clothes scream "talk to me!" but generally they are quiet and introverted. Obviously they are desperate for human interaction and to be interesting, but their actual personality can't make that happen, so they try to use the wacky clothing items to spark some conversation. Of course I categorically refuse to help them out by asking about their interest or wacky clothing item.

Sometimes I find myself dangerously close to this precipice. On the one hand, I think everyone is fucking boring and retarded, so I have no interest in actually talking to people and being outgoing. On the other hand, I think that just wearing what you're "supposed to" is fucking ridiculous and boring, so I am tempted to get wacky and break the rules of decorum to show that I am not going to play society's stupid conformist fashion game. The result is that I become the boring quiet guy in wacky clothes. That must be avoided.

Recently I've become much more aware of how childish and pouty I can be. When something doesn't go my way, I go into a quiet huff. Partly that's processing and thinking about the next course of action, but it's also excessive dwelling. It's also a subconscious way of showing the world that I have been bothered and you better not do that any more; I actually make a pouty face and glower, it's very obvious if I have a camera pointed at myself (which would be an awesome tool BTW). For some reason I am much more able to step outside my own consciousness and observe myself as I do this recently and realize "wow you are being really pouty".

Playing DDR the other night I realized I have a bad habit of clinging to easy things I am good at instead of pushing myself by doing hard things I'm not so good at. Years ago when I played DDR more seriously, I played almost the whole time on light mode, I moved up to the hardest songs like Paranoia, but would practice them over and over until I could get a 100% perfect, but I stayed on light mode. A few times I tried normal mode and it was really hard and I moved back to just perfecting light mode. I would have gotten much better much faster if I had just pushed myself more to move up in levels, even though it felt hard and frustrating. I realize now that I did the same thing when I was playing poker. I stayed at lower levels where I could completely dominate and make massive profit rates, where I was confident that I was the best at the table, rather than move up to levels where I might struggle a bit.

03-29-10 | DDR Pads

We fetched out the DDR pads and played a bit. Not a bad game, but I wish there more games to play on the pads. Has anybody made a bunch of mini games that are playable on the DDR pads? (any game that just uses arrows is playable, but that equals roughly 0% of games). Maybe PS2 games are hopeless, but I could also play PC games. Are there any PC games that just use the arrow keys and nothing else? (one I can think is Daleks) (also Boulder Dash)

Simon Says. Just a simpler version of "hit these buttons". It plays a sequence, you repeat it, it adds to the sequence, you repeat it, ad infinitum. Each direction you hit makes a note and the sequence winds up playing a song.

Track & Field. Various competition games that basically consist of slamming on left-right really fast and then press up to jump, etc. permutations thereof. Discus throw could require you to do up-left-down-right-up-left-down-right and so on so there are various patterns to practice, and hurdles would be fun.

Katamari Damacy type game (or Monkey Ball if you prefer) (or Marble Madness). Anyway the point is you steer some ball around by tapping the different arrows.

I tried to get some PC games working on the HTPC so we could play on the TV. Jeebus what a clusterfuck. Fucking video games are such a disaster. I've written this many times before and nobody is paying attention, but here goes again : all video games (and 3d apps in general) must launch initially in windowed mode and must be able to run at whatever the desktop resolution is. (you can launch in maximized windowed mode if you like but you better respond to Windows-M minimization requests and you better fucking let me Alt-Tab to other apps). Furthermore, you better fucking have keyboard controls at least on your menus (arrows and enter please), and you better fucking respect "Escape" and "Alt-X" or "Alt-F4" or whatever.

Basically any jump to "full screen" or another resolution is extremely intrusive and should never be done except when the user specifically requests it. I'll give you a few reasons why : 1. On the HTPC I remote-desktop in to set things up. Jumping to full screen fucks the remote desktop in various ways. 2. I always want to be running in my LCD's native res. My desktop is always at the LCD's native res. You should never just jump me to a res that's different. 3. On the HTPC I use analog TV out. It has to be calibrated for each res. When you jump to some other res, the signal is all stretched and offset and I have to go back to the ATI control panel and adjust it. 4. You don't know what resolutions are good for my card/display - I DO, and I chose them on the desktop, so just leave it alone. 5. "Full screen" is no different than a maximized window on most modern 3d cards. The habit of jumping to full screen for performance is archaic. (yes you might save one blit, but leave that as an option for enthusiasts, not the default). 6. I may well want to run your game and check my email at the same time. Let me. You do not control my experience, I do.

Also, gamepad support on the PC is just such fucking balls. Maybe if I had an MS branded gamepad and only played new games built on XInput it would be okay, but old games using DirectInput and random old gamepads = massive fail. Good luck even getting it detected and selected as the controller, then the sticks will be all crazy out of wack with fucked up deadzones and drifts, and then when you fix that you find the button mapping is all boofood and you have to go into config and map all the buttons to something sensible. Half an hour of setup to get a game to work.

Installing games on the HTPC in general is probably a bad idea, because they are generally so fucked up that I risk destabilizing the system, and I don't want to have to deal with setting it up again cuz it's working pretty well right now. This is why I never play video games. I spent an hour last night fucking around installing things and trying to configure them right.

03-29-10 | WA Roads for Biking and Driving

I've been doing a lot of scouting and joy riding recently.

Vashon - went to scope out the bike riding - mm meh, it really doesn't look like a great place to ride; for one thing when you get off the ferry you are immediately faced with a monster hill. I hate hills right at the start of rides when I'm not warm. Then once you get to the top, the rest of the island is pretty flat and pastoral. The shoulders are not great and there's a lot of fast traffic. Most of the route slips that I've seen take you on a loop around the island, around the west side and then back on Vashon Hwy on the east. That's fucking retarded, Vashon Hwy is awful to ride on, just do an out and back on the west side. There is one real nice bit of twisties for driving around Cedarhurst but otherwise pretty meh.

West Snoqualmie Valley Road in the Duvall - Carnation area. Way too much traffic for fast driving. Narrow road and no shoulder - does not look like a nice place to bike at all. Scenery is pretty good though - you get views of the Cascades and pastoral river valley farming stuff. It's an enjoyable sunday drive.

A bit further north of there you can do High Bridge Road and Woods Creek Road. Both pretty nice drives as the traffic is a lot lower; these are popular with local motorcyclists, so they may slow you down, and cops do watch them. Again not great bike rides, though Woods Creek might be okay despite lack of shoulder because it is pretty low traffic.

Snohomish is a pretty abysmal tourist trap, but there are still some decent sights there. There seems to be a hot air balloon launching every time I go by it.

A bit further north I hear Menzel Lake and Jordan Road are good, but haven't been up there yet. Also Mountain Loop Hwy up to Silverton - Bedal looks cool, but I think it's pretty rough dirt road through the highest part.

Renton - Maple Valley trail. This is a "rails to trails" conversion, so it's all flat, separated from the road. On a nice weekend it might get unbearably crowded near Renton, but the traffic thins out and it gets more scenic as you get closer to Maple Valley.

Green River Gorge Road / Green Valley Road - this is some pretty rural Washington. Not great for driving though, around the Gorge Road area the pavement is in really abysmal condition, so you have to slow to watch for killer pot holes, and then in the Valley area it's residential and trafficky so you have to / should take it slow. Gorge Road area seems promising for biking, though there are quite a few sport bikes and ricers and such joy riding out here so it may be a bit dangerous. The gorge itself is pretty fucking rad, really deep shear cliffs down to rushing river, but it's hard to access (the small shitty state parks in the area don't provide any access to the good parts of the river); I hear that the locals know the access paths and guard their secrets.

There are some very promising driving areas further afield (mainly in the Mt Rainier - Mt St Helens area) but that will have to wait for summer.

It's sort of weird that it's easier to get away from people in San Francisco than it is here. Part of the issue is that SF was really amazing about saving wild land right near the city (the whole of Marin and the Peninsula). But I think another part of the problem is the Cascades. It's nice that we have the Cascades right here, but there are basically no roads in them, and even very few decent hiking trails at low elevations. So essentially we are trapped in a little strip of land on the west side of the Cascades and almost every bit of it is developed and suburbanized.

03-23-10 | Pinnacle

Bike Snob NYC is really the pinnacle of internet comedy.

I believe there are two major forms of comedy that have been primarily developed by the internet; obviously these existed before, but they have been honed and perfected and seen their hayday in recent years. One is the sardonic / snarky condescension comedy, done so well by Surly Gourmand , but this also goes back to stuff like "Old Man Murray" ; basically in this form you humorously point out how retarded everyone else is. The other major internet comedy form is the "meme" ; this comedy form primarily lives on web forums, 4chan and /b/ being the primary nexus, but most major forums create their own memes. In this form you create a neologism or repurpose a word into a reference, and then it is used repeatedly in humorous ways. The meme is really only super funny if you know the history of where it came from.

Bike Snob combines the snark and the meme to optimum effect. The really amazing thing is that he also pulls off the snark without bile, and he creates all his own memes rather than just stealing them from the net. When you first read Bike Snob you don't really get it, but after following for a few weeks, you see the creation of memes, then the development of their humor value, and then they become funny just any time he drops them.

Linkage :

Quantum Diaries Survivor is a pretty damn good physics blog.

I can never get enough of Denny regrade photos

Scans of Vintage Bicycle Advertisements - Diablo Scott - Picasa Web Albums

Name Your Porsche - in case you thought your Porsche wasn't douchey enough already, you can name it "Bruce". "Bruce" !!

McLaren Automotive officially launches itself and MP4-12C supercar - yeah this thing is pretty fucking hot. The funny thing is that for the masses of technology they put into these super high end cars, they get microscopic returns. It's technology porn really, it's technology for its own sake, to drool over, not to do anything. If you look at the C&D Lightning Lap results for example, the WRX and the Speed 3 are great values at about $25k for a 3:15 lap, and the closer you get to the top the more ridiculous your cost per second gets. Obviously you're paying for looks and feel and so on, the focus on power and speed is all wrong.

fastestlaps.com View topic - airdrag... factory claims vs car magazine tested numbers.. really sweet page of photos and results from the german magazine "Sport Auto". (note that the photos are fakes though still very interesting - the actual runs for Cd are done with much higher air speed and with the wings up, the photos are taken at low speed air flow where you get those nice laminar lines; the real high speed flows have a lot of turbulence wakes; if more photos like this exist, let me know). The key number is not "Cd" , the coefficient of drag, but rather "CdA" , and of course downforce is very important too. The GT3 has very impressive aerodynamics.

Compact Crank Overload - all about the "compact double". I highly recommend compact doubles for all bikes. Sadly my Litespeed is on a 130 BCD crank and I don't want to have to replace the whole crank so I'm fucked.

All About Motor Oil - in case the last one I posted wasn't enough

Aldo's Pic of the Day - classic cycling photos.

03-22-10 | Pleased

I replaced my engine air filter over the weekend, and it made me feel oh so pleased with myself. Oh look at me, I'm such a blue collar manly man, I'm all covered in engine grease and doing my own maintenance. Of course it's a very easy job, it should take 15 minutes or so, it took me an hour because I'm anal and I cleaned everything out (cleaned the throttle body too which I highly recommend, you should also clean the MAF on newer cars).

It also felt nice to give back a bit to the car. The car gives me so much, it gives me adrenaline rushes and erections and warm butt massages. If I just take all that pleasure and don't give back, I'm treating the car like a cheap lay. You have to show your car some commitment, do nice things for it, give it baths, buy it air filters, it shows you really care.

Working on cars would be pretty fun if you had your own garage with a lift and a full set of tools, including pneumatic drivers and impact sockets and all that shit. I'm kind of tempted to get into it, but it really doesn't make much sense. I could easily do my own oil change, it's quite easy to get to the oil drain bolt and the filter on my car, but then I'd still have to deal with taking the oil to a disposal site and all that shite. I changed my own spark plugs on my old pontiac long ago (which was easy and worth it), and I changed the pads and rotors on my Prelude with the help of my uncle (which was a huge pain in the ass and not worth it). It all would be pleasant and easy if things worked out the way they are supposed to, but there's always some fucking problem where you don't have a certain tool you need, or you bought the wrong part, or some bolt is frozen, or some screw is stripped, and you have to run to the auto parts shop or get a mechanic to help with something and it kills all your theoretical savings. Really you do it yourself not to save money, but for the pleasure of touching your machine, and also because you know that when you do it you take your time and do it right, whereas mechanics rush and fuck things up and don't take care; if some part doesn't go back together quite right the mechanic just hammers it in and then zip ties it down.

We live cheek by jowl with our neighbors. It's strange, we're in an old neighborhood, all the houses are from around 1910, and I've seen photos from Vintage Seattle of when the area was being built (bleh I can't find the photo I'm think of, but this is a decent demonstration ) - there's tons of empty space all around, but they build the houses right next to each other. I'm not sure why they did that exactly back in the day; it wasn't because of the cost of land, it must have been for some practical reason, like maybe running water or sewer lines was really hard, so you saved that work by putting things together? Dunno.

Our neighbors are mostly very quiet, in particular the directly adjacent neighbors are wonderfully quiet, so the proximity is not a problem. But lately one of the slightly farther neighbors has started to violate the unwritten code of neighbor conduct : thou shalt pretend that each other does not exist. You're supposed to close your ears when you hear neighbors talking; if they are outside their house doing something, you pretend you don't see it and you certainly don't go over and talk to them. (note : there is an exception to the "I can't see you" rule, which is when you are on the twilight hour promenade, in that case all neighbors may exchange in cordialities). One neighbor has crossed the line. When I go out on the balcony, sometimes they call out "hello" from their balcony. Fuck you, I won't say hello, I'm trying to pretend you don't exist! The other day N sneezed and neighbor yelled out "bless you". What !? That sneeze is not yours to bless. Fie!

There are also some neighborhood kids who have been playing baseball and other such hard-edged activities right next to my car. I don't have a garage, just a driveway, so I am not protected from the dings of mini basketballs. I feel like it's way too douchey to go out and say "don't play next to my Porsche", but I know it's just a matter of time before I get a dent from those fuckers. The car has got extremely thin body panels to save weight; the hood in particular is so thin you can dent it just by closing it if you aren't very gentle. Maybe I should get a Pit Bull and put it on a leash next to the car and blow a dog whistle any time a child comes near it.

Some of the neighbors are semi-hippyish. I love people who are anti-corporate, people who grow their own veg, people who are tolerant of different lifestyles, people who are into drugs and music, but the fact is most hippies are fucking dicks. Hippies/liberals are the worst kind of people - they act super chill and want everyone to get along - but only if everyone goes along with their idea of how things should be. They're extremely prejudiced and judgemental, if you wear "uptight" clothes they assume you're a square who will harsh their vibe. They're identical to "family values" conservatives in that respect who call for government to stay out of our personal lives, but what they really mean by that is they want government to enforce *their* choice of lifetyle. Hippies espouse "tolerance" but by that they mean tolerance for sexual variations and races, not tolerance for christians and capitalists. I guess extremists on the left and right are both rotten and inconsistent, however let's not get carried away with moral equivalence, someone who wants to force peace and love on everyone is not the same as someone who wants to force christianity and income inequality on everyone. Anyhoo, I can see the hippie neighbors scowl at me because I'm obviously a capitalist pig, and I'm part of the gentrifying yuppie force that is invading their old neighborhood.

03-20-10 | Offenses

If you're releasing a fucking sample mp3 track from your album, name it "artist - title.mp3" , don't fucking name it "02.mp3" you fucking tardball musicians. I guess a lot of people aren't even aware of the actual name of their music files any more because they use fucking iTunes or some shit that isn't file based. Fucking iTunes not only fucks me up if I try to use it, but even when I don't use it it fucks me up.

Google Maps print display needs more white background. The colors should only be for outlines and road, not the big colored fields they use for things like parks, water, and city blocks. I think they have single handedly made $100 M or so for HP.

When you're going straight on a road that has two lanes in each direction and you come to a red light and you are the first car there - get in the fucking left lane. I am constantly confounded by coming up to red lights where I want to turn right and some solitary cocksucker is sitting there in the right lane blocking me. Similarly if you come up to a red light on a one lane road that has a bit of space, shade to the left a bit so that people have room to turn right around you. This is just being considerate, that is, thinking about how your actions affect others. All the so-called "nice" people rarely give much thought to being actually considerate. Speaking of which :

When I go biking, obviously the worst people are the ones who try to run in to you, or throw bottles at you, or honk at you to scare you, but those are pretty rare. Far more common is the "nice" fuckers who get it all wrong. Maybe the most annoying of all are the people who wave at you to go when it's their turn. I'll come up to a stop sign, and I come to a complete stop because there's a car already there in the perpendicular direction. The best thing the car could do is just fucking go quickly when it's their turn, that way I can go behind them smoothly. Instead they just don't fucking go, I look straight at them like "WTF?" and they wave at me to go. Fuck you you fucking stupid wanker, go when it's your turn. (the most extreme variation of this is people who do not even have a stop sign, I'm stopped with my foot down at a perpendicular stop sign and they come to a stop for me to go; you fucking retard, just go so I can go behind you). A similar stupid "nice" person move is the fucking perpetual hover; they want to pass but don't want to cut too close to you, so they just drive right behind you for like a mile. That's really fucking annoying because I have to be super aware of them all the time to protect myself, just make your fucking pass; usually these people are the retards who refuse to cross the yellow line ever.

People who do their own major home improvement projects are fucking cockmunchers. This old guy that lives next to us is doing a total rebuild of his house. Bang bang bang every day. I think he's been doing it for three years or so; certainly in the 6 months that I've lived here he's made no apparent progress. I think he puts up one board each day. Hire some fucking people and get it finished. Or you should have to pay your neighbors $100 a day for each day that you do construction on your house.

Carry some fucking cash people. If you don't have cash and there's a long line for some cheap service - get the fuck out of line and go home. No you don't get to buy a magazine with your credit card while we all stand in line behind you and tap our feet. And don't blame the fucking convenience store or movie theater for having a slow Visa network connection - you know perfectly well that a lot of places have slow card machines, that's why you carry some fucking cash. It takes two minutes to grab some cash at an ATM, and then you have it for weeks. It saves *you* time, and it's considerate. Oh and when you do go to the ATM, get at least $200. What are you a fucking child that can't carry cash because it makes you spend it too fast? It's also fucking retarded all these people who carry mace and flashlights and leathermen so they are "ready for an emergency" but don't carry cash; cash is the #1 most important survival tool.

Why the fuck are you people on cell phones all the time? Drivers, pedestrians, people in stores, I see them all the time just going through life seemingly constantly with a cell phone next to their head. What the fuck are you talking about? I'm way more interesting than you and I don't have anything to say, I just can't imagine what these people are blabbering on about all day every day.

Don't fucking speed in residential neighborhoods you cocks. Any time there is anything blocking your vision, such as parked cars, you should assume there is a child right behind it who might jump out in front of you. You should only speed in areas with clear visibility or where you are reasonably certain noone is going to jump out (eg. on freeways there are concrete pillars for overpasses that obscure vision, but I'm pretty sure noone is behind them). There's another Porsche in my neighborhood who goes ripping down the narrow streets here and it pisses me off especially because of the association with me. There are also a lot of intersections around here in the residential back streets that have no signage at all - no stop sign or anything, just a 4 way intersection. People go blazing through these intersections which is just so ridiculously stupid.

There are many fucking annoyances with C that I have written about previously. Most of my major annoyances are with the fact that it does not provide you mechanisms to protect yourself as a coder (for example writing "overload" next to a virtual function to make sure it overloads something, which protects you from changing the signature in the parent and not doing the same change to all children). One bug I frequently contend with is accidentally using the same variable name in an inner scope. eg. I have some global named "foo" and then I make an "int foo" inside a function, and then get confused about which one I'm dealing with. I wish I could make that a warning, and have a keyword, maybe reuse "overload" to indicate when I want to do it intentionally.

03-19-10 | Porsche 997 Owner Notes

Some things I discovered after purchase that I wish I'd known before and are hard to find out :

The car doesn't really like to be cold. For one thing it's sold on "summer tires", and below 45 F or so you really need to be on "winter tires". (winter tires does not mean "snow tires" ; the difference between summer and winter tires is the rubber compound; you want tire rubber to have just the right softness in the operating temperature range; winter tires are chemically created to be soft at low temperature, but you can't use them year round because they would get too soft in heat; there's also a difference in the depth and style of tread, but the rubber compound is the bigger issue; summer tires become hard like plastic below 40 F). But even beyond that, the engine is designed to be track friendly, which means its operating temperature range is very high. It's hard to make an engine that works well both at very low and very high temperatures; one issue is the motor oil in the engine; no oil can handle a temperature range from 0 F (cold winter start) all the way up to 300 F (heavy track use). Track racers use 15W50 , winter street users use 0W40 ; the engine's not really happy until it's over at least 175 F. This means if you want to go racing in the winter you need to warm it up a long while (5-10 minutes ; just watch your oil temp gauge).

Some Porsche engines eat oil, some don't. It's sort of a random chance whether you get an oil eater or not. Not really a huge deal either way, it just means adding a liter or two between oil changes (oh, BTW, the manufacurer recommended 10k or 15k miles between oil changes is bullshit - it's part of what causes blown engines; use 5k intervals!). What *is* annoying is the fucking electronic oil measurement. It refuses to give a reading unless the oil has descended into the oil pan, so you can't do it when the engine has been run at all, you have to let the car sit for 10-30 minutes to get a reading. But you also don't get an accurate reading if the car has sat for a long time, so you can't just do it in the morning after the car sits overnight. Also the car has to be level to get a reading. Very annoying. Each electronic oil tick is half a quart. You should try to keep it between the bottom and middle tick mark - not at the top, keeping it fully topped up will encourage it to burn oil faster. (Porsche officially says that burning up to 1 quart per 400 miles is "by design" ; most cars, like mine, seem to burn about 1 quart per few thousand or so, while a few cars don't burn any at all).

The rear parking sensor thing beeps. It's a bit annoying because it starts beeping when you are like 10 feet away from the thing behind you. I'm always like "really? you're beeping already?". I just discovered recently that it will actually go to a solid tone after beeping and you still have about 6 inches behind you at that point. The parking sensor also can't detect thin or very low objects, so it's better to just use your eyes and ignore it (preferrably buy a car without that option).

The PCM computer thing is really awful for playing music. It does play MP3 CD's and it recognizes the directory structure and will show you folder names, that is handy. There is NO aux input for ipods or whatever. That is fucking retarded. You can get an aftermarket kit to plug in your ipod for $500 or so. The PCM audio can't do simple things like turn off the radio but leave the CD player on (eg. when you eject CDs it starts playing radio). You can't turn on & off the audio separately from the Nav/etc. It also can't pause CDs from the normal controls; I just discovered the other day that you can in fact pause CDs from the multi-function steering wheel by pressing on the volume wheel.

The Bose stereo is pretty fucked up; first of all you must turn off the "adaptive surround" or whatever it's called where they try to create surround sound from stereo. Then you have to tweak all the treble and bass settings drastically to try to get something decent out of it. One quirk is that there are separate settings for FM and CD mode, so you have to be playing in that mode to tweak it, and you have to do all the tweaks twice.

The rear subwoofer and the rear seats are both easily removed if you want to save weight for track days or make more room for cargo or whatever.

The default alignment on the car is very "mild" ; that is, it's very stable, keeps the car straight on the freeway when you let go of the wheel, and resists turning, making the car understeer slightly. This is nice for highway driving and many people will be happy with it. If you want more aggressive turn-in, the easiest way to fix that is just to get a more aggressive alignment; the biggest difference comes from getting more negative camber up front, but you can also get less toe in. You can just ask for the "rest of world performance alignment".

Something about the front alignment (maybe it's the caster ?) means that when you turn all the way to lock, you actually are up on the edges of the tire. Obviously you don't have great control when you're on the edges of the tire and it can feel squirelly, especially if the tires are cold and stiff. They can "scrub" or "crab" and make some crunchy sounds. It's not bad once the tires are warm, but you still probably shouldn't slam on the gas at full lock. It's not a great car for gymkana unless you do some suspension mods. (BTW getting a "performance alignment", as you should do, will mostly fix this)

It's hard to get the transmission back into 1st gear once you've been up to high speeds (after you slow back down). You can solve this by double clutching : put it in neutral, let out the clutch, put the clutch back in, put it in 1st. This is "by design" ; many Porsche drivers just keep the car in 2nd once it gets moving. You can't really use 1st gear once you get moving, which sort of sucks for low speed corners (such as in autocross) because you get into super low revs in 2nd gear.

Almost every part of the car is just held together by little plastic tabs, you know those bits that click together. This is kind of handy because it means everything is very easy to take apart and put back together, but it also means that it's not super solid feeling. The cabin has a lot of squeaks and rattles. You can fix these pretty easily by just popping out the offending piece, putting a bit of foam or felt tape under it and popping it back in. But trying to do that on every piece would drive you mad.

I often think that there's some horrible rattle from a loose piece of the car. In fact every time it has just been something I put in the car, a key chain, a quarter in the change tray, etc. The car gets a lot of vibration from the road because the suspension is pretty stiff, so anything you carry in the car will bounce around quite a lot.

The electricity is on inside the car all the time, even when it's shut off and the doors are locked. This means your cabin accessories plugged into the various 12V DC ports will stay on. Personally I find that annoying, others may like it. It also means the car will drain the battery if it sits a while; people who keep their Porsches in garages all the time usually have to buy a battery tender.

If the battery goes completely dead it's a pain to deal with cuz the hood is power operated. You have to first put jumpers on the fuse panel in the driver's footwell, then use the hood open switch on the key, not the one in the car. (there is also an emergency manual hood open wire, but it requires taking apart the right wheel well, so the fuse panel method is preferred when possible)

Like most cars these days, you can disable the seatbelt warning chime by plugging the seat belt in and out 15 times quickly (this makes the computer think it's broken and thus disabled it). Or you can buy a Durametric (Professional) cable that will let you toggle all the option codes in the chip the way a dealer can. (it's OBDII which I guess is a standard interface that lots of cars are on these days, you can get a generic OBDII device for cheap that can at least read engine failure codes and clear service lights, but you can't toggle options with a generic device).

The Porsche as a daily driver is mostly great. The main problem with it is not the harsh ride or the low cargo space, it's that it wants you to really drive it. It begs you to hammer the throttle and swing through the curves. That's great, right? Well mostly yes, but not always. Some days you just want to turn off your brain for the commute and get to work without incident. Some days it's easier to deal with the traffic and annoyance if you have a sedative in the form of a non-performant car. I found that when I was driving the Nissan Versa rental car - you can't go fast even if you want to, and that actually is very relaxing. The Porsche is like a girlfriend who wants to have nasty hard sex every day. Sounds great, right? Well, after the 20th day in a row of fucking, you kind of just want to watch TV and be left alone, but she's still jumping on you and whipping you with her hair; at that moment you kind of wish you had a fat lazy girlfriend who just eat chips with you and sits on the couch.

03-18-10 | Physics

Gravity is a force that acts proportionally to the two masses. (let's just assume classical Newtonian gravity is in fact the way the universe works)

People outside of science often want to know "but why?" or "how exactly? what is the mechanism? what carries the force?" . At first this seems like a reasonable question, you don't just want to have these rules, you want to know where they come from, what they mean exactly. But if you think a bit more, it should be clear that these questions are absurd.

Let's say you know the fundamental physical laws. These are expressed as mathematical rules that tell you the behavior of objects. Say for example we lived in a world with only Newtonian dynamics and gravity and that is all the laws. Someone asks "but what *is* gravity exactly?". I ask : How could you ever know? What could there ever be that "is" gravity? If something was faccilitating the force of gravity, there would have to be some description of that thing, some new law to describe it. That would mean some new rule to describe this thing that carried gravity. Then you would ask "well where does this rule for the carrier of gravity come from?" and you would need a new rule. Say you said "gravity is carried by the exchange of gravitons" ; then of course they could ask "why is there a graviton, what makes gravitons exactly, why do they couple in this way?" etc.

The fundamental physical laws cannot be explained by anything else.

That's almost a tautology because that's what I mean by "fundamental" - you take all the behavior of the universe, every little thing, like "I pushed on this rock with 10 pounds of force and it went 5 meters per second". You strip away every single law that can be explained with some other law. You strip and strip and finally you are left with a few laws that cannot be explained by anything else. These are the fundamental laws and there is no "why" or "how" for them. In fact the whole human question of "how" is imprecise; what we really should say is "what simpler physical law can explain this phenomenon?". And at some point there is no more answer to that.

Of course this is assuming that there *is* a fundamental physical law. Most physicists assume that to be true without questioning it, but I wrote here at cbloom.com long ago that in fact the set of physical laws might well be infinite - that is, maybe we will find some day that the electrical and gravitational force can be explained in terms of some new law which also adds some new behaviors at very small scale (if it didn't add new behaviors it would simply be a new expression of the same law and not count), and then maybe that new law is explained in terms of another new law which also adds new behaviors, etc. ad infinitum - a russian doll of physcial laws that never ends. This is possible, and furthermore I contend that it is irrelevant.

There is a certain human need to know "why" the physical laws are as they are, or to know the "absolute" "fundamental" laws - but I don't believe there's really much merit to that at all. What if they do finally work out string theory, and it explains all known phenomena for a while, but then we find that there is a small error in the mass of the Higgs Boson on the order of ten to the minus one billion, which tells us there must be some other physical law that we don't yet know. The fact that string theory then is only a very good model of the universe and not the "absolute law" of the universe changes nothing except our own silly human emotions in response to it (and surely crackpots would rise up and say that since it's not "100% right" then there must be angels and thetans at work).

What if we found laws that explained all phenomena that we know of perfectly. We might well think those laws are the "absolute fundamental" laws of the universe. But how would we ever know? Maybe there are other phenomena that can't be explained by those laws that we simply haven't seen yet. Maybe those other phenomena could *never* be seen! (for example there may be another entire set of particles and processes which have zero coupling to our known matter). The existance of this unexplained phenomena does not reduce the merit of the laws you know, even though they are now "not complete" or "don't describe all of nature".

It's funny to think about how our intuition of "mass" was screwed up by the fact that we evolved on the earth in a high gravity environment where we inherently think of mass as "weight" - eg. something is heavy. There's this thing which I will call K. It's the coefficient of inertia, it's how hard something is to move when you apply a certain force to it. F = K A if you will. Imagine we grew up in outer space with lots of large electrical charges around. If we apply an electric field to two charges of different K, one moves fast and one moves slow, the difference is the constant K. It's a very funny thing that this K, this resistance to changes of motion, is also the coupling to the gravitational field.

03-17-10 | Porsche 911 Buying Guide

This one goes out to all the internet searchers in the world. I figure I should brain dump to the wild electronic blue.

ADDENDUM : this is really a 997.1 buying guide , not an old-911 buying guide (there are plenty of those around if that's what you want, look elsewhere) ; first a buyer has to learn is that Porsche people don't ever call it a 911, you have to refer to the specific generation you are talking about, and when people say "911" they usually mean the very early models, pre-964.

Buy used. If you have any doubts, remember that Porsche makes $28247 per vehicle which is almost all the difference between new and 1 year old car prices. The cars have a pretty huge price cliff when they transition from "almost new" ( under 10k miles ) to "yuck the seat is soaked with someone else's ass sweat" ( over 10k miles ) ; you will get the best value by buying the most recent year you can get, but with higher than average miles. Also, high mileage on these cars is not actually a terrible thing; the absolutely worst thing you can do for these engines is to let them sit in a garage for months and then a nice day comes around and you go out and thrash it, which is what a lot of the low mile cars do. Also at high mileage it's more likely that a major engine problem (RMS/IMS) would have shown up by now. Also, Porsche buyers are nuts about minor cosmetic damage; that means you should of course buy a car that has some minor cosmetic flaws like paint scratches and bumper rash, and you should get a big discount for it.

Naming : "911" is the line of rear engine cars; it also refers to one specific generation of the line. Future types of 911 are either called the "2001 911" or by their numerical code name, eg. "996". Sometimes people will also just say "the coupe" which is understood to mean "current year 911 carrera".

History : Porsche repeatedly tried to kill the 911 and get traction with mid-engined (914) or front-engined (944,928) cars. Before 1999 the 911 was air-cooled (really "oil cooled" , which it still is, but now the oil is additionally cooled by water, while then the oil was just cooled by air). The old 911 is the longest production running car after the Beetle, which is interesting since it is just a Beetle (this claim is pretty bogus). Most Porsche nuts idolize the early cars which is why you need to ignore the opinion of Porsche nuts. Older Porsches were also more rare and held value better; it was a weird cult car; the modern 996+ car is not like that any more at all, it is a mass market easy to drive car with huge depreciation; some of the old Porsche nuts don't understand or refuse to recognize that change.

1999 introduced the 996 (actually 98 but you will never see a 98). The current car (2010) is largely the same as that car - same platform, same basic engine. All modern 911's are ECU controlled with air sensors and drive-by-wire electronic throttle and all that kind of shit that makes them just like a BMW or something, but they do a good job of feeling simple and pure and manual. In general the advantage of the 911 over competing cars is just that it is an incredibly well sorted car which is well designed for people who like to drive. The power is not amazing by modern standards, but the way it puts down the power is fantastic - it has less drivetrain loss than average, less throttle delay, you sit low, it's just the right kind of stiff, the ergonomics are great for real driving. They have also kept the weight pretty low by modern standards - you can get a base 911 for under 3000 pounds with all the AC and carpets intact , most competitors are around 3500.

The lineup of variants is basically the same for all the years (though they're fickle about which variant is available in what year), so first I'll go over the variants :

Basically you pick each part of the car and put them together, so you pick :
roof type : {Carrera/Cabriolet/Targa}
+ engine type : {S , not S }
+ drive type : {2wd or 4wd}
Put them together, so eg. Carrera 4wd S = C4S

Note : the Turbo, GT3, and GT2 are a separate car that is not a variant of this; I will address them at the end.

Carrera = Coupe - this is the good one. Lighter, more rigid.
Cabriolet = Convertible. This is for very fat bankers. If you want a convertible car there are many better choices.
Targa = glass top ; at some point the meaning of "Targa" changed from a T-top type of thing to the new variant which is a huge glass sliding roof. The glass roof is heavy and it's weight in a bad place - up high - which hurts handling, but I must say it is absolutely glorious inside there; it's all the advantages of a convertible with none of the disadvantages. It is a painfully uncool fat banker's car, but it's also really fucking great to be in. If you don't care about what other people think of you (which you clearly don't if you are considering a 911), this is a great car, though they are a bit rare thus hard to find cheap used. The rear glass also opens like a hatch back, which is especially great if you delete the rear seats (which you should).

911 "base" (Carrera/Cabriolet/Targa) - don't buy this, it's for retards and fat (but poor) bankers.

911 "S" - the "S" is a huge upgrade over the base for not much money, it's obligatory. You get a bigger engine, bigger brakes, bigger rims, better suspension, standard PASM and other sport options, better air intake, better exhaust, it's just upgrades all around, you really cannot pass up the "S".

"2" vs "4" (referred to as C2S or C4S for example for the Carrera 2WD S) - this was the hardest buying decision for me. The "4" gets a few bonuses, like pre-pressurized brake lines and larger brakes from the turbo, as well as the wide body and larger tires. The "4" also gets 100-150 more pounds of weight (though most of that is in the front which is not a terrible thing), and a bit of understeer and a slightly more numb feeling. In the real world, the 4 is actually faster in almost all scenarios in the hands of 99.99% of drivers, however the 2 *feels* faster because it is lighter and there's less drivetrain loss and lag - you get the engine in the back directly driving the wheels in the back which makes it super responsive. If you do get a "4" you should go for the 2009+ variant. I think 90% of the fat banker buyers should be getting the 4, because it is more practical and safer for the shitty driver; serious enthusiasts probably should get the 2 for the tighter feel and wacky tail-out fun times.

Now we'll go over the years. The 996/997 has been basically the same but tweaked over the years and there are some major differences :

The 996 generally (1998-2004) is the ugly stepchild of recent Porsches. It's very ugly, the front end is identical to the old Boxster front end, the interior feels to me just like a Dodge Stealth, and the steering feel and throttle and brakes and everything just feel a lot worse than the 997 (even though they are fundamentally the same).

996.1 : 1998-2001 : 3.4L engine (300 HP) : do not buy these. They are known for engine failures (RMS is just a small leak problem while IMS is a major engine failure) and the 2002 has a lot of upgrades.

996.2 : 2002-2004 : 3.6L (320 HP) : adds variable valve timing, better aero and stiffer chassis, lots of little improvements over 996.1 , mostly sorts out the engine failures. This car is much hated, though it's actually not a bad car, so it should be available for under $30k , and if you do a bunch of mods it could be a nice car. I still don't recommend it, because if you are thinking of a 996 you can get a GT3 or Turbo for so cheap (see later).

997.1 : 2006-2008 : 3.8L (S) (360 HP) : lots of improvements over the 996.2 - bigger engine, bigger rims, brakes, handling more sorted, clutch feels a lot better, better steering feel. Adds PASM adjustable sport suspension (you are getting an S of course so this is standard) which is very good. Oversteer is extremely well sorted now through suspension and alignment and tire sizing, if anything the cars slightly understeer out of the factory. PSM (electronic traction control) is very good - keeps you going straight without interfering too much.

997.2 : 2009-2010 : 3.8L DFI (385 HP) ; big engine improvement, moves to direct injection which gives more power, more economy, more torque. Fixes the IMS problems by having no IMS. The biggest improvements however are in the automatic (PDK) which gets the 7-speed double clutch from the Turbo, and the C4S which gets the electronic 4WD from the Turbo (PTM) (the old 4S had mechanical viscous clutch 4WD which is less effective as a 4WD but maybe is better at making the car feel like a RWD). If you're considering a C4S, especially an automatic, the 997.2 is a big win over the 997.1 ; some indications that the early engines are not sorted (eating a lot of oil); they also have tuned the suspension to be a bit more dead out of the factory, so the 997.2 is even less tail-happy and a bit more numb than the 997.1 , but you can easily have this fixed aftermarket if you want more snap. Also if you want an automatic (PDK), you must get a 997.2 with SC (Sports Chrono), the earlier automatics suck (tiptronic), and normally SC is worthless, but with PDK it is a must. Another huge change for 997.2 is that an LSD (Limitted Slip Differential) is now standard (on S/PASM cars anyway, which you are of course getting), so that's a nice bonus. Right now these cars are hard to find cheap used, however you can still find unsold 2009's at dealers and they are offering big deals on them.

2011-2012 will bring a major overhaul and is currently codenamed the 991.

Buying options on the 997 : (a general note : options may cost a huge amount when you buy the car new, but on used cars they are worth almost $0 ; do not pay more for a car because of options ; if someone tries to tell you the car is worth $5k more because it has PCCB, tell him you're sorry he wasted his money on options that are worth nothing in resale).

Sports Chrono : must get this with PDK on the 997.2 (it gives you a faster-shifting more aggressive mode for the automatic; must have; also gives launch control; those cars also have the ability to switch the 4WD to 2WD on C4S models which is pretty cool). On other cars all this does is give you a "sport" button which remaps the throttle to be a steeper curve. Basically it just makes e-gas have a *= 2 factor on it, so that 100% trottle comes on quicker. That's retarded and actually hurts lap times because it gives you less fine throttle control.

PASM : this will be standard (you are buying an S right) and it's good.

19" rims : this will be standard (you are buying an S right) but it's not good. 18" rims are preferred - they are much lighter, let you buy cheaper tires, give a softer more comfortable ride, and are at worst neutral on performance (and might help).

Bose stereo : terrible, avoid. Adds weight and might make the audio quality worse. The car is so loud the stereo quality doesn't matter anyway.

Nav & PCM : terrible, avoid. Much improvement in 997.2 (PCM v3) , so if you need electronics get a 997.2 ; otherwise just use an aftermarket nav system, or pull over and use your iPhone.

Parking assist, dimming rear view mirror, headlight washers, TPMS, etc. etc. - none of this is good, avoid it all. Even if you think you want it, it seriously does not work well, avoid. Like it's not just overpriced, but having it is worse than not having it.

X51 package : performance package that boosts HP about +25 ; this is a very nice kit (exhaust, headers, intake, valve flaps, cylinder heads), which you should definitely get on a used car if you can find it cheap. Definitely not worth the $$ on a new car (but you aren't buying new cuz that's ridiculous with a Porsche). The OEM X51 kit is way way better than any aftermarket 3rd party intake/exhaust mods. (ADDENDUM : x51 is really a very good deal in a used car and you should get it if possible; it also includes a lot of upgrades to make the engine more reliable and handle track G's better, such as a better oil pan and extra oil scavenge pump, better cooling, better cylinder heads, etc. The value is really not about the small power boost, it's that the car can handle heavy use without blowing up.)

Tiptronic (automatic) : avoid because you're not a fat boring lazy banker. Also because the PDK in the newer cars makes this look like stone age dogshit, so these cars will suffer badly in the resale market. (If you do want an automatic, that's a good reason to get a 997.2 instead of a 997.1).

Ceramic brakes (PCCB) : this is a pretty "meh" option ; definitely don't pay more for it. In theory it's cool because they basically don't wear or fade with use, and for a street car they might be a cool thing, but they do crack under very high heat track use, and when they crack they cost $10k to replace, so maybe not an awesome thing for a car you will track (and if you don't track, then normal pads are fine).

Okay, now on to the GT3, GT2 and Turbo . These are significantly different than the other cars, even though they obviously look very similar. They have different suspension, engines, intake & oil coling, and body panels. The Turbo generally has the latest technology, which then moves down to the base cars in the next model. Most significantly and what sets these apart as a family from the other 911 cars is that they are actually built on a different engine. All the 996/997 GT3/GT2/Turbo cars are based on the same engine : the Porsche GT1 race engine from 1996-1999. This engine was brought to the retail 911 for the 996 GT3 in 2003-2004 ; even though it is a 3.6L flat 6 like the normal 911, it is a totally different engine that revs higher, can take more heat, and has a true dry sump for oil cooling which lets it sustain over 1 lateral g without drying out. (technically it's the M64/GT1 engine, while the main line of 911's has an M96/M97 engine). There's not as big of a difference between 996 and 997 variants of these cars as there is with the base cars, because these cars have stayed with the same GT1-based engine the whole time (eg. a 996 GT3 vs. a 997 GT3 is not a huge difference, just body style, some suspension degrades, addition of electronic nannies, stuff like that). The main difference in 996-997 on these cars is the body style and the tech doodads. The GT named cars are quite clear - they are named by the race class they are designed for ; GT1 is the highest race class, then GT2, then GT3. (* addendum : the 997.2 Turbo is now the DFI "9A1" engine as the 997.2 line, just with a Turbo stuck on it; the 997 GT3 is still the M64/GT1 engine, and the 996 and 997.1 Turbo is also an M64/GT1 engine).

Turbo : (often called the TT, as in "996 TT") a bit like a C4S with the better engine & a turbo stuck on it. Also the latest tech doodads to control all that power, fancy brakes and suspension and all that. Cabin is just like a C4S. Obviously an awesome machine, but I really don't like this car; it's not ideal for the track, and the problem is that it's not fun until you are over 100 miles an hour, which makes it pretty impractical in the US. This car is made for blazing on the autobahn. On the plus side, 996 Turbos can be had for quite cheap now.

GT3 : RWD, naturally aspirated, a bit like a stripped C2S with X51 but really a whole other beast. Stripped of weight and unnecessary doodads like computers. The lower tighter suspension makes this not a very practical street car. Also all that power in the rear-engine RWD layout makes this car much more tail happy than the C2S (which is tweaked to soften up the oversteer tendencies). The 997 GT3 has a lot of the modern tech doodads (PSM, PASM) which make it more liveable on the street. The 996 GT3 is the last really raw non-teched-out Porsche, and they can be had used for quite cheap, this is my recommendation for a real raw track car. vs a C2S the car has a much lower rotating weight due to light weight fly wheel, lighter rims and brakes. This is a great car, and a tempting buy because it basically takes a 911 and does all the mods to it that you wish you could do if you're a serious driver, but if you are buying a car that's mainly for the track and a bit uncomfortable on the road, there are better options (like a Cayman or Lotus if you just want fun, or a Nissan GTR if you really want speed, or a Corvette for RWD madness, or lots of other things that are all much cheaper). The only advantage of the 911 over its rivals is its practicality, and the GT3 throws that away.

GT2 : this is a GT3 with a turbo stuck on it, or a Turbo that's been converted to RWD and stripped of weight and made stiffer. This is a ridiculous dangerous car and you should not buy one. This is the one car that really carries on the 911 legacy of being an out of control killer .

Of course there are various other variants (GT3 RS, Cup, etc. etc.) but you aren't going to buy them so whatever. You should be able to get a 996 Turbo or GT3 in good shape for $50-60k now.

Some good guides :

Deutsch Nine history of the models
PorscheDoc's 996 buying guide
Grant Neal's 997 in detail

If you are shopping, do not trust any dealer or private guy about what variant or options a car has. Ask them for a photocopy of the first page of the service book, which will have the VIN and option codes. Then enter them here :

VIN code guide
VIN Decoder
Option code Guide
Option code decoder


Most people will recommend a PPI ("Pre-Purchase Inspection") when buying a used Porsche. I got a PPI and it was pretty worthless, because Porsche mechanics use it as a way to make free money from people who don't know Porsches, and they don't really care if you get a good car or not. So, I'll tell you what you should look for. I do recommend that you get a PPI, but rather than letting the mechanic just check the car out for you, ask for this specific information :

1. DME readout of CELs and overrevs. They scan this with an OBDII reader and should get a page of information. Have them send you an exact copy of what they get from reader, not just their opinion on it. What you want to see is no overrevs in range 4 or 5 (range 1-2 are okay, range 3 is iffy). (you can also buy your own OBD reader and do this yourself)

2. Compression test - this is the only way to tell if there's damage inside the engine. You want the compression number on each cylinder; this should only cost about $100, any more and they are ripping you off. In particular if the engine has been run too hot or oil-starved, the cylinder liners can be gauged or warped out of round, and then they will not seal properly and you'll see reduced compression on that cylinder.

3. Check for oil weep on the engine, particularly in the front. Obviously a seller can hide this by cleaning the engine, but they usually don't. Obvious oil weep near the engine-tranny mounting is a sure sign of RMS leak.

4. Check for bent suspension parts, cracked coil packs, and rust on the exhaust. These are all easily visible on the bottom of the car when its lifted.

ADDENDUM 3-1-2011 :

if for some reason I was buying a new 997 right now it would be a GTS for sure. The GTS is a 997.2 (9A1 DFI), it's RWD but with the wide body from the 4WD car, has the x51 Powerkit option standard for +20 hp, alcantara interior bits, some aero body bits, a nice rear-seat delete option like the GT3. In non-America it also comes with sunroof delete, but in the US that's not an option. Basically the GTS is a good set of options, and you actually get them at a discount, unlike the retarded "Sport Classic" or "Speedster" which are just 997's with a ridiculous markup for no reason. Personally I wish they would have done a bit more weight reduction for the GTS, but instead of a real lightweight enthusiast car, what they did was make the GTS the car that all 997's should have been all along. At 408 HP it's actually got the power to match its rivals. (the only thing not to like about the GTS are the absurd center-lock wheels, but you can opt for normal lugs).

03-16-10 | Video Reviews

Watching a bit of "The Inbetweneers" (BBC). Meh, it's passable so far as TV comedy goes (way better than Modern Family for example, but not as good as The League). But I'm really sick of the whole nerdy loser boys in high school genre that was made so big by Judd Apatow and that fat guy who's always in his movies and Michael Cera and all that. Uh, yeah we get it, high school boys are horny and terrible with girls and they get drunk and puke cuz they drank too much ha ha ha. And then they learn lessons about life aww look how fast they grow up. I find the whole need to relive childhood or laugh at our childhood very juvenile and tedious.

"Banlieue 13" (District 13/ District B13 - very bad translated titles) is fucking great. It's silly and frenetic and imaginative, everything that action movies should be. It's a bit strange that it completely flopped in the US, since Luc Besson is pretty well known and it's a fucking great movie, but maybe it's just a bit too French ; you won't really get how close to reality it is unless you've followed a little bit of French news in the last 10 years, and then it's got the French goofiness with silly cars and parkour and banging soundtrack. It's the best Parkour in a movie for sure

Feynman's "Messenger Lectures" - ( "The Character of Physical Law" ) ; this is available from Bill Gates but that's a fucking stupid ploy to make you use stupid Silverlight , just download the torrents. God these are so good. You don't really need any strong physics or math background, these are aimed at college students, but Feynman always aims high and doesn't talk down, so you do have to be smart to follow. It's so satisfying and invigorating to watch something that's actually intellectually stimulating, it makes my brain start firing. I wish I could watch only things like this all the time.

High Stakes Poker is back on now and as usual is amazing. It sucks they ditched Aj Benza in the booth though, obviously he was just a talking bag of puss, but Gabe can't work solo, commentary is always a team; Gabe Kaplan alone is almost like John Madden alone - not good.

Full Tilt "Durrr" Million Dollar Challenge is also interesting. The first 6 episodes are a bit boring because he's playing against pretty weak opponents, but in episode 7 Zigmund sits in and they play some insane PLO. It's also funny to see that Durrr was playing these games all day long while Isildur was tearing him up every night; Durrr wound up losing around $3M to Isildur very quickly. It's very interesting to watch Durrr play, partly because he gives tons of action which makes things interesting, but also because he plays so weird that it creates all kinds of new situations and thought patterns. Some of the things that he does are just so exploitable, but people are afraid to do the right thing against him. For example, Durrr will often bluff when it's clear that the opponent has a very very strong hand; people think "he knows I have a strong hand, so he must not be bluffing" and fold, when really they need to just open their game way up against him. Similarly Durrr rarely folds to 3-bets. The correct counter play to that is to 3-bet a lot of good hands, and then be very happy to play big pots with anything decent post flop. The problem is people are so afraid of variance and big pots that they try to pot control and wind up giving Durrr the pot.

03-12-10 | Friday Linkage

YouTube - SeatofEmpire's Channel - cool documentary about Seattle as the manifestation of American capitalist-military exploitation. This is just previews but they are fun and well made.

YouTube - Heel & Toe - best video I've seen of heel & toe technique. For racing, through a corner you want to go from braking hard directly into gassing hard with no coast time in between. It's actually much more important to do this in a turbo car or any car with a high narrow power band - in that case you need to actually shift while you go through the corner so that you keep the car in the power band, you can't wait and shift right before you power out at the apex because the RPM will have dipped too low.

YouTube - minutegongcoughs's Channel - good old music

Wheel Alignment A Short Course
Caster, Camber, Toe
The new 911 has some weird shit going on with the alignment to keep the tail from sliding out. I had no idea what camber or toe were or how they affected vehicle dynamics; crazy interesting stuff. Some people who race 911's make the alignment more neutral so that they oversteer more.

John Sizemore - Weird_Weird_Science on Dailymotion - awesome awesome videos of extreme zooms on materials all the way down to the atomic scale. How do they do this?.

Alan Turing - Wikipedia, the free encyclopedia
Turing biography
Of course we all know Turing from the Turing Machine and the Turing Test and his thoughts on AI and Computability, but he also did work on chemical pattern formation, abstract mathematics, and code breaking. I had no idea about his personal life story, though, which is quite shocking.

NOVA Sputnik Declassified A Tainted Legacy PBS - speaking of scientist biographies

Popular Science - Google Books Old Popular Science magazine is fucking awesome. Just the other day I was oiling up my bicycle and was wondering why we don't oil up cars the same way (oiling all the hinges and gears and such) - well of course in the past they did!

Pelican Technical Bulletin All About Motor Oil
A little bit archaic but pretty interesting.

Brian Beckman's Physics of Racing Series - really this is "Physics of Driving" ; it's mainly a review of the basic forces involved in cars. Pretty good stuff. I guess BB is at MS Research and may have contributed to "Forza 2" , though if it's anything like the typical MS Research video game "contribution" it means he sent them a link to his articles and they ignored it. (BTW just found this nice series of videos on basic car physics made by a dry narrator in a Seven).

Aircrew Survival Equipmentman 2 - Aviation theories and other practices - what to do when your plane goes down. Includes how to maintain or fabricate your parachute.

03-12-10 | Hot Douche Gasses

Aftermarket exhaust products for cars really make me angry.

The car company engineers do a really amazing job of making modern cars high performance and also low pollution. Modern catalytic converters and variable valve timing and computer controlled feedback loops with air sensors are real technological marvels. And yet some redneck with a blow torch thinks he can improve it by cutting some holes or putting on fatter pipes.

First of all, "freer flowing" or "reduced backpressure" is rarely a good thing to do to your car. Putting on fatter pipes rarely helps performance much, and if it does help it probably helps only at high RPM at the cost of performance at low RPM. The reasons are rather complicated, but basically the combustion chamber is designed to have just the right amount of exhaust exit flow. What that should be exactly depends on the car and the RPM, but typically you want the flow of exhaust to create a vacuum to help evict the cylinder after combustion; some engines also want the exhaust to stop moving and create a sort of wall as the exhaust valve closes and new air and fuel comes in. Changing the diameter of pipes can screw up these pressures, and the result is often torque loss at low RPM. At high RPM it's a simpler situation because the cylinder is exploding so many times and making so much gas you just want to get it out as quickly as possible.

( see for example 1 or 2 )

The latest trend in car mods is to do a full or partial bypass, just cutting a hole before the cats that dumps exhaust straight out the back of the car, sometimes on a switch, sometimes not. Even if it did help, it would still be dickish to ignore your fucking up the air quality just because you need 5 more horse power. Some of the modders are just confused because cats actually did used to hurt performance back in 1950 or whenever they first came out, but modern engines and cats work marvelously (very large high reving engines like race cars can still see a small benefit from not using them).

But worst of all its just super douchey to make your car louder. I personally love to hear my engine, and actually I wouldn't mind it being a bit louder, but I want it loud just for me in the cabin, not for the whole rest of the world (and I'd like it on a switch too, so I can have quiet mode and loud mode). People who amp their exhaust are obviously doing it intentionally to be obnoxious. Hey pedestrians walking on the road, listen to this! Yeah fuck you for trying to have a conversation, look over here at my great fucking car! My penis might be tiny but my exhaust is huge, oh yeah!

In other douchey car news, my car has xenon head lights and I hate it. I'm kind of tempted to "smoke" the headlights just to reduce their brightness a bit. I can see people in front of me get annoyed when I pull up behind them. Of course I'm quite low so my sin is only one tenth the sin of a high-carriage car with xenons, but I still hate to be that guy. (In other "that guy" news, I used the carpool lane through Bellevue for the first time. I have no regrets).

ADDENDUM : another awesome example is what people do for air intakes on the 911. There are a few varieties of stupidity here.

One popular mod is like Fabspeed smooth red tube . The OEM air intake tube has a few very clever features; for one it is baffled to reduce noise and drag (contrary to Fab's claims, rough surfaces actually make less air resistance than smooth in many cases; see eg. the golf ball; the rough surface will create turbulence while the smooth surface will build up a big sticky boundary layer; of course the reality and details are incredibly complex and it's hard to say what's better).

But the main feature is the little tube that comes off the OEM intake. That little tube is a Helmholtz Resonator, which vibrates at a certain frequency as air passes over it (just like blowing over the mouth of a bottle). It's tuned so that that resonance cancels out the vibration resonance of the air coming in at low RPM. At high RPM they go out of tune and no longer cancel. The result is the car is nice and quiet at low RPM and then howls at high RPM. (see here for example)

The main thing that the air intake mods do is just break those clever features. Of course they could get the same effect by just putting a piece of duct tape over the resonator tube. And hell while you're at it stick a playing card in your air intake so you get some extra noise.

To some extent the whole "cold air intake" thing is another left-over from the 50's modder days when manufacturers actually got things wrong. Back then you had carburated cars taking big gulps of hot air from inside the engine bay. If you just stuck on a tube that ran the air intake out to the outside of the car, it would let it breathe cold air, which increases engine efficiency (cold air is denser in oxygen mainly). Modern cars all route cold air to the air intake very efficiently, either using their own tubes or simply through controlling the air flow patterns when the car moves. Modern cars are not designed to breathe correctly when they are sitting still (eg. on a dyno) which is part of how these nonsense tuners can claim gains on the dyno (the main way is just faking the results, eg. running the "before" test on a cold engine and the "after" when it's all warmed up).

Anyway, it's a cheap way to get a little sound, which is what buyers really want. The funny thing is they can't admit that, everyone has to pretend it's about performance. Both buyers and sellers happilly go along with the fraud, sellers post fake dynos and buyers claim they "feel a difference". (kind of like how ballet is really about seeing some hot half-naked people, but everyone has to pretend it's about the story).

Another popular mod is a BMC or K&N cotton oiled air filter. These filters do seem to in fact let in a tiny bit more air (maybe 1% more) which might increase power, but they also let in more particles, which might hurt engine life. It's almost a tautology that more air flow = less filtration, unless you actually increase the surface area of the filter. (see here for example). There's also some indication that those filters let in more air when clean, but less air when dirty (compared to a standard OEM paper filter). Many ricers want to use BMC or K&N filters because "racers use them". It's a completely different situation for a few reasons - 1. race engines have a lifetime of less than 1000 hours, so longevity isn't really a big concern, 2. they actually need as much air as they can get, most road engines are not air limitted (or rather, they are, but the problem is not availability of air at the intake, it's driving it into the cylinders, and the solution to that is forced induction, not filters), and perhaps most importantly 3. the racers are sponsored by those filter companies.

03-10-10 | Distortion Measure

What are the things we might put in an ideal distortion measure? This is going to be rather stream of conscious rambling, so beware. Our goal is to make output that "looks like" the input, and also that just looks "good". Most of what I talk about will assume that you are running "D" on something like 4x4 or 8x8 blocks and comparing it to "D" on other blocks, but of course it could be run on a gaussian windowed patch, just some way of localizing distortion on a region.

I'm going to ignore the more "macroscopic" issues of which frame is more important than another frame, or even which object within a frame is more important - those are very important issues I'm sure, but they can be added on later, and are beyond the scope of current research anyway. I want to talk about the microscopic local distortion rating D. The key thing is that the numerical value of D assigns a way to "score" one distortion against another. This not only lets you choose the way your error looks on a given block (choosing the one with lowest score obviously), it also determines how your bits are allocated around the frame in an R/D framework (bits will go to places that D says are more important).

It should be intuitively obvious that just using D = SSD or SAD is very crude and badly broken. One pixel step of numerical error clearly has very different importance depending on where it is. How might we do better ?

1. Value Error. Obviously the plain old metric of "output value - input value" is useful even just as a sanity check and regularizer ; it's the background distortion metric that you will then add your other biasing factors to. All other things being equal you do want output pixels to exactly match input pixels. But even here there's a funny issue of what measure you use. Probably something in the L-N norms, (L1 = SAD, L2 = SSD). The standard old world metric is L2, because if you optimize for D = L2, then you will minimize your MSE and maximize your PSNR, which is the goal of old literature.

The L-N norms behave differently in the way they rate one error vs another. The higher N is, the more importance it puts on the largest error. eg. L-infinity only cares about the largest error. L-2 cares more about big errors than small ones. That is, L2 makes it better to change 100->99 than 1->0. Obviously you could also do hybrid things like use L1 and then add a penalty term for the largest error if you think minimizing the maximum error is important. I believe that *where* the error occurs is more important than what its value is, as we will discuss later.

2. DC Preservation. Changes in DC are very noticeable. Particularly in video, the eye is usually tracking mainly one or two foreground objects; what the means is that most of the frame we are only seeing with our secondary vision (I don't know a good term for this, it's not exactly peripheral vision since it's right in front of you, but it's not what your brain is focused on, so you see it at way lower detail). All this stuff that we see with secondary vision we are only seeing the gross properties of it, and one of those is the DC. Another issue is that if a bunch of blocks in the source have the same DC, and you change one of them in the output, that is sorely noticeable.

I'm not sure if it's most important to preserve the median or the mean or what exactly. Usually people preserve the mean, but there are certainly cases where that can screw you up. eg. if you have a big field of {80} with a single pixel spike on it, you want to preserve that background {80} everywhere no matter what the spike does in the output. eg. {80,80,255,80,80} -> {80,80,240,80,80} is better than making it go -> {83,83,240,83,83} even though the latter has better mean preservation.

3. Edge Preservation. Hard edges, especially long straight lines or smooth curves, are very visible to humans and any artifact in them stands out. The importance of edges varies though; it has something to do with the length of the edge (longer edges are more major visual features) and with the contrast range of the region around the edge : eg. an edge that separates two very smooth sections is super visible, but an edge that's one of many in a bunch of detail is less important (preserving the detail there is important, but the exact shape of the edge is not). eg. a patch of grass or leaves might have tons of edges, but their exact shape is not crucial. An image of hardwood floor has tons of long straight parallel edges and preserving those exactly is very important. The border between objects is typically very important.

Obviously there's the issue of keeping the edges that were in the original and also the issue of not making new edges that weren't in the original. eg. introducing edges at block boundaries or from ringing artifacts or whatever. As with edge preservation, the badness of these introduces edges depends on the neighborhood - it's much worse to make them in a smooth patch than once that's already noisy. (in fact in a noisy patch, ringing artifacts are sort of what you want, which is why JPEG can look better than naive wavelet coders on noisy data).

4. Smooth -> Smooth (and Flat -> Flat). Changing smooth input to not smooth is very bad. Old coders failed hard on this by making block boundaries. Most new coders now handle this easily inherently either because they are wavelet or use unblocking or something. There are still some tricky cases though, such as if you have a smooth ramp with a bit of gaussian noise speckle added to it. Visually the eye still sees this as "smooth ramp" (in fact if you squint your eyes the noise speckly goes away completely). It's very important for the output to preserve this underlying smooth ramp; many good modern coders see the noise speckle as "detail" that should be preserved and wind up screwing up the smooth ramp.

5. Detail/Energy Preservation. The eye is very sensitive to whether a region is "noisy" or "detailed", much more so than exactly what that detail is. Some of the JPEG style "threshold of visibility" stuff is misleading because it makes you think the eye is not sensitive to high frequency shapes - true, but you do see that there's "something" there. The usual solution to this is to try to preserve the amount of high frequency energy in a block.

There are various sub-cases of this. There's true noise (or real life video that's very similar to true noise) in which case the exact pixel values don't matter much at all as long as the frequency spectrum and distribution of the noise is reproduced. There's detail that is pretty close to noise, like tree leaves, grass, water, where again the exact pixels are not very important as long as the character of the source is preserved. Then there's "false noise" ; things like human hair or burlap or bricks can look a lot like noise to naive analysis metrics, but are in fact patterned texture in which case messing up the pattern is very visible.

There are two issues here - obviously there's trying to match the source, but there's also the issue of matching your neighbors. If you have a bunch of neighboring source blocks with a certain amount of energy, you want to reproduce that same patch in the output - you don't want to have single blocks with very different energy, because they will stand out. Block energy is almost like DC level in this way.

6. Dynamic range / sdev Preservation. Of course related to previous metrics, but you can definitely see when the dynamic range of a region changes. On an edge it's very easy to see if a high contrast edge becomes lower contrast. Also in noise/detail areas the main things you notice are the DC, the amount of noise, and the range of the noise. One reason its so visible is because of optical fusion and affects on DC brightness. That is, if you remove the bright specks from a background it makes the whole region look darker. Because of gamma correction, {80,120} is not the same brightness as {100,100}. Now theoretically you could do gamma-corrected DC preservation checks, but there are difficulties in trying to be gamma correct in your error metrics since the gamma remapping sort of does what you want in terms of making changes of dark values relatively more important; maybe you could do gamma-correct DC preservation and then scale it back using gamma to correct for that.

It's unclear to me whether the important thing is the absolute [low,high] range, or the statistical width [mean-sdev,mean+sdev]. Another option would be to sort the values from lowest to highest and look at the distribution; the middle is the median, then you have the low and high tails on each side; you sort of want to preserve the shape of that distribution. For example the input might have the high values in a kind of gaussian falloff tail with most values near median and fewer as it gets higher; then the output should have a similar distribution, but exactly matching the high value is not important. The same block might have all of its low values at exactly 0 ; in that case the output should also have those values at exactly 0.

Whatever all the final factors are, you are left with how to scale them and combine them. There are two issues on scaling : power and coefficient. Basically you're going to combine the sub-distortions something like this :

D = Sum_n { Cn * Dn^Pn }

Dn = distortion sub type n
Cn = coefficient n
Pn = power n

The power Pn lets you change the units that Dn are measured in; it lets you change how large values of Dn contribute vs. how small values contribute. The cofficient Cn obviously just overall scales the importance of each Dn vs. the other D's.

It's actually not that hard to come up with a good set of candidate distortion terms like I did above, the problem is once you have them (the various Dn) - what are the Cn and Pn to combine them?

03-08-10 | Distortion and Bit Allocation

I now know that rate allocation is by far the most important thing in video. It's obviously important in a lot of things, but in video you just have so many bits and so much flexibility in where you put them, and there are lots of psychovisual phenomena that don't exist in images (due to motion, eye adaptation, feature tracking, etc. because the eye notices changes over time, etc). In fact I conjecture that you could take a really shitty old coder like MPEG2 and make videos that beat anything currently in existance with just better rate allocation.

What can rate allocation do ?

1. Move bits to the source of predictions. That is, code some frame (or part of a frame) better than normal because it will be used as a mocomp source in the future. This is actually a purely mathematical win and would apply without any psychovisual consideration. A lot of people do this in semi-heuristic ways, but of course those can make lots of mistakes (for example there may be cases where increasing the bit assignment to a block might actually make it a worse source for the future, eg. the future might be a better match to the block with more distortion; also starving the future might cause it to no longer choose that block as a source, etc). Some people move bits around while holding all the block mode decisions and movecs constant, which at least lets you converge, but of course you should consider all possible bit moves and all possible mode changes.

2. Move bits from frame to frame to make some frames look better and some look worse. Move bits around the frame to make parts look better and parts look worse. In general choosing where to put your error.

There's also a related issue which is not exactly rate allocation but is very similar. In lossy coders like video coders you often you have a choice of what your error looks like. That is, for the same distortion (in a numerical sense) you could make different shapes of error, through choosing different block modes, choosing different movecs, or more globally choosing quantizers or quantization matrices. This often ties into rate allocation because it involves how you make your free choices in the encoder :

3. What the distortion looks like. In particular, if you make some amount of error (in an SSD or SAD sense (aka L2 or L1 norm)) what does that error look like? what is the shape of it?

Now, in a lagrangian framework the main thing driving all these decisions is just the D metric in J = R + lambda D. If you change D, it changes where bits get put. D determines how important you think one type of error is vs. another type of error.

Just as an example, say you ran face detection on your video, then you could assign face regions to all your frames, and any error in the face region could be counted as extra important - if you put this into your "D" metric, then the lagrangian coder automatically gives those areas more bits. But that kind of example is rather banal. There are obviously tons of human error-importance issues that you could try to account for, having to do with what objects are most important in the frame, where the motion is, what kind of errors are particularly appalling, etc etc.

Purely numerical error distribution can be important : say you have an error of 3 somewhere and an error of 20 somewhere else. You have bits to change each by 1. Should you change the 3 to 2 or the 20 to 19 ? Well, it depends on their neighborhoods, but I think more often than not you should do the 3->2. That will be more visually noticeable. Using L1 or L2 (or L-N for whatever other N's) causes you to make different decisions in these cases. Most simplistically you can see it as a continuum between minimizing the total abs error (L1) vs minimizing the maximum error (L-infinity). That is, the issue of whether you have clumpy error or spread out error is a pretty big one.

The thing holding back development is a lack of a procedure for measuring real "quality". The problem is changing distortion to change your bit allocation for psychovisual purposes will by definition hurt your abstract measures. (hacky changes to D might hurt RMSE but help SSIM, but in that case I would say some of the change was not "psychovisual" - the part of the change which helps SSIM is in fact an analytical change to improve a certain metric). At some point you have to be able to make a decision that you will allocate bits in such a way that your video will look worse to computers, but will look better to humans. (with our current shitty computer analysis models).

x264 and others have a bit of a solution for this - they use a kind of "crowd sourcing" (bleck web 2.0 buzz word, I feel like I just vomitted in my own mouth a little). They can put beta features in their code and they have mobs of fan-boys who will download betas and try them on lots of videos and then post results on the forums. This gives you lots of real human eyes saying "this looks better" or not for attempts at psychovisual. But I don't think you can really make big developments using that technique - you can only make small heuristic stabs in the dark and then find out if they were okay, because the turnaround time for results from the crowd is too long, and if you release too many dead ends for them to test they will stop doing it, so you have to be reasonably sure it is a good change before publishing it to the crowd, etc. It's not the kind of thing a researcher needs, which is a black box where I can throw videos and say "which looks better to a human".

The result is that we are mostly stabbing in the dark and occasionally getting lucky.

03-08-10 | Unrighteous Indignation

One of the things that makes me angriest in the world are people who think they're saintly and have the rules and righteousness behind them, but are just fucking wrong. There are a few varieties of this, one are just people who are purely incorrect and have the facts wrong, the other is people who choose to dickishly try to enfore the strict letter of the law on one behavior (that they don't like), but hypocritically are slack about it in other ways (that they do like), and really even if they are in fact always perfect rule abiders themselves, to not be flexible about when a rule is actually important or not is fucking dickish.

Riding a bike in the city you run into this kind of thing constantly. Every pedestrian and driver thinks they are an expert about bicycle traffic laws and they are eager to inform you with their incorrect and dickish "knowledge". It's a very Seattle/Scandinavian way of being an ignorant dick - to cloak your sour misanthropy in condescension and enforcement of rules. (I'd much rather be yelled at by a New York stereotype who says "hey! wassa matta you, I'm walkin' here!" or whatever)

About a week ago we were riding along. We stopped at a stop sign and waited for the cars to pass. On the other side of the road I noticed a woman standing at the corner talking on her cell phone, just sort of ambling, certainly not making a move to cross. So we take off and ride through the intersection. As we're passing, the woman says "you know pedestrians still have the right of way".

Eh... as usual it took me a minute to register; I thought she was just talking on her cell phone call, but about half way down the block I realized it was directed at us. It's such badly placed hate; if you want to yell out hate at cyclists you should at least pick a time when the cyclists are doing something remotely wrong, there are plenty of dick cyclists out there, I'm sure you won't have to wait long to take out your sour vile rotten soul on them. And, you know, learn to fucking be a pedestrian, and get off your damn cell phone too.

There are frequent difficulties riding down the road. Many cars seem to think that cyclists are legally obligated to make way, or to ride as far to the right as possible. This is not correct. Cyclists are legally required to ride as far to the right as is *safe* for them to do so, except when turning. Generally to be safe, you should ride about three feet to the left of a parked car. People in parked cars do not watch out as they should, and as a cyclist you need to pre-reserve your swerve space, since you can not know when you will need it.

A large amount of the problem with this is that drivers are just so fucking stupid about what rules are important to follow exactly and which are okay to be flexible on. Often I will be riding along and some car will come up behind me and just stay behind me - even though the other side of the road is completely empty and all he has to do is pull out a little bit and go around. People in most of the world have no problem with this and do it constantly, but here in the US there is this bizarre unwillingness to cross the yellow line; oh god forbid I pull out across the yellow line a little bit, it is holy and inviolable. Hell, a lot of the times it's actually a road with two or more lanes going the same direction, and the driver won't even change into the left lane. Nah, I want to be in the right lane, fuck this fucking cyclist cock blocking me, I couldn't possibly be bothered to just move to the left lane. Often roads have a wide turn lane down the middle that noone is using ; hello, the fucking turn lane is an ideal way to pass safely around cyclists, you only need to go about halfway out into it, people going the other direction can also come out halfway into it, everyone is happy, fucking don't be a retard.

The other day N and I were riding along on a narrow busy multi-lane street which is nevertheless a "bike corridor" here in hated bad-street Seattle. This street has parked cars, narrow lanes, and basically no space for bikes, so the only safe way to ride is to take a lane, which is not a big deal for sane people because cars have another lane going the same direction to just go around you. So since we're taking the lane we ride side by side, which makes it clearer that we are just taking the lane. (one of the most important aspects of riding safely as a cyclist is to make it very clear to cars what you are doing - you should not make timid or sudden moves, when you are taking a lane or turning left, it's good to clearly telegraph your intentions, then take the whole lane, or pull out all the way to the left; if someone is opening their door and you have to swerve around, swerve way around so you're more visible, etc.). Anyway, some self-righteous ignorant cock of a driver chose this moment to pull over and roll down his window and yell at us about how it was illegal to ride two abreast. Umm, first of all, no it's not, it's illegal to ride *more* than two abreast. Second of all, your situational awareness and ideas about what rules are important and when is completely fucked; even *if* that was the law, it would be a retarded time to yell at us, since we couldn't really get out of the way anyway even if we rode single file (and if you tried to get in a tight lane with us riding single file you would be a dangerous cock).

Anyway this all happened a while ago and I wasn't going to even write about it because it just makes me sad how fucking stupid and mean people are. It's not just that they're stupid, it's that they are almost intentionally stupid in a self-righteous selfish way; like they choose willfully to not actually know the law, or to not think about the other person's risks and rewards in a certain situation, they only want to follow some dumb rule without thinking situationally, and they want to be fucking right and lord their rightness over others. But I was researching again (in vain) to find a lawyer that can handle Oregon speeding tickets, and I stumbled across so many of this type of comment in web forums and blogs :

"You will lose your license for a while. Hopefully"

"You could lose your license for that and I hope you do."

"HOw do you figure going 102-106 mph isn’t reckless? Are you trying to kill somebody or are you just too immature to deserve a license?"

etc. you see this kind of thing posted all over the internet - people being holier than thou rule-touting fuckers. Do you all have no concept of what is actually dangerous? Going 100 on a freeway is really not dangerous at all (assuming low traffic, and assuming your car is in good kip - you check your tires often, have good brakes, and you have a car that is stable and maneuverable at speed). I dunno what freeways these people are driving on, but the freeways I drive on don't have any pedestrians, or hard turns, or oncoming traffic or parked cars. Going just the speed limit is way more dangerous if you are talking on your cell phone and drinking your big gulp. Going at the speed limit in a busy cyclist/pedestrian area is way more dangerous.

In related news, the Washington State House failed to pass a proper tough cell phone law. Talking on a cell phone is currently a secondary offense which basically means there is no law at all. Earlier the Senate passed a better version that at least makes it a primary offense, but IMO is actually still not tough enough as it allows hands-free talking and it still allows dialing by hand. (for those that don't know, driving while talking on a hands-free is roughly equivalent to driving while exhausted or driving drunk in terms of the affect on reaction time, braking and obstacle avoidance).

While I'm on the subject of car safety, I'll repeat my call for all interactive car computers to be banned, as well as Xenon headlights (damn blinding shit), also window tint (dumb fuckers can't see at night). It should also be illegal for children to bother their mother while she's driving; people need to pay damn attention to the road. I also had an idea on cell phones the other day after yet another incident where some dumb cell-phone talking pedestrian tried to walk in front my car : cell phones are just dangerous in any kind of use with movement, not just driving. Maybe they should just put speed sensors inside all cell-phones and put them on a hard switch to turn off when they are in motion, so you have to stop and stand still (or pull over your car) to talk. Oh, and headlights and bumpers should be mandated to all be at the same height. Like fine if you want to drive a retarded SUV, go for it, but your headlights and bumpers are still going to be one foot above the pavement, not aimed directly at my brain.

I also randomly stumbled across this Bizarre WA Supreme Court Moving Violation Ruling in which they seem to basically rule that because the trooper did not see the defendant commit the crime, he couldn't issue a moving violation. What !? So if someone runs their car into a lamp post and tests positive for alcohol, you can't give them a DUI because you didn't actually see them driving? This seems to set a completely retarded precedent. Says the Supreme Court :

"Negligent driving in the second degree is a moving violation. 
For the infraction to be valid, the movement must have been made in the officer's presence."

more story here or WA supreme court blog

03-05-10 | Identity Vacations

I had this idea ages ago with N, but Won just reminded me :

I think 99% of people are horrible at taking vacations if their goal is happiness. They stress themselves out way too much and also fail to push themselves personally. eg. running around to see a bunch of tourist attractions is both stressful and not rewarding in any way. Also, sitting on a beach somewhere nice might feel relaxing at the time, but provides no long term happiness at all (though that's one case where the anticipation of it might really be the happiness - eg. people who live in the cold and take an annual winter beach holiday, the actual vacation does nothing for you really, but all those months in the cold you can look forward to it).

Anyway, if you just want a good normal vacation, you should both stress less and do more. If you're not retarded, that's quite easy, you just don't worry too much about planning, give yourself plenty of time so you're not in a rush, but when you get there actually do the interesting activities, don't just sit around and be lazy or be a pussy and be too afraid to try anything risky.

The longest term reward you can get from vacation is to push yourself personally in some way, eg. to learn something, or do something that's outside of your normal character, have experiences that change you significantly in a way that will stick with you for a long time. eg. take a vacation boating some islands and live aboard a ship with a teacher and learn to sail, or stay on a commune and farm and force yourself to be non-judgemental of the dirty hippies, or take a vacation where you are an S&M "gimp" in Berlin to expand your kink horizons, etc.

Anyhoo, I think the next trend in boutique vacations is "identity vacations" where you pay some package to travel somewhere and be some new character. They give you the right clothes, the right place to stay, and some accessories and a guide to integrate you into that lifestyle. So you might take a "I'm gay in San Francisco" vacation or "I sell jerk chicken and make dancehall music in Jamaica" or "I run a bar on the beach in Mexico" or "I'm a fixster in Brooklyn" or whatever.

03-05-10 | India Rails

We've been playing India Rails a bit recently (one of the Empire Builder train games). It's okay, but I think it's broken in a few ways which is a bit surprising since it is a much lauded and tweaked series, and India is supposedly the best of the rail series.

The win conditions seem out of kilter. There are two win conditions - connect all the major cities AND have 250 million in cash. The weird thing is these seem to be way way different in difficulty. Connecting the cities is pretty easy, in fact perhaps too easy, the game can be pretty short if you get lucky cards early, but getting 250 million in cash takes a long time and becomes a boring grind. It seems like a better win condition would be something like connecting a major city is worth 40 million in victory points, cash on hand is worth victory points, try to reach 300 million victory points.

Perhaps most broken is that there's just almost no interaction between the players. It seems to be one of a classic type of bad board game which you basically play in solitaire, and it's just a race to see which solitaire player wins first. That is a horrible horrible type of board game. I also don't see any real deep strategy, yes there is planning, but I can't imagine ever playing against someone and being amused by the clever trick they use against me. (basically the only interaction with other players I see is running track to block them ; yes there are optional rule additions that add a bit more interaction, such as open contracts and leaving dropped cargo in towns, but even that is pretty minimal).

I've had a hankering to play good board games recently and have been pretty disappointed. Carcassone is decent and actually has some interesting play (almost all the interaction and complex strategy is about farms), but it's just too simple; I think it's a great kids game. Settlers is a really mediocre game; almost all the real thinking is just in the placement phase, and then the actual play is pretty rote and becomes just tedious after you play a lot. I supposed Settlers is more interesting if you really play the metagame of influencing people to make them trade with you advantageously (just like Monopoly is only an interesting game if you play the metagame), but I'm not a big fan of political metagame as a form of strategy.

So many of the "sophisticated" board games that you see are just a bunch of over-complicated rules that surround a play dynamic that's not very deep. You spend hours reading the manual and figuring out how to play, and then when you actually "get it", you discover the game is trivial (by "trivial" I mean a very smart person will know the correct move in each situation).

I know I should just play Chess or Poker or Go or something real, but it seems like there should be a better middle ground of games that are "fun" and not so serious as those, but also have something actually interesting and challenging to think about during play.

03-05-10 | Consumerism Happiness

Sometimes when I'm depressed I'll go on foolish Amazon shopping sprees; I never buy frivolous things, mainly it's tools or hepa filters or shit like that, but it's things I don't really need which I imagine will somehow make my life better when I have them. It does in fact give you a small happiness boost when you get your new toy and unwrap it and play with it, but that is fleeting (and doing it too often kills it). BTW Amazon Prime is really fucking me up; I never used to shop online much, hell I never used to shop much at all, but now it's just so fast and cheap and easy I just go click click and the thing shows up. Theretically yes it is a "good deal" (in the sense that it saves me lots of time because I don't have to worry about what qualifies for Super Saver and I get my purchase quicker) but in reality it is a horrible deal because it changes your buying pattern so that you shop a lot more. Of course Amazon knows this which is why they push it so much, and I knew it too but thought I had the self control to not be suckered like all of you fools; well no, no I don't.

Anyway - while that sort of consumer happiness is empty and unfulfilling and short term and unproductive (much like the happiness of booze or sweets) - there is a type of consumer happiness that I believe is more profound : lusting for the unattainable super-desirable product.

In this form of happiness there is some wonderful product that you can't afford; you read about it, you put posters of it on your wall, you hang out on web forums and talk about it, you follow all the new version releases. You save and save and work hard and some day you buy one. That is happy times. Not because getting the product is so great, but the anticipation, and the hard work to get it. Having some goal, some desire, and working hard to get it, and finally acheiving it - that is one of the greatest true pleasures in life. I think people foolishly think that to be happy their goal needs to be something actually important, curing cancer or whatever, in fact that's not true at all - all that matters is you really want the goal and it takes hard work and a lot of time to get it.

I used to have this kind of happiness with computer parts when I was a kid, I lusted after a video toaster or that 24-bit card for my Amiga, I read about all kinds of products, I saved and eventually I got a DCTV and I was pretty delighted, but when you actually get your goal, the real fun is over and you have to find a new thing.

It's hard to have this kind of happiness anymore, because when you get old and mature you realize that all products are actually pretty fucking irrelevant to your quality of life (like, yeah, a proper HDTV would be nice and all, but I can watch TV on my old thing just fine) - so it's impossible to really get too excited about lusting after some product, and if I had to really work hard and save up to buy something, mmm meh I just wouldn't buy it.

There's a weird drawback to becoming enlightened about happiness. I now know that happiness comes only from inside myself, external circumstances are really irrelevant. You might think that's liberating, but it's actually very difficult to deal with, because you no longer incorrectly believe that external activities will somehow magically make you happy, it's hard to get excited about doing them, and yet you need to do them and you need to be excited about it, because happiness comes from your own excitement.

03-03-10 | Science on TV

The state of science on TV is absolutely shameful. I find it really inspiring and energizing to watch a bit of mild science, and of course I don't expect them to actually be rigorous, but it seems they can't even manage to be correct.

Nova has gotten mostly unwatchable. Maybe this is part of the generally destruction of PBS by the Republican Machine, which includes the enforcement of "balanced" opinion, and the broadcast of corporate shill self-help bullshit like Suze Orman In my youth I vaguely remember watching Nova and seeing real interesting science. Now it seems like every Nova is some fucking archaeologist recontructing something from history, like building some fucking boat or a catapult or whatever. While that is vaguely amusing, it's completely non-scientific, it doesn't prove anything (archaeologists seem to be almost all quacks), and it belongs on Mythbusters not Nova. (there is a good Nova once in a while still, I recall liking the Quest for Absolute Zero okay).

ScienceNow with Neil "look at me I'm a black scientist" DeGrass Tyson is equally unwatchable. It was vaguely watchable when it had Alan Alda, mainly because he didn't pretend to know anything about science, and they seemed to cover more technology developments which they're better at.

We tried to watch "Parallel Lives, Parallel Worlds" , which is vague about Hugh Everett and the Many Worlds interpretation, but mostly about some whiney emo singing douchebag who can't do math. I turned this off in disgust after about 20 minutes because even in the rare moments when they actually talked about real quantum mechanics, they were not only overly simplistic, but very often just wrong. This is a really interesting topic, and you could do a really good show about the outsider quacks who challenged the foundations of quantum mechanics, you could talk about EPR, Everett, and David Deutsch is a great character too.

A common flaw in almost every science program is to over-dramatize developments and conflicts in science. The shows like to present new findings as "shaking the foundations of everything that came before" and such nonsense, which almost never actually the case.

The most recent offender in that regard is BBC's Secret Life of Chaos. It basically consists of showing some pretty pictures and then the bald buffoon comes on and says outrageous nonsense things like "and this tore down the fabric of Newtonian physics". Umm, no, not really, no it didn't. (A decent story of reaction-diffusion and morphogenesis is here or here )

In other pseudo-science news : we've been watching BBC's nature series "Life" , "Wild China" , and now "Yellowstone" ; all amazing, but they're fucking FAKE ! They stage shots, they use studio footage and split it into location shots, they feed predators to get kill shots, it's just all fake. You can't trust a single shot they show. It's just a bunch of pretty pictures. ( They also use filters and other adjustments; for example in the Yellowstone video they show you the classic beauty shot of Grand Prismatic Spring which has obviously been saturation-pumped ; in real life it looks like : this . )

03-03-10 | Image Compresson : Color , ScieLab : Part 2

Follow up to the last post on color .

First a correction : what I said about downsampling there is mostly wrong. I made the classic amateur's blunder of testing on too small a data set and drawing conclusions from it. I'm a little embarassed to make that mistake, but hey this is a blog not a research journal. Any expectations of rigor are unfounded. For example this is one of the test images I ran on that convinced me that downsample was bad :

-i7 qtable ; CoCg optimized joint for min SCIELAB

downsample :

   262,144 ->    32,823 =  1.001 bpb =  7.986 to 1 (per pixel)
Q : 11.0000  Co scale = Cg Scale = 1.525
bits DC : 19636|5151|3832 , bits AC : 175319|38483|19879
bits DC = 10.9% bits AC = 89.1%
bits Y = 74.3% bits CoCg = 25.7%
rmse : 7.3420 , psnr : 30.8485
ssim : 0.9134 , perc : 73.3109%
scielab rmse : 2.200

no downsample :

   262,144 ->    32,679 =  0.997 bpb =  8.021 to 1 (per pixel)
Q : 12.0000  Co scale = Cg Scale = 0.625
bits DC : 19185|13535|9817 , bits AC : 160116|39407|19091
bits DC = 16.3% bits AC = 83.7%
bits Y = 68.7% bits CoCg = 31.3%
rmse : 6.9877 , psnr : 31.2781
ssim : 0.9111 , perc : 72.9532%
scielab rmse : 1.980

you can see that downsample is just much worse in every way, including severely worse in SCIELAB which doesn't care about chroma differences as much as luma. In this particular image, there's a lot of high detail color bits, and the downsampled version looks significantly worse, it's easy to pick out visually.

However, in general this is not true, and in fact downsample is often a small win.

Without further ado I present lots of stats :

i0 Cg=1 Co=1 i0 Cg = 0.6 Co = 0.575 i7 Cg = 0.6 Co = 0.575 i4/i7 opt per image i7 CoCg optimized independently per image i7 CoCg optimized jointly per image downsampled
file rmse scielab rmse scielab rmse scielab rmse scielab Co Cg rmse scielab Co / Cg rmse scielab
kodim01 12.6809 4.8898 12.5848 4.8413 12.6567 4.3415 12.7018 4.238 0.455 0.455 12.623 4.3153 1.225 12.486 4.2525
kodim02 6.235 2.1961 6.1733 2.1793 6.2836 2.0519 6.2544 1.9542 0.58 0.58 6.2285 1.978 1.3375 6.4866 1.9841
kodim03 4.0098 1.7135 3.974 1.7173 4.0621 1.5587 3.9778 1.5883 0.705 0.83 4.0853 1.5359 1.6 4.1235 1.6102
kodim04 6.3981 2.4661 6.3658 2.4929 6.4083 2.2579 6.4083 2.2579 0.705 0.705 6.4092 2.248 1.5625 6.3698 2.1977
kodim05 14.2903 7.2293 14.0531 7.1756 14.1613 6.5253 14.2296 6.452 0.58 0.58 14.167 6.5291 1.5625 13.9658 6.4326
kodim06 8.9416 3.6338 8.836 3.5923 8.9622 3.2131 9.0316 3.1608 0.455 0.58 8.9664 3.2184 1.3 8.8455 3.1733
kodim07 5.147 2.316 5.1145 2.1919 5.2338 2.0167 5.2388 1.9815 0.58 0.58 5.202 2.0047 1.225 5.1601 1.9462
kodim08 14.6964 7.5082 14.5479 7.5237 14.5675 6.8769 14.6411 6.7521 0.58 0.83 14.5726 6.8285 1.4875 14.3053 6.692
kodim09 4.4789 1.8149 4.439 1.8574 4.5303 1.675 4.5303 1.675 0.705 0.955 4.5467 1.6359 1.4125 4.5389 1.6906
kodim10 4.9926 2.0932 4.9477 2.1196 5.0678 1.9887 5.0398 1.9514 0.58 0.955 5.0585 1.9109 1.6 5.0449 1.9556
kodim11 7.9484 3.2677 7.9006 3.2315 8.0441 2.9234 8.0441 2.9234 0.58 0.58 8.0478 2.9276 1.375 7.939 2.858
kodim12 4.6495 1.8486 4.6326 1.8529 4.7335 1.6862 4.7259 1.6663 0.58 0.705 4.7041 1.6776 1.2625 4.7001 1.6457
kodim13 18.5372 8.3568 18.3502 8.2634 18.5334 7.2841 18.6579 7.1262 0.455 0.58 18.5013 7.2697 1.1125 18.381 7.2327
kodim14 11.076 4.8628 10.972 4.7473 11.0146 4.3268 11.064 4.2636 0.58 0.58 11.0151 4.3308 1.3 10.9818 4.3614
kodim15 5.8269 2.4099 5.8082 2.4665 5.9134 2.2246 5.8383 2.2457 0.705 0.705 5.9158 2.2098 1.525 5.8699 2.1497
kodim16 5.689 2.3266 5.6289 2.3199 5.7372 2.0534 5.7372 2.0534 0.58 0.58 5.7373 2.055 1.375 5.6667 2.0276
kodim17 5.5166 2.3244 5.47 2.2994 5.6716 2.0774 5.5853 2.0874 0.455 0.705 5.6523 2.0574 1.4125 5.6014 2.037
kodim18 10.8501 4.8609 10.7131 4.7903 10.9517 4.3169 10.9639 4.2627 0.58 0.705 10.9266 4.3006 1.3375 10.8048 4.2189
kodim19 7.1545 2.8338 7.0872 2.8518 7.2311 2.4977 7.2637 2.4362 0.58 0.705 7.2158 2.4758 1.5625 7.1314 2.4396
kodim20 4.7872 1.8258 4.7183 1.8042 4.9208 1.6441 4.863 1.6524 0.455 0.83 4.9265 1.6306 1.1875 4.9427 1.656
kodim21 7.7757 3.3671 7.6338 3.3427 7.9293 3.0078 7.8541 3.0018 0.705 0.705 7.9204 2.95 1.3 7.7688 2.9302
kodim22 8.279 3.2205 8.1788 3.1253 8.3292 2.8656 8.3542 2.8114 0.455 0.58 8.3026 2.8379 1.45 8.267 2.8436
kodim23 3.917 1.5567 3.8968 1.5138 3.953 1.4315 3.961 1.4157 0.58 0.58 3.9481 1.4146 1.6 4.3382 1.573
kodim24 10.9877 5.2479 10.8105 5.0477 11.0256 4.6141 11.0435 4.5882 0.455 0.455 11.0413 4.6005 1.3375 10.9372 4.503
194.86 84.17 192.84 83.35 195.92 75.46 196.01 74.54 195.71 74.94 194.65 74.41

explanation :

output bit rate 1 bpb in all cases
parameters are optimized to minimize E = ( 2 * SCIELAB + 1 * RMSE )
RMSE is on RGB
SCIELAB is perceptual color difference metric

i0 = flat quantization matrix
i7 = tweaked perceptual quantization matrix to minimize E
i4/i7 = optimized blend of flat to perceptual matrices

The table reads roughly left to right in terms of decreasing perceptual error.  

"i0 Cg=1 Co=1" : flat q-matrix, standard lossless YCoCg transform without extra scaling

"i0 Cg=0.6 Co=0.575" ; optimize CoCg scale for E ; interestingly this also helps RMSE

"i7 Cg=0.6 Co=0.575" ; non-flat constant Q-matrix ; hurts RMSE a bit, helps SCIELAB a lot

"i4/i7 opt per image" ; per-image non-flat Q-matrix ; not a big difference

"i7 CoCg optimized independently per image" : independently optimize Co and Cg for each image

"i7 CoCg optimized jointly per image downsampled" : downsample test, CoCg optimized with Co=Cg

On the full kodak set, downsampling is a slight net win. There are a few cases (kodim03,kodim23) where it hurts a lot like I saw before, but in most cases it is a slight win or close to neutral. The conclusion is that given the speed benefit, you should downsample. However there are occasional cases where it will hurt a lot.

I think most of the results are pretty intuitive and not extremely dramatic.

It's a little non-inuitive what exactly is going on with the per-image customized chroma scales. Your first thought might be "well those images have different colors in them, so the color space scale is adapting to the color content in the image". That's not so. For one thing, more or less content of a certain color doesn't mean you need a different color space - it just means that that band of the color space will get more energy, and thus more bits. e.g. an image that has lots of "Co" component colors will simply have more energy in the Co plane - that doesn't mean scaling Co either up or down will help it.

If you think about the scaling another way it's more obvious what's going on. Scaling the color planes is equivalent to using different quantizers per plane. Optimizing the scalings is equivalent to doing an R/D optimization of the quantizer of each plane. Thus we see what the scaling is doing : it's taking bits away from hard to code planes and moving them to easier to code planes (in an R/D slope sense).

In particular, when I visually inspected some of the more extreme cases (cases where the per-image optimized scales were a big win vs. a constant overall scale, such as kodik10) what I found was that the optimized scalings were taking bits *away* from the dominant colors. One very obvious case was on photos of the ocean. The ocean is mostly one color and is very hard to code (expensive in an R/D sense) because it's all choppy and random. The optimized scaling took bits away from the ocean and moved them to other colors that had more R/D payoff.

(BTW rambling a bit : I've noticed that x264 Psy VAQ tends to do the same kind of thing - it takes bits away from areas that are really noisy mess, such as water, and moves them to areas that have smooth pattern and edges. Intuitively you can guess if an area is a mess and just really hard to code then you should just say "fuck it" and starve it for bits even if MSE R/D tells you it wants bits. I think also that improving an area from an RMSE of 4 to 2 is better than improving from 10 to 7, even though it's less of a distortion win. Visually there's a bit difference that occurs when an area goes from "looks good" to "looks noisy" , but not much of a difference when an area goes from "looks bad" to "looks really bad").

So this is in fact not really a surprising result. We know already that heavy R/D bit allocation can do wonders for lossy compressors. That are lots more areas to explore - optimization of every coefficient in the quantization matrix, optimization of the color transform, optimization of the transform basis functions, etc. etc. - and in each case you need to be clever about the way you encode the extra rate control side information.

ADDENDUM : I thought I should write up what I think are the useful takeaway conclusions :

1. It is crucial to do the right kind of scaling to Co/Cg (or chroma more generally) depending on whether you downsample or not. In particular the way most people just turn downsample on or off and don't compensate by scaling chroma is a mistake, eg. not a fair comparison, because their scaling will be tuned for one or the other.

2. Downsample vs. no-downsample is pretty close to neutral. If you downsample for speed, that's probably fine. There are rare cases where it does hurt a whole lot though.

3. Using a non-flat Q matrix does in fact help perceptual quality significantly. And it doesn't hurt RGB RMSE nearly as much as it helps SCIELAB (helps SCIELAB by 10.35 % , hurts RMSE by 1.58 % ).

4. It does appear acceptable to use global tweaked values all the time rather than custom tweaking to each image. Custom tweaks do give you another bit of benefit, but it's not huge, thus not worth the very slow optimization step. (see DCTune eg)

03-01-10 | File Locks

There are so many dumb fucking things in our lives that we just get used to and accept and then don't even think about any more. But they're fucking dumb and we shouldn't let people get away with them.

The thing that's bugging me today is the way exe's are locked while they are running. WTF that's retarded, the whole thing is copied into RAM, there's no reason it needs to be kept on disk any more.

Of course it's standard practice when you're developing something to use something like run_prog.bat :

copy prog.exe prog_current.exe
prog_current %*

so that you can then still write prog.exe , but come on, this is fucking stupid.

In other similar stupidity : WTF I can't delete a directory just because I have an open CMD in that dir ? Just invalidate that path and let CMD handle it.

03-01-10 | Car Pool Lane

For some reason I just can't bring myself to cheat in the carpool lane. It's some weird stupid sense of morality. I've been working hard most of my life to remove that behavior from my personality - following moral codes without good reason, or not wanting to disturb others. Certainly it's great to be moral, but you h