#define LOOP(var,count) for(int var=0;(var)<(int)(count);var++) #define LOOPBACK(var,count) for(int var=(int)(count)-1;(var)>=0;var--)but then I never wind up using it, and I find the code that does use it looks really ugly. Maybe if I got in the habit that ugliness would go away.
Certainly adding a "loop" keyword would've been a good idea in C99. Instead we have GCC trying to optimize out signed int wrapping, and many compilers now specifically look for the for(;;) construct and special-case handling it as if it was a loop() keyword.
In other minor news, I'm running with the "NoScript" addon now. I've used FlashBlock for a long time to much happiness, so this is just the next level. It does break 90% of web sites out there, but it has also made me shockingly aware of how many random sites are trying to run scripts that are highly questionable (mainly data-mining and ad-serving).
People sometimes ask me about laptops because I wrote about them before :
cbloom rants 05-07-10 - New Lappy
cbloom rants 04-15-10 - Laptop Part 2
cbloom rants 04-12-10 - Laptop search
cbloom rants 01-20-09 - Laptops Part 3
A small update : I'm still very happy with the Dell Precision 6500. They've updated it with some newer GPU options, so you can now get it with an ATI 7820 , which is a full DX11 part and close to as fast as you can get mobile. Other than that all my advice remains the same - get USB 3 of course, quad i7, LED backlighting, install your own SSD and RAM and do a clean windows install. Do NOT get RAID disks, and do NOT install any Intel or Dell drivers. I have no DPC latency problems or any shite like that. The only thing about it that's slightly less than perfect is the temperature / fan control logic is not quite right.
It looks like there's a problem with the NV GTX 280 / Dual chip / Powermizer stuff. See :
My Solution for Dell XPS M1530 DPC Latency
Dell, DPC Latency, and You - Direct2Dell - Direct2Dell - Dell Community
Dell Latitude DPC Latency Issues
So I recommend staying away from that whole family. The new Vaio Z looks really awesome for a small thin/light, however there is a bit of a nasty problem. They only offer it with SSD's in RAID, and they are custom non-replaceable SSD's, and it appears that the reason for the 4 SSD's in RAID is because they are using cheapo low end flash. There's lots of early reports of problems with this setup, and the fact that you have to run RAID and can't change the drives is a big problem. Also, last time I checked Windows still can't send through TRIM to RAID, so that is borked, but theoretically that will be addressed some day.
"Treme" is outrageously bad. Like so bad that it makes me lose all faith in Simon, and makes me wonder if "The Wire" actually is good or if I was part of some collective delusion. It's a mess of characters I don't really care about, each with their own uninteresting unrealistic story. It's just awful writing, and everything feels so forced, it's just screaming "look how new orleans this is" all the time.
Ghost in the Shell Stand Alone Complex (the series) is really damn good. I like it way better than the movies, for example; the movies are obviously prettier, but also pretty vapid. The series is meaty, and totally sucks you in. Especially in "2nd Gig" it really gets rolling where after each episode you just have to immediately watch the next one to see what happens next. Every once in a while there's an episode that's not part of the main story line which is supposed to flesh out one of the side characters (Pozu, Saito, Tokosa) and those can be real stinkers. The episodes on the main story line are the good ones.
I rather enjoyed the "Jesse Stone" series, I'm somewhat embarrassed to admit. It's definitely very cheesy, mild, CBS fare that would please your grandma, but I found it had a really nice tone, a quietness, a really well crafted mood. There's a slowness to it, the camera hangs on scenery. In many ways it reminds me of the "Wallander" series which I also liked (though Wallander is much better). Both are basically terrible stereotypical cop stories. Oh, the cop is an alcoholic, has emotional demons, trouble with his ex-wife, he doesn't follow the rules, has trouble with the chief, has hunches, but he's great at what he does. How fucking cliche and boring. Both shows are saved by the light. The light is like another actor, a presence that pervades everything. In Wallander it's the thin, bright, sideways Swedish light. In Jesse Stone it's the clouds and fog and darkness, with shafts breaking through. The first few Jesse Stones are the good ones, the later ones are pretty weak.
Yikes, one nightmare I'd like to forget is "Gavin and Stacey". Sometimes I randomly grab something because it is high rated on Metacritic TV. Lately I have not been doing well with the British imports. "Gavin and Stacey" is like some awful modern Married with Children, where they're just crass awful people and that's supposed to be funny for some reason. "Ideal" was about some fat guy who talks with his mouth closed and smokes pot a lot and nothing funny ever happens.
I've been watching a bit of "Twin Peaks" ; I never saw it originally. It holds up surprisingly well. I think it's because the show always had a sort of weird cheesy sitcom/soap-opera gone horribly wrong vibe to it, and the dated production and video look go along with that. It's very inconsistent. I find that the episodes that were actually directed by David Lynch are really good, really creepy and ominous and exciting, but then the other episodes just get really boring. Lynch crafts these moments that are just so weird, but they're actually really little things that he does. One scene that really struck me was when Cooper is lying on the floor and the old man waiter at the hotel comes in and just keeps talking about the glass of milk, and the scene just keeps going and going and Lynch draws it out and nothing is really happening but you get more and more creeped out.
All the nu-fred yuppie trainees are pretty annoying. I have to remember to just ignore them. I could list all the dumb assholish things they do (tailing me too close, slamming on the brakes when I'm right behind them, running stop signs, etc.) , but it's the same thing drivers do, it's the same thing everyone does. They're all fucking assholes and retards, I can't let that get me down too much. One good move I've picked up lately in both my riding and driving is just to pull over and stop. Some fucking dick is riding my ass for no reason and annoying me. In my youth I would've just grit my teeth, or yelled at him, or something. Now I just pull over and stop for a while and let him get away from me, and go back to enjoying myself without them.
The mountains are covered in blankets of huckleberries now. I've written about them before . Now's the time! I think the easiest way to get to them is off Steven's Pass. You can actually just go to the ski resort and then hike south to Josephine Lake, which eliminates a lot of hill climbing, and you get into prime berry territory. They have such a bright, unique flavor. There's like notes of apricot or something; they kind of remind me of the flavor of "now n' laters" that has that tanginess.
Summer is almost over. This has been one of the worst summers of my life. I don't mean that bad things have happened, I mean the summer itself was shit. Normally I go inside through the winter, get fat and drink and get depressed, then the sun comes out and I go outdoors and run around naked and be free and get fit and happy. This year, summer didn't start until July 5 (I remember it well because July 4 was still rainy and gray). Now it appears to be fading away into fall already, and it was MIA through most of July. And I worked way too much through almost all of it. I never really got into "summer mode" at all, never got fit, never got that feeling of being free and running around. The closest was when we had that brief heat wave.
I do love heat waves here. It just gives you no choice but to get down to the lake and have a swim. Everyone cool who loves life is hanging out by the water, and it's just a grand old time. The water is very cold, but it's invigorating and perfect on a 90+ day. The life guards love to yell at me, and the samoans dive on top of each other, and the russian mobster-wannabes cruise around Kirklans.
The standard solution is Fenwick Trees . They are compact (take no more room than the C[i] table itself). They are O(log(N)) fast. In code :
F[i] contains a partial sum of some binary range the size of the binary range is equal to the bottom bit on of i if i&1 - it contains C[i] if i&3 == 2 - contains Sum of 2 ending with i (eg. C[i]+C[i-1] ) if i&7 == 4 - contains Sum of 4 ending with i if i&F == 8 - contains Sum of 8 ending with i(follow the link above to see pictures). To get C[i] from F[i] you basically have to get the partial sums S[i] and S[i-1] and subtract them (you can terminate early when you can tell that the rest of their sum is a shared walk). The logN walk to get S from F is very clever :
sum = 0;
while (i > 0)
{
sum += F[i];
i = i & (i-1);
}
The i & (i-1) step turns off the bottom bit of i, which is the magic of the Fenwick Tree structure being
the same as the structure of binary integers. (apparently this is the same as i -= i & -i , though I haven't
worked out how to see that clearly).
If you put F[0] = 0 (F starts indexing at 1), then you can do this branchless if you want :
sum = 0; UNROLL8( sum += F[i]; i = i & (i-1); );(for an 8-bit symbol, eg 256 elements in tree).
You can't beat this. The only sucky thing is that just querying a single probability is also O(logN). There are some cases where you want to query probability more often than you do anything else.
One solution to that is to just store the C[i] array as well. That doesn't hurt your update time much, and give you O(1) query for count, but it also doubles your memory use (2*N ints needed instead of N).
One option is to keep C[i], and throw away the bottom level of the Fenwick tree (the odd indexes that just store C[i]). Now your memory use is (3/2)*N ; it's just as fast but a little ugly.
But I was thinking what if we start over. We have the C[i], what if we just build a tree on it?
The most obvious thing is to build a binary partial sum tree. At level 0 you have the C[i], at level 1 you have the sum of pairs, at level 2 you have the sum of quartets, etc :
showing the index that has the sum of that slot : 0:01234567 1:00112233 2:00001111 3:00000000So update is very simple :
Tree[0][i] ++; Tree[1][i>>1] ++; Tree[2][i>>2] ++; ...But querying a cumprob is a bit messy. You can't just go up the tree and add, because you may already be counted in a parent. So you have to do something like :
sum = 0; if ( i&1 ) sum += Tree[0][i-1]; i>>=1; if ( i&1 ) sum += Tree[1][i-1]; i>>=2; if ( i&1 ) sum += Tree[1][i-1]; ..This is O(logN) but rather uglier than we'd like.
So what if we instead design our tree to be good for query. So we by construction say that our query for cumprob will be this :
sum = Tree[0][i]; sum += Tree[1][i>>1]; sum += Tree[2][i>>2]; ...That is, at each level of the tree, the index (shifted down) contains the amount that should be added on to get the partial sum that preceeds you. That is, if i is >= 64 , then Tree[6][1] will contain the sum from [0..63] and we will add that on.
In particular, at level L, if (i>>L)is odd , it should contain the sum of the previous 2^L items. So how do we do the update for this?
Tree[0][i] ++; i >>= 1; if ( i&1 ) Tree[1][i] ++; i >>= 1; if ( i&1 ) Tree[2][i] ++; ... or Tree[0][i] ++; if ( i&2 ) Tree[1][i>>1] ++; if ( i&4 ) Tree[2][i>>2] ++; ... or Tree[0][i] ++; i >>= 1; Tree[1][i] += i&1; i >>= 1; Tree[2][i] += i&1; ...this is exactly complementary to the query in the last type of tree; we've basically swapped our update and query.
Now if you draw what the sums look like for this tree you get :
These are the table indexes :
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||||||||
| 0 | 1 | 2 | 3 | ||||||||||||
| 0 | 1 | ||||||||||||||
| 0 | |||||||||||||||
These are the table contents :
| C0 | C0-C1 | C2 | C2-C3 | C4 | C4-C5 | C6 | C6-C7 | C8 | C8-C9 | C10 | C10-C11 | C12 | C12-C13 | C14 | C14-C15 |
| 0 | C0-C1 | 0 | C4-C5 | 0 | C8-C9 | 0 | C12-C13 | ||||||||
| 0 | C0-C3 | 0 | C8-C11 | ||||||||||||
| 0 | C0-C7 | ||||||||||||||
| 0 | |||||||||||||||
Now what if we slide everything left so we don't have all those zeros in the front, and we'll go ahead and
stick the total sum in the top :
| C0 | C0-C1 | C2 | C2-C3 | C4 | C4-C5 | C6 | C6-C7 | C8 | C8-C9 | C10 | C10-C11 | C12 | C12-C13 | C14 | C14-C15 |
| C0-C1 | 0 | C4-C5 | 0 | C8-C9 | 0 | C12-C13 | 0 | ||||||||
| C0-C3 | 0 | C8-C11 | 0 | ||||||||||||
| C0-C7 | 0 | ||||||||||||||
| C0-C15 | |||||||||||||||
Now, for each range, we stuff the value into the top slot that it covers, starting with the largest range. So (C0-C7) goes in slot 7 , (C0-C3) goes in slot 3, etc :
| C0 | -v- | C2 | -v- | C4 | -v- | C6 | -v- | C8 | -v- | C10 | -v- | C12 | -v- | C14 | -v- |
| C0-C1 | -v- | C4-C5 | -v- | C8-C9 | -v- | C12-C13 | -v- | ||||||||
| C0-C3 | -v- | C8-C11 | -v- | ||||||||||||
| C0-C7 | -v- | ||||||||||||||
| C0-C15 | |||||||||||||||
(the -v- is supposed to be a down-arrow meaning you get those cell contents from the next row down). Well, this is exactly a Fenwick Tree.
I'm not sure what the conclusion is yet, but I thought that was like "duh, cool" when I saw it.
In all its user friendliness, it doesn't give me any notification of this. In fact, it CC's the comment to my email just like normal, but doesn't mark it in any way as being spam-filtered.
Because of this the comments can sit for a long time in the Blogger Spam pot until I realize "WTF why is that not showing up" and go fix it.
So if you posted something, but don't see it, that's probably why. You can just email me and say "WTF where's my comment".
It's pretty awesome that it thinks technical posts which have absolutely no spam-like properties at all are spam, but the small handful of times I actually have gotten spam comments they have been completely obvious, and allowed right through.
This is why I fucking hate web software. I had things working decently, it was fine, and then they fucking push some new feature on me that I'm not allowed to opt out of and it fucks up my life. God damn you.
I'm sure that someday they will break the Blogger API that I am using to autopost here, and when that happens I might just retire.
The really offensive thing is getting changes pushed on you that you didn't ask for. Then everybody asks "can we please have it the old way, I liked it just fine" and they tell you "no, eat your broccoli, this way is better". We got the same thing with Win 7 and Vista where random shit is changed and people are like "please just an option to have the old way" and they say "this way is better, you're not allowed to have the old way". Fuck you, you're not my mom. I'll make my own damn decisions about what's better for me or not. It pisses me off that this has become standard practice in software.
Basically being a fucking dick is now a carefully studied science. They call it "controlling the user experience" or "locking in eyeballs" or "tying together products" or "promoting certain user stories" or "rapid upgrade encouragement". It's fucking manipulation and it's not fucking cbloom endorsed.
In related news, the volume of spam coming to my email address has recently exploded. The Gmail spam filter is very unreliable, so I've made a point of going and examining all the spam periodically, but that's not really possible now that I'm getting 500 spams a day (up from 50-100 a week ago).
So if you send me an email and I don't reply when you think I should have, it's possible it got spam-marked.
It would be so fucking easy for gmail to fix this properly if they cared. For one thing, you could show me the spam probability in the spam folder, and let me sort by it, so I can just manually look at the ones you weren't sure about. You could also whitelist people I know. You could send back captcha challenges when someone's mail gets marked as spam so they at least know it.
Hell there are a million trivial solutions, it's not like it's a hard problem.
Maybe I'll write my own spam filter one day.
Also Sean and I talked long ago about how to do the lowest impact possible manually-instrumented profiler with full information. Basically you just record the trace and don't do any processing on it during recording. All you have to do is :
Profiler_Push(label) *ptr++ = PUSH(#label); *ptr++ = rdtsc(); Profiler_Pop( label) *ptr++ = POP( #label); *ptr++ = rdtsc(); #define PUSH(str) (U64)(str) #define POP (str) -(S64)(str)where ptr is some big global array of U64's , and we will later use the stringized label as a unique id to merge traces. Once your app is done sampling, you have this big array of pushes and pops, you can then parse that to figure out all the hierarichical timing information. In practice you would want to use this with a scoper class to do the push/pop for you, like :
class rrScopeProfiler { rrScopeProfiler() { Push; } ~rrScopeProfiler() { Pop; } };
#define PROFILE(label) rrScopeProfiler RR_STRING_JOIN(label,_scoper) ( #label );
Very nice! (obviously the pop marker doesn't have to provide the label, but it's handy for consistency
checking).
(if you want to get fancy, the profiler push/pop array should really be write-combining memory, and the stores should be movnti's on x64 (*), that way he profiler push/pop wouldn't even pollute your cache, which makes it very low impact indeed)
(* BTW movnti is exposed as _mm_stream_si64 ; unfortunately, this does not exist in VC2005, you need 2008; the fact that they took away inline asm and then failed to expose all the instructions in intrinsics was a huge "fuck you" to all low-level developers; it was terrible in the early days, they've caught up a bit more with each rev ) (note that movntq is not the same, movnti comes from normal registers).
So I did this, and made my AllocParser able to parse that kind of input and turn it into AllocRecords. (the normal mode of AllocParser is to handle stack traces of memory allocations).
So now I can make tabview output for either top-down or bottom-up hierarchies, and also graphiz output like : lz_profile.svg .
There are a few minor things to clean up, like I'd like to be able to show either seconds or clocks, I'd like to be able to divide clocks by the total # of bytes to make it a clocks per byte, and if a node only has a self time, I'd like to delete the redundant +self node.
Another option is to show the hierarchy in a totally different graphviz style, using the boxes-within-boxes method. I tried that before for allocs and it was hideous, but it might be okay for timing. If I do that, then I could combine this with a global timeline to show thread profiles over time, and then use the arrows to connect dependencies.
Then I have to buy that old plotter I've been talking about ...
Each LED corresponds to one unit test; if it passes it shows green, if it fails to run it shows red, if it fails to compile it flashes red.
It also runs speed and memory use tests and shows them in equalizer-type graph info. It bleeps and bloops.
So you can just sit there and jam away and write code and keep checking it in, and you get instant visual feedback if you break a unit test, or slow something down.
Ooh, it would be cool if there were various display widgets and various outputs, and you could actually run physical wires to hook them up. Like oldschool telephone switchboards. So maybe I only have like 4 speed display graph panels, but I have 20 tests that can output speed information. I can reach over and plug the wire from the test I want to the panel I want to see it on.
Part of it is just little bits of noise around. I've always had a bad case of the "prairie dog" prey instinct - whenever I hear a noise I have to stop and look around "what's that? what's that?" , I get all jittery and nervous. But even beyond that - when I'm at the office at odd times when there's nobody directly near me, I'm still affected. Just knowing that someone is in the same building as me bugs me. I can't relax, my butthole is all tight and I just can't get into the groove and dive into the code.
Working at home is often good, and N is very understanding about me needing to go in my room and be left alone, but still I feel the craving for more. Mainly it's the damn home improvers that plague me now. The worst thing is not knowing when it's going to happen. Getting my mind into a real sharp work state takes a lot of effort and forethought. It's sort of like a performer getting ready to be "on" for the camera - you have to psyche yourself up, make sure you're hydrated and have proper blood sugar, then the stage lights go on and I sit down at my DevStudio to shine ... and then the fucking neighbor starts running his roto-tiller or some shit and my performance is cancelled.
As I get older I realize that the artist's studio in the country is really the ideal setup. Of course we've always hurt about these artists who have a country home, and then an outbuilding that they turn into studio, so you can just retreat into your private space and be alone to work. I always thought "what an indulgence" or "what sensitive woosies" , but yeah, that would be sweet.
(among other advantages, the cblib version doesn't pull vector.h or windows.h into the apf headers, both of which I consider to be very severe sins)
See older posts for a description of how it works and earlier not-good way of doing it and initial announcement .
The basic way it works is :
that's it, very simple!
Here's my main.cpp for an example of usage :
#include < float.h >
#include < stdio.h >
#include < string.h >
//#include "autoprintf.h"
#include "apf.h"
#include < windows.h >
struct Vec3 { float x,y,z; };
namespace apf
{
inline const String ToString( const Vec3 & v )
{
return StringPrintf("{%.3f,%.3f,%.3f}",v.x,v.y,v.z);
}
/*
inline size_t autoArgConvert(const HWND arg)
{
return (size_t)arg;
}
*/
inline const String ToString( const HWND v )
{
return StringPrintf("[%08X]",v);
}
};
int main(int argc,const char * argv[])
{
//MakeAutoPrintfINL();
//autoprintf("test bad %d",3.f);
autoprintf("hello %-7a",(size_t)400,"|\n");
//*
//autoprintf("100%");
autoprintf("percent %% %s","100%"," stupid","!\n");
autoprintf("hello ","world %.1f",3.f,"\n");
autoprintf("hello ",7," world\n");
autoprintf("hello %03d\n",7);
autoprintf("hello %d",3," world %.1f\n",3.f);
autoprintf("hello ",(size_t)400,"\n");
autoprintf("hello ",L"unicode is balls"," \n");
autoprintf("hello %a ",L"unicode is balls"," \n");
//autoprintf("hello %S ",L"unicode is balls"," \n");
autoprintf("hello ",apf::String("world")," \n");
// autoprintf("hello ",LogString()); // compile error
autoprintf("top bit ",(1UL<<31)," \n");
autoprintf("top bit %d",(1UL<<31)," \n");
autoprintf("top bit %a",(1UL<<31)," \n");
autoprintf("top bit %a\n",(size_t)(1UL<<31));
HANDLE h1 = (HANDLE) 77;
HWND h2 = (HWND) 77;
autoprintf("HANDLE %a\n",h1);
autoprintf("HWND %a\n",h2);
char temp[1024];
autosnprintf(temp,1023,"hello %a %a %a",4.f,7,apf::String("world"));
printf("%s\n",temp);
Vec3 v = { 3, 7, 1.5f };
autoprintf("vector ",v," \n");
autoprintf("vector %a is cool %a\n",v,(size_t)100);
/**/
return 0;
}
The normal way to make user types autoconvert is to add a ToString() call for your type, but you could also use autoArgConvert. If you use autoArgConvert, then you will wind up going through a normal %d or whatever.
One nice thing is that this autoprintf is actually even safer than my old safeprintf. If you mismatch primitive types, (eg. you put a %d in your format string but pass a float), it will check it using the same old safeprintf method (that is, a runtime failure). But if you put a std::string in the list when you meant to put a char *, you will get a compile error now, which is much nicer.
Everything in cblib now uses this (I made Log.h be an autoprintf) and I haven't noticed a significant hit to compile time or exe size since the templates are all now deterministic and non-recursive.
Yes it does a lot of dynamic allocation. Get with the fucking 20th century. And it's fucking printf. Printf is slow. I don't want to hear one word about it.
It's most appalling because all the maps are based on the USGS data, which is paid for by our fucking tax dollars. Fortunately there are some perfectly good free map sites :
libremap.org : Libre Map Project - Free Maps and GIS data
digital-topo-maps.com : Free Printable Topo Maps - Instant Access to Topographic Maps
ACME Mapper 2.0
Of these, digital-topo-maps.com is the easiest to browse around cuz it just uses Google maps (actually, it seems to be like 10X faster than Google's own interface, so it's actually just a nice way to browse normal maps too).
Libre Maps is the most useful for hiking with because it has nice printer-ready pages in high quality.
Also, I sometimes forget that Google still has the Terrain maps because they are hidden under More... now. I'm sure somebody has done trail overlays for Google Terrain, but I haven't found it.
2. When I call AAA it gives me fucking California because my phone is from CA. Seriously? People don't have cell phone numbers from other states? I get to wait on hold for five minutes, then request WA, then wait again.
3. Fucking grocery store near my house is phasing out human checkers for these fucking automated machines. In theory that's a nice idea, but in practice the things are so fucking broken that they are instant boiling blood and CHARLES SMASH rage attack. They just constantly freak out and go into "please return the item to the bagging area" ; god dammit I already put the item in the bagging area you fucking turd.
4. Fucking ticket I got for going 86 on a 65 mph freeway is costing me $1500 a year in raised insurance rate. Unbelievable. I understand it's a random tax and so on, which I'm sort of fine with, but the collusion with the insurance industry is criminal (Geico buys laser guns for police departments, car insurers pay lobbyists to support speed cameras, etc.). It'll be on my record for at least three years, so total cost to me is around $5000 , total profit for the municipality is maybe $100.
5. On the other hand, fuck you for crowing about red light cams being dismissed . When you run a red light, the camera should immediately fire a predator missile and blow up your car, you fucking dangerous self-righteous cock. I see a lot of people these days with photo-blocking plates on their cars too. You fucking shit head, if you run red lights you deserve punishment.
6. I got another fraudulent withdrawal from my First Mutual account from the EXACT SAME fraudulent merchant. I put in yet another fraud report, and I asked them WTF why didn't they block it since I had already reported fraud the last time. They said they won't block future withdrawals unless I do a $20 "stop payment" request. WTF WTF, I have to pay you to stop people from just withdrawing money from my account any time I want? I also asked about how these ACH's are authorized. They told me anybody with a merchant account can ACH withdraw from anybody else whenever they want. WTF WTF. I have to fill out a hundred forms and sign a bunch of shit and fax it and all that bullshit if I want to do an ACH from my *OWN* account, but other fuckers can just ACH any time they want straight out of my money without my permission.
7. The United States now has an overt policy of assassinating anyone they want in non-warzones without any judicial oversight or even the slightest proof of guilt necessary. Holy shit, at least in the past the CIA was secretive about their assassinations because they knew they were doing something wrong. Now we just blow up people. I also think the idea that we should care whether or not these assassinations are of American citizens or not is disgusting. They're human beings in a non-warzone with no proof of guilt and plenty of collateral damage. We all should be outraged, but we're so fucking whipped that we don't even blink any more.
At the heart of v2 is a "fixed" way of doing varargs. The problem with varargs in C is that you don't get the types of the variables passed in, or the number of them. Well there's no need to groan about that because it's actually really trivial to fix. You just make a bunch of functions like :
template < typename T1, typename T2, typename T3 >
inline String autoToStringSub( T1 arg1, T2 arg2, T3 arg3)
{
return autoToStringFunc( 3,
safeprintf_type(arg1), safeprintf_type(arg2), safeprintf_type(arg3),
arg1, arg2, arg3, 0 );
}
for various number of args. Here autoToStringFunc(int nArgs, ...) is the basic vararg guy who will do
all the work, and we just want to help him out a bit. This kind of adapter could be used very generally
to make enhanced varargs functions. Here I only care about the "printf_type" of the variable, but
more generaly you could use type_info there. (you could also easily make abstract Objects to encapsulate
the args and pass through an array of Objects, so that the called function wouldn't have to be a stupid C
vararg function at all, but then it's harder to pass through to the old C funcs that still want varargs).
On top of this we have a type-change adapter :
template < typename T1, typename T2 >
inline String autoToString( T1 arg1, T2 arg2)
{
return autoToStringSub(
autoprintf_StringToChar( autoArgConvert(arg1) ),
autoprintf_StringToChar( autoArgConvert(arg2) ));
}
autoToString calls down to autoToStringSub, and uses autoArgConvert. autoArgConvert is a template that passes through basic types
and calls ToString() on other types. ToString is a template that knows the basic types, and the client can
extend it by adding ToString for their own types. If they don't, it will be a compile error.
StringToChar is a helper that turns a String into a char * and passes through anything else. We have to do it
in that double-call way so that the String can get allocated and stick around as a temporary until our whole call is done.
The next piece is how to implement autoToStringFunc() , which takes "enhanced varargs". We need to figure out which pieces are format strings and do various types of printfs (including supporting %a for auto-typed printf). The only tricky part of this is how to step around in the varargs. Here is the only place we have to use a little bit of undefined behavior. First of all, think of the va_list as a pointer to a linked list. Calling va_arg essentially advances the pointer one step. That's fine and stanard. But I assume that I can then take that pointer and pass it on as a va_list which is the remaining args (see note *).
The key way we deal with the varargs is with functions like this :
static inline void SkipVAArg(ESafePrintfType argtype, va_list & vl)
{
switch(argtype)
{
case safeprintf_charptr: { va_arg(vl,char *); return; }
case safeprintf_wcharptr: { va_arg(vl,wchar_t *); return; }
case safeprintf_int32: { va_arg(vl,int); return; }
case safeprintf_int64: { va_arg(vl,__int64); return; }
case safeprintf_float: { va_arg(vl,double); return; }
case safeprintf_ptrint: { va_arg(vl,int*); return; }
case safeprintf_ptrvoid: { va_arg(vl,void*); return; }
default:
// BAD
safeprintf_throwsyntaxerror("SkipVAArg","unknown arg type");
return;
}
}
And the remainder is easy!
* : actually it looks like this is okay by the standard, I just have to call va_end after each function call then SkipArgs back to where I was. I believe this is pointless busywork, but you can add it if you want to be fully standard compliant.
template < typename T1, typename T2, typename T3, typename T4 >
inline String autoToString( T1 arg1, T2 arg2, T3 arg3, T4 arg4 )
{
return ToString(arg1) + autoToString( arg2,arg3,arg4);
}
template < typename T2, typename T3 >
inline String autoToString( const char *fmt, T2 arg2, T3 arg3 )
{
autoFormatInfo fmtInfo = GetAutoFormatInfo(fmt);
if ( fmtInfo.autoArgI )
{
String newFmt = ChangeAtoS(fmt,fmtInfo);
if ( 0 ) ;
else if ( fmtInfo.autoArgI == 1 ) return autoToString(newFmt.CStr(), ToString(arg2).CStr(),arg3);
else if ( fmtInfo.autoArgI == 2 ) return autoToString(newFmt.CStr(), arg2,ToString(arg3).CStr());
else return autoPrintf_BadAutoArgI(fmt,fmtInfo);
}
if ( fmtInfo.numPercents == 0 ) return ToString(fmt) + autoToString(arg2,arg3);
else if ( fmtInfo.numPercents == 1 ) return StringPrintf(fmt,arg2) + autoToString(arg3);
else if ( fmtInfo.numPercents == 2 ) return StringPrintf(fmt,arg2,arg3);
else return autoPrintf_TooManyPercents(fmt,fmtInfo);
};
you have an autoToString that takes various numbers of template args. If the first arg is NOT a char *,
it calls ToString on it then repeats on the remaning args. Any time the first arg is a char *, it uses
the other specialization which looks in fmt to see if it's a printf format string, then splits the args
based on how many percents they are. I also added the ability to use "%a" to mean auto-typed args,
which is what the first part of the function is doing.
That's all dandy, but you should be able to see that for large numbers of args, it generates a massive amount of code.
The real problem is that even though the format string is usually a compile-time constant, I can't parse it at compile time, so I generate code for each arg being %a or not being %a, and for each possible number of percents. The result is something like 2^N codegens for N args. That's bad.
So, I know how to fix this, so I don't think I'll publish v1. I have a method for v2 that moves most of the work out of the template. It's much simpler actually, and it's a very obvious idea, all you have to do is make a template like :
autoprintf(T1 a1, T2 a2, T3 a3)
{
autoPrintfSub( autoType(a1), autoArg(a1) ,autoType(a2), autoArg(a2) , .. )
}
where autoType is a template that gives you the type info of the arg, and autoArg does conversions on
non-basic types for you,
and then autoPrintfSub can be a normal varargs non-template function and take care of all the hard work.
... yep new style looks like it will work. It requires a lot more fudging with varargs, the old style didn't need any of that. And I'm now using undefined behavior, though I think it always works in all real-world cases. In particular, in v2 I'm now relying on the fact that I can do :
va_start(vl) va_arg(vl) .. a few types to grab some args from vl vsnprintf( vl);that is, I rely on the fact that va_arg advances me one step in the va_list, and that I then still have a valid va_list for remaining args which I can pass on. This is not allowed by the standard technically but I've never seen a case where it doesn't work (unless GCC decided to get pedantic and forceably make it fail for no good reason).
Testing on 10M arrays of average length 192 (random in [128,256]).
count : 10000000 totalBytes : 1920164768 clocks per byte : burtle : 1.658665 crc32 : 10.429893 adler32 : 1.396631 murmur : 1.110712 FNV : 2.520380So Adler is in fact decently fast, not as fast as Murmur but a bit faster than Burtle. (everything is crazy fast on my x64 lappy; the old post was on my work machine, everything is 2-3X faster on this beast; it's insane how much Core i7 can do per clock).
BTW I wasn't going to add Murmur and FNV to this test - I didn't test them before because they are really not "corruption detection" hashes, they are hashes for hash tables, in particular they don't really try to specifically gaurantee the one bit flips will change the hash or whatever it is that CRC's gaurantee, but after I saw how non-robust Adler was I figured I should add them to the test, and we will see that they do belong...
Now when I count collisions in the same way as before, a problem is evident :
collisions : rand32 : 11530 burtle : 0 crc32 : 11774 adler32 : 1969609 murmur : 11697 FNV : 11703note that as before, rand32 gives you a baseline on how many collisions a perfect 32 bit hash should give you - those collisions are just due to running into the limitted space of the 32 bit word. Burtle here is a 64 bit hash and never collides. (I think I screwed up my CRC a little bit, it's colliding more than it should. But anyhoo). Adler does *terribly*. But that's actually a known problem for short sequences.
How does it do on longer sequences ? On arrays of random length between 2k and 4k (average 3k) :
num hashes : 10000000 totalBytes : 30722620564 clocks per byte : burtle : 1.644675 crc32 : 11.638417 adler32 : 1.346784 murmur : 1.027105 FNV : 2.999243 collisions : rand32 : 11530 burtle : 0 crc32 : 11586 adler32 : 12335 murmur : 11781 FNV : 11653it's better, but still the worst of the group.
BTW I should note that the adler32 implementation does unrolling and rollup/rolldown and all that kind of stuff and none of the other ones do. So it's speed advantage is a bit unfair. All these sort of informal speed surveys should be taken with a grain of salt, since to really fairly compare them I would have to spend a few weeks on each one making sure I got it as fast as possible, and of course testing on various platforms. In particular FNV and Murmur use multiplies with is a no-go, but you could probably use shift and add to replace the multiplies, and you'd get something like Bob's "One at a Time" hash.
So I figured I'd test on what is more like my real usage scenario.
In the RAD LZH , I compress 16k data quanta, and check the CRC of each compressed chunk before decompressing. So compressed chunks are between 0 and 16k bytes. Since they are compressed they are near random bits. Corruption will take various forms, either complete stompage with random shite, or some bit flips, or tail or head stomps. Complete stompage has been tested in the above runs (it's the same as checking the collision rate for two unrelated sequences), so I tested incremental stomps.
I made random arrays between 256 and 16k bytes long. I then found the hash of that array, did some randomized incremental stomping, and took the hash after the changes. If the hashes were the same, it counts as a collision. The results are :
numTests : 13068402 burtle : 0 : 0.00000% crc32 : 0 : 0.00000% adler32 : 3 : 0.00002% murmur : 0 : 0.00000% FNV : 0 : 0.00000%Adler32 is the only one that fails to detect these incremental stomps. Granted the failure rate is pretty low (3/13068402) but that's not secure. Also, the hashes which are completely not designed for this (Murmur and FNV) do better. (BTW you might think the Adler32 failures are all on very short arrays; not quite, it does fail on a 256 byte case, then twice at 3840 bytes).
ADDENDUM : Ok I tested Fletcher32 too.
cycles : rand32 : 0.015727 burtle : 1.364066 crc32 : 4.527377 adler32 : 1.107550 fletcher32 : 0.697941 murmur : 0.976026 FNV : 2.439253 large buffers : num hashes : 10000000 totalBytes : 15361310411 rand32 : 11530 burtle64 : 0 crc32 : 11710 adler32 : 12891 fletcher32 : 11645 murmur : 11792 FNV : 11642 small buffers : num hashes : 10000000 totalBytes : 1920164768 rand32 : 11530 burtle64 : 0 crc32 : 11487 adler32 : 24377 fletcher32 : 11793 murmur : 11673 FNV : 11599 difficult small buffers : num hashes : 10000000 totalBytes : 1920164768 rand32 : 11530 burtle64 : 0 burtle32 : 11689 crc32 : 11774 adler32 : 1969609 fletcher32 : 11909 murmur : 11665 FNV : 11703Conclusion : Adler32 is very bad and unsafe. Fletcher32 looks perfectly solid and is very fast.
ADDENDUM 2 : a bit more testing. I re-ran the test of munging the array with incremental small changes of various types again. Running on lengths from 256 up to N, I get :
munge pattern 1 : length : 6400 numTests : 25069753 rand32 : 0 burtle64 : 0 burtle32 : 0 crc32 : 0 adler32 : 14 fletcher32 : 22 murmur : 0 FNV : 0 munge pattern 2 : length : 4096 numTests : 31322697 rand32 : 0 burtle64 : 0 burtle32 : 0 crc32 : 0 adler32 : 9 fletcher32 : 713 murmur : 0 FNV : 0
So I strike my conclusion that Fletcher is okay. Fletcher and Adler are both bad.
ADDENDUM 3 : Meh, it depends what kind of "corruption" you expect. The run above in which Fletcher is doing very badly includes some "munges" which tend to fill the array with lots of zeros, in which area it does very badly.
If you look at really true random noise type errors, and you always start your array full of random bits, and then you make random bit flips or random byte changes (between 1 and 7 of them), and then refill the array with rand, they perform as expected over a very large number of runs :
numTests : 27987536 rand32 : 3 : 0.00001% burtle64 : 2 : 0.00001% burtle32 : 2 : 0.00001% crc32 : 1 : 0.00000% adler32 : 1 : 0.00000% fletcher32 : 2 : 0.00001% murmur : 2 : 0.00001% FNV : 1 : 0.00000%
autoprintf("hello world\n");
autoprintf("hello ",7," world\n");
autoprintf("hello %03d\n",7);
autoprintf("hello ","world %.1f",3.f,"\n");
autoprintf("hello %d",3," world %.1f\n",3.f);
autoprintf("hello ",(size_t)400,"\n");
autoprintf("hello ",L"unicode is balls"," \n");
autoprintf("hello ",String("world")," \n");
In particular, all of the following things work :
I'm gonna clean up the code a bit and try to extricate it from cblib (meh or maybe not) and I'll post it in a few days.
It does pretty much everything I've always wanted from a printf. There is one thing missing, which is formatting for arbitrary types. Currently you can only format the basic types, and the non-basic types go through a different system. eg. you can either do :
autoprintf("hello %5d ",anInt);
or
autoprintf("hello ",(size_t)anInt);
but you can't yet do
autoprintf("hello %5",(size_t)anInt);
(note that the type specifier is left off, only format specifiers are on the %). I know how to make this work, but it makes the implementation
a lot more complicated, so I might punt on it.
The more complicated version is to be able to pass through the format spec into the templated converter. For example, you might have a ToString() for your Vec3 type which makes output like ("{%f,%f,%f}",x,y,z) . With the current system you can do :
Vec3 v;
autoprintf("v = ",v);
and it will call your ToString, but it would be groovy if you could do :
Vec3 v;
autoprintf("v = %.1",v);
as well and have that format apply to the conversion for the type. But that's probably more complication than I want to get into.
Another thing that might be nice is to have an explicit "%a" or something for auto-typed, so you can use them at the end like normal printf args. eg :
autoprintf("hello %d %a %f\n", 3, String("what"), 7.5f );
LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular, it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse, which there were many of before LZMA, and yet it trounces them all.
So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a full spec ).
LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.
Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom). This is pretty standard.
Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword (or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data. The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long" flag. Other match-related coding events don't use it.
In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do that and just always uses N bits of position (I think N=2 is a typical good value).
Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are sent without compression.
The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :
LZMA state machine :
Literal :
state < 7 :
normal literal
state >= 7 :
delta literal
state [0-3] -> state = 0
state [4-9] -> state -= 3 ([1-6])
else state -= 6 [10-11] -> ([4-5])
Match :
rep0
len 1 :
state -> < 7 ? 9 : 11
len > 1 :
state -> < 7 ? 8 : 11
rep12
state -> < 7 ? 8 : 11
normal
state -> < 7 ? 7 : 10
// or from Igor Pavlov's code :
static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5};
static const int kMatchNextStates[kNumStates] = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates] = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};
Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've recently encoded. For example :
initial state = 5 code a normal match -> 7 code a rep match -> 11 code a literal -> 5 code a literal -> 2 code a literal -> 0Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to be an advantage, but other coders have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three that gives you 8 states of previous context), which seems to me to have about the same effect.
There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match, the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is a delta literal ?
Delta literals are coded as :
char literal = *ptr;
char lastPosPtr = ptr - lastOffset;
char delta = literal ^ *lastPosPtr;
that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos,
the last offset, that's important for structured data).
When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another, these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer. And that's what's cool about this.
Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file (where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.
Okay, I think that's it for how the coding works. A few more tidbits :
The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches. The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).
One thing I can't figure out is how the binary tree is maintained with the sliding window.
ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant". My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree, you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which is the alternative.
The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree, it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match), but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries. I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps. If anyone has figured it out, please help me out.
ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are 72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte which is often very correlated, and some bytes that are near random. So we might have something like :
[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F] ... history ... * start here we will encode : 00,00,40,00 : 4 byte match at offset 72 (offset 72 is probably offset0 so this is a rep0 match) 7E : delta literal encode 7E ^ 7F = 1 00 : one byte match to offset 72 (rep0) 4C : delta literal encode 4C ^ 3F = 0x73 2F : regular literalAlso because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will be captured very well.
Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple bottom-bits cycles).
The goal is to make this work without taking any speed hit. There are lots of little tricks to make this happen. For example, the LZ decode match copier is allowed to trash up to 8 bytes past where it thinks the end is. This lets me do a lot fewer bounds checks in the decode. To prevent actual trashing then, I just make the encoder never emit a match within 8 bytes of the end of a chunk. Similarly, the Huffman decoder can be made to always output a symbol in finite number of steps (never infinite loop or access a table out of bounds). You can do this just by doing some checks when you build your decode tables, then you don't have to do any checks in the actual decode loop.
So, how do we make sure that it actually works? To prove that it is 100% fuzz resilient, you would have to generate every possible bit stream of every possible length and try decoding them all. Obviously that is not possible, so we can only try our best to find bad cases. I have a couple of strategies for that.
Random stomps. I just stomp on the compressed data in some random way and then run the decoder and see what happens (it should fail but not crash). I have a test loop set up to do this on a bunch of different files and a bunch of different stomp methods.
Just stomping random bytes in turns out to not be a very good way to find failures - that type of corruption is actually one of the easiest to handle because it's so severe. So I have a few different stomp modes : insert random byte, insert 00, insert FF, flip one bit, and the same for shorts, dwords, qwords. Often jamming in a big string of 00 or FF will find cases that any single byte insert won't. I randomize the location of the stomp but prefer very early position ones, since stomping in the headers is the hardest to handle. I randomize the number of stomps.
One useful thing I do is log each stomp in the form of C code before I do it. For example I'll print something like :
compBuf[ 906 ] ^= 1 << 3; compBuf[ 61 ] ^= 1 << 3; compBuf[ 461 ] ^= 1 << 4;then if that does cause a crash, I can just copy-paste that code to have a repro case. I was writing out the stomped buffers to disk to have repro cases, but that is an unnecessary slowdown; I'm currently running 10,000+ stomp tests.
(note to self : to do this, run main_lz -t -z -r1000)
Okay, so that's all very nice, but you can still easily miss failure cases. What I really want is something that gives me code coverage to tell that I've handled corrupted data in all the places where I read data. So I stole an idea from relacy :
Each place I get a byte (or bits) from the compressed stream, I replace :
U8 byte = *ptr++;with
U8 byte = *ptr++; FUZZ(byte); // I wanted to do this but couldn't figure out how to make it work : // U8 byte = FUZZ( *ptr++ );(and similar for getting bits). Now, what the FUZZ macros do is this :
The first time they are encountered, they register their location with the FuzzManager. They are then a disabled possible fuzz location. Each one is given a unique Id.
I then start making passes to try to fuzz at all possible locations. To do this, each fuzz location is enabled one by one, then I rerun the decompressor and see if that location was in fact hit. If a fuzz location is enabled, then the FUZZ macro munges the value and returns it (using all the munge modes above), and if it's disabled it just passes the byte through untouched.
Once I try all single-munges, I go back and try all dual munges. Again in theory you should try all possible multi-fuzz sequences, but that's intractable for anything but trivial cases, and also it would be very odd to have a problem that only shows up after many fuzzes.
As you make passes, you can encounter new code spots, and those register new locations that have to be covered.
Again, a nice thing I do is before each pass I log C code that will reproduce the action of that pass, so that if there is a problem you can directly reproduce it. In this case, it looks like :
Fuzz : 1/36
rrFuzz_StartPass_Select(".\compress\rrLZHDecompress.cpprrLZHDecoder_DecodeSome",351010,3,1,0x28502CBB);
In order to have reproducability, I use FILE/LINE to identify the fuzz location, not an index, since the index can change from run to run
based on the code path taken. Also, note that I don't actually use FILE/LINE because I have FUZZ in macros and templates - I use __FUNCDNAME__
so that two versions of a template get different tags, and I use __COUNTER__ so that macros which cause multiple fuzzes to occur at the same
original code line get different location numbers. eg. this works :
#define A() do { U8 t = *ptr++; FUZZ(t); } while(0)
#define B() A(); A();
template < int i > void func() { B(); }
void main()
{
func< 0 >();
func< 1 >();
}
// there should be 4 separate unique FUZZ locations registered :
/*
I log :
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318000) = 0;
rrFuzz_Register(".\main_lz.cpp|??$func@$0A@@@YAXXZ",1318001) = 1;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318000) = 2;
rrFuzz_Register(".\main_lz.cpp|??$func@$00@@YAXXZ",1318001) = 3;
*/
As usual I'm not sure how to get the same thing in GCC. (maybe __PRETTY_FUNCTION__ works? dunno).
The actual FUZZ macro is something like this :
#define FUZZ_ID __FILE__ "|" __FUNCDNAME__ , __LINE__*1000 + __COUNTER__
#define FUZZ( word ) do { static int s_fuzzIndex = rrFuzz_Register(FUZZ_ID); if ( rrFuzz_IsEnabled(s_fuzzIndex) ) { word = rrFuzz_Munge(word); } } while(0)
The only imperfection at the moment is that FUZZ uses a static to register a location, which means that locations that are never visited at all never get registered, and then I can't check to see if they were hit or not. It would be nice to find a solution for that. I would like it to call Register() in _cinit, not on first encounter.
Anyway, this kind of system is pretty handy for any code coverage / regression type of thing.
(note to self : to do this, define DO_FUZZ_TEST and run main_lz -t -r1000)
ADDENDUM : another practical tip that's pretty useful. For something small and complex like your headers, or your Huffman tree, or whatever, you might have a ton of consistency checks to do to make sure they're really okay. In that case, it's usually actually faster to just go ahead and run a CRC (*) check on them to make sure they aren't corrupted, then skip most of the validation checks.
On the primary byte stream we don't want to do that because it's too slow, but for headers the simplicity is worth it.
(*) not actually a CRC because doing byte-by-byte table lookups is crazy slow on some game platforms. There are other robust hashes that are faster. I believe Bob Jenkin's Lookup3 is probably the best and fastest, since we have platforms that can't do multiplies fast (ridiculous but true), so many of the hashes that are fast on x86 like Murmur2 are slow on consoles.
cbloom rants 10-05-08 - 1
cbloom rants 10-06-08 - 1
cbloom rants 10-08-08 - 1
So, WTF I'm going insane. Anyway, here are some more links :
encode.ru : How fast should be a range coder
ctxmodel.net : Context Modelling
CiteSeerX : Arithmetic coding , Langdon 79
Sachin Garg : 64-bit Range Coding and Arithmetic Coding
One random thing I should note is that if you have 64 bit registers, you can let range go between 2^32 and 2^64 , and output 32 bits at a time.
ADDENDUM : another random thing that occurs to me : if you're doing an fpaq0p-style sloppy binary arith coder where range is allowed to get quite small, you can actually do a few encodes or decodes in a row without checking for renormalization. What you would have to do is first do a *proper* renorm check that handles the underflow from straddling the middle case (which it normally doesn't handle) so that you are sure you have >= 24 bits in your range. Then, you can do several binary arithmetic codes, as long as the total probability shift is <= 24. For example, you could do two codes with 12 bits of probability precision, or 3 with 8 bits. Then you check renorm again. Probably the most sensible is doing two 12-bit precision codes, so you are able to do a renorm check once per two codes rather than every code. Of course then you do have to handle carries.
The whole situation with car safety (eg. raising door panels, Volvo's auto-stop research) is sort of like if you had a problem with people running around shooting each other, and your solution is to make everyone wear bullet proof vests. How about guns with built in digital cameras that detect if you're pointing them at a human and then run mood detection to tell if they're hostile or not and block firing. (of course arguing that cars shouldn't be safer is absurd).
I'm really rathered bothered by the whole idea of an "accident". It's almost never actually an accident, it's usually gross misconduct by one (or more) parties. The fact that you just exchange insurance and it gets paid for completely distorts the reality of punishment that would lead to different behaviors. In particular there should be a party at fault and they should lose their license. Though this fantasy is a bit unrealistic since we know well that punishment for rare events doesn't actually change behavior.
However, the importance of it was missed when it came out. For many years afterwards people continued to publish "improvements" to Huffman decoding (such as Sub-linear Decoding of Huffman Codes Almost In-Place (1998) ) which are just pure useless shit (I don't mean to single out that paper, there were probably hundreds of shitty foolish pointless papers on "efficient huffman" written after Moffat/Turpin).
Most people in the implementation community also missed this paper (eg. zlib, JPEG, etc. people who make important use of huffman decodes have missed these techniques).
I missed it too. Recently we did a lot of work on Huffman decoding at RAD, and after trying many techniques and lots of brainstorming, we came up with what we thought was a brilliant idea :
Store the code in your variable bit input word left-justified in a register. The Huffman codes are numerically arranged such that for codes of any given length, leaves are lower values than branches. Then, the code for the first branch of each codelen can be left-justified in a word, and your entire Huffman decode consists of :
while ( bits >= huff_branchCodeLeftAligned[codeLen] ) codeLen++; return ( (bits>>(WORD_SIZE-codeLen)) - baseCode[ codeLen ] );(this returns a symbol in "canonical order" - that is most probable is 0 ; if your symbols are not in order from most to least probably, you need an additional table lookup to reorder them).
This is really incredibly fucking hot. Of course it's obvious that it can be improved in various ways - you can use a fast table to skip the first few steps, you can use a nextCodeLen table to skip blanks in the codeLen sequence, and you can use a binary search instead of linear search. For known-at-compile-time huffman trees you could even optimize the binary search for the probability distribution of the codes and generate machine code for the decoder directly.
All of those ideas are in the Moffat+Turpin paper.
I also found this : Anatomy of ROLZ data archiver , which is the only actual algorithm description I've ever found of ROLZ , since Ilia doesn't write up his work. (there's also a brief description at the Russian Wikipedia ).
Anyway, it's pretty obvious how you would do ROLZ, there are few unexpected cool things on the "Anatomy of ROLZ data archiver" page.
1. The way he keeps the lists of offsets for each context by just stepping back through the history of the file already processed is pretty cool. It means there's no actual separate [context][maxoffsets] table at all, the offsets themselves are pointers back a linked list. It also means that you can do sliding-window trivially.
2. In the BALZnoROLZ.txt file he has Ilia Muraviev's binary probability updater :
//This is predictor of Ilya Muraviev
class TPredictor {
private:
unsigned short p1, p2;
public:
TPredictor(): p1(1 << 15), p2(1 << 15) {}
~TPredictor() {}
int P() {
return (p1 + p2);
}
void Update(int bit) {
if (bit) {
p1 += unsigned short(~p1) >> 3;
p2 += unsigned short(~p2) >> 6;
}
else {
p1 -= p1 >> 3;
p2 -= p2 >> 6;
}
}
};
First of all, let's back up a second, what is this? It's a probability update for binary arithmetic coding. A very standard way to do fast probability updates for binary arithmetic coding is to do :
#define PROB_ONE (1<<14) // or whatever #define PROB_UPD_SHIFT (6) // or something prob = PROB_ONE >> 1; // start at 1/2 if ( bit ) prob += (PROB_ONE - prob) >> PROB_UPD_SHIFT; else prob -= prob >> PROB_UPD_SHIFT;what this is doing is when you get a zero bit :
prob *= (1 - 2^-PROB_UPD_SHIFT);that's equivalent to a normal counting probability update if you put :
n1 = prob*N n0 = N - n1 when I get a zero bit n0++ and N++ prob = n1 / N so update is prob := prob*N / (N+1) or prob *= N / (N+1) so N/(N+1) = (1 - 2^-PROB_UPD_SHIFT) which means N = (2^PROB_UPD_SHIFT - 1)then you keep prob and reset N; that is, this update is equivalent to pretending you have such an n0 and N and you increment them and compute the new probability, but then you don't actually store N, so the next update will have the same weight (if N increased then each update has a smaller effect than the last). This is an IIR filter that acts a bit like a moving average of the last N. The larger N is, the bigger window we are effectively using. A small N adapts very quickly.
So Ilia's probability update is a 2^3-1 and 2^6-1 window size, and then averaged. That's a very simple and neat idea that never occured to me - use two simple probability estimators, one that adapts very fast and one that adapts more slowly, and just blend them.
This is something well known by "practictioners of the art" but I've never seen it displayed explicitly, so here we go. We're talking about arbitrary-alphabet decoding here obviously, not binary, and static probability models mostly.
Let's start with our Huffman decoder. (a bit of review
here or
here or
here ). For simplicity and symmetry, we will use a Huffman decoder that can handle code lengths up to 16,
and we will use a table-accelerated decoder. The decoder looks like this :
// look at next 16 bits (but don't consume them)
U32 peek = BitInput_Peek(16);
// use peek to look in decode tables :
int sym = huffTable_symbol[peek];
// use symbol to get actual code length :
int bits = symbol_codeLength[sym];
// and remove that code length from the bit stream :
BitInput_Consume(bits);
this is very standard (more normally the huffTable would only accelerate the first 8-12 bits of decode, and you would then fall back to
some other method for codes longer than that). Let's expand out what Peek and Consume do exactly. For symmetry to the arithcoder I'm going to keep my bit buffer
right-aligned in a big-endian word.
int bits_bitLen = // # of bits in word
U32 bits_code = // current bits in word
BitInput_Peek(16) :
{
ASSERT( bits_bitLen >= 16 );
U32 ret = bits_code >> (bits_bitLen - 16);
}
BitInput_Consume(bits) :
{
bits_bitLen -= bits;
bits_code &= (1 << bits_bitLen)-1;
while ( bits_bitLen < 16 )
{
bits_code <<= 8;
bits_code |= *byteStream++;
bits_bitLen += 8;
}
}
it should be obvious what these do; _Peek grabs the top 16 bits of code for you to snoop. Consume removes the top "bits" from code, and
then streams in bytes to refill the bits while we are under count. (to repeat again, this is not how you should actually implement bit
streaming, it's slower than necessary).
Okay, now let's look at an Arithmetic decoder. (a bit of review here or here and here ). First lets start with the totally generic case. Arithmetic Decoding consists of getting the probability target, finding what symbol that corresponds to, then removing that symbol's probability range from the stream. This is :
AC_range = size of current arithmetic interval
AC_code = value in range specified
Arithmetic_Peek(cumulativeProbabilityTotal) :
{
r = AC_range / cumulativeProbabilityTotal;
target = AC_code / r;
return target;
}
Arithmetic_Consume(cumulativeProbabilityLow, probability, cumulativeProbabilityTotal)
{
AC_range /= cumulativeProbabilityTotal;
AC_code -= cumulativeProbabilityLow * AC_range
AC_range *= probability;
while ( AC_range < minRange )
{
AC_code <<= 8;
AC_range <<= 8;
AC_code |= *byteStream++;
}
}
Okay it's not actually obvious that this is a correct arithmetic decoder (the details are quite subtle) but it is; and in fact this is
just about the fastest arithmetic decoder in the world (the only thing you would do differently in real code is share the divide by cumulativeProbabilityTotal
so it's only done once).
Now, the problem of taking the Peek target and finding what symbol that specifies is actually the slowest part, there are various solutions, Fenwick trees, Deferred Summation, etc. For now we are talking about *static* coding, so we will use a table lookup.
To decode with a table we need a table from [0,cumulativeProbabilityTotal] which can map a probability target into a symbol. So when we get a value from _Peek we look it up in a table to get the symbol, cumulativeProbabilityLow, and probability.
To speed things up, we can use cumulativeProbabilityTotal = a power of two to turn the divide into a shift. We choose cumulativeProbabilityTotal = 2^16. (the longest code we can write with our arithmetic coder then has code length -log2(1/cumulativeProbabilityTotal) = 16 bits).
So now our static table-based arithmetic decode is :
Arithmetic_Peek() :
{
r = AC_range >> 16;
target = AC_code / r;
}
int sym = arithTable_symbol[target];
int cumProbLow = cumProbTable[sym];
int cumProbHigh = cumProbTable[sym+1];
Arithmetic_Consume()
{
AC_range >>= 16;
AC_code -= cumProbLow * AC_range
AC_range *= (cumProbHigh - cumProbLow);
while ( AC_range < minRange )
{
AC_code <<= 8;
AC_range <<= 8;
AC_code |= *byteStream++;
}
}
Okay, not bad, and we still allow arbitrarily probabilities within the [0,cumulativeProbabilityTotal] , so this is more general than the Huffman decoder.
But we still have a divide which is very slow. So if we want to get rid of that, we have to constrain a bit more :
Make each symbol probability a power of 2, so (cumProbHigh - cumProbLow) is always a power of 2 (< cumulativeProbabilityTotal). We will then store the log2
of that probability range. Let's do that explicitly :
Arithmetic_Peek() :
{
r = AC_range >> 16;
target = AC_code / r;
}
int sym = arithTable_symbol[target];
int cumProbLow = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];
Arithmetic_Consume()
{
AC_range >> 16;
AC_code -= cumProbLow * AC_range
AC_range <<= cumProbLog2;
while ( AC_range < minRange )
{
AC_code <<= 8;
AC_range <<= 8;
AC_code |= *byteStream++;
}
}
Now the key thing is that since we only ever >> shift down AC_Range or << to shift it up, if it starts a power of 2, it stays a power of 2. So we will
replace AC_Range with its log2 :
Arithmetic_Peek() :
{
r = AC_log2Range - 16;
target = AC_code >> r;
}
int sym = arithTable_symbol[target];
int cumProbLow = cumProbTable[sym];
int cumProbLog2 = log2Probability[sym];
Arithmetic_Consume()
{
AC_code -= cumProbLow << (AC_log2Range - 16);
AC_log2Range += (cumProbLog2 - 16);
while ( AC_log2Range < min_log2Range )
{
AC_code <<= 8;
AC_log2Range += 8;
AC_code |= *byteStream++;
}
}
we only need a tiny bit more now. First observe that an arithmetic symbol of log2Probability is written in (16 - log2Probability) bits, so lets call
that "codeLen". And we'll rename AC_log2range to AC_bitlen :
Arithmetic_Peek() :
{
peek = AC_code >> (AC_bitlen - 16);
}
int sym = arithTable_symbol[peek];
int codeLen = sym_codeLen[sym];
int cumProbLow = sym_cumProbTable[sym];
Arithmetic_Consume()
{
AC_code -= cumProbLow << (AC_bitlen - 16);
AC_bitlen -= codeLen;
while ( AC_bitlen < 16 )
{
AC_code <<= 8;
AC_bitlen += 8;
AC_code |= *byteStream++;
}
}
let's compare this to our Huffman decoder (just copying down from the top of the post and reorganizing a bit) :
BitInput_Peek() :
{
peek = bits_code >> (bits_bitLen - 16);
}
// use peek to look in decode tables :
int sym = huffTable_symbol[peek];
// use symbol to get actual code length :
int codeLen = sym_codeLen[sym];
BitInput_Consume() :
{
bits_code &= (1 << bits_bitLen)-1;
bits_bitLen -= codeLen;
while ( bits_bitLen < 16 )
{
bits_code <<= 8;
bits_bitLen += 8;
bits_code |= *byteStream++;
}
}
you should be able to see the equivalence.
There's only a small difference left. To remove the consumed bits, the arithmetic coder does :
int cumProbLow = sym_cumProbTable[sym];
AC_code -= cumProbLow << (AC_bitlen - 16);
while the Huffman coder does :
bits_code &= (1 << bits_bitLen)-1;
which is obviously simpler. Note that the Huffman remove can be written as :
code = peek >> (16 - codeLen);
bits_code -= code << (bits_bitLen - codeLen);
What's happening here - peek is 16 bits long, it's a window in the next 16 bits of "bits_code". First
we make "code" which is the top "codeLen" of "peek". "code" is our actual Huffman code for this symbol.
Then we know the top bits of bits_code are equal to code, so to turn them off, rather than masking we can
subtract. The equivalent cumProbLow is code<<(16-codeLen). This is the equivalence of the Huffman code
to taking the arithmetic probability range [0,65536] and dividing it in half at each tree branch.
The arithmetic coder had to look up cumProbLow in a table because it is still actually a bit more general than the Huffman decoder. In particular our arithmetic decoder can still handle probabilities like [1,2,4,1] (with cumProbTot = 8). Because of that the cumProbLows don't hit the nice bit boundaries. If you require that your arithmetic probabilities are always sorted [1,1,2,4], then since they are power of two and sum to a power of two, each partial power of two must be present, so the cumProbLows must all hit bit boundaries like the huffman codes, and the equivalence is complete.
So, you should now see clearly that a Huffman and Arithmetic coder are not completely different things. They are a continuum on the same scale. If you start with a fully general Arithmetic coder it is flexible, but slow. You then constrain it in various ways step by step, it gets faster and less general, and eventually you get to a Huffman coder. But those are not the only coders in the continuum, you also have things like "Arithmetic coder with fixed power of two probability total but non-power-of-2 symbol probabilities" which is somewhere in between in space and speed.
BTW not directly on topic, but I found this in my email and figure it should be in public :
Well, Adaptive Huffman is awful, nobody does it. So you have a few options :
Static Huffman -
very fast
code lengths must be transmitted
can use table-based decode
Arithmetic with static probabilities scaled with total = a power of 2
very fast
can use table-based decode
must transmit probabilities
decode must do a divide
Arithmetic semi-adaptive
"Deferred Summation"
doesn't transmit probabilites
Arithmetic fully adaptive
must use Fenwick tree or something like that
much slower, coder time no longer dominates
(symbol search in tree time dominates)
Arithmetic by binary decomposition
can use fast binary arithmetic coder
speed depends on how many binary events it takes to code symbols on average
It just depends on your situation so much. With somehting like image or
audio coding you want to do special-cased things like turn amplitudes
into log2+remainder, use a binary coder for the zero, perhaps do
zero-run coding, etc. stuff to avoid doing the fully general case of a
multisymbol large alphabet coder.
It means I can make command line test apps for regression and profiling and just run them and pass in file name arguments for testing on.
and yes I know I can use clib or do it myself or whatever, but still, WTF ?
void rrSPU_MemSet16(void * ptr,qword pattern,int count)
{
qword * RADRESTRICT p = (qword *) ptr;
char * end = ((char *)ptr + count );
while ( (char *)p < end )
{
*p++ = pattern;
}
}
(and yes I know this could be faster, this is the simple version for readability).
The interesting thing has been that taking a 16-byte pattern as input actually makes it way more useful than normal byte memset. I can now also memset shorts and words, floats, doubles, and vectors! So this is now the way I do any array assignments when a chunk of consecutive elements are the same. eg. instead of doing :
float fval = param; float array[16]; for(int i=0;i<16;i++) array[i] = fval;you do :
float array[16]; qword pattern = si_from_float(fval); rrSPU_MemSet16(array,pattern,sizeof(array));In fact it's been so handy that I'd like to have it on other platforms, at least up to U32.
Of course in practice, it can be quite important, particularly because we don't actually just send one huffman tree per file. All serious compressors that use huffman resend the tree every so often. For example, to compress bytes what you might do is extend your alphabet to [0..256] inclusive, where 256 means "end of block" , when you decode a 256, you either are at the end of file, or you read another huffman tree and start on the next block. (I wrote about how the encoder might make these block split decisions here ).
So how might you send a Huffman tree?
For background, you obviously do not want to actually send the codes. The Huffman code value should be implied by the symbol identity and the code length. The so-called "canonical" codes are created by assigning codes in numerical counting up order to symbols of the same length in their alphabetical order. You also don't need to send the character counts and have the decoder make its own tree, you send the tree directly in the form of code lengths.
So in order to send a canonical tree, you just have to send the code lens. Now, not all symbols in the alphabet may occur at all in the block. Those technically have a code length of "infinite" but most people store them as code length 0 which is invalid for characters that do occur. So you have to code :
which symbols occur at all which code lengths occur which symbols that do occur have which code length
Now I'll go over the standard ways of doing this and some ideas for the future.
The most common way is to make the code lengths into an array indexed by symbol and transmit that array. Code lengths are typically in [1,31] (or even less [1,16] , and by far most common is [4,12]), and you use 0 to indicate "symbol does not occur". So you have an array like :
{ 0 , 0 , 0 , 4 , 5 , 7 , 6, 0 , 12 , 5 , 0, 0, 0 ... }
1. Huffman the huffman tree ! This code length array is just another array of symbols to compress - you can of course just run your huffman encoder on that array. In a typical case you might have a symbol count of 256 or 512 or so, so you have to compress that many bytes, and then your "huffman of huffmans" will have a symbol count of only 16 or so, so you can then send the tree for the secondary huffman in a simpler scheme.
2. Delta from neighbor. The code lens tend to be "clumpy" , that is , they have correlation with their neighbors. The typical way to model this is to subtract each (non-zero) code len from the last (non-zero) code len, thus turning them into deltas from neighbors. You can then take these signed deltas and "fold up" the negatives to make them unsigned and then use one of the other schemes for transmitting them (such as huffman of huffmans). (actually delta from an FIR or IIR filter of previous is better).
3. Runlens for zeros. The zeros (does not occur) in particular tend to be clumpy, so most people send them with a runlen encoder.
4. Runlens of "same". LZX has a special flag to send a bunch of codelens in a row with the same value.
5. Golomb or other "variable length coding" scheme. The advantage of this over Huffman-of-huffmans is that it can be adaptive, by adjusting the golomb parameter as you go. (see for example on how to estimate golomb parameters). The other advantage is you don't have to send a tree for the tree.
6. Adaptive Arithmetic Code the tree! Of course if you can Huffman or Golomb code the tree you can arithmetic code it. This actually is not insane; the reason you're using Huffman over Arithmetic is for speed, but the Huffman will be used on 32k symbols or so, and the arithmetic coder will only be used on the 256-512 or so Huffman code lengths. I don't like this just because it brings in a bunch more code that I then have to maintain and port to all the platforms, but it is appealing because it's much easier to write an adaptive arithmetic coder that is efficient than any of these other schemes.
BTW That's a general point that I think with is worth stressing : often you can come up with some kind of clever heuristic bit packing compression scheme that is close to optimal. The real win of adaptive arithmetic coding is not the slight gain in efficiency, it's the fact that it is *so* much easier to compress anything you throw at it. It's much more systematic and scientific, you have tools, you make models, you estimate probabilities and compress them. You don't have to sit around fiddling with "oh I'll combined these two symbols, then I'll write a run length, and this code will mean switch to a different coding", etc.
Okay, that's all standard review stuff, now let's talk about some new ideas.
One issue that I've been facing is that coding the huffman tree in this way is not actually very nice for the decoder to be able to very quickly construct trees. (I wind up seeing the build tree time show up in my profiles, even though I only buld tree 5-10 times per 256k symbols). The issue is that it's in the wrong order. To build the canonical huffman code, what you need is the symbols in order of codelen, from lowest codelen to highest, and with the symbols sorted by id within each codelen. That is, something like :
codeLen 4 : symbols 7, 33, 48 codeLen 5 : symbols 1, 6, 8, 40 , 44 codeLen 7 : symbols 3, 5, 22 ...obviously you can generate this from the list of codelens per symbol, but it requires a reshuffle which takes time.
So, maybe we could send the tree directly in this way?
One approach is through counting combinations / enumeration . For each codeLen, you send the # of symbols that have that codeLen. Then you have to select the symbols which have that codelen. If there are M symbols of that codeLen and N remaining unclaimed symbols, the number of ways is N!/(M!*(N-M)!) , and the number of bits needed to send the combination index is log2 of that. Note in this scheme you should also send the positions of the "not present" codeLen=0 group, but you can skip sending the entire group that's largest. You should also send the groups in order of smallest to largest (actually in order or *complement* order, a group that's nearly full is as good as a group that's nearly empty).
I think this is an okay way to send huffman trees, but there are two problems : 1. it's slow to decode a combination index, and 2. it doesn't take advantage of modelling clumpiness.
Another similar approach is binary group flagging. For each codeLen, you want to specify which remaining symbols are of that codelen or not of that codelen. This is just a bunch of binary off/on flags. You could send them with a binary RLE coder, or the elegant thing would be Group Testing. Again the problem is you would have to make many passes over the stream and each time you would have to exclude already done ones.
(ADDENDUM : a better way to do this which takes more advantage of "clumpiness" is like this : first code a binary event for each symbol to indicate codelen >=1 (vs. codeLen < 1). Then, on the subset that is >= 1, code an event for is it >= 2, and so on. This the same amount of binary flags as the above method, but when the clumpiness assumption is true this will give you flags that are very well grouped together, so they will code well with a method that makes coherent binary smaller (such as runlengths)).
Note that there's another level of redundancy that's not being exploited by any of these coders. In particular, we know that the tree must be a "prefix code" , that is satisfy Kraft, (Sum 2^-L = 1). This constrains the code lengths. (this is most extreme for the case of small trees; for example with a two symbol tree the code lengths are completely specified by this; on a three symbol tree you only have one free choice - which one gets the length 1, etc).
Another idea is to use MTF for the codelengths instead of delta from previous. I think that this would be better, but it's also slower.
Finally when you're sending multiple trees in a file you should be able to get some win by making the tree relative to the previous one, but I've found this is not trivial.
I've tried about 10 different schemes for huffman tree coding, but I'm not going to have time to really solve this issue, so it will remain neglected as it always has.
NOTE :
To make an int that lives in a vector, here are some rules :
1. spu_promote() is good (or si_from_int which is the same thing)). eg.
vec_int4 v = spu_promote(i,0);
2. Store is bad. eg.
vec_int4 v; v[0] = i;
3. Array initialization from a variable is bad. eg.
vec_int4 v = (vec_int4) { i }
The first one will give you a vector with i in the top word and *don't care* in the rest. The latter two will generate code that preserves the values in the bottom parts of the register. That will lead to a bunch of random cwd's and shufbs and shit like that popping up in your code in unexpected places.
Here's a helper class that you can use when you have an int on the stack which you need to force to actually be treated as a vector. When you look at your disasm and you see shufb's for no reason, toss these guys in to replace C ints.
struct RR_ALIGN_SPU rrSPU_VectorInt
{
qword m_qw;
operator int() const { return ((vec_int4)m_qw)[0]; }
void operator = (int x) { m_qw = si_from_int(x); }
};
(the main time you will need to do this by hand is for arrays, eg. int x[4]; will be shit, so use rrSPU_VectorInt x[4]);
BONUS SPU FUCKITUDE :
So I wrote a mini profiler for the SPU that tracks some times and copies it back to the PPU. It's super lightweight, and for the most part I found that when I enabled it I lost maybe 0.1% of my speed. Today I disabled profiling and expected a little speed boost and found ... my app is now slower without profiling. I'm at 140 MB/sec without profiling, 142 MB/sec with profiling. So maybe I'll ship with the profiler enabled.
This is a typical but particularly ridiculous example of what I've seen all along - random changes cause something else to happen in the code gen which can either be good or bad - and it's not by a trivial amount either.
ADDENDUM : I found out what this one was. Is I replace my profile macros with GCC_SCHEDULE_BREAK macros, I get the high speed. So the profiling was helping by causing scheduler breaks (due to the asm mtfb presumably).
For example say you write something like memcpy - (not as an intrinsic but as a function). Most of the time you're okay with it just being a function, but in some little routine that is a super hot spot you want to be able to say :
for ( .. a million .. )
{
.. important stuff
__forceinline memcpy(to,from,size);
}
(and the opposite __notinline).
More generally Sean once mentioned just the idea of being able to mark up "yeah really optimize this" or
"this is done often" parts of the code, so that could suffice as well.
At the moment the only way I know how to do this is some ugly shit like :
__forceinline void inl_myfunc()
{
.. write code here ..
}
void call_myfunc()
{
inl_myfunc();
}
then clients can choose to call inl_myfunc or call_myfunc. Ugly.
C99 cleaned up the inline/extern spec so you can get compilation of non-inlined "inline" functions in only one place, but it failed
to let the client specify whether or not it should inline or not.
BTW it should be evidently clear that the standard compiler inlining heuristic of using complexity is totally wrong. Little shitlet functions that happen to call memcpy should *not* get it inlined, and my big complex LZ decoder function *should*. In fact there's just no way for the compiler to know when it's a good idea to inline or not because it doesn't have information about how often a spot in code is hit.
Restrict continues to cause me no end of annoyance; I'm working on some chunk of code that I know is all alias-free, but I look at the disasm and I see it's doing pointless loads and stores. Okay, WTF, I forgot to put restrict on something. Now I have to randomly browse around my code and type definitions and try to find the one that's missing restrict. That's fucking retarded for workflow. I should just be able to say __restrict { } over my chunk of code.
On my work machine, I currently have VC 2003,2005,2008, and 2010 installed. Which one should the CPP/H file open in?
The right answer is obviously : whichever one is currently running.
And of course there's no way to do that.
Almost every day recently I've had a moment where I am working away in VC ver. A , and I double click some CPP file, and it doesn't pop up, and I'm like "WTF", and then I hear my disk grinding, and I'm like "oh nooes!" , and then the splash box pops up announcing VC ver. B ; thanks a lot guys, I'm really glad you started a new instance of your gargantuan hog of a program so that I could view one file.
Actually if I don't have a DevStudio currently running, then I'd like CPP/H to just open in notepad. Maybe I have to write my own tool to open CPP/H files. But even setting the file associations to point at my own tool is a nightmare. Maybe I have to write my own tool to set file associations. Grumble.
(for the record : I have 2008 for Xenon, 2005 is where I do most of my work, I keep 2003 to be able to build some old code that I haven't ported to 2005 templates yet, and 2010 to check it out for the future).
2. Shufb is not mod 32. Presumably because it has those special 0 and FF selections. This is not a huge deal because you can fix it by doing AND splats(31) , but that is enough slowdown that it ruins shufb as a fast path for doing byte indexing. (people use rotqby instead because that is mod 16).
A related problem for Shufb is that there's no Byte Add. If you had byte add, and shufb was mod 32, then you could generate grabbing a 16-byte portion of two quads by adding an offset.
In order to deal with this, you have to first mod your offset down to [0,15] so that you won't overflow, then you have to see if your original offset before modding had the 16 bit set, and if so, swap the reg order you pass to shuffle. If you had a byte add and shuffle was mod 32, you wouldn't have to do any of that and it would be way faster to do grabs of arbitrary qwords from two neighboring qwords.
(also there's no fast way to generate the typical shuffle mask {010203 ...}. )
3. You can make splats a few different ways; one is to rely on the "spu_splats" builtin, which figures out how to spread the value (usually using FSMB or some such variant). But you can also write it directly using something like :
(vec_int4) { -1,-1,-1,-1 }
which the compiler will either turn into loading a constant or some code to generate it, depending on what it thinks is faster.
Some important constants are hard to generate, the most common being (vec_uchar16){0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} , so you might want to load that into a register and then pass it around if you need it much. (if you are using the stdlib memcpy at all, you can also get it from __Shuffles).
4. Something that I always want with data compression and bit manipulation is a "shift right with 1 bits coming in".
Another one is bit extract and bit insert , grab N bits from position M. Also just "mask with N bits" would always be nice, so I
don't have to do ((1<
6. memcpy seems to be treated differently than __builtin_memcpy by the compiler. They both call the same code, but sometimes calling
memcpy() doesn't inline, but __builtin_memcpy does. I don't know WTF is up with that (and yes I have -fbuiltins set and all that shit,
but maybe there's some GCC juju I'm missing). Also, it is a decent SPU-optimized memcpy, but
the compiler doesn't have a true "builtin"
knowledge of memcpy for SPU the way it does on some platforms - eg. it doesn't know that for types that are inherently 16-byte aligned
it can avoid the rollup/rolldown, it just always calls memcpy. So if you're doing a lot of memcpys of 16-byte aligned stuff you should
probably have your own.
7. This one is a fucking nightmare : I put in some "scheduling barriers" (asm volatile empty) to help me look at the disassembly for
optimization (it makes it much easier to trace through because shit isn't spread all over). After debugging a bit, I forgot to take
out the scheduling barriers and ran my speed test again -
it was faster.
Right now my code is 5-10% faster with a few scheduling barriers manually inserted (19 clocks per symbol vs 21).
That's fucking bananas, it means the compiler's
scheduler is fucking up somehow. It sucks because there's no way for me to place them in a systematic way, I have to randomly
try moving them around and see what's fastest.
8. Lots of little things are painfully important on the SPU. For example, obviously most of your functions should be inline, but
sometimes you have to make them forceinline (and it's a HUGE difference if you don't because once a val goes through the stack it
really fucks you), but then sometimes it's faster if it's not inline. Another nasty one is manual scheduling. Obviously the
compiler does a lot of scheduling for you (though it fucks up as noted previously), but in many cases it can't figure out that two
ops are actually commutative. eg. say you have something like A() which does shared++ , and B() also does shared++ , then the compiler
can't tell that the sequence {A B} could be reordered to {B A} , so you have to manually try both ways. Another case is :
This kind of shit affects every platform of course, but on a nice decent humane architecture like a Core iX/x86 it's < 1%. On SPU it's 5-10%
which is nothing to sneeze at and forces you to do this annoying shit.
9. One thing that's annoying in general about the SPU (and even to some extent the PPU) is that some code patterns can be *SO* slow that
your non-inner-loop parts of the code dominate.
On the wonderful PC/x86 , you basically can just find your hot spots and optimize those little routines, and ignore all your glue and
startup code, cuz it will be reasonably fast. Not true on the SPU. I had some setup code that built some acceleration tables that's
only called once, and then an inner loop that's called two hundred thousand times. The setup code takes 10% of the total time. On the PC
it doesn't even show up as a blip. (On the Xenon you can get this same sort of thing if your setup code happens to do signed-16-bit loads,
variable shifts, or load-hit-stores). On the SPU my setup code that was so bad was doing lots of byte reads & writes, and lots of branches.
The point is on the PC/x86 you can find 10% of your code base that you have to worry about for speed, and the other 90% you just make work.
On the SPU you have to be aware of speed on almost all of your code. Blurg that blows so bad.
10. More generally, I found it just crucial to have my SPU ELF reloader running all the time. Every time I touch the code in any way, I hit
"build" and then look over and see my speed, because my test app is just running on the PS3 all the time reloading the SPU ELF and running
the test. Any time you do the most inoccuous thing - you move some code around, change some variable types - check the speed, because it
may have made a surprisingly big difference. (*)
(* = there is a disadvantage to this style of optimization, because it does lead you into local minima; that is, I wind up being a greedy
perturbation search in the code space; some times in order to find real minima you have to take a big step backwards).
11. For testing, be careful about what args you pass to system_init ; make sure you aren't sharing your SPUs with the management processes.
If you see unreliable timing, something is wrong. (5,1) seems to work okay.
I guess it's a durable material, but shoe soles (like car tires) have the property that durable and grippy are inherently
opposite. I want the softest rubber you can get on my feet. In fact I would love to get just regular shoes that have that super grippy
climbing shoe rubber that wears off in like a week. Fine, I'll replace them periodically, but at least in the mean time I will have soft
sticky contact with the ground.
When I was young I would reinvent the wheel for everything stupidly. Then I went through a maturation phase where I cut back on that and
tried to use existing solutions and be sensible. But now I'm starting to see that a certain judicious amount of "Casey Method" is a good
idea for certain programmers.
For example, if I had to write a game level editor tool in C# the straightforward way, it would probably be the most efficient way, but
that's irrelevant, because I would probably kill myself becuase I finished, or just quit the job, so the actual time to finish would
be infinite. On the other hand, if I invented a system to write the tool for me, and it was fun and challenging, it might take longer
in theory, and not be the most efficient/sensible/responsible/mature way to go, but it would keep be engaged enough that I would actually
do it.
Shelving with just one workspace as a way to save your work seems to work okay, but I've been using it to make temp checkins to go
between my various computers, and it's not awesome for that. I guess what I really have to do is make real branches for that, but
branches scare the bejeesus out of me.
One annoyance is that since only the top word can be used for things like loads, you often lose your SIMD. Like say you want to generate
4 addresses and do 4 loads, you can do your math on 4 channels, but then you have to copy the results out of channels 1-3 into other
registers in slot 0 (using shuf or rotqby), and this often is more expensive than just doing the math 4 separate times. (because
the instructions to transfer across lanes are slower than math).
To make use of the static branch predictor, organize branches so the likely part is first (the "if" part is likely, the "else" is unlikely). You can override
this with if (__builtin_expect(expr, 1/0)) , I use these :
Now, in theory you can also do dynamic branch prediction. If you can compute the predicate for your branch early enough, you can see if it will be taken or not
and either hint or not hint (obviously not using a branch, but using a cmov/select). I have not found a way to make the compiler generate this, so this seems to
be reserved for pure assembly in super high performance code.
ADDENDUM : just found an important issue with the dual pipe; instruction fetch is on odd pipe (obvious I guess, all fetch is on odd pipe).
That means when you overload the odd pipe, not only are you failing to dual issue - you are fighting with instruction fetch for time. That's
a disaster. So pretty much any time you can replace a load/store/rotq/shufb/etc. with a math op on even pipe, it's a win, even if you need
2-3 math instructions to do the same odd pipe instruction.
Anyway, the point is if you want to really get serious, you have to map out your dependency chains and look at the latency
of each step and try to minimize that, rather than just trying to reduce instructions.
More troubling is the fact that you cannot easily trick the compiler to fix it. You might think that this :
The only reliable way I have found is to go ahead and use the qword intrinsics
To be clear, what I would like is a way to say "I'm going to only work on this as an int, but please keep it in a
vector and ignore the bottom 3 components and generate the code that is fastest to give me the [0] result" , but there
doesn't seem to be a reliable way to make the compiler do that. I dunno, maybe there's a way to get this.
But more than that, work is like a mental tick that I sometimes indulge in. It's an autistic fugue. You go into this hole where all you
can think about is technical issues, and it's horrible, but it also sort of feels good. Like taking a poo, or playing with a loose tooth.
Then when I get into this state I just can't stop. I try to relax with N, but all I can think about is spu_shufb and should I be using
_align_hint ? And does DMA invalid ll-sc reservations? And I have to go back to work.
I imagine it's a bit like having OCD. It's not like the OCD guy really wants to count the lines in the wood grain. But if he walks into
the room and doesn't count them, it just eats at his brain - "must count wood grain" - repeating over and over. It's not like working
really makes me happy; after a day of solid work I don't feel good, in fact quite the opposite, my brain feel fried from fatigue, and my
body is in great pain from sitting too much, but I can't resist it, and if I don't work I just keep thinking "work work work".
Certainly I think that the Appleophile bloggers who are enamored of "native code" are missing the big picture. It's a damn shame that so
many simple utilitarian apps are being written for specific platforms, when we have pretty decent platform-independent languages.
Obviously speed parity has not been acheived yet, though a lot of it is 1. retarded Java programmers who are doing things like SetPixel() one by one
instead of using higher level APIs, and 2. we're not actually on 8+ cores all the time yet.
The whole MSVC editor-is-my-debugger thing is such a massive win that it really hurts to step back from it to
the bad old days of separate debugger.
I did one thing that sort of helps - I run my test app in a loop, and when it finishes each test, I unload my SPU image,
wait for a key press, then reload it and repeat. This lets me rebuild my SPU ELF and just reload it and repeat. This
avoids a lot of test cycle time, but only works until you have a crash so it fails a lot.
The ability to just do stdin/stdout to/from a console from the PS3 is pretty awesome. It lets me write my test apps as
if they were just command line apps and run them from my PC.
There's a common belief that an empty volatile asm in GCC is a scheduling barrier :
The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to
give gaurantees :
gcc.gnu.org/bugzilla/show_bug.cgi?id=17884
or
gcc-patches/2004-10/msg01048.html .
In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile,
but they claim that's a documentation error.
Now they also have the
built-in "__sync" stuff .
But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.
__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also
do not document the semantics well. They say :
"In most cases, these builtins are considered a full barrier"
What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ?
( there also seem to be some bugs in some of the __sync implementations
bugzilla/show_bug.cgi?id=36793 )
For example on PS3 __sync_val_compare_and_swap appears to be :
(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all,
which pursuant to previous blog
Mystery - Does the Cell PPU need Memory Control
might actually be the correct thing).
I also stumbled on this thread where
Linus rants about GCC .
I think the example
someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you
deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually
might be wrapping signed ints in your code right now and not know about it. Say I have some code like :
See
details on signed-integer-overflow in GCC and
notes about wrapv and
wrapv vs strict-overflow
One of the reasons that I never talk to people is that they almost always bring the conversation down to a low level.
It's one of my greatest frustrations. Of course it happens in politics all the time, I want to talk about something
like how we could actually get better regulation of corporations; obviously just putting Glass-Steagall back in place
would be good, but you have to look at the underlying reason why that went away - too much political influence of
the big banks, too much importance put on GDP; but besides that you have to ask why the free market is not regulating
itself better. Why are private funds investing in hedge funds that charge such high fees and don't actually beat the
S&P 500? Maybe it's just ignorance, or maybe they're getting kickbacks. And why aren't shareholders making sure that
executives do what's in the best interest of the company? I think this is one of the most important things, there's
a big problem with corporate boards and the whole shareholder-election process that isn't being addressed; boards are
full of cronies and are failing in their oversight. When's the last time you ever heard of a board shaking up a major
company because it was being run badly? Never!
Anyway, I want to talk about something interesting, and instead I get, "but regulation is bad" , uhh , okay, that view
is fine, but how about some actual content, if you want to disagree tell me something interesting. No, "mmm I don't like
the idea of bigger government". Umm, okay, why not? how about some ideas about how it could be controlled in other ways.
No. The conversation is brought down to a boring low level.
Or, even when you're talking to smart people, they will often go into this annoying pedantic correction mode; they don't
want to let you be right about anything so they pick some little irrelevant detail to squabble about, like "urm you didn't
actually mean GDP, you meant GNP, that's a common mistake". Umm, okay, maybe that's true, if it is true you could make
your correction interesting by explaining yourself a bit, or you could just shut the fuck up because it's not germane to
the point I'm trying to make and it just drags the conversation down into semantics.
I also really don't like talking to people about a topic that they obviously don't care enough to actually learn.
When I find somebody who really knows something about a topic and can teach me, I am ecstatic, I want to pick their brain,
first of all I want to get references and do the reading so I can get up on the background material because I don't want to
waste their time going over stuff anybody could teach me. I enjoy teaching myself, but find that people who actually want
to learn are very rare. Most people just want to rant about some topic they don't actually know anything about. I'll offer
"well if you'd like to learn I can point you to.." oh no, I'll just rant without learning thank you. Okay, I'm done with this
conversation. Or you get the people who think they're an expert and because of that don't want to listen to anything you
have to say. I mean, I know perfectly well that I think I'm an expert on lots of things that I'm probably not, but I still want
to learn. If you actually have new insight that I haven't figured out, that's fantastic, please give it to me.
In other stressful news, one of the neighbors just bought their child a drum kit.
Despite my past complaints about the god damn home improvers, it's actually a very quiet neighborhood. For one thing
we aren't afflicted by that common Seattle blight of being near "musicians". (Seattle has perhaps the highest per-capita of
amateur bands in the US, which sounds good in theory but is actually fucking terrible, because they practice). It's kind of
amusing when you just go for walks around the neighborhood; in our neighborhood there's a guy who plays accordian about a block
away who's quite good actually, and a guy who jams on electric guitar really loud about two blocks away. (though it doesn't beat
where I lived in SF, where down the block from me some older guys would hang out in their garage and play really good jazz).
Giving a child drums should really be illegal. I mean, you could practice on those "Rock Band" style fake drums until you're decent; once you're
decent the sound is not so bad, but there's something in particular about an instrument being played badly that is just excruciating and hard to
tune out.
Say you need to change a stack two dimensional array :
You can take a two dimensional array as function arg in a few reasonable ways :
2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.
(* = if you compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually
have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).
First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls).
When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not a moron and I'm doing the dynamic dispatch for a
good reason, because I actually need to do different things on different objects.
The issue is just how you implement it.
How C++ virtual functions are implemented on most compilers :
Why does this hurt ?
How can virtual calls be removed ?
No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C#
are doing this now, so hopefully we will get it in C++ soon.
When/if we get this, it is a very powerful optimization, because you can make then make functions that take
concrete types and get joint-dispatch where you turn many virtual calls into just one.
An example to be clear :
This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns)
is that it pushes knowledge of the derived class up to the base. One nice variant of this is :
For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This
at first seems like a good idea - not all objects have positions, and they might want to implement different ways of
reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different
object types push their different position into m_position. (For example silly people implement attachments by making
GetPosition() query the parent position then add some offset; instead you should do that computation only once in your
object update and store it into the shared m_position).
Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.
I dunno, maybe I should just go ahead and buy a house up here. It's okay here I guess, though I don't know if I can stand the winters, or certain other
drawbacks. Plus where I want to live if I'm single vs. married with kids is very different.
I know in my head that people who are successful in life are people who just choose a certain plan and commit to it as if they were sure, even though that's
totally illogical and there's no reason to be sure of it. You have to act as if you are going to stick with something; every house you move into you should
treat as if you are going to be there for life. People who hesitate or hedge tend to get nowhere.
I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal
parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available.
If they build in rrLZHOptimal, it should automatically provide that option.
I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which
registers itself and sets that function pointer to the actual implementation.
This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :
1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that
case you will be initializing everything with singletons and so on).
2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever.
(this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by
default, but with a __noremove__ modifier or something)).
There are various tricks to address this but I don't think any of them is very nice. (*)
I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call.
I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right
Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing,
and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.
(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just
decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not
wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)
ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features
in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build -
eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create
shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is
very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would
force you to link in every module you might want to use, which is the thing you want to avoid.
For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like
things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS
or some such? Or should I just provide code that builds for SPU and let you do your management?
I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same.
The repro looks like this. In some header file I have a function sort of like this :
If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and
so on, so for all I know their behavior might be correct, but it's damn annoying.
It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which
is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).
ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) ,
and example of GCC inline rules insanity
In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.
I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case
for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP
more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.
The best link that summarizes it IMO is the GCC note on
CPP Traditional Mode that describes
the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax,
in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not
need to retokenize).
I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP
for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use).
It also means you can't do some of the more creative string munging things in the CPP that I enjoy.
In particular, in every CPP except GCC, this works :
In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be
constants. For example, lots of code bases have something like "offsetof" :
A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template
value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement
the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also
annoying because there are places I would use arbitrary values.
(see for example
1
or 2 ).
ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :
In general I like to use template args as a way to make the compiler generate different functions for various
constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline
problem example.
It's not just that the chips are super fast and easy to use. It's that they actually encourage good
software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about
good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals,
every function has to be inline.
1. Complex addressing.
This is way more huge than I ever expected. There are two important subaspects here :
1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]
1.B. being able to use memory locations directly in instructions instead of moving through registers.
Together these work together to make it so that on x86 you don't have to fuck around with loading shit out
to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables
in a local.
e.g.
and those run at almost the same speed !
This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are
this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!
ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad,
it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in
between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the
point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about
changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it.
Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops
and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.
2. Out of order execution.
Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls
and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.
In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without
OOE.
The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this
regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate,
it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86
it actually doesn't hurt much. Why?
Typical C++ code may be something like :
ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing.
But your code path is not static, and the caches have basically random delays because other threads might be using them too,
so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling
for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr
even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific)
etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you
since it can schedule functions across the jump. etc. etc.
ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you
when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's
no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.
Final addendum :
Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone
knows that and it's old news and boring.
What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a
regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment
or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have
made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into
an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.
At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just
to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about
the PC/x86 performance at all, just 100% write your code for the PPC.
There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing,
you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this
might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive
to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.
Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the
x86, because it will be like a bazillion times faster than the PPC anyway.
Some other PPC notes :
1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS
conflict detection). In particular, it can easily come up when
you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy
between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly
offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.
2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted
before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables.
For example, say you write a function like this :
The conclusion is the same one I came to last time :
When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything
you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very
badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you
never take the address of). Never call external functions.
This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific
questions :
1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point)
(and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)
2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ?
(eg. can Lock just be Acquire and Unlock just be Release ?)
Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared
variable.
In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :
Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this
question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code
into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock,
then lock() must be a full barrier, but unlock() still doesn't need to be.
Joe Duffy
mentions this subject but doesn't
come to any conclusions. He does argue that
it can be confusing if they are not
full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you
touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves
badly.
One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they
are used to. Some coders may see "Lock" and think code neither moves in or out.
But on some platforms it does make the mutex much more expensive.
To be concrete, is this a good SpinLock ?
For more concreteness -
Viva64 has a nice analysis of
Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any*
sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock
is only Release ordered.
The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind?
Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's
more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO
or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?
I'm having trouble finding any information about this, but I did notice a funny thing in :
Mike Acton's Multithreading Optimization Basics
Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one
store queue (I think)).
The instructions are in order, all of the out-of-orderness of these
PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses
act sequential on the PPU ?
Hmmm I'm not confident about this and need more information.
The nice thing about Cell being open is there is tons of information about it from IBM
but it's just a mess and very hard to find what you want.
Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that
platform anyway.
ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :
First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen
in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be
delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one
exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control
ordering you have ;
lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).
lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you
put an lwsync after every instruction you would have a nice x86-like semantics.
In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me
to make a sync point with all other cores.
sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.
Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.
isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops
use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions
to finish their loads before any future instructions start (it flushes pipelines).
isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the
article linked above they say :
"PowerPC also has the synchronization instructions isync and eieio
(which is used to control reordering to caching-inhibited memory). These
synchronization instructions should not be needed for normal
synchronization purposes."
All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but
it also does a lot more that you don't need.
ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just
want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever),
but don't care about it being acquire or release memory ordered.
Anyway, it seems like it would be very easy to fix. You just do something to force an attacking style. One
random idea I had is you could require that 3 forwards always stay on the opponent's side of midfield. This
prevents you from drawing back everyone for defense, and means that the attackers can get a numbers advantage
whenever they want to take that risk.
No Limit Hold'em is broken. It's far too profitable and easy to play a very tight style. The fix is very
easy - antes. But almost nobody does it outside of the biggest games. (an extra blind would also work well).
Baseball is broken. Not in a game rule system way, I think it actually functions pretty well in that sense.
Rather it is broken as a spectator sport because it is just too slow and drawn out. The fix for this is
also very easy - put time limits on pitchers and batters. None of this fucking knocking the dirt off your
shoes, throwing to first, then asking for more time, oh my god.
Basketball is broken. I wrote about that before
so I won't repeat myself.
Rugby is broken. In a lot of ways actually. The rules of the scrum are very hard to enforce, so you constantly
get collapsed scrums and balls not put in straight and so on, very messy, not fun to play or watch. The best
part of the game is the free running, but it's very easy to win without a good running game at all, just by
playing well in the set pieces and kicking, which is really ugly rugby. I don't have any great ideas on how to
fix it, but it's definitely broken. Sevens is actually a much better game for the most part.
I guess it's pretty hard to change these things because they are established and have history and fans and so on,
and any time you make a change a bunch of morons complain that you're mucking things up, but a bit of tweakage could
seriously improve most sports.
ADDENDUM :
Tennis is broken. Power and serving are rewarded too much over control, which makes matches boring. The French Open is
generally the best tennis of the year to watch because it's slower. This could be easily fixed by limitting racket technology
to 1980 levels or something.
Auto racing is horribly broken. I think F1 is hopeless and boring so I won't talk about that. The Le Mans / GT series are almost
interesting, but the stupid rules just make it incomprehensible who has an advantage each year. Some manufacturer can happen to have
a car that fits well with the current rule set, and then they dominate for a few years. In many of the series, the cars are so
modified that they hardly share any parts with their street origins at all. Like currently the BMW M3's are struggling in the ALMS but
winning in the ELMS because of tiny differences in the arcane rules (something about suspension and aero that's allowed).
I think the solution is very easy : let manufacturers bring anything they want, but it has to be available for the public to buy at
some fixed price. So rather than all these classes that have all these rules (no 4WD for example in the Le Mans series, and minimum
weights and so on) - get rid of the rules and have a $100k series and a $150k series. Let manufacturers make the best car they can make
for that price, and if they want to take a loss and bring a car that's got more value than that, they can, as long as the public can buy
it. This would really let us see what a $150k M3 can do vs a $150k R8.
News sites covering sports events should have a "spoiler free" mode. They should let you view the information in chronological order (past to
present), and let you block how far ahead you want to see. eg. say I have game 3 of the NBA finals on tape, and before I watch it I want to
catch up on what happened in game 2, I should be able to mark "don't show me game 3" and go read the news.
I'm hitting that particular problem right now with the Tour de F.
Why is there no fucking decent blog of someone telling me news about the Tour? Yes, I know there are plenty of news sites like velonews or
cyclingnews or whatever, that's not what I want. I also don't want a "tour diary" from a rider. I want a smart, funny, 3rd party who is
following everything and can write about what happens and also some editorial info about the secret dramas. Where is my content?
For ages I've wanted a blog I could follow that was just a well-curated extraction of amusement. I like to see a funny photo or some hot
chicks or whatever trashy internet amusement there is, but I don't want to have to slog through the mass of crap that you're
bathed in when you go to the massive aggregator sites like milkandcookies or daily* or whatever. Like just one little high quality
nugget once a day, why the fuck do I not have that?
The other thing I've wanted forever is a science news site that's targetted at science degree graduates, but not specialists
in that exact topic. There's a big gap between popular reporting, which is just woefully low-level, often just wrong, or
completely inane (like reporting crackpot fringe loonies as if they are real science), and the full-on rigor and impenetrability
of actual research papers. There could be a middle ground, intended for intelligent scientific people, written by people
who actually understand the full depth of what they're writing about. The only place I know of to get that is in college
magazines; for example the Caltech Engineering&Science magazine that I occasionally get is actually a pretty good source
for that depth of material.
In other news, the opening of the Montlake bridge almost every day of the summer so that a few fuckers can
get their over-height sailboats through is a really ridiculous slap in the face of any kind of civic sense.
I've been on a sailboat and gone from Lake Union to Lake Washington, and it is a delight, but you can get through
just fine on a moderate size boat without raising the bridge. You have to almost intentionally get a really tall
mast just so you can fuck up the lives of thousands of people when the bridge raising causes traffic to back up
onto the 520 and leads to a massive traffic jam. It's really appalling.
One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads
that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you
have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).
The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is
important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86
load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do
something like
(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little
worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).
In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you
know MS is not going to copy your syntax, but you could copy theirs. If you
would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide
compatible declarations. Shit like
making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to
fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).
You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real
mess. BTW not having pragma warning disable is
pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for
specific bits of code where you know the warning is bogus or innocuous.
The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.
I deposited a check at Well's Fargo once that was pretty old; I had misplaced it and just found it and went and deposited it. They charged me
a $25 fee for depositing a check that was too old.
Recently I had some money taken from my First Mutual account through a fraudulent ACH. They of course reimbursed it - but WTF that is not enough.
They allowed someone to take money out of my account when the person didn't even sign my name. The name signed is "Lindsey Meadows" or
something and First Mutual just let it right on through. I should get to charge them $25 for their gross incompetence.
Anytime anyone bills you wrong, you get to wait for 30 minutes on the phone. If you are *lucky* they will fix the bill. What about fucking
compensating me you cocks? But if I make a mistake and send in my monthly payment with a slightly wrong number on the check I get a fee.
I sent a bike by UPS and they checked it out and told me the rate was $65. Two weeks later I get a notice from them that they measured it
during shipping and decided it was oversize and they were charging me an extra $70.
Just recently, UPS has completely fucked up delivery of two packages; one I sent they bounced back to me even though the address was
completely correct; I had to talk to them on the phone and send it again. They gave me no reimbursement at all (they didn't charge me for
the second shipping, but didn't reimburse the first), and of course no compensation for the trouble or the delay. Recently they completely
lost a package that was shipped to me; again it was insured and everything so I'll get it fine, but I should be able to drop a $70 charge
on them for failed delivery. Oh you screwed up delivery? Okay, that's a $50 fee you owe me. Oh you don't remember agreeing to that?
It's on page 97 of the contract you had to sign to accept my package.
My garbage pickup company will occasionally just drop a $10 oversize garbage pickup fee on me. I never put out anything outside the bin, I
have no idea what the fuck they are charging me for, maybe some neighbor sneaks shit in or its just a mistake, but the fact that they can
just tack on fees at will without my agreement is the problem.
Of course I signed away permission for them to do that somewhere in the contract. But that is no fucking excuse. Retarded anti-humanists
will say "it's your own fault, you had the opportunity to read the contract and you chose to sign it". What? First of all, you can't be
serious that I'm supposed to study every fucking contract that's put in my face. Second of all, if I actually didn't sign the contracts
that I didn't agree with, I couldn't live anywhere since all the rental agreements are absurd, I couldn't have a phone, a credit card, a
bank, utility service, I mean are you fucking stupid? Of course I have to deal with these companies, I have no choice to not sign
abusive contracts.
I fucking hate our lack of freedom and independence.
There is no free market solution to these problems. For one thing, there is basically no significant competition in almost every service.
Even in sectors that have apparent competition, like say car insurance or banking, sure there are various people to choose from, but in fact they
are almost all identical. They all run their business the same way, and none of them is actually good to their consumers.
The only solution is strong government regulation. In particular there are two very simple consumer protection laws that I would like to see :
1. Elimination of non-voluntary fees. All charges to a consumer must be explicitly authorized (and no they cannot be preauthorized on
contingencies).
This is extremely powerful, simple, and would make a huge difference. For example it solves bank overdraft abuse. When you try to make
an overdraft, the bank has to contact you (by email and cell) and confirm that you want to accept the overdraft (and the $25 fee). If you
say no, the charge just bounces (and there is no fee to you). Obviously the same thing would apply to cell phone abuse through roaming
or overage minutes or whatever.
Now you the consumer might want to preauthorize $50 a month in fees to give you some wiggle room, but you could choose $0 preauthorized fees
if you like.
2. Make user agreements illegal. This is a little trickier because I don't think you can quite ban them completely, so you have to say
something about them being "minimal" and "transparent". Maybe you could require that the average person should be able to fully read and
understand it within 60 seconds.
Agreements that protect the service provider from lawsuits or that specify settlement through arbitration should simply be illegal.
But this line of thinking is all irrelevant because nothing that significantly reduces corporations' power to fuck us over will ever be done.
The result should be that using VC with MyNifty, you basically don't even know that your files are in source control - everything
should just autocheckout at the right time.
Go get the original NiftyPlugins at Code.Google and do the install.
Then download MyNifty.zip
Extract MyNifty.zip into the dir where the NiftyP4 DLL is installed. It should be something like :
Run VC.
Tools -> Addin Manager
Tools -> Options -> Environment -> Documents
Output pane should have a "NiftyPerforce" option; you should see a message there like "NiftyPerforce ( Info): MyNiftyP4! RELEASE"
Nifty should be set to : useSystemEnv = True, autoCheckoutOnEdit = false, autoCheckout on everything else = true.
I recommend autoAdd and autoDelete = false.
Make sure your P4 ENV settings are done, or use P4CONFIG files.
First of all, the efficiency difference vs. Powergrips or toe clip-strap pedals is pretty small. Those also lock your foot in pretty well
and let you spin. When people say "clipless pedals are a huge gain" they are comparing to platform pedals which is fucking ridiculous,
you need to compare vs. toe clips. The nice thing about toe clips is you can leave them loose around town, and then when you get out on a
long course, you just reach down and pull the strap tight and then you are locked in nice and neat.
The first problem with clipless pedals is ergonomic. Yes, they can be adjusted just right so that you have good geometry and they won't
cause any pain - but that is a big pain in the ass, and the average amateur doesn't have the perfect adjustment. The result is knee and
hip pain. The extra freedom of strap pedals lets you get a more comfortable position and avoid the pain.
The biggest problem with clipless pedals is that it turns the amateur road rider into a real dick-head. They aren't comfortable with
clipping in and out, so they go to great measures to avoid it. They'll hang on to posts at red lights, run lights and stop signs,
won't wait in a line of other cyclists, etc. They create a real hazard because they can't get their feet out easily.
Of all the dick cyclists on the road, the yuppie amateur road racer has to be the worst. They're the ones who are all wobbly and don't
stop for pedestrians. They ride in big groups and don't get out of the way for cars. They run stop signs and act like they aren't doing
anything wrong. They'll often ride way out in the road for no good reason and not get over to let cars by. Of course cyclists should
take the lane when they need to for safety reasons, but that is not the case for these turds.
It really makes me sad when I see some out of shape people who are obviously trying to get into cycling, and the fucking shop has set them
up with some harsh aluminum frame, with a way too aggressive forward position, and clipless pedals; they're obviously very uncomfortable
on their bike, and also out of control.
X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes
that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the
best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide
them.
In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like
unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the
branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.
Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back
and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate
neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own
function.
X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because
it has to do an actual multiply or shift. So for
example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).
X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM,
but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC
many years ago before that got fixed.
X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you
can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this).
On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no
out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to
give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.
X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz)
is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs.
(I guess this is a microcoded
instruction so if you use -mwarn-microcode you can get warnings about it).
PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an
S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.
-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed
with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have
it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some
kind of scope, like :
X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want
to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register"
keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase
of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the
compiler really wants to generate a load from the static base mem, it won't cache that.
Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy.
The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't
eliminate the copy) then back into a local, which then gets registered. Ugh.
You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables,
they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.
X. Standard loop special casing. On Xenon they seem to special case the standard
X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make
this go away.
X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at
figuring out that some combined operation can be turned into one instruction. eg. something like this :
X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case
it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to
generate it
manually . The clean solution to this is to write your own SELECT , that's like :
Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.
There is, however, something it would be awesome for : dev kits.
If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company.
They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on
the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.
It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have
to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.
Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also
rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office
to turn on the dev kit, or make sure noone is using it, or whatever).
Basically he contends that the real problem with bike-car safety is that car drivers do not take the responsibility of their powerful machine
seriously enough. Yes, of course, I agree absolutely, but SO WHAT ?
We could have a lot of discussions about the way the world should be, but it's irrelevant and non-productive to pine for things that will
never be. Yes, it would be nice if people payed attention to the road when they drove, didn't talk on cell phones, didn't drink coffee
and talk to their passenger. A car is a deadly powerful killing machine, and stupid people forget that because it's so comfy and feels
safe and is easy to drive and so on. Yes, I think most people would drive better if they had to drive something like an old open-roof
roadster where you are exposed to the elements and feel vulnerable. But you are not going to change American's driving habits. People
want to jump in their giant beast, mash the gas, watch TV while they're driving, and fuck you if you're a pedestrian or cyclist in their
way.
Look, drivers are fucking dangerous morons. It doesn't matter if you're on a bike or not - they are constantly running red lights, pulling
into crosswalks, not stopping for pedestrians, going the wrong way around roundabouts. Almost every single day I see some major violation of
basic traffic laws, and even beyond that there are just constant violations of basic human sense and decency. One of the ones that's really
getting my goat recently is how people around here love to blow right through a stop sign and come to stop about ten feet past it, with their
nose way out in the intersection (they do this intentionally because they want to get further out into the intersection to see to the sides;
the reasonable thing to do would be to first stop at the stop sign and then pull forward to see). So when I'm driving or biking through the intersection, all I see is somebody who
blows through a stop sign and is coming right at me (and then they stop just before hitting me).
I personally would love to see the elimination of the entire concept of the vehicular "accident". They are only rarely accidents; it's usually
somebody fucking up. The person who crashed should not only have to pay the cost of the accident, but should get a punitive legal punishment
such as license suspension or even prison time. For example when the old lady ran right through a red light and smashed my car should have
clearly had her license taken away. In cases of hitting pedestrians you should get jail time. It's almost impossible for a pedestrian to ever
be at fault, because even if they do jump out right in front of you - you should always expect them to do that when you are in an area with
pedestrians, so you should be going slow and be ready to slam on the brakes. But this is never going to happen so it's a pointless rant.
As for the issue of mandatory helmets - I don't really think it's anything to get riled up about. As a cyclist of course you should choose
to wear a helmet even if it's not a law. Obviously making it a law is political weakness. Oh shit cyclists are getting hit by cars - let's
restrict the cyclists because god knows we're not gonna restrict the cars. Well duh, of course that's how politics works. But there are
perfectly reasonable reasons to make helmets madatory - the same reason seat belts in cars are mandatory - because it reduces the medical cost
which is shared by society.
We can rant all we want about how drivers should pay more attention, be more courteous, be reasonable and intelligent, but it just won't ever
happen.
What I would like to see is better ways for me as a cyclist to avoid cars, and me as a driver to avoid cyclists. Part of the problem is
that the people who put down the bike lanes are real fucking morons. Right in my neighborhood we have bike lanes or "sharrows"
right down the busiest arterial roads, when there are perfectly good quiet back streets that run parallel and would be much better routes
for bikes. Personally when I ride, I take the back routes that are very low traffic, but the majority of cyclists are just as retarded
as the retarded cars, and they take the road that is "bike recommended" even though it's much worse.
(BTW as a reminder, let me emphasize that the retardation equivalence of bikes and cars in no way excuses the cars from their sins; if you
have a fight between someone with a feather and someone with a knife and they are both being dangerous morons with their weapon - the feather
guy can be forgiven but the knife guy is a fucking selfish ignorant dick; you often hear self-righteous car morons go on about how the
bikes "do bad things too" ; so fucking what? so what? he's poking you with his feather, just ignore him and be careful with your damn knife).
The woods have a wonderful silence to them; the boughs are baffles, muffling
sound, making the air heavy and still. I imagine having a clearing in the woods
so a bit of light can get in. In the clearing is a japanese style pavillion,
dark thick wood braces and paper screens. It is empty of all clutter, my
private space, quiet and peaceful, where I can just think and work and be
alone.
There are actually lots of huge wooded properties for sale out not too far
from Seattle. I think the best nearby place is out in the Snoqualmie river
valley, around Duvall/Carnation/Novelty Hill. You can get 40 acres for
around $600k which is pretty stonkering. 40 acres is enough that you can
put a little building in the middle and not be able to see or hear a neighbor
at all. It also seems like a pretty good investment. It's inevitable that
the suburbs will get built out to there eventually, and then all that land
could be worth a fortune. This is why I've never understood living in traditional
suburbs; if you go just another ten miles out you get to real country where you can
have big wild property with woods and gardens and isolation, for less money!
But then I start thinking - if I'm going to live in the middle of nowhere,
why live in the middle of nowhere near Seattle? It's too far to really go
into the city on any kind of regular basis, so I may as well just live in
the countryside in CA or Spain or Argentina or somewhere with better weather.
Living in the country is really only okay if I'm married or something. If I'm
single I have to be in the city. Even if I am with the woman I love, moving out
to the country is sort of like retiring from life. It's changing gears to a very
isolated, simple life. That's very appealing to me, but I don't think it's time
for that phase of my life just yet.
Lately I have been taking lots of walks around Seattle U. It has pretty nice grounds,
with lots of little hidden gardens tucked behind or between buildings where you can stroll or sit.
I love the feeling on a college campus. You can just feel the seriousness in the air.
Even when there are lots of kids around there's a feeling of quiet and solitude; maybe it's
because the big buildings create a sort of echoing canyon that changes the sounds.
I miss having deep intellectual problems to work on that you really have to go and think about for
a long time. Even though I'm sort of doing research right now, it's engineering research, where
my time needs to be spent at the machine writing code for test cases, it's not theoretical research.
It's really a delightful thing to have a hard theoretical problem to work on. You just keep it in
the back of your mind and you chew on it for months. You try to come at it in different ways, you
search for prior art papers about it. All the time you are thinking about it, and
often the big revelation comes when you are taking a hike or something.
The Windows registry vs. INI files is surprisingly pro-registry.
First of all, using the Registry does not actually *fix* any issues with multiple instances that you can't fix pretty easily with INI files.
In particular, the contention over the config data is the easy part of the problem.
There's an inherent messy ambiguity with settings and multiple instances. If I change settings in instance 1 do I want those
to be immediately reflected in instance 2? If so, the only way to fix this is to have all instances constantly checking for new settings (or
get some notification). This problem is exactly the same with the registry or INI files. Sure the registry gives you nice safe atomic writes,
but you can implement that yourself easily enough, or you could use an off the shelf database. So that is really not much of an argument.
In fact, getting changes across instances with INI files could be done pretty neatly using a file change notification to cause reloads (I'm
not sure if Windows provides a similar watcher notification mechanism for registry changes). (the system that most people use of dumping
settings changes on program exit is equally broken with the registry or INI files).
Second, storing things like last dialog positions and dumping structs and such is not really appropriate for INI files (or really even for
the registry for that matter). The INI file is for things that the user might want to edit by hand, or copy to other machines, or whatever.
That other junk is really a logically separate type of data. It's like the NCB in MSVC, which we all know you want to just wipe out from
time to time. (in fact making it separate is nice because if I accidentally get my last dialog position off in outer space I can just delete
that data). I think the official nice Win32 way to store this data is off in AppData somewhere, but I don't love that either.
Third, the benefits of the INI are massive and understated. 1. text editting is in fact a huge benefit over the registry. It lets you see
all the options and edit them in a tool that is friendly and familiar. 2. it lets you do all the things you would do normally on files - eg. I
can easily email my INI to friends, I can save backups of settings I made for some purpose, hell I can munge the INI from batch files, I can
easily zip it up to save old versions, etc.
And this last one is by far the most important - making programs be "transportable" - that is, they rely on nothing but stuff in dirs under them -
is just a massive win. It lets me rearrange my disk, copy programs around without running installers, save versions of programs, etc.
Back in the DOS days, whenever I finished a code project, we would make tape backups (lol tape) of the code * and all the compilers used to build it *.
To do that all you had to do was include the dir containing the compiler. Five years later we could pull out projects that used some bizarro
compilers that we didn't have any more, and it would all just work because they were fully transportable. The win of that is so massive it
dominates the convenience of the registry for the developer.
Which brings us to the most important part : the convenience for the developer is not the issue here! It's what would be nicer for the user.
And there INI is just all win. If it's more work for the developer to make that work, we should do that work.
Smartness overload ( and addendum ) is purportedly a
rant against over-engineering and excessive "smartness".
But let's start there right away. Basically what he wants is to have less smartness in the development of the basic architectural systems,
but that requires *MORE* smartness every single time you use the system. For example :
In many situations overgeneralization is a handy excuse for
laziness. Managing object lifetimes is one of my pet peeves. It’s common
to use single, “universal” system for all kinds of situations instead of
spending 5 minutes and think.
He's anti-smartness but pro "thinking" each time you write some commmon code.
My view is that "smartness" during development is very very bad. But by that I mean requiring the client (coder) to think and make the
right decision each time they do something simple. That inevitably leads to tons of bugs. Having systems that are clear and uniform and
simple are massive wins. When I'm trying to write some leaf code, I shouldn't have to worry about basic issues, they should be taken
care of. I shouldn't have to write array code by hand every time I need an array, I should use a vector. etc.
Furthermore, he is arguing against general solutions. I don't see how you can possibly argue that having each coder cook up their own
systems for lifetime management is a good idea. Uniformity is a massive massive win. Even if you wrote some manual lifetime
control stuff that was great, when some co-worker goes into your code and tries to use things they will be lost and have problems.
What if you need to pass objects between code that use different schemes? What a mess.
Yet, folks insist on using reference counted pointers or GC everywhere.
What? Some of counters can be manipulated from multiple threads? Well,
let’s make _all_ pointers thread-safe, instead of thinking for another 5
minutes and separating special cases. It may be tempting to have a
solution that just works in every case, write it once and use
everywhere. Sadly, in many cases it means unnecessary overhead.
Yes! It is very tempting to have a solution that just works in every case! And in fact, having that severely *reduces* the need for
smartness, and severely reduces bugs. Yes, if the overhead is excessive that's a problem, but that can't be dealt with without destroying
good systems.
I think what's he trying to say is something along the lines of "don't use a jackhammer to hammer a nail" or something; that you shouldn't
use some very heavy complex machinery when something simple would do the trick. Yes, of course I agree with that, but he also succumbs to
the fallacy of taking that way too far and just being anti-jackhammer in general. The problem with that is that you wind up having to
basically cook up the heavy machinery from scratch over and over again, which is much worse.
Especially with thread safety issues, I think it is very wrong-headed to suggest that coders should "think" and "not be lazy" in each
occurance of a problem and figure out what exactly they need to thread-protect and how they can do it minimally, etc. To write thread-safe
code it is *crucial* to have basic systems and common paradigms that "just work everywhere". Now that doesn't mean that have to make all
smart pointers theadsafe. You could easily have something like "SingleThreadSmartPointer" and "ThreadSafeSmartPointer". An even better
mechanism would be to design your threading system such that cross-thread smart pointers aren't necessary. Of course you want sensible
efficient systems, but you also want them to package up common actions for you in a gauranteed safe way.
Finally, let's get to the real meat of the specific argument, which is about object lifetime management. He seems to be trashing a
bogus smart pointer system in which people are passing around smart pointers all the time, which incurs lots of overhead. This is
reminiscent of all the people who think the STL is incredibly slow, just because they are using it wrong. Nobody sensible has a smart
pointer system like that. Smart people who use the boost:: pointers will make use of a mix of pointers - scoped_ptr, shared_ptr, auto_ptr,
etc. for different lifetime management cases. Obviously the case where a single object always owns another object is trivial. You could
use auto_ptr or even just a naked pointer if you don't care about automatic cleanup. The nice thing is that if I later decide I need to
share that object, I can change it to shared_ptr, and it is easy to do so (or vice-versa). Even if something is a shared_ptr, you don't
have to pass it as a smart pointer. You can require the caller to hold a ref and then pass things as naked pointers. Obviously little
helper functions shouldn't take a smart pointer that has to inc and dec and refcount thread-safely, that's just bone headed bad usage, not
a flaw of the paradigm.
Now, granted, by not using smart pointers everywhere you are introducing holes in the automaticity where bad coders can cause bugs. Duh.
That is what good architecture design is all about - yes if we can make everything magically work everywhere without performance overhead
we would love to, but usually we can't so we have to make a compromise. That compromise should make it very easy for the user to write
efficient and mistake-free code. See later for more details.
Object lifetime management involves work one way or another. If you use smart pointers or some more lazy type of GC, that amount of work
needed for the coder to do every time he works with shared objects is greatly reduced. This make it easier to write leaf code and reduces bugs.
The idea of using an ID as a weak reference without a smart pointer is
basically a no-go in game development IMO. Let me explain why :
First of all, you cannot ever convert the ID to a pointer *ever* because
that object might go away while you are using the pointer. eg.
So, one solution to this is to only use ID's. This is of course what Windows
and lots of other OS'es do for most of their objects. eg. HANDLE, HWND, etc.
are actually weak reference ID's, and you can never convert those to pointers,
the function calls all take the ID and do the pointer mapping internally.
I believe this is not workable because we want to get actual pointers to objects
for convenience of development and also efficiency.
Let me also point out that a huge number of windows apps have bugs because of
this system. They do things like
Now, of course weak references are great, but IMO the way to make them safe
and useful is to combine them with a strong reference. Like :
By using a system like this you can be both very efficient and very safe. The system I use is roughly like this :
Now, certainly lots of projects can be written without any complicated lifetime management AT ALL. In particular, many games throughout
history have gotten away with having a single phase of the world tick when all object destruction happens; that lets you know that
objects never die during the frame, which means you can use much simpler systems. I think if you are *sure* that you can use simpler
systems then you should use them - using fancy systems when you don't need them is like using a hash table to implement an array with
the index as the hash key. Duh, that's dumb. But if you *DO* need complicated lifetime management, then it is far far better to use
a properly enegineered and robust system than to do ad-hoc per-usage coding of custom solutions.
Let me make another more general point : every time you have to "think" when you write code is an opportunity to get it wrong. I think
many "smart" coders overestimate their ability to write simple code correctly from scratch, so they don't write good robust architectural
systems because they know they can just write some code to handle each case. This is bad software engineering IMO.
Actually this leads me into another blog post that I've been contemplating for a while, so we'll continue there ...
I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place,
which goes nicely with Moffat's in place Huffman codelen builder ).
Here's the algorithm :
1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen.
We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts
in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).
2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix".
That is, they do not satisfy the kraft inequality K <= 1 for decodability.
3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by
increasing some code lengths.
4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :
5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :
6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.
Okay, so what's happening here ?
There's one forward pass and one backwards pass. First we truncate the code lengths that were too long.
We are now in trouble and we need to find some other code lengths that we can make longer so that we are
prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter
how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just
walk forward from the lowest symbol count. (*).
After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you
bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage
of this you want to find the highest count symbol whose length you can decrease and still have a prefix
code.
As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter
much because these are the very rare symbols.
footnotes :
(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so
adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to
minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)
(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K,
when considering overshooting you have to consider the penalty from doing len++ in the step that does the
overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge
step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some
ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot
K. Keep doing ops while K > 1 and you will wind up with K = 1. ).
Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?
What would the correct priority queue op be for the (**) footnote ?
Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K
value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can
do it on two codes of length L+1. Or one of length L+1 and two of length L+2.
Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs
of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th
queue.
But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and
K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the
package-merge algorithm !
Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you
are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0
and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked
in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from
the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at
level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing
child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.
For example :
In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm
(see
Wikipedia ,
or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the
Larmore/Hirschberg paper is impenetrable).
Here's my summary :
Okay, so it all works fine, but it bothers me.
I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to
solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to
the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the
proof that the codes built in this way are prefix. But it's all very indirect and round-about.
What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can
see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree.
The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead
we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand
intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has
a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that
sort of makes sense, but it just makes me even more unhappy that I can't quite see it.
This is a ridiculous load of shit. The credit card companies are almost as bad. The "verified by Visa" shit is a vaguely decent start
in the right direction, but it's still just pathetically easy to get someone's credit card number and use it at will.
The insane thing is that it would be so easy to fix. I've mentioned this idea before, but the simplest one that occurs to me which would
work in the current system is to have temporary one use only account numbers. So when I want someone to do an ACH withdrawal, I ask my
bank for a temp account number which will expire or is conditional, and I give that number to the merchant. Same thing for credit card
numbers. But no. They say some bullshit about how it would be too expensive to retrofit more security into the system, while they
swim in piles of our money.
It blows my mind how many people don't carefully check over their bill from everyone. At the grocery store, from your phone company
(phone companies are such massive scamming lying crooks that the only viable option is to not have a phone contract at all IMO), from
your bank, etc.
There is one good heuristic that works 90% of the time :
If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable
property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such.
But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending
on whether the normal filter chose loco post-filter or not.
But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you
optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right
mode is N and it wins by 10k.
ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse;
at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which
are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way
smaller.
Okay, some actually useful notes :
There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs.
Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is
to do entropy per row, and add a penalty for non-zero. You're welcome.
The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.
I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but
is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity,
a minimum set of filters is preferrable.
BTW I should not that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the
optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what
matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is
that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this
problem; it requires more thought.
Whatever, here you go :
PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to
pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first
place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.
Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders
out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :
1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for
end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your
decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder
run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to
do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode
to something).
2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C
very neatly using template loop inversion which I wrote about previously. For maximum speed however you
really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking
awful pointless abortion that is the Paeth filter.
3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit,
then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k
and you only need one previous row to filter, so your whole set of read-write data should be less than 64k,
and the data you stream out should be written to a separate buffer with NTQ write-combining style writes.
Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs
(X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer
is only written write combined you can decode directly into locked textures.
My guess is that this should be in the neighborhood of 80 MB/sec.
AdvPNG is just the Zip encoder from 7zip (Igor Pavlov). It's a semi-optimal parser using a forward-pass multi-lookahead heuristic like Quantum.
Some day I need to try to read the source code to figure out what he's doing exactly, but god damnit the code is ugly and it's not documented.
Dammit Igor.
PNGOUT is a lossless "deflate" (zip) optimizer. The engine is the same as
KZip by Ken Silverman. Unfortunately Ken has not documented his algorithm at all anywhere.
Come on Ken! Nobody is gonna pay you for KZip! Just write about it!
Anyway, my guess from reading his brief description and looking at what it does is : KZip has some type of optimal parser. Who knows what
kind of optimal parser; my guess is that knowing a bit about how Ken codes it is probably not an actual Storer-Szymanski optimal parser, but
rather some kind of heuristic, perhaps a like 7zip/LZMA. KZip also clearly has
some kind of Huffman split point optimizer (similar to
what I just did ). Again just guessing from the command line options it looks like his Huffman splitter is single pass and is just
based on a heuristic that detects changes in the statistics and decides to put a split there. Hmmm, I wish I'd found this months ago.
KZip appears to be the best zip optimizer in existence. Despite claims of being crazy slow I actually think
it's quite fast by my standards.
No surprise KZip is a lot smaller and faster than my optimal parser, but I do make smaller files. Ken ! Set your algorithm free!
(ADDENDUM : whoops, that was only on PNG data; for some reason it's pretty fast on image data, but it's slow as balls in
some other cases, not sure what's going on there; 7zip is a lot faster than kzip and the file sizes are usually
very close (once in a while kzip does significiantly better)).
For a while I've been curious to try my RAD LZ optimizer on a Zip token stream. It would be a nice sanity check to test it against KZip, but I'm not
motivated enough to take the pain of figuring out how to write Zip tokens.
Here are the results for my PNG-alike with the modes :
The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes
with LZ parameters optimized for each mode.
One important note :
The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above.
Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ.
Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM
pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace.
When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these
would be equivalent.
Okay, what do we see?
It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression
"level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a
min match len > 10 because it starts to hurt decode speed).
Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode.
It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not
the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output
of 393k instead of 360k.
Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best.
Synthetic images are harder, they are very sensitive to the exact mode.
We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any
heuristic that can detect when 0+loco needs to be checked.
Similarly for adaptive + noloco.
Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes.
When the sizes are very far apart then it is a reliable guess.
Let me briefly note things I could be searching that I'm not :
Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.
LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time
on this so I'm not bothering with it.
Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.
Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.
Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it.
Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other
pixel formats that might help, blah blah.
Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks
and compress them completely independently (no LZ window across the chunk boundary).
This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.
Base PNG by default does :
So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest.
The things you can play with are :
The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings.
The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try)
with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on
which branch of the search space you walk off initially they know from testing where the optimum likely is.
"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion
you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform
can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though).
(BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for
purposes of this comparison).
There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.
BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat
and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing
that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and
while base PNG is pretty bag, PNG after optimization is not bad at all.
Find a DPCM pixel prediction filter which uses only N,W,NW and does not range-expand (eg. ubytes stay in ubytes).
(eg. like PNG).
We certainly could use a larger neighborhood, we could use adaptive predictors that evaluate the neighborhood for edges/etc., we
would wind up with GAP from CALIC or something newer. We want to keep it simple so we can have a very fast decoder.
These filters :
commentary :
The big surprise is that ClampedGradPredictor (#12) is the fucking bomb. In fact it's so good that it hides the behavior of other predictors.
For example plain old Grad is never picked. Also predictor #5 (grad skewed towards average) was actually by far the best until #12 came
along.
The other minor surprise is that W is actually best sometimes, and N is never best, and generally N is much worse than W. Now, it is no
surprise that W is better than N - it is a well known fact that typical images have much stronger horizontal correlation than vertical,
but I am surprised just how *big* the difference is.
More in the next post.
2. Fucking folders can't all be set to Details as far as I can figure out. Yes yes yes I know you can do "set all folders to look like this one".
But that only applies to folders that you have ever visited at least once. When you go to some random folder which you have never visited
before - BOOM you're looking at fucking icons again. I despise icons. There must be a way to set the default
options for folders, but I can't find it on web searches.
3. Fucking backspace not going up dir in Explorer really chaps my hide. I need to fix that.
I guess I'll do that after I write my own AllSnap replacement (Grrr).
When I lose my net connection from home to work, VC will still hang longer than I'd like. This appears to be comming from the
P4 command. The problem is that P4 hangs, and that Nifty waits on P4. Those issues in more detail :
P4 stalls way too long when it can't connect to the server. So far as I can tell there is no way to set this timeout variable
in P4 (??). (net searching is a bit hard because Perforce does have a timeout variable, but that is for controlling how long
client login sessions on the server last before they are reset). I'd like to set this timeout to like 1 second, currently it's
10 seconds or something. I basically either have a fast connection or not, the long timeout is never right.
Nifty stalls on P4. This is so that when you do something like "save all", it gets notication of the VC command and can check out
any necessary files before VC tries to save them. So it runs the P4 command and waits before returning to let VC do its save.
So my hack solution for the moment is to make Nifty only wait for 500 millis for the P4 command and just return if it isn't done by then.
This will then give you a "file is not writeable" popup error kind of thing and saves you from the horrible stalled out DevStudio.
BTW some notes for working on addins :
The easiest way to find the names of VC commands seems to be from Keyboard mapping. It appears that they are basically not
documented at all. If you can't get it too hookup from the Keyboard mapping command name, your next best option is to trap
*every* command and log its name, then go into VC and do the things you want to find the name for and see what comes out of
your logs. (see links below)
Disabling an addin from the Tools->Addins menu does not actually unload it (sadly). You have to close VC and
restart it to make it unload so that you can update the DLL.
The easiest way to debug an addin is just to make a .AddIn file in
"C:\Users\you\Documents\Visual Studio 2005\Addins" and point it at the DLL that you build from your AddIn project.
Then you can set up F5 in your AddIn project to run devenv. That way you debug the devenv session that loads your AddIn
and you can set breakpoints etc.
See also :
Using EnableVSIPLogging to identify menus and commands with VS 2005 + SP1 - Dr. eX's Blog - Site Home - MSDN Blogs
A while ago I decided since I'm buying all this stuff from Amazon, I should write some reviews so that other people can see what products are
good and which aren't. I mainly wrote reviews for products that had zero reviews, and mainly in cases where the product was not clearly
described or listed.
I thought nothing of it, until I went back and was reviewing some of my purchases and noticed that there was no review on some of the products
I was pretty sure I had reviewed.
Well, it turns out that Amazon censored my review. Not only did they not publish it, but they *silently* don't publish it, they don't give you
any notification or reason that it was declined, it just doesn't show up.
It took me a while emailing back and forth with support to get some answers, but it turns out most of the declines come down to a specific
policy :
Amazon does not allow reviews to address errors in the listing.
Some of the reviews I wrote were about incorrect pictures or list prices. For example a picture might show a set of several brass wool scrubbers,
but the shipped product is only one of them. I write a review to make it clear what you're getting. BOOM! Review does not show up.
Here are the current guidelines . In
particular, the troublesome ones are :
Of course I should note that this is not unusual. Yelp of course censors reviews in roughly the same way ; both Amazon and Yelp really want you
to write "stories" that attract "community" and generally make people hang out and buy product. They don't want factual information that helps
customers. I had several reviews deleted from Yelp because they failed to be "personal" enough (they were short things like "this place sucks").
Yelp is also semi-corrupt, despite claims otherwise they are in fact in the pocket of sponsors, and will delete reviews or ban people who
write too many negative reviews.
CNET and Chow are actually much worse. They will bald-facedly delete reviews and discussion threads that are
critical of sponsors.
Most of the advertising sponsored web forums, such as the 2+2 poker forums or car forums like 6speedonline are the
same way as well. They will lock or delete threads that are critical of sponsors or the site administration.
In conclusion : the internet is fucked. It is now owned by corporations who censor and edit the content to create the message they want.
You have to be very careful about what you read on these sites, because valuable negative information might have been deleted, and if
you value your own content, you should not contribute to any one of these sites.
This is obviously retarded. Either you're a spammer or you're not. Spammers should have their accounts banned and the rest of us should
be allowed to send fucking email. If you are considering getting an account with Verio, don't. (their prices are also terrible by modern
standards).
See
here for more people complaining.
Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0).
I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to
figuring it out. Anyway, I stumbled upon this :
The idea is obvious once you see it. You walk in order of the original character array index,
(not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor.
At index [i+1] then, you must be able to match to a length of at least the same length-1 by
matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix
match. For example :
And here's the C code :
Here's the whole routine :
But the problem is here :
In particular this loop could be written without the redundant counter :
(BTW I just can't help going off topic of my own posts; oh well, here we go : I hate the fact that I have to have C-style casts in my code
all over now, largely because of fucking 64 bit. Yeah yeah I could static_cast or whatever fucking thing but that is no better and much
uglier to read. On the line with the _MIN , I know that a _MIN of two types can fit in the size of the smaller type, so it's safe,
but it's disgusting that I'm using my own semantic knowledge to validate the cast, that's very dangerous. I could use my check_value_cast here
but it is tedious sticking that everywhere you do any 32-64 bit casting or pointer-int casting.
On the output I could use ptrdiff_t or whatever you portability tards want, but that only pushes the cast to somewhere else (if I Read
return a type ptrdiff_t then the outside caller has to turn that into S64 or whatever).
I decided it was better if my file IO routines always
just work with 64 bit file positions and sizes, even on 32 bit builds. Obviously running on 32 bits you can't actually do a larger than 4B
byte read into a single buffer, so I could use size_t or something for the size of the read here, but I hate types that change based on the build,
I much prefer to just say "file pos & size is 64 bit always".)
The Porsche is sort of the same way. I don't know if they actually do this intentionally, maybe they do if they're very clever,
but the car is a bit temperamental, you have to baby it a bit, check the oil all the time, let it get up to temperature before
thrashing it, etc. Of course that "quirkiness" is really "shittiness", a modern car should just run and not need to be babied,
but that investment that you put in actually makes you feel closer to it, it makes you really fond of it.
It's well known that our attachment to children is based on the same principle. You put so much work into making a child
that you then feel very committed to it.
Of course this is really all just a classic example of one of the major flaws in human intuitive reasoning - the fallacy of
sunk cost. Just because you have previously put in lots of work to your child doesn't mean that you should continue to do so.
You should ignore sunk cost and just evaluate future EV based on the current situation. If you're a rational human you should wake up
each day and say to yourself "should I keep my kids today?".
BTW this is another rant but I do find it somewhat amusing that the most celebrated features of humanity are basically irrationality.
There's only good decision making and bad decision making. Irrational decision making is bad. Anyone who is "emotional" or "sticks to
their guns" or is "brave" or "puts their family before everything else" or whatever is a bad decision maker. That's not to say that
a rational decision maker will never do anything that might be described as "brave" or wouldn't "put their family first", but they will do
it based on considering the consequences of each choice and selecting one; anyone who does it based only on "feeling" is dangerous and
not to be trusted.
That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't
cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal
in a time-quality tradeoff sense.
And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.
I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.
You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes.
The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously
this lets you adapt to the local statistics better, but costs the space to send those code lengths.
(BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes").
You want to pick the split
points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).
The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the
ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.
The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building
(top down) or Huffman tree building (bottom up).
The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down
tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data
like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to
offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely
segregating each unique bit. I also sometimes call this the "greedy" approach.
So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find
the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.
In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong
reason why that should be true.
There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature
(actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category
trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here).
One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are
beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get
past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :
Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it
can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been
a lot of great papers in the last 5 years out of the interactive ray tracing community.
I really don't like interacting with human beings very much. I wish I could have sanctuary where I was alone in my own environment with no
fucking headaches and annoyances from outside forces.
Sometimes I think about moving out to the country so I can get a bunch of land and not have to see any neighbors,
but my god country people are fucking unbearable. I don't mind the fact that they have very simple tastes - cheap beer, country music,
muscle cars - but they're just rotten human beings, they are not generous of spirit and wide of mind. They give you dirty looks
if you're an outsider; if you tell them their joke about date rape is offensive they say "what are you a fag or somethin?". They're the
worst. Yes I'm talking about you Pomeroy.
Of course the are the country towns that have been colonized by retirees or "artists", liberal types from the city who now run a bead
boutique and give dirty passive aggressive looks to anyone who hasn't made their house "quaint" enough. They're almost as bad as the
real country folk.
N and I took a trip out to Eastern WA over the weekend and saw many striking and beautiful things. It gets bloody well hot and dry out
there so I think it was just about the last chance to do it in comfort, so I'm glad we did. Probably best to go out there in May, which
is their spring. We still got a lot of fresh spring greenery and a lot of wild flowers, but I do think we were a week or two late. We have
an amazing knack of finding surprising off-the-map beautiful things together, it's almost magical, like we will suddenly decide to take
some side road that we hadn't noticed before and sudden we are in a strange different world of purple flowers and intense wind and rippling
alfalfa.
I really despise commuting. For a while there the novelty of the new car made it okay again, but it's back to being just awful. The past
few weeks I've been working mainly from home because I had a nice productive spurt and just wanted to go with it and knock out some code.
Commuting just puts me in such a foul mood that I lose hours of productivity afterward before I can simmer down and think again,
and then when I come home it sets me off again, such that when I get home N usually asks "what's wrong with you?" , oh I just fucking hate
everyone and I'm really depressed about how fucking shitty humanity is because I had to drive with them, that's all.
I just don't know if I can function in this world. Sometimes I think about buying a house and having a stable job and a family and all that,
but that life involves working regular hours and commuting and shit like that and I just don't think I can do it.
One problem I have with this analysis is that it assumes your weight distribution is static. In reality
the 60/40 is only when I'm seated. When I am braking downhill my weight transfers forward a lot and with
these low pressure the front tire will mush.
I've been having a lot of trouble on my blue bike because of the wheels I bought from Pricepoint. They're
good components (Shimano hubs on Mavic Open Pro rims) but I think the spokes are shitty and the lacing job
was done wrong somehow, because they keep coming loose; I thighten and true them, and then a week later
they're loose again. Contrary to the popular press, I have found that the "pre-built" wheelsets (like
Mavic Cosmos or whatever) are excellent quality and hold up great, but the "hand built" wheels that are
supposed to be so superior are only as good as the build-up, which if you buy form some cheapo online place
is probably not very good.
Anyway, because of my wheel trouble I swapped out my rear wheel for another that I have sitting around with
an 8 speed cassette on it (my bike is normally a 10 speed). I switched the downtube lever to friction shifting
(instead of index) and off you go. Friction shifting on these many-speed modern bikes is sort of interesting;
the cogs are so close together and the shift is so smooth that it almost feels like a completely analog shift.
You just move the lever and it silently slips into a very slightly different gear. You can just dial the
lever to the gear ratio you want like an analog slider. I wonder when we will have continuous transmissions
for bicycles; you could imagine just having a single cog with a ratcheting mechanism to get bigger or smaller,
or perhaps more realistically a cone gear with a belt drive like some of the early CVT's on cars (a guide holds
the belt to one part of the cone which sets the gear).
Anyway, friction shifting for 8+ gears is not awesome. On flat ground with no load it's fine, but if your
bike has any flex at all, it will cause you to change gears when you stand up and dig up a hill, the gears
are just too close together to avoid hop. I don't think you can friction shift past 6, maybe 7 gears.
IMO 8 gears was probably the perfect amount. I know some purists will say 5 was enough. Eh, not really. 5
is plenty if you are on flat terrain, sure, but for varied terrain you do want some very small gears, and also
some big ones for flats (though I agree with Dave Moulton that the very big gears most bikes come with these
days are very pointless; sure it's fun to go 30 mph on a downhill, but you could go 27 mph and it would be
almost as good). At 8 you can have enough range and also fine enough steps in the "money zone" where you
spend most of your time. Beyond 8, there's little gain from the additional gears, and you start having more
problems because the cogs are so very close together, it's more finnicky about having the index adjustment
just right, if things aren't right it's easier to get hops into the wrong gear, and of course the chain has
to be thinner and weaker.
That said, I will now do some ranting.
Fucking external code and your use of unsigned ints! I've tracked down the last two (*) of my nasty bugs in
porting to 64 bit and both were because of people using unsigned ints for no good reason. (* = I think; you can
never know that you found your last bug, much like you can never prove a physical theory right).
One was in the BMP reader code that I stole from MSDN years ago after getting fed up with not being able to
load some weird formats (RLE4 or whatever). (BTW this is another rant - every god damn video game codebase I've ever
used has some broken ass home-rolled BMP reader in it that only handles the most friendly and vanilla BMP formats;
fucking don't roll your own broken ass shit when there are perfectly good fully functional ones you can take).
Anyway, because it's MS code it uses the awful DWORD and UINT and all that shit. It actually had a number
of little bugs that I fixed over the years, such as this one :
While I'm at it, this is the list of nasty not-obvious stuff to watch out for that I made :
cbloom rants 06-21-08 - 3
Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code
page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very
bad idea, don't do that).
Here is what I have figured out on Win64 so far :
1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever
the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes
the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.
2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida
Concole works if you like that). The raster fonts do not support it.
3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation.
Even wprintf() passed in correct unicode file names does not do the code page translation
correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code
page translation correctly. (as usual the pedantic rule-morons will spout a bunch of bullshit about the fact that printf is just fine the way it is
and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing
translation and it should let me control how!)
4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that
your command line apps might be piped to a unicode text file. To handle this correctly you need to detect it yourself. Ugly ugly.
5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact
changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app
because the user's font choice might not support the code page you want.
6. CMD argc/argv argument encoding is still in the cmd code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a
file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did
FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode
version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), but that is
so rare it's not really worth worrying about.
7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is
found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means
that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way
to take a command line that works in CMD and run it from your app.
_wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the
work of promoting your CMD code page arguments to proper unicode).
8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console
and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer
and paste it into a console. My oh my.
9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else
to list the files then changing it back (their output seems to be the same as mine in UTF8 code page).
This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be
found if you copy-paste them to your command line!
So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be
GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the
same; I think the answer is they are only different if output is piped to a file).
I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :
Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names.
(I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For
arguments that aren't files, convert from the console code page to wchar.
Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.
To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into
printf.
For more information :
Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.
BTW a semi-related rant :
WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about
just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :
Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a
x64 totally changes the way exceptions on the PC are implemented.
There are two major paradigms for exceptions :
1. Push/pop tracking and unwinding. Each time you enter a function some unwind info is pushed, and when you
leave it is popped. When an exception fires it walks the current unwind stack, calling destructors and
looking for handlers. This makes executables a lot larger because it adds lots of push/pop instructions,
and it also adds a speed hit everywhere even when exceptions are never thrown becuase it still has to push/pop
all the unwind info. MSVC/Win32 has done it this way. ("code driven")
2. Table lookup unwinding. A table is made at compile time of how to unwind each function. When an exception
is thrown, the instruction pointer is used to find the unwind info in the table. This is obviously much
slower at throw time, but much faster when not throwing, and also doesn't bloat the code size (if you don't
count the table, which doesn't normally get loaded at all). I know Gcc has done this for a while (at least
the SN Systems version that I used a while ago). ("data driven")
All x64 code uses method #2 now. This is facilitated by the new calling convention. Part of that is that
the "frame pointer omission" optimization is forbidden - you must always push the original stack and
return address, because that is what is walked to unwind the call sequence when a throw occurs. If you
write your own ASM you have to provide prolog/epilog unwind info, which goes into the unwind table to tell
it how to unwind your stack.
This has been rough and perhaps slightly wrong. Read here for more :
X64 Unwind Information - FreiK's WebLog - Site Home - MSDN Blogs
BTW many people (Pierre, Jeff, etc.) think the overhead of Win32's "code driven" style of exception handling is excessive. I agree to some
extent, however, I don't think it's terribly inefficient. That is, it's close to the necessary amount of code and time that you have to
commit if you actually want to catch all errors and be able to handle them. The way we write more efficient code is simply by not catching all
errors (or not being able to continue or shut down cleanly from them). That is we decide that certain types of errors will just crash the app.
What I'm saying is if you actually wrote error checkers and recovery for every place you could have a failure, it would be comparable to the
overhead of exception handling. (ADDENDUM : nah this is just nonsense)
I'm always amazed that novelists actually finish their novels. Especially conceptual works. If it's a page-flipper story novel, then
the story actually cares you (the writer) along with it, you're excited to see what happens next just like the reader is. But when it's
just a dry game in form, I would get bored of writing it after 5 pages.
Pierre wrote a pretty good blog post a while ago (which I can't find now grrr) about how the standard maxim to
"optimize late & optimize only hot spots" can be
wrong. In particular, the general bloat of inefficiency everywhere can be a huge drag and not provide any easy targets. I agree with that
to some extent (certainly at OddWorld we had the problem of some nasty generalized speed hit from various small inefficiencies). However,
the opposite is also true - optimizing early or micro-optimizing can really make you do stupid things. I've written before about how it
can get you into local minima traps where you don't see a better algorithm because your current bad one is so tweaked. One of the worst
things you can do is super-optimize a bad algorithm. And on the flip side, it can actually be a huge boon to your coding if your low
level stuff is really slow.
I was reminded of this recently when I got
this introduction to x64 ASM with an example Huffman decoder.
Despite lots of work in ASM, this Huffman decoder is pathetically bad, it actually walks through node pointers to decode, which is just
about the worst possible way to do it. I don't mean to pick on that guy, it's example code and it's actually really nice example code,
it was super useful to me in my x64 learnings, but I've seen so many bad Huffman ASM implementations, it has to be one of the all-time great
examples of foolish premature optimization of bad algorithms. If you had a really slow bit input routine, you might be motivated to
work on the *algorithm* to avoid bit inputs as much as possible. Then you would come up with what all the smart people do, which is
to read ahead big chunks of bits and use a table to decode symbols directly. Usually the best way to optimize a slow function is to not
call it. (I don't actually have a good reference on fast Huffman, but
I did write a tiny bit earlier).
Of course there is also the flip side. Sometimes a slow underlying function can cause you to waste lots of times on optimization
that's not really necessary. Probably the worst culprit I see of this is allocations. I see people doing masses of work to remove
allocations and often it just doesn't make any sense. A small block alloc these days is around 20-50 clocks, often less than a divide.
There are good reasons to remove allocs (mainly if you are on a low memory system and want very predictable memory use patterns),
but speed is not one of them, and people who are using a very bad malloc back end with a malloc that is hundreds or thousands of clocks
are just giving themselves a problem that isn't real.
Browsing around the web another one struck me. I tend to be *very* careful in the way I write code (compared to my game developer compadres
anyway, I suppose I am wild and reckless compared to many of the systems devs). I used to be somewhat of a specialist in saving bad code bases,
and in that spot I really hate to even try to work with the code I'm given. The first thing I do is start tracing through runs and see what's
really happening, and then I start adding comments and asserts everywhere. Then I start wrapping up common functionality into classes that
encapsulate something and enforce a rule so that I can be *sure* that what everyone thinks is happening really is happening. When I see code
that does not *prove* to me that it's working right, I assume it's not working right.
This carefulness is multiplied many fold when it comes to threading. I am hyper careful and don't trust myself for the most basic things.
I see a lot of people write simple threading code and do it without much in the way of comments or helper functions because they are "smart"
and they can tell what's happening and don't need helpers. They just do some interlocked ops to coordinate the threads and then maybe do
some non-protected accesses when they "know" they can't have races, etc. That's awesome until you have
a mysterious problem like this . You could blame the min() for doing something weird,
or blame the compiler for not being nice enough with it's volatile function, but IMO this is the type of thing that would never happen if the
code was written more carefully.
Even when I'm at my most lazy, I would write this :
Another option which I often use is to make a Shared<> template , so gAvailable would be Shared< LONG > gAvailable. Then you have to
access it with members like LoadRelaxed() or StoreRelease() , etc.
I treat threading code like a loaded gun. I don't point it anyone's face, I store it without bullets, etc. You take these precautions not
because they are absolutely necessary, but because they ensure you don't have bad surprises.
A lot of the times in rebuttal, people will show me their unsafely written code and say "well this works fine, tell me what's wrong with it".
Umm, no, you don't understand the issue at all my friend. The point is that with unsafely written code, I have to use my brain to figure out
whether it is working or not, and as we have seen, that is *very* very hard. Especially with threading and races. In fact the only way I
could do it with any level of confidence is to instrument it and run it through Relacy or one of the other automatic race detectors.
Maybe I miss managing people and getting to pick on their code, so now I'm picking on random code from around the internet. Actually that
might be fun, do a weekly code review of some random code I grab from the web ;)
The primary meaning is "belittled, insulted, strongly disapproved of". Old dictionaries will only have this
meaning. I contend that this word is basically never used in this way anymore. (the new meaning is
"obsolute / superceded by something else and further usage is discouraged" ).
"Pander" :
I was shocked to find that the primary definition of pander in most dictionaries is "Noun : Pimp" , and the
primary verb meaning is "to act as a pimp ; eg. to provide for the sexual gratification of others". I don't
know when that was the meaning but it sure isn't now. I contend that 99% of people would say the meaning of
pander is to "cater to the desires of the audience or target group, without scruples" , as in politicians
pandering to the NAARP or filmmakers pandering to retarded people who want sex and action. (I want sex and
action, just not retarded sex and action).
Don't sell Win7-32 for desktops. If you go to most PC makers (Dell, Lenovo, etc.) and spec a machine, the default OS
will be Win7 - 32 bit. I'm sure lots of people don't think about that and just hit okay. That fucking sucks, they
should be encouraging everyone to standardize to 64 bit.
Give me a 32+64 combined exe. This is trivial and obvious, I don't get why I don't have it. For most apps there's
no need for a whole separate Program Files (x86) / Program Files install dir nonsense. Just let me make a .EXE
that has the x86 and the x64 exes in it, and execute the right one. That way I can just distribute my app as a
single EXE and clients get the right thing.
Let me call 32 bit DLL's from 64 bit code. As usual
people write a bunch of nonsense about why you can't do it that just isn't true. I can't see any real good reason
why I can't call 32 bit DLL's from 64 bit code. There are a variety of solutions to the memory addressing range problem. The
key thing is that the memory addresses we are passing around are *virtual* addresses, not physical, and each app can have different
virtual address page tables. You could easily make the 32-bit to 64-bit thunk act like a process switch, which swaps the page tables.
It's perfectly possible for an app with 32 bit of address space to access 64 bits of physical memory (of course this is exactly what
32 bit apps running on 64 bit windows do).
Okay, that would be pretty ugly, but there's a very simple solution - reserve the lower 4G of virtual address space in 64 bit apps
for communication with 32 bit apps. That is, make all the default calls to HeapAlloc go to > 4G , but give me a flag to alloc in the lower 4 G.
Then if I want
to call to a 32 bit DLL I have to pass it memory from the lower 4G space. Yes obviously you can't just build your old app and attach
to a 32 bit DLL and have it just work, but it would still give me a way to access it when I need it (and there are plenty of other reasons
why you wouldn't be able to just link in the 32 bit dll without thinking, eg. the sizes of many of your types have changed).
Right now if I have some old 32 bit DLL that I can't recompile for whatever reason and I need to get to it from my 64 bit app, the
recommended solution is to make a little 32 bit server app that calls the DLL for me and use interprocess communication with named memory
maps to give its data to the 64 bit app. That's okay and all but they easily could have packaged that into a much easier thing to do.
This code project article is actually one of the best I've seen on the
transition. It reminds me of one of the weird broke-ass things I've seen : some of my favorite command line apps are still old 32 bit builds.
They still work ... mostly. The problem is that they go through the fucking WOW64 directory remapping, so if you try to do anything with them
in the Windows system directories you can get *seriously* fucked up. I really don't like the WOW64 directory remap thing. It only exists to
fix broke-ass apps that manually look for stuff in "c:\program files" or whatever. I feel it could have been off by default and then turned on
only for exe's that need it. I understand their way makes most 32 bit apps work out of the box which is what most people want, so that all makes
sense and is fine, but it is nasty that my perfectly good 32 bit apps don't actually get to see the true state of the machine for no good reason.
To give an example of the most insane case : I use my own "dir" app. Before I rebuilt it, I was using the old 32 bit exe version. It literally
lies to me about what files are in a given dir because it gets the WOW64 remap. That's whack. The 32 bit exes work just fine on Win64 and the
only thing forcing me to rebuild a lot of these things is just to make them avoid the WOW64.
amd64 is a synonym for x64. IA64 is Itanium.
The actual "CL" command line to build x64 vs x86 can stay unchanged I think (not sure about this yet). For example
you still link shell32.lib , and you still define "WIN32".
Another nasty case where 64 bit bytes your ass is in the old printf non-typed varargs problem. If you use cblib/safeprintf
this shouldn't be too bad. Fucking printf formatting for 64 bit ints is not standardized. One nice thing on MSVC
is you can use "%Id" for size_t sized things, so it switches between 32 and 64. For real 64 bit use "%I64d" (on MSVC anyway).
size_t really fucking annoys me because it's unsigned. When I do a strlen() I almost always want that to go into a signed int
because I'm moving counters around that might go negative. So I either have to cast or I get tons of warnings. Really fucking
annoying. So I use my own strlen32 and anyone who thinks I will call strlen on a string greater than 2 GB is
smoking some crack.
Here are the MSC vers :
eg. if I write some matrix multiply routine and I know I have made it memory-alias safe, I want to say "restrict" on the *code* chunk, not on certain
variables.
Same thing with "volatile" in the Java/C++0x/MSVC>=2005 sense, it's really a command for the load or the store, eg. "this store must actually be written to
memory and must be ordered against other stores" , not a descriptor of a variable.
So I'm going to try to fix some things in Nifty. It's
annoying difficult to debug an add-in for visual studio, so I might have to
use printf debugging. The problems I have are :
1. It randomly doesn't work sometimes. The spawnm p4 process fails. Usually if I shut down devenv and restart it this gets fixed. Need to track this down.
"Failed to spawn process: System.ComponentModel.Win32Exception: The system cannot find the file specified"
2. When it can't connect to the p4 server it does hang devenv for too long. Would be nice to fix.
3. Checking out projects and solutions is pretty fucked. I'm sort of inclined to just make it check out all the vcproj's whenever you open a solution,
and then revert unchanged when you close. Even just a key to check out all vcproj's in the solution would be handy, though ideally that would be
triggered whenever you edit project properties. I guess if I had that I could make a macro that does "checkout all projects; open properties" and
put that on my properties keyboard shortcut.
4. The toolbar is constantly querying status if its visible so that it can tell whether to gray out the "p4 edit" buttons or not. That's kind of annoying.
I might just make it so it doesn't gray those buttons based on whether the file is in p4 or not.
This is what I use to disconnect projects from source control without fucking everything up : (it's just text substitution to remove the source control
lines from the projects).
The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer
and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to
implement DCAS the same way. There are various solutions to this :
1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :
1.A. There are now other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is
a very minor suckitude.
1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked.
Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on
them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit
x86 version of your app instead of the 64 bit version.
1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes.
This is part of the general bloating of 64 bit structs and mildly sucks.
(BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).
1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.
So #1 is an okay solution, but what are the alternative ?
2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on
Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer
has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know
of other than the fact that twenty years from now you'll have to touch your code again.
3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to
use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change
a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can
just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time.
This is a somewhat more complex way to do things conceptually, but it is very powerful and general,
and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have.
See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".
You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit pointer,count until your pointer becomes more than 52 bits,
and then switch to a pointer to extra data.
The workaround goes like this :
Go to "c:\program files (x86)\microsoft visual studio 8\vc\bin". Find the occurance of ML64.exe ; copy them to ML.exe . Now you can add .ASM files
to your project. Go to the Win32 platform config and exclude them from build in Win32.
You now have .ASM files for ML64. For x86/32 - just use inline assembly. For x64, you extern from your ASM file.
Calling to x64 ASM is actually very easy, even easier than x86, and there are more volatile registers and the convention is that caller has to
do all the saving. All of this means that you as a little writer of ASM helper routines can get away with doing very little. Usually your args
are right there in {rcx,rdx,r8,r9} , and then you can use {rax,r10,r11} as temp space, so you don't even have to bother with saving space on
the stack or any of that. See
list of volatile registers
BTW the best docs are just the full AMD64 manuals .
For example here's a full working .ASM file :
And how to get to it from C :
BTW one of the great things about posting things on the net is just that it makes me check myself. That cmpxchg64 has a stupid branch,
I think this version is better :
ADDENDUM : I just found a new evil secret way I'm fucked. Unions with size mismatches appears not to even be a warning of any
kind. So for example you can silently have this in your code :
__asm__ cmpxchg8bcmpxchg16b - comp.programming.threads Google Groups
One unexpected annoyance has been that a lot of the Win32 function signatures have changed. For example LRESULT is now a pointer not
a LONG. This is a particular problem because Win32 has always made heavy use of cramming the wrong type into various places,
eg. for GetWindowLong and stuffing pointers in LPARAM's and all that kind of shit. So you wind up having tons of C-style casts when
you write Windows code. I have made good use of these guys :
BTW this all has made me realize that the recent x86-32 monotony on PC's has been a delightful stable period for development. I had almost
forgotten that it used to be always like this. Now to do simple shit in my code, I have to detect if it's x86 or x64 , if it is x64, do I have
an MSC version that has the intrinsics I need? if not I have to write a got damn MASM file. Oh and I often have the check for Vista vs. XP to
tell if I have various kernel calls. For example :
Even ignoring the pain of the last FUCK branch which requires making a .ASM file, the fact that I had to do a bunch of version/target checks
to get the right code for the other paths is a new and evil pain.
Oh, while I'm ranting, fucking MSDN is now showing all the VS 2010 documentation by default, and they don't fucking tell you what version
things became available in.
This actually reminds me of the bad old days when I got started, when processors and instruction sets were changing rapidly. You actually
had to make different executables for 386/486 and then Pentium, and then PPro/P3/etc (not to mention the AMD chips that had their own
special shiznit). Once we got to the PPro it really settled down and we
had a wonderful monotony of well developed x86 on out-of-order machines that continued up to the new Core/Nehalem chips
(only broken by the anomalous blip of Itanium that we all ignored as it went down in flames like the Hindenburg).
Obviously we've had consoles and Mac and other platforms to deal with, but that was for real products that want portability to deal with,
I could write my own Wintel code for home and not think about any of that. Well Wintel is monoflavor no more.
The period of CISC and chips with fancy register renaming and so-on was pretty fucking awesome for software developers, because you see the same
interface for all those chips, and then behind the scenes they do magic mumbo jumbo to turn your instructions into fucking gene sequences that
multiply and create bacterium that actually execute the instructions, but it doesn't matter because the architecture interface still just
looks the same to the software developer.
It's real foolishness when people say things like climbing is
not a hard workout . I just have to roll my eyes pretty much anytime anybody talks about exercise because you just all don't get it.
*Anything* is a workout if you make it a workout. There's no inherent difficulty level of any activity, it depends how hard you do it. You
hear retards all the time saying "yoga's not a hard workout" , well maybe if you do it like a moron it's not, use some more intensity, make it
more difficult for yourself if you need more work. I've heard plenty of people tell me "biking's not a hard enough workout". Oh really? Go
faster, dumbass, or maybe try going up some hills.
My anger at the drivers around here grows and grows until it hits a boiling point where I just become depressed about how fucking stupid and
selfish you all are. It really is amazing to me that people here constantly blow through yellows, roll through stop signs, and yet take forever
to get moving when a light turns green and are busy-bodies about my speed (my speed which is almost always less than that of the SUV
that's screeching around the corner at 90% of its limit so it would be unable to make a quick correction if anything surprising happened).
What a boring topic, I apologize. At least in places like LA people are more consistently aggressive assholes; it's less hypocritical.
I've been working at home recently and it's been a great boost of productivity for me. It's so good to not have to worry about when I'm going to
try to make the commute to avoid traffic, and it's awesome to be able to go directly from morning coffee to coding, which is the most productive
instant of the day for me. There are two problems : 1. I got a nice standing desk all set up at work and I miss having it at home, I'm hurting my
body spending too much time in a chair. 2. it's a little hard because N is also home many days, and there's a bit of difficult tension when I have to say
"leave me alone I'm working".
I really don't like the way the financial meltdown narrative has been crafted by the media. One of the false narratives is the "black swan" story -
that everything was being done with mathematical models and that it was a very unlikely but high impact event that was not accounted for in the models
that caused the crisis. The other false narrative is that it was evil investment bankers at Goldman or similar that somehow
caused it all. The reality is it was caused by ignorance and greed and corruption at almost every level of society. From presidents and congress who
stripped regulation from the finance and mortgage industry, to the Fed keeping rates way too low and not monitoring banks well, to Fannie Mae et.al.
underwriting too many loans, to Countrywide et.al. intentionally issuing loans they knew were bad to make more profit, to Goldman et.al. for packaging
loans they knew were bad and selling them as safer than they really were, to the ratings agencies asleep at the wheel, to individual real estate investors
getting in way over their heads trying to make an easy buck, etc. etc.
Here's the simple test case I cooked up :
Well a few lines above that is the key. There was this :
Well, it turns out that somewhere way back in RR_ASSERT I was in a branch that caused me to have this definition for RR_ASSERT :
BTW the thing that kicked this off is that fucking VC x64 doesn't support inline assembly. ARGH YOU COCK ASS. Because of that we had long
ago written something like
Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just
propagate that branch up through the loops, but there's way to express it neatly in C.
The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to
make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :
This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable
into the right template call. But it's ugly.
BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant
can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel
can have between 1 and 4 channels. Instead of this :
1. Automatically resizes based on amount of memory needed by other apps. eg. other apps can steal memory from your cache to run.
2. Automatically gives pages away to other apps or to file IO or whatever if they are touching their cache pages more often.
3. Automatically keeps the cache in memory between runs of your app (if nothing else clears it out). This is pretty immense.
Because of #3, your custom caching solution might slightly beat using the Windows cache on the first run, but on the second run it will
stomp all over you.
To do this nicely, generally the cool thing to do is make a unique file name that is the key to the data you want to cache. Write the data to
a file, then memory map it as read only to fetch it from the cache. It will now be managed by the Windows page cache and the memory map will just
hand you a page that's already in memory if it's still in cache.
The only thing that's not completely awesome about this is the reliance on the file system. It would be nice if you could do this without ever
going to the file system. eg. if the page is not in cache, I'd like Windows to call my function to fill that page rather than getting it from disk,
but so far as I know this is not possible in any easy way.
For example : say you have a bunch of compressed images as JPEG or whatever. You want to keep uncompressed caches of them in memory. The right way
is through the Windows page cache.
I contacted the developer of AllSnap to see if he would give me the code so I could fix it, but he is ignoring me. I can tell from debugging
apps when AllSnap is installed that it seems to work by injecting a DLL. This is similar to how I hacked the poker sites for GoldBullion,
so I think I could probably reproduce that. But I dunno if Win7/x64 has changed anything about function injection and the whole DLL function
pointer remap method.
BTW/FYI the standard Windows function injection method goes like this : Make a DLL that has some event handler. Run a little app that causes that event
to trip inside the app you want to hijack. Your DLL is now invoked in that app's process to handle that event. Now you are running in that process
so you can do anything you want - in particular you can find the function table to any of their DLL's, such as user32.dll, and stuff your own function
pointer into that memory. Now when the app makes normal function calls, they go through your DLL.
What I would like is a way to make this more robust. I have very strong threading primitives, I want a way to make sure that I use them!
In particular, I want to be able to mark certain structs as only touchable when a critsec is locked or whatever.
I think that a lot of this could be done with Win32 memory page protections. So far as I know there's no way to associate protections
per-thread, (eg. to make a page read/write for thread A but no-access for thread B). If I could do that it would be super sweet.
One idea is to make the page no access and then install my own exception handler that checks what thread it is, but that might be too much
overhead (and not sure if that would fail for other reasons).
The main usage is not for protected crit-sec'ed structs, that is really the easiest case to maintain because it's very obvious right there
in the code that you need to take the critsec to touch the variables. The hard case to maintain is the ad hoc "I know this is safe to touch
without protection". In particular I have a lot of code that runs like this :
The thing that this saves me from is when I'm tinkering in DoComplicatedStuff() which is some function called deep inside Phase 2
somewhere and I change it to no longer follow the memory access rule that it is supposed to be following. This is just my hate for
having rules for code correctness that are not enforced by the compiler or at least by run-time asserts.
Minimal state. Recordable state at every transition point. This lets you bookmark your point anywhere in your work, go backwards and forwards,
save your spot and come back to it, etc.
This all goes back to the entire state being a little token that you can just grab and store off. Granted, lots of web pages fuck this up
because they use some server-side shit and they don't show you all the public state or whatever fucking dick-ass thing they do. But good
old fashioned Web gets this awesomely right.
It's actually a paradigm that I think more developers should espouse in their Win32 apps, both publicly and internally.
By "publically" I mean you should expose it to the user - let the user drag off the current spot to a link, and let them restore.
This should be in like every app. The full state of the app should be in an edit box somewhere that I can copy/paste or drag to the
desktop. I should be able to double-click it to jump back into the app at that same point.
"Internally" I mean it's nice to make sure your state some very simple plain C structures, so that you can just push & pop or save old
versions of the state, like :
Yeah yeah the C++ way is to give every member a stream-in/stream-out, but it's too hard to maintain that robustly all the time.
This is actually related to another very important programming paradigm in general : minimize state, and avoid redundancy. Don't
store variables that are computed from other variables. Don't copy values from one place to another. Always go get them at the
original source. This is a massive bug reducer. Every time I see something like "this variable must be kept in sync with this variable"
I think "why not just get rid of one of them?".
I do have one major practical problem with Windows slow networking : my file copy and dir listing routines are ungodly slow across the net.
I know this can be done faster. TeraCopy for example is pretty fast, I would love to know what they are doing. The super brute force solution
would be to just run my own file system client/server and send packets to my own port. For example if I want to get a dir listing, I just
send one packet saying "list this dir" and the listener on the other side does it locally and then sends me back one big packet with the full
dir listing. I could run that on TCP/IP and it would be like instant. So how do I get speed like that over proper Windows networking? Or
maybe that is the way to go and I could just remote-run my listener app on any machine I want to talk to?
Kill stuck apps. WTF I know you are capable of killing stuck apps, because if I use my own "killproc" app I can kill them cleanly (or another
nice way is to attach the debugger to the app and then kill it from there). But sometimes even fucking Task Manager refuses to kill it, and why can't I just
kill it from the fucking X box. Okay maybe not the X box because that's just a GUI widget on the app, but let me fucking right-click in the
non responding Window and say "yes really fucking kill the fuck out of this piece of shit app".
I might have to write my own app that uses IsHungAppWindow and then hard-kills whatever is not responding. I could put it on a hot key and it
would save my bacon when the fucking Task Manager won't run (my god why is Task Manager still just a fucking app like everything else, which means
that it can't always get enough CPU or screen display rights; there should be a machine monitor in the ctrl-alt-del screen that is always
accessible).
It is kind of funny to me how people take something that is pure hooliganism and laughs and adrenaline (driving cars fast) and have to turn it into something
they can be anal about and practice and study and be "right" about have "the way to do things". It always happens, I mean wine and food and such are the same
way, the people who really love something become way too obsessive about it and make it way too analytical and lose focus on the simple joy of it.
Coding for RAD is kind of fucking me up. I have lots of bits of good code that I know I've written but I don't know where they are anymore.
For example I know I wrote a bunch of careful stuff to try to sleep to framerate well but I can't find it anymore. Is it in my cblib stuff,
or is it in my RAD/Oodle stuff, or is it in my RAD/shared stuff? Urg.
Dell lappy is pretty great. I dropped it and slightly dented the case. Metal cases feel awesome and give that impression of "quality" but
in fact plastic is a pretty fucking amazing material to make things out of. Plastic does not get hot, it's super lightweight, it's very
tough, and it has this amazing property that it can take an impact, deform, and then return to its original shape. (same goes for car
interiors of course).
Part of the problem with plastic car exteriors was that the paints weren't good enough. That's no longer true, there are new amazing paints
that can make plastic cars look like metal.
Top Chef Masters is pretty good, way better than the first season of TCM, though fucking Kelly Choi is a real drag (even worse than Padma; at
least Padma actually hot, and it's amusing when she's stoned off her ass and says everything is delicious, whereas Kelly is freakish looking with
her stick body and giant head, and makes that weird forced-smile face all the time; they both share the inability to just read a freaking cue card
smoothly). (it's a real pet peeve of mine when people think that someone is hot just for being thin; thinness is correlated to hotness, but it is
not causal!). (and of course anybody who's on TV at all will have a million weirdos who insist she's super hot).
The advantage is that the encoder can reasonably easy consider {movec,residual} coding choices jointly. This is a huge advantage over
just picking what's my best movec, okay now code the residual. Because movec affects the residual, you cannot make a good R/D decision if
you do it separately. By using block movecs, it reduces the number of options that need to be considered to a small enough set that
encoders can practically consider a few important choices and make a smart R/D decision. This is what is behind all current good video
encoders.
The disadvantage of movec-residual coding is that they are redundant and connected in a complex and difficult to handle way. We send them
independently, but really they have cross-information about each other, and that is impossible to use in the standard framework.
There are obviously edges and shapes in the image which occur in both the movecs and the residuals. eg. a moving object will have a boundary,
and really this edge should be used for both the movec and residual. In the current schemes we send a movec for the block, and then the
residuals per pixel, so we now have finer grain information in the residual that should have been used to give us finer movecs per pixel, but
it's too late now.
Let's back up to fundamentals. Assume for the moment that we are still working on an 8x8 block. We want to send that block in the current frame.
We have previous frames and previous blocks within the current frame to help us. There are 256^3^64 possible values for this block.
If we are doing lossy coding, then not all possible values for the block can be sent. I won't get into details of lossiness, so just say there
are a large number of possible values for the pixels of the block; we want to code an index to one of those values.
Each index should be sent with a different bit length based on its probability. Already we see a flaw with {movec-residual} coding - there are
tons of {movec,residual} pairs that specify the same index. Of course in a flat area lots of movecs might point to the same pixels, but even
if that is eliminated, you could go movec +1, residual +3, or movec +3, residual +1, and both ways get to +4. Redundant encoding = bit waste.
Now, this bit waste might not be critically bad with current simple {movec,residual} schemes - but it is a major encumbrance if we start looking
at more sophisticated mocomp options. Say you want to be able to send movecs for shapes, eg. send edges and then send a movec on each side. There
are lots of possibilities here - you might just send a movec per pixel (this seems absurdly expensive, but the motion fields are very smooth so
should code well from neighbors), or you might send a polygon mesh to specify shapes. This should give you much better motion fields, and then the
information in the motion fields can be used to predict the residuals as well. But the problem is there's too much redundancy. You have greatly
expanded the number of ways to code the same output pixels.
We could consider more modest steps as well, such as sticking with block mocomp + residual, but expanding what we can do for "mocomp". For example,
you could use two motion vectors + arbitrary linear combination of the source blocks. Or you could do trapesoidal texture-mapping style mocomp. Or
mocomp with a vector and scale + rotation. None of these is very valuable, there are numerous problems : 1. too many ways to encode for the encoder
to do thorough R/D analysis of all of them, 2. too much redundancy, 3. still not using the joint information across residual & motion.
In the end the problem is that you are using a 6-d value {velocity,pixel} to specify a 3-d color. What you really want is a 3-d coordinate which
is not in pixel space, but rather is a sort of "screw" in motion/pixel space. That is, you want the adjacent coordinates in motion/pixel space to
be the ones that are closest together in the 6-d space. So for example RGB {100,0,0} and {0,200,50} might be neighbors in motion/pixel space if
they can be reached by small motion adjustments.
Okay this is turning into rambling, but another way of seeing it is like this : for each block, construct a custom basis transform. Don't send
a separate movec or anything - the axes of the basis transform select pixels by stepping in movec and also residual.
ADDENDUM : let me try to be more clear by doing a simple example. Say you are trying to code a block of pixels which only has 10 possible
values. You want to code with a standard motion then residual method. Say there are only 2 choices for motion. It is foolish to code
all 10 possible values for both motion vectors! That is, currently all video coders do something like :
The other huge fundamental defficiency is that the probability modeling of movecs and residuals is done in a very primitive way based only on
"they are usually small" assumptions. In particular, probability modeling of movecs needs to be done not just based on the vector, but on the
content of what is pointed at. I mentioned long ago there is a lot of redundancy there when you have lots of movecs pointing at the same thing.
Also, the residual coding should be aware of what was pointed to by the movec. For example if the movec pointed at a hard edge, then the
residual will likely also have a similar hard edge because it's likely we missed by a little bit, so you could use a custom transform that handles
that better. etc.
ADDENDUM 2 : there's something else very subtle going on that I haven't seen discussed much. The normal way of sending {movec,residual} is
actually over-complete. Mostly that's bad, too much over-completeness means you are just wasting bits, but actually some amount of over-completeness
here is a good thing. In particular for each frame we are sending a little bit of extra side information that is useful for *later* frames.
That is, we are sending enough information to decode the current frame to some quality level, plus some extra that is not really worth it for
just the current frame, but is worth it because it helps later frames.
The problem is that the amount of extra information we are sending is not well understood. That is, in the current {movec,residual} schemes we
are just sending extra information without being in control and making a specific decision. We should be choosing how much extra information to
send by evaluating whether it is actually helpful on future frames. Obviously the last frames of the video (or a sequence before a cut) you
shouldn't send any extra information.
In the examples above I'm showing how to reduce the overcomplete information down to a minimal set, but sometimes you might not want to do
that. As a very course example say the true motion at a given pixel is +3, movec=3 to get to final pixel=7 , but you can code the same result
smaller by using movec=1 - deciding whether to send the true motion or not should be done based on whether it actually helps in the future,
but more importantly the code stream could collapse {3,7} and {1,7} so there is no redundant way to code if the difference is not helpful.
This becomes more important of course if you have a more complex motion scheme, like per-pixel motion or trapezoidal motion or whatever.
The best source I know of at the moment is H265.net , but you can also
find lots of stuff just by searching for video on citeseer. (addendum : FTP to Dresen April
meeting downloads ).
H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas
that are outside of this kind of scheme. Some quick notes on the kind of things we may see :
Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take
a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying
source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you
do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there
will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target
frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel
mocomp from that.
Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with
more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition.
This lets the block transform match the data better. Which also leads us to -
Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big
win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned
along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the
transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option
is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and
use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the
same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on
those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb.
These techniques can also be used for ...
Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you
don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities).
Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is
basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There
are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair
game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds
a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.
Unblocking filters ("loop filtering" WTF retarded name is that) are an obvious area for improvement. The biggest area for improvement is deciding
when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter
has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream
was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve
(eg. it could have but chose not to).
Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency
response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg.
oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous
frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some
assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters
correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.
Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's
(very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW
or EZDCT.
If it in fact becomes a clean open-source video standard with no major patent encumbrances, it might be well integrated in Firefox,
Windows Media, etc. etc. - eg. we might actually have a video format that actually just WORKS! I don't even care if the quality/size is
really competitive. How sweet would it be if there was a format that I knew I could download and it would just play back correctly and
not give me any headaches. Right now that does not exist at all. (it's a sad fact that animated GIF is probably the most portable video
format of the moment).
Now, you might well ask - why VP8 ? To that I have no good answer. VP8 seems like a messy cock-assed standard which has nothing in
particular going for it. The entropy encoder in particular (much like H264) seems badly designed and inefficient.
The basics are completely vanilla, in that it is block based, block modes, movecs, transforms, residual coding.
In that sense it is just like MPEG1 or H265. That is a perfectly fine thing to do, and in fact it's what I've wound up doing, but you
could pull a video standard like that out of your ass in about five minutes, there's no need to license code for that. If in fact VP8 does
dodge all the existing patents then that would be a reason that it has value.
The VP8 code stream is probably pretty weak (I really don't know enough of the details to say for sure).
However, what I have learned of late is that there is massive room for the encoder to make good output video even through a weak
code stream. In fact I think a very good encoder could make good output from an MPEG2 level of code stream.
Monty at Xiph has a nice page about work on Theora. There's nothing
really cutting edge in there but it's nicely written and it's a good demonstration of the improvement you can get on a fixed standard
code stream just with encoder improvements (and really their encoder is only up to "good but still basic" and not really into the realm
of wicked-aggressive).
The only question we need to ask about the VP8 code stream is : is it flexible enough that it's possible to write a good encoder for it
over the next few years? And it seems the answer is yes. (contrast this to VP3/Theora which has a fundamentally broken code stream which
has made it very hard to write a good encoder).
ADDENDUM : this post by Greg Maxwell is pretty right on.
ADDENDUM 2 : Something major that's been missing from the web discussions and from the literature about video for a long
time is the separation of code stream from encoder. The code stream basically gives the encoder a language and framework to work in.
The things that Jason / Dark Shikary thinks are so great about x264 are almost entirely encoder-side things that could apply to almost any
code stream (eg. "psy rdo" , "AQ", "mbtree", etc.). The literature doesn't discuss this much because they are trapped in the pit of PSNR
comparisons, in which encoder side work is not that interesting. Encoder work for PSNR is not interesting because we generally know directly
how to optimizing for MSE/SSD/L2 error - very simple ways like flat quantizers and DCT-space trellis quant, etc. What's more interesting
is perceptual quality optimization in the encoder. In order to acheive good perceptual optimization, what you need is a good way to
measure percpetual error (which we don't have), and the ability to try things in the code stream and see if they improve perceptual error
(hard due to non-local effects), and a code stream that is flexible enough for the encoder to make choices that create different kinds of
errors in the output. For example adding more block modes to your video coder with different types of coding is usually/often bad in a PSNR
sense because all they do is create redundancy and take away code space from the normal modes, but it can be very good in a perceptual sense
because it gives the encoder more choice.
ADDENDUM 3 : Case in point , I finally have noticed some x264 encoded videos showing up on the torrent sites. Well, about 90% of them don't
play back on my media PC right. There's some glitching problem, or the audio & video get out of sync, or the framerate is off a tiny bit, or
some shit and it's fucking annoying.
ADDENDUM 4 : I should be more clear - the most exciting thing about VP8 is that it (hopefully) provides an open
patent-free standard that can then be played with and discussed openly by the development community. Hopefully
encoders and decoder will also be open source and we will be able to talk about the techniques that go into them,
and a whole new
What this means is VC thinks you have no SCC connection at all, your files are just on your disk. You need to change the default NiftyPerforce
settings so that it checks out files for you when you edit/save etc.
Advantages of NiftyPerforce without P4SCC :
1. Much faster startup / project load, because it doesn't go and check the status of everything in the project with P4.
2. No clusterfuck when you start unconnected. This is one the worst problems with P4SCC, for example if you want to work on some work projects
but can't VPN for some reason, P4SCC will have a total shit fit about working disconnected. With the NiftyPerforce setup you just attrib your
files and go on with your business.
3. No difficulties with changing binding/etc. This is another major disaster with P4SCC. It's rare, but if you change the P4 location of a
project or change your mappings or if you already have some files added to P4 but not the project, all these things give MSdev a complete
shit-fit. That all goes away.
Disadvantages of NiftyPerforce without P4SCC :
1. The first few keystrokes are lost. When you try to edit a checked-in file, you can just start typing and Nifty will go check it out,
but until the checkout is done your keystrokes go to never-never land. Mild suckitude. Alternatively you could let MSDev pop up the dialog
for "do you want to edit this read only file" which would make you more aware of what's going on but doesn't actually fix the issue.
2. No check marks and locks in project browser to let you know what's checked in / checked out. This is not a huge big deal, but it is a nice
sanity check to make sure things are working the way they should be. Instead you have to keep an eye on your P4Win window which is a mild
productivity hit.
One note about making the changeover : for existing projects that have P4SCC bindings, if you load them up in VC and tell VC to remove the
binding, it also will be "helpful" and go attrib all your files to make them writeable (it also will be unhelpful and not check out your
projects to make the change to not have them bound). Then NiftyPerforce won't work because your files are
already writeable.
The easiest way to do this right is to just open your vcproj's and sln's in a text editor and rip out all the binding bits manually.
I'm not sure yet whether the pros/cons are worth it. P4SCC actually is pretty nice once it's set up, though the ass-pain it gives when trying
to make it do something it doesn't want to do (like source control something that's out of the binding root) is pretty severe.
ADDENDUM :
I found the real pro & con of each way.
Pro P4SCC : You can just start editting files in VC and not worry about it. It auto-checks out files from P4
and you don't lose key presses. The most important case here is that it correctly handles files that you have
not got the latest revision of - it will pop up "edit current or sync first" in that case. The best way to use Nifty
seems to be Jim's suggestion - put checkout on Save, do not checkout on Edit, and make files read-only editable in memory.
That works great if you are a single dev but is not super awesome in an actual shared environment with heavy contention.
Pro NiftyP4 : When you're working from home over an unreliable VPN, P4SCC is just unworkable. If you lose
connection it basically hangs MSDev. This is so bad that it pretty much completely dooms P4SCC.
ARG actually I take that back a bit, NiftyP4 also hangs MSDev when you lose connection, though it's not nearly as bad.
I mentioned this before :
But I'm having second thoughts, because putting little config shitlets in my source dirs is one of the things I
hate about CVS. Granted it would be much better in this case - I would only need a handful of them in my top
level dirs, but another disadvantage is my p4bydir app would need to scan up the dir tree all the time to find
config files.
And there's a better way. The thing is, the P4 Client specs already have the information of what dirs on my
local machine go with what depot mappings. The problem is the client spec is not actually associated with a
server. What you need is a "port client user" setting. These are stored as favorites in P4Win, but there is
no authoritative list of the valid/good "port client user" setups on a machine.
So, my new idea is that I store my own config file somewhere that lists the valid "port client user" sets that
I want to consider in p4bydir. I load that and then grab all the client specs. I use the client specs to
identify what dirs to map to where, and the "port client user" settings to tell what p4 environment to set
for that dir.
I then replace the global p4.exe with my own p4bydir so that all apps (like NiftyPerforce) will automatically
talk to the right connection whenever they do a p4 on a file.
Wait, this is a research question ? Gee, why would I prefer perfect black and white raster fonts to smudged
and color-fringed cleartype. I just can't imagine it! Better do some community user testing...
Oh my god, LOL, holy crap. They are obviously comparing Cleartyped anti-aliased fonts to black-and-white
rendered TrueType fonts, NOT to raster fonts. They're probably doing big fonts on a high DPI screen too.
Try it again on a 24" LCD with an 8 point font please, and compare something that has an unhinted TrueType and
an actual hand-crafted raster font. Jesus. Oh, but I must be wrong because the community survey says 94%
prefer cleartype!
Anyway, as usual the annoying thing is that in pushing their fuck-tard agenda, they refuse to acknowledge
the actual pros and cons of each method and give you the controls you really want. What I would like is
a setting to make Windows always prefer bitmap fonts when they exist, but use ClearType if it is actually
drawing anti-aliased fonts. Even then I still might not use it because I fucking hate those color fringes,
but it would be way more reasonable. Beyond that obviously you could want even more control like switching
preferrence for cleartype vs. bitmap per font, or turning on and off hinting per font or per app, etc. but
just some more reasonable global default would get you 90% of the way there. I would want something like
"always prefer raster font for sizes <= 14 point" or something like that.
Text editors are a simple case because you just to let the user set the font and get what they want, and it
doesn't matter what size the text is because it's not layed out. PDF's and such I guess you go ahead and use TT
all the time. The web is a weird hybrid which is semi-formatted. The problem with the web is that it doesn't
tell you when formatting is important or not important. I'd like to override the basic firefox font to be
my own choice nice bitmap font *when formatting is not important* (eg. in blocks of text like I make). But if you
do that globally it hoses the layout of some pages. And then other pages will manually request fonts which are
blurry bollocks.
CodeProject has a nice font survey
with Cleartype/no-Cleartype screen caps.
GDI++ is an interesting hack to GDI32.dll to replace
the font rendering.
Entropy overload has some decent hinted
TTF fonts for programmers you can use in VS 2010.
Electronic Dissonance
has the real awesome solution : sneak raster fonts into asian fonts so that VS 2010 / WPF will use them. This is money
if you use VS 2010.
For reference, this is the ferocious beast that is terrorizing the poor mailman :
LOL. It would actually be pretty damn sweet if I could stop getting mail. Don't think the duplex neighbor would like that though.
RunOrActivate : useful with a hot key program, or from the CLI. Use RunOrActivate [program name]. If a running
process of that program exists, it will be activated and made foreground. If not, a new instance is started.
Similar to the Windows built-in "shortcut key" functionality but not horribly broken like that is.
(BTW for those that don't know, Windows "shortcut keys" have had huge bugs ever since Win 95 ; they sometimes work
great, basically doing RunOrActivate, but they use some weird mechanism which causes them to not work right with
some apps (maybe they use DDE?), they also have bizarre latency semi-randomly, usually they launch the app instantly
but occasionally they just decide to wait for 10 seconds or so).
RunOrActivate also has a bonus feature : if multiple instances of that process are running it will cycle you between
them. So for example my Win-E now starts an explorer, goes to existing one if there was one, and if there were a few
it cycles between explorers. Very nice. Also works with TCC windows and Firefox Windows. This actually solves a
long-time useability problem I've had with shortcut keys that I never thought about fixing before, so huzzah.
WinMove : I've been using this forever, lets you move and resize the active window in various ways, either by manual
coordinate or with some shorthands for "left half" etc. Anyway the new bit is I just added an option for "all windows"
so that I can reproduce the Win-M minimize all behavior and Win-Shift-M restore all.
I think that gives me all Win-Key functions I actually want.
ADDENDUM : One slightly fiddly bit is the question of *which* window of a process to activate in RunOrActivate.
Windows refuses to give you any concept of the "primary" window of a process, simply sticking to the assertion
that processes can have many windows. However we all know this is bullshit because Alt-Tab picks out an
isolated set of "primary" windows to switch between. So how do you get the list of alt-tab windows? You don't.
It's "undefined", so you have to make it up somehow.
Raymond Chen describes the
algorithm used in one version of Windows.
New P4 Installs don't include P4Win , but you can just copy it from your old install and keep using it.
This is not a Win7 problem so much as a "newer MS systems" problem, but non-antialiased / non-cleartype text
rendering is getting nerfed. Old stuff that uses GDI will still render good old bitmap fonts fine, but newer
stuff that uses WPF has NO BITMAP FONT SUPPORT. That is, they are always using antialiasing, which is
totally inappropriate for small text (especially without cleartype). (For example MSVC 2010 has no bitmap font
support (* yes I know there are some workarounds for this)).
This is a huge fucking LOSE for serious developers. MS used to actually have better small text than Apple,
Apple always did way better at smooth large blurry WYSIWYG text shit. Now MS is just worse all around because
they have intentionally nerfed the thing they were winning at. I'm very disappointed because I always run
no-cleartype, no-antialias because small bitmap fonts are so much better. A human
font craftsman carefully choosing which pixels should be on or off is so much better than some fucking algorithm
trying to approximate a smooth curve in 3 pixels and instead giving me fucking blue and red fringes.
Obviously anti-aliased text is the *future* of text rendering, but that future is still pretty far away.
My 24" 1920x1200 that I like to work on is 94 dpi (a 30" 2560x2600 is 100 dpi, almost the same). My 17" lappy at 1920x1200 has some of the highest pixel
density that you can get for a reasonable price, it's pretty awesome for photos, but it's still only 133 dpi
which is shit for text (*). To actually do good looking antialiased text you need at least 200 dpi, and 300 would
be better. This is 5-10 years away for consumer price points. (In fact the lappy screen is the unfortunate
uncanny valley; the 24" at 1920x1200 is the perfect res where non-atialiased stuff is the right size on screen
and has the right amount of detail. If you just go to slightly higher dpi, like 133, then everything is too
small. If you then scale it up in software to make it the right size for the eye, you don't actually have
enough pixels to do that scale up. The problem is that until you get above 200 dpi where you can do arbitrary
scaling of GUI elements, the physical size of the pixel is important, and the 100 dpi pixel is just about perfect).
(* = shit for anti-aliased text, obviously great for raster fonts at 14 pels or so).
(
ADDENDUM : Urg I keep trying to turn on Cleartype and be okay with it. No no no it's not okay. They should call it "Clear Chromatic Abberation"
or "Clearly the Developers who thing this is okay are colorblind". Do they think our eyes only see luma !? WTF !? Introducing colors into
my black and white text is just such a huge visual artifact that no amount of improvement to the curve shapes can make up for that.
)
It's actually pretty sweet right now living in a world where our CPU's are nice and multi-core, but most apps are still single core. It means
I can control the load on my machine myself, which is damn nice. For example I can run 4 apps and know that they will all be pretty nice and
snappy. These days I am frequently keeping 3 copies of my video test app running various tests all the time, and since it's single core I know
I have one free core to still fuck around on the computer and it's full speed. The sad thing is that once apps actually all go multi-core
this is going to go away, because when you actually have to share cores, Windows goes to shit.
Christ why is the registry still so fucking broken? 1. If you are a developer, please please make your apps not
use the registry. Put config files in the same dir as your .exe. 2. The Registry is just a bunch of text strings,
why is it not fucking version controlled? I want a log of the changes and I want to know what app made the change
when. WTF.
The only decent way to get environment variables set is with TCC "set /S" or "set /U".
"C:\Program Files (x86)" is a huge fucking annoyance. Not only does it break by muscle memory and break a ton of batch files I had that
looked for program files, but now I have a fucking quandary every time I'm trying to hunt down a program.. err is it in x86 or not? I really
don't like that decision. I understand it's needed for if you actually have an x86 and x64 version of the same app installed, but that is
very rare, and you should have only bifurcated paths on apps that actually do have a dual install.
(also because lots of apps hard code to c:\program files , they have a horrible hack where they let 32 bit apps think they are actually in
c:\program files when they are in "C:\Program Files (x86)"). Blurg.
Some links :
Types - Vista/Win7 has borked the "File Associations" setup. You need a 3rd party app like Types now
to configure your file types (eg. to change default icons).
Shark007.net - Windows 7 Codecs - WMP12 Codecs - seem to work.
Pismo Technic Inc. - Pismo File Mount - nicest ISO mounter I've found (Daemon tools feels like it's
made out of spit and straw).
Hot Key Plus by Brian Apps - ancient app that still works and I like because it's super
simple.
Change Windows 7 Default Folder Icon - Windows 7 Forums ; presumably you
have the Preview stuff for Folders turned off, so now make the icon not so ugly.
- how to move your perforce depot Annoyingly I used a
different machine name for new lappy and thus a different clientview, so MSVC P4SCC fails to make the connection and wants to rebind every project.
The easiest way to fix this is just to not use P4SCC and kill all your bindings and just use NiftyPerforce without P4SCC.
(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way
to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another
env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have
some work code and some home code open at the same time they shit their pants.
I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something
like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).
allSnap make all windows snap - AllSnap for x64/Win7 seems to be broken, but the old 32 bit one seems
to work just fine still. (ADDENDUM : nope, old allsnap randomly crashes in Win 7, do not use)
KeyTweak Homepage - I used KeyTweak to remap my Caps Lock to Alt.
Firefox addons :
ADDENDUM : I found the last few guys who are ticking my disk :
One that you obviously want to disable is Windows Media Player Media sharing serivcee :
Fix wmpnetwk.exe In Windows 7 . It just constantly scans your media dirs for shit to
serve. Fuck you.
The next big culprit is the Windows Reliability stuff. Go ahead and disable the RAC scheduled task, but that's not the real problem. The nasty one is the "last alive stamp"
which windows writes once a minute by default. This is to help diagnose crashes. You could change TimeStampInterval to 30 or so to make it once every 30 minutes, but I set it
to zero to disable it. See :
And this is a decent summary/repeat of what I've said :
How to greatly reduce harddisk grinding noises in Vista .
First impressions of the M6500 : build quality is very nice, lightweight all metal like the
Lattitude series. The screen is pretty great, 1920x1200 matte, pretty bright, decent contrast,
the only complaint is that it's not super viewing-angle-independent. It is a very nice bonus
that the lappy LCD res is the same res that I run my 24" external, so I can switch between
lappy LCD and external LCD and it doesn't hose my layouts (however, on the minus side of that
equation, because the lappy LCD is so pixel dense, I have to run in large fonts on it, and
switch back to small fonts for the external LCD, boo). The internal peripherals and the case
are well designed for popping things in and out. It takes two 2.5" thin lappy disks and has 4
RAM slots. The disks are easy to get in and out, and 2 of the RAMs are easy, but the other two
are under the keyboard and take a bit more work. The thing is pretty amazingly quiet, the only
auditory annoyance is that the fan oscillates up and down too much, I wish I could program the
fan to just be on all the time in low speed instead of jumping up and down. The keyboard
action is very nice, like the Lattitude series, but it is the standard fucking retarded 17" lappy
thing of just using a normal 15" lappy keyboard and sticking it in the larger case. Jebus can't
you people actually make a custom keyboard for the 17" form factor that takes advantage of
the extra space to give me better layout!? Come on! Anyway, since I don't really use lappy as
a lappy, this is mostly academic.
Win 7 reinstall went very smoothly - it autodetects all the hardware well enough that you can at least
boot up. You then need to install a few things - the GPU drivers, the USB3 driver, the Touchpad
driver. That's about it. Smoothest Windows install I've ever had, by far.
I had one minor annoyance during install - turns out the lappy came with 1066 Mhz RAM and I bought 1333
Mhz RAM to add. Well, when you do that the lappy boots right up and doesn't complain at all, and
runs through memory check (and even the Windows heavy duty memory check) and doesn't find any errors.
But it will in fact give you random RAM failures and blue-screen you. So I had to reformat the disk
and pull the 1066 RAM and reinstall windows. Then I discovered that the lappy doesn't work with memory only
in the C/D slots and not the A/B slots. So I pulled it apart again, and then finally got going. Sigh.
Make sure you switch the BIOS to AHCI, not Intel's fucking raid thing (*before* the Windows reinstall).
Currently I am running
the MS AHCI driver which supports TRIM. Unclear whether I will ever install the Intel driver.
I put in an Intel X25-M SSD. Holy shit this thing is so good. If you are a serious computer and
you are not on an SSD - GET ONE RIGHT NOW.
So far I have disabled : Indexing, ReadyBoot, Prefetch, Superfetch, Indexing, Scheduled Defrag,
Defender, Updater, System Restore, Page File, Hibernate, Wifi polling. That has cleaned things up nicely, but
some fucking Win7 service is still pinging my disk every 10 seconds or so and I haven't tracked
down what it is yet. (fuckers). When I am doing nothing on my computer, it should go to 0% CPU and
never ever touch my disk.
Win7 is mostly good so far. As usual there's the annoyance that MS loves to randomly rename things
and move them around. Perhaps the worst thing so far is that fucking Backspace is no longer
"go up a dir" in Explorer. Yes, I know Alt-Up does that, but it should be fucking backspace god
dammit! This was widely complained about during Beta, but like fuckers they refused to provide a
config switch to let me have my damn old backspace behavior. I found an AHK solution to map
Backspace to Alt-Up , but AHK is a fucking bloated beast of flakey crapware so I'm hoping to find
another solution to this (probably just write my own).
The Win-# hot keys are almost a good thing, except that using a number which is their position
in the task bar is fucking awful. You should let me assign my own hotkeys, that way I can use
Win-V for VisualStudio and Win-F for Firefox or whatever, so that I can actually memorize the
keypresses and be fast instead of having to count each time.
Finally, it annoys me to all fuck that MS refuses to give me the one feature that would make UAC usable :
just a button to always allow promotion of a given app. When it says "do you want to allow?" it should
be "Yes/No/Always" , not just "Yes/No". As is, the fucking Shareware "ProcessGuard" is much much better
than UAC because with ProcessGuard you can actually say "always allow this app" or "always forbid this app"
blah blah. It's so fucking obvious and such a major usability fuckup. The result is that 99% of power
users just turn off UAC, whereas if you had a "always allow for this app" I would totally leave UAC on.
I dunno, it just boggles my mind, it would be so easy to make UAC useful and functional. You just have to
understand how computing works. I want to have a host of programs installed on my computer which I have
marked trusted and let them do whatever they want. Then I want to be able to download random junk from
the internet and run them in safe mode where they are forbidden from doing various things like installing
stuff in startup or fucking with my windows dir. WTF, why do I not have this !?
Oh, I guess I'm going to try going to MSVC 2010. Since this is supposed to be my machine for the next 10 years,
I'd rather just eat the pain of getting on new stuff all at once and then hopefully not have to do it ever again.
We'll see about that...
ADDENDUM : Urg. I take it all back, Win 7 is a FUCKING ABORTION (later addendum : that might have been
a slight exaggeration). I am in a constant
hell of fucking file ownerships and user privilidges and shit. Here's just a minor sample :
You Map a network drive. Everything seems fine and dandy. Now you open a command prompt
in "run as administator" mode. Your mapped drive is not there !? WTF !? Oh, brilliant fuckers
that they are, the drive mapping is *PER USER* so the fucking administrator doesn't see it,
so you have to remap it for the administrator account.
I install Perforce. Of course Perforce Server installs as "owned by" the Administrator
account. So when I am logged in as my account if I try to do anything to those files it
says "fuck you I'm going to pop up annoying boxes".
I'm trying to copy files from my old lappy's disk to my new one. Of course now Win 7 pays
attention to the NTSF security tags and it sees those files are owned by some other
user, so I get a bunch of random "Access Denied" messages with no explanation. Of
course I can just go and do a fucking recursive "take ownership" of all those files,
but that's really just a hack fix and if I plug that disk in somewhere else I'll have to
do it again.
Jesus christ. Somebody give me an operating system where my files are just fucking files without
some security or owner bullshit and I
can run whatever I want and it just fucking works. I want Windows 95 please.
It's easy enought to disable UAC popups, but that's only the fucking tip of the
iceberg. UAC is fucking you up back and forth all the time. Win 7 UAC also does
some super nasty shit that most people don't know about, in that it remaps a bunch
of virtual directory names *per user* (and also for 32 bit vs 64 bit and other
compatibility modes), so depending on how your program is run it can see very
different things on the same box. DO NOT WANT.
05-13-09 - Image Compression Rambling Part 1
12-08-08 - DXTC Summary
If you are a serious driver and are still running stock alignment, I highly recommend it. But do your
research first.
Most people think of "alignment" as making the wheels straight to fix pulling issues. In reality alignment
is much more than that and can give you a lot of parameters to tweak to play with the way the car handles.
It's much cheaper than doing suspension mods, and often more effective. The main things you will want
adjusted are toe and camber. I posted these links before :
Wheel Alignment A Short Course
But to repeat myself a bit, most stock alignments come with pretty significant toe-in and pretty minimal camber.
What the toe-in does is make the car more stable, it acts to straighten itself. That means the front end
doesn't wander under braking or when you hit bumps. Most pussy comfort drivers complain when cars are "wiggly"
(which means they are "lively"), so manufacturers sell the cars with lots of toe-in. This makes them feel
nice and stable on the freeway, but it sucks for making crisp turns. The other issue is camber - lots of negative
camber means the wheels are tilted in; this will give you better contact patches when the car is in a lean through
a corner. If you spend most of your time going straight on freeways or just turning slowly, then the stock
alignment with minimal camber is fine, but if you do a lot of hard cornering, you will get much better grip
with lots of negative camber. I now have my car set to zero toe and -1.2 degrees camber.
Race cars will go even more extreme, sometimes even using toe *out* which makes them super lively and easy to turn-in,
and tons of negative camber.
For Porsches you can ask for the
"RoW Performance alignment" or just go to a good sport alignment shop and ask
for an "aggressive street alignment" or something like that.
The next step on a Porsche is get to the GT3 "Lower Control Arms" which let you go to greater negative camber (-2 degrees
seems good). You can hack this with "camber plates" but camber plates are like $50 and the LCA's are like $150 and labor
will be $200 or so, so just go with the better LCA solution.
Similarly with events - if someone asks you out to do something, and they are clearly enjoying it but you are not - be a
fucking good sport and try to get in the spirit and at least feign moderate happiness. Let them enjoy their time, don't
make yourself into a distraction and annoyance by griping or wandering off or whatever. When you agree to go out with
someone you implicitly agree that if they like it and you don't, you will go along with it. It's not that hard. And
just tolerating it while obviously showing your annoyance and saying "I'm just here for you, let me know when we can leave"
does not count as going along with it, it's still shitty.
Recently I've been thinking about this party I went to at the top of the condo tower on First Hill. My date at the time
made me go despite my apprehension. As I expected, it was fucking awful, just a mob of people I didn't know with nothing
to talk about, no dancing, no games, just stupid fucking boring people drinking and chit chatting about nonsense. It
was excruciating, and like the dick that I am, even though I was "being a good sport" in my mind, I made it abundandtly
clear with my body language and constant wandering off that I was not happy. The funny thing is that many months later,
the horrible party is actually one of the more memorable social events that I've attended in the last few years. It gives
me something to talk about, the condo tower it was in is the tallest building around and the party was right at the top
so it's a unique experience. It's weird this need to "do something" that we have; when I haven't been out much I get this
feeling of stir craziness, that I'm wasting my life, you get depressed without knowing why exactly, then you go out to
some event and it's just awful the whole time and you can't wait to leave, and then months later it is the thing you did
that you remember, not all those days when you actually were happy and just stayed home or went for a bike ride or whatever
it is that you actually enjoy.
People who suck at things are a real problem. In theory, it doesn't actually matter whether people are good at things or
not - we put a lot of our self worth into our "skills" (I am so fucking great because I'm good at X), but in reality when
I'm hanging out with someone I could care less about their skills. What actually matters is their attitude and emotional
intelligence and sense of humor and kindness and so on, that is what actually makes you a good person to be with. But
in reality, people who suck at things are just a fucking drag. The problem is that *they* care that they suck. The result is that they
are often in a funk because they screwed something up, or they are super afraid of being judged, or they have big
insecurity problems. The result is that they bring you down because they don't want you to be so much better at them,
they force you to hide your own skills. For example, I really don't mind if someone cooks for me and makes mediocre food;
what makes it fun or not is their attitude, their conversation, their enthusiasm, in fact if they are excited about their food
that is way more important than whether it is actually good or not (the converse is people who are really awesome cooks but act
all humble and self-deprecating about it which is just annoying and unpleasant). However, that almost never works out because
they imagine you are thinking horrible things about them and their food. Similarly with playing board games or sports with
someone; when I toss a frisbee with someone, I don't care how great they are, yeah it's better if you can actually make some
throws and catches, but enthusiasm and good attitude and hustle are way more important. The problem is that people who suck get
down on themselves and get in a funk and then it's just annoying to be with them.
People who can suck and not care about it are very rare and very cool. In fact that is one of my great unattained goals for
myself.
The day starts by stripping everything loose out of your car - everything! Things typically forgotten are floor mats and the
tire kit in your trunk. Take it all out. It's good to bring a big tub or something to keep your gear in that you took out of
your car.
Then there was a one hour ground school. If you know anything about race driving (eg. if you know what a "late apex" is or
what the "traction budget" is) then this is pretty boring, but to their credit the guys speaking were actually pretty
charistmatic and moved quickly and made it not excruciating.
The rest of the day consisted of six different exercise stations. You spend an hour at each station then rotate to the next.
At each station there are 10 cars or so and you take turns running the course, and a few volunteer instructors rotate through
the cars so you almost always have an instructor in car with you. The instructors were uniformly great guys & gals - they
volunteer and were very friendly and knowledgeable and had great attitudes despite us trying to make them vomit and standing
in the rain all day.
Oh yeah, it was wet, it rained pretty hard all day. That made everything very slippery. In one way it was good to get to
practice with the car in the wet, but it would have been nice to get some runs in the dry.
The sessions were :
Skidpad (donuts) : they wet the track and you go around in circles. You can really feel how you can throttle steer in
this exercise, more throttle and you circle wider, less throttle you come in. Like most of the things you learn in DS, you
should already know this in your brain, but you need to actually feel it in your car and hands for it to become intuition.
I also go to play with the understeer/oversteer characteristics of my car. When I gradually take the throttle up past the
limit of adhesion, my car goes into understeer and plows out; if I really punch the throttle hard, it will kick the rear out
and go into an oversteer spin. I was never able to kick the rear out and control it, it's hard to do in the C4S because you
have to kick the throttle so hard to get it to step out. The main mistake I was making was once I got into a spin I was
reflexively putting the clutch in and letting off the throttle abruptly; what I need to do is just ease down on the throttle
and try to catch it before bailing out.
Braking / accident avoidance : part 1 is go as fast as you can and then slam on the brakes to full stop as late as you can.
This was a trip because it is absolutely amazing how fast the car can stop when you are fully on the brakes (even in the wet!).
The goal of the exercise is to brake as late as possible and still stop before the cone. I kept braking way too soon because
my intuition says "you have to brake now!" ; by the end I started getting closer, but I really needed more time to get used
to it. Part 2 was accident avoidance; a surprise obstacle (a very brave volunteer) jumps out to one side, and you have to brake
and steer away at the same time. Again I was just not braking hard enough here at first, because my gut says you can't steer
if you're braking that hard; in fact you can!
Handling oval : run the car around in an oval to practice apexing. Get up to as much speed as possible before the turn,
hard on the brakes without much turning, look for the apex, turn hard and try to power out past the apex, making a perfect
"late apex" turn. Hard to do in practice. A few things I need to get better at on this - make sure you actually look and
stare at the apex, it will be out your side window as you brake in; make sure you brake enough before turning in, you need to
get well slow or you'll understeer and miss the apex. Basically what you're doing is prolonging your straights on both
entry and exit, which is what gives you more speed; you brake late and very hard so you enter deep into the turn, then at very low
speed you turn hard to aim back past the apex, then get on the power early and power out hard in a very gradual curve.
Slalom : weaving between cones and trying to go as fast as possible. The main thing here is smoothness and looking ahead.
This felt pretty natural to me, it is a lot like skiing. The main thing is to look way ahead and have your line planned out,
take the most gradual arc you can, don't jerk the car back and forth.
Advanced Slalom : like slalom, but cones are not in a straight line, so you have to look ahead more and use more planning.
Pretty fun. The skill is a lot like slalom, being smooth, but also a bit like Autocross in visualizing the "empty space";
that is, don't see the cones and think that you have to drive past the cones, rather see all the empty space that the cones
allow and pick your best line in that empty space - often there are straights hidden in the cones, and straights = speed.
+ bonus shifting training. I actually learned a lot from this. One that I was coached on all day by instructors was to
keep your hands on the wheel! I habitually get my hand over to the shifter too often, sometimes I'll anticipate I might have
to shift in a corner and I go move my hand over to the stick to prepare, but that means I'm not cornering as well, you need
to stay on the wheel, the quickly pop over to the shifter and quickly pop back. I also do some funny unnecessary extra movements,
like sometimes I love the shifter to neutral, let go for a half second, and then move it into the next gear. The main thing
was some guidance on rev matching downshifts. I know how to do it, but it's great to just have someone watch you and pay
attention to the RPM dial and the lurch of the car and tell you your mistake each time. The goal is to make a perfectly
smooth downshift so that you can't feel it at all. The main thing is the Porsche engine is real "heavy" , you need to give it
a real good kick of gas to get the revs up, and you need to do it well enough before you let the clutch out. A basic downshift
should consist of : clutch in, select neutral, increase throttle, select gear, clutch out. The detail here is crucial - you
increase throttle *before* you select gear. Another little trick I learned is that it's better to over-shoot the RPM for rev
match than to undershoot it, so err on the side of too much gas.
Autocross : obviously the exercise that puts it all together, we got to do a few runs on a tiny autocross course. This was a
fucking blast, and I plan to do some more Auto-X some day. You get to slam on the gas for the straights, hard on the brakes,
make the turns, it involves all the skills. Like the advanced slalom course, a lot of the skill is in picking a good line,
which means not following the cones - use the freedom of all that extra tarmac. Take your turns as wide and sweeping as you
can, turn soft wiggles into straights. My main mistakes were : following the direction of the cones too much, not going as
wide as I should after and before corners, coming into corners too fast - you have to really brake hard coming in, not getting
on throttle hard enough or early enough in the straights - intuitively you see a 50 foot long straight before the next turn and
your mind says "okay just coast in there for the straight" but what you need to do it floor it as late as possible and then
slam on the brakes before the turn.
I learned a lot about my driving and my car's responses, so it was a huge success, and a lot of fun. It is a very long day - I woke up
around 5 AM to get the ferry, so I was exhausted by the end, and the other drawback is that there is an awful lot of
standing around. With 10 cars on each skill, that means you only spend 1/10 of the time actually driving and the rest is
waiting. It would be so awesome to be the only one out there with just a real top instructor and tons of track time, you
could improve so much.
There were also a huge variety of people there; it was only about 50% Porsches or maybe a bit less, lots of BMWs, and a few
red herrings like a Hyundai Genesis and a Pontiac G8. There were a few real track cars that were trailered in. There were also
people there who were not real speed drivers at all, who just wanted to get more comfortable with their cars.
My car is not really a good autocross car at all, it's too big and heavy, the 4WD is a disadvantage, and really the visibility
of a convertible would help a lot. I was jealous of the people in Miatas and Elises and shit like that - a small light RWD
car would be a fucking blast for AutoX, you can get up and down in speed quickly. Also the power of my car is really a
disadvantage for me in a way - I'm not good enough to handle the car at the speeds it can do.
The ideal track car would be something really cheap so you don't have to worry about the abuse it will get,
light, only medium power so it can't go fast enough to hurt me, RWD front engine. Put sticky tires on it and do some suspension mods.
But there's no way I will ever track often enough for it to be worth having a "track car". It's another one of those things
that you would ideally share between 10 friends or something.
So I'm looking at Dells. The Dell Latitude E6410 is very similar to the Thinkpad T410 ;
Core i5 or i7 duals, nice magnesium case, 14.1" at 1440x900 matte (worse than my current lappy
which is 1440x1050 but passable). NVidia 3100M. Proper hard drive options. All around it
looks like a very nice general purpose lappy.
Dell Precision M4500 is the "mobile workstation" , equivalent to the Thinkpad W510. It's interesting
how Dell and Lenovo seem to have lined up their products as direct competing analogues; there's
some economic theory term for that. Anyway, it has options for dual or quad Core i5/i7,
NV 1800 or 880 GPU, it can do two hard disks, screen is 15.6" at 1600x900 or 1920x1080 ; seems
reasonable for me. One problem is it only has 2 RAM slots, so I'll max out at 8 GB, while the
Thinkpad W510 has four. (Correction : no it can't really do two hard disks, it can run a second
SSD in a mini slot, that's not as good as the 6500 which can do two real disks).
(the Dell M6500 is the even bigger one, but it doesn't offer a DX11 GPU either so there's really
no point in stepping up to it; the best is an ATI FirePro M7740 which is a shitty business certified
kind of thing on DX 10.1 ; it does have a proper 17" 16:10 screen in "WUXGA" (1920x1200)
and you can get it factory
built with RAID 0 2x500 GB 7200 RPM disks which is not bad. )
Dell has some retarded naming going on right now, watch out for that. eg the Latitude 6400 and 6500 are
the old model, the xx10 are the new ones. The Dell Precision 4400 , 6400, etc are the old ones, the x500
are the new ones. WTF. Also the "Precision M" series is actually part of the "Latitude E" family and
thus is compatible with the "E Series" docking solutions. But the docking changed from the 6400 to the 6410
generation so watch out for that. Yay. All these business Dell lappies do docking, though there are
many reports of problems with that. It looks like those problems may be various drivers, but come on Dell,
you need to get your shit together.
(BTW you might think the AlienWares would be ever better, but no, not really. The best GPU you can get is an
ATI HD 4870 which is DX 10. The screens are glossy and they don't have docking. Boo).
A few niggles to look at :
WiMAX ? So far as I know WiMax is not really rolled out in the US yet, but it is exciting for the future.
Since I don't actually use my lappy on the go this is probably not for me. Maybe someday I'll have a WiMax router
at home and cancel my cable.
Dual Core at something like 2.4 / 2.6 Ghz or Quad Core at 1.6 / 1.73 Ghz ? I suspect that the dual is actually faster for
most current usage, but I'm tempted to go with the quad to give myself a better test environment for my multicore code. It
is also nice the way multicore fixes Windows broke ass multi-tasking so that you can actually have 3 apps running and they
are all fully responsive.
(BTW all you fucking app developers should not be gobbling so much CPU when your app is out of
focus god damnit, especially gobbling CPU for fucking cutesy UI update shit when you are not focus).
128 GB SSD or 500 GB 7200 RPM HD ? SSD is no doubt the win for a thin/light lappy, but for this
type of workstation lappy I'm not so sure. There are reports of data loss, performance degradation over time,
etc. that are a bit scary for something I hope to run for 5-10 years. The fast random access and not having
to worry about defrag is a pretty huge win though. One annoyance is I would have to buy my own SSD third party
and reintall Windows.
Small addendum on SSDs : yeah they are the win. Going for a 160 GB Intel X25-M. Make sure you get a "G2" (gen2) as
Gen 1 apparently has no TRIM. Also the X25-E is *way* faster, but crazy expensive and only 32 GB. That indicates though
that SSD speeds are going to continue to shoot up and prices come down.
One of the great mysteries of living in Seattle is that it is fucking impossible to find a decent apple up here. The
apples in grocery stores here are uniformly disgusting - mealy, mushy, old, often with brown soft spots. The exact same
chain stores carry the exact same apples in California and they are fresh and crisp and delightful. Now, you snobavores
might suggest I get my apples at a local farmer's market. Sadly, the snobosphere has ruined farmers markets for apples.
It's no long cool enough to just get a nice farmer-grown Fuji or Gala or some shit, so they no longer sell those proper
apples, it's all "heritage varietals".
Heritage apples are fucking inedible. They're astringent, dry, pastey. It's almost like taking a bite out of a raw
potato, which I just realized is probably where "pomme de terre" came from. In French class people would always be like
"apple of the earth? WTF are those Frenchies thinking, potatos and apples are nothing alike" , but in fact old apples
from before the bitterness was bred out and the sugar was bred up were like potatos!
1. Make good defaults. Make things that you always want automatic. With every command line option, ask if you usually want it
on or off, and make the default the thing you usually want, and then make the option to turn it off. Be aggressive about
"refactoring" over time. If you find that you have to type a million options to run your app right, something is wrong.
If you have trouble remembering your own options, something is wrong. Another detail : make options automatically turn
each other on. For example when I enable "heavy prepass mocomp search" I now also automatically enable "save motion to
disk cache" because I pretty much always want them together and I kept running mocomp search and forgetting to enable saving
it to disk.
2. Make full logging and saving the default. You never want to run your app and have it produce some weird results or
crash or something and have no record of it because your forgot to enable logging. Full logging should be on by default,
and only disabled from the command line in weird cases. All my apps now automatically write logs to c:\logs\appname.log
using argv[0] automatically. My video test app also writes a log for each run which is named with the date and time so
that I have logs of every run ever. Each log also writes out info about the build and the command line options so that I
am never left thinking "WTF run was this?". (Actually there's one thing I'm still not doing here that I really should do,
which is to record the sync state of perforce that was used to build the current EXE ; we had that working at Oddworld and
it is the fucking bomb). This is a variant of the Carmack adage that "no time spent visualizing your algorithm is ever wasted" ;
my variant is something like "no amount of logging is too much".
2.1. Make clean versions of your logs and output! Just because you are logging tons of detail, that's not an excuse to
just output a ton of shit that is impossible to parse. You might need a "detailed log" and a "summary log" ; certainly don't
just spit it all to stdout. If you find yourself having to go through the log constantly to find the little bit of info you
actually want, pull that bit out and format it right. Do computations for y
combinedDecodeTable = s_combinedDecodeTable;
the compiler will "optimize" out the copy and will do all the math to compute s_combinedDecodeTable in your inner
loop. So you have to use the "volatile trick" :
volatile UINTa v_hack;
v_hack = (UINTa) s_combinedDecodeTable;
const CombinedDecodeTable * RADRESTRICT const combinedDecodeTable = (const CombinedDecodeTable * RADRESTRICT const) v_hack;
sigh.
A()
if ( x )
{
B()
.. more ..
}
else
{
B()
.. more ..
}
and again the compiler can't figure out that the B() could be hoisted out of the branches (more normally you would probably write it with B
outside the branch, and the compiler won't bring it in). It might be faster inside the branches or outside, so
you have to try both. Branch order reorganization can also help a lot. For example :
if ( x )
if ( y )
.. xy case
else
.. xNy case
else
if ( y )
.. Nxy case
else
.. NxNy case
obviously you could instead do the if(y) first then the if(x). Again the compiler can't do this for you and it can make a big difference.
08-06-10 | Vibram
I fucking hate Vibram. I need to buy some damn hiking shoes (my hipster slipper-tennis-shoes are not ideal for backpacking), but it's
basically impossible because vibram seems to have cornered the market. Having vibram under me feels like I'm walking on wet wood, or
ice cubes, or something else unreasonably stiff and slippery.
08-06-10 | The Casey Method
The "Casey Method" of dealing with boring coding tasks is to instead create a whole new code structure to do them in. eg. need to write
some boring GUI code? Write a new system for making GUIs.
08-05-10 | P4 Shelf
The fact that I can't access the new "p4 shelve" from the old P4Win client makes it almost useless to me. P4V is ridiculously awful.
It takes like 30 seconds to start up, I don't know WTF they're doing, but I know it's not okay.
08-04-10 | Initial Learnings of SPU
int x = struct.array[7];
for(... many .. )
{
.. do stuff with x ..
}
and what I'm seeing is that the "do stuff with x" is generating a load from struct.array[7] every time, and the assignment into my temp has
been eliminated completely. On the SPU this shows up when you have something a struct member int m_x and you do something like :
qword x = spu_promote( struct.m_x , 0 );
for(... many .. )
{
.. do stuff with x ..
}
and the compiler decides that rather than load m_x once and keep it in a register it will reload it all the time. It's like the optimizer
has a very early pass to merge variables that have just been renamed by assignment - but it is incorrectly not accounting for side effects
on the code gen. Weird.
Anyway this brings us to :
#define IF_LIKELY(expr) if ( __builtin_expect(expr,1) )
#define IF_UNLIKELY(expr) if ( __builtin_expect(expr,0) )
but Mike says it's best just to inherently arrange your code. The way this is actually implemented on the SPU is with a branch hint instruction. The instruction
sequence to the SPU will look like :
branch hint - says br is likely
.. stuff ..
branch
if branch was unlikely, there's no need for a hint, so it's only generated to tell the SPU that the branch is likely, eg. the instruction fetch needs to jump
ahead to the branch location. The branch hint needs to be 15 clocks or so before the branch. One compile option that turns out to be important is
-mhint-max-nops=2
which tells GCC to insert nops to make the branch hint far enough away from the branch. Right now 2 is optimal for me, but YMMV, it's a number you have to fiddle
with.
// these two instructions are much faster
qword vec_item_addr = si_a( vec_fastDecodeTable , vec_peek );
qword vec_item = si_rotqby(vec_item_data,vec_item_addr);
// than this one !?
qword vec_item = si_rotqby(vec_item_data,vec_peek);
Part of the solution to that is to compile with "-mdual-nops=1" which lets the compiler insert nops to fix dual issue alignment. The =1 was the best for me,
but again YMMV. BTW one problem with the SPU is that there is code size pressure (just to fit in the 256k),
and all these nops do make the code size bigger.
4370: 42 7f ff ad ila $45,65535 # ffff
4374: 38 8d c8 33 lqx $51,$16,$55 // load quad for codelen
4378: 18 0d c8 36 a $54,$16,$55 // 54 = address of codelen
437c: 18 0d db b5 a $53,$55,$55 // 53 = peek*2
4380: 1c 03 5b 34 ai $52,$54,13 // 52 = align byte for codelen
4384: 38 83 9a b0 lqx $48,$53,$14 // load qw for table[peek]
4388: 18 03 9a b2 a $50,$53,$14 // 50 = address of symbol
438c: 3b 8d 19 af rotqby $47,$51,$52 // rotate to get byte for codelen
4390: 1c 03 99 31 ai $49,$50,14 // 49 = adjust to get word symbol
4394: 14 3f d7 ac andi $44,$47,255 // &= 0xFF to get byte 44 = codelen
4398: 3b 8c 58 2e rotqby $46,$48,$49 // rot by to get symbol
439c: 3b 0b 06 2b rotqbi $43,$12,$44 // use
43a0: 08 03 56 0d sf $13,$44,$13 // use bits
43a4: 18 2b 57 07 and $7,$46,$45 // $7 = symbol , 45 = ffff
43a8: 39 8b 15 8c rotqbybi $12,$43,$44 // use
is faster than this :
4360: 0f 60 d7 2c shli $44,$46,3 // *= 8 to make table address
4364: 38 8b 07 aa lqx $42,$15,$44 // load qw for table
4368: 18 0b 07 ab a $43,$15,$44 // 43 = address of table item
436c: 3b 8a d5 29 rotqby $41,$42,$43 // rotate qword to get two dwords at top
4370: 08 02 d4 8b sf $11,$41,$11 // use
4374: 3b 0a 45 27 rotqbi $39,$10,$41 // use
4378: 3f 81 14 87 rotqbyi $7,$41,4 // rotate for symbol
437c: 39 8a 53 8a rotqbybi $10,$39,$41 // use
int x,y;
loop {
if ( x <= y )
}
it will *sometimes* do the right thing (the right thing is to make x and y qwords and do all the int ops on the [0] element).
However, sometimes it decides that they aren't important enough and it will leave them as ints on the stack, in which case
you get loads & shuffles to fetch them.
vec_int4 x,y;
loop {
if ( x[0] <= y[0] )
}
would do the trick - you're telling the compiler that I want them to be qwords, but I only use the top int part.
Nope, that doesn't work, in fact it often makes it *much worse*. In theory, the compiler should be able to tell
that I am only ever touching x[0], so when I do
x[0] ++
it should be able to know that it can go ahead and increment the *whole* vector x with a plain old add (eg. also increment x[1],2,3),
but it doesn't always do that, and you wind up getting instructions to extract the portion you're working on.
Sigh.
qword x,y;
loop {
qword comp = si_clgt( x, y );
if ( comp[0] )
}
but the problem with that is that you now have to be very careful about what instructions you choose to generate
for your operations - you can often be slower than the plain C using intrinsics because you aren't thinking about scheduling and pairing.
Also, once you start using intrinsics, you can't use too much of the [0] form of access any more because of the above note.
08-02-10 | Work
Work is a compulsion. I've been working way too much lately, it's hurting my back and my shoulder, making me depressed. There's no
real pressure for me to do it, all the pressure comes from myself. For one thing I do feel like I need to get a lot of things done really
quick. First I have to finish this optimization / cross-platform shit I'm doing, then I want to get my threading stuff cross-platform and
tested better, then I need to get back to video and finish some things. I feel like I really need to finish this video stuff and I have
to do it fast.
08-02-10 | Java will be faster than C
I think it's quite likely that in the next 10 years, Java and C# programs will be faster than C/C++ programs. The languages are simply
better - cleaner, more well defined, more precise in their operation, and most importantly - much easier to parallelize. C/C++ is just
too hard to make parallel for most tasks, it's not worth it for the average programmer. But Java/C# are very easy.
08-02-10 | SPU Developing
Programming for the SPU might be really fun if it didn't take like 10 minutes to get the debugger loaded.
07-31-10 | GCC Scheduling Barrier
When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker
than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.
__asm__ volatile("")
but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec).
What I believe it does do is it splits the "basic blocks"
for compilation, but then after initial optimization there's another merge pass where basic blocks are
combined and they can in fact then schedule against each other.
sync
lwarx
.. loop ..
stwcx
isync
which means it is a full Seq_Cst operation like a lock xchg on x86.
That would be cool if I actually knew it always was that on all platforms, but failing to clearly
document the gauranteed memory semantics of the __sync operations makes them dangerous.
(BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire
unlock Xenon
which uses lwsync in that spot).
char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190
which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do
the right thing, but under gcc that is not gauranteed.
07-29-10 | Bothersome
I think I've been working too much and I'm stressed out and probably shouldn't be taking it out on the internet.
07-27-10 | 2d arrays
Some little things I often forget and have to search for.
int array[7][3];
into a dynamically allocated one, and you don't want to change any code. Do you know how?
It's :
int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);
It's a little bit cleaner if you use a typedef :
typedef int array_t [3];
int array[7][3];
array_t array[7];
array_t * array = (array_t *) malloc(7*sizeof(array_t));
those are all equivalent (*).
void func(int array[7][3]) { }
void func(int (*array)[3]) { }
void func(int array[][3]) { }
function arg arrays are always passed by address.
07-26-10 | Virtual Functions
Previous post on x86/PPC made me think about virtual functions.
There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of
this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual
call?
(multiple and virtual inheritance ignored here for now).
vtable pointer = obj->vtable;
func_pointer = vtable->func
jump func_pointer
vtable pointer = obj->vtable;
The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
func_pointer = vtable->func
Then you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
jump func_pointer
Then you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.
void DoStuff(int x);
as the public interface, and then hidden lower down, something like :
virtual void v_DoStuff(int x) = 0;
void DoStuff(int x) { v_DoStuff(x); }
this gives you a single controlled call-through point which you can then make non-virtual or whatever at some
point. (though, this screws up the next issue:)
class Base { virtual int Func() { return 0; } };
class A : public Base { virtual int Func() { return 1; } };
class B : public A { int m_data; };
class C : public Base { virtual int Func() { return 2; } };
void Test(A * obj)
{
obj->Func();
}
in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func.
void DoStuff(int x)
{
if ( this is Actor ) Actor::DoStuff(x);
else Base::DoStuff(x);
}
the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even
on shitty chips.
void DoStuff(int x)
{
if ( base is okay ) Base::DoStuff(x);
else v_DoStuff(x);
}
so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual.
void func(obj * o)
{
o->v1();
o->v2();
o->v3();
}
that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :
template < typename T >
void t_func(T * o)
{
o->T::v1();
o->T::v2();
o->T::v3();
}
void func(obj * o)
{
dispatch to actual type of obj :
t_func < typeof(o) >( o );
}
the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable
mechanism, which is annoying when you want to write a helper function and not add it to your class definition.
There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel
class heirarchy to do dispatch) but
they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you
have and do manual dispatch.
07-26-10 | Jeebus
God damn my landlord is unreasonable. I hate all you people so fucking much. Just leave me alone please. I'm really sick of living in someone
else's house. I want to be able to do whatever I want with my home.
07-26-10 | Code Issues
How do I make a .c/.cpp file that's optional? eg. if you don't build it into your project, then you just don't get the functionality in it, but if you do, then it magically
turns itself on and gives you more goodies.
StupidFunction.h :
inline int StupidFunction()
{
return SOME_POUND_DEFINE;
}
Then in two different files I have :
A.cpp :
#define SOME_POUND_DEFINE (0)
#include "StupidFunction.h"
printf("%d\n",StupidFunction());
and :
B.cpp :
#define SOME_POUND_DEFINE (1)
#include "StupidFunction.h"
printf("%d\n",StupidFunction());
and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).
#define V(x) x
#define CAT(a,b) V(a)V(b)
to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has
very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output
(one and only one!).
/* Offset of member MEMBER in a struct of type TYPE. */
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
well, that's illegal under GCC for no damn good reason at all. So you have to do :
/* Offset of member MEMBER in a struct of type TYPE. */
#ifndef __GNUC__
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#else
/* The cast to "char &" below avoids problems with user-defined
"operator &", which can appear in a POD type. */
#define offsetof(TYPE, MEMBER) \
(__offsetof__ (reinterpret_cast
damn annoying. (code stolen from here ).
The problem with this code under GCC is that a "type *" cannot be used in a constant expression.
template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table
{
that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to
set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC
it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is
to take those template args as ints and cast them to t_key type internally, but that fucking blows.
07-21-10 | x86
x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms.
(addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular
P-Pro through Core i7).
x = y;
is
mov eax, ecx;
while
x = s.array[i];
is
mov eax, [eax + ecx*4 + 48h]
.. stuff ..
this->m_obj->func();
.. stuff ..
this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it
can get rearranged to :
..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
vtable->func();
.. stuff ..
And as long as you have enough stuff to do in between it's no problem.
Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's
not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer
from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).
{
int x = 0;
for( ... one million .. )
{
.. do lots of stuff using x ..
x = blah;
}
external_func(&x);
}
the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right
before the function :
{
//int x; // x = r7
r7 = 0;
for( ... one million .. )
{
.. do lots of stuff using r7 ..
r7 = blah;
}
stack_x = r7
external_func(&stack_x);
}
Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :
{
//int x; // x = r7
r7 = 0;
stack_x = r7;
for( ... one million .. )
{
.. do lots of stuff using r7 - stores to stack_x every time ! ..
r7 = blah;
stack_x = r7;
}
external_func(&stack_x);
}
07-18-10 | Mystery : Do Mutexes need More than Acquire/Release ?
What memory order constraints do Mutexes really need to enforce ?
y = 1;
Lock(x)
load x
x ++;
store x;
Unlock(x)
y = 2;
and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire
and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in
fact the y=1 assignment could be completely eliminated.
Lock()
{
while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )
;
}
Unlock()
{
StoreRelease( lock , 0 );
// AtomicExchange( lock, 0 , memory_order_seq_cst );
}
One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you
aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again
if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS
has complex mechanisms to prevent live locks and starvation and all that shit.
07-18-10 | Mystery : Does the Cell PPU need Memory Control ?
Is memory ordering needed on the PPU at all ?
// Pentium
#define AtomicStoreFence() __asm { sfence }
#define AtomicLoadFence() __asm { lfence }
// PowerPC
#define AtomicStoreFence() __asm { lwsync }
#define AtomicLoadFence() __asm { lwsync }
// But on the PPU
#define AtomicStoreFence()
#define AtomicLoadFence()
Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert
than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact
that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free
queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as
write-combined or SSE; normal x86 doesn't need them at all).
#define AtomicStoreFence()
#define AtomicLoadFence()
is a bad idea, because the compiler can still reorder things on you. Better to do :
#define AtomicStoreFence() _CompilerWriteBarrier()
#define AtomicLoadFence() _CompilerReadBarrier()
07-18-10 | Mystery : Why no isync for Acquire on Xenon ?
The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync"
to enforce Acquire memory constraint. Which is right? See :
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming
07-17-10 | Broken Games
Soccer is broken. There's too little scoring and it's too easy to play a very defensive style. The issue
with low scores is not just lack of excitement (that aspect is debatable), the big problem is that in a game
that's often 1:0 or 0:0 , it greatly increases the importance of bad calls and flukes. If the scores were
more like 7-5 , then 1 slip up wouldn't matter so much. Statistically, the "best" team loses in soccer more
often than any other major sport.
07-16-10 | Content
Where the fuck is all this content that we are supposed to have in this age of the internet and vast media ?
07-13-10 | Tech Blurg
How do I atomically store or load 128 bit data on x64 ?
load :
sse load 128
lfence
store :
sfence
sse store 128
or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128
and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly
think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).
07-12-10 | Corporate Inequality
One of the things that bothers me is that all the corporations I deal with basically have free reign to fuck me, and I have nothing I can do back.
If I every do anything wrong, they charge me fees, and they can fuck up severely and I get nothing.
07-12-10 | My Nifty P4
What's new in MyNifty :
no stall for down networks
catch lots more cases that need to do checkouts (especially for the vcproj)
don't check files for being read-only before trying p4 edit (lets you fix mistakes)
don't check files for being in project before doing p4 edit (lets you edit source controlled files not in your project)
C:\Documents and Settings\charlesb\Application Data\Microsoft\MSEnvShared\Addins\
(you may want to save copies of the originals).
NiftyPerforce : enable & startup
"Allow Editing of Read only files " = yes !
07-10-10 | Clipless pedals
I think clipless pedals are fucking terrible. Yes, they are slightly more efficient, and it does feel kind of nice to be
locked in, but for the average amateur cyclist, they are a big problem and way too many people use them.
07-10-10 | PowerPC Suxors
I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks,
so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues
that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).
__speed_critical {
.. code ..
}
and then it should warn about microcode and load hit stores and whatever else within that scope.
for(int i=0;i < count;i++) { }
kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :
for(int i=0;;)
{
i++;
if ( i == count ) break;
}
that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :
for(int i=count;i--;) { }
appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it
wants.
x = ( y >> 4 ) & 0xFF;
will get turned into one instruction. Obviously this only works for constant shifts.
#define SELECT(condition,val_if_true,val_if_false) ( (condition) ? (val_if_true) : (val_if_false) )
and replace it with Mike's bit masky version for PPC.
07-09-10 | Backspace
#define _WIN32_WINNT 0x0501
#define WIN32_LEAN_AND_MEAN
#include
07-08-10 | Remote Dev
I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.
07-07-10 | Counterpoint
Dave Moulton is usually right on the money about issues bike related, but when he rants about
bicycle helmet laws , I think
he's off the money.
07-05-10 | Country Living
Probably because I've been reading Tanizaki (wonderful) recently, and also
because my neighborhood has turned into a construction yard as all the god
damn home-improvers have kicked into high gear for the summer, I have been
fantasizing a lot recently about living out in the country.
07-04-10 | Counterpoint 1
In which I reply to other people's blogs :
07-05-10 | Counterpoint 2
In which I reply to other people's blogs :
Object * ptr = GetObject( ID ) ; // checks object is alive
// !! in other thread Object is deleted !!
ptr->stuff(); // crash !
HWND w = Find Window Handle somehow
.. do stuff on w ..
// !! w is deleted !!
.. do more stuff on w .. // !! this is no good !
I have some Windows programs that snoop other programs and run into this issue,
and I have to wind up checking IsWindow(w) all over the place to tell if the
windows whose ID I'm holding has gone away. It's a mess and very unsafe (particularly
because in early versions of windows the ID's can get reused within moderate time spans, so you
actually might get a success from IsWindow but have it be a different window!).
ObjectPtr ptr = GetObject( ID ); // checks existence of weak ref
the weak ref to pointer mapping only returns a smart pointer, which ensures
it is kept alive while you have a pointer. This is just a form of GC of course.
Object owners use smart pointers. Of course you could just have a naked pointer or something, but the performance cost of using a
smart pointer here is nil and it just makes things more uniform which is good.
Weak references resolve to smart pointers.
Function calls take naked pointers. The caller must own the object so it doesn't die during the call. Note that this almost never
requires any thought - it is true by construction, because in order to call a function on an object, you had to get that object from
somewhere. You either got it by resolving a weak pointer, or you got it from its owner.
This is highly efficient, easy to use, flexible, and almost never has problems. The only way to break it is to intentionally do
screwy things like
object * ptr = GetObject( ID ).GetPointer();
CallFunc( ptr );
which will get a smart pointer, get the naked pointer off it, and let the smart pointer die.
07-03-10 | Length-Limitted Huffman Codes Heuristic
In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.
while ( codeLen[ s ] < max && K > 1 )
{
codeLen[ s ] ++;
// adjust K for change in codeLen
K -= 2 ^ - codeLen[ s ]
}
while ( (K + 2^-codeLen[ s ]) <= 1 )
{
// adjust K for change in codeLen
K += 2 ^ - codeLen[ s ]
codeLen[ s ] --;
}
L=3: A B
L=2: A B {AB} C
L=1: A B {AB} C {AB|C}
At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1.
So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.
07-02-10 | Length-Limitted Huffman Codes
I have something interesting to write about Huffman decoders, but that will have to wait a bit.
Explicity what we are trying to do is solve :
Cost = Sum[ L_i * C_i ]
C_i = count of ith symbol
L_i = huffman code length
given C_i, find L_i to minimize Cost
contrained such that L_i <= L_max
and
Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)
This is solved by construction of the "coin collector problem"
The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.
For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code
The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value
The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property
If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward
-----------------------------------
BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.
You start at the 2^-1 (final) list. You ask : what's the next node of this list? It's either a symbol or
made from the first two nodes of the prior list. So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
can look at them next time.
07-02-10 | Bank Fraud
In the last month I've been the target of fraud twice. It's quite ridiculous how poor the security of our
electronic banking system is. To do an ACH out of someone's bank account, all you need is an account number. !? WTF !? No password,
nothing. In theory they require a signature, but in practive they don't actually check that. Furthermore, the signature only
authorizes someone to do ACH's - it doesn't specify a limit or a specific transaction! So once you authorize someone they can keep
running more transfers after the fact.
06-21-10 | RRZ PNG-alike notes
Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few
cases that you cannot ever catch with a heuristic.
do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3
if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth
filters, and simple LZ matchers.
06-20-10 | PNG Comparo
Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo.
First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but
they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format.
I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking
and compare that vs. no-loco PNG. God dammit now I have to do that.
RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco
RRZ heuristic RRZ best png best
ryg_t.yello.01.bmp 392963 359799 373573
ryg_t.train.03.bmp 35195 31803 34260
ryg_t.sewers.01.bmp 421779 420091 429031
ryg_t.font.01.bmp 26911 22514
ryg_t.envi.colored03.bmp 95394 97203
ryg_t.envi.colored02.bmp 54662 55036
ryg_t.concrete.cracked.01.bmp 299963 309126
ryg_t.bricks.05.bmp 370459 375964
ryg_t.bricks.02.bmp 455203 465099
ryg_t.aircondition.01.bmp 20522 20320
ryg_t.2d.pn02.bmp 22147 24750
kodak_24.bmp 559564 558085 572591
kodak_23.bmp 479240 478041 483865
kodak_22.bmp 574252 571301 580566
kodak_21.bmp 549865 545584 547829
kodak_20.bmp 429556 439993
kodak_19.bmp 541424 545636
kodak_18.bmp 618961 631000
kodak_17.bmp 508672 504961 510131
kodak_16.bmp 466277 481190
kodak_15.bmp 506728 504213 516741
kodak_14.bmp 581520 580301 590108
kodak_13.bmp 677041 688072
kodak_12.bmp 465297 477151
kodak_11.bmp 510200 519918
kodak_10.bmp 497400 500082
kodak_09.bmp 491896 493958
kodak_08.bmp 610524 610505 611451
kodak_07.bmp 473500 473233 486421
kodak_06.bmp 534037 540442
kodak_05.bmp 624368 623341 638875
kodak_04.bmp 522061 532209
kodak_03.bmp 437765 464434
kodak_02.bmp 500964 508297
kodak_01.bmp 586328 582389 588034
bragzone_TULIPS.bmp 565997 591881
bragzone_SERRANO.bmp 103462 96932
bragzone_SAIL.bmp 613845 609953 623437
bragzone_PEPPERS.bmp 366611 376799
bragzone_MONARCH.bmp 508096 507937 526754
bragzone_LENA.bmp 467103 474251
bragzone_FRYMIRE.bmp 241899 230055
bragzone_clegg.bmp 444736 483056
06-20-10 | KZip
While futzing around on PNG I discovered that there is this whole community of people who are hard-core optimizing PNG's for distribution
sizes. They have these crazy tools like pngcrush/optipng/advancecomp/pngout.
06-20-10 | Searching my PNG-alike
Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco,
LZ min match len, and LZ compress "level" (level = how hard it looks for matches).
0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now
name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376
06-20-10 | Some notes on PNG
Before we go further, lets have a look at PNG for reference.
Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs
Zlib strategy "FILTER" , which just means minMatchLength = 4
Filter 0..5
Zlib Strategy ; they have a few hard-coded but really it comes down to min match length
Zlib compression level (oddly highest level is not always best)
PNG scan options (progressive, interleaves, bottom up vs top down, etc.)
It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ
compress, you cannot tell from statistics.
name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056
06-20-10 | Filters for PNG-alike
The problem :
case 0: // 0 predictor is the same as NONE
return 0;
case 1: // N
return N;
case 2: // W
return W;
case 3: // gradient // this tends to win on synthetic images
{
int pred = N + W - NW;
pred = RR_CLAMP_255(pred);
return pred;
}
case 4: // average
return (N+W)>>1;
case 5: // grad skewed towards average // this tends to win on natural images - before type 12 took over anyway
{
int pred = ( 3*N + 3*W - 2*NW + 1) >>2;
pred = RR_CLAMP_255(pred);
return pred;
}
case 6: // grad skewed even more toward average
{
int pred = ( 5*N + 5*W - 2*NW + 3) >>3;
pred = RR_CLAMP_255(pred);
return pred;
}
case 7: // grad skewed N
{
int pred = (2*N + W - NW + 1)>>1;
pred = RR_CLAMP_255(pred);
return pred;
}
case 8: // grad skewed W
{
int pred = (2*W + N - NW + 1)>>1;
pred = RR_CLAMP_255(pred);
return pred;
}
case 9: // new
{
int pred = (3*N + 2*W - NW + 1)>>2;
pred = RR_CLAMP_255(pred);
return pred;
}
case 10: // new
{
int pred = (2*N + 3*W - NW + 1)>>2;
pred = RR_CLAMP_255(pred);
return pred;
}
case 11: // new
return (N+W + 2*NW + 1)>>2;
case 12: // ClampedGradPredictor
{
int grad = N + W - NW;
int lo = RR_MIN3(N,W,NW);
int hi = RR_MAX3(N,W,NW);
return rr::clamp(grad,lo,hi);
}
case 13: // median
return Median3(N,W,NW);
case 14:
{
// semi-Paeth
// but only pick N or W
int grad = N + W - NW;
// pick closer of N or W to grad
if ( RR_ABS(grad - N) < RR_ABS(grad - W) )
return N;
else
return W;
}
perform like this :
name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ryg_t.train.03.bmp 56347 76447 70047 80619 76187 79811 77519 79771 76979 77399 76279 80383 75739 75875 74635
ryg_t.sewers.01.bmp 604843 475991 496067 468727 466363 455359 458111 458727 469203 456399 462727 496691 452663 495931 464915
ryg_t.font.01.bmp 42195 54171 53599 56135 73355 66955 70163 60199 63815 69499 70099 79983 59931 61219 56735
ryg_t.envi.colored03.bmp 316007 147315 136255 158063 139231 154551 148419 155379 149667 148683 143415 145871 148099 144851 153311
ryg_t.envi.colored02.bmp 111763 98835 87711 111367 93123 108867 100003 106275 95927 96391 97383 96907 99275 94371 100707
ryg_t.concrete.cracked.01.bmp 493143 416307 449899 426755 403779 402635 400923 408019 420291 398215 408299 437371 409055 440371 424087
ryg_t.bricks.05.bmp 568755 534267 524243 514563 509375 493923 497639 505267 499007 501079 497419 551687 499855 547935 511595
ryg_t.bricks.02.bmp 684515 590955 577207 557155 560391 537987 545019 551251 544555 549543 544635 602115 545919 600139 557659
ryg_t.aircondition.01.bmp 41595 33215 34279 33535 33199 32695 32683 32619 33207 32503 32987 35795 31507 35171 32243
ryg_t.2d.pn02.bmp 25419 27815 28827 29499 32203 32319 32443 29679 32531 32315 32575 34183 28307 31247 27919
ryg_gemein.bmp 797 801 173 153 565 585 553 521 205 633 529 813 153 825 141
kodak_24.bmp 843281 735561 760793 756509 744021 736037 735797 735989 753709 733105 741801 785077 719185 777905 727221
kodak_23.bmp 865985 593977 616885 617105 591501 592561 588837 595949 609437 586121 595793 620657 585453 617825 601713
kodak_22.bmp 919613 732613 756249 741441 719973 712757 711245 721365 732337 709621 719157 760065 706517 763481 724053
kodak_21.bmp 797741 753057 691653 746033 713469 716065 710321 737017 713845 720525 704129 744785 701733 744633 709445
kodak_20.bmp 624901 563261 538069 571633 540981 548329 541773 559437 548549 546565 540337 559545 539289 557105 545901
kodak_19.bmp 847781 724541 728337 732717 718309 712989 712217 716593 729505 712893 715053 752173 689953 756849 698881
kodak_18.bmp 958657 808577 820253 820429 783373 783089 777377 794965 799449 779117 782541 821717 783005 821025 802569
kodak_17.bmp 782061 664829 655845 685273 651025 657225 650477 666625 666645 653117 652293 680813 644849 675045 656597
kodak_16.bmp 664401 696493 595101 673961 646981 647441 643353 675157 640833 658721 631509 684517 628837 682217 629293
kodak_15.bmp 779289 639009 660137 677649 643985 651729 644809 650525 666229 641669 650377 673545 635441 662389 645817
kodak_14.bmp 912149 800681 742901 787273 754233 754325 749641 778777 758045 760761 744001 795029 743909 791341 755425
kodak_13.bmp 1015697 932969 898929 941109 890989 900533 890221 918073 905705 897865 888113 925757 894461 922701 907029
kodak_12.bmp 690537 641517 583717 641973 613461 617881 612461 635253 617913 620833 607177 644837 598393 638889 602213
kodak_11.bmp 773937 725081 674657 714169 697533 691425 690389 708197 695109 698369 685965 738609 671365 739385 676597
kodak_10.bmp 766701 650637 648349 665309 648997 641961 641481 651741 651785 643369 642133 685733 623437 684949 629193
kodak_09.bmp 703689 640025 628161 659637 635201 637657 633881 644193 647817 636281 634105 662125 615393 662801 621873
kodak_08.bmp 1026037 862013 888589 835329 884329 841405 859717 839193 861917 856097 865405 942073 792833 950005 792201
kodak_07.bmp 725525 654925 605753 629757 633945 622693 627053 643069 621433 634965 620645 673361 597821 666629 604337
kodak_06.bmp 796077 786789 680841 750253 731117 723657 722981 753989 714849 739405 709825 772685 703417 773613 705529
kodak_05.bmp 981245 850313 836165 845001 815801 814429 810709 832337 825073 815897 811729 852385 808993 853785 828145
kodak_04.bmp 845721 672369 679245 692469 659981 663625 657621 670629 678165 658845 661957 691945 653629 688253 670325
kodak_03.bmp 656841 605525 559081 619713 581713 596313 587633 607841 600725 592761 584449 601901 579221 592745 587409
kodak_02.bmp 761633 666601 663837 677385 648117 649785 644985 661253 662077 647553 647545 682089 640501 687305 654309
kodak_01.bmp 896361 850353 811873 828397 822169 807197 810217 827621 815245 818249 807665 869829 781173 875265 788725
bragzone_TULIPS.bmp 1052229 729141 757129 688121 703317 671417 683069 683653 696193 682109 691257 756241 675637 761477 704293
bragzone_SERRANO.bmp 150898 172286 160718 193602 255462 274994 278094 212478 254902 274478 277418 285990 171918 213214 167058
bragzone_SAIL.bmp 1004581 865725 826729 834385 811121 798037 798301 819669 807221 805785 796761 862209 795581 868857 813521
bragzone_PEPPERS.bmp 712087 469243 456511 438211 442695 428627 433227 437215 436595 436059 434359 471587 426191 474595 439575
bragzone_MONARCH.bmp 907737 670825 671745 644513 638201 623613 627513 640213 640485 630941 630941 676401 624925 679553 652849
bragzone_LENA.bmp 745299 484027 513747 506631 478875 481911 477131 481819 494431 474343 483007 498519 478559 498799 489203
bragzone_FRYMIRE.bmp 342567 432867 399963 481259 608567 664895 670755 544963 612075 666279 668187 693467 433755 456899 421263
bragzone_clegg.bmp 806117 691593 511625 518489 1208409 1105433 1167561 733217 951773 1158429 1165953 1289969 502845 1286365 505093
06-20-10 | Struct annoyance
It's a very annoying fact of coding life that references through a struct are much slower than loading the
variable out to temps. So for example this is slow :
void rrMTF_Init(rrMTFData * pData)
{
for (int i = 0; i < 256; i++)
{
pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
}
}
and you have to do this by hand :
void rrMTF_Init(rrMTFData * pData)
{
U8 * pMTFimage = pData->m_MTFimage;
U8 * pMTFMap = pData->m_MTFmap;
for (int i = 0; i < 256; i++)
{
pMTFimage[i] = pMTFmap[i] = (U8) i;
}
}
Unfortunately there are plenty of cases where this is actually a significant big deal.
06-20-10 | Windows 7 Niggles
1. When you edit a file name in explorer it by default doesn't include the extension. I constantly do "F2 , ctrl-C" to grab
a file name, and then find later that I have the fucking file name without extension. .URL files are the worst, they
for some reason just absolutely refuse to show me their extension. God fucking dammit, LEAVE SHIT ALONE.
06-19-10 | NiftyP4 and Timeout
I've fiddled with the NiftyP4 code and have it almost working perfectly for my needs (thanks Jim!). One
small niggle remains :
Source Checkout - niftyplugins - Project Hosting on Google Code
Resources about Visual Studio .NET extensibility
Many Visual studio events missing from Events.SolutionEvents
HOWTO Getting Project and ProjectItem events from a Visual Studio .NET add-in.
HOWTO Capturing commands events from Visual Studio .NET add-ins
Binding to Visual Studio internal Commands from add-in
06-19-10 | Ranting Wussies
The disappointing thing about most ranters is that when you actually get them in person with the subject of their mockery, they
turn into these polite normal boring reasonable conciliatory people who see the validity of the other side's position. You see
Jon Stewart do this a lot, of course BSNYC and his ilk do it, etc. Stick to your fucking guns you woosies.
06-19-10 | Amazon censors product reviews
It's hard for me to get my hackles too up about this, but it's a topic that's worth noting over and over, so I'm making myself write about it.
# Details about availability, price, or alternate ordering/shipping
#
# Comments on other reviews visible on the page (because page visibility is subject to change without notice)
Off-topic information:
* Feedback on the seller, your shipment experience or the packaging
* Feedback about typos or inaccuracies in our catalog or product description
Granted, some amount of censorship is necessary or you would be flooded by spam and libel, but this goes rather too far.
Not being allowed to write a review when what you get is not what they said you would get is a pretty big problem.
I can also understand them not wanting reviews that say "this is available cheaper at Walmart" but all my reviews that mentioned price
were things like "listing says retail price is $100, in fact retail price is around $40". It's also simply not true that they objectively
enforce those standards; you will see plenty of positive reviews around Amazon that mention price in a positive way, things like
"what a great product for only $10" - that review is allowed through, but if it's a negative review like "product is not worth $10" it
won't be allowed through; all my reviews that mentioned price and were censored were negative reviews. For example one of my reviews that
was blocked was about a listing that showed a picture of a box full of cleaning pads and showed a list price of $40; actual shipped product
was one cleaning pad (not a box full) - review deleted.
06-19-10 | Verio spam filters outgoing mail
Verio, the host for cbloom.com , filters your outgoing emails through some spam filter and rejects them. You, the paying
customer have no control over this, you cannot disable it or add whitelists or anything. In its wisdom it decides that
emails like this are spam :
Brownies have been placed in the kitchen. Make them vanish, please.
After much emailing with Verio tech support I get these copy-paste responses :
----------------------------------------------------------------------
Outbound spam filters
----------------------------------------------------------------------
We apologize for any inconvenience you might have experienced with this issue.
In response to your concern. The outgoing mails might have some contents that is not allowed by your filters.
and when I ask if there's any way to disable or get around it :
************************************************************************
Outbound spam filters
************************************************************************
I am sorry but no. There is no way to whitelist outgoing messages, we have no support for getting around our internal, outgoing spam filters
06-19-10 | Gmail doesn't let you send mails to yourself
I always CC myself when I write an email that I think is interesting (especially with my multiple email personalities, I often CC
from one to another). I've been confounded for a while that when I use gmail, it seems to not deliver the emails to myself.
Not only that, but it gives no error message or delivery failure or anything, it just silently doesn't send them.
06-17-10 | Suffix Array Neighboring Pair Match Lens
I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match
lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths
between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but
in order to find all matches against any given suffix, you find the position of that suffix in the sort order,
then walk to neighbors and keep doing MIN() with the pair match lengths.
from :
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2
Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od
On "mississippi" :
At index 1 :
ississippi
matches :
issippi
with length 4.
Now step forward one.
At index 2 :
ssissippi
I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3
static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
int n = sortWindowLen;
const U8 * A = sortWindowBase;
int h = 0;
for(int i=0; i< n ;i ++)
{
int sort_i = sortIndexInverse[i];
if ( sort_i > 0 )
{
int j = sortIndex[sort_i-1];
int h_max = sortWindowLen - RR_MAX(i,j);
while ( h < h_max && A[i+h] == A[j+h] )
{
h++;
}
sortSameLen[sort_i-1] = h;
if ( h > 0 ) h--;
}
}
}
06-17-10 | Friday Code Review
Today I bring you a piece of my own code that had a nasty bug. This is from my buffered file reader, which I did some fiddling on a while ago
and didn't test hard enough (there are now so many ways to use my code that it's really hard to make sure I test all the paths, which is a
rant for another day).
S64 OodleFile_ReadLarge(OodleFile * oof,void * into,S64 count)
{
// use up the buffer I have :
U8 * into_ptr = (U8 *)into;
S64 avail = (S64)( oof->ptrend - oof->ptr );
S64 useBuf = RR_MIN(avail,count);
memcpy(into_ptr,oof->ptr,(size_t)useBuf);
oof->ptr += useBuf;
into_ptr += useBuf;
count -= useBuf;
rrassert( oof->ptr <= oof->ptrend );
if ( count > 0 )
{
//OodleFile_Flush(oof);
rrassert( oof->ptr == oof->ptrend );
S64 totRead = 0;
while( count > 0 )
{
S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
S32 read = oof->vfile->Read(into_ptr,curRead);
if ( read == 0 )
break; // fail !
count -= read;
into_ptr += read;
}
return useBuf + totRead;
}
else
{
return useBuf;
}
}
S64 totRead = 0;
while( count > 0 )
{
S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
S32 read = oof->vfile->Read(into_ptr,curRead);
count -= read;
into_ptr += read;
// totRead += read; // doh !!
}
return useBuf + totRead;
That's benign enough, but I think this is actually a nice simple example of a habit I am trying to get into :
never use another variable when one of the ones you have will do
while( count > 0 )
{
S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
S32 read = oof->vfile->Read(into_ptr,curRead);
count -= read;
into_ptr += read;
}
return useBuf + (S64)(into_ptr - (U8 *)into);
Don't step the pointer and also count how far it steps, do one or the other, and compute them from each other. I keep finding this over
and over and its one of the few ways that I still consistently make bugs : if you have multiple variables that need to be in sync, it's
inevitable that that will get screwed up over time and you will have bugs.
06-17-10 | Investment
I bought a Hattori knife a little while ago. It requires a lot of babying - it has some carbon steel so you have to keep it dry,
the steel is quite soft so you have to make sure to only let it touch other soft things, you have to sharpen it with a whetstone
every so often. It's an amazing knife, but you really have to invest a lot into it. The funny thing is that the fact that it
requires so much work actually makes you more fond of it. It's like the knife needs you, it's great and that greatness comes from
the work you put into it.
06-15-10 | The end of ad-hoc computer programming
I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending.
For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition.
You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research
on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other
terms.
06-15-10 | Top down / Bottom up
"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems.
Today I am thinking about how to decide where to split a chunk of data for Huffman compression.
A quick overview of the problem :
X O
O X
Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you
find a perfect tree.
06-15-10 | Neighbors
I'm really fucking sick of having neighbors. I hate when one neighbor smokes next door and the smell comes in my window. I hate when their
dog yaps. I hate when they run the lawn-mower or leaf blower. I hate the fucking hippy hypocrites who give me the passive-aggressive cold
shoulder. I hate the constant home renovation cock munchers who like to hammer a little bit each day.
06-12-10 | Some Bike Notes
I found this interesting article on
Optimizing Tire Pressure Drop (PDF) .
Now I certainly do agree that too many people in the quest for "performance" (both on cars and bikes) think
that stiff ride = fast. With that in mind they over-inflate their bike tires, and use too narrow bike tires,
and on cars they use over-size rims on tires without enough rubber. Anyway I'm not sure I believe the article,
because if I follow it, it suggests really crazy pressures :
I'm about 200 pounds. Assuming the 60/40 weight distribution is right, that's 120 on the rear and 80 pounds
on the front.
My lightspeed has 23 mm tires front and rear, so the pressures should be roughly 80 psi front and 120 psi rear.
(it's roughly 1:1 for 23 mm which is very unclear from the stupid way they draw the graph)
My city bike has 25 mm tire rear and 28 mm tire front. The pressures should be :
rear : 110 psi, front : 60 psi
I've been running something like 115 rear, 100 front on my lightspeed and 110 rear, 80 front on my city bike,
so according to this article I need to drop the front pressure even a lot more.
06-10-10 | Ranting
I really hate programmers who whine about other people's code or the compiler or the OS or whatever, always
blaming other people for their woes. 99% of the time their own code would not stand up to the standards
that they are holding others to, and lots of the time the thing they are whining about is actually because
of their own bad usage of a perfectly good system.
< dwSrcInc = ((dwWidth >> 3) + 3) & ~3;
-> dwSrcInc = (((dwWidth+7) >> 3) + 3) & ~3;
but the nasty one was the inherent type of dwSrcInc. BMP's of course can scan up or down so there's some
code that does :
if ( normal ) dwSrcInc = pitch;
else dwSrcInc = - pitch;
Looks good, right? Well, sort of. Of course dwSrcInc is a DWORD. What happens when you put a negative in
there ? Yeah, you're getting it. We then update the pointer like :
ptr += dwSrcInc;
Well, this works out the way you want if the pointer is 32 bits, because the pointer math is done like an unsigned
int add and it wraps. But when the pointer is 64 bits, you're adding not quite 4 billion to your pointer. No good.
64-bit dangers :
use of C-style casts, like :
int x = (int) pointer;
use of hard-coded sizes in mallocs :
void ** ptrArray = malloc( 4 * count );
structs with pointers that are IO'd as bytes
or otherwise converted to bytes as in unions etc
use of 0xFFFFFFFF and similar
use of (1<<31) as a flag bit in pointers
struct packing changes
ptrdiff_t is signed
size_t is unsigned !!
unions of mismatched sizes do not warn !!
06-07-10 | Unicode CMD Code Page Checkup
I wrote a while ago about the horrible fuckedness of Unicode support on Windows :
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 1
expected : printf with wide strings (unicode) would do translation to the console code page so that characters so up right. (this should probably be
done in the stdout -> console portion)
observed : does not translate to CMD code page
expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
observed : arguments are only available in CMD code page
repro : write an app that takes your own full argc/argv array, and spawn a process with those same args (use an exe name that involved troublesome characters)
expected : same app runs again
observed : if Ansi and CMD code pages differ, your own app may not be found
repro : copy-paste a file name from (unicode) Explorer into CMD
expected : unicode is translated into current CMD code page and thus usable for command line arguments
observed : does not work
repro : dir a file with strange characters in it. copy-paste the text output from dir and type "dir
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI
typedef wchar_t wchar; // it's not fucking char_t
// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }
// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str) (sizeof(str)/sizeof(wchar))
fucking be reasonable to your users. You ask too much and make me do stupid busy work.
#define wprintf wprintf_does_not_do_what_you_want
kind of thing.
06-07-10 | Exceptions
Small note about exceptions for those who are not aware :
Uninformed - vol 4 article 1
Introduction to x64 debugging, part 3 « Nynaeve
Introduction to x64 debugging, part 2 « Nynaeve
Exception Handling (x64)
06-07-10 | Cortazar
Playing with form is not really interesting in and of itself. If you do it well, it can enhance the story, but if you do it badly it can be
a distraction, or just cheezy. I'm sure they think they're very clever when they come up with it - tell the story backwards in time, or
tell it from the point of view of the protagonist's hat, or whatever, but no it's really not very clever, anybody can come up with that shit.
The question I always ask myself, is "if this was told as just a straightforward narrative, would it be interesting?". The answer
with Cortazar is no.
06-07-10 | Coding Style
I sometimes feel like I live in my own coding world, where I have these strange ideas that nobody else agrees with.
// pull available work with a limit of 5 items per iteration
LONG work = min(gAvailable, 5);
as
LONG avail = gAvailable;
LONG work = min(avail, 5);
but this is also an example where the hated "volatile" is showing its ugliness. I hate that the "volatile" is far away on the variable
decl and not right here where I need it. Sometimes I would write :
LONG avail = *((volatile LONG *)&gAvailable);
LONG work = min(avail, 5);
but even better if I had my threading helpers I would do something like :
LONG avail = LoadShared_Relaxed(&gAvailable);
LONG work = min(avail, 5);
where I explicitly load the shared variable using "Relaxed" memory ordering semantics (actually I don't know the usage case, but this
should probably have been Acquire memory semantics). LoadShared_Relaxed is nothing but *ptr , so many people don't see the point of
having a function call there at all, but it makes it absolutely clear what is happening. It also just makes it more verbose to touch
the shared variable, which encourages you to load it out into a local temporary, which is good.
06-02-10 | Words that are changing meaning
"Deprecated" :
06-02-10 | 64 bit transition
The broke ass 64/32 transition is pretty annoying. Some things that MS could/should have done to make it easier :
06-02-10 | Some random Win64 junk
To build x64 on the command line, use
vcvarsall.bat amd64
Oddly the x64 compiler is still in "c:\program files (x86)" , not "c:\program files" where it's supposed to be.
Also there's this "x86_amd64" thing, I was like "WTF is that?" , it's the cross-compiler to build x64 apps from
x86 machines.
MSVC++ 9.0 _MSC_VER = 1500
MSVC++ 8.0 _MSC_VER = 1400
MSVC++ 7.1 _MSC_VER = 1310
MSVC++ 7.0 _MSC_VER = 1300
MSVC++ 6.0 _MSC_VER = 1200
MSVC++ 5.0 _MSC_VER = 1100
Some important revs : "volatile" changes meaning in ver 1400 ; most of the intrinsics show up in 1300 but more
show up in 1400. Template standard compliance changes dramatically in 1400.
06-02-10 | Directives in the right place
I'm annoyed with "volatile" and "restrict" and all that shit because I believe they put the directive in the wrong place - on the variable declaration -
when really what you are trying to do is modify certain *actions* , so the qualifier should be on the *actions* not the variables.
06-02-10 | NiftyP4 Problems
I've switch to NiftyP4, but it has some problems. On the plus side, you can download the code and it just fucking builds right out of the box! That is
fucking amazing, kudos to you, I can't remember the last time I downloaded some open source thing and just built it without any hair pulling. One thing
to be careful of is you have to uninstall the old install from the original installer, then you install your newly built one from the installer it makes
and uninstall it from that installer.
c:\bat>type KillVcScc.bat
call p4 edit *.sln
call p4 edit *.vcproj
call spawnm killScc *.sln
call spawnm KillSCC *.vcproj
call spawnm_tr tr *.sln SourceCodeControl xxx
call dele *scc
c:\bat>type killscc.bat
if "%1"=="" endbat
call p4 edit %1
call zc -o %1 %1.sav
KillPrefix.exe Scc %1.sav %1.sav2
KillPrefix.exe CanCheckoutShared %1.sav2 %1
05-29-10 | Lock Free in x64
I mentioned long ago in the low level threading articles that
some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.
05-29-10 | Some more x64
Okay , MASM/MSDev support for x64 is a bit fucked. MSDev has built-in support for .ASM starting in VC 2005 which does everything for you,
sets up custom build rule, etc. The problem is, it hard-codes to
ML.EXE - not ML64. Apparently they have fixed this for VC 2010 but it is basically impossible to back-fix.
(in VC 2008 the custom build rule for .ASM is in an XML file, so you can fix it yourself thusly )
public my_cmpxchg64
.CODE
align 8
my_cmpxchg64 PROC
mov rax, [rdx]
lock cmpxchg [rcx], r8
jne my_cmpxchg64_fail
mov rax, 1
ret
align 8
my_cmpxchg64_fail:
mov [rdx], rax
mov rax, 0
ret
align 8
my_cmpxchg64 ENDP
END
extern "C" extern int my_cmpxchg64( uint64 * val, uint64 * oldV, const uint64 newV );
align 8
my_cmpxchg64 PROC
mov rax, [rdx]
lock cmpxchg [rcx], r8
sete cl
mov [rdx], rax
movzx rax,cl
ret
my_cmpxchg64 ENDP
and you can probably do better. (for example it's probably better to just define your function as returning unsigned char and then you can avoid the
movzx and let the caller worry about that)
union Fucked
{
struct
{
void * p1;
int t;
} s;
uint64 i;
};
build in 64 bit and it's just hose city. BTW I think using unions as a datatype in general is probably bad practice.
If you need to be doing that for some fucked reason, you should just store the member as raw bits, and then
same_size_bit_cast() to convert it to the various types. In other words, the dual identity of that memory should
be a part of the imperative code, not a part of the data declaration.
05-29-10 | x64 so far
x64 linkage that's been useful so far :
_InterlockedCompareExchange Intrinsic Functions
x86-64 Tour of Intel Manuals
x64 Starting Out in 64-Bit Windows Systems with Visual C++
Writing 64-bit programs
Windows Data Alignment on IPF, x86, and x86-64
Use of __m128i as two 64 bits integers
Tricks for Porting Applications to 64-Bit Windows on AMD64
The history of calling conventions, part 5 amd64 - The Old New Thing - Site Home - MSDN Blogs
Snippets lifo.h
Predefined Macros (CC++)
Physical Address Extension - PAE Memory and Windows
nolowmem (Windows Driver Kit)
New Intrinsic Support in Visual Studio 2008 - Visual C++ Team Blog - Site Home - MSDN Blogs
Moving to Windows Vista x64
Moving to Windows Vista x64 - CodeProject
Mark Williams Blog jmp'ing around Win64 with ml64.exe and Assembly Language
Kernel Exports Added for Version 6.0
Is there a portable equivalent to DebugBreak()__debugbreak - Stack Overflow
How to Log Stack Frames with Windows x64 - Stack Overflow
BCDEdit Command-Line Options
Available Switch Options for Windows NT Boot.ini File
AMD64 Subpage
AMD64 (EM64T) architecture - CodeProject
20 issues of porting C++ code on the 64-bit platform
// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_value_cast( t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
// just value cast :
return (t_to) from;
}
// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_p( t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
// cast through char * to make aliasing work ?
char * ptr = (char *) &from;
return *( (t_to *) ptr );
}
// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
// cast with union is better for gcc / Xenon :
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_u( t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
union _bit_cast_union
{
t_fm fm;
t_to to;
};
_bit_cast_union converter = { from };
return converter.to;
}
// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
t_to to = static_cast
#if _MSC_VER > 1400
// have intrinsic
_InterlockedExchange64()
#elif _X86_NOT_X64_
// I can use inline asm
__asm { cmpxchg8b ... }
#elif OS_IS_VISTA_NO_XP
// kernel library call available
InterlockedExchange64()
#else
// X64 , not Vista (or want to be XP compatible) , older compiler without intrinsic,
// FUCK !
#error just use a new newer MSVC version for X64 because I don't want to fucking write MASM rouintes
#endif
05-28-10 | Foolishness
Many of the food blog snobistas descend into this foolishness of making everything at home when some things are just not
a wise use of time. I mean I'm sure these
homemade chicharrones are delicious and all, but fuck that's a lot
of work when you could just walk over to the local Mexicatessen and buy a big bag that was also freshly fried in their house-made lard.
And while you're there buy some carnitas and tortillas too and have a much better meal for cheaper and less work.
05-27-10 | Weird Compiler Error
Blurg just fought one of the weirder problems I've ever seen.
void fuck()
{
#ifdef RR_ASSERT
#pragma RR_PRAGMA_MESSAGE("yes")
#else
#pragma RR_PRAGMA_MESSAGE("no")
#endif
RR_ASSERT(true);
}
And here is the compiler error :
1>.\rrSurfaceSTBI.cpp(43) : message: yes
1>.\rrSurfaceSTBI.cpp(48) : error C3861: 'RR_ASSERT': identifier not found
Eh !? Serious WTF !? I know RR_ASSERT is defined, and then it says it's not found !? WTF !?
#undef assert
#define assert RR_ASSERT
which seems like it couldn't possibly cause this, right? It's just aliasing the standard C assert() to mine. Not possible related, right?
But when I commented out that bit the problem went away. So of course my first thought is clean-rebuild all, did I have precompiled headers
on by mistake? etc. I assume the compiler has gone mad.
#define RR_ASSERT(exp) assert(exp)
This creates a strange state for the preprocessor. RR_ASSERT is now a recursive macro. When you actually try to use it in code, the
preprocessor apparently just bails and doesn't do any text substitution. But, the name of the preprocessor symbol is still defined,
so my ifdef check still saw RR_ASSERT existing.
Evil.
#ifdef _X86
#define RR_ASSERT_BREAK() __asm int 3
#else
#define RR_ASSERT_BREAK() assert(0)
#endif
which is what caused the difficulty.
05-27-10 | Loop Branch Inversion
A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is
for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find
all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do.
eg :
LoopAndDoWork(query,worktype)
{
Make bounding area
Do Kd-tree descent ..
loop ( tree nodes )
{
bounding intersection, etc.
found an object
DoPerObjectWork(object);
}
}
The problem is that DoPerObjectWork then is some conditional, maybe something like :
DoPerObjectWork(object)
{
switch(workType)
{
...
}
}
or even worse - it's a function pointer that you call back.
template < int workType >
void t_LoopAndDoWork(query)
{
...
}
and then provide a dispatcher which does the branch outside :
LoopAndDoWork(query,worktype)
{
switch(workType)
{
case 0 : t_LoopAndDoWork<0>(query); break;
case 1 : t_LoopAndDoWork<1>(query); break;
...
}
}
this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed
penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be
kept in sync and can be far apart in the code.
for(int c=0;c
DoStuff(0);
if ( channels > 1 ) DoStuff(1);
if ( channels > 2 ) DoStuff(2);
if ( channels > 3 ) DoStuff(3);
because those ifs reliably go away.
05-26-10 | Windows Page Cache
The correct way to cache things is through Windows' page cache. The advantage from doing this over using your own custom cache code is :
05-26-10 | Windows 7 Snap
My beloved "AllSnap" doesn't work on Windows 7 / x64. I can't find a replacement because fucking Windows has a feature called "Snap" now,
so you can't fucking google for it. (also searching for "Windows 7" stuff in general is a real pain because solutions and apps for the
different variants of windows don't always use the full name of the OS they are for in their page, so it's hard to search for; fucking
operating systems really need unique code names that people can use to make it possible to search for them; "Windows" is awful awful in
this regard).
05-25-10 | Thread Insurance
I just multi-threaded my video test app recently, and it was reasonably easy, but I had a few nagging bugs because of hidden ways they
were touching shared memory without protection deep inside functions. Okay, so I found them and fixed them, but I'm left with a problem -
any time I touch one of those deep functions, I could screw up the threading without realizing it. And I might not get any indication of
what I did for weeks if it's a rare race.
Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A
Phase 2 : fire up threads. They only read from A and do so without protection. They each write to unique areas B,C,D.
Phase 3 : spin down threads. Now main thread can write A and read B,C,D.
So what I would really like to do is :
Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A
-- set A memory to be read-only !
-- set B,C,D memory to be read/write only for their own thread
Phase 2 : fire up threads. They only read from A and do so without protection. They each write to unique areas B,C,D.
-- make A,B,C,D read/write only for main thread !
Phase 3 : spin down threads. Now main thread can write A and read B,C,D.
05-25-10 | State and the Web
There's a major way that the whole iPple device thing is taking us backwards. Plain old HTML (eg. not apps) is awesome in that they actually get
something really right :
State save(curState);
DoSpeculativeStuff(curState);
GetResults(curState);
curState = save;
this is actually one of the new things I'm doing in my Video Test framework and it has been awesomely useful.
05-23-10 | Two Windows Woes
Slow net. My god WTF is wrong with Windows networking. (I don't mean the TCP/IP stack, I mean shared computer browsing). What the fuck is
wrong with networking in general? Why are there such massive stalls? I mean for browsing my local network, how in the world can it take so
fucking long to discover the machines on my fucking LAN !? And if a machine is not there, can't you just fail in like a millisecond !? I mean
a fucking millisecond is FOREVER to send a packet of light out on some wires and get a reply back.
05-23-10 | Misc
I went driving at Pacific Raceways with the Porsche DE ("Driver's Education" , which is the euphemism they use to make it sound safe, it's actually run
your fucking suped-up old 911 around a race track at insane speeds and occasionally spin out and toss it into the bushes). Maybe I will write up some more
details on it, but it was a fucking blast, the car was *amazing*, I was exhausted, I used about a year's worth of tires and brakes. I highly recommend it.
05-21-10 | Video coding beyond H265
In the end movec-residual coding is inherently limitted and inefficient. Let's review the big advantage of it and the big problem.
Code motion = [0 or 1]
Code residual = [0,1,2,3,4,5,6,7,8,9]
Or in tree form :
0 - [0,1,2,3,4,5,6,7,8,9]
*<
1 - [0,1,2,3,4,5,6,7,8,9]
Clearly this is foolish. For each movec, you only need to code the residual which encodes that resulting pixel block the smallest under that
movec. So you only need each output value to occur in one spot on the tree, eg.
0 - [0,1,2,3,4]
*<
1 - [5,6,7,8,9]
or something. That is, it's foolish to have to ways to encode the residual to reach a certain target when there were already cheaper ways to reach
that target in the movec coding portion.
To minimize this defficiency, most current coders like H264 will code blocks by either putting almost all the bits in the movec
and very few in the residual, or the other way (almost none in the movec and most in the residual). The loss occurs most when you have many bits
in the motion and many in the residual, something like :
0 - [0,1,2]
1 - [3,4,5,6]
2 - [7,8]
3 - [9]
05-20-10 | Some quick notes on H265
Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being
developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall
to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in
the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff",
it's mostly "do the same thing but use more CPU".
05-19-10 | Some quick notes on VP8
The VP8 release is exciting for what it might be in two years.
05-13-10 | P4 with NiftyPerforce and no P4SCC
I'm trying using P4 in MSDev with NiftyPerforce and no P4SCC.
05-12-10 | P4 By Dir
(ADDENDUM : see comments, I am dumb).
(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way
to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another
env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have
some work code and some home code open at the same time they shit their pants.
I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something
like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).
05-12-10 | Cleartype
Since I ranted about Cleartype I thought I'd go into a bit more detail.
this
article on Cleartype in Win7 is interesting, though also willfully retarded.
Another research question we’ve asked ourselves is why do some people
prefer bi-level rendering over ClearType? Is it due to hardware issues
or is there some other attribute that we don’t understand about visual
systems that is playing a role. This is an issue that has piqued our
curiosity for some time. Our first attempt at looking further into this
involved doing an informal and small-scale preference study in a
community center near Microsoft.
1. 35 participants.
2. Comments for bi-level rendering:
Washed out; jiggly; sketchy; if this were a printer, I’d say it needed a new cartridge; fading out – esp. the numbers, I have to squint to read this, is it my glasses or it is me?; I can’t focus on this; broken up; have to strain to read; jointed.
3. Comments for ClearType:
More defined, Looks bold (several times), looks darker, clearer (4 times), looks like it’s a better computer screen (user suggested he’d pay $500 more for the better screen on a $2000 laptop), sort of more blue, solid, much easier to read (3 times), clean, crisp, I like it, shows up better, and my favorite: from an elderly woman who was rather put out that the question wasn’t harder: this seems so obvious (said with a sneer.)
05-11-10 | Note from the Mail Man
05-11-10 | Some New Cblib Apps
Coded up some new goodies for myself today and released them in a new
cblib and chuksh .
05-09-10 | Some Win7 Shite
Perforce Server was being a pain in my ass to start up because the fucking P4S service doesn't get my P4ROOT
environment variable. Rather than try to figure out the fucking Win 7 per-user environment variable shite,
the easy solution is just to move your P4S.exe into your P4ROOT directory, that way when it sees no P4ROOT
setting it will just use current directory.
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]
"EnableSuperfetch"=dword:00000000
"EnablePrefetcher"=dword:00000000
[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]
"NoWinKeys"=dword:00000001
[HKEY_CURRENT_USER\Control Panel\Desktop]
"MenuShowDelay"="10"
DownThemAll
Flashblock
PDF Download
Stop Autoplay
Adblock Plus
DownloadHelper
Shutdown Event Tracker Tools and Settings System Reliability
How to disable hard disk thrashing with Vista - Page 7
05-07-10 | New Lappy
I got my new lappy, a Dell Precision M6500 big 17" behemoth. I'm setting up Win 7 today
which is mostly going smoothly.
04-22-10 | Image Writings
I gathered my recent image compression posts to send someone, thought I'd dump them here too -
(it's also quite possible I missed some since I just found these by googling myself)
05-14-09 - Image Compression Rambling Part 2
05-15-09 - Image Compression Rambling Part 3 - HDR
05-26-09 - Some Study of DCT coefficients
05-27-09 - Optimal Local Bases
02-19-09 - PNG Sucks
05-15-09 - Trellis Quantization
01-14-10 - A small note on Trellis quantization
02-23-10 - Image Compresson - Color , ScieLab
03-03-10 - Image Compresson - Color , ScieLab - Part 2
07-06-09 - Small Image Compression Notes
07-07-09 - Small Image Compression Notes , Part 2
08-25-09 - Oodle Image Compression Looking Back
08-27-09 - Oodle Image Compression Looking Back Pictures
02-10-10 - Some little image notes
05-18-09 - Lagrange Space-Speed Optimization
01-12-10 - Lagrange Rate Control Part 1
01-12-10 - Lagrange Rate Control Part 2
01-13-10 - Lagrange Rate Control Part 3
11-18-08 - DXTC Part 1
11-18-08 - DXTC Part 2
11-20-08 - DXTC Part 3
11-21-08 - DXTC Part 4
09-08-09 - DXTC Addendum
06-17-09 - DXTC More Followup
11-20-08 - Pointless
11-21-08 - More Texture Compression Nonsense
04-21-10 | Car Alignment
I got a "performance alignment" done for my car about a week ago. I'd heard it was the #1 best thing
to do for your car if you are a serious driver (after or when you get good tires if your car doesn't come
with them), but MY GOD I was not prepared for what an incredible difference it was. It was like
"night and day" or "a whole new car" or whatever cliche you want to use to express the incredible
difference in feel. It really felt like a different car. Before, the 997 felt powerful but a bit heavy
and clumsy and mild, with the more aggressive alignment, it felt way sharper, much more "bite" for turn-in,
much more grip in the corners, able to get more power down through the corner without losing grip, just
awesome. I immediately went out and did some mad driving around the city because it just felt so damn good.
Caster, Camber, Toe
04-20-10 | Life
One of the shittiest things you can do is to dump on someone's happiness. Happiness is rare and difficult, and often
involves lots of work to build up to it. When somebody finds a new hobby that is making them really happy and they
always want to tell stories about it at lunch ("a funny thing happened last weekend at the model airplane field...") you
just have to suck it up and feign interest and be a good listener. When somebody buys a new toy (like an iPad) you have
to go along with the after-purchase show-off euphoria and do the proper oohs and ahs and feign jealousy. If you
think that what they're doing or buying is actually awful and a waste of time, just shut the fuck up and keep it to yourself.
04-20-10 | Speeders Beware
Press Release :
Starting today, April 9, and extending through May 1st law enforcement agencies throughout
Washington will crack down on speeding with extra patrols on local roads, state highways
and interstate freeways.
From April 9th through May 1st, extra speeding patrols will begin throughout Washington on:
1. Fridays from 11 a.m. to 7 p.m.
2. Saturdays from noon to 8 p.m.
3. Sundays from noon to 8 p.m.
04-19-10 | PNWR Driver Skills Day
I went and did the Porsche club "Driver Skills" DS day last weekend. It's a prerequisite before you get to run on the real race
track (aka "Driver Education" or DE), and I thought it would be a good way to get used to running the car closer to the limit, since I'm not very familiar
with it yet. DS is held at Bremerton Raceway Park which is just an abandoned air strip. It's just a big tarmac and
they put a bunch of cones on it and
you do various exercises. (Porsche club also runs its autocross out there).
04-15-10 | Laptop Part 2
Well I need to hurry up and get something so I decided to just go for the Thinkpad W510. And... Lenovo is completely out of stock
on the reasonable screens ( "HD+" at 1600x900 ). Apparently they are indefinitely out of stock; they say "more than 4 weeks" but
the reality is they have lost their supplier so they are just fucked.
04-14-10 | Apples
I can't eat an apple without almost choking. There's always a moment after I've eaten about half the apple where the
back-log of unchewed bits hits critical blockage volume in my throat; I struggle to force it down with hard swallowing,
and then panic and try to find water quickly. An apple without a glass of water is almost a booby trap, if someone hands
you an apple they might be trying to kill you.
04-14-10 | Friendly Apps
A
recent post by Multimedia Mike is about something that I
advocate which I think most people don't do enough of : make your app easy for yourself to use. Developers too often think
of usability as an annoyance which they have to do for end users, and they will struggle and deal with great annoyances when
they use their own apps themselves. This is ridiculous, you are your most important user. Mike mentions a few good things,
I'll repeat him a bit :