Indirect threading is...slower?
M2VM so far has been using the switched-interpreter mode of my VM generator.
As of yesterday, my profiler no longer says the GC is the bottleneck — now, it's the interpreter loop. So, logically, I've started to optimize it.
As a first step, I switched the VM generator to produce indirect-threaded code — simply looking up the address of each bytecode's code, using an array of addresses (one for each bytecode). I expected this to be faster.
It's about 10% slower on my benchmarks. (It's about the same speed at -O1, but at higher levels it gets slower and slower.)
I'll drill down with some profiling to see if I can find the cause, but this kinda blows my mind. The generated code for the switch statement was the hottest in the module, so I assumed that, by getting it out of the way, things would speed up.
Update: This is only true on the G4/G5. On an Intel Core Duo system, the indirect threaded interpreter is an instant 10% speed boost. However, I've found an issue with GCC's code generation that's ruining my indirect branch prediction rate — which may explain the issues on the G5.