Experimenting with LLVM
My oldest live projects are Mongoose and Cesta, both in intermittent development since 2003. I'm doing a bit of hacking on both right now.
In the case of Mongoose, writing my own VM (or, thus far, four of them) was fun, but it's not viable in the long term. I'd like to target the language to an existing VM, but earlier attempts with the JVM and CLR bore little fruit. (Neither VM is well-suited to dynamic languages with serious metaprogramming facilities.)
In the case of Cesta, I need a way to speedily evaluate filter expressions, and a way to "cross-breed" decoders for parallel fuzzy decoding.
Research on both topics kept leading me back to LLVM, which is many things:
Initially, I was scared of LLVM. I did some hacking on the GCC family of compilers a few years ago, and the code is abjectly terrifying — not to mention the deliberately obfuscated intermediate representations. However, every research lead I followed kept pointing back to LLVM. Eventually, I got up the gumption to download it and try it out.
In short: I am floored. LLVM is almost absurdly easy to work with, thanks to the best-designed C++ API I've ever seen outside of Qt. Within minutes of downloading it, I had built and tested some of my C code using
An hour, two source examples, and 120 lines of code later, I had written a program that generates a Fibonacci function in RAM, JITs it to native x86, and executes it in process, printing the results. (It might have taken longer if the LLVM guys hadn't provided an example that does exactly that, which I was able to learn from.)
This is when I started to notice something: the JIT is impressively fast, considering that I had not activated any of LLVM's impressive suite of optimizations. I coded up two version of the Fibonacci sequence (naïve and hand-optimized) and ran some benchmarks.
The first is the naïve recursive version:
Compiled in a test harness, the times are as follows (suitably averaged and scrubbed of outliers):
fib(24)
fib(40) (note that units are now milliseconds)
Generating this recursive Fibonacci function at runtime (which takes some milliseconds at startup, not included in this benchmark) produces code that runs faster than GCC at -O0. Neat.
Next, the iterative transform of the function:
(The code to generate this at runtime is substantially more complex, since I'm assembling the SSA intermediate representation by hand in C++.)
This algorithm is vastly faster than the previous version, so note that our units below are now nanoseconds.
fib(24)
fib(40)
On this version, LLVM shows an impressive performance gain over GCC at -O0, and is within a factor of 2 of GCC's output at -O3. (Edit: As noted in the comments, running GCC at -O0 is effectively tying both of its hands behind its back. I understand this; I'm comparing to -O0 here because I've not aggressively optimized my LLVM code here and I wanted to try a non-JIT compiler with its optimizations turned off. It's still faster than the M3VM threaded interpreter.)
To try this for yourself, here's where to find the sources:
In the case of Mongoose, writing my own VM (or, thus far, four of them) was fun, but it's not viable in the long term. I'd like to target the language to an existing VM, but earlier attempts with the JVM and CLR bore little fruit. (Neither VM is well-suited to dynamic languages with serious metaprogramming facilities.)
In the case of Cesta, I need a way to speedily evaluate filter expressions, and a way to "cross-breed" decoders for parallel fuzzy decoding.
Research on both topics kept leading me back to LLVM, which is many things:
- It's a compiler, with frontends for C, C++, and several other languages;
- It's a library for designing and implementing transformations on code;
- It's a framework for building the backends of new compilers;
- It's a well-defined low-level virtual machine;
- It's an embeddable code generator, suitable for use in Just-In-Time compilers.
Initially, I was scared of LLVM. I did some hacking on the GCC family of compilers a few years ago, and the code is abjectly terrifying — not to mention the deliberately obfuscated intermediate representations. However, every research lead I followed kept pointing back to LLVM. Eventually, I got up the gumption to download it and try it out.
In short: I am floored. LLVM is almost absurdly easy to work with, thanks to the best-designed C++ API I've ever seen outside of Qt. Within minutes of downloading it, I had built and tested some of my C code using
llvm-gcc
, their drop-in replacement for GCC.An hour, two source examples, and 120 lines of code later, I had written a program that generates a Fibonacci function in RAM, JITs it to native x86, and executes it in process, printing the results. (It might have taken longer if the LLVM guys hadn't provided an example that does exactly that, which I was able to learn from.)
This is when I started to notice something: the JIT is impressively fast, considering that I had not activated any of LLVM's impressive suite of optimizations. I coded up two version of the Fibonacci sequence (naïve and hand-optimized) and ran some benchmarks.
The first is the naïve recursive version:
int fibonacci(int x) {
if(x <= 2) return 1;
return fib(x - 1) + fib(x - 2);
}
Compiled in a test harness, the times are as follows (suitably averaged and scrubbed of outliers):
fib(24)
- GCC 4.0.1, -O0: 1136us
- GCC 4.0.1, -O3: 397us
- LLVM JIT: 771us
fib(40) (note that units are now milliseconds)
- GCC 4.0.1, -O0: 1866ms
- GCC 4.0.1, -O3: 618ms
- LLVM JIT: 1578ms
Generating this recursive Fibonacci function at runtime (which takes some milliseconds at startup, not included in this benchmark) produces code that runs faster than GCC at -O0. Neat.
Next, the iterative transform of the function:
int iterative_fibonacci(int x) {
if(x <= 2) return 1;
int a = 1, b = 1, c;
while(x > 2) {
c = a + b;
b = a;
a = c;
x--;
}
return c;
}
(The code to generate this at runtime is substantially more complex, since I'm assembling the SSA intermediate representation by hand in C++.)
This algorithm is vastly faster than the previous version, so note that our units below are now nanoseconds.
fib(24)
- GCC 4.0.1, -O0: 170 ns
- GCC 4.0.1, -O3: 30 ns
- LLVM JIT: 51 ns
fib(40)
- GCC 4.0.1, -O0: 299 ns
- GCC 4.0.1, -O3: 51 ns
- LLVM JIT: 93 ns
On this version, LLVM shows an impressive performance gain over GCC at -O0, and is within a factor of 2 of GCC's output at -O3. (Edit: As noted in the comments, running GCC at -O0 is effectively tying both of its hands behind its back. I understand this; I'm comparing to -O0 here because I've not aggressively optimized my LLVM code here and I wanted to try a non-JIT compiler with its optimizations turned off. It's still faster than the M3VM threaded interpreter.)
To try this for yourself, here's where to find the sources:
- The LLVM naïve Fibonacci code is in the LLVM distribution at
examples/Fibonacci
. - The iterative LLVM Fibonacci code.
- The recursive C Fibonacci code (for GCC).
- The iterative C Fibonacci code (for GCC).
17 Comments:
Comparing anything with gcc -O0 is completely meaningless. gcc -O0 exists solely to create debugger-friendly code, in which as many C-level constructs as possible are directly represented in the assembly. It's not meant for anything else.
By taw, at 6:37 AM
the times are as follows (suitably averaged and scrubbed of outliers)
What do you mean by suitably averaged?
By Isaac Gouy, at 7:42 AM
bdash - I'm still learning how to enable link-time optimization.
taw - gcc -O0 is meaningful to me because I've spent the last several years writing interpreters and naïve code generators. Code compiled with gcc -O0 performs rather better than any of my interpreters, so while I realize it's not meant to be impressive, beating it is an excellent step for me. (Particularly since driving LLVM has turned out to be so simple.) I included gcc -O3 as a representation of how "real" GCC uses would perform.
isaac - I mean I ran each trial 100 times (1000 for the fast iterative algorithm), dropped the highest and lowest, and averaged the rest (arithmetic mean). I'm not used to people outside my family and job reading this blog, so I've not documented my methods as thoroughly as I should. :-)
By Cliff L. Biffle, at 9:15 PM
With all respect to LLVM, which certainly looks impressive (the generated code in their try-it-out page is really interesting to read) maybe you should have a look to neko which sounds like a very nice platform for dynamic languages...
Great entry, btw!
By Excalibor, at 2:52 AM
Excalibor, thanks for the pointer! I can't find any documentation on the Neko VM machine model, can you point me to it? I'd love to take advantage of someone else's platform if I can, but I've read through the Neko docs and they seem pretty incomplete.
Specifically, I'd need information on how much behavior is wired into the VM, particularly with regards to function dispatch, exception handling, closures/non-local-returns, tail calls, continuations, and the object model. These have been the main sticking points with the JVM, CLR, and Parrot. (Parrot comes awfully close to what I need, but it's woefully underdocumented.)
Also, information on the JIT and GC, their performance, and their scaling in high-thread-count environments would be useful.
The main advantage I derive from using LLVM is that it doesn't mandate any of these aspects, giving me freedom to experiment. That, and it lets me easily transform to native code.
Also, I note that Neko preserves the unfortunate primitive/object distinction of Java. Is this on purpose?
Thanks again!
By Cliff L. Biffle, at 3:31 PM
(Excalibor, I should note, I'm assuming that the NekoVM specification isn't a direct one-to-one correspondance with the Neko language. If things like Neko's exception handling and type system are baked into the VM, I will not be able to make use of it, unfortunately. Among other things, my objects are not hash tables. :-)
By Cliff L. Biffle, at 3:33 PM
Cliff,
I'm afraid I won't be able to answer your questions (which are, BTW, very good questions): unfortunately, documentation is a point where most software (not just open-source) really lacks in quality and quantity.
I've just been exposed to Neko through haXe, which is a web framework to export ECMAScript, Flash or Neko, and thus I don't really know how it works under the hood... Anyway, I'm sure you'll be able to get more info contacting the author(s) directly, they seem to be pretty nice guys.
As for your objects not being hash tables, this may not be the place to ask, but may I ask why not? It's a proven model and a very useful one (eg. Self, Lua, and a long etc). Just curious.
Good luck, anyway, on your search (and yes, Parrot is woefully underdocumented, because there's a lot of technical things to do which are more interesting... I can only hope that when Parrot is closer to a final release Dan will tackle that particular battle and ask everyone to start fully documenting their interfaces, including himself!).
By Excalibor, at 2:10 AM
Excalibor,
Keeping objects as hashtables is an efficiency hit. (And the implementations of Self I'm aware of only conceptually represented their objects as hashes; the actual objects were simple frames. They might have changed this at some point, though.)
Mongoose (the language I'm doing this for) doesn't allow objects to have fields added after instantiation, so I can statically compute the field offsets and convert them into load/store instructions directly.
By Cliff L. Biffle, at 12:47 PM
@Cliff
I'm looking for a VM to use in my projects, and I want to ask you if you can make a comparison of Parrot and LLVM. I understand Parrot lacks documentation, but for sure it has a lot of features over LLVM.
Maybe LLVM is so well designed, as is so boosted on their website, to easily win over Parrot?
By Unknown, at 1:41 PM
This comment has been removed by the author.
By Unknown, at 1:04 PM
This comment has been removed by the author.
By Unknown, at 1:08 PM
So machine code compiled by LLVM with the optimizations, which are enabled by default runs faster than machine code from gcc -O0 (no optimization enabled at all, machine code compiled for easiness of reading, and debugging), and 2 times slower than gcc -O3 (with real optimization enabled) and such a short program?
Well, LLVM - well done.
Please if possible may you try and post us results for libJIT 0.1.2?
and the research, experimental branch of libJIT:
libJIT 0.1.2+
http://code.google.com/p/libjit-linear-scan-register-allocator/
with various optimization levels enabled (jit_function_set_optimization_level from 0 to 4), and various ABIs (jit_abi_cdecl and jit_abi_internal).
There is a tutorial for libJIT there:
http://www.gnu.org/software/dotgnu/libjit-doc/libjit_1.html
Thanks,
Kirill
By Unknown, at 1:15 PM
This is an innovative Idea.I found that you love experiments after reading this post.Good job,well done amigo.ecommerce development
By Ronaldo, at 4:51 AM
Your blog is STELLAR! I mean, I've never been so entertained by anything in my life! Your vids are perfect for this. I mean, how did you manage to find something that matches your style of writing so well? I'm really happy I started reading this today. You've got a follower in me for sure!thai dating sites
By Unknown, at 10:23 PM
complicated. I have to have an understanding in the field of programming. Can I learn from your programming
My Blog: Susu Kambing & Susu Kambing
By Anonymous, at 4:15 PM
Hey Thanks for sharing this valuable information with us. I will come back to your site and keep sharing this information with us.
Root Kindle Fire HDX 7
By Leo5, at 9:36 PM
Brilliant Resource! Your article is a goldmine of information for nature lovers. I’ve also published a new article on Best Resort Near Madhai Gate , offering comprehensive insights for animal enthusiasts. Please read it and let me know what you think!
By Nikita, at 5:38 AM
Post a Comment
<< Home