Cliff Hacks Things.

Saturday, March 31, 2007

Experimenting with LLVM

My oldest live projects are Mongoose and Cesta, both in intermittent development since 2003. I'm doing a bit of hacking on both right now.

In the case of Mongoose, writing my own VM (or, thus far, four of them) was fun, but it's not viable in the long term. I'd like to target the language to an existing VM, but earlier attempts with the JVM and CLR bore little fruit. (Neither VM is well-suited to dynamic languages with serious metaprogramming facilities.)

In the case of Cesta, I need a way to speedily evaluate filter expressions, and a way to "cross-breed" decoders for parallel fuzzy decoding.

Research on both topics kept leading me back to LLVM, which is many things:

  1. It's a compiler, with frontends for C, C++, and several other languages;

  2. It's a library for designing and implementing transformations on code;

  3. It's a framework for building the backends of new compilers;

  4. It's a well-defined low-level virtual machine;

  5. It's an embeddable code generator, suitable for use in Just-In-Time compilers.



Initially, I was scared of LLVM. I did some hacking on the GCC family of compilers a few years ago, and the code is abjectly terrifying — not to mention the deliberately obfuscated intermediate representations. However, every research lead I followed kept pointing back to LLVM. Eventually, I got up the gumption to download it and try it out.

In short: I am floored. LLVM is almost absurdly easy to work with, thanks to the best-designed C++ API I've ever seen outside of Qt. Within minutes of downloading it, I had built and tested some of my C code using llvm-gcc, their drop-in replacement for GCC.

An hour, two source examples, and 120 lines of code later, I had written a program that generates a Fibonacci function in RAM, JITs it to native x86, and executes it in process, printing the results. (It might have taken longer if the LLVM guys hadn't provided an example that does exactly that, which I was able to learn from.)

This is when I started to notice something: the JIT is impressively fast, considering that I had not activated any of LLVM's impressive suite of optimizations. I coded up two version of the Fibonacci sequence (naïve and hand-optimized) and ran some benchmarks.

The first is the naïve recursive version:

int fibonacci(int x) {
if(x <= 2) return 1;
return fib(x - 1) + fib(x - 2);
}


Compiled in a test harness, the times are as follows (suitably averaged and scrubbed of outliers):
fib(24)

  • GCC 4.0.1, -O0: 1136us

  • GCC 4.0.1, -O3: 397us

  • LLVM JIT: 771us


fib(40) (note that units are now milliseconds)

  • GCC 4.0.1, -O0: 1866ms

  • GCC 4.0.1, -O3: 618ms

  • LLVM JIT: 1578ms



Generating this recursive Fibonacci function at runtime (which takes some milliseconds at startup, not included in this benchmark) produces code that runs faster than GCC at -O0. Neat.

Next, the iterative transform of the function:

int iterative_fibonacci(int x) {
if(x <= 2) return 1;
int a = 1, b = 1, c;
while(x > 2) {
c = a + b;
b = a;
a = c;
x--;
}
return c;
}


(The code to generate this at runtime is substantially more complex, since I'm assembling the SSA intermediate representation by hand in C++.)

This algorithm is vastly faster than the previous version, so note that our units below are now nanoseconds.
fib(24)

  • GCC 4.0.1, -O0: 170 ns

  • GCC 4.0.1, -O3: 30 ns

  • LLVM JIT: 51 ns


fib(40)

  • GCC 4.0.1, -O0: 299 ns

  • GCC 4.0.1, -O3: 51 ns

  • LLVM JIT: 93 ns



On this version, LLVM shows an impressive performance gain over GCC at -O0, and is within a factor of 2 of GCC's output at -O3. (Edit: As noted in the comments, running GCC at -O0 is effectively tying both of its hands behind its back. I understand this; I'm comparing to -O0 here because I've not aggressively optimized my LLVM code here and I wanted to try a non-JIT compiler with its optimizations turned off. It's still faster than the M3VM threaded interpreter.)

To try this for yourself, here's where to find the sources:

Labels: , ,

Sunday, March 18, 2007

Cocoa Bindings: more on NSArrayController

After my post on NSArrayController's performance yesterday, Scott rightly pointed out that the behavior I was seeing might be a result of how I was using the class, rather than the class's implementation.

I've boiled down the scenario into a bare-minimum project: ArrayGrowDemo.zip (47k)

I've kept the use of Bindings to the textbook case, illustrated below:


All in all, the only unusual thing I'm doing in the code (a slightly simplified version of the Cesta algorithm) is populating the array continuously from a backing thread, but coalescing the KVO notifications in a foreground thread. For what it's worth, I see the same overall pattern by populating from a perform from the main run loop -- the trends just show faster when the population isn't throttled by the performance of the UI.

The application, when run, will write stats in CSV format to /tmp/arraygrow.log. Run it from Xcode, because you'll need that Terminate button handy: the UI becomes totally nonresponsive very quickly. (This replicates the freeze I saw in the Cesta UI when processing a loaded gigabit link, when I was using NSArrayController.)

Here's a graph from one of my test runs, leaving the NSTableView unsorted. The blue line is the amount of time required to distribute KVO notification that the property has changed size (i.e. that the "array" has grown); from Shark profiles, this is where the NSArrayController code tends to slow down.



(More specifically, the blue line is the time spent in didChange:valuesAtIndexes:forKey: per invocation. Its willChange: brother appears to be constant-time. The red line is the time spent in didChange: divided by the number of items actually changed, showing that — fortunately — the NSArrayController isn't linear with regard to the number of changed items...in Tiger....)

I've run a few variations on this, which you can try by editing the Nib. If you change the model key of either table column to "value.value.value.value", you can see the results of binding to a (pathologically) deeply nested property. This appears to slow things down very slightly, and boosts the time spent in the KVO register/unregister code. (I included this because I expected it to slow things down significantly; it doesn't.)

When sorting the table view on either column, each update remains effectively O(n), but the actual time taken increases substantially. (The sort operation, presumably O(n log n), is lost in the O(n) noise.)

Oh kind readers, please take a look at this source and let me know what I'm doing wrong. I would love to switch back to NSArrayController for my packet view in Cesta -- it would reduce the lines of code I have to maintain.

Edit: Updated the images to fit better.

Labels: , ,

Saturday, March 17, 2007

Cocoa Bindings: mmm, crow

I should have learned this long ago: when I make a glowing blog post about something, I'm about to learn far more about it than I ever wanted to.

I've been collecting Shark profiles of Cesta all day, as my own personal performance-fixit. I've learned two things.
1. My new decode logic is much faster than I anticipated. (Good!)
2. As a result, all of the CPU Cesta's been chewing up has been -- yes, Cocoa Bindings.

The Bindings framework is not to blame. It's the built-in Controller objects (Edit: or, as Scott points out in the comments, the way I'm using them). Specifically, NSArrayController seems to very aggressively make copies of the multi-valued property it controls. I haven't decompiled it for specifics, but I see enormous NSArrays being allocated -- and that's a class that I specifically avoid. (Granted, NSArrayController was probably not designed to be stuffed with thousands of rows per second. Copying the array makes sense for smaller datasets.)

Moreover, for some reason, it keeps invoking indexOfObject:, which is almost by definition O(n/2) for an unsorted array. When my captures start getting large (hundreds of thousands of packets), indexOfObject: and its partner in crime, CFEquals, start creeping up in my profiles.

This is, of course, nonsense. I've been very careful to only append objects to the end of my capture -- it's a time-series data recording, after all. Unless I'm running a "Find" operation, there should be no reason to run around comparing things.

After lunch today I started running into some really odd bugs with NSTreeController, where it appears to be allocating internal NSArrays to keep track of children, and then reading right off the end of them. I know I'm handling my retains and KVO correctly, so this seems to be a flaw in NSTreeController.

So, I've stripped out all the NS*Controller objects. More trouble than they're worth for this application. I've replaced them with lightweight custom implementations that don't call indexOfObject: or crash. I now have an NSTableDataSource and NSOutlineDataSource again -- but they're both very thin, because I had already designed the data model to be presented in tables and trees.

In subsequent profiling, I noticed something odd. Removing NSArrayController and unbinding my packet table resulted in a 400% overall speedup, which I was expecting, but I got another boost in speed from removing NSTreeController, which I was not. It seems to create an NSArrayController-like object behind the scenes, with all its attendant overhead. (Not to mention that both NS*Controllers create callstacks of 40 or so frames, greatly boosting the use of objc_msgSend.)

Thus, my previous post (which read something to the effect of "OMG BINDINGS ARE FAST") has not proven entirely accurate. I would now rephrase that as follows:
1. Bindings, themselves, are fast. I'm still using them extensively to keep my controllers and data sources in sync.
2. NSControllers, the typical use of Bindings, don't seem to scale well to large update frequencies.

(Edit: After discovering someone actually reading this blog, I cleaned up a couple cases of language that may have been overly pointed.)

Cocoa Bindings: @sum is O(n), even if you're nice to it.

My last post was very positive toward Cocoa Bindings. A week later, you might say the honeymoon is over.

They still do most of what I need, don't get me wrong -- it's the performance. In profiles, whenever Cesta starts using an entire core on my MackBook Pro, Bindings is responsible for 80-90% of it. (The harder stuff, like capturing, decoding, and constraint-checking a loaded ethernet link, is 2% or less.)

I'm working on isolating each part of the problem, but here's a tidbit for today -- not from Bindings specifically, but from the Key-Value Coding system on which they're built (thanks for the correction, Scott). (Update: I Sharked my way to some of the other Bindings issues in my next post.)

The performance of @sum may surprise you. Specifically, the @sum function is O(n) for the total size of your array, even if you distribute incremental updates using KVO.

This will come as no surprise if you're familiar with how KVC is implemented, but for those of us (me) hoping for a free ride, it's a bit of a downer.

The backstory:

I wanted a nice label on Cesta's summary view showing the number of bytes captured. Lazy as I am, I decided to use Bindings: I created an NSTextField and bound it to the aggregate property "@sum.capturedSize" on my Capture Events NSArrayController.

Now, I put a lot of work into distributing only incremental updates when new packets are captured. My code diligently calls willChange:valuesAtIndexes:forKey and didChange:valuesAtIndexes:forKey with only the changed packets, so that observers don't have to reevaluate the whole array each time. Hell, I even maintain separate state for each thread so that nobody sees new packets before they've received willChange:.

In short, I expected that any observer update that doesn't rely on all the data (like @max would) would be, at worst, O(n) for the number of changes.

I was incorrect, as my debug traces show. @sum is calling capturedSize on every packet, every time.

This is one of three hotspots that cause Cesta's screen refreshes to be O(n) to the number of events captured. The fix? Drop the aggregate function and bind to a scalar property. It's nearly trivial, but I'll include the code here for completeness.

First, we add an ivar for the property, and a getter/setter pair:

@interface CSCaptureDocument : NSDocument {
...
uint64_t _bytesCaptured;
}
...
- (uint64_t)bytesCaptured;
- (void)setBytesCaptured: (uint64_t)bytes;
@end


Then, we add the code for that getter and setter. I won't post it here; there's a button to generate it.

Then, we hook into the packet delivery code:

- (void)frameSource: (CSFrameSource *)source receivedFrame: (CSFrame *)frame
{
[_capture appendCaptureEvent: frame];
[self setBytesCaptured: _bytesCaptured + [frame capturedSize]];
}


Then we simply re-bind the "Bytes Captured:" label to the bytesCaptured property of the document, rather than an aggregate property on the NSArrayController. Voila -- instant 15% performance savings.

(Edit: as Scott points out in the comments, this behavior makes sense for @sum since there's nowhere to store the accumulator. I've updated some of the text above to clarify that I'm not attacking @sum here, just pointing out my own very inappropriate use of it -- just in case someone else does the same thing and goes Googling for an explanation.)

Labels: , , ,

Friday, March 09, 2007

Cesta lives!

I've been putting in a lot of work on Cesta, my network analyzer, due in no small part to continual goading from clee. (He has a Mac now, and is trying to load it up with shiny Mac software.)



I hadn't touched Cesta's code in over a year, and it was getting pretty crufty even then, so I didn't bother trying to modernize it. Instead, I'm applying a marvelous technique I learned from John Brewer: TONSO, or Take Off and Nuke the Site from Orbit.

(It's the only way to be sure.)

The interesting thing is how fast it's come along. I bootstrapped the core system (with an Ethernet decoder) in an evening of work, while distracted by clee and Ben-from-Apple. Starting from a clean slate really helps you reduce a system to its essentials -- and working in Objective-C didn't hurt either. (Still one of my favorite languages of all time.)

I've rebuilt most of what the last Cesta generation could do, and killed off a number of ingrained-but-incorrect assumptions in the old domain model. We should have feature-parity with Wireshark in the not-too-distant future, which is not a slam against the Wireshark devs by any means -- they're very smart. No, it's a tribute to how easy developing Mac software in Objective-C has become.

We're targeting Tiger-and-later with this rev, whereas the original Cesta worked all the way back to 10.2. This makes things a lot cleaner: Mac OS X really is evolving at a breakneck pace, and while all the old APIs are still supported (and much faster), there are new ways of doing everything that are so much better.

Case in point: Cocoa Bindings. Much of the code in Cesta Of Old was dedicated to shuttling pieces of data onto, and off of, the screen. I'm a user-interface geek, so it had to be blindingly fast, absolutely intuitive, and as fluid as Mac apps are expected to be -- and I paid for my high standards in code.

Now? There is no code.

Using Cocoa Bindings, I wire up my UI widgets to my underlying domain model using Interface Builder, and the runtime figures out how best to move and cache the data. The new Cesta UI is entirely Bindings-driven, leaving the whole user interface as a thin projection of the underlying objects -- from the chooser that lets you pick the network interface(s) to capture from, right up to the three-paned packet list and dissector.

If I have to, I can drop in optimizations where needed (Objective-C's category mechanism keeps it clean) -- but thus far, it hasn't been necessary. Cocoa Bindings is fast; far faster than I expected, and they say it'll only get faster with time. It doesn't break a sweat displaying hundreds of thousands of decoded packets in a table. I am impressed.

(I met the folks behind the Bindings API at a recent Cocoaheads meeting, and not only do they produce great things, they're also very friendly and tolerant of n00b questions.)

Not having to hand-code the user interface has freed me up to work on more interesting parts of the application. In the two weeks that this Cesta codebase has been under development, we've added features that I only dreamed of in the last rev. For example, Cesta can now integrate data from any number of network interfaces in a single capture, and can attach and detach interfaces on the fly. (These changes are even recorded inline, right in the capture, for future reference.)

It's also grown a general framework for finding, highlighting, and explaining problems in packets. If an IP checksum is wrong, for example, it's highlighted in the UI, and the user can see a localized explanation of the issue.

We have found one issue, however. clee is totally new to Cocoa, just having cracked open the Hillegass book, and the Cesta codebase is total greek to him -- even though he's a very sharp programmer, fluent in half a dozen languages.

I can't blame him. The main difference between this Cesta and the last is that I now fully grok the Cocoa application/document architecture. It took me a couple years of diligent study to get everything right (because some important parts of it are not terribly well-documented -- and I didn't have the Hillegass book). To a newcomer, the control flow is almost entirely opaque. Sure, you can see that CSCaptureDocument seems to be some sort of document that holds a capture, but it's never created anywhere -- and neither are most of the other classes! How on earth do they talk to each other?

The answer, of course, is that a good chunk of the application logic is effectively hidden. It's not in the source code; it's in property list files, in the Interface Builder nibs, in interactions with closed-source frameworks. To a programmer used to programming, well, code, and accustomed to fully open-source systems, Cocoa can be pretty mysterious. To really understand a Cocoa application, particularly one that has drank as deeply of the Kool-Aid as Cesta has, one must understand all these non-code resources -- and, in some cases, read the API docs on every dependent class, trying to figure out who calls what delegate method and why.

Ironically, the loose coupling that makes control flow hard to understand is also one of Cocoa's greatest strengths. (Though part of the problem -- possibly a large part -- is Xcode, which is firmly lodged about six years behind the IDE state-of-the-art. From a Java perspective, it feels like my 1999 work in Forte, rather than my 2007 work in Eclipse.)

To diverge even further from the alleged topic of this post, this also worries me on my applications at work. Us loosely-coupled object systems guys are leaning pretty heavily on dependency injection and event-driven applications these days, and it may come to the point where the only way to understand the flow of control is to step through the app in a debugger. Not an ideal solution.

Anyway. Coming back around to Cesta, there's one other point of interest: we'll be open-sourcing this version, most likely under a BSD license. I've been out of real open source development for too long. It'll be good to be back.

Labels: , ,