Cliff Hacks Things.: March 2006

Monday, March 20, 2006

Nutshell: my thoughts on static type checking

I've been giving a lot of thought to static type checking in Mongoose — for three years now, in fact.

Static type checking has a bad name these days, with programmers advocating fully dynamically-typed languages for productivity reasons, or to avoid complexity.

I agree with the gist of the argument, but I feel it's based on bad data. Here are my arguments. (I'm not making the "static type systems allow for more optimizations" argument, since Self proved that one wrong 13 years ago.)

1. The popular static type systems are the worst examples.

Ask one of these programmers for an example of a static type system that holds them back. I will virtually guarantee the answer is likely to be C, C#, C++, Java, Pascal, or Delphi.

I will also virtually guarantee the answer is not Ocaml, ML, or Haskell, three strongly-typed languages — stronger even than the C family or Java, as they lack pointer aliasing, casts, and other ways to accidentally goof up your types at runtime. And yet they don't get in the way (though the lack of operator overloading in ML is unfortunate).

There are a number of reasons for this. The ML family provides incredibly good type-inference support: the compiler looks at the operations you're performing on a value and where it came from, and deduces what type you wanted. (In most interactive environments, it will report the types it deduces, so you can see if you screwed up. Yes, the errors are yours, the compiler knows its stuff.) As a result, the ML family (including Ocaml) doesn't require you to pedantically type every variable, like the C family.

Microsoft added a sort of type-inference to C# in the latest version, but it only saves you from declaring your variable types — instead of int x you get to write var x. Woohoo. Compared to ML, this is a toy feature, and it's a really bad fit for the language, in my uneducated country-boy opinion.

2. Type declarations — or type annotations, or whatever you want to call them — are excellent documentation.

Maybe I'm just a bad programmer, but in Ruby, I find myself passing Strings when I really wanted to pass Integers all the damn time. The program usually doesn't die until runtime — but it usually does die, all because I forgot a to_i somewhere in my parsing code.

This is exacerbated by the fact that Strings, in Ruby, overload most arithmetic operators. Rather than dying as soon as I try to add or shift a String, I just get a weird String. (I do like it better than the transparent conversions in Perl, I'll say.)

I have the same issues with Smalltalk. Sometimes, it's just really nice to have documentation on what the expected type of a parameter is, what a function will return, or whether or not something can be null. A lot of Smalltalk code fakes this, by naming the parameters after their expected type — aSmallInteger, for instance. This is really just an overgrown Hungarian notation, and violates one of my basic principles of software design: values should be named for their roles, not their types.

3. Having a tool to check the accuracy of your documentation rocks.

When writing Java in an IDE, I always leave Javadoc checking enabled.

For those of you fortunate enough to have avoided Java, Javadoc-checking in an IDE verifies that your parameters are documented, their declared types are correct, and so forth — in your comments. It verifies that your comments are in sync with the source. This is wonderful, since in Java, comments duplicate the information in the method signatures. It keeps them from getting out of sync.

Type declarations on parameters and results are a form of documentation — and, in a static type system, they're checked by a tool called the compiler.

4. A sufficiently expressive static type system doesn't bind your hands.

On this point, I must return — as I so often do — to Strongtalk.

Strongtalk is a descendant of Smalltalk, a totally dynamic language. One of the real powers of Smalltalk is its genericity: a parameter can be of any type at all, and as long as it responds to the set of messages you send it, your algorithm will work. Pass it an integer, pass it a user-defined currency type, pass it a complex number — it doesn't care, and neither do you.

Strongtalk sports a static type system that preserves this power. First, it's optional: if you honestly want a parameter to accept any type of object, that's fine.

Second, it doesn't confuse "class" with "type." Sure, you can say "this parameter must be a SmallInteger" — but you can also say "this parameter must respond to this set of messages with these types" In other words, you can describe the type in the same way that Smalltalk does (and Ruby does, and Python does, and...) — but in an explicit fashion, and one that the compiler can check for you!

Strongtalk's type system is by far the most powerful and expressive that I've used. It, single-handedly, convinced me that static typing was not a dead end — but rather the path of least resistance.

Something that Strongtalk, near as I can tell, didn't do is infer the types and write the type declarations for you — but there's really nothing preventing that. This is one of the features I'm working into Mongoose, albeit slowly.

5. Explicitness is good.

I like explicit code. (Yeah, I've worked for porn sites, but no jokes, please.) I like explictness of intent, both on the part of the programmer, and on the part of third-party components.

Static type systems are only one step in that direction. Pre/postconditions and loop and class invariants are another. (Really, type systems are a specific case of this, but I feel they're best expressed separately.)

I'm in the very early stages of a contract-inference engine for Mongoose code, which will save the programmer the grunt work of writing pre/postconditions. Stay tuned.

Sunday, March 19, 2006

M2VM status report

M2VM is coming along. It's proving as flexible as I'd hoped — even changes to pretty fundamental structures, like the layout of an activation record, go off without a hitch.

Hooray for metaprogramming, I say.

Some notes:

- My strict adherance to standard C (the GNU99 dialect, anyway, which I consider a standard) has kept portability simple. The current codebase works great on all combinations of {Linux, Mac OS X} x { PPC32, PPC64, x86 }, without any architecture-specific defines or conditional compilation. Of course, it also forces me to fight GCC's decisions in some areas, but I'm working on that.

- Holy crap the Core Duos are fast.

- Basically all of my development right now is in Mongoose; the C layer is mostly finalized, for now. Even metaprogramming stuff like exception handling (which has to interact with the callstack), reflection, and dynamic loading is in Mongoose. It's fun like Forth, but with better abstractions.

- My Module abstraction seems to be good. Modules contain classes and/or other modules, and serve as the largest granule of access control and scoping. You can provide a filter when you load a Module, to prevent it from accessing, say, the reflection classes.

The exception-handling facilities are coming along. I've put together a brief (and possibly cryptic) slide stack explaining the Mongoose exception mechanisms (244k PDF), explaining the various nuances. (It supports Java-style unwind-and-abort handling, PL/1-style resumption, and a few other options.)

Tuesday, March 14, 2006

Indirect threading is...slower?

Well, that was unexpected.

M2VM so far has been using the switched-interpreter mode of my VM generator.

As of yesterday, my profiler no longer says the GC is the bottleneck — now, it's the interpreter loop. So, logically, I've started to optimize it.

As a first step, I switched the VM generator to produce indirect-threaded code — simply looking up the address of each bytecode's code, using an array of addresses (one for each bytecode). I expected this to be faster.

It's about 10% slower on my benchmarks. (It's about the same speed at -O1, but at higher levels it gets slower and slower.)

I'll drill down with some profiling to see if I can find the cause, but this kinda blows my mind. The generated code for the switch statement was the hottest in the module, so I assumed that, by getting it out of the way, things would speed up.

Hrm.

Update: This is only true on the G4/G5. On an Intel Core Duo system, the indirect threaded interpreter is an instant 10% speed boost. However, I've found an issue with GCC's code generation that's ruining my indirect branch prediction rate — which may explain the issues on the G5.

Monday, March 13, 2006

Politeness

I've been thinking more about the Politeness Principle I mentioned in the last entry, and I think it's pretty descriptive. In this post, I'll illustrate it with an extended metaphor: food service.

First, let's start with a procedural/imperative food service.

Assume for the moment that, in some sort of bizzare Zamyatin fantasy, everyone is identical. The food service folks come to your little cubicle three times a day and stuff food down your throat. You have no say in how much or what you're fed — it is literally stuffed down your throat.

This, as I think we all can agree, is the least polite level of foodservice. It's also basically how most imperative programming styles work: a single process is in control (the program) and it directly modifies data.

Everyone doesn't have to be exactly the same; there can be some variation, as long as it's evident to the program. When people arrive at work, they could receive a colored card, stuck to their forehead, indicating the type of eater they are. A red card, say, might indicate lots of food, while a blue card indicates less food. The food service folks come by, look at the card, and stuff food down your throat.

Now, say you want an amount of food in between the two extremes — or say you're fasting or not hungry that day. You stick a purple card to your forehead.

The food service folks come by, look at your card, and have no well-defined response. If you're lucky, they might get confused and move on to the next person — if you're unlucky, they might stuff all the food down your throat, or remove you from your cube and replace you with a human-shaped stack of mashed potatoes. The foodservice person might even commit ritual suicide at your feet. The behavior, either way, is undefined. In programming, this is called a "tagged type" — it has a tag indicating how it should be dealt with. As in foodservice, a program must be warned of all the possible tags in advance, or undefined (read: usually bad) things may result. Generally this is either lost data, or corrupted data (replaced with mashed potatoes) — or a program crash.

So, we define some new tags for new types of eaters. Say, green for vegetarian, black for fasting, white for no-gluten. Then we need a new color for vegetarian-heavy-eater, no-gluten-heavy-eater, etc. — every combination we want to handle. This will quickly spiral out of control, with the foodservice people needing to carry tomes explaining what to do in any given case. The foodservice job has gotten very complex, because the people are treated as dumb: they are given a tag, and that's the only communication.

At this point, some bright young foodservice luminary might suggest a revolutionary idea: "Why don't we ask the people?" This is basically the paradigm shift that Smalltalk brought to programming: don't put all the intelligence in the process, let the data itself be intelligent. But a first crack at this is almost always wrong.

The foodservice worker arrives at your cube one morning and, without even looking at your tag, says "Would you like one or two helpings of meat?" "Oh, no thanks," you say, and the foodservice worker shoves one helping down your throat. (The vegetarian next door is even less happy about this.) The question was wrong, and you gave an unexpected answer. This is usually what procedural programmers do when let loose in an object-oriented language like Smalltalk: they let the data be intelligent in all the wrong ways.

It's still not very polite. You're not given freedom of choice, and you certainly don't get to eat at your own pace. Sure, you get to answer a question now, but your eating is still being micromanaged.

As a next crack, the foodservice fellow says "Pick one of the items from this list." If one of them suits you, great, and he shoves it down your throat. If none of them are appropriate, well, too bad, you don't get to eat.

A lot of software systems are half-polite. By analogy, the fellow would roll up a buffet of eats, with some prepared dishes and some raw ingredients, perhaps with a small array of cooking utensils. He would wait patiently while you prepared a dish to perfectly suit your liking, or allow you to refuse if you weren't hungry.

And then he would shove it down your throat.

In a slightly more refined system, he would wait until you had prepared the dish, and then move your arm to the fork and guide it into your mouth for each bite. If he were really nice, he would watch for cues (bloating, vomiting) and stop feeding you, but don't count on it.

All this throat-shoving and force-feeding is starting to make you feel a wee bit violated, and that's exactly the term used in software: specifically, this is "violation of encapsulation." Your eating procedures — and, for that matter, your gastrointestinal tract — are your business, and are encapsulated away from the foodservice procedures.

This allows each eater to have variations. Some folks eat slower; some folks (the author included) tend to just tilt the plate back and let the entire meal plummet to its digestive doom. Just as a polite procedure lets you pick and choose your own ingredients, a really polite procedure would just let you freakin' eat.

So, ideally, the buffet would wait as you consumed your meal (or didn't) at your own pace, let you take another helping if required, and then bid you adieu. This would be polite.

The essence of the principle is this:
1. Don't micromanage.
2. Don't poke around in peoples' privates.
3. The easiest way to allow individual variation is to let people do things for themselves.

This can be applied to software design, as I'm doing with Mongoose, by replacing "people" with "software components." In the LinkedList I described in a previous post, I noted that the program doesn't forcefully grab some elements of the list and do actions on them; the process is closer to "Hey, element x, do this, and ask your neighbors to do it too."

As for individual variation, I dropped in an alternative Link implementation, RunLink, that can represent multiple identical elements with a single node. It required no changes to LinkedList, because LinkedList didn't care about the internals of the links — it only sent them messages and worked with their responses.

This is the real power of object-oriented design, and unfortunately it's lost on a lot of folks working in OO. I think it's that they haven't had it explained well. This blog post is off the top of my head, so it continues to be explained poorly, but you get the idea, I hope.

Sunday, March 12, 2006

Tail recursion, micromanagement, and naughty bits

On this fine Sunday afternoon, I've been assembling a LinkedList class for the Mongoose standard library (in the Collections module). It's a pretty standard doubly-linked list — a freshman-level chore that I've done in at least a dozen languages through the years. Recently, good high-level languages have saved me the work by bundling an implementation — but when you create your own language, the giants' shoulders are smaller and less comfortable.

So, let's take the task of getting a particular link within the list. (A Link is an object that knows who its neighbors are, and holds a reference to some value.) Once you can get a link, you can get its value, or navigate around it.

In C, Java, and other procedural languages, you would probably do this with a loop and at least one scratch variable. Loops and scratch variables smell funny to me.

Mongoose calls this method linkAt:, and it's implemented like this:


(-- ... in LinkedList ... --)
@method self linkAt: index [
  (_count = 0) ifTrue: [ ^nil ]. (-- no links to return? --)
  ^_head seekForwardBy: index.
].

It checks if the list contains no elements (as indicated by an instance slot named _count), and returns nil if this is so. Otherwise, it tells the head of the list (a Link named _head) to peek ahead some number of elements and return the neighbor it finds.

Of course, _head, like other Links, only knows its immediate neighbors, so it delegates.


(-- ... in Link ... --)
@method self seekForwardBy: offset [
  (offset = 0) ifTrue: [ ^self ]. (-- that's me! --)
  ^_next seekForwardBy: offset - 1.
].

It checks if the offset is zero, and if so, doesn't do any work at all: it simply returns itself. Otherwise, it pawns the work off on its next neighbor (named, appropriately, _next).

When I wrote this code (with the exception of a couple poor name choices), it just felt good, and I wanted to know why. I think it comes down to one thing:

There is no micromanagement.

In a normal LinkedList implementation (and, I admit, in the first draft of my Mongoose class), the list itself examines each Link, checking who its neighbors are. It controls the navigation; it interrogates every Link, leaving it feeling confused and violated.

This implementation does nothing of the sort: it asks the Link to find out a bit of data. Each Link can then ask its neighbors as appropriate. It comes back to one of the principles espoused by Smalltalk, which I refer to as the Politeness Principle: don't micromanage, or, alternatively, ask objects to do things; don't peek at their naughty bits and decide for them. (On that note, it could be considered the gender equality principle as well.)

As I iteratively refactored LinkedList to make it Polite, I tried to improve politeness while still using a traditional while-loop, thinking it would be more readable for most programmers.

I failed.

Which brings me to my next thought on the matter: tail calls are really the only Polite way to do iteration. There's no one object or routine managing the process; instead, each stage successively asks the next, "Do whatever you think we should with this data," and it can decide whether or not to continue.

Mongoose has optimized tail calls since M1 as its fundamental iteration primitive, but I didn't add them because they were Polite — I added them because they were less conceptually nasty than a goto, which is how most other languages iterate.

On the other hand, Smalltalk, the origin of the Politeness Principle, does not mandate tail-call optimization. In my tests on some current Smalltalks, it doesn't seem common. This blows my mind: it's only proper!

Update: I had misstated the Politeness Principle; I'm looking for a concise yet correct definition now.

Runtime type information and exception handling

The design of Mongoose includes a number of 'little experiments.'

For example, Mongoose doesn't provide a standard way to determine an object's class. This is on purpose; the class of an object is an implementation detail. You can use the reflection API to find out if you really need to know, but chances are, you don't. Instead, you can check an object for conformance to a certain protocol (a set of messages), and the object may very well change its conformance at runtime.

I've managed to implement most of the Mongoose standard library in this fashion, but I've hit a bit of a roadblock: structured exception handling.

Specifically, it's really nice to be able to define a number of exception handlers, specialized on the type of exception encountered. The traditional way to do this in languages like Java is to specialize on the exception's class.

Sure, I could do this other ways — say, check some 'exception type' parameter — but it feels like a workaround, and requires another layer of namespace to avoid collisions.

This will require some thought.

Tuesday, March 07, 2006

Mongoose syntax

As a brief followup to my last post, here's the entire Mongoose language syntax. It's not in a single paragraph for clarity, but it's still quite concise.

Literals

Strings are enclosed in double-quotes. Any double-quotes within the string are escaped with backslash.

Characters are enclosed in single-quotes. A single-quote must be escaped.

Integers can be base 10 (42), base 16 (0x2A), or base 2 (0b101010).

Symbols (used to name messages) begin with a hash mark (#) and can contain any legal message (see below).

Arrays can contain any sequence of expressions, contained in #( ... ).

A normal set of control characters can be used in strings and characters using backslash (\n, \t, \e, etc.), plus Unicode hex escapes (\u002A).

Sending Messages

All messages are sent to a receiver. There are three types of message sends: unary, binary (operator), and keyword.
receiver unaryMessage sends #unaryMessage to receiver.
receiver + argument sends #+ to receiver with an argument (where + can be any sequence of Unicode operator characters)
receiver do: this with: that sends #do:with: to receiver (where this and that are arguments)

Precedence

Unary->Binary->Keyword.
Thus,
4 squared + 2 to: 12 + 8 factorial
is equivalent to
((4 squared) + 2) to: (12 + (8 factorial))
All operators have equal precedence.
Precedence can be overridden using parentheses.

Statements

A statement takes one of three forms: normal, assignment, and return.
A normal statement is simply a message send (which can have message sends as its receiver or arguments, recursively).
An assignment statement consists of a variable, the assignment symbol :=, and a message send, like
foo := bar value.
A return statement consists of the return symbol ^, and a message send, like
^bar value.
Statements are separated by a full stop.

Declaring Variables

Identifiers

Any non-whitespace, non-punctuation Unicode character (plus the underscore) is legal in an identifier. Identifiers must not start with a digit.

Blocks

A block is a set of statements enclosed in square brackets ( [ ... ] ). Blocks may contain any mix of statements and variable declarations. A block can define arguments, which must appear in a declaration at the start of the block; arguments are indicated with colons, eg
[ | :x | x + 2]

Scoping

Scoping is entirely lexical. There are no global variables. Specifically, scopes at this point nest as follows:
class variables -> method arguments -> method locals > block arguments/locals (-> block arguments/locals...)

Reserved Words

Mongoose has three reserved words. self refers to the object whose method is currently executing. super is an alias for self that starts method lookup at the parent class. Self (capitalized, like a class) is a placeholder for the class of self, whatever it may be (in descendants and the like).

self is not a reserved word per se; it can be bound to any identifier within a method, but the compiler currently enforces self. Self (capitalized) is likely to be renamed, due to the potential for confusion with self.

I'm considering a couple additional literal forms, but that's the entire language at the moment.

For writing classes in text format, there's a declarative format that lays out methods and such, but it's extrasyntactic right now.

On Unicode.

I'm a big fan of Unicode.

Well, not Unicode specifically, but character sets that aren't annoyingly limited. Watching software at my last job fumble around with accented characters — much less Kanji — really clubbed this into my head.

Mongoose, since M1 in 2004, has supported Unicode throughout. I'm continuing this in the M2 implementation.

Mostly, it's for clarity. I consider ≠ to be a lot clearer than != or /= or .ne. or whatever other hacks other languages have used. Same with ≤ and the like. As of tonight, the current compiler and runtime support all this — Object understands the ≠ message and does the right thing, and so on. Unicode handles such characters quite well: there's a reserved block (U+2200 through U+22FF) for operators. Any of these characters are now legal Mongoose operators, from the compiler on up.

This gets a lot of grumbling from the old guard, but I'm not really open to grumbling on such matters. Any Mac user can type all these symbols directly from their keyboards, without having to enter hex codes or some nonsense. I've included message synonyms for the keyboard-impaired (!=, <=, etc.), but they're just that: synonyms for the real message.

I've been thinking, over the past few years, about ALGOL syntax. ALGOL originally defined a 'pretty' syntax — the print syntax, iirc — that used subscripts for array access and the like. This was well outside the capability of machines at the time, of course, so the syntax that lived on in its descendants used A[2] instead of A₂.

I think we're getting to the point where such print syntaxes could be made official. A List might use any subscripted expression as an element accessor; a Number might use a superscripted expression as an exponent. As long as it's easy to enter (which, right now, it's not), I think this could be a win.

But that's because I'm insane.

Mongoose is unlikely to support any such thing, since it's the dreaded "orthogonal syntax feature" I'm avoiding. I like that the entire language syntax can be described in a paragraph, including every possible way a method might be invoked.

Wednesday, March 01, 2006

Reinventing the wheel!

I never thought I'd say this, but I kinda wish I were writing in C++.

Seriously. As I write my own hashmap implementation in C, I'm wishing I had access to the STL. Fortunately, this is for the bootstrap loader in M2VM, so it only needs to index the bootstrap classes. There are 14 (see footnote), so if I take a shortcut and make lookups O(n), nothing will catch fire.

M1 avoided this whole issue by hardcoding the bootstrap classes, creating them directly from the C code. This is rather icky and makes development a pain. It's also silly in terms of M2VM, since none of these classes have any distinguished role from the VM's perspective.

Footnote: to start the userspace classloader, written in Mongoose, the M2VM needs the following classes: Object, Class, Nil, Boolean, True, False, SmallInteger, Character, String, Method, Array, ByteArray, MethodContext, and BlockContext. Character is technically optional, but I'm including it to ensure that Strings are fully functional. Object is also technically optional, as Mongoose has no distinguished root class, but it makes the class graph complete.

Unlike most Smalltalk-derived VMs, the Symbol class — which represents unique, interned Strings, for message sends and the like — is not distinguished. It's implemented entirely in userspace using a pretty standard GoF flyweight pattern.

Cliff Hacks Things.