Cliff Hacks Things.

Wednesday, November 08, 2006

More on PropellerForth

I mentioned in my last post that I hoped to optimize PropellerForth, and get it running faster than 1.7 million (small) words per second (MWPS). Well, this may not be possible. Here's my thinking:

  • Executing any word requires at least two memory reads: reading the word out of the instruction thread, and reading that word's code field.

  • Most interesting words require at least one additional memory read (such as pulling a second argument off the parameter stack, or reading an inline argument from the instruction thread). Memory writes are less common, since the top of stack is in a register.

  • A memory read can only occur once every 16 cycles.

  • At 80MHz, we get 5 million memory reads per second.

  • At 2-3 memory accesses per word, we max out at 1.67-2 million words per second.

Thus, for an indirect-threaded interpreter on the P8X32 (the current Propeller incarnation), I'm already nudging up against the theoretical maximum.

Of course, each core can only access memory every 16 cycles — so the Propeller can support eight simultaneous 1.7MWPS Forth kernels.

On a traditional architecture, the next step would be to ditch indirect threading for either direct or subroutine threading — either of which potentially eliminates a memory access from every word. However, on the Propeller this won't work:

  • Executable code can live only within the 2KB "cog-local" RAM attached to each core. Thus, all CFAs point into this region, not into shared RAM where the dictionary lives. As a result, a single jump instruction cannot target both a machine code word and a user word — so subroutine threading is out.

  • Likewise, I cannot simply jump to the CFA of a word — first it would have to be loaded into cog-local RAM. Thus, direct threading would probably incur more overhead, not less, from moving each word across address spaces.

I'm investigating other ways of cutting my memory access frequency. Stay tuned.

If the Propeller folks were to ask me for advice on their next architecture (which they haven't), I would make the following suggestions:

  • Indirect addressing of cog-local RAM. The current architecture treats local RAM as a register file, and (as with most register files) won't indirectly address it through a register. This means all task stacks have to live in shared RAM, slowing things down tremendously. (The stacks can actually live in local RAM, but it requires self-modifying code and ends up more expensive than just using the slow RAM.)

  • Make the local and shared RAM address spaces disjoint. Combined with the indirect addressing, this would allow a single instruction to read from either local or shared RAM (possibly incurring greatly increased latency). This would make direct-threaded interpreters possible.

  • Include sign-extending variants of rdword and rdbyte.

  • Orthogonality is all well and good, but I really miss auto-increment addressing. Being able to auto-increment/decrement a register after a rdlong or wrlong would cut the size of my Forth kernel by nearly 25%, and keep me from missing some hub access windows.

Monday, November 06, 2006


Regular readers know my take on the Parallax Propeller microcontroller: neat, 8-core silicon hampered by a platform-specific, very limited IDE (really more of an editor with a Load button) and a programming language I don't much care for.

In the hopes of solving both these issues in one swell foop, I've started work on an ANS Forth for the Propeller — creatively titled PropellerForth. For folks who haven't used Forth, it is simultaneously the most bewildering and liberating programming environment you'll ever use — a computational boot-to-the-head of the same degree as Lisp or Smalltalk. Like these languages, Forth is generally written entirely in itself, and the user can modify even the fundamental operations at runtime in an interactive environment.

The main difference: even modern Forths tend to fit happily in 8K of RAM. (Mine's a little larger than this, now, because I've added in some luxurious libraries.)

Currently, my system image comes in at 8,372 bytes, including all buffers and stacks for the bootstrap task. It executes 1.7 million words (small words) per second at 80Mhz — not a speed demon, but I've still got a lot of room for optimization. It was at 1.4 milion wps yesterday.

The oddities of the Propeller architecture have made this a little interesting:

  • It's an old-style, string-of-addresses indirect threaded Forth. This is necessary because the more modern styles such as direct or subroutine threading assume you can jump to instructions in the dictionary's data space; the Propeller can't.

  • The Propeller has no hardware stacks, or indirect addressing of its fast (local) RAM. Thus, the stack is emulated with slow (shared) RAM, entirely in software. This has required a lot of careful tuning.

  • The Propeller has no hardware peripherals, other than some counters. All I/O, including the serial console, is implemented by the Forth kernel.

  • Forth was designed for multiuser multitasking on single-core machines, and later expanded to multiuser multitasking on SMP machines. As an embedded system, the Propeller needs single-user multitasking on an eight-core chip — eight cores sharing a dictionary and data space. This means things like semaphores all of a sudden appear in the standard library.

Forth is intoxicating. Once the system is bootstrapped (which required a couple kilobytes of assembler, most of which I generated with some scripts — as I am wont to do), development proceeds exponentially, by defining each layer in the system in terms of the previous.

I believe that every aspiring (or experienced) computer programmer should write a Forth at some point, as a sort of rite of passage. It's challenging, but not because it's hard: because it requires some pretty fundamental paradigm shifts. Writing a Forth will bring you many Aha! moments, and watching the language define itself is quite a trip.