More on PropellerForth
I mentioned in my last post that I hoped to optimize PropellerForth, and get it running faster than 1.7 million (small) words per second (MWPS). Well, this may not be possible. Here's my thinking:
Thus, for an indirect-threaded interpreter on the P8X32 (the current Propeller incarnation), I'm already nudging up against the theoretical maximum.
Of course, each core can only access memory every 16 cycles — so the Propeller can support eight simultaneous 1.7MWPS Forth kernels.
On a traditional architecture, the next step would be to ditch indirect threading for either direct or subroutine threading — either of which potentially eliminates a memory access from every word. However, on the Propeller this won't work:
I'm investigating other ways of cutting my memory access frequency. Stay tuned.
If the Propeller folks were to ask me for advice on their next architecture (which they haven't), I would make the following suggestions:
- Executing any word requires at least two memory reads: reading the word out of the instruction thread, and reading that word's code field.
- Most interesting words require at least one additional memory read (such as pulling a second argument off the parameter stack, or reading an inline argument from the instruction thread). Memory writes are less common, since the top of stack is in a register.
- A memory read can only occur once every 16 cycles.
- At 80MHz, we get 5 million memory reads per second.
- At 2-3 memory accesses per word, we max out at 1.67-2 million words per second.
Thus, for an indirect-threaded interpreter on the P8X32 (the current Propeller incarnation), I'm already nudging up against the theoretical maximum.
Of course, each core can only access memory every 16 cycles — so the Propeller can support eight simultaneous 1.7MWPS Forth kernels.
On a traditional architecture, the next step would be to ditch indirect threading for either direct or subroutine threading — either of which potentially eliminates a memory access from every word. However, on the Propeller this won't work:
- Executable code can live only within the 2KB "cog-local" RAM attached to each core. Thus, all CFAs point into this region, not into shared RAM where the dictionary lives. As a result, a single jump instruction cannot target both a machine code word and a user word — so subroutine threading is out.
- Likewise, I cannot simply jump to the CFA of a word — first it would have to be loaded into cog-local RAM. Thus, direct threading would probably incur more overhead, not less, from moving each word across address spaces.
I'm investigating other ways of cutting my memory access frequency. Stay tuned.
If the Propeller folks were to ask me for advice on their next architecture (which they haven't), I would make the following suggestions:
- Indirect addressing of cog-local RAM. The current architecture treats local RAM as a register file, and (as with most register files) won't indirectly address it through a register. This means all task stacks have to live in shared RAM, slowing things down tremendously. (The stacks can actually live in local RAM, but it requires self-modifying code and ends up more expensive than just using the slow RAM.)
- Make the local and shared RAM address spaces disjoint. Combined with the indirect addressing, this would allow a single instruction to read from either local or shared RAM (possibly incurring greatly increased latency). This would make direct-threaded interpreters possible.
- Include sign-extending variants of
rdword
andrdbyte
. - Orthogonality is all well and good, but I really miss auto-increment addressing. Being able to auto-increment/decrement a register after a
rdlong
orwrlong
would cut the size of my Forth kernel by nearly 25%, and keep me from missing some hub access windows.