Excellent GC papers
Now, my experience in formal writing is primarily in psychology, and I know that APA style dictates a certain wordiness — extensive citations of prior related work, etc. At ASU, at least, APA papers were actually prohibited from using quotations, so any referenced material had to be restated or paraphrased. So, I imagine some of the wordiness in CS papers is purely stylistic.
However, every so often I find really to-the-point papers.
My favorite example is Hölzle's 1993 paper on write barriers in Self. I came across this a couple years ago, when I was studying Self. It explains its purpose, summarizes prior work (including concise code snippets), and presents its results in just 6 pages — and the idea it's presenting is non-trivial. Marvelous. (Dr. Hölzle is not in my line of direct supervisors at work, either, so this isn't brown-nosing.)
I'm currently reading a 2001 paper by Fridtjof Siebert discussing root scanning in garbage collectors. It's also delightfully punchy, at 15 pages. I came across it during my research today, and it made me both happy and sad.
Happy, because it lays out (rather more formally) some of the techniques I'm using in M2VM's collector. (Because of M2VM's Smalltalk origins, all trace roots are in heap at well-defined points during execution.)
Sad, because I thought I was all cool for coming up with this idea. According to the paper, I'm at $StateOfTheArt - 4 years. :-)
The good news: Siebert's scheme is tailored to uniprocessor systems, where having a single active thread is acceptable — whereas my variation is explicitly designed for multiprocessor machines (like mine). I'm sure this leap has already been made in the literature, but it's nice to know I'm not entirely insane.
For any interested parties: my approach has threads check a global condition variable at certain synchronization points. If any thread would like a collection to occur, it sets the variable and waits for other threads to reach a synchronization point; each thread flushes registers to heap and decrements a countdown semaphore when they reach synchronize. Once all threads have synchronized, the collector traces the roots and marks reachable objects, and all threads resume while the collection completes.
Because each thread's context information is accessible through a reference in global scope (actually from a pthread TLS slot), finding and tracing the roots is O(n) for n threads. (The full trace phase is still linear for the number of reachable objects, as usual.)
I'm looking at a couple enhancements for parallel tracing and scavenging, and trying to find a good way to apply this to full collections (currently, it's limited to minor, new-gen collections). We'll see.

