[MLton] Register allocation in MLton

Wed, 30 Jun 2004 20:06:55 -0400 (EDT)

> My problem is it seems to simple. Andrew Appel provides an algorithm
> in his ''Tiger book'' "Modern compiler implementation in ML", which
> seems to quite fast, based on worklists and sets represented as
> doubly linked lists to avoid quadratic blowup. Is there a more clever
> way which could be recommended?

Usually, the next place to go after Appel's book is Muchnick's book:
  Steven S. Muchnick.  _Advanced Compiler Design & Implementation_

Glancing through Chapt. 16, he recommends representing the interference
graph as a combination of an adjacency matrix and adjacency lists.

> I know you can allocate smarter if you have an expression. And that
> SML/NJ allocates registers directly on the CPS transformation;
> after closure converting the CPS, they rewrite the CPS such that
> there are always strictly less than N free variables, where N is the
> number of machine registers. So when a new variable is bound, there are
> a free register to assign it to. Quite clever. Since the article was
> old, I expect this is what is done in SML/NJ 110.

You might check out
  http://www.smlnj.org/compiler-notes/index.html
for some more recent compiler notes on SML/NJ.  The whole MLRISC
infrastructure wants to do register allocation as well, so I don't know if
your comments on register alloction on CPS still stand for recent working
versions.

> For JITs theres the Linear scan algorithm which almost looks like
> cheating, but it is easy to grasp and probably rather fast.

Don't know much about these.

> But I ran out of good ideas when you only have 6 registers available.
> You can spill coalesce in various ways of course, but is there more
> tricks to be applied?

One SML/NJ's compiler notes page, there is a paper on X86-k32:

   Thus, for the x86, the conceptual model of the architecture has been
   extended with a set of memory locations that are treated as registers.
   The net effect is one where temporaries are computed into memory
   locations and passed as arguments to functions. The use of these memory
   locations managed in this way can be as fast as using registers, where
   indirectly, the register allocation is taking into account the hardware
   register renaming mechanism.

> And what does MLton do?

Something that I've come to believe is surprisingly effective given its
naivete.  Some notes to put MLton's native backend in perspective with
other code generators (for example, the codegen from Appel's book).

First, the MLton runtime system is shared between both the C codegen and
the native codegen (and hopefully with a bytecode codegen).  What this
effectively means is that it is most convenient to decide on stack layout
_before_ any codegen takes over.  In particular, we compute all the stack
frame info for each "procedure" (note, by this time, a "procedure" is far
removed from an individual SML function from the source code), including
stack size, GC masks for each frame, etc.  To do so, the Machine IL
imagines an abstract machine with an infinite number of (pseudo-)registers
of every size.  A liveness analysis determines, for each variable, whether
or not it is live across a point where the runtime system might take over
(for example, any garbage collection point) or a call to another
"procedure".  Those that are live go on the stack, while those that aren't
live go into psuedo-registers.  From this information, we know all we need
to about each stack frame.  On the downside, nothing further on is allowed
to change this stack info; it is set in stone.

At this point, we either generate C or generate assembly.

If we generate C, each of those pseudo-registers corresponds to a C local
variable.  C is maintaining it's own stack in parallel with the stack that
MLton constructed above.  Now gcc tries to do register allocation, and it
tries to put as many of these pseudo-registers into physical registers as
possible; when it is forced to spill, it spills to the C stack.  Hence, it
doesn't disturb the ML stack.

If we generate assembly, then we need to do a round of physical register
allocation.  The first thing to ask is: "where do we spill if we run out
of registers?"  Since we can't modify the stack, we need to do something
else.  Turns out there is a simple solution, related to what I quoted from
the SML/NJ docs.  Recall that values living in those psuedo-registers are
never live across a point where the runtime takes over or where we call
another procedure.  Therefore, we can actually share psuedo-registers
between all procedures of a program.  What we do in practice is allocate a
global array of memory with enough room to hold the maximum number of
pseudo-registers needed by any procedure.  If we ever need to spill a
physical register that is holding a value that corresponds to a
pseudo-register, we can simply spill to this global memory array.

The second thing to recall is that a traditional compiler allows the
register allocator to spill to the stack.  In particular, most of the code
that the register allocator is working on is talking about abstract
temporaries, which get mapped to either physical registers or get spilled
to the stack.  If anything is moved to or from memory, it is already
explicit in the code.

This isn't quite the case with the native codegen.  The ML stack really
corresponds to memory; in particular, each stack slot is represented by an
offset into memory from the stack pointer.  However, we would really like
to avoid touching memory every time we need to access a value that happens
to live on the stack; it only got put on the stack because it needs to be
there anytime we enter the runtime.  Otherwise, it should be free to live
in a register.

So, the native codegen register allocator essentially corresponds to
considering the Machine IL code with each stack-slot and pseudo-register
operand as a temporary.  The only difference is that when it comes time to
spill, we know exactly where to spill to and from -- stack-slot things to
to the stack, while psuedo-reg things go to the global memory array.  A
second minor complication is that the Machine IL reuses stack-slots and
pseudo-regs -- so you really need to do live range splitting.

At this point, I think a standard graph coloring register allocator could
take over.  The fact that most spilling will be to the global memory array
means that the array stays in cache, so accessing in and out of that is
pretty fast.  Better than spilling to different stack frames.  Of course,
this only handles the integer instructions; the x86 floating point stack
is a different beast altogether.  Another SML/NJ compiler note paper
suggests treating the floating point as registers and doing fswap
instructions to get things into the right place.  A further complication
is that many x86 integer instructions work perfectly well with one operand
in memory.  At least from a code size perspective, it seems a waste to
move something into a register just to use it once.  So, it seems like an
obvious optimization is to try to use spill-candidates directly from
memory.  I'm sure there is some literature on something like this.

In any case, though, the native codegen goes for the simplest thing you
could possibly imagine.  We have an abstract notion of "memory location".
A stack slot is a memory location; a pseudo-register is a memory location;
heap cells are memory locations.  Now, every Machine IL instruction
corresponds to some manipulation on these memory locations.  For each
basic block, greedily assign registers to memory locations; that is, we
keep a virtual mapping from physical registrs to the memory locations that
they are holding a value for.  We keep further info to decide when the
register and it's memory location are holding the same value.  We
precompute some "hints" as to what memory locations will be used in the
near future, when a memory location becomes dead, when a memory location
is invalidated by another intruction, what special register a memory
location should be loaded into, etc.  Now we just take a linear walk over
the basic block, maintining this register mapping, emitting appropriate
instructions to load values from memory, or to move values into the right
register.  We maintain certain invariants at each basic block boundary,
determining which pseudo-registers may be kept in physical registers.
Every other memory location must be flushed back to memory.  A similar
technique handles the floating point stack.

Honestly, I'm not incredibly proud of the code.  It started off as get
something implemented by the end of the summer.  At that stage, there were
no hints being computed or propagated.  Everything was just a greedy
algorithm.  Fortunately or unfortunately, it worked well enough.  And,
like many software artifacts, it simply became easier to maintain and
improve the existing code than it was to implement something better.

While I don't advocate the technique, there are some interesting bits that
I've never seen discussed elsewhere.  For example, neither Appel nor
Muchnick seem to mention that spilling can (sometimes) be done to a global
array.  When the spilled temporary is not live across another function
call, then it need not live on the stack.  Even if you can't "globalize"
this common spill area across separate compilation units, you should be
able to for a single unit.  You pay a little bit of static memory to save
stack space for every invocation of the function.  Another interesting bit
seems to be having a "home" memory location for every temporary.  That
makes spilling easier, because you know exactly where to spill to and
from.  Both of these ideas seem like they could be incorporated into a
graph coloring register allocator.  (In fact, having written all this
stuff down, I'm less frightened by the prospect of migrating the native
codegen to a graph coloring register allocator.  Not that I'm promising
anything.)

There are some nagging questions, like what could the native codegen
possibly be doing that improves over gcc?  I've conjectured that the
native codegen makes better use of alias information than gcc does.  (In
particular, from gcc's perspective both the ML stack and the ML heap
alias, so every stack-slot operation conflicts with every heap cell
manipulation.)  I wrote up some experiments, with some surprising results,
at:
  http://www.mlton.org/pipermail/mlton/2002-November/013253.html
The whole thread there has some other interesting points.

> And is there anything
> to be aware of which also applies to RISC machines and improves the
> allocation?

As I said before, usually register allocation is done on temporaries
(which don't correspond to physical memory) with explicit memory accesses.
The assumption is that all aliasing has already been taken into account
and been accomodated by the necessary memory movements.  If, however you
find yourself in a situation where you want to do register allocation on
memory locations, then the more accurate your may-aliasing relation, the
better.

> Reading material is also greatly appreciated.

The ACM SIGPLAN Conference on Programming Language Design and
Implementation is where the academic papers on this subject appear.  You
might be interested in a paper that appered in this year's conference;
while this is a shameless plug for the guys down the hall, it is a very
nice generalization of graph coloring to realistic architectures and helps
bring the theory a little up to speed with the practice:

   A Generalized Algorithm for Graph-Coloring Register Allocation
   Michael D. Smith, Norman Ramsey and Glenn Holloway

   Graph-coloring register allocation is an elegant and extremely popular
   optimization for modern machines. But as currently formulated, it does
   not handle two characteristics commonly found in commercial
   architectures.  First, a single register name may appear in multiple
   register classes, where a class is a set of register names that are
   interchangeable in a particular role. Second, multiple register names
   may be aliases for a single hardware register. We present a
   generalization of graph-coloring register allocation that handles these
   problematic characteristics while preserving the elegance and
   practicality of traditional graph coloring.  Our generalization adapts
   easily to a new target machine, requiring only the sets of names in the
   register classes and a map of the register aliases. It also drops
   easily into a well-known graph-coloring allocator, is efficient at
   compile time, and produces high-quality code.

   http://www.eecs.harvard.edu/~nr/pubs/gcra-abstract.html