[MLton] Optimizations done by GCC 4.0

Wed, 13 Oct 2004 11:59:54 -0400 (EDT)

Good descriptions of many of these optimizations can be found in:
  Steven S. Muchnick.  _Advanced Compiler Design & Implementation_

> * Scalar replacement of aggregates
>
>   I do not know what they mean by this.

Essentially, replace a C struct (i.e., an aggregate) that is local to a
function by an equivalent set of local variables (i.e., scalars).  Now,
constant and copy propagation that work on scalars can be applied to (what
were) the fields of the struct.  By itself, the transformation doesn't do
much, but it enables other optimizations.

MLton's various flattening passes do something similar for the
whole-program.  Within a single SSA procedure, there isn't much need for
it, since MLton's copy propagation will propagate tuple selects.

> * Value range propagation
>
>   This looked interesting. It is basicly a worklist-algorithm working
>   on a lattice propagating the probabilty of branches being taken. This
>   is to aid in trace-scheduling etc.
>
>   <url http://www.pattosoft.com.au/jason/Papers/ValueRangeProp/>
>
>   is one url on the thing. Nearly 10 years old and I have never heard
>   about it. Something MLton does?

MLton doesn't perform any branch prediction optimizations.

> * Partial redundancy elimination
>
> No clue, but a couple of guesses as to what it is.

PRE combines global common subexpression elimination and loop-invariant
code motion and also some additional code improvements.  MLton does not
perform this optimization, but it really should.  Unfortunately, the
presentations of PRE for SSA don't align well with MLton's notion of SSA;
the mismatch is between phi-functions and indexed variables (in
traditional SSA) and goto-with-arguments (in MLton's SSA).

Some notes to this effect are at:

http://www.mlton.org/pipermail/mlton/2003-March/013738.html
http://www.mlton.org/pipermail/mlton/2003-March/013739.html

Actually, those responses were to a query from Tom Murphy.  He and Brendan
McMahan at CMU went on to do a project that implemented PRE for MLton's
SSA:  http://www-2.cs.cmu.edu/~tom7/ssapre/
Unfortunately, we've never seen the code or benchmark results.

> * Load and store motion
>
> If a loop loads and stores to the same memory address and the generate
> by the load is killed by the store, then do all operation on a register
> instead and hoist the load to before the loop and press the store down
> after the loop.
>
> Seems trivial to implement - wonder if it would give anything.

The localRef optimization in MLton turns reads and writes to a ref cell (a
heap allocated object, hence a memory address) into a local variable.  In
the presence of Stephen's recent refFlattening passes, the above might
make sense.  However, note that it is not a valid optimization in the
presence of threads, when both threads might have access to the same
reference cell.

> * Autovectorization
>
> Detection and use of SIMD instructions for vectorized operations. I
> think it is a quite special optimization and the places one can apply
> it are rather limited. I do not know how much it is needed though.

The MLton SSA IL where we do most of our optimizations doesn't express
SIMD instruction like operations, so we can't really apply the above.

> * Loop interchange
>
> Interchange the way one traverses the memory so the cache plays nicely
> with you. Canonical example is matrix multiplication. I do not know if
> they do blocking optimizations too and treats the registers as another
> cache layer.

Not done in MLton.  Not necessarily semantics preserving when an effectful
function is iterated over memory.

> * Tail recursion by accumulation
>
> Contify!

Partly.  The introduceLoops optimization is what really turns a self-call
into a local loop.  However, it may be the case that only after sufficient
inlining and contification are all the calls to the function pulled into
the body, hence enabling the introduceLoops optimization.

> I was wondering if this applies to MLton at all. Most of them are
> already done, so I am wondering about the things MLton does not do, and
> if it would give anything. I think there might be a reason why they are
> not implemented (too little gain for too much analysis and too
> specialized comes to mind).

The branch prediction and autovectorization fall into the too specialized
category.

I think PRE definitely belongs in MLton; it's just been a matter of
finding the time to really address the mismatch between the traditional
SSA and MLton's SSA.  I think that Tom and Brendan made some great
progress, and I just haven't had the time to really absorb their results.

> Is there a complete description of all passes MLton does? If not, would
> there be a point in having one? I would volunteer to inspect the source
> and write down what I find as a starting point, though my english lacks
> a bit of style (or maybe the style is not recognized as english yet,
> heh). I was thinking of a short description of each pass together with
> relevant book and paper references. I am also aware that I may not have
> the in-depth knowledge needed to do this.

There is an out of date reference at:

http://www.mlton.org/pipermail/mlton/2000-July/008048.html

This is mostly right for the first few items; everything described as
being in a /cps/* file has been moved to the SSA IL, which has simplified
and combined some items.

In the SSA IL, we current do the following:

* removeUnused -- remove unused globals, functions, blocks, arguments,
   constructors, constructor arguments.
* inline -- inline functions (size-based metrics)
* contify -- combine a function with it's single, known continuation
* localFlatten -- replace a tuple with separate variables for each field
  (like scalar replacement of aggregates, but also replaces a heap
  allocation)
* constantPropagation
* useless -- useless thing elimination; removes components of tuples that
  are constants, removes arguments of functions that are constants.
* simplifyTypes -- computes a "cardinality" of each datatype, which is an
  abstraction of the number of values of the datatype.
* polyEqual -- implement polymorphic equality
* introduceLoops -- a function whose self-calls are all tail calls is
  transformed into a local loop
* loopInvariant
* localRef -- turn read and writes of a local (non-escaping) ref cell into
  a local variable.
* flatten -- flatten arguments to constructors, functions, and
  blocks.
* commonArg -- remove arguments to a block when all jumps pass the same
  value
* commonSubexp -- common subepression elimination
* commonBlock -- combine statementless blocks with equivalent transfers
* redundantTests -- simplify conditionals whose results are implied by a
  previous conditional test
* redundant -- ???
* knownCase -- duplicate and simplify case switches when the constructor
  of the scruitinee is known
* deepFlatten -- flatten into mutable fields of objects and vectors.  This
  enables more efficient representations
* refFlatten -- flatten ref cells into a containing object

In addition, there is a shrinker pass that includes a whole family of
compile-time reductions, like
   o #1(a, b) --> a
   o case C x of C y => e  --> let y = x in e
   o constant folding, copy propagation
The shrinker is run after every SSA optimization pass.

There are additional optimizations in the latter passes of the compiler:
 - choosing good representations, including packing multiple objects into
   one machine word.
 - heuristically placing limit checks
 - peephole optimizations in the native codegen