[MLton] Optimizations done by GCC 4.0

Jesper Louis Andersen jlouis@mongers.org
Wed, 13 Oct 2004 00:32:18 +0200

I was running through the changes done in GCC 4.0 today. Now they have
worked a tree-saa into their optimization layer, so they are beginning
to get up to par with modern compiler technology. Here is a number of
optimizations (code transformations) they mention:

<url http://gcc.gnu.org/gcc-4.0/changes.html>

* Scalar replacement of aggregates

  I do not know what they mean by this. 

* Constant propagation

  MLton already has this, and it is not hard to do on a SSA.

* Value range propagation

  This looked interesting. It is basicly a worklist-algorithm working
  on a lattice propagating the probabilty of branches being taken. This
  is to aid in trace-scheduling etc.

  <url http://www.pattosoft.com.au/jason/Papers/ValueRangeProp/>

  is one url on the thing. Nearly 10 years old and I have never heard
  about it. Something MLton does?

* Partial redundancy elimination

No clue, but a couple of guesses as to what it is.

* Load and store motion

If a loop loads and stores to the same memory address and the generate 
by the load is killed by the store, then do all operation on a register
instead and hoist the load to before the loop and press the store down
after the loop.

Seems trivial to implement - wonder if it would give anything.

* Strength reduction

Ah, the good old optimization. I expect MLton to do this, though there
is no pass explicitly mentioning this. (loopInvariant maybe?)

* Dead store elimination

A store to something never used is dead, so it can be removed.

* Dead and unreachable code elimination

MLton already does this a lot.

* Autovectorization

Detection and use of SIMD instructions for vectorized operations. I
think it is a quite special optimization and the places one can apply
it are rather limited. I do not know how much it is needed though.

* Loop interchange

Interchange the way one traverses the memory so the cache plays nicely
with you. Canonical example is matrix multiplication. I do not know if
they do blocking optimizations too and treats the registers as another
cache layer.

* Tail recursion by accumulation



I was wondering if this applies to MLton at all. Most of them are 
already done, so I am wondering about the things MLton does not do, and
if it would give anything. I think there might be a reason why they are
not implemented (too little gain for too much analysis and too
specialized comes to mind). 

Is there a complete description of all passes MLton does? If not, would
there be a point in having one? I would volunteer to inspect the source
and write down what I find as a starting point, though my english lacks
a bit of style (or maybe the style is not recognized as english yet, 
heh). I was thinking of a short description of each pass together with
relevant book and paper references. I am also aware that I may not have
the in-depth knowledge needed to do this.