[MLton-devel] common argument optimization

Mon, 24 Mar 2003 18:18:20 -0500 (EST)

In trying to bring the new functorized IO routines up to par with the
old routines, I'd been looking at the SSA for some of the hot loops
and was struck by a number of instances of Goto transfers that passed
the same arguments to the same label; e.g.
 l1 () = [... z1 = ? ...] Goto l3(x, y, z1)
 l2 () = [... z2 = ? ...] Goto l3(x, y, z2)
 l3 (a, b, c) = ...
This code can be simplified to:
 l1 () = [... z1 = ? ...] Goto l3(z1)
 l2 () = [... z2 = ? ...] Goto l3(z2)
 l3 (c) = [a = x; b = y; ...]
which saves a number of resources: time of setting up the arguments
for the jump to l3, space (either stack or pseudo-registers) for the
arguments of l3, etc.  It may also expose some other optimizations, if
more information is known about x or y.

I thought it would be pretty straightforward to write a quick little
optimization to clean up these examples.  The simplest analysis I
could think of maintains varInfo: Var.t -> Var.t option list ref,
intialized to [].
For each variable v bound in a Statement.t or in the SSA Function.t args,
then List.push(varInfo v, NONE).
For each  Goto l(x1, ..., xn)  where  l(a1, ..., an),
then List.push(varInfo ai, SOME xi).
For each block argument a used in an unknown context (e.g., arguments of
blocks used as continuations, handlers, arith success, runtime return,
or case switch labels),
then List.push(varInfo a, NONE).
Now, any block argument a such that varInfo a = xs, where all of the
elements of xs are equal to SOME x, can be optimized by setting  a = x
at the beginning of the block and dropping the argument from Gotos.

That takes care of the example above.  We can clearly do slightly
better, by changing the transformation criteria to the following: any
block argument a such that varInfo a = xs, where all of the elements
of xs are equal to SOME x *or* are equal to SOME a, can be optimized
by setting  a = x  at the beginning of the block and dropping the
argument from Gotos.  This optimizes a case like:
 l1 () = [... z1 = ? ...] Goto l3(x, y, z1)
 l2 () = [... z2 = ? ...] Goto l3(x, y, z2)
 l3 (a, b, c) = [... w = ? ...] case w of true => l4 | false => l5
 l4 () = [...] Goto l3(a, b, w)
 l5 () = ...
where a common argument is passed to a loop (and is invariant through
the loop).  Of course, the loopInvariant optimization pass would
normally introduce a local loop and essentially reduce this to the
first example, but I have seen this in practice, which suggests that
some optimizations after loopInvariant do enough simplifications to
introduce (new) loop invariant arguments.

However, the above analysis and transformation doesn't cover the cases
where eliminating one common argument exposes the opportunity to
eliminate other common arguments.  For example:
  l1 () = [...] Goto l3(x)
  l2 () = [...] Goto l3(x)
  l3 (a) = [...] Goto l5(a)
  l4 () = [...] Goto l5(x)
  l5 (b) = ...
One pass of analysis and transformation would eliminate the argument
to l3 and rewrite the Goto l5(a) to Goto l5(x), thereby exposing the
opportunity to eliminate the common argument to l5.

The interdependency the arguments to l3 and l5 suggest performing some
sort of fixed-point analysis.  This analysis is relatively simple:
maintain varInfo: Var.t -> VarLattice.t, where
VarLattice.t ~=~ Bot | Point of Var.t | Top (but as implemented by the
FlatLattice functor with a lessThan list and value ref under the
hood), initialized to Bot.
For each variable v bound in a Statement.t or in the SSA Function.t args,
then VarLattice.<= (Point v, varInfo v)
For each  Goto l(x1, ..., xn)  where  l(a1, ..., an),
then VarLattice.<= (varInfo xi, varInfo ai).
For each block argument a used in an unknown context,
then VarLattice.<= (Point a, varInfo a).
Now, any block argument a such that varInfo a = Point x can be
optimized by setting a = x at the beginning of the block and dropping
the argument from Gotos.

Now, with the last example, we introduce the ordering constraints:
  varInfo x <= varInfo a
  varInfo a <= varInfo b
  varInfo x <= varInfo b.
Assuming that varInfo x = Point x, then we get varInfo a = Point x and
varInfo b = Point x, and we optimize the example as desired.

But, that is a rather weak assumption.  It's quite possible for
varInfo x = Top.  For example, consider:
  g1 () = [... n = 1 ...] Goto l0(n)
  g2 () = [... m = 2 ...] Goto l0(m)
  l0 (x) = ...
  l1 () = [...] Goto l3(x)
  l2 () = [...] Goto l3(x)
  l3 (a) = [...] Goto l5(a)
  l4 () = [...] Goto l5(x)
  l5 (b) = ...
Now varInfo x = varInfo a = varInfo b = Top.  What went wrong here?
When varInfo x went to Top, it got propagated all the way through to a
and b, and prevented the elimination of any common arguments.  What
we'd like to do instead is when varInfo x goes to Top, propagate on
Point x -- we have no hope of eliminating x, but if we hold x
constant, then we have a chance of eliminating arguments for which x
is passed as an actual.

Does anyone see where this is going yet?  Pausing for a little
thought, I realized that I had once before tried proposing this kind
of "fix" to a fixed-point analysis -- when we were first investigating
the contification optimization in light of John Reppy's CWS paper.  Of
course, that "fix" failed because it defined a non-monotonic function
and I couldn't take the fixed point.  But, Stephen suggested a
dominator based approach, and we were able to show that, indeed, the
dominator analysis subsumed both the previous call based analysis and
the cont based analysis.  And, a moment's reflection reveals further
parallels:  when varInfo: Var.t -> Var.t option list ref, we have
something analagous to the call analysis, and when
varInfo: Var.t -> VarLattice.t, we have something analagous to the
cont analysis.  Maybe there is something analagous to the dominator
approach (and therefore superior to the previous analyses).

And this turns out to be the case:
Construct the graph G as follows:
  nodes(G) = {Root} U Var.t
  edges(G) = {Root -> v | v bound in a Statement.t or
                                  in the SSA Function.t args} U
             {xi -> ai | Goto l(x1, ..., xn)  where  l(a1, ..., an)} U
             {Root -> a | a is a block argument used in an unknown context}
Let idom(x) be the immediate dominator of x in G with root Root.
Now, any block argument a such that idom(a) = x <> Root can be
optimized by setting a = x at the beginning of the block and dropping
the argument from Gotos.

Furthermore, experimental evidence suggests (and I'm fairly confident
that a formal presentation could prove) that the dominator analysis
subsumes the "call" and "cont" based analyses in this context as well
and that the dominator analysis gets "everything" in one go.

I must admit, I was rather suprised at this progression and final
result.  At the outset, I never would have thought of a connection
between contification and common argument optimizations.  They would
seem to be two completely different optimizations.  Although, this may
not really be the case.  As one of the reviewers of the ICFP paper
said:

  I understand that such a form of CPS might be convenient in some
  cases, but when we're talking about analysing code to detect that
  some continuation is constant, I think it makes a lot more sense to
  make all the continuation arguments completely explicit.

  I believe that making all the continuation arguments explicit will
  show that the optimization can be generalized to eliminating
  constant arguments, whether continuations or not.

What I think the common argument optimization shows is that the
dominator analysis does slightly better than the reviewer puts it: we
find more than just constant continuations, we find common
continuations.  And I think this is further justified by the fact that
I have observed common argument eliminate some env_X arguments which
would appear to correspond to determining that while the closure being
executed isn't constant it is at least the same as the closure being
passed elsewhere.

At first, I was curious whether or not we had missed a bigger picture
with the dominator analysis.  When we wrote the contification paper, I
assumed that the dominator analysis was a specialized solution to a
specialized problem; we never suggested that it was a technique suited
to a larger class of analyses.  After initially finding a connection
between contification and common argument (and thinking that the only
connection was the technique), I wondered if the dominator technique
really was applicable to a larger class of analyses.  That is still a
question, but after writing up the above, I'm suspecting that the
"real story" is that the dominator analysis is a solution to the
common argument optimization, and that the contification optimization
is specializing common argument to the case of continuation arguments
(with a different transformation at the end).  (Note, a whole-program,
inter-procedural common argument analysis doesn't really make sense
(in our SSA IL), because the only way of passing values between SSA
functions is as arguments.  (Unless of course in the case that the
common argument is also a constant argument, in which case constant
propagation could lift it to a global.)  The inter-procedural
contification optimization works out because there we move the
function to the argument.)

Anyways, it's still unclear to me whether or not the dominator based
approach solves other kinds of problems.  Since Tom Murphy mentioned
PRE last week and I thought again about my previous investigations
encountering the problems I mentioned in my reply and I'm struck by a
potential similarity: I kept having thoughts along the lines of "well,
if it turns out that the expressions bound to x and y are the same and
redundant, and we have l(x) and l(y) transfers to block l(a), then we
should propagate that redundant expression through to a; but if they
turn out different, then we can't do anything to a, but we do want to
allow it to participate in other redundant expressions" which are
pretty similar to thoughts I had with the common argument
optimization.

On the downside, the optimization doesn't have a huge impact on
runtime, although it does predictably saved some code size.  I stuck
it in the optimization sequence between localRef and flatten.  I think
it makes sense to add it after introduceLoops and loopInvariant (even
though commonArg get some things that loopInvariant gets, it doesn't
get all of them).  I also think that it makes sense to add it before
commonSubexp, since identifying variables could expose more common
subexpressions.  I would think a similar thought applies to
redundantTests.  I'm now having second thoughts about placing
commonArg before flatten and localFlatten3.  I previously didn't put
much thought into it, but it seems to me that we could have cases
where some components of a tuple used as an argument are common, but
the whole tuple isn't.

Here are the benchmark results.  I believe that the horrid compile
time for hamlet is due to the dominator computation on the graph G.
Note that as I've defined it above (and currently implemented it), the
graph G has a node for _every_ variable in the function and an edge
from Root to every variable bound in a statement.  This is overkill,
since we only care about variables used as formals and actuals.  This
can easily be accomodated and that should shrink the size of the graph
G by quite a bit.  (Hopefully enough to prevent a self-compile from
thrashing in GC during commonArg.)

Finally, notice that wc-input1 got a decent improvement, which is
encouraging for me because the compiler here is using the new IO,
while the old IO's hot loop didn't exhibit opportunities for common
argument elimination.

MLton0 -- mlton -diag commonArg
MLton1 -- mlton -drop-pass commonArg
compile time
benchmark         MLton0 MLton1
barnes-hut          2.60   2.53
boyer               4.48   3.97
checksum            0.68   0.70
count-graphs        1.43   1.38
DLXSimulator        3.86   3.48
fft                 1.19   1.12
fib                 0.68   0.67
hamlet            148.33  46.91
imp-for             0.70   0.70
knuth-bendix        2.37   2.20
lexgen              5.48   4.90
life                1.32   1.25
logic               2.54   2.51
mandelbrot          0.73   0.68
matrix-multiply     0.73   0.73
md5                 1.33   1.24
merge               0.70   0.64
mlyacc             23.03  20.00
model-elimination  29.11  19.84
mpuz                0.90   0.90
nucleic            79.47  78.79
peek                1.13   1.04
psdes-random        0.67   0.68
ratio-regions       2.14   1.99
ray                 3.55   3.45
raytrace            8.76   9.43
simple              6.02   6.97
smith-normal-form   9.64   8.45
tailfib             0.79   0.77
tak                 0.98   1.89
tensor              3.42   3.39
tsp                 1.66   1.66
tyan                3.80   3.42
vector-concat       0.73   0.70
vector-rev          0.75   0.73
vliw               14.64  12.75
wc-input1           1.73   2.17
wc-scanStream       2.18   2.09
zebra               4.07   3.76
zern                1.23   1.12
run time
benchmark         MLton0 MLton1
barnes-hut         48.68  49.70
boyer              55.70  55.62
checksum           47.35  47.35
count-graphs       40.27  43.09
DLXSimulator       92.16  92.12
fft                36.57  37.27
fib                49.50  49.50
hamlet             51.47  53.09
imp-for            51.61  51.62
knuth-bendix       40.01  40.05
lexgen             41.70  41.87
life               52.33  54.92
logic              59.83  60.05
mandelbrot         54.16  54.16
matrix-multiply    30.90  31.07
md5               143.21 143.24
merge              85.04  83.17
mlyacc             40.16  39.97
model-elimination  72.59  72.45
mpuz               38.94  38.94
nucleic            41.46  41.48
peek               35.41  35.41
psdes-random       23.23  23.24
ratio-regions      36.98  36.98
ray                29.74  28.46
raytrace           40.89  42.60
simple             53.88  64.97
smith-normal-form  63.46  59.80
tailfib            46.35  46.19
tak                88.82  89.29
tensor             32.10  32.12
tsp                69.73  68.84
tyan               59.05  58.88
vector-concat      92.75  98.45
vector-rev        117.40 121.98
vliw               47.78  46.55
wc-input1          77.03  86.52
wc-scanStream     168.09 175.03
zebra              48.90  51.03
zern               42.55  46.03
run time ratio
benchmark         MLton1
barnes-hut          1.02
boyer               1.00
checksum            1.00
count-graphs        1.07
DLXSimulator        1.00
fft                 1.02
fib                 1.00
hamlet              1.03
imp-for             1.00
knuth-bendix        1.00
lexgen              1.00
life                1.05
logic               1.00
mandelbrot          1.00
matrix-multiply     1.01
md5                 1.00
merge               0.98
mlyacc              1.00
model-elimination   1.00
mpuz                1.00
nucleic             1.00
peek                1.00
psdes-random        1.00
ratio-regions       1.00
ray                 0.96
raytrace            1.04
simple              1.21
smith-normal-form   0.94
tailfib             1.00
tak                 1.01
tensor              1.00
tsp                 0.99
tyan                1.00
vector-concat       1.06
vector-rev          1.04
vliw                0.97
wc-input1           1.12
wc-scanStream       1.04
zebra               1.04
zern                1.08
size
benchmark            MLton0    MLton1
barnes-hut          123,268   123,332
boyer               137,243   137,235
checksum             50,859    50,867
count-graphs         68,563    69,155
DLXSimulator        109,356   111,164
fft                  58,375    59,495
fib                  50,859    50,867
hamlet            1,177,316 1,237,564
imp-for              50,851    50,859
knuth-bendix         96,076    96,204
lexgen              168,425   170,633
life                 70,715    70,867
logic               110,075   110,091
mandelbrot           50,971    50,979
matrix-multiply      51,427    51,435
md5                  64,012    64,020
merge                52,179    52,187
mlyacc              480,393   488,841
model-elimination   602,612   610,172
mpuz                 55,771    55,811
nucleic             197,739   197,683
peek                 61,812    61,948
psdes-random         51,563    51,571
ratio-regions        69,259    69,267
ray                 113,660   116,700
raytrace            218,153   289,097
simple              195,303   195,303
smith-normal-form   194,968   195,792
tailfib              50,659    50,667
tak                  51,067    51,075
tensor              117,879   118,039
tsp                  68,716    68,788
tyan                115,820   116,076
vector-concat        52,027    52,035
vector-rev           51,235    51,243
vliw                321,269   331,733
wc-input1            75,873    76,145
wc-scanStream        79,409    79,889
zebra               127,468   127,476
zern                 56,726    56,798

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel