[MLton-devel] common argument optimization
Matthew Fluet
MLton@mlton.org
Mon, 24 Mar 2003 18:18:20 -0500 (EST)
In trying to bring the new functorized IO routines up to par with the
old routines, I'd been looking at the SSA for some of the hot loops
and was struck by a number of instances of Goto transfers that passed
the same arguments to the same label; e.g.
l1 () = [... z1 = ? ...] Goto l3(x, y, z1)
l2 () = [... z2 = ? ...] Goto l3(x, y, z2)
l3 (a, b, c) = ...
This code can be simplified to:
l1 () = [... z1 = ? ...] Goto l3(z1)
l2 () = [... z2 = ? ...] Goto l3(z2)
l3 (c) = [a = x; b = y; ...]
which saves a number of resources: time of setting up the arguments
for the jump to l3, space (either stack or pseudo-registers) for the
arguments of l3, etc. It may also expose some other optimizations, if
more information is known about x or y.
I thought it would be pretty straightforward to write a quick little
optimization to clean up these examples. The simplest analysis I
could think of maintains varInfo: Var.t -> Var.t option list ref,
intialized to [].
For each variable v bound in a Statement.t or in the SSA Function.t args,
then List.push(varInfo v, NONE).
For each Goto l(x1, ..., xn) where l(a1, ..., an),
then List.push(varInfo ai, SOME xi).
For each block argument a used in an unknown context (e.g., arguments of
blocks used as continuations, handlers, arith success, runtime return,
or case switch labels),
then List.push(varInfo a, NONE).
Now, any block argument a such that varInfo a = xs, where all of the
elements of xs are equal to SOME x, can be optimized by setting a = x
at the beginning of the block and dropping the argument from Gotos.
That takes care of the example above. We can clearly do slightly
better, by changing the transformation criteria to the following: any
block argument a such that varInfo a = xs, where all of the elements
of xs are equal to SOME x *or* are equal to SOME a, can be optimized
by setting a = x at the beginning of the block and dropping the
argument from Gotos. This optimizes a case like:
l1 () = [... z1 = ? ...] Goto l3(x, y, z1)
l2 () = [... z2 = ? ...] Goto l3(x, y, z2)
l3 (a, b, c) = [... w = ? ...] case w of true => l4 | false => l5
l4 () = [...] Goto l3(a, b, w)
l5 () = ...
where a common argument is passed to a loop (and is invariant through
the loop). Of course, the loopInvariant optimization pass would
normally introduce a local loop and essentially reduce this to the
first example, but I have seen this in practice, which suggests that
some optimizations after loopInvariant do enough simplifications to
introduce (new) loop invariant arguments.
However, the above analysis and transformation doesn't cover the cases
where eliminating one common argument exposes the opportunity to
eliminate other common arguments. For example:
l1 () = [...] Goto l3(x)
l2 () = [...] Goto l3(x)
l3 (a) = [...] Goto l5(a)
l4 () = [...] Goto l5(x)
l5 (b) = ...
One pass of analysis and transformation would eliminate the argument
to l3 and rewrite the Goto l5(a) to Goto l5(x), thereby exposing the
opportunity to eliminate the common argument to l5.
The interdependency the arguments to l3 and l5 suggest performing some
sort of fixed-point analysis. This analysis is relatively simple:
maintain varInfo: Var.t -> VarLattice.t, where
VarLattice.t ~=~ Bot | Point of Var.t | Top (but as implemented by the
FlatLattice functor with a lessThan list and value ref under the
hood), initialized to Bot.
For each variable v bound in a Statement.t or in the SSA Function.t args,
then VarLattice.<= (Point v, varInfo v)
For each Goto l(x1, ..., xn) where l(a1, ..., an),
then VarLattice.<= (varInfo xi, varInfo ai).
For each block argument a used in an unknown context,
then VarLattice.<= (Point a, varInfo a).
Now, any block argument a such that varInfo a = Point x can be
optimized by setting a = x at the beginning of the block and dropping
the argument from Gotos.
Now, with the last example, we introduce the ordering constraints:
varInfo x <= varInfo a
varInfo a <= varInfo b
varInfo x <= varInfo b.
Assuming that varInfo x = Point x, then we get varInfo a = Point x and
varInfo b = Point x, and we optimize the example as desired.
But, that is a rather weak assumption. It's quite possible for
varInfo x = Top. For example, consider:
g1 () = [... n = 1 ...] Goto l0(n)
g2 () = [... m = 2 ...] Goto l0(m)
l0 (x) = ...
l1 () = [...] Goto l3(x)
l2 () = [...] Goto l3(x)
l3 (a) = [...] Goto l5(a)
l4 () = [...] Goto l5(x)
l5 (b) = ...
Now varInfo x = varInfo a = varInfo b = Top. What went wrong here?
When varInfo x went to Top, it got propagated all the way through to a
and b, and prevented the elimination of any common arguments. What
we'd like to do instead is when varInfo x goes to Top, propagate on
Point x -- we have no hope of eliminating x, but if we hold x
constant, then we have a chance of eliminating arguments for which x
is passed as an actual.
Does anyone see where this is going yet? Pausing for a little
thought, I realized that I had once before tried proposing this kind
of "fix" to a fixed-point analysis -- when we were first investigating
the contification optimization in light of John Reppy's CWS paper. Of
course, that "fix" failed because it defined a non-monotonic function
and I couldn't take the fixed point. But, Stephen suggested a
dominator based approach, and we were able to show that, indeed, the
dominator analysis subsumed both the previous call based analysis and
the cont based analysis. And, a moment's reflection reveals further
parallels: when varInfo: Var.t -> Var.t option list ref, we have
something analagous to the call analysis, and when
varInfo: Var.t -> VarLattice.t, we have something analagous to the
cont analysis. Maybe there is something analagous to the dominator
approach (and therefore superior to the previous analyses).
And this turns out to be the case:
Construct the graph G as follows:
nodes(G) = {Root} U Var.t
edges(G) = {Root -> v | v bound in a Statement.t or
in the SSA Function.t args} U
{xi -> ai | Goto l(x1, ..., xn) where l(a1, ..., an)} U
{Root -> a | a is a block argument used in an unknown context}
Let idom(x) be the immediate dominator of x in G with root Root.
Now, any block argument a such that idom(a) = x <> Root can be
optimized by setting a = x at the beginning of the block and dropping
the argument from Gotos.
Furthermore, experimental evidence suggests (and I'm fairly confident
that a formal presentation could prove) that the dominator analysis
subsumes the "call" and "cont" based analyses in this context as well
and that the dominator analysis gets "everything" in one go.
I must admit, I was rather suprised at this progression and final
result. At the outset, I never would have thought of a connection
between contification and common argument optimizations. They would
seem to be two completely different optimizations. Although, this may
not really be the case. As one of the reviewers of the ICFP paper
said:
I understand that such a form of CPS might be convenient in some
cases, but when we're talking about analysing code to detect that
some continuation is constant, I think it makes a lot more sense to
make all the continuation arguments completely explicit.
I believe that making all the continuation arguments explicit will
show that the optimization can be generalized to eliminating
constant arguments, whether continuations or not.
What I think the common argument optimization shows is that the
dominator analysis does slightly better than the reviewer puts it: we
find more than just constant continuations, we find common
continuations. And I think this is further justified by the fact that
I have observed common argument eliminate some env_X arguments which
would appear to correspond to determining that while the closure being
executed isn't constant it is at least the same as the closure being
passed elsewhere.
At first, I was curious whether or not we had missed a bigger picture
with the dominator analysis. When we wrote the contification paper, I
assumed that the dominator analysis was a specialized solution to a
specialized problem; we never suggested that it was a technique suited
to a larger class of analyses. After initially finding a connection
between contification and common argument (and thinking that the only
connection was the technique), I wondered if the dominator technique
really was applicable to a larger class of analyses. That is still a
question, but after writing up the above, I'm suspecting that the
"real story" is that the dominator analysis is a solution to the
common argument optimization, and that the contification optimization
is specializing common argument to the case of continuation arguments
(with a different transformation at the end). (Note, a whole-program,
inter-procedural common argument analysis doesn't really make sense
(in our SSA IL), because the only way of passing values between SSA
functions is as arguments. (Unless of course in the case that the
common argument is also a constant argument, in which case constant
propagation could lift it to a global.) The inter-procedural
contification optimization works out because there we move the
function to the argument.)
Anyways, it's still unclear to me whether or not the dominator based
approach solves other kinds of problems. Since Tom Murphy mentioned
PRE last week and I thought again about my previous investigations
encountering the problems I mentioned in my reply and I'm struck by a
potential similarity: I kept having thoughts along the lines of "well,
if it turns out that the expressions bound to x and y are the same and
redundant, and we have l(x) and l(y) transfers to block l(a), then we
should propagate that redundant expression through to a; but if they
turn out different, then we can't do anything to a, but we do want to
allow it to participate in other redundant expressions" which are
pretty similar to thoughts I had with the common argument
optimization.
On the downside, the optimization doesn't have a huge impact on
runtime, although it does predictably saved some code size. I stuck
it in the optimization sequence between localRef and flatten. I think
it makes sense to add it after introduceLoops and loopInvariant (even
though commonArg get some things that loopInvariant gets, it doesn't
get all of them). I also think that it makes sense to add it before
commonSubexp, since identifying variables could expose more common
subexpressions. I would think a similar thought applies to
redundantTests. I'm now having second thoughts about placing
commonArg before flatten and localFlatten3. I previously didn't put
much thought into it, but it seems to me that we could have cases
where some components of a tuple used as an argument are common, but
the whole tuple isn't.
Here are the benchmark results. I believe that the horrid compile
time for hamlet is due to the dominator computation on the graph G.
Note that as I've defined it above (and currently implemented it), the
graph G has a node for _every_ variable in the function and an edge
from Root to every variable bound in a statement. This is overkill,
since we only care about variables used as formals and actuals. This
can easily be accomodated and that should shrink the size of the graph
G by quite a bit. (Hopefully enough to prevent a self-compile from
thrashing in GC during commonArg.)
Finally, notice that wc-input1 got a decent improvement, which is
encouraging for me because the compiler here is using the new IO,
while the old IO's hot loop didn't exhibit opportunities for common
argument elimination.
MLton0 -- mlton -diag commonArg
MLton1 -- mlton -drop-pass commonArg
compile time
benchmark MLton0 MLton1
barnes-hut 2.60 2.53
boyer 4.48 3.97
checksum 0.68 0.70
count-graphs 1.43 1.38
DLXSimulator 3.86 3.48
fft 1.19 1.12
fib 0.68 0.67
hamlet 148.33 46.91
imp-for 0.70 0.70
knuth-bendix 2.37 2.20
lexgen 5.48 4.90
life 1.32 1.25
logic 2.54 2.51
mandelbrot 0.73 0.68
matrix-multiply 0.73 0.73
md5 1.33 1.24
merge 0.70 0.64
mlyacc 23.03 20.00
model-elimination 29.11 19.84
mpuz 0.90 0.90
nucleic 79.47 78.79
peek 1.13 1.04
psdes-random 0.67 0.68
ratio-regions 2.14 1.99
ray 3.55 3.45
raytrace 8.76 9.43
simple 6.02 6.97
smith-normal-form 9.64 8.45
tailfib 0.79 0.77
tak 0.98 1.89
tensor 3.42 3.39
tsp 1.66 1.66
tyan 3.80 3.42
vector-concat 0.73 0.70
vector-rev 0.75 0.73
vliw 14.64 12.75
wc-input1 1.73 2.17
wc-scanStream 2.18 2.09
zebra 4.07 3.76
zern 1.23 1.12
run time
benchmark MLton0 MLton1
barnes-hut 48.68 49.70
boyer 55.70 55.62
checksum 47.35 47.35
count-graphs 40.27 43.09
DLXSimulator 92.16 92.12
fft 36.57 37.27
fib 49.50 49.50
hamlet 51.47 53.09
imp-for 51.61 51.62
knuth-bendix 40.01 40.05
lexgen 41.70 41.87
life 52.33 54.92
logic 59.83 60.05
mandelbrot 54.16 54.16
matrix-multiply 30.90 31.07
md5 143.21 143.24
merge 85.04 83.17
mlyacc 40.16 39.97
model-elimination 72.59 72.45
mpuz 38.94 38.94
nucleic 41.46 41.48
peek 35.41 35.41
psdes-random 23.23 23.24
ratio-regions 36.98 36.98
ray 29.74 28.46
raytrace 40.89 42.60
simple 53.88 64.97
smith-normal-form 63.46 59.80
tailfib 46.35 46.19
tak 88.82 89.29
tensor 32.10 32.12
tsp 69.73 68.84
tyan 59.05 58.88
vector-concat 92.75 98.45
vector-rev 117.40 121.98
vliw 47.78 46.55
wc-input1 77.03 86.52
wc-scanStream 168.09 175.03
zebra 48.90 51.03
zern 42.55 46.03
run time ratio
benchmark MLton1
barnes-hut 1.02
boyer 1.00
checksum 1.00
count-graphs 1.07
DLXSimulator 1.00
fft 1.02
fib 1.00
hamlet 1.03
imp-for 1.00
knuth-bendix 1.00
lexgen 1.00
life 1.05
logic 1.00
mandelbrot 1.00
matrix-multiply 1.01
md5 1.00
merge 0.98
mlyacc 1.00
model-elimination 1.00
mpuz 1.00
nucleic 1.00
peek 1.00
psdes-random 1.00
ratio-regions 1.00
ray 0.96
raytrace 1.04
simple 1.21
smith-normal-form 0.94
tailfib 1.00
tak 1.01
tensor 1.00
tsp 0.99
tyan 1.00
vector-concat 1.06
vector-rev 1.04
vliw 0.97
wc-input1 1.12
wc-scanStream 1.04
zebra 1.04
zern 1.08
size
benchmark MLton0 MLton1
barnes-hut 123,268 123,332
boyer 137,243 137,235
checksum 50,859 50,867
count-graphs 68,563 69,155
DLXSimulator 109,356 111,164
fft 58,375 59,495
fib 50,859 50,867
hamlet 1,177,316 1,237,564
imp-for 50,851 50,859
knuth-bendix 96,076 96,204
lexgen 168,425 170,633
life 70,715 70,867
logic 110,075 110,091
mandelbrot 50,971 50,979
matrix-multiply 51,427 51,435
md5 64,012 64,020
merge 52,179 52,187
mlyacc 480,393 488,841
model-elimination 602,612 610,172
mpuz 55,771 55,811
nucleic 197,739 197,683
peek 61,812 61,948
psdes-random 51,563 51,571
ratio-regions 69,259 69,267
ray 113,660 116,700
raytrace 218,153 289,097
simple 195,303 195,303
smith-normal-form 194,968 195,792
tailfib 50,659 50,667
tak 51,067 51,075
tensor 117,879 118,039
tsp 68,716 68,788
tyan 115,820 116,076
vector-concat 52,027 52,035
vector-rev 51,235 51,243
vliw 321,269 331,733
wc-input1 75,873 76,145
wc-scanStream 79,409 79,889
zebra 127,468 127,476
zern 56,726 56,798
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel