MLton's closure conversion
Stephen Weeks
MLton@sourcelight.com
Tue, 10 Oct 2000 10:11:22 -0700 (PDT)
> The fact that the flattener flattened too much (21 args) is a real problem.
> Clearly there are lots of times when exactly this flattening would be good.
> I would think that the only time it was really bad was if:
> All calls unpack these arguments from other tuples (so re-partitioning it
> and passing the tuples as a tuple would have been a big win.
> All the arguments are passed around a lot.
> Was this the case?
I made some progress in understanding why MLton produces worse code on some of
the benchmarks. The problems aren't caused by the flattener. They come from
the closure converter. There are at least two problems.
1. MLton does not compile recursive curried functions efficiently.
2. MLton's closure conversion algorithm can cost when there are large closures
with only a few fields used per invocation.
I don't have solutions offhand (I haven't thought about it much) but I thought
I'd explain the problems to y'all so you could think about them too.
Consider the following SML code.
val x = ...
val y = ...
val z = ...
val w = ...
fun f a b c =
if a > 0
then f (b + c) x (f y z w)
else 17
val _ = f 13 14 15
In closure conversion, the recursive curried function f gets turned into three
functions.
------------------------------------------------------------
datatype env0 = Env0 of int * int * int * int
datatype env1 = Env1 of int * int * int * int * env0 * int
datatype env2 = Env2 of int * int * int * int * env0 * int * int
fun F0(Env0 r, a) =
let x = #1 r
y = #2 r
z = #3 r
w = #4 r
f = Env0 r
in Env1(x, y, z, w, f, a)
end
fun F1(Env1 r, b) =
let x = #1 r
y = #2 r
z = #3 r
w = #4 r
f = #5 r
a = #6 r
in Env2(x, y, z, w, f, a, b)
end
fun F2(Env2 r, c) =
let x = #1 r
y = #2 r
z = #3 r
w = #4 r
f = #5 r
a = #6 r
b = #7 r
in if a > 0
then F2(F1(F0(f, b + c), x), F2(F1(F0(f, y), z), w))
else 17
end
val r = (x, y, z, w)
val _ = F2(F1(F0(r, 13), 14), 15)
------------------------------------------------------------
After the inliner inlines F0 and F1 everywhere, and the type simplifier removes
the transparent constructors, we have the following:
------------------------------------------------------------
fun F2(r, c) =
let x = #1 r
y = #2 r
z = #3 r
w = #4 r
f = #5 r
a = #6 r
b = #7 r
in if a > 0
then F2(let val x' = #1 f
val y' = #2 f
val z' = #3 f
val w' = #4 f
in (x', y', z', w', f, b + c, x)
end,
F2(let val x' = #1 f
val y' = #2 f
val z' = #3 f
val w' = #4 f
in (x', y', z', w', f, y, z)
end, w))
else 17
end
val r = (x, y, z, w)
val _ = F2((x, y, z, w, r, 13, 14), 15)
------------------------------------------------------------
Finally, after flattening, we have the following.
------------------------------------------------------------
fun F2(x, y, z, w, f, a, b, c) =
if a > 0
then let val x' = #1 f
val y' = #2 f
val z' = #3 f
val w' = #4 f
in F2(x', y', z', w', f, b + c, x,
let val x' = #1 f
val y' = #2 f
val z' = #3 f
val w' = #4 f
in F2(x', y', z', w', f, y, z, w)
end)
end
else 17
val r = (x, y, z, w)
val _ = F2(x, y, z, w, r, 13, 14, 15)
------------------------------------------------------------
Annoyingly, the environment is carried around in two duplicate forms: the
formals x, y, z, w and the record f.
On the other hand, if f was not curried, things work out better. Here is the
source program.
------------------------------------------------------------
val x = ...
val y = ...
val z = ...
val w = ...
fun f (a, b, c) =
if a > 0
then f (b + c, x, f (y, z, w))
else 17
val _ = f (13, 14, 15)
------------------------------------------------------------
After closure conversion, we have the following.
------------------------------------------------------------
fun F (r, abc) =
let val x = #1 r
val y = #2 r
val z = #3 r
val w = #4 r
val a = #1 abc
val b = #2 abc
val c = #3 abc
in if a > 0
then F (r, (b + c, x, f (r, (y, z, w))))
else 17
end
val _ = F ((x, y, z, w), (13, 14, 15))
------------------------------------------------------------
After flattening, we have the following.
------------------------------------------------------------
fun F (r, a, b, c) =
let val x = #1 r
val y = #2 r
val z = #3 r
val w = #4 r
in if a > 0
then F (r, b + c, x, f (r, y, z, w))
else 17
end
val _ = F ((x, y, z, w), 13, 14, 15)
------------------------------------------------------------
So, with the uncurried code, the environment is passed around with no
duplication. This is better than the curried case, but still bad (and gets
worse as the closure record gets bigger, as happens in large programs) since all
of the selects are performed on every recursive call.
We decided to use a closure conversion algorithm where all the selects happen
first in order to ensure space safety. I'm thinking we need to go to a hybrid
approach, where we split the closure into two pieces, one piece consisting of
all the small types (int, bool ref, ...) for which it can't hurt to move the
selects later and another piece consisting of the possibly large types (list,
array) where the selects must be done first.
As an experiment, I converted the plclub code from curried to uncurried format.
Here are the results.
curry tuple
OCAML MLton MLton
time time time
holes 1.7 5.6 4.0
fov 1.4 2.0 2.1
intercyl 1.6 2.3 2.4
snowgoon 2.9 3.9 4.0
dice 3.8 5.4 5.7
golf 1.6 3.4 2.8
cone-fractal 3.8 4.8 4.4
large 4.3 3.4 3.0
pipe 5.4 5.1 5.1
chess 15.9 19.3 17.2
fractal 12.1 10.3 9.3
geom-mean 3.5 4.8 4.5
Of the three that were spending a lot of time in the evaluater (holes, dice,
golf), it helped on holes and golf, but there's still a ways to go. Here are
the new profile numbers. The names are better because with curried functions,
there were lots of anonymous lambdas.
holes.gml
eval_0 52.33%
trace_1 9.00%
x_1954 7.00%
**C overhead** 6.67%
golf.gml
eval_0 33.09%
trace_1 11.52%
intersect_0 9.29%
find_all_0 7.06%
dice.gml
trace_1 29.74%
intersect_0 16.42%
plane_1 12.59%
find_all_0 8.58%
Anyways, we're still pretty dead because with a recursive function, every field
is selected out of the closure on every call. For example, here is the start of
the evaluator CPS code.
fun eval_0(x_1942: list_12,
x_1943: list_16,
x_1944: list_16,
env_7: (real * real * (real * real * rounding_mode_0) * real ref * unit ref * real array * real * real * real * ((real * real) * lambdas_7 ref * real * lambdas_6 ref * unit ref * ((real * real) * real * real)) * (real * real * rounding_mode_0) * unit ref * (word * unit ref * list_0 ref) * list_0 ref * (real * real) * (real * real) * (real * real))): (list_16) =
let
val deg_0: real = #1 env_7
val rad_0: real = #2 env_7
val floor_0: (real * real * rounding_mode_0) = #3 env_7
val identity_0: real array = #6 env_7
val optimize_rec_0: real = #7 env_7
val dtr_0: real = #8 env_7
val dcos_0: real = #9 env_7
val trace_0: ((real * real) * lambdas_7 ref * real * lambdas_6 ref * unit ref * ((real * real) * real * real)) = #10 env_7
val conv_0: (real * real * rounding_mode_0) = #11 env_7
val openOut_0: (word * unit ref * list_0 ref) = #13 env_7
val closeOut_0: list_0 ref = #14 env_7
val rotatez_0: (real * real) = #15 env_7
val rotatey_0: (real * real) = #16 env_7
val rotatex_0: (real * real) = #17 env_7
It seems possible to me that this problem is the reason why we're not doing
better on the larger benchmarks such as kit, self compile.