MLton's closure conversion

Tue, 10 Oct 2000 10:11:22 -0700 (PDT)

> The fact that the flattener flattened too much (21 args) is a  real  problem.
> Clearly  there  are lots of times when exactly this flattening would be good.
> I would think that the only time it was really bad was if:
>     All calls unpack these arguments from other tuples (so re-partitioning it
>         and passing the tuples as a tuple would have been a big win.
>     All the arguments are passed around a lot.
> Was this the case?

I made some progress in understanding why MLton produces worse code on some of
the benchmarks.  The problems aren't caused by the flattener.  They come from
the closure converter.  There are at least two problems.

1. MLton does not compile recursive curried functions efficiently.
2. MLton's closure conversion algorithm can cost when there are large closures 
   with only a few fields used per invocation.

I don't have solutions offhand (I haven't thought about it much) but I thought
I'd explain the problems to y'all so you could think about them too.

Consider the following SML code.

val x = ...
val y = ...
val z = ...
val w = ...
fun f a b c =
   if a > 0
      then f (b + c) x (f y z w)
   else 17
val _ = f 13 14 15

In closure conversion, the recursive curried function f gets turned into three
functions.

------------------------------------------------------------
datatype env0 = Env0 of int * int * int * int
datatype env1 = Env1 of int * int * int * int * env0 * int
datatype env2 = Env2 of int * int * int * int * env0 * int * int
fun F0(Env0 r, a) = 
    let x = #1 r
        y = #2 r
        z = #3 r
        w = #4 r
        f = Env0 r
    in Env1(x, y, z, w, f, a)
    end	
fun F1(Env1 r, b) = 
    let x = #1 r
        y = #2 r
        z = #3 r
        w = #4 r
	f = #5 r
	a = #6 r
    in Env2(x, y, z, w, f, a, b)
    end	
fun F2(Env2 r, c) =
    let x = #1 r
        y = #2 r
        z = #3 r
        w = #4 r
	f = #5 r
	a = #6 r
	b = #7 r
    in if a > 0
        then F2(F1(F0(f, b + c), x), F2(F1(F0(f, y), z), w))
       else 17
    end
val r = (x, y, z, w)
val _ = F2(F1(F0(r, 13), 14), 15)
------------------------------------------------------------

After the inliner inlines F0 and F1 everywhere, and the type simplifier removes
the transparent constructors, we have the following:

------------------------------------------------------------
fun F2(r, c) =
    let x = #1 r
        y = #2 r
        z = #3 r
        w = #4 r
	f = #5 r
	a = #6 r
	b = #7 r
    in if a > 0
        then F2(let val x' = #1 f
			val y' = #2 f
			val z' = #3 f
			val w' = #4 f
                   in (x', y', z', w', f, b + c, x)
                   end,
                 F2(let val x' = #1 f
			val y' = #2 f
			val z' = #3 f
			val w' = #4 f
		    in (x', y', z', w', f, y, z)
                    end, w))
       else 17
    end
val r = (x, y, z, w)
val _ = F2((x, y, z, w, r, 13, 14), 15)
------------------------------------------------------------

Finally, after flattening, we have the following.

------------------------------------------------------------
fun F2(x, y, z, w, f, a, b, c) =
   if a > 0
      then let val x' = #1 f
               val y' = #2 f
               val z' = #3 f
               val w' = #4 f
           in F2(x', y', z', w', f, b + c, x,
                 let val x' = #1 f
                     val y' = #2 f
                     val z' = #3 f
                     val w' = #4 f
                 in F2(x', y', z', w', f, y, z, w)
                 end)
           end
       else 17
val r = (x, y, z, w)
val _ = F2(x, y, z, w, r, 13, 14, 15)
------------------------------------------------------------

Annoyingly, the environment is carried around in two duplicate forms: the
formals x, y, z, w and the record f.

On the other hand, if f was not curried, things work out better.  Here is the
source program.

------------------------------------------------------------
val x = ...
val y = ...
val z = ...
val w = ...
fun f (a, b, c) =
   if a > 0
      then f (b + c, x, f (y, z, w))
   else 17
val _ = f (13, 14, 15)
------------------------------------------------------------

After closure conversion, we have the following.

------------------------------------------------------------
fun F (r, abc) =
   let val x = #1 r
	val y = #2 r
	val z = #3 r
	val w = #4 r
	val a = #1 abc
	val b = #2 abc
	val c = #3 abc
   in if a > 0
         then F (r, (b + c, x, f (r, (y, z, w))))
      else 17
   end
val _ = F ((x, y, z, w), (13, 14, 15))
------------------------------------------------------------

After flattening, we have the following.

------------------------------------------------------------
fun F (r, a, b, c) =
   let val x = #1 r
	val y = #2 r
	val z = #3 r
	val w = #4 r
   in if a > 0
         then F (r, b + c, x, f (r, y, z, w))
      else 17
   end
val _ = F ((x, y, z, w), 13, 14, 15)
------------------------------------------------------------

So, with the uncurried code, the environment is passed around with no
duplication.  This is better than the curried case, but still bad (and gets
worse as the closure record gets bigger, as happens in large programs) since all 
of the selects are performed on every recursive call.

We decided to use a closure conversion algorithm where all the selects happen
first in order to ensure space safety.  I'm thinking we need to go to a hybrid
approach, where we split the closure into two pieces, one piece consisting of
all the small types (int, bool ref, ...) for which it can't hurt to move the
selects later and another piece consisting of the possibly large types (list,
array) where the selects must be done first.  

As an experiment, I converted the plclub code from curried to uncurried format.
Here are the results.

			curry  tuple
		OCAML	MLton  MLton
		time	time   time
holes		 1.7	 5.6	 4.0
fov		 1.4	 2.0	 2.1
intercyl	 1.6	 2.3	 2.4
snowgoon	 2.9	 3.9	 4.0
dice		 3.8	 5.4	 5.7
golf		 1.6	 3.4	 2.8
cone-fractal	 3.8	 4.8	 4.4
large		 4.3	 3.4	 3.0
pipe		 5.4	 5.1	 5.1
chess		15.9	19.3	17.2
fractal		12.1	10.3	 9.3

geom-mean	 3.5	 4.8	4.5

Of the three that were spending a lot of time in the evaluater (holes, dice,
golf), it helped on holes and golf, but there's still a ways to go.  Here are
the new profile numbers.  The names are better because with curried functions,
there were lots of anonymous lambdas.

holes.gml
eval_0                                  52.33%
trace_1                                  9.00%
x_1954                                   7.00%
**C overhead**                           6.67%

golf.gml
eval_0                        33.09%
trace_1                       11.52%
intersect_0                    9.29%
find_all_0                     7.06%

dice.gml
trace_1                       29.74%
intersect_0                   16.42%
plane_1                       12.59%
find_all_0                     8.58%

Anyways, we're still pretty dead because with a recursive function, every field
is selected out of the closure on every call.  For example, here is the start of 
the evaluator CPS code.

fun eval_0(x_1942: list_12,
	   x_1943: list_16,
	   x_1944: list_16,
	   env_7: (real * real * (real * real * rounding_mode_0) * real ref * unit ref * real array * real * real * real * ((real * real) * lambdas_7 ref * real * lambdas_6 ref * unit ref * ((real * real) * real * real)) * (real * real * rounding_mode_0) * unit ref * (word * unit ref * list_0 ref) * list_0 ref * (real * real) * (real * real) * (real * real))): (list_16) = 
   let
      val deg_0: real = #1 env_7
      val rad_0: real = #2 env_7
      val floor_0: (real * real * rounding_mode_0) = #3 env_7
      val identity_0: real array = #6 env_7
      val optimize_rec_0: real = #7 env_7
      val dtr_0: real = #8 env_7
      val dcos_0: real = #9 env_7
      val trace_0: ((real * real) * lambdas_7 ref * real * lambdas_6 ref * unit ref * ((real * real) * real * real)) = #10 env_7
      val conv_0: (real * real * rounding_mode_0) = #11 env_7
      val openOut_0: (word * unit ref * list_0 ref) = #13 env_7
      val closeOut_0: list_0 ref = #14 env_7
      val rotatez_0: (real * real) = #15 env_7
      val rotatey_0: (real * real) = #16 env_7
      val rotatex_0: (real * real) = #17 env_7

It seems possible to me that this problem is the reason why we're not doing
better on the larger benchmarks such as kit, self compile.