[MLton] N-Body benchmark and the refFlatten and the deepFlatten optimizations

Mon Aug 18 14:21:36 PDT 2008

On Wed, 2 Jul 2008, Vesa Karvonen wrote:
> I'm not sure what exactly causes the optimizations to be missed.  I would
> assume that refFlatten is trickier to apply and requires a more complex
> analysis.  The deepFlatten page does not explain the associated analysis
> in any way (and I haven't looked at the implementation of deepFlatten).  I
> find it slightly surprising that the deepFlatten optimization is not
> applied in this case.  The objects are finite sized and not very large (3
> reals (+ overhead)).  Shouldn't this make it safe (for space) to apply
> deepFlatten?  Of course there are various trade-offs here (between
> allocation, copying, sharing), and it is impossible to make optimal
> decisions in every case.
>
> The comments on the refFlatten page led me to think that the problem (not
> applying either optimization) might be caused by the code that creates the
> system (see the benchmark code) by mapping over a list during which ref
> cells are being allocated.  So, I manually eliminated the map call.  This
> obviously changed the results of some analysis, eliminated the allocation
> and improved performance considerably from about 32 to about 26
> seconds.  If I interpret the generated code correctly, then the
> deepFlatten optimization was now applied, but not refFlatten, which
> should give even better results by avoiding one more indirection.
>
> It would seem that improving the analyses of the refFlatten and deepFlatten
> optimizations, so that they could be applied in more cases, could have
> significant performance benefits.

A while back, Vesa and I chatted about this issue, and I thought it would 
be worth recording the conclusions in the mail archive.

For Vesa's code, the interesting bit is how the 'system' value is 
represented:

   type body = {pos : Vec3D.t Ref.t, vel : Vec3D.t Ref.t, mass : Real.t}
   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         map f
         [{pos = {x = 0.0, y = 0.0, z = 0.0},
           vel = {x = 0.0, y = 0.0, z = 0.0},
           mass = 1.0}, ...]
      end

For the original implementation above, at the end of the SSA2 optimization 
passes, the 'body list' type is represented by the SSA2 type:

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * ((real64 * real64 * real64) ref)
                       * real64
                       * list_11)

Thus, a cons cell is represented by an object with 4 fields; the first 
field is a pointer to a 3-field object (each field of which is a mutable 
real64); the second field is a pointer to a mutable pointer to a three 
field object (each field of which is a real64); the third field is a 
real64; the fourth field is a list_11.
Hence, the body record type has been 'inlined' into the specific list 
type.  Also, the deepFlatten pass has 'pushed' the mutability of the vel 
field into the three components of a Vec3D.t.  [MLton initially orders a 
record type alphabetically by field name and various passes might reverse 
the order of the entire tuple; so, we know that the field next to the mass 
field will be the pos field, leaving the other field to be the vel field.]

Vesa noted that the line

   ; #pos a := pos a |+| vel a |* dt

ended up allocating a tuple, leading to a high allocation total.

Vesa gave an alternate implementation, equivalent to the following:

   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         [f {pos = {x = 0.0, y = 0.0, z = 0.0},
             vel = {x = 0.0, y = 0.0, z = 0.0},
             mass = 1.0}, ...]
      end

For this implementation, at the end of the SSA2 optimization passes, the 
'body list' type is represented by the SSA2 type:

   list_10 = nil_6 of (unit)
           | ::_10 of ((real64 ref * real64 ref * real64 ref)
                       * (real64 ref * real64 ref * real64 ref)
                       * real64
                       * list_10)

Thus, a cons cell is represented by an object with 4 fields; the first 
field is a pointer to a 3-field object (each field of which is a mutable 
real64); the second field is a pointer to a 3-field object (each field of 
which is a mutable real64); the third field is a real64; the fourth field 
is a list_11.
Hence, the body record type has been 'inlined' into the specific list 
type.  Also, the deepFlatten pass has 'pushed' the mutability of the vel 
field into the three components of a Vec3D.t and 'pushed' the mutability 
of the pos field into the three components of a Vec3D.t.

With this implementation, the line

   ; #pos a := pos a |+| vel a |* dt

ends up not allocating anything, simply mutating each of the three 
components.

Both of these SSA2 type representations are established after the 
deepFlatten pass (but before the refFlatten pass).  In both, the 
refFlatten pass does nothing (with respect to the 'body list' type). This 
raises a couple of issues, and I don't have a complete understanding of 
either of them.

First, it isn't clear why, in the original implementation, the 'vel' field 
was deepFlatten-ed but the 'pos' field was not deepFlatten-ed.  There 
doesn't seem to be a significant difference in the way the fields are 
initialized and used in the program.  [That this was an issue didn't occur 
to me earlier, Vesa and I didn't discuss it, and I don't have anything 
more to say on it, other than it is an interesting question, and perhaps 
the best way of achieving the desired allocation behavior.]

Second, it wasn't clear why, in the original implementation, the 'pos' 
field wasn't subject to being refFlatten-ed (given that the 'pos' field 
wasn't deepFlatten-ed).  That is, we might expect the refFlatten pass to 
take

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * ((real64 * real64 * real64) ref)
                       * real64
                       * list_11)

to

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * (real64 * real64 * real64) ref
                       * real64
                       * list_11)

In retrospect, this wouldn't actually change the allocation behavior of 
the program significantly.  It would save one indirection (and one 
allocation at the allocation of the 'system' value), but it would still 
require one tuple allocation for each iteration.  Nonetheless, it is 
interesting to note why the reference isn't flattened into the containing 
data structure.

One requirement for the refFlatten optimization is that the 'ref' cell 
allocation occurs in the same block as the containing object allocation. 
With regards to the original code above, this means that for each ::_11 
allocation, there should be a 'ref' cell allocation.  Looking at the SSA2 
IL code for the program into the refFlatten pass, we see two ::_11 
allocations:

   L_271 (x_339: list_11,
 	 x_338: (real64 * real64 * real64),
 	 x_337: (real64 * real64 * real64),
 	 x_336: real64,
 	 x_335: list_10)
     x_342: ((real64 * real64 * real64) ref) = (x_337)
     x_2396: real64 = #2 x_338
     x_2397: real64 = #1 x_338
     x_2398: real64 = #0 x_338
     x_347: real64 = Real64_mul (global_171, x_2398)
     x_346: real64 = Real64_mul (global_171, x_2397)
     x_345: real64 = Real64_mul (global_171, x_2396)
     x_343: (real64 ref * real64 ref * real64 ref) = (x_347, x_346, x_345)
     x_341: real64 = Real64_mul (x_333, x_336)
     x_2158: ::_11 of ((real64 ref * real64 ref * real64 ref)
 		      * ((real64 * real64 * real64) ref)
 		      * real64
 		      * list_11) = ::_11 (x_343, x_342, x_341, x_339)
     x_340: list_11 = x_2158: list_11
     case x_335 of
       nil_6 => L_273 | ::_3 => L_1326

and

   L_274 (x_355: list_11,
 	 x_354: (real64 ref * real64 ref * real64 ref),
 	 x_353: ((real64 * real64 * real64) ref),
 	 x_352: real64,
 	 x_351: list_11)
     x_2164: ::_11 of ((real64 ref * real64 ref * real64 ref)
 		      * ((real64 * real64 * real64) ref)
 		      * real64
 		      * list_11) = ::_11 (x_354, x_353, x_352, x_355)
     x_356: list_11 = x_2164: list_11
     case x_351 of
       nil_7 => L_276 | ::_11 => L_1327

The first ::_11 allocation is the obvious one, that includes the ref cell 
allocation (x_342 = (x_337)).  The second ::_11 allocation does not 
include a ref cell allocation, so the 'pos' field is not susceptible to 
refFlatten-ing.  The second ::_11 allocation corresponds to a list 
reversal that is part of the implementation of map.  [A list reversal is 
used so that map is implemented by two tail-recursive loops, rather than 
one non-tail-recursive loop.]  In general, one can't perform the 
refFlatten-ing optimization in this case.  Indeed, the ref cells are 
copied from the old list to the new list.  Unless we 'know' that the old 
list is dead, we can't flatten for fear of breaking sharing. (Someone 
might have a copy of the old list, and we need to see the same 
references.)

We can change the implementation to the following:

   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         foldr (fn (x,l) => (f x)::l) []
         [{pos = {x = 0.0, y = 0.0, z = 0.0},
           vel = {x = 0.0, y = 0.0, z = 0.0},
           mass = 1.0}, ...]
      end

There is still a list reversal as part of the implementation of foldr, but 
now the input list is reversed.  (One can't take this as the 
implementation of map, because it would apply the mapped function to the 
elements in the wrong order.)  The deepFlatten pass still yields the same 
list_11 representation for the 'body list' type, but now the SSA2 program 
into the refFlatten pass has exactly one ::_11 allocation (nearly 
alpha-equivalent to the L_271 block above).  However, the 'pos'
field is still not subject to being refFlatten-ed.

Another (syntactic) requirement of the refFlatten optimization is that the 
ref cell updates occur in the same block as the selection of the ref cell 
from its containing data-structure.  This requirement is violated in the 
SSA2 program:

   L_389 (x_567: (real64 ref * real64 ref * real64 ref),
 	 x_566: ((real64 * real64 * real64) ref),
 	 x_565: list_10)
     x_2456: (real64 * real64 * real64) = #0 x_566
     x_2457: real64 = #2 x_567
     x_2458: real64 = #1 x_567
     x_2459: real64 = #0 x_567
     x_576: real64 = Real64_mul (x_2459, global_175)
     x_575: real64 = Real64_mul (x_2458, global_175)
     x_573: real64 = Real64_mul (x_2457, global_175)
     x_2453: real64 = #2 x_2456
     x_2454: real64 = #1 x_2456
     x_2455: real64 = #0 x_2456
     x_571: real64 = Real64_add (x_2455, x_576)
     x_570: real64 = Real64_add (x_575, x_2454)
     x_569: real64 = Real64_add (x_573, x_2453)
     x_568: (real64 * real64 * real64) = (x_571, x_570, x_569)
     x_566 := x_568
     case x_565 of
       nil_7 => L_391 | ::_11 => L_1357

The 'pos' field reference is x_566, which is passed as a first-class value 
into the L_389 block, rather than selected from the cons cell within the 
block.  The motivation for this syntactic requirement is to both make it 
easy to find the containing data structure and to ensure that the 
flattening of the ref cell into the containing data structure does not 
violate space safety.

Neither of the syntactic requirements are semantic requirements -- there 
exist programs that violate that the syntactic requirements that could 
still validly optimized.  For example, consider the following common 
idiom:

   val r = {a = ref (f x), b = ref (g y)}

If the code for g is not inlined (or includes a loop), then the ref cell 
allocated for the a field will be separated from the allocation of the 
record, because, in A-normal form, this code looks like:

   val r = let val fx = f x
               val ar = ref fx
               val gy = g y  (* point of possible block breaking *)
               val br = ref gy
               val r = {a = ar, b = br}
           in r end

One can restore the syntactic requirement in the above code as:

   val r = let val fx = f x
               val gy = g y
           in {a = ref fx, b = ref gy}
           end

The latter requirement is also commonly violated by programs, because, as 
in the above code, components of a datatype variant are passed as 
first-class values and, thus, become separated from their selection from 
the containing data structure.

In any case, it seems clear that there are ways of improving the 
deepFlatten and refFlatten optimizations.  Not to mention the fact that 
there have been other reports of deepFlatten and refFlatten requiring 
inordinate amounts of memory for large input programs.