[MLton] N-Body benchmark and the refFlatten and the deepFlatten
optimizations
Matthew Fluet
fluet at tti-c.org
Mon Aug 18 14:21:36 PDT 2008
On Wed, 2 Jul 2008, Vesa Karvonen wrote:
> I'm not sure what exactly causes the optimizations to be missed. I would
> assume that refFlatten is trickier to apply and requires a more complex
> analysis. The deepFlatten page does not explain the associated analysis
> in any way (and I haven't looked at the implementation of deepFlatten). I
> find it slightly surprising that the deepFlatten optimization is not
> applied in this case. The objects are finite sized and not very large (3
> reals (+ overhead)). Shouldn't this make it safe (for space) to apply
> deepFlatten? Of course there are various trade-offs here (between
> allocation, copying, sharing), and it is impossible to make optimal
> decisions in every case.
>
> The comments on the refFlatten page led me to think that the problem (not
> applying either optimization) might be caused by the code that creates the
> system (see the benchmark code) by mapping over a list during which ref
> cells are being allocated. So, I manually eliminated the map call. This
> obviously changed the results of some analysis, eliminated the allocation
> and improved performance considerably from about 32 to about 26
> seconds. If I interpret the generated code correctly, then the
> deepFlatten optimization was now applied, but not refFlatten, which
> should give even better results by avoiding one more indirection.
>
> It would seem that improving the analyses of the refFlatten and deepFlatten
> optimizations, so that they could be applied in more cases, could have
> significant performance benefits.
A while back, Vesa and I chatted about this issue, and I thought it would
be worth recording the conclusions in the mail archive.
For Vesa's code, the interesting bit is how the 'system' value is
represented:
type body = {pos : Vec3D.t Ref.t, vel : Vec3D.t Ref.t, mass : Real.t}
val system : body list =
let
val f = fn {pos, vel, mass} =>
{pos = ref pos,
vel = ref (vel |* daysPerYear),
mass = mass * solarMass}
in
map f
[{pos = {x = 0.0, y = 0.0, z = 0.0},
vel = {x = 0.0, y = 0.0, z = 0.0},
mass = 1.0}, ...]
end
For the original implementation above, at the end of the SSA2 optimization
passes, the 'body list' type is represented by the SSA2 type:
list_11 = nil_7 of (unit)
| ::_11 of ((real64 ref * real64 ref * real64 ref)
* ((real64 * real64 * real64) ref)
* real64
* list_11)
Thus, a cons cell is represented by an object with 4 fields; the first
field is a pointer to a 3-field object (each field of which is a mutable
real64); the second field is a pointer to a mutable pointer to a three
field object (each field of which is a real64); the third field is a
real64; the fourth field is a list_11.
Hence, the body record type has been 'inlined' into the specific list
type. Also, the deepFlatten pass has 'pushed' the mutability of the vel
field into the three components of a Vec3D.t. [MLton initially orders a
record type alphabetically by field name and various passes might reverse
the order of the entire tuple; so, we know that the field next to the mass
field will be the pos field, leaving the other field to be the vel field.]
Vesa noted that the line
; #pos a := pos a |+| vel a |* dt
ended up allocating a tuple, leading to a high allocation total.
Vesa gave an alternate implementation, equivalent to the following:
val system : body list =
let
val f = fn {pos, vel, mass} =>
{pos = ref pos,
vel = ref (vel |* daysPerYear),
mass = mass * solarMass}
in
[f {pos = {x = 0.0, y = 0.0, z = 0.0},
vel = {x = 0.0, y = 0.0, z = 0.0},
mass = 1.0}, ...]
end
For this implementation, at the end of the SSA2 optimization passes, the
'body list' type is represented by the SSA2 type:
list_10 = nil_6 of (unit)
| ::_10 of ((real64 ref * real64 ref * real64 ref)
* (real64 ref * real64 ref * real64 ref)
* real64
* list_10)
Thus, a cons cell is represented by an object with 4 fields; the first
field is a pointer to a 3-field object (each field of which is a mutable
real64); the second field is a pointer to a 3-field object (each field of
which is a mutable real64); the third field is a real64; the fourth field
is a list_11.
Hence, the body record type has been 'inlined' into the specific list
type. Also, the deepFlatten pass has 'pushed' the mutability of the vel
field into the three components of a Vec3D.t and 'pushed' the mutability
of the pos field into the three components of a Vec3D.t.
With this implementation, the line
; #pos a := pos a |+| vel a |* dt
ends up not allocating anything, simply mutating each of the three
components.
Both of these SSA2 type representations are established after the
deepFlatten pass (but before the refFlatten pass). In both, the
refFlatten pass does nothing (with respect to the 'body list' type). This
raises a couple of issues, and I don't have a complete understanding of
either of them.
First, it isn't clear why, in the original implementation, the 'vel' field
was deepFlatten-ed but the 'pos' field was not deepFlatten-ed. There
doesn't seem to be a significant difference in the way the fields are
initialized and used in the program. [That this was an issue didn't occur
to me earlier, Vesa and I didn't discuss it, and I don't have anything
more to say on it, other than it is an interesting question, and perhaps
the best way of achieving the desired allocation behavior.]
Second, it wasn't clear why, in the original implementation, the 'pos'
field wasn't subject to being refFlatten-ed (given that the 'pos' field
wasn't deepFlatten-ed). That is, we might expect the refFlatten pass to
take
list_11 = nil_7 of (unit)
| ::_11 of ((real64 ref * real64 ref * real64 ref)
* ((real64 * real64 * real64) ref)
* real64
* list_11)
to
list_11 = nil_7 of (unit)
| ::_11 of ((real64 ref * real64 ref * real64 ref)
* (real64 * real64 * real64) ref
* real64
* list_11)
In retrospect, this wouldn't actually change the allocation behavior of
the program significantly. It would save one indirection (and one
allocation at the allocation of the 'system' value), but it would still
require one tuple allocation for each iteration. Nonetheless, it is
interesting to note why the reference isn't flattened into the containing
data structure.
One requirement for the refFlatten optimization is that the 'ref' cell
allocation occurs in the same block as the containing object allocation.
With regards to the original code above, this means that for each ::_11
allocation, there should be a 'ref' cell allocation. Looking at the SSA2
IL code for the program into the refFlatten pass, we see two ::_11
allocations:
L_271 (x_339: list_11,
x_338: (real64 * real64 * real64),
x_337: (real64 * real64 * real64),
x_336: real64,
x_335: list_10)
x_342: ((real64 * real64 * real64) ref) = (x_337)
x_2396: real64 = #2 x_338
x_2397: real64 = #1 x_338
x_2398: real64 = #0 x_338
x_347: real64 = Real64_mul (global_171, x_2398)
x_346: real64 = Real64_mul (global_171, x_2397)
x_345: real64 = Real64_mul (global_171, x_2396)
x_343: (real64 ref * real64 ref * real64 ref) = (x_347, x_346, x_345)
x_341: real64 = Real64_mul (x_333, x_336)
x_2158: ::_11 of ((real64 ref * real64 ref * real64 ref)
* ((real64 * real64 * real64) ref)
* real64
* list_11) = ::_11 (x_343, x_342, x_341, x_339)
x_340: list_11 = x_2158: list_11
case x_335 of
nil_6 => L_273 | ::_3 => L_1326
and
L_274 (x_355: list_11,
x_354: (real64 ref * real64 ref * real64 ref),
x_353: ((real64 * real64 * real64) ref),
x_352: real64,
x_351: list_11)
x_2164: ::_11 of ((real64 ref * real64 ref * real64 ref)
* ((real64 * real64 * real64) ref)
* real64
* list_11) = ::_11 (x_354, x_353, x_352, x_355)
x_356: list_11 = x_2164: list_11
case x_351 of
nil_7 => L_276 | ::_11 => L_1327
The first ::_11 allocation is the obvious one, that includes the ref cell
allocation (x_342 = (x_337)). The second ::_11 allocation does not
include a ref cell allocation, so the 'pos' field is not susceptible to
refFlatten-ing. The second ::_11 allocation corresponds to a list
reversal that is part of the implementation of map. [A list reversal is
used so that map is implemented by two tail-recursive loops, rather than
one non-tail-recursive loop.] In general, one can't perform the
refFlatten-ing optimization in this case. Indeed, the ref cells are
copied from the old list to the new list. Unless we 'know' that the old
list is dead, we can't flatten for fear of breaking sharing. (Someone
might have a copy of the old list, and we need to see the same
references.)
We can change the implementation to the following:
val system : body list =
let
val f = fn {pos, vel, mass} =>
{pos = ref pos,
vel = ref (vel |* daysPerYear),
mass = mass * solarMass}
in
foldr (fn (x,l) => (f x)::l) []
[{pos = {x = 0.0, y = 0.0, z = 0.0},
vel = {x = 0.0, y = 0.0, z = 0.0},
mass = 1.0}, ...]
end
There is still a list reversal as part of the implementation of foldr, but
now the input list is reversed. (One can't take this as the
implementation of map, because it would apply the mapped function to the
elements in the wrong order.) The deepFlatten pass still yields the same
list_11 representation for the 'body list' type, but now the SSA2 program
into the refFlatten pass has exactly one ::_11 allocation (nearly
alpha-equivalent to the L_271 block above). However, the 'pos'
field is still not subject to being refFlatten-ed.
Another (syntactic) requirement of the refFlatten optimization is that the
ref cell updates occur in the same block as the selection of the ref cell
from its containing data-structure. This requirement is violated in the
SSA2 program:
L_389 (x_567: (real64 ref * real64 ref * real64 ref),
x_566: ((real64 * real64 * real64) ref),
x_565: list_10)
x_2456: (real64 * real64 * real64) = #0 x_566
x_2457: real64 = #2 x_567
x_2458: real64 = #1 x_567
x_2459: real64 = #0 x_567
x_576: real64 = Real64_mul (x_2459, global_175)
x_575: real64 = Real64_mul (x_2458, global_175)
x_573: real64 = Real64_mul (x_2457, global_175)
x_2453: real64 = #2 x_2456
x_2454: real64 = #1 x_2456
x_2455: real64 = #0 x_2456
x_571: real64 = Real64_add (x_2455, x_576)
x_570: real64 = Real64_add (x_575, x_2454)
x_569: real64 = Real64_add (x_573, x_2453)
x_568: (real64 * real64 * real64) = (x_571, x_570, x_569)
x_566 := x_568
case x_565 of
nil_7 => L_391 | ::_11 => L_1357
The 'pos' field reference is x_566, which is passed as a first-class value
into the L_389 block, rather than selected from the cons cell within the
block. The motivation for this syntactic requirement is to both make it
easy to find the containing data structure and to ensure that the
flattening of the ref cell into the containing data structure does not
violate space safety.
Neither of the syntactic requirements are semantic requirements -- there
exist programs that violate that the syntactic requirements that could
still validly optimized. For example, consider the following common
idiom:
val r = {a = ref (f x), b = ref (g y)}
If the code for g is not inlined (or includes a loop), then the ref cell
allocated for the a field will be separated from the allocation of the
record, because, in A-normal form, this code looks like:
val r = let val fx = f x
val ar = ref fx
val gy = g y (* point of possible block breaking *)
val br = ref gy
val r = {a = ar, b = br}
in r end
One can restore the syntactic requirement in the above code as:
val r = let val fx = f x
val gy = g y
in {a = ref fx, b = ref gy}
end
The latter requirement is also commonly violated by programs, because, as
in the above code, components of a datatype variant are passed as
first-class values and, thus, become separated from their selection from
the containing data structure.
In any case, it seems clear that there are ways of improving the
deepFlatten and refFlatten optimizations. Not to mention the fact that
there have been other reports of deepFlatten and refFlatten requiring
inordinate amounts of memory for large input programs.
More information about the MLton
mailing list