[MLton] s->alignment considered harmful

Thu Nov 1 08:00:19 PST 2007

On Tue, 2 Oct 2007, Florian Weimer wrote:
> Storing s->alignment in a variable means that the alignment code in
> util/align.h uses full integer divisions to perform the alignment.
> Consequently, any allocation-intense program mainly stresses the
> integer divider.  A simple s/s->alignment/4/ (assuming that the
> default alignment value is 8 on i386) leads to a 60% speed increase
> for examples/thread-switch.sml on a slightly dated Opteron.
>
> The best option seems to be to compile the GC multiple times, for each
> alignment choice.  Alternatively, the util/align.h could be tweaked to
> assume power-of-two alignment, or the alignment could be stored as a
> mask.
>
> I think this accounts for the slowdown that has been observed after
> the x86_64 branch merge.

It is an interesting observation, though I'm not sure that all the 
conclusions are justified.

For one, the s->alignment variable has been there since r1980 (Apr. 2003), 
so it wasn't introduced by the x86_64 branch.  It can't really be blamed 
for the slowdown in the x86_64 branch.  (Though, you are correct that the 
slowdown can probably be attributed to other forms of parametrization that 
were introduced by the x86_64 branch.)

Second, almost all allocation (with the exception of large arrays and 
(new) threads) is done by the ML program, not by the runtime.  That is, 
the mutator (which is compiled with a specific alignment) need never 
consult s->alignment for allocations.  Now, it is the case that an 
allocation intensive program will invoke the GC more often, which will 
then need to dynamically consult s->alignment.  But, the issue is really a 
GC-intense program, not an allocation-intense program.  (Though, the 
two kinds of intensities really only differ if you use runtime options to 
deviate from MLton's default heap sizing heuristics; one can allocate a 
large heap to an allocation intense (but low live data) program, which 
will cut down on the number of GCs; one can specify a small heap to an 
allocation light program, which will increase the number of GCs.)

But, 60% speed increase is pretty impressive.  On my Xeon, thread-switch 
spends about 25% of the time in GC, so making s->alignment constant seems 
to help more than just the GC time.  Nonetheless, it would be interesting 
to see the results of a constant s->alignment on the benchmark suite.

As for tweaking the implementation of util/align.h, one difficulty is that 
align has "round up" semantics, not "round down".  So, a simple mask won't 
work.  On the other hand, we do assume that the alignment is a power of 
two, so maybe there are some bit operations that would be faster than an 
integer divide.