[MLton] s->alignment considered harmful
Matthew Fluet
fluet at tti-c.org
Thu Nov 1 08:00:19 PST 2007
On Tue, 2 Oct 2007, Florian Weimer wrote:
> Storing s->alignment in a variable means that the alignment code in
> util/align.h uses full integer divisions to perform the alignment.
> Consequently, any allocation-intense program mainly stresses the
> integer divider. A simple s/s->alignment/4/ (assuming that the
> default alignment value is 8 on i386) leads to a 60% speed increase
> for examples/thread-switch.sml on a slightly dated Opteron.
>
> The best option seems to be to compile the GC multiple times, for each
> alignment choice. Alternatively, the util/align.h could be tweaked to
> assume power-of-two alignment, or the alignment could be stored as a
> mask.
>
> I think this accounts for the slowdown that has been observed after
> the x86_64 branch merge.
It is an interesting observation, though I'm not sure that all the
conclusions are justified.
For one, the s->alignment variable has been there since r1980 (Apr. 2003),
so it wasn't introduced by the x86_64 branch. It can't really be blamed
for the slowdown in the x86_64 branch. (Though, you are correct that the
slowdown can probably be attributed to other forms of parametrization that
were introduced by the x86_64 branch.)
Second, almost all allocation (with the exception of large arrays and
(new) threads) is done by the ML program, not by the runtime. That is,
the mutator (which is compiled with a specific alignment) need never
consult s->alignment for allocations. Now, it is the case that an
allocation intensive program will invoke the GC more often, which will
then need to dynamically consult s->alignment. But, the issue is really a
GC-intense program, not an allocation-intense program. (Though, the
two kinds of intensities really only differ if you use runtime options to
deviate from MLton's default heap sizing heuristics; one can allocate a
large heap to an allocation intense (but low live data) program, which
will cut down on the number of GCs; one can specify a small heap to an
allocation light program, which will increase the number of GCs.)
But, 60% speed increase is pretty impressive. On my Xeon, thread-switch
spends about 25% of the time in GC, so making s->alignment constant seems
to help more than just the GC time. Nonetheless, it would be interesting
to see the results of a constant s->alignment on the benchmark suite.
As for tweaking the implementation of util/align.h, one difficulty is that
align has "round up" semantics, not "round down". So, a simple mask won't
work. On the other hand, we do assume that the alignment is a power of
two, so maybe there are some bit operations that would be faster than an
integer divide.
More information about the MLton
mailing list