[MLton] benchmarks with -representation {normal,packed}
Stephen Weeks
MLton@mlton.org
Fri, 23 Apr 2004 12:48:09 -0700
> Where's the speedup coming from? Avoiding GCs? Less memory traffic in
> allocating heap objects?
My guess is less memory traffic.
Another possibility is a change in how case statements are
implemented for sum types like
datatype t = A | B | C of int
The old way used to be to do a test on the bottom two bits to see if
they are zero and either go to a switch on the value-carrying variants
or the non-value-carrying variants.
The new way is to do a switch on all the tag bits, immediately jumping
to the correct non-value carrying variant, or go to a switch on the
value-carrying variants.
> I could imagine a slowdown due to extra bit-fiddling to manipulate
> packed objects.
I think it's pretty unlikely. A couple of bit ops is better than a
memory op.
> I would expect packed representations to pay off in programs that are
> using a lot of memory and doing frequent GCs (so, I'm interested to hear
> if there is a self-compile speedup), but I didn't think that applied to
> many of the benchmarks.
Yeah, I agree that the benchmarks aren't the best demonstration of
this. But I should at least cook one up to show what's possible :-).
Here's the time and allocation for a self compile with MLton compiled
-representation normal and -representation packed (each compiling
MLton -representation packed).
normal packed ratio
--------------- --------------- -----
time 415.13 + 197.84 391.39 + 205.09 1.028
allocated 32,025,718,116 31,150,575,268 1.028
So, we get a few percent decrease in allocation and run time by
packing.
> Again, I would have thought the bit fiddling of packed representations
> would have increased code size, rather than decrease.
The bit fiddling isn't that messy, and each bit fiddle corresponds to
a memory fiddle. To take an extreme example, consider allocating a
tuple of type "bool * bool * bool * bool". In the packed world, this
type is represented as a byte (with the high four bits as zero). So,
in Rssa, building the tuple looks like
x_864 = Word32_toWord8 (b3_1)
x_862 = Word32_toWord8 (b2_1)
x_863 = Word8_lshift (x_864, 0wx1)
x_861 = Word8_orb (x_863, x_862)
x_859 = Word32_toWord8 (b1_1)
x_860 = Word8_lshift (x_861, 0wx1)
x_858 = Word8_orb (x_860, x_859)
x_856 = Word32_toWord8 (x_416)
x_857 = Word8_lshift (x_858, 0wx1)
x_855 = Word8_orb (x_857, x_856)
On the other hand, in the unpacked world, this type is represented as
a five word object, with one word for the header and one for each
bool. So, in Rssa, building the tuple looks like.
x_534
= Object {header = 0x1D,
size = 20,
stores = ({offset = 0, value = b3_1},
{offset = 4, value = b2_1},
{offset = 8, value = b1_1},
{offset = 12, value = x_416})}
If we get down to the assembly, the code size is pretty similar.
Here's the packed tuple:
movb %cl,%ah
movb %dl,%al
shlb $0x1,%ah
orb %ah,%al
movb %bl,%dh
shlb $0x1,%al
orb %al,%dh
movl %esi,%eax
movb %al,%dl
shlb $0x1,%dh
orb %dh,%dl
and the unpacked one:
movl $0x1D,(0+(0*4))(%esp)
leal (0+(4*1))(%esp),%ecx
movl %esi,(0+((4*1)+(0*1)))(%ecx)
movl %edi,(0+((8*1)+(0*1)))(%ecx)
movl %edx,(0+((0*1)+(0*1)))(%ecx)
movl (0+((28*1)+(0*1)))(%ebp),%esi
movl %esi,(0+((12*1)+(0*1)))(%ecx)
addl $20,%esp