profiling go
Stephen Weeks
MLton@sourcelight.com
Sat, 9 Jun 2001 12:35:28 -0700
> Here is the loop that computes the checksum of a vector of bytes:
>
> fun loop_55 (x_495, x_494) =
> if x_494 = x_493
> then if x_489 = x_495
> then SOME_1 x_491
> else raise BadChecksum
> else loop_55 (Word32.+ (Word32.tolargeWord (Vector.sub (x_491,
> x_494)),
> Word32.+ (0w63,
> Word32.* (0wx1234567, x_495))),
> x_494 + 1 (overflow => raise Overflow))
I'm a bit confused. Should the Word32.tolargeWord be Word8.toLargeWord? Also,
it looks like you have rearranged the order of things since the x_494 + 1 is
computed first by the assembly code (which would imply right-to-left
evaluation). Could you please run MLton with "-show-types true" and send the
unedited CPS code?
Assuming x_491 is a Word8.word vector, you might be able to speed stuff up by
using Pack32Little.subVec to read an entire word at a time instead of a byte.
Here are my comments on the assembly.
loop_55:
(36)(%edi) == x_491
(40)(%edi) == x_493
(44)(%edi) == x_495
(48)(%edi) == x_494
movl (48*1)(%edi),%eax %eax = x_494
cmpl (40*1)(%edi),%eax if x_494 = x_493
je L_423
movl %eax,%ebx %ebx = x_494
incl %ebx %ebx = x_494 + 1
jo L_427
movl (36*1)(%edi),%ecx %ecx = x_491
movb (%ecx,%eax,1),%dl %dl = Vector.sub (x_491, x_494)
movl %ebx,(48*1)(%edi) x_494 = %ebx
movzbl %dl,%eax %eax = Word8.toLargeWord (%dl)
movl %eax,localuint
movl (44*1)(%edi),%eax %eax = x_495
movl $0x1234567,%ebx %ebx = 0wx1234567
xorl %edx,%edx
mull %ebx %eax = x_495 * 0wx1234567
addl $0x63,%eax %eax = Word32.+ (0w63, ...)
addl localuint,%eax %eax = Word32.+ (Word8.toLargWord ...)
movl %eax,(44*1)(%edi) x_495 = %eax
jmp loop_55
Shouldn't the $0x63 be $0x3F?
> I'm confused by the constant re-loading of %ecx (x_491 in the CPS code).
I'm betting it's because x_491 is a pointer and is live across a limit check,
and hence we won't let it live in a register.
> Also the storing of %eac in localuint.
I agree. I don't know why we didn't use another register.
I don't understand the "xorl %edx, %edx".
I would think we could at least keep x_494 and x_495 in a register around the
loop.
All in all, pretty bad code, mostly due to the register allocator (both in the
backend and the codegen).
I guess if this is your only hot loop, you can FFI it to C for the time being?