good in general, but bad nested loops
Stephen Weeks
MLton@sourcelight.com
Tue, 10 Jul 2001 15:02:44 -0700
I took a look at nestedloop.{gcc,mlton,ocaml}
First, here is how I compiled each of them.
gcc -O3 -fomit-frame-pointer -o nestedloop nestedloop.c
ocamlopt -S -o nestedloop -noassert -unsafe -ccopt -O3 nestedloop.ml
mlton -detect-overflow false -keep cps -keep g nestedloop.sml
Here are the running times for "nestedloop 23"
gcc .95
ocaml 1.41
mlton 2.65
These jibe pretty well with the shootout numbers.
Here is what gcc generates.
.L60:
incl %ebp
decl %edx
jne .L60
.L72:
decl %ecx
jne .L56
.L71:
decl %ebx
jne .L52
.L70:
decl %esi
jne .L48
.L69:
decl %edi
jne .L44
.L68:
decl 8(%esp)
jne .L40
So, it has changed the loop to count down from n to 0 instead of up from 0 to n.
It has also kept five of the loop variables in registers, plus the counter.
Most importantly, the hot loop, L_60, is only three instructions.
Here's the hot loop (about 85%) for mlton. This is the same as what Matthew
sent earlier.
loopF_0:
movl (220*1)(%edi),%esp
cmpl $0,%esp
je L_84
decl %esp
movl %esp,(220*1)(%edi)
incl (216*1)(%edi)
jmp loopF_0
Here's the hot loop for ocaml. The low bit in integers is a tag bit.
.L101:
cmpl $1, %ebx
je .L100
addl $-2, %ebx
addl $2, %eax
jmp .L101
So, ocaml, like gcc keeps both the loop counter and the main counter (x) in
registers.
So ocaml beats us because of register allocation, and gcc beats ocaml by
reordering the loop so that there is only one test, which saves two
instructions.
gcc is also noticing the loop nesting, which ocaml doesn't do -- in fact ocaml
leaves the nested loop as calls. But, loop nesting isn't necessary to get most
of the speed on this benchmark. If ocaml just did a little more better local
code generation, it would be very close to gcc.
We can get potentially the same as gcc, because we have the loop nest. We
just(!) need to do the right register allocation and instruction reordering. Oh
yeah, and figuring out the Overflows can't happen.