CVS Commit
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Wed, 26 Sep 2001 14:46:22 -0400 (EDT)
> Could you check the addresses of the 2 versions and see if the alignments
> were different?
Here are the full details. There is a base version and an optimized
version. Both versions have identical assembly instructions. The main
loop is the following in the original version:
loop_1:
testl %esi,%esi
jz L_30
L_20:
movl $38,%edx
xorl %ecx,%ecx
movl $1,%edi
fibP_0:
testl %edx,%edx
jz L_31
L_21:
decl %edx
jo L_40
noOverflow_0:
addl %ecx,%edi
jo L_33
noOverflow_1:
xchgl %edi,%ecx
jmp fibP_0
.p2align 2
L_33:
L_40:
movl $1,%esi
L_0:
...
.p2align 2 ###
L_31: ###
L_24: ###
cmpl $39088169,%ecx ###
jne L_37 ###
L_26: ###
decl %esi ###
jo L_40 ###
noOverflow_2: ###
jmp loop_1 ###
.p2align 2
L_37:
L_45:
movl (globalpointer+(4*4)),%esi
jmp L_0
.p2align 2
L_30:
L_18:
movl c_stackP,%esi
xchgl %esi,%esp
pushl $0
addl $12,%ebp
movl $L_46,(%ebp)
movl %ebp,(gcState+40)
movl %esi,(gcState+12)
call MLton_exit
movl (gcState+40),%ebp
movl (gcState+12),%esp
jmp *(%ebp)
"..." is about 70 lines of assembly, including labels and .palign
directives. In the optimized version, I moved the ### lines to
immediately following the jmp fibP_0 instruction.
In the base version, I get the following:
[fluet@lennon mlton]$ /usr/bin/time ./tailfib
25.10user 0.03system 0:25.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (89major+10minor)pagefaults 0swaps
[fluet@lennon mlton]$ nm -n tailfib | grep "..."
08048e19 t loop_1
08048e21 t L_20
08048e2d t fibP_0
08048e35 t L_21
08048e38 t noOverflow_0
08048e3c t noOverflow_1
08048e40 t L_33
08048e40 t L_40
08048f14 t L_31
08048f1c t L_26
08048f23 t noOverflow_2
08048f28 t L_37
08048f28 t L_45
08048f34 t L_18
08048f34 t L_30
In the optimized version, I get the following:
[fluet@lennon mlton-opt]$ /usr/bin/time ./tailfib
22.95user 0.04system 0:23.05elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (89major+10minor)pagefaults 0swaps
[fluet@lennon mlton-opt]$ nm -n tailfib | grep "..."
08048e19 t loop_1
08048e21 t L_20
08048e2d t fibP_0
08048e31 t L_21
08048e34 t noOverflow_0
08048e38 t noOverflow_1
08048e3c t L_31
08048e48 t L_26
08048e4b t noOverflow_2
08048e50 t L_33
08048e50 t L_40
08048f24 t L_37
08048f24 t L_45
08048f30 t L_18
08048f30 t L_30
As best I can make out, the only significant difference is that in the
base version, the instruction corresponding to jz L_31 requires 4
additional bytes. Granted, this is in the hottest loop, but is that
enough to explain a 10% speedup?