CVS Commit

Wed, 26 Sep 2001 14:46:22 -0400 (EDT)

> Could  you  check  the  addresses of the 2 versions and see if the alignments
> were different?

Here are the full details.  There is a base version and an optimized
version.  Both versions have identical assembly instructions.  The main
loop is the following in the original version:

loop_1:
	testl %esi,%esi
	jz L_30
L_20:
	movl $38,%edx
	xorl %ecx,%ecx
	movl $1,%edi
fibP_0:
	testl %edx,%edx
	jz L_31
L_21:
	decl %edx
	jo L_40
noOverflow_0:
	addl %ecx,%edi
	jo L_33
noOverflow_1:
	xchgl %edi,%ecx
	jmp fibP_0
.p2align 2
L_33:
L_40:
	movl $1,%esi
L_0:
...
.p2align 2                               ###
L_31:                                    ###
L_24:                                    ###
	cmpl $39088169,%ecx              ###
	jne L_37                         ###
L_26:                                    ###
	decl %esi                        ###
	jo L_40                          ###
noOverflow_2:                            ###
	jmp loop_1                       ###
.p2align 2
L_37:
L_45:
	movl (globalpointer+(4*4)),%esi
	jmp L_0
.p2align 2
L_30:
L_18:
	movl c_stackP,%esi
	xchgl %esi,%esp
	pushl $0
	addl $12,%ebp
	movl $L_46,(%ebp)
	movl %ebp,(gcState+40)
	movl %esi,(gcState+12)
	call MLton_exit
	movl (gcState+40),%ebp
	movl (gcState+12),%esp
	jmp *(%ebp)

"..." is about 70 lines of assembly, including labels and .palign
directives.  In the optimized version, I moved the ### lines to
immediately following the jmp fibP_0 instruction.

In the base version, I get the following:

[fluet@lennon mlton]$ /usr/bin/time ./tailfib
25.10user 0.03system 0:25.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (89major+10minor)pagefaults 0swaps
[fluet@lennon mlton]$ nm -n tailfib | grep "..."
08048e19 t loop_1
08048e21 t L_20
08048e2d t fibP_0
08048e35 t L_21
08048e38 t noOverflow_0
08048e3c t noOverflow_1
08048e40 t L_33
08048e40 t L_40
08048f14 t L_31
08048f1c t L_26
08048f23 t noOverflow_2
08048f28 t L_37
08048f28 t L_45
08048f34 t L_18
08048f34 t L_30

In the optimized version, I get the following:

[fluet@lennon mlton-opt]$ /usr/bin/time ./tailfib
22.95user 0.04system 0:23.05elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (89major+10minor)pagefaults 0swaps
[fluet@lennon mlton-opt]$ nm -n tailfib | grep "..."
08048e19 t loop_1
08048e21 t L_20
08048e2d t fibP_0
08048e31 t L_21
08048e34 t noOverflow_0
08048e38 t noOverflow_1
08048e3c t L_31
08048e48 t L_26
08048e4b t noOverflow_2
08048e50 t L_33
08048e50 t L_40
08048f24 t L_37
08048f24 t L_45
08048f30 t L_18
08048f30 t L_30

As best I can make out, the only significant difference is that in the
base version, the instruction corresponding to jz L_31 requires 4
additional bytes.  Granted, this is in the hottest loop, but is that
enough to explain a 10% speedup?