CVS Commit
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Wed, 26 Sep 2001 09:56:37 -0400 (EDT)
http://www.cs.cornell.edu/People/fluet/MLton/tree.tgz
/src/lib/mlton/basic/tree.{sig,sml}
Added layout function.
http://www.cs.cornell.edu/People/fluet/MLton/x86-codegen.tgz
/src/mlton/codegen/x86-codegen/x86.{sig,fun}
x86-loop-info.{sig,fun} (* new files *)
x86-live-transfers.{sig,fun}
x86-generate-transfers.fun
Abstracted the calculation and conversion of the loop nesting forest to
x86-loop-info.*. Modified x86-generate-transfers to be "smarter" about
code layout. At an Iff {condition, truee, falsee} transfer, the heuristic
is as follows:
if either the truee's block or falsee's block has already been generated,
then branch to that one, and fall thru to the other
if neither of the truee's and falsee's blocks have been generated,
then fall thru to the label with the least "loop-distance" from the
Iff's label (where loop distance from l1 to l2 is NONE if l1 and l2 are
in non-nested loops, and SOME (d(l2) - d(l1)) where d(l) is the depth of
the deepest occurence of l in the loop forest); this favors staying in
the same loop or entering nested loops over exiting loops
if both labels have the same loop distance, fall thru to the label
with more live variables
if both labels have the same number of live variables, fall thru to truee
Changed the "todo" list of blocks to be processed to a double-ended queue;
Near transfer labels (gotos, ifs, cases) are pushed; far transfers
(handlers, conts) are enqued. Nested loops tend to come out the right
way.
(* End CVS log. More elaboration below. *)
Using the loop forest "solved" the problem in vector-concat, where the
inner-loop was split:
loop_10:
cmpl %edx,%ecx
movl %ecx,(28*1)(%ebp)
je L_70
L_34:
movl (36*1)(%ebp),%ecx
xchgl %esi,%ebx
xchgl %esi,%edi
loop_5:
movl (-2*4)(%ebx),%eax
cmpl %eax,%esi
jnl L_71
L_35:
cmpl %eax,%esi
jae L_72
L_37:
movl (%ebx,%esi,4),%eax
incl %esi
jo L_86
noOverflow_2:
movl %eax,(%ecx,%edx,4)
incl %edx
jo L_86
noOverflow_3:
movl (28*1)(%ebp),%ecx
xchgl %edi,%ebx
xchgl %esi,%edi
jmp loop_10
...
.p2align
L_71:
testl $0x3,%edi
jnz L_79
L_102:
movl (%edi),%ebx
movl (4*1)(%edi),%edi
xorl %esi,%esi
jmp loop_5
Now we get:
loop_10:
cmpl %edx,%ecx
movl %ecx,(28*1)(%ebp)
je L_71
L_34:
movl (36*1)(%ebp),%ecx
xchgl %esi,%ebx
xchgl %esi,%edi
loop_5:
movl (-2*4)(%ebx),%eax
cmpl %eax,%esi
jl L_72
L_40:
testl $0x3,%edi
jnz L_73
L_94:
movl (%edi),%ebx
movl (4*1)(%edi),%edi
xorl %esi,%esi
jmp loop_5
...
.p2align 2
L_72:
L_35:
cmpl %eax,%esi
jae L_74
L_37:
movl (%ebx,%esi,4),%eax
incl %esi
jo L_84
noOverflow_2:
movl %eax,(%ecx,%edx,4)
incl %edx
jo L_84
noOverflow_3:
movl (28*1)(%ebp),%ecx
xchgl %edi,%ebx
xchgl %esi,%edi
jmp loop_10
This has the "problem" that the bottom of the loop_10 loop is far away. I
realized that this was due to the fact that I use enque to add blocks to
the layout todo list. After finishing with the loop_5 loop, the bottom of
the loop_10 loop is ready to go, but it's after everything that has been
enqued so far. So, I changed it to a double-ended queue. Near transfer
labels (gotos, ifs, cases) are pushed; far transfers (handlers, conts) are
enqued. For the most part, this works; nested loops generally come out
the right way:
loop_10:
cmpl %edx,%ecx
movl %ecx,(28*1)(%ebp)
je L_71
L_34:
movl (36*1)(%ebp),%ecx
xchgl %esi,%ebx
xchgl %esi,%edi
loop_5:
movl (-2*4)(%ebx),%eax
cmpl %eax,%esi
jl L_72
L_40:
testl $0x3,%edi
jnz L_73
L_94:
movl (%edi),%ebx
movl (4*1)(%edi),%edi
xorl %esi,%esi
jmp loop_5
.p2align 2
L_73:
L_95:
movl (globalpointer+(4*4)),%esi
jmp L_0
.p2align 2
L_72:
L_35:
cmpl %eax,%esi
jae L_74
L_37:
movl (%ebx,%esi,4),%eax
incl %esi
jo L_84
noOverflow_2:
movl %eax,(%ecx,%edx,4)
incl %edx
jo L_84
noOverflow_3:
movl (28*1)(%ebp),%ecx
xchgl %edi,%ebx
xchgl %esi,%edi
jmp loop_10
The invervening L_73 block is raising an exception; it's small here, so
it's o.k. Overall, this sped vector-concat up from 5.0s to 4.9s on the
benchmarks. On the other hand, it slowed tailfib down from 13.5s to
15.1s; I was somewhat surprised by this, but I think I've figured it out.
The assembly instructions are identical in the old and new versions. The
key loops are the following in both versions:
loop_1:
testl %esi,%esi
jz L_41
L_20:
movl $38,%edx
xorl %ecx,%ecx
movl $1,%edi
fibP_0:
testl %edx,%edx
jz L_42
L_21:
decl %edx
jo L_34
noOverflow_0:
addl %ecx,%edi
jo L_44
noOverflow_1:
xchgl %edi,%ecx
jmp fibP_0
...
.p2align 2
L_42:
L_24:
cmpl $39088169,%ecx
jne L_48
L_26:
decl %esi
jo L_34
noOverflow_2:
jmp loop_1
What's different in the two versions are the relative positions of
L_42,L_34,L_44, and L_48. In the new version, L_44 is at the head of the
queue when we finish the fibP_0 loop; unfortunately, the entire program
has been reduced to one CPS function, and raise-to-jump means that L_44
leads right into the top-level exception handler. By the time L_42 gets
to the head of the queue, we've layed down a lot of code, so we need a
large offset for jz L_42. Similar arguments for the various other lables.
I was able to recover the lost time simply by reordering some of the
blocks in the resulting assembly file. (Note, in all versions, the order
of the blocks are the same; so the static branch predictions should be the
same in all cases; obviously the dynamic branch predicitions are
identical.) So, I think I've convinced myself that it's just bad icache
usage with the new version. I think the right thing to do is make sure
that L_42 is layed down earlier; again, the loop forests should help; the
overflow labels aren't part of any loops, whereas L_42 is related to the
fibP_0 loop. I think the right thing is some sort of priority queue for
the todo list, but I'm not quite sure what the right metric is. Any
suggestions?