CVS Commit
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Tue, 18 Sep 2001 10:11:25 -0400 (EDT)
http://www.cs.cornell.edu/People/fluet/MLton/x86-codegen.tgz
(* src/mlton/codgen/x86-codegen/ *)
http://www.cs.cornell.edu/People/fluet/MLton/x86codegen.tgz
(* src/include/x86codegen.h *)
http://www.cs.cornell.edu/People/fluet/MLton/main.tgz
(* src/mlton/main/main.sml *)
http://www.cs.cornell.edu/People/fluet/MLton/control.tgz
(* src/mlton/control/control.{sig,sml} *)
Various changes to the x86-codgen.
x86.sig
x86.fun
x86-pseudo.sig
x86-mlton.fun
x86-allocate-registers.fun
x86-validate.fun
Added x86.Instruction.IMUL2 for the two-operand form of the imul
instruction. This form of imul does not require the destination to be in
edx:eax and the source to be in a register. This alleviates some register
pressure and suffling around multiplication instructions. The
disadvantage is that the two-operand form of the imul instruction does not
support byte multiplications, so Word8.* must use the one-operand form.
x86.sig
x86.fun
x86-pseudo.sig
x86-mlton.fun
x86-live-transfers.fun
x86-generate-transfers.fun
Extended the Entry.t datatype with CReturn. Used by
x86-live-transfers.fun to prevent floating-point values from being carried
across C-calls in floating-point registers. (C-calling convention
requires floating-point stack to be cleared.)
x86codegen.h
x86-generate-transfers.fun
Changed reserved register for gcState.stackTop from %edi to %ebp. Changed
reserved register for gcState.frontier from %esi to %esp. Motivated
primarily by a desire to keep %esp from being used as a general purpose
register -- there is unecessary shuffle code when trying to use %esp as an
index in a memory operand.
x86-live-transfers.fun
main.sml
control.sig
control.sml
Major changes.
- improved efficiency of get_distanceF by first computing distanceF' (the
distance from the start of a block to the first use of a memloc in that
block)
- implemented get_distanceB (the distance from the start of a block to
the last use or def of a memloc before that block).
- modified both get_distanceF and get_distanceB to be "smart" about far
transfers; in particular, get_distanceB of a memloc will be PosInfinity if
the memloc is guaranteed to be on the stack (e.g., at function entry, on
return from a non-tail call, on return from the runtime, etc.). Likewise,
get_distanceF of a memloc will be PosInfinity if the memloc is live across
a far transfer.
- memlocs with get_distanceB of PosInfinity are not carried into the
block in registers (i.e., don't load a value too early). This fixed the
problem at function entry of copying args from stack to registers,
copying them back on at the stack check GC, and finally copying them back
to registers on GC return. Likewise, memlocs with get_distanceF of
PosInfinity are not carried into the block in registers.
- smart computation of "sync" attribute of memlocs at block entries;
values that are not carried in registers into a block must be on the
stack; hence, later uses of the value can assume that it is synchronized.
- weight live variables in inner loops higher using the loop forest of
the call-graph (technically, it is a revised call graph; far entries
(function entry, non-tail conts, non-tail handlers, runtime returns) are
all preceeded by a root node; i.e., a block corresponding to L() = K(f ())
does not result in an edge L -> K; instead, there is an edge root -> K;
this corresponds to the fact that we don't want to count the frontier
check loop as a real loop.)
- changed the -native-live-transfer option to accept an integer, rather
than a boolean. -native-live-transfer 0 turns off carrying live values in
registers. -native-live-transfer n is used as follows:
val (useLF, useB, sync)
= case !Control.Native.liveTransfer
of 1 => (false, false, false)
| 2 => (false, false, true)
| 3 => (false, true, false)
| 4 => (false, true, true)
| 5 => (true, false, false)
| 6 => (true, false, true)
| 7 => (true, true, false)
| _ => (true, true, true)
where useLF indicates using the loop forest to weight inner loops, useB
indicates using get_distanceB, and sync corresponds to using smart
computation of sync attribute.
Benchmark Results: (I'm leaving out compile time, because it was with a G0
(i.e., SML/NJ) mlton.
MLton0 -- mlton-stable -native-live-stack true
MLton1 -- mlton -native-live-stack true -native-live-transfer 0
MLton2 -- mlton -native-live-stack true -native-live-transfer 4
MLton3 -- mlton -native-live-stack true -native-live-transfer 8
run time
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 4.7 5.0 4.7 4.7
checksum 4.3 4.0 4.0 4.0
count-graphs 6.1 6.4 6.0 6.0
DLXSimulator 18.5 19.0 18.6 18.5
fft 11.3 11.3 11.2 11.1
fib 3.7 4.4 3.9 3.9
hamlet 9.7 10.0 9.5 9.6
knuth-bendix 7.8 7.8 7.9 7.8
lexgen 12.9 13.8 12.6 12.8
life 14.0 11.5 10.8 10.8
logic 27.4 28.8 27.6 27.8
mandelbrot 7.5 8.8 7.4 7.3
matrix-multiply 5.6 4.9 4.8 4.7
md5 4.3 4.2 4.2 4.2
merge 69.5 66.7 66.6 66.9
mlyacc 10.9 11.5 10.7 10.7
mpuz 6.3 5.3 5.6 5.6
nucleic 7.9 8.1 7.9 8.0
peek 4.5 4.6 3.5 3.5
psdes-random 5.0 4.3 4.2 4.0
ratio-regions 10.4 10.3 10.4 10.3
ray 4.5 5.1 4.5 4.5
raytrace 5.6 5.7 5.6 5.6
simple 7.3 7.1 7.0 7.3
smith-normal-form 1.2 1.2 1.2 1.2
tailfib 15.3 18.1 13.9 13.5
tak 9.8 9.8 9.6 9.6
tensor 5.1 7.0 5.3 4.8
tsp 11.2 11.7 11.4 11.3
vector-concat 4.5 7.7 3.7 4.9
vector-rev 5.0 5.1 5.0 5.0
vliw 7.1 7.5 6.8 6.9
wc-input1 3.5 3.1 3.5 3.0
wc-scanStream 3.2 4.6 2.9 2.9
zebra 3.4 3.1 3.0 3.1
zern 41.6 44.8 41.0 40.9
run time ratio
benchmark MLton1 MLton2 MLton3
barnes-hut 1.1 1.0 1.0
checksum 0.9 0.9 0.9
count-graphs 1.0 1.0 1.0
DLXSimulator 1.0 1.0 1.0
fft 1.0 1.0 1.0
fib 1.2 1.0 1.0
hamlet 1.0 1.0 1.0
knuth-bendix 1.0 1.0 1.0
lexgen 1.1 1.0 1.0
life 0.8 0.8 0.8
logic 1.1 1.0 1.0
mandelbrot 1.2 1.0 1.0
matrix-multiply 0.9 0.8 0.8
md5 1.0 1.0 1.0
merge 1.0 1.0 1.0
mlyacc 1.1 1.0 1.0
mpuz 0.9 0.9 0.9
nucleic 1.0 1.0 1.0
peek 1.0 0.8 0.8
psdes-random 0.9 0.8 0.8
ratio-regions 1.0 1.0 1.0
ray 1.1 1.0 1.0
raytrace 1.0 1.0 1.0
simple 1.0 1.0 1.0
smith-normal-form 1.0 1.0 1.0
tailfib 1.2 0.9 0.9
tak 1.0 1.0 1.0
tensor 1.4 1.0 0.9
tsp 1.0 1.0 1.0
vector-concat 1.7 0.8 1.1
vector-rev 1.0 1.0 1.0
vliw 1.1 1.0 1.0
wc-input1 0.9 1.0 0.9
wc-scanStream 1.5 0.9 0.9
zebra 0.9 0.9 0.9
zern 1.1 1.0 1.0
size
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 61,529 62,009 61,113 61,065
checksum 38,308 38,484 38,356 38,356
count-graphs 60,268 59,708 58,940 58,876
DLXSimulator 100,540 100,252 96,492 96,588
fft 47,816 47,304 46,920 46,776
fib 38,340 38,404 38,372 38,372
hamlet 1,010,151 1,023,815 985,527 985,991
knuth-bendix 78,493 78,797 77,677 77,693
lexgen 144,004 143,268 141,204 141,188
life 56,612 56,820 56,388 56,372
logic 166,852 164,948 164,148 164,148
mandelbrot 38,340 38,388 38,372 38,372
matrix-multiply 38,756 38,884 38,804 38,804
md5 51,997 52,429 51,469 51,469
merge 39,324 39,420 39,372 39,372
mlyacc 456,940 445,756 439,308 439,292
mpuz 43,956 44,148 43,684 43,700
nucleic 78,508 78,140 78,028 77,964
peek 46,453 46,421 46,149 46,149
psdes-random 39,276 39,484 39,356 39,372
ratio-regions 62,460 60,860 61,356 61,468
ray 89,127 87,559 85,927 85,959
raytrace 196,284 200,220 190,188 189,836
simple 170,872 167,576 165,912 165,944
smith-normal-form 139,829 143,413 143,173 143,237
tailfib 38,020 38,148 38,052 38,052
tak 38,364 38,428 38,396 38,396
tensor 63,772 64,508 61,740 61,708
tsp 52,181 52,309 51,461 51,445
vector-concat 38,876 39,148 38,940 38,940
vector-rev 38,764 38,956 38,844 38,844
vliw 290,144 288,392 282,240 282,288
wc-input1 57,389 57,597 56,573 56,541
wc-scanStream 59,645 60,173 58,733 58,701
zebra 152,965 125,061 140,501 140,469
zern 43,895 44,167 43,735 43,735
No ground-breaking improvements, but nothing too bad. I looked at
vector-concat, and can't quite figure out what's wrong. All of the hot
loops are identical in MLton0 and MLton3. There's one extra bit of
shuffling going on; maybe that really is hurting.
Todo:
- Since the loop forest is being constructed, we might as well use it to
guide code-layout. For example, I noticed a pair of nested loops in
vector-concat where the inner loop wasn't all fall thrus:
loop_10:
cmpl %edx,%ecx
movl %ecx,(28*1)(%ebp)
je L_70
L_34:
movl (36*1)(%ebp),%ecx
xchgl %esi,%ebx
xchgl %esi,%edi
loop_5:
movl (-2*4)(%ebx),%eax
cmpl %eax,%esi
jnl L_71
L_35:
cmpl %eax,%esi
jae L_72
L_37:
movl (%ebx,%esi,4),%eax
incl %esi
jo L_86
noOverflow_2:
movl %eax,(%ecx,%edx,4)
incl %edx
jo L_86
noOverflow_3:
movl (28*1)(%ebp),%ecx
xchgl %edi,%ebx
xchgl %esi,%edi
jmp loop_10
...
.p2align
L_71:
testl $0x3,%edi
jnz L_79
L_102:
movl (%edi),%ebx
movl (4*1)(%edi),%edi
xorl %esi,%esi
jmp loop_5
loop_5 should be straight-line code, because it is the inner loop. This
is easy with the loop forest; when generating the transfer for the
block loop_5, L_71 will be in the same loop while L_35 won't be in the
same loop; now, just favor falling through to labels in the same loop.