CVS Commit

Tue, 18 Sep 2001 10:11:25 -0400 (EDT)

http://www.cs.cornell.edu/People/fluet/MLton/x86-codegen.tgz
(* src/mlton/codgen/x86-codegen/ *)
http://www.cs.cornell.edu/People/fluet/MLton/x86codegen.tgz
(* src/include/x86codegen.h *)
http://www.cs.cornell.edu/People/fluet/MLton/main.tgz
(* src/mlton/main/main.sml *)
http://www.cs.cornell.edu/People/fluet/MLton/control.tgz
(* src/mlton/control/control.{sig,sml} *)

Various changes to the x86-codgen.

x86.sig
x86.fun
x86-pseudo.sig
x86-mlton.fun
x86-allocate-registers.fun
x86-validate.fun

Added x86.Instruction.IMUL2 for the two-operand form of the imul
instruction.  This form of imul does not require the destination to be in
edx:eax and the source to be in a register.  This alleviates some register
pressure and suffling around multiplication instructions.  The
disadvantage is that the two-operand form of the imul instruction does not
support byte multiplications, so Word8.* must use the one-operand form. 

x86.sig
x86.fun
x86-pseudo.sig
x86-mlton.fun
x86-live-transfers.fun
x86-generate-transfers.fun

Extended the Entry.t datatype with CReturn.  Used by
x86-live-transfers.fun to prevent floating-point values from being carried
across C-calls in floating-point registers.  (C-calling convention
requires floating-point stack to be cleared.)

x86codegen.h
x86-generate-transfers.fun

Changed reserved register for gcState.stackTop from %edi to %ebp.  Changed
reserved register for gcState.frontier from %esi to %esp.  Motivated
primarily by a desire to keep %esp from being used as a general purpose
register -- there is unecessary shuffle code when trying to use %esp as an
index in a memory operand. 

x86-live-transfers.fun
main.sml
control.sig
control.sml

Major changes.  
 - improved efficiency of get_distanceF by first computing distanceF' (the
distance from the start of a block to the first use of a memloc in that
block)
 - implemented get_distanceB (the distance from the start of a block to
the last use or def of a memloc before that block).
 - modified both get_distanceF and get_distanceB to be "smart" about far
transfers; in particular, get_distanceB of a memloc will be PosInfinity if
the memloc is guaranteed to be on the stack (e.g., at function entry, on
return from a non-tail call, on return from the runtime, etc.).  Likewise,
get_distanceF of a memloc will be PosInfinity if the memloc is live across
a far transfer.
 - memlocs with get_distanceB of PosInfinity are not carried into the
block in registers (i.e., don't load a value too early).  This fixed the
problem at function entry of copying args from stack to registers,
copying them back on at the stack check GC, and finally copying them back
to registers on GC return.  Likewise, memlocs with get_distanceF of
PosInfinity are not carried into the block in registers.
 - smart computation of "sync" attribute of memlocs at block entries;
values that are not carried in registers into a block must be on the
stack; hence, later uses of the value can assume that it is synchronized.
 - weight live variables in inner loops higher using the loop forest of
the call-graph (technically, it is a revised call graph; far entries
(function entry, non-tail conts, non-tail handlers, runtime returns) are
all preceeded by a root node; i.e., a block corresponding to L() = K(f ())
does not result in an edge L -> K; instead, there is an edge root -> K;
this corresponds to the fact that we don't want to count the frontier
check loop as a real loop.)
 - changed the -native-live-transfer option to accept an integer, rather
than a boolean.  -native-live-transfer 0 turns off carrying live values in
registers.  -native-live-transfer n is used as follows:
	val (useLF, useB, sync)
	  = case !Control.Native.liveTransfer
	      of 1 => (false, false, false)
	       | 2 => (false, false, true)
	       | 3 => (false, true, false)
	       | 4 => (false, true, true)
	       | 5 => (true, false, false)
	       | 6 => (true, false, true)
	       | 7 => (true, true, false)
	       | _ => (true, true, true)
where useLF indicates using the loop forest to weight inner loops, useB
indicates using get_distanceB, and sync corresponds to using smart
computation of sync attribute.

Benchmark Results: (I'm leaving out compile time, because it was with a G0
(i.e., SML/NJ) mlton.

MLton0 -- mlton-stable -native-live-stack true
MLton1 -- mlton -native-live-stack true -native-live-transfer 0
MLton2 -- mlton -native-live-stack true -native-live-transfer 4
MLton3 -- mlton -native-live-stack true -native-live-transfer 8

run time
benchmark         MLton0 MLton1 MLton2 MLton3
barnes-hut           4.7    5.0    4.7    4.7
checksum             4.3    4.0    4.0    4.0
count-graphs         6.1    6.4    6.0    6.0
DLXSimulator        18.5   19.0   18.6   18.5
fft                 11.3   11.3   11.2   11.1
fib                  3.7    4.4    3.9    3.9
hamlet               9.7   10.0    9.5    9.6
knuth-bendix         7.8    7.8    7.9    7.8
lexgen              12.9   13.8   12.6   12.8
life                14.0   11.5   10.8   10.8
logic               27.4   28.8   27.6   27.8
mandelbrot           7.5    8.8    7.4    7.3
matrix-multiply      5.6    4.9    4.8    4.7
md5                  4.3    4.2    4.2    4.2
merge               69.5   66.7   66.6   66.9
mlyacc              10.9   11.5   10.7   10.7
mpuz                 6.3    5.3    5.6    5.6
nucleic              7.9    8.1    7.9    8.0
peek                 4.5    4.6    3.5    3.5
psdes-random         5.0    4.3    4.2    4.0
ratio-regions       10.4   10.3   10.4   10.3
ray                  4.5    5.1    4.5    4.5
raytrace             5.6    5.7    5.6    5.6
simple               7.3    7.1    7.0    7.3
smith-normal-form    1.2    1.2    1.2    1.2
tailfib             15.3   18.1   13.9   13.5
tak                  9.8    9.8    9.6    9.6
tensor               5.1    7.0    5.3    4.8
tsp                 11.2   11.7   11.4   11.3
vector-concat        4.5    7.7    3.7    4.9
vector-rev           5.0    5.1    5.0    5.0
vliw                 7.1    7.5    6.8    6.9
wc-input1            3.5    3.1    3.5    3.0
wc-scanStream        3.2    4.6    2.9    2.9
zebra                3.4    3.1    3.0    3.1
zern                41.6   44.8   41.0   40.9

run time ratio
benchmark         MLton1 MLton2 MLton3
barnes-hut           1.1    1.0    1.0
checksum             0.9    0.9    0.9
count-graphs         1.0    1.0    1.0
DLXSimulator         1.0    1.0    1.0
fft                  1.0    1.0    1.0
fib                  1.2    1.0    1.0
hamlet               1.0    1.0    1.0
knuth-bendix         1.0    1.0    1.0
lexgen               1.1    1.0    1.0
life                 0.8    0.8    0.8
logic                1.1    1.0    1.0
mandelbrot           1.2    1.0    1.0
matrix-multiply      0.9    0.8    0.8
md5                  1.0    1.0    1.0
merge                1.0    1.0    1.0
mlyacc               1.1    1.0    1.0
mpuz                 0.9    0.9    0.9
nucleic              1.0    1.0    1.0
peek                 1.0    0.8    0.8
psdes-random         0.9    0.8    0.8
ratio-regions        1.0    1.0    1.0
ray                  1.1    1.0    1.0
raytrace             1.0    1.0    1.0
simple               1.0    1.0    1.0
smith-normal-form    1.0    1.0    1.0
tailfib              1.2    0.9    0.9
tak                  1.0    1.0    1.0
tensor               1.4    1.0    0.9
tsp                  1.0    1.0    1.0
vector-concat        1.7    0.8    1.1
vector-rev           1.0    1.0    1.0
vliw                 1.1    1.0    1.0
wc-input1            0.9    1.0    0.9
wc-scanStream        1.5    0.9    0.9
zebra                0.9    0.9    0.9
zern                 1.1    1.0    1.0

size
benchmark            MLton0    MLton1  MLton2  MLton3
barnes-hut           61,529    62,009  61,113  61,065
checksum             38,308    38,484  38,356  38,356
count-graphs         60,268    59,708  58,940  58,876
DLXSimulator        100,540   100,252  96,492  96,588
fft                  47,816    47,304  46,920  46,776
fib                  38,340    38,404  38,372  38,372
hamlet            1,010,151 1,023,815 985,527 985,991
knuth-bendix         78,493    78,797  77,677  77,693
lexgen              144,004   143,268 141,204 141,188
life                 56,612    56,820  56,388  56,372
logic               166,852   164,948 164,148 164,148
mandelbrot           38,340    38,388  38,372  38,372
matrix-multiply      38,756    38,884  38,804  38,804
md5                  51,997    52,429  51,469  51,469
merge                39,324    39,420  39,372  39,372
mlyacc              456,940   445,756 439,308 439,292
mpuz                 43,956    44,148  43,684  43,700
nucleic              78,508    78,140  78,028  77,964
peek                 46,453    46,421  46,149  46,149
psdes-random         39,276    39,484  39,356  39,372
ratio-regions        62,460    60,860  61,356  61,468
ray                  89,127    87,559  85,927  85,959
raytrace            196,284   200,220 190,188 189,836
simple              170,872   167,576 165,912 165,944
smith-normal-form   139,829   143,413 143,173 143,237
tailfib              38,020    38,148  38,052  38,052
tak                  38,364    38,428  38,396  38,396
tensor               63,772    64,508  61,740  61,708
tsp                  52,181    52,309  51,461  51,445
vector-concat        38,876    39,148  38,940  38,940
vector-rev           38,764    38,956  38,844  38,844
vliw                290,144   288,392 282,240 282,288
wc-input1            57,389    57,597  56,573  56,541
wc-scanStream        59,645    60,173  58,733  58,701
zebra               152,965   125,061 140,501 140,469
zern                 43,895    44,167  43,735  43,735

No ground-breaking improvements, but nothing too bad.  I looked at
vector-concat, and can't quite figure out what's wrong.  All of the hot
loops are identical in MLton0 and MLton3.  There's one extra bit of
shuffling going on; maybe that really is hurting.

Todo:
 - Since the loop forest is being constructed, we might as well use it to
guide code-layout.  For example, I noticed a pair of nested loops in
vector-concat where the inner loop wasn't all fall thrus:

loop_10:
	cmpl %edx,%ecx
	movl %ecx,(28*1)(%ebp)
	je L_70
L_34:
	movl (36*1)(%ebp),%ecx
	xchgl %esi,%ebx
	xchgl %esi,%edi
loop_5:
	movl (-2*4)(%ebx),%eax
	cmpl %eax,%esi
	jnl L_71
L_35:
	cmpl %eax,%esi
	jae L_72
L_37:
	movl (%ebx,%esi,4),%eax
	incl %esi
	jo L_86
noOverflow_2:
	movl %eax,(%ecx,%edx,4)
	incl %edx
	jo L_86
noOverflow_3:
	movl (28*1)(%ebp),%ecx
	xchgl %edi,%ebx
	xchgl %esi,%edi
	jmp loop_10
...
.p2align
L_71:
	testl $0x3,%edi
	jnz L_79
L_102:
	movl (%edi),%ebx
	movl (4*1)(%edi),%edi
	xorl %esi,%esi
	jmp loop_5

loop_5 should be straight-line code, because it is the inner loop.  This
is easy with the loop forest; when generating the transfer for the
block loop_5, L_71 will be in the same loop while L_35 won't be in the
same loop; now, just favor falling through to labels in the same loop.