RSSA backend
Matthew Fluet
fluet@CS.Cornell.EDU
Thu, 17 Jan 2002 14:43:56 -0500 (EST)
> > 4. make the array initialization loops explicit in MACHINE; I guess this
> > would be an insertArrayInitializations after insertSignalChecks
> >
> > I think we eventually want 4, which solves all the problems at once.
>
> Agreed. I will do it right now. It should be easy.
Great. Other than that, what's currently checked in to the rssa branch
passes all the regressions, benchmarks, and a self-compile.
Benchmark results are equally as strange as the C codegen ones, although
they don't seem to agree on what got speedups.
run time ratio
benchmark MLton1
barnes-hut 1.02
checksum 0.99
count-graphs 0.80
DLXSimulator 0.88
fft 1.08
fib 1.01
hamlet 1.13
imp-for 0.79
knuth-bendix 0.98
lexgen 0.96
life 0.95
logic 1.13
mandelbrot 0.90
matrix-multiply 0.91
md5 0.94
merge 0.99
mlyacc 0.97
mpuz 1.01
nucleic 0.95
peek 2.58
psdes-random 0.90
ratio-regions 0.96
ray 1.11
raytrace 0.99
simple 0.99
smith-normal-form 1.00
tailfib 0.80
tak 0.98
tensor 0.61
tsp 1.09
tyan 1.09
vector-concat 0.41
vector-rev 0.98
vliw 0.97
wc-input1 1.15
wc-scanStream 0.61
zebra 0.77
zern 0.94
I've no real idea either; peek is ok, because knownCase in the stable
version is speeding it up to 0.3 of the running time of the SSA optimizer
in the RSSA branch. logic and wc-input1 might be worth looking at.
I think I see why there are such significant speedups, at least for the
x86 codegen. The limit check insertion and pseudo-register allocator are
doing "better" in the RSSA branch. For example, the loop in tail fib
looks like this in RSSA branch
fibP_0 {kind = Jump, live = (RI(4), RI(3), RI(2), RI(0))}:
Switch (RI(4), [(0, L_12)], Some (L_15))
and like this in the stable branch:
fibP_0 {kind = Jump, live = [SI(4), SI(8), SI(12), SI(16)]}:
Switch (SI(16), [(0, L_10)], Some (L_13))
So, the RSSA looks a lot more like -native-live-stack true; in any event,
the x86 codegen turns the first loop into:
fibP_0:
testl %edi,%edi
jz L_41
L_15:
decl %edi
jo L_10
L_34:
addl %ecx,%edx
jo L_43
L_33:
xchgl %edx,%ecx
jmp fibP_0
because it will pass pseudo-regs in registers by default,
while it turns the second into:
fibP_0:
movl (0+(16*1))(%ebp),%esi
testl %esi,%esi
jz L_27
L_13:
decl %esi
jo L_8
L_19:
movl (0+(12*1))(%ebp),%edi
movl %edi,%edx
addl (0+(8*1))(%ebp),%edx
jo L_29
L_20:
movl %esi,(0+(16*1))(%ebp)
movl %edi,(0+(8*1))(%ebp)
movl %edx,(0+(12*1))(%ebp)
jmp fibP_0
because it loads and stores stack slots.
On the down side, this means that there is a little slowdown in compile
times, because the live sets manipulated by the codegen are larger.