SSA simplify passes
Matthew Fluet
fluet@CS.Cornell.EDU
Fri, 4 Jan 2002 14:02:26 -0500 (EST)
Here's a little experiment I just ran. I added a -loop-passes
integer option which sets the number of times to run through the list of
SSA simplification passe. Here are the benchmark results:
MLton0 -- mlton -loop-passes 1
MLton1 -- mlton -loop-passes 2
compile time
benchmark MLton0 MLton1
barnes-hut 3.44 3.95
checksum 0.84 0.80
count-graphs 2.30 2.48
DLXSimulator 6.14 6.87
fft 1.71 1.86
fib 0.74 0.73
hamlet 75.12 91.49
imp-for 0.77 0.82
knuth-bendix 3.12 3.34
lexgen 7.56 9.32
life 1.71 1.88
logic 4.32 5.59
mandelbrot 0.76 0.81
matrix-multiply 0.84 0.80
md5 1.64 1.66
merge 0.80 0.80
mlyacc 29.03 37.35
mpuz 1.05 1.22
nucleic 3.79 3.88
peek 1.33 1.47
psdes-random 0.82 0.81
ratio-regions 3.34 5.62
ray 4.78 5.60
raytrace 11.55 13.29
simple 9.61 12.10
smith-normal-form 10.85 11.51
tailfib 0.72 0.73
tak 0.71 0.76
tensor 4.09 4.71
tsp 1.98 2.03
tyan 5.03 6.01
vector-concat 0.84 0.84
vector-rev 0.80 0.79
vliw 16.70 24.14
wc-input1 2.19 2.45
wc-scanStream 2.17 2.50
zebra 7.32 9.95
zern 1.43 1.46
run time
benchmark MLton0 MLton1
barnes-hut 5.44 5.00
checksum 4.06 4.04
count-graphs 6.07 5.52
DLXSimulator 22.50 22.41
fft 13.72 13.52
fib 4.24 4.27
hamlet 10.60 10.29
imp-for 10.25 11.28
knuth-bendix 8.57 8.21
lexgen 13.69 12.43
life 8.74 9.62
logic 26.92 22.69
mandelbrot 7.72 7.87
matrix-multiply 5.48 1.35
md5 2.53 0.50
merge 75.85 74.91
mlyacc 12.51 14.04
mpuz 5.69 5.59
nucleic 10.29 12.80
peek 4.05 3.74
psdes-random 4.18 3.89
ratio-regions 11.80 10.50
ray 4.98 4.80
raytrace 6.00 6.54
simple 7.52 8.34
smith-normal-form 1.26 1.26
tailfib 19.29 19.28
tak 10.79 10.79
tensor 6.88 6.87
tsp 11.11 11.12
tyan 25.17 26.22
vector-concat 7.21 7.29
vector-rev 5.90 5.76
vliw 8.08 8.34
wc-input1 2.25 1.86
wc-scanStream 4.52 4.40
zebra 2.94 2.96
zern 48.77 44.88
run time ratio
benchmark MLton1
barnes-hut 0.92
checksum 0.99
count-graphs 0.91
DLXSimulator 1.00
fft 0.99
fib 1.01
hamlet 0.97
imp-for 1.10
knuth-bendix 0.96
lexgen 0.91
life 1.10
logic 0.84
mandelbrot 1.02
matrix-multiply 0.25
md5 0.20
merge 0.99
mlyacc 1.12
mpuz 0.98
nucleic 1.24
peek 0.92
psdes-random 0.93
ratio-regions 0.89
ray 0.96
raytrace 1.09
simple 1.11
smith-normal-form 1.00
tailfib 1.00
tak 1.00
tensor 1.00
tsp 1.00
tyan 1.04
vector-concat 1.01
vector-rev 0.98
vliw 1.03
wc-input1 0.83
wc-scanStream 0.97
zebra 1.01
zern 0.92
size
benchmark MLton0 MLton1
barnes-hut 69,982 69,326
checksum 21,585 21,585
count-graphs 44,713 41,777
DLXSimulator 99,417 97,273
fft 33,069 32,613
fib 21,585 21,585
hamlet 1,498,772 1,581,652
imp-for 21,577 21,577
knuth-bendix 65,938 62,698
lexgen 157,297 173,089
life 41,481 42,729
logic 89,313 96,449
mandelbrot 21,545 21,513
matrix-multiply 22,009 21,785
md5 31,986 27,746
merge 22,793 22,809
mlyacc 578,545 639,377
mpuz 26,529 26,321
nucleic 63,769 63,081
peek 31,514 31,370
psdes-random 22,553 22,569
ratio-regions 44,537 46,057
ray 85,492 86,132
raytrace 204,769 202,833
simple 195,405 217,709
smith-normal-form 149,546 150,010
tailfib 21,257 21,257
tak 21,705 21,689
tensor 72,321 76,353
tsp 37,474 34,090
tyan 91,954 94,978
vector-concat 22,393 22,393
vector-rev 22,209 21,905
vliw 340,541 416,821
wc-input1 45,514 46,474
wc-scanStream 46,794 47,746
zebra 127,954 121,154
zern 28,684 28,484
Some interesting results:
md5 -- if you look at the .ssa of -loop-passes 1, you'll see three
functions all of the form:
fun x_148 (x_154) = L_88 ()
L_88 ()
()
which are used in about 45 non-tail calls. Before the last removeUnused
call, these functions do somthing, but their results aren't used, so they
are trimmed down to:
fun x_148 (x_154) = L_88 ()
L_88 ()
x_165 = Word32_ge (x_154, global_11)
case x_165 of
false => L_93 | true => L_94
L_94 ()
L_92 ()
L_93 ()
L_92 ()
L_92 ()
x_161 = Word32_sub (global_11, x_154)
x_163 = Word32_ge (x_161, global_11)
case x_163 of
false => L_90 | true => L_91
L_91 ()
L_89 ()
L_90 ()
L_89 ()
L_89 ()
()
which the shinker reduces to the above (this explains why removeUnused
didn't eliminate the function argument, although it is clearly unused in
the final function).
I think that just eliminating those extraneous non-tail calls sped it up
quite a bid.
matrix-multiply: As best I can make out, something (I don't know what
pass) establishes the fact that the source matrix is initialized, but
never updated -- so it replaces all the array subs (and associated
computation of subscripts) with the constant 1.0.
imp-for: This one slows down by 10%. Very strange, especially considering
that when you look at the .ssa files, which are almost identical. One
thing that is different is that there are a number of blocks whose
argument lists are reversed from what they are in -loop-passes 1.
Looking at the assembly, this is confirmed -- most of the code is
identical, where they differ are stack offset numbers and some reversals
of instructions in accesing arguents. You can see above that the sizes of
the executables are identical. Some subtle caching?
nucleic: Looking at the datatypes of the -loop-passes 2 version, my guess
is too much flattening of tuples. In particular, something like this
looks bad:
fun x_8 (x_23,
x_22,
x_21,
x_20,
x_19,
x_18,
x_17,
x_16,
x_15,
x_14,
x_13,
x_12,
x_11,
x_10,
x_9) = L_16 ()
L_16 ()
x_68 = (x_12,
x_13,
x_14,
x_15,
x_16,
x_17,
x_18,
x_19,
x_20,
x_21,
x_22,
x_23)
x_81 = #13 x_11
case x_81 of
A_0 => L_23 | _ => L_35
which is the expansion of this:
fun x_283 (x_521, x_520, x_519) = L_97 ()
L_97 ()
x_577 = #2 x_520
x_578 = #1 x_520
x_582 = #13 x_577
case x_582 of
A_0 => L_104 | _ => L_116
Those are the ones that I've looked at; I don't know if there is a point
to be made, but it's an experiment I've been meaning to try.