SSA simplify passes
Matthew Fluet
fluet@CS.Cornell.EDU
Sun, 6 Jan 2002 16:58:05 -0500 (EST)
> > Here are a couple of other random "cleanup"'s I'm trying to track down:
> ...
> > Pulling apart x_1991 just to put it back together again in env_30 seems
> > wasteful. This happens all over the place in a self-compile.
>
> Yeah, eliminating reconstitution of tuples has been on my list for a
> while. Go for it.
It was fairly easy to add to the shrinker.
One minor complication was needing to give VarInfo a ty: Type.t field in
order to verify that the reconstructed tuple has the same type as the
original tuple; i.e., can't do anything about
L (x: (int * int * int))
val a = #1 x
val b = #2 x
val z = (a, b)
z
whereas
L (x: (int * int)) L (x: (int * int))
val a = #1 x x
val b = #2 x becomes
val z = (a, b)
z
Initial results are o.k. The benchmark simple was one in which I noticed
tuple reconstruction going on, so I was initially testing with that;
you'll see in the benchmarks why I was initially disappointed and did a
little more tweaking. There seems to be some interaction with the
flattener, so, for testing purposes, I've added an option
-tuple-recon-elim {0|1|2}. With 0, no tuple reconstruction elimination is
done (i.e., the way things are currently). With 1, tuple reconstruction
elimination is turned on in the shrinker after localFlatten3
in the SSA simplification pipeline (so, it's used in the shrinker during
the final 5 simplification passes). With 2, tuple reconstruction
elimination is turned on always (including the initial shrink of the
closure converted program). Here are the benchmark results:
MLton0 -- mlton -tuple-recon-elim 0
MLton1 -- mlton -tuple-recon-elim 1
MLton2 -- mlton -tuple-recon-elim 2
compile time
benchmark MLton0 MLton1 MLton2
barnes-hut 3.29 3.29 3.27
checksum 0.76 0.78 0.79
count-graphs 2.33 2.31 2.29
DLXSimulator 6.07 6.03 6.05
fft 1.71 1.62 1.63
fib 0.73 0.69 0.71
hamlet 74.49 70.41 70.76
imp-for 0.74 0.78 0.76
knuth-bendix 2.98 2.98 2.92
lexgen 7.57 7.63 7.62
life 1.77 1.71 1.67
logic 4.01 3.99 4.15
mandelbrot 0.77 0.75 0.77
matrix-multiply 0.80 0.79 0.81
md5 1.65 1.67 1.66
merge 0.79 0.78 0.78
mlyacc 30.10 29.95 31.08
mpuz 1.03 1.07 1.06
nucleic 3.71 3.70 3.72
peek 1.30 1.34 1.31
psdes-random 0.82 0.81 0.82
ratio-regions 3.37 3.35 3.36
ray 4.78 4.81 4.83
raytrace 12.07 11.65 11.56
simple 9.34 9.36 9.77
smith-normal-form 11.08 11.08 10.54
tailfib 0.70 0.71 0.67
tak 0.74 0.72 0.72
tensor 4.15 4.16 4.13
tsp 2.03 2.02 2.02
tyan 5.10 5.08 5.00
vector-concat 0.80 0.82 0.81
vector-rev 0.78 0.78 0.78
vliw 16.90 16.97 16.96
wc-input1 2.13 2.12 2.16
wc-scanStream 2.22 2.17 2.13
zebra 7.65 7.68 7.35
zern 1.34 1.42 1.38
run time
benchmark MLton0 MLton1 MLton2
barnes-hut 5.29 5.28 5.29
checksum 3.93 4.04 4.04
count-graphs 6.14 6.15 6.16
DLXSimulator 21.59 21.60 21.56
fft 13.90 13.50 13.52
fib 4.23 4.22 4.23
hamlet 10.55 9.89 10.21
imp-for 10.21 10.21 10.21
knuth-bendix 8.40 8.40 8.15
lexgen 13.71 13.71 13.42
life 8.60 8.61 8.12
logic 26.13 26.53 27.46
mandelbrot 7.70 7.69 7.69
matrix-multiply 5.48 5.30 5.39
md5 2.51 2.51 2.51
merge 77.46 77.37 77.43
mlyacc 12.43 12.24 12.22
mpuz 5.67 5.67 5.67
nucleic 10.05 10.04 9.94
peek 4.03 4.03 4.03
psdes-random 4.16 4.16 4.16
ratio-regions 11.79 11.80 11.81
ray 4.92 4.90 4.92
raytrace 6.00 5.95 5.93
simple 7.30 7.32 8.53
smith-normal-form 1.26 1.26 1.22
tailfib 19.20 19.20 19.22
tak 10.77 10.78 10.77
tensor 6.86 6.85 6.86
tsp 11.09 11.08 11.08
tyan 25.28 22.29 17.39
vector-concat 7.24 7.33 7.11
vector-rev 6.00 6.03 6.01
vliw 8.07 8.07 8.13
wc-input1 2.23 2.23 2.23
wc-scanStream 4.50 4.44 4.44
zebra 2.83 2.82 2.85
zern 47.84 48.30 48.41
run time ratio
benchmark MLton1 MLton2
barnes-hut 1.00 1.00
checksum 1.03 1.03
count-graphs 1.00 1.00
DLXSimulator 1.00 1.00
fft 0.97 0.97
fib 1.00 1.00
hamlet 0.94 0.97
imp-for 1.00 1.00
knuth-bendix 1.00 0.97
lexgen 1.00 0.98
life 1.00 0.94
logic 1.02 1.05
mandelbrot 1.00 1.00
matrix-multiply 0.97 0.98
md5 1.00 1.00
merge 1.00 1.00
mlyacc 0.98 0.98
mpuz 1.00 1.00
nucleic 1.00 0.99
peek 1.00 1.00
psdes-random 1.00 1.00
ratio-regions 1.00 1.00
ray 1.00 1.00
raytrace 0.99 0.99
simple 1.00 1.17
smith-normal-form 1.00 0.97
tailfib 1.00 1.00
tak 1.00 1.00
tensor 1.00 1.00
tsp 1.00 1.00
tyan 0.88 0.69
vector-concat 1.01 0.98
vector-rev 1.00 1.00
vliw 1.00 1.01
wc-input1 1.00 1.00
wc-scanStream 0.99 0.99
zebra 1.00 1.01
zern 1.01 1.01
size
benchmark MLton0 MLton1 MLton2
barnes-hut 69,982 69,982 69,998
checksum 21,585 21,585 21,585
count-graphs 44,713 44,713 44,481
DLXSimulator 99,417 99,417 99,401
fft 33,069 33,069 33,069
fib 21,585 21,585 21,585
hamlet 1,498,772 1,465,892 1,484,532
imp-for 21,577 21,577 21,577
knuth-bendix 65,938 65,938 65,866
lexgen 157,297 157,297 158,049
life 41,481 41,481 41,337
logic 89,313 89,313 89,313
mandelbrot 21,545 21,545 21,545
matrix-multiply 22,009 22,009 22,009
md5 31,986 31,986 31,986
merge 22,793 22,793 22,793
mlyacc 578,545 574,065 573,265
mpuz 26,529 26,529 26,529
nucleic 63,769 63,769 63,241
peek 31,514 31,514 31,514
psdes-random 22,553 22,553 22,553
ratio-regions 44,537 44,537 44,537
ray 85,492 85,460 85,412
raytrace 204,769 204,657 203,697
simple 195,405 194,653 197,973
smith-normal-form 149,546 149,546 149,578
tailfib 21,257 21,257 21,257
tak 21,705 21,705 21,705
tensor 72,321 72,321 72,321
tsp 37,474 37,474 37,490
tyan 91,954 91,234 89,362
vector-concat 22,393 22,393 22,393
vector-rev 22,209 22,209 22,209
vliw 340,541 340,413 336,893
wc-input1 45,514 45,514 45,530
wc-scanStream 46,794 46,794 46,794
zebra 127,954 127,954 127,954
zern 28,684 28,684 28,684
Not too much effect on compile time. Note, however, that the way the
-tuple-recon-elim option is used is that all of the "analysis" behind
tuple reconstruction elimination is always done; there's just a final if
test of the option that either eliminates the tuple reconstruction or
leaves it. So, the differences in compile time are less to do with what
the shrinker is doing and more to do with the differences in the
intermediate programs.
Runtimes are interesting, as always. You can see why I had initial
doubts, with simple being unaffected by the -tuple-recon-elim 1 option
(which does, indeed, produce an almost identical SSA program, minus the
redundant tuple construction) and being quite negatively affected by the
-tuple-recon-elim 2 option (which is again an almost identical SSA
program, but many datatypes and arguments are not flattened).
Tyan shows a surprising benefit. All of the SSA programs are pretty much
identical, minus various tuple reconstructions. Again, the
-tuple-recon-elim 2 program has less flattened datatypes and arguments,
but that doesn't seem to negatively impact this benchmark. Looking at
gc-summary for the three programs, it's clear that we avoided significant
allocation and excess GCs.
[fluet@localhost tyan]$ ./tyan.0 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,552,384
max stack size(bytes): 26,752
GC time(ms): 4,090 (16.9%)
maxPause(ms): 10
number of GCs: 1,873
bytes allocated: 2,356,798,696
bytes copied: 255,447,228
max bytes live: 180,256
[fluet@localhost tyan]$ ./tyan.1 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,564,672
max stack size(bytes): 28,544
GC time(ms): 3,260 (15.1%)
maxPause(ms): 10
number of GCs: 1,573
bytes allocated: 1,984,296,704
bytes copied: 214,353,012
max bytes live: 178,704
[fluet@localhost tyan]$ ./tyan.2 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,748,992
max stack size(bytes): 43,520
GC time(ms): 2,430 (14.5%)
maxPause(ms): 10
number of GCs: 1,135
bytes allocated: 1,445,752,052
bytes copied: 153,468,780
max bytes live: 182,196
Almost everything else is probably just noise. logic, which has the
greatest deviation in runtime ratios of the remaining benchmarks, is
identical SSA programs under each of the three options, so that's all
noise.
I also played with self-compiles. After getting to a fixed-point with the
new changes, I did 6 self-compiles, each resulting executable used to
compile the next. The options used and the name given to the resulting
exectuable are
@MLton fixed-heap 500m -- -type-recon-elim 0 --> G10
@MLton fixed-heap 500m -- -type-recon-elim 0 --> G10a
@MLton fixed-heap 500m -- -type-recon-elim 1 --> G11
@MLton fixed-heap 500m -- -type-recon-elim 0 --> G11a
@MLton fixed-heap 500m -- -type-recon-elim 2 --> G12
@MLton fixed-heap 500m -- -type-recon-elim 0 --> G12a
So, each of the G*a executables are identical; furthermore, the time to
produce a G*a executable gives some indication of how tuple reconstruction
elimination is affecting a self-compile. Here are the sizes:
text data bss dec hex filename
10301850 1622864 36132 11960846 b6820e mlton-compile.G10
10301850 1622864 36132 11960846 b6820e mlton-compile.G10a
9714698 1588288 36132 11339118 ad056e mlton-compile.G11
10301850 1622864 36132 11960846 b6820e mlton-compile.G11a
9724058 1595280 36996 11356334 ad48ae mlton-compile.G12
10301850 1622864 36132 11960846 b6820e mlton-compile.G12a
So, we did save some executable size by eliminating tuple reconstruction.
On the other hand, the compile times aren't very sensible:
711.39user 18.06system 12:20.66elapsed 98%CPU -- to compile G10
747.99user 48.58system 14:04.73elapsed 94%CPU -- to compile G10a
690.29user 19.91system 12:32.34elapsed 94%CPU -- to compile G11
725.07user 1100.11system 32:42.66elapsed 92%CPU -- to compile G11a
753.95user 366.83system 19:36.90elapsed 95%CPU -- to compile G12
693.51user 356.16system 18:28.15elapsed 94%CPU -- to compile G12a
At a minimum, the times to compile G10 and G10a should be almost the same.
The excess system time to compile G11a might suggest that G11 is doing
excess allocation, but they were all run with the same fixed heap. Still,
there is some excess sytem time in the last two compiles. I wasn't using
the machine for anything else at the time, and it was run at 10am after
the machine had been on ovenight, so I don't think there were any backed
up cron jobs. They were all run in a row, so maybe I just really foiled
the virtual memory system (kernel 2.4.8 -- I know I should move up at
least past 2.4.10 or so because there are known issues with the VM).
Rerunning all of the compiles with -v2 (because I wanted to see if I could
pinpoint some pass that was slow because of tuple reconstruction
elimination), I get the following:
MLton finished in 527.27 + 289.79 (35% GC)
746.58user 70.64system 14:24.24elapsed 94%CPU
MLton finished in 533.43 + 290.50 (35% GC)
753.75user 70.35system 14:09.91elapsed 96%CPU
MLton finished in 503.85 + 277.33 (36% GC)
724.30user 57.03system 13:40.35elapsed 95%CPU
MLton finished in 475.20 + 212.58 (31% GC)
631.01user 56.96system 12:10.05elapsed 94%CPU
MLton finished in 582.62 + 338.43 (37% GC)
739.66user 181.55system 16:20.46elapsed 93%CPU
MLton finished in 838.95 + 393.44 (32% GC)
690.70user 541.89system 22:24.08elapsed 91%CPU
which doesn't explain much, and contradicts the previous in some ways, but
is more believable, I think. It suggests that compiling
with -tuple-recon-elim 1 improves compile time a little, and, furthermore,
the resulting compiler is a bit faster. On the other hand, compiling with
-tuple-recon-elim 2 hurts compile time a little bit, and the resulting
compiler is really slow -- presumably because of lack of flattening.
Anyways, I'm hopeful that after looking at the flattener, we should be
able to get to a point where the tuple reconstruction
elimination can be permanently left on in the shrinker.