SSA simplify passes

Sun, 6 Jan 2002 16:58:05 -0500 (EST)

> > Here are a couple of other random "cleanup"'s I'm trying to track down:
> ...
> > Pulling apart x_1991 just to put it back together again in env_30 seems
> > wasteful.  This happens all over the place in a self-compile.
> 
> Yeah, eliminating reconstitution of tuples has been on my list for a
> while.  Go for it.

It was fairly easy to add to the shrinker.

One minor complication was needing to give VarInfo a ty: Type.t field in
order to verify that the reconstructed tuple has the same type as the
original tuple; i.e., can't do anything about

L (x: (int * int * int)) 
  val a = #1 x
  val b = #2 x
  val z = (a, b)
  z

whereas

L (x: (int * int))               L (x: (int * int))
  val a = #1 x                    x
  val b = #2 x         becomes 
  val z = (a, b)
  z

Initial results are o.k.  The benchmark simple was one in which I noticed
tuple reconstruction going on, so I was initially testing with that;
you'll see in the benchmarks why I was initially disappointed and did a
little more tweaking.  There seems to be some interaction with the
flattener, so, for testing purposes, I've added an option
-tuple-recon-elim {0|1|2}.  With 0, no tuple reconstruction elimination is
done (i.e., the way things are currently).  With 1, tuple reconstruction
elimination is turned on in the shrinker after localFlatten3
in the SSA simplification pipeline (so, it's used in the shrinker during
the final 5 simplification passes). With 2, tuple reconstruction
elimination is turned on always (including the initial shrink of the
closure converted program).  Here are the benchmark results:

MLton0 -- mlton -tuple-recon-elim 0
MLton1 -- mlton -tuple-recon-elim 1
MLton2 -- mlton -tuple-recon-elim 2
compile time
benchmark         MLton0 MLton1 MLton2
barnes-hut          3.29   3.29   3.27
checksum            0.76   0.78   0.79
count-graphs        2.33   2.31   2.29
DLXSimulator        6.07   6.03   6.05
fft                 1.71   1.62   1.63
fib                 0.73   0.69   0.71
hamlet             74.49  70.41  70.76
imp-for             0.74   0.78   0.76
knuth-bendix        2.98   2.98   2.92
lexgen              7.57   7.63   7.62
life                1.77   1.71   1.67
logic               4.01   3.99   4.15
mandelbrot          0.77   0.75   0.77
matrix-multiply     0.80   0.79   0.81
md5                 1.65   1.67   1.66
merge               0.79   0.78   0.78
mlyacc             30.10  29.95  31.08
mpuz                1.03   1.07   1.06
nucleic             3.71   3.70   3.72
peek                1.30   1.34   1.31
psdes-random        0.82   0.81   0.82
ratio-regions       3.37   3.35   3.36
ray                 4.78   4.81   4.83
raytrace           12.07  11.65  11.56
simple              9.34   9.36   9.77
smith-normal-form  11.08  11.08  10.54
tailfib             0.70   0.71   0.67
tak                 0.74   0.72   0.72
tensor              4.15   4.16   4.13
tsp                 2.03   2.02   2.02
tyan                5.10   5.08   5.00
vector-concat       0.80   0.82   0.81
vector-rev          0.78   0.78   0.78
vliw               16.90  16.97  16.96
wc-input1           2.13   2.12   2.16
wc-scanStream       2.22   2.17   2.13
zebra               7.65   7.68   7.35
zern                1.34   1.42   1.38
run time
benchmark         MLton0 MLton1 MLton2
barnes-hut          5.29   5.28   5.29
checksum            3.93   4.04   4.04
count-graphs        6.14   6.15   6.16
DLXSimulator       21.59  21.60  21.56
fft                13.90  13.50  13.52
fib                 4.23   4.22   4.23
hamlet             10.55   9.89  10.21
imp-for            10.21  10.21  10.21
knuth-bendix        8.40   8.40   8.15
lexgen             13.71  13.71  13.42
life                8.60   8.61   8.12
logic              26.13  26.53  27.46
mandelbrot          7.70   7.69   7.69
matrix-multiply     5.48   5.30   5.39
md5                 2.51   2.51   2.51
merge              77.46  77.37  77.43
mlyacc             12.43  12.24  12.22
mpuz                5.67   5.67   5.67
nucleic            10.05  10.04   9.94
peek                4.03   4.03   4.03
psdes-random        4.16   4.16   4.16
ratio-regions      11.79  11.80  11.81
ray                 4.92   4.90   4.92
raytrace            6.00   5.95   5.93
simple              7.30   7.32   8.53
smith-normal-form   1.26   1.26   1.22
tailfib            19.20  19.20  19.22
tak                10.77  10.78  10.77
tensor              6.86   6.85   6.86
tsp                11.09  11.08  11.08
tyan               25.28  22.29  17.39
vector-concat       7.24   7.33   7.11
vector-rev          6.00   6.03   6.01
vliw                8.07   8.07   8.13
wc-input1           2.23   2.23   2.23
wc-scanStream       4.50   4.44   4.44
zebra               2.83   2.82   2.85
zern               47.84  48.30  48.41
run time ratio
benchmark         MLton1 MLton2
barnes-hut          1.00   1.00
checksum            1.03   1.03
count-graphs        1.00   1.00
DLXSimulator        1.00   1.00
fft                 0.97   0.97
fib                 1.00   1.00
hamlet              0.94   0.97
imp-for             1.00   1.00
knuth-bendix        1.00   0.97
lexgen              1.00   0.98
life                1.00   0.94
logic               1.02   1.05
mandelbrot          1.00   1.00
matrix-multiply     0.97   0.98
md5                 1.00   1.00
merge               1.00   1.00
mlyacc              0.98   0.98
mpuz                1.00   1.00
nucleic             1.00   0.99
peek                1.00   1.00
psdes-random        1.00   1.00
ratio-regions       1.00   1.00
ray                 1.00   1.00
raytrace            0.99   0.99
simple              1.00   1.17
smith-normal-form   1.00   0.97
tailfib             1.00   1.00
tak                 1.00   1.00
tensor              1.00   1.00
tsp                 1.00   1.00
tyan                0.88   0.69
vector-concat       1.01   0.98
vector-rev          1.00   1.00
vliw                1.00   1.01
wc-input1           1.00   1.00
wc-scanStream       0.99   0.99
zebra               1.00   1.01
zern                1.01   1.01
size
benchmark            MLton0    MLton1    MLton2
barnes-hut           69,982    69,982    69,998
checksum             21,585    21,585    21,585
count-graphs         44,713    44,713    44,481
DLXSimulator         99,417    99,417    99,401
fft                  33,069    33,069    33,069
fib                  21,585    21,585    21,585
hamlet            1,498,772 1,465,892 1,484,532
imp-for              21,577    21,577    21,577
knuth-bendix         65,938    65,938    65,866
lexgen              157,297   157,297   158,049
life                 41,481    41,481    41,337
logic                89,313    89,313    89,313
mandelbrot           21,545    21,545    21,545
matrix-multiply      22,009    22,009    22,009
md5                  31,986    31,986    31,986
merge                22,793    22,793    22,793
mlyacc              578,545   574,065   573,265
mpuz                 26,529    26,529    26,529
nucleic              63,769    63,769    63,241
peek                 31,514    31,514    31,514
psdes-random         22,553    22,553    22,553
ratio-regions        44,537    44,537    44,537
ray                  85,492    85,460    85,412
raytrace            204,769   204,657   203,697
simple              195,405   194,653   197,973
smith-normal-form   149,546   149,546   149,578
tailfib              21,257    21,257    21,257
tak                  21,705    21,705    21,705
tensor               72,321    72,321    72,321
tsp                  37,474    37,474    37,490
tyan                 91,954    91,234    89,362
vector-concat        22,393    22,393    22,393
vector-rev           22,209    22,209    22,209
vliw                340,541   340,413   336,893
wc-input1            45,514    45,514    45,530
wc-scanStream        46,794    46,794    46,794
zebra               127,954   127,954   127,954
zern                 28,684    28,684    28,684

Not too much effect on compile time.  Note, however, that the way the
-tuple-recon-elim option is used is that all of the "analysis" behind
tuple reconstruction elimination is always done; there's just a final if
test of the option that either eliminates the tuple reconstruction or
leaves it.  So, the differences in compile time are less to do with what
the shrinker is doing and more to do with the differences in the
intermediate programs. 

Runtimes are interesting, as always.  You can see why I had initial
doubts, with simple being unaffected by the -tuple-recon-elim 1 option
(which does, indeed, produce an almost identical SSA program, minus the
redundant tuple construction) and being quite negatively affected by the
-tuple-recon-elim 2 option (which is again an almost identical SSA
program, but many datatypes and arguments are not flattened).

Tyan shows a surprising benefit.  All of the SSA programs are pretty much
identical, minus various tuple reconstructions.  Again, the
-tuple-recon-elim 2 program has less flattened datatypes and arguments,
but that doesn't seem to negatively impact this benchmark.  Looking at
gc-summary for the three programs, it's clear that we avoided significant
allocation and excess GCs.

[fluet@localhost tyan]$ ./tyan.0 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,552,384
max stack size(bytes): 26,752
GC time(ms): 4,090 (16.9%)
maxPause(ms): 10
number of GCs: 1,873
bytes allocated: 2,356,798,696
bytes copied: 255,447,228
max bytes live: 180,256
[fluet@localhost tyan]$ ./tyan.1 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,564,672
max stack size(bytes): 28,544
GC time(ms): 3,260 (15.1%)
maxPause(ms): 10
number of GCs: 1,573
bytes allocated: 1,984,296,704
bytes copied: 214,353,012
max bytes live: 178,704
[fluet@localhost tyan]$ ./tyan.2 @MLton gc-summary -- > /dev/null
max semispace size(bytes): 1,748,992
max stack size(bytes): 43,520
GC time(ms): 2,430 (14.5%)
maxPause(ms): 10
number of GCs: 1,135
bytes allocated: 1,445,752,052
bytes copied: 153,468,780
max bytes live: 182,196

Almost everything else is probably just noise. logic, which has the
greatest deviation in runtime ratios of the remaining benchmarks, is
identical SSA programs under each of the three options, so that's all
noise.

I also played with self-compiles.  After getting to a fixed-point with the
new changes, I did 6 self-compiles, each resulting executable used to
compile the next.  The options used and the name given to the resulting
exectuable are
@MLton fixed-heap 500m -- -type-recon-elim 0  --> G10
@MLton fixed-heap 500m -- -type-recon-elim 0  --> G10a
@MLton fixed-heap 500m -- -type-recon-elim 1  --> G11
@MLton fixed-heap 500m -- -type-recon-elim 0  --> G11a
@MLton fixed-heap 500m -- -type-recon-elim 2  --> G12
@MLton fixed-heap 500m -- -type-recon-elim 0  --> G12a
So, each of the G*a executables are identical; furthermore, the time to
produce a G*a executable gives some indication of how tuple reconstruction
elimination is affecting a self-compile.  Here are the sizes:

   text      data     bss       dec      hex        filename
10301850   1622864   36132   11960846   b6820e   mlton-compile.G10
10301850   1622864   36132   11960846   b6820e   mlton-compile.G10a
 9714698   1588288   36132   11339118   ad056e   mlton-compile.G11
10301850   1622864   36132   11960846   b6820e   mlton-compile.G11a
 9724058   1595280   36996   11356334   ad48ae   mlton-compile.G12
10301850   1622864   36132   11960846   b6820e   mlton-compile.G12a

So, we did save some executable size by eliminating tuple reconstruction.
On the other hand, the compile times aren't very sensible:

711.39user 18.06system 12:20.66elapsed 98%CPU   -- to compile G10
747.99user 48.58system 14:04.73elapsed 94%CPU   -- to compile G10a
690.29user 19.91system 12:32.34elapsed 94%CPU   -- to compile G11
725.07user 1100.11system 32:42.66elapsed 92%CPU -- to compile G11a
753.95user 366.83system 19:36.90elapsed 95%CPU  -- to compile G12
693.51user 356.16system 18:28.15elapsed 94%CPU  -- to compile G12a

At a minimum, the times to compile G10 and G10a should be almost the same.
The excess system time to compile G11a might suggest that G11 is doing
excess allocation, but they were all run with the same fixed heap.  Still,
there is some excess sytem time in the last two compiles.  I wasn't using
the machine for anything else at the time, and it was run at 10am after
the machine had been on ovenight, so I don't think there were any backed
up cron jobs.  They were all run in a row, so maybe I just really foiled
the virtual memory system (kernel 2.4.8 -- I know I should move up at
least past 2.4.10 or so because there are known issues with the VM).

Rerunning all of the compiles with -v2 (because I wanted to see if I could
pinpoint some pass that was slow because of tuple reconstruction
elimination), I get the following:

MLton finished in 527.27 + 289.79 (35% GC)
746.58user 70.64system 14:24.24elapsed 94%CPU 
MLton finished in 533.43 + 290.50 (35% GC)
753.75user 70.35system 14:09.91elapsed 96%CPU 
MLton finished in 503.85 + 277.33 (36% GC)
724.30user 57.03system 13:40.35elapsed 95%CPU 
MLton finished in 475.20 + 212.58 (31% GC)
631.01user 56.96system 12:10.05elapsed 94%CPU
MLton finished in 582.62 + 338.43 (37% GC)
739.66user 181.55system 16:20.46elapsed 93%CPU
MLton finished in 838.95 + 393.44 (32% GC)
690.70user 541.89system 22:24.08elapsed 91%CPU

which doesn't explain much, and contradicts the previous in some ways, but
is more believable, I think.  It suggests that compiling
with -tuple-recon-elim 1 improves compile time a little, and, furthermore,
the resulting compiler is a bit faster.  On the other hand, compiling with
-tuple-recon-elim 2 hurts compile time a little bit, and the resulting
compiler is really slow -- presumably because of lack of flattening.

Anyways, I'm hopeful that after looking at the flattener, we should be
able to get to a point where the tuple reconstruction
elimination can be permanently left on in the shrinker.