limit check insertion

Tue, 23 Oct 2001 22:25:53 -0700

It was an incredibly simple change to force limit checks to only be
coalesced within basic blocks -- a couple of lines in limit-check.fun
did the trick.  I've checked in the change, and made this the default.
It passes all the regressions and a self-compile.  It also fixes the
bug with vliw, but introduces a bug in simple, and has no effect on
the prodcons and mutex bugs.

Unfortunately, it increases code sizes (up to 50%!) and running times
(up to 30% or so), almost across the board, with the exception of
matrix multiply, which may be due to a limit check that is no longer
lifted into a loop.  The stats are below.

For a self-compile, the numbers are bad.  The mlton executable is
currently about 7.5M, but when compiled -limit-check-per-block true,
it blows up to 11.4M.  Here are the times for three self compiles

Compiling new MLton using an old one.
   Compile SML starting
   Compile SML finished in 269.67 + 221.90 (45% GC)
   Compile C starting
   Compile C finished in 10.47 + 0.0 (0.0% GC)
   Assemble starting
   Assemble finished in 32.36 + 0.0 (0.0% GC)
   Link starting
   Link finished in 2.72 + 0.0 (0.0% GC)
MLton finished in 315.31 + 221.91 (41% GC)
size mlton-compile
   text	   data	    bss	    dec	    hex	filename
6447545	1045724	  35940	7529209	 72e2f9	mlton-compile

Compiling new MLton using itself (built above) and -limit-check-per-block true
   Compile SML starting
   Compile SML finished in 311.38 + 256.69 (45% GC)
   Compile C starting
   Compile C finished in 15.60 + 0.0 (0.0% GC)
   Assemble starting
   Assemble finished in 45.44 + 0.0 (0.0% GC)
   Link starting
   Link finished in 5.09 + 0.0 (0.0% GC)
MLton finished in 377.63 + 256.69 (40% GC)
size mlton-compile
   text	   data	    bss	    dec	    hex	filename
9995209	1381500	  33764	11410473	 ae1c29	mlton-compile

Compiling new MLton using itself (built above) and -limit-check-per-block true
   Compile SML starting
   Compile SML finished in 351.27 + 266.54 (43% GC)
   Compile C starting
   Compile C finished in 15.95 + 0.0 (0.0% GC)
   Assemble starting
   Assemble finished in 45.72 + 0.0 (0.0% GC)
   Link starting
   Link finished in 5.34 + 0.0 (0.0% GC)
MLton finished in 418.39 + 266.55 (39% GC)
size mlton-compile
   text	   data	    bss	    dec	    hex	filename
9995209	1381500	  33764	11410473	 ae1c29	mlton-compile

So the self-compile times in order were: 537, 634, 685.  So we take a
big hit (100s) for creating a bigger executable, and a pretty bad hit
(50s) for running with extra limit checks.

It seems pretty clear that we will need a limit check insertion
algorithm that coalesces across blocks.

Feel free to look into the remaining bugs with simple, prodcons, and
mutex.  I'll start in on 'em tomorrow.

--------------------------------------------------------------------------------

MLton0 -- mlton -limit-check-per-block false
MLton1 -- mlton -limit-check-per-block true

compile time
benchmark         MLton0 MLton1
barnes-hut           2.5    2.5
checksum             0.7    0.7
count-graphs         1.8    1.9
DLXSimulator         4.2    4.3
fft                  1.3    1.3
fib                  0.6    0.6
hamlet              49.5   54.3
knuth-bendix         2.4    2.5
lexgen               5.4    5.7
life                 1.4    1.5
logic                6.3    8.4
mandelbrot           0.7    0.7
matrix-multiply      0.7    0.7
md5                  1.3    1.3
merge                0.7    0.7
mlyacc              22.7   24.9
mpuz                 0.9    0.9
nucleic              3.2    3.1
peek                 1.1    1.1
psdes-random         0.7    0.7
ratio-regions        2.6    2.7
ray                  3.5    3.7
raytrace             9.3    9.4
simple               6.6    7.2
smith-normal-form    7.7    7.6
tailfib              0.6    0.6
tak                  0.6    0.6
tensor               3.0    3.1
tsp                  1.6    1.6
tyan                 3.9    4.1
vector-concat        0.7    0.7
vector-rev           0.7    0.7
vliw                12.4   12.9
wc-input1            1.7    1.7
wc-scanStream        1.8    1.8
zebra                9.5   10.6
zern                 1.1    1.1

run time
benchmark         MLton0 MLton1
barnes-hut           5.0    5.7
checksum             3.7    4.1
count-graphs         5.6    6.6
DLXSimulator        15.0   15.0
fft                  7.5    8.2
fib                  4.6    5.0
hamlet               9.0   10.2
knuth-bendix         8.5    9.0
lexgen              12.2   13.7
life                11.0   12.1
logic               26.6   27.1
mandelbrot           9.4    9.6
matrix-multiply      6.7    5.3
md5                  0.6    0.8
merge               39.5   45.9
mlyacc              10.5   11.2
mpuz                 6.4    7.1
nucleic              8.5    9.1
peek                 4.8    5.8
psdes-random         4.6    4.6
ratio-regions        9.1    9.8
ray                  4.9    5.5
raytrace             5.8    6.7
simple               6.7      *
smith-normal-form    1.1    1.1
tailfib             22.1   22.5
tak                 10.5   11.1
tensor               9.7    9.9
tsp                 12.0   13.8
tyan                19.8   20.9
vector-concat        8.0    9.6
vector-rev           3.1    4.2
vliw                   *    7.0
wc-input1            2.8    3.6
wc-scanStream        4.3    4.8
zebra                2.7    3.0
zern                38.4   41.0

run time ratio
benchmark         MLton1
barnes-hut           1.1
checksum             1.1
count-graphs         1.2
DLXSimulator         1.0
fft                  1.1
fib                  1.1
hamlet               1.1
knuth-bendix         1.1
lexgen               1.1
life                 1.1
logic                1.0
mandelbrot           1.0
matrix-multiply      0.8
md5                  1.3
merge                1.2
mlyacc               1.1
mpuz                 1.1
nucleic              1.1
peek                 1.2
psdes-random         1.0
ratio-regions        1.1
ray                  1.1
raytrace             1.1
smith-normal-form    1.0
tailfib              1.0
tak                  1.1
tensor               1.0
tsp                  1.2
tyan                 1.1
vector-concat        1.2
vector-rev           1.3
vliw                ~1.0
wc-input1            1.3
wc-scanStream        1.1
zebra                1.1
zern                 1.1

size
benchmark            MLton0    MLton1
barnes-hut           62,201    64,505
checksum             21,101    21,157
count-graphs         42,837    46,429
DLXSimulator         85,221    95,061
fft                  30,337    31,513
fib                  21,093    21,093
hamlet            1,059,152 1,344,512
knuth-bendix         64,390    68,430
lexgen              134,493   148,733
life                 40,373    43,141
logic               159,373   258,365
mandelbrot           21,061    21,117
matrix-multiply      21,485    21,533
md5                  29,438    30,518
merge                22,237    22,325
mlyacc              449,501   540,109
mpuz                 27,365    28,149
nucleic              61,365    63,293
peek                 29,502    31,070
psdes-random         22,149    22,181
ratio-regions        44,117    47,989
ray                  72,128    80,640
raytrace            174,157   214,533
simple              158,953   185,225
smith-normal-form   145,165   147,429
tailfib              20,765    20,789
tak                  21,133    21,165
tensor               65,228    68,780
tsp                  35,670    37,286
tyan                 83,950    92,238
vector-concat        21,773    21,837
vector-rev           21,605    21,645
vliw                290,329   329,641
wc-input1            41,454    44,470
wc-scanStream        44,078    47,382
zebra               122,190   136,758
zern                 27,288    27,584