[MLton] Re: A MLton / C-- Experiment

Mon, 21 Mar 2005 16:34:28 -0500 (EST)

> * Conclusions *
> 
> Despite being unable to achieve the goal of developing a C-- codegen
> sufficient for compiling and correctly running the whole of the MLton
> benchmark suite, I do not believe that we are that far off.  

After receiving a few helpful comments, I was able to push my MLton /
C-- experiment through to completion.

> I don't know how tenable it is to go back to 32-bit file positions
> to complete the experiment

Stephen Weeks pointed out that it was a 4 line change to the MLton
sources to go back to 32-bit file positions.

> Returns in the Machine IL could become either return or cut to
> transfers in C--.

A useful discussion with John Dias revealed that because there are two
stacks (the ML stack and the C-- stack), there is really no reason
that the "return thingy" on the ML stack need actually correspond to
the return address or continuation.  The runtime only uses it to look
up a frame index, while the codegen is free to interpret it anyway
that is conveinent, subject to being able to efficiently translate it
to a frame index.  I was duplicating information by having the return
continuation both on the C-- stack and on the ML stack.  So, I could
just put the frame index as the "return thingy" on the ML stack (in
which case, the runtime function to translate would be the identity),
and use normal C-- return.  The upside of this translation is that it
yields more efficient C-- code; the downside of this translation is
that it doesn't exercise any of the C-- run-time API.

+ Regressions +

MLton's regression suite includes 234 small SML programs.  

Of these, 24 use features incompatible with the C-- codegen (callcc,
finalization, threads, signals, worlds, etc.).

Of the remaining 210 programs, 15 failed to compile correctly:
 11 fail in the linker, with "undefined reference to .Lexit_lNN",
    which corresponds to a reported qc-- bug
  1 fails in the C-- codegen, with an unsupported 64-bit operation;
    this is an issue with MLton, as one pass which is supposed to replace 
    primitives not supported by a codegen with corresponding C-calls is 
    not replacing overflowing arithmetic primitives not supported by the 
    codegen.
  1 fails in qc--, with "operator %sx reached postexpander",
    which corresponds to a reported qc-- bug
  1 fails in qc--, with "Asked for temporary in space `general-purpose
    temporaries' with unsupported width 64"; this regression test
    exercises 64-bit word operations by calls to C, but 64-bit values are 
    shuffled in C--
  1 fails in qc--, with "Asked for temporary in space `general-purpose
    temporaries' with unsupported width 64"; this regression test
    exercises 64-bit real operations either by C-- operators or calls
    to C.
In the last failing regression test, all the 64-bit values manipulated
by the C-- program are intended to be of register kind "float", but in
many basic blocks, such values are just copied from one place to
another without any explicit floating-point operation applied.

Of the 195 programs which were successfully compiled, _all_ but 2
executed correctly, and those two segfaulted due to overflowing the
system/C-- stack.  Both, adopted from the MLKit regression suite, have
some non-tail-recursive folds over lists (presumably to improve region
inference), which are called with very large lists (on the order of
50K elements).  This was a very pleasant outcome, since I could
optimistically assume that a successfully compiled program
corresponded to a correctly executing program.

+ Benchmarks +

MLton's benchmark suite includes 43 SML programs ranging from 15
lines (tak.sml) to 22900 lines (hamlet.sml), though the majority of
programs fall below 1000 lines.  You can view the benchmarks at:
  http://www.mlton.org/Performance

Of the 43 programs, 4 failed to compile, all in the linker with
"undefined reference to .Lexit_lNN".  One additional program failed to
compile when the 'Backend.x86.improve=Optimize.improve' (peephole
optimizer) option was passed to qc--.

The benchmark results are appended below.  I am quite impressed with
the run-time results, particularly since the entire experiment is
heavily biased against the C-- codegen.  I implemented the simplest
possible C-- codegen that I could think of, which meant translating
the Machine IL essentially verbatim to C--.  For example, a simple
equality comparision and transfer in the Machine IL is translated to
the rather verbose:

	RW32_2 = %zx32(%bit(%eq(RW32_1, RW32_0)));
	switch RW32_2 {
		case (0x0 :: bits32) : { goto L_8045; }
		case (0x1 :: bits32) : { goto L_8042; }
	}

A better, though more complicated translation, might be:

	if %eq(RW32_1, RW32_0) {
		RW32_2 = 0x1 :: bits32;
		goto L_8045;
	} else {
		RW32_2 = 0x0 :: bits32;
		goto L_8042;
	}

Similarly, I made no attempt to choose an intelligent block layout;
every basic block in the Machine IL becomes a labelled sequence of
statements ending with a transfer, often a goto.

Trends:

Generally, we see the executables produced by the C-- codegen running
1.4X to 6.0X longer than their native codegen counterparts.
Encouragingly, we see a sizeable improvement by the C-- peephole
optimizer across all benchmarks.

Executables produced by the C-- codegen are significantly larger than
executables produced by either the native or C codegens.  In the worst
case (mlyacc), the native codegen yielded a 505K executable while the
C-- codegen yielded a 12M executable.

Compile times with the C-- codegen are slower than with either the
native or C codegens.  I should point out that I am currently
producing one .cmm file for each Machine IL procedure in the program.
For the mlyacc benchmark, I produce 115 separate .cmm files, ranging
in size from 217 lines to 5889 lines and one at 49315 lines, with a
majority of files under 1000 lines.  Separate files means separate
calls to qc--.opt and more object files for the linker.  In the early
days of the native codegen, we discovered that the non-linearity in
the assembler made producing a single .s file for the whole program
prohibitively expensive, and separate .s files for each Machine IL
procedure was a waste of time with multiple calls to the assembler.
We now use a simple scheme where we put approximately 10000 assembly
lines per file, which appears to work well.  I suspect something
similar could be done in the C-- codegen.

A special note should be made on the compile time of boyer with the
qc-- peephole optimizier and the failure of smith-normal-form to
compile with the qc-- peephole optimizer.  Both stem from an artifact
of MLton's aggressive, whole-program globalization of values.  One of
the Machine IL procedures is an "initGlobals" function which does
nothing but perform a slew of heap allocations.  When the program has
lots of constant, global data, this procedure can get very large.  For
boyer, it yields a single basic block with 5000 statements, all of
which are either moves or increments of the frontier pointer.  For
smith-normal-form, the initGlobals function has a single basic block
with 13000 statements.  Such basic blocks seem to be very bad for
peephole optimizers, either because of their size or because of the
large number of moves, which appears tempting to optimize.  In any
case, when run on smith-normal-form, the qc-- peephole optimizer fails
due to memory exhaustion.

I will note that we experienced exactly the same effect in the native
codegen.  We special case the initGlobals procedure and suppress all
of the native codegen optimizations on it.  Again, I suspect something
similar could be done in the C-- codegen.  Alternatively, perhaps
MLton should consider means by which this initial global heap image
could be computed at compile time and simply block-copied into the
heap at run time.

Finally, I'm reproducing some comments made by John Dias when he saw
the benchmark results for the first benchmark:

> > run time ratio
> > benchmark  MLton1 MLton2 MLton3
> > barnes-hut   1.09   2.49   2.35
> 
> Given our nearly complete lack of optimization, I'm not very discouraged
> by these numbers -- even the peephole optimizer underperforms (according
> to Norman). And as I said, I have made a number of unimprovements to the
> register allocator, so I dare say these performance numbers can be
> improved easily when I find the time -- hopefully soon.
>
> > size
> > benchmark   MLton0  MLton1  MLton2  MLton3
> > barnes-hut 100,754 107,373 463,036 385,356
>
> I blame a combination of the register allocator and the incredibly verbose
> encoding of the run-time data. I hope one of Norman's summer students will
> tackle the problem of compressing that data.
> 
> > compile time
> > benchmark  MLton0 MLton1 MLton2 MLton3
> > barnes-hut   8.09  13.37  51.37  76.66
> 
> Well, I can't say that MLton2 surprises me, but the disparity b/w 2 and 3
> is interesting. I'm willing to bet that the new cfg improves that.

*******************************************************************************
Full benchmark results
*******************************************************************************

MLton0 -- mlton -codegen native
MLton1 -- mlton -codegen c
MLton2 -- mlton -codegen cmm 
MLton3 -- mlton -codegen cmm -cmmc-opt 'Backend.x86.improve=Optimize.improve'

run time ratio (MLtonN / MLton0)
benchmark         MLton0 MLton1 MLton2 MLton3
barnes-hut          1.00   1.09   2.49   2.35
boyer               1.00   1.18   2.21   1.66
checksum            1.00   0.86   3.52   3.07
count-graphs        1.00   1.39   4.54   3.43
DLXSimulator        1.00   1.05   1.39   1.24
fft                 1.00   1.03   1.43   1.48
fib                 1.00   1.32   3.03   2.22
flat-array          1.00   1.33   3.36   3.01
hamlet              1.00   2.00      *      *
imp-for             1.00   1.44   6.37   5.18
knuth-bendix        1.00   1.42   3.80   2.91
lexgen              1.00   1.99   3.75   3.10
life                1.00   1.31   5.21   3.90
logic               1.00   1.32   2.64   1.92
mandelbrot          1.00   1.12   4.13   2.19
matrix-multiply     1.00   1.06   3.89   3.99
md5                 1.00   1.37   3.20   2.23
merge               1.00   1.00   5.67   4.03
mlyacc              1.00   1.30   2.36   1.87
model-elimination   1.00   1.31      *      *
mpuz                1.00   1.45   4.69   3.88
nucleic             1.00   1.04      *      *
output1             1.00   1.97   4.39   3.34
peek                1.00   1.76   5.75   5.75
psdes-random        1.00   1.07   3.60   3.24
ratio-regions       1.00   1.43   3.54   2.31
ray                 1.00   1.19   3.85   3.07
raytrace            1.00   1.45   4.56   3.33
simple              1.00   1.48   2.67   2.10
smith-normal-form   1.00   1.00   1.02      *
tailfib             1.00   2.42   5.72   5.02
tak                 1.00   1.06   2.93   1.95
tensor              1.00   1.71   5.97   4.52
tsp                 1.00   1.10   1.96   1.70
tyan                1.00   1.09   2.42   1.87
vector-concat       1.00   1.35   3.38   3.18
vector-rev          1.00   1.21   2.47   2.06
vliw                1.00   1.40   2.48   1.88
wc-input1           1.00   1.73   4.49   3.80
wc-scanStream       1.00   1.26   3.72   3.32
zebra               1.00   1.21      *      *
zern                1.00   1.19   3.23   2.79

size (bytes)
benchmark            MLton0    MLton1     MLton2     MLton3
barnes-hut          100,754   107,373    463,036    385,356
boyer               138,545   150,465  1,115,965  1,024,773
checksum             52,385    52,481     72,965     67,117
count-graphs         66,305    73,529    179,701    142,221
DLXSimulator        129,789   139,521    755,205    632,533
fft                  64,920    73,388    138,902    116,766
fib                  47,877    52,565     69,009     62,889
flat-array           47,901    52,601     69,337     63,433
hamlet            1,243,296 1,349,004          *          *
imp-for              47,733    52,377     67,845     62,245
knuth-bendix        105,473   121,377    382,709    297,605
lexgen              196,938   226,322  1,793,578  1,594,434
life                 65,229    68,901    169,369    129,841
logic               106,737   102,413    804,201    687,993
mandelbrot           47,829    52,481     69,037     62,597
matrix-multiply      49,336    54,032     74,352     66,776
md5                  73,737    77,749    214,169    184,553
merge                49,457    53,989     77,305     68,521
mlyacc              505,514   585,718 12,679,994 12,001,970
model-elimination   631,975   722,627          *          *
mpuz                 50,477    57,525     82,533     71,525
nucleic             199,432   154,299          *          *
output1              76,499    81,343    235,227    202,075
peek                 72,461    78,761    201,465    172,441
psdes-random         48,557    53,325     71,241     64,665
ratio-regions        73,573    82,161    267,733    204,309
ray                 180,686   205,713  1,108,940    934,476
raytrace            264,355   335,726  1,779,525  1,393,165
simple              217,549   262,784  3,219,123  2,907,939
smith-normal-form   177,821   198,817    535,625          *
tailfib              47,573    52,313     67,353     61,761
tak                  47,957    52,537     69,237     62,917
tensor               93,792   109,004    380,008    311,896
tsp                  78,297    85,363    232,439    190,615
tyan                131,145   149,661    617,845    479,757
vector-concat        49,157    54,065     76,461     68,069
vector-rev           48,385    53,037     72,325     65,421
vliw                390,037   466,881  3,688,265  3,129,945
wc-input1           101,125   114,441    464,293    399,525
wc-scanStream       109,265   114,997    567,753    492,185
zebra               120,817   129,013          *          *
zern                 90,866    98,054    184,768    160,776

compile time (seconds)
benchmark         MLton0 MLton1 MLton2  MLton3
barnes-hut          8.09  13.37  51.37   76.66
boyer               8.92  28.17  76.85 1454.22
checksum            5.36   5.70   9.33   10.36
count-graphs        6.21   8.31  27.39   33.72
DLXSimulator        9.27  18.08  79.53  119.66
fft                 5.95   6.98  19.29   24.02
fib                 5.33   5.65   9.57   10.78
flat-array          5.40   5.71   9.74   10.82
hamlet             61.73 197.87      *       *
imp-for             5.37   5.68   9.45   10.72
knuth-bendix        7.41  13.92  53.58   81.78
lexgen             10.92  25.18 125.07  194.11
life                6.02   7.94  24.96   35.42
logic               7.70  13.31  68.79  118.09
mandelbrot          5.38   5.86   9.64   10.85
matrix-multiply     5.47   5.92  11.05   12.25
md5                 6.31   8.78  25.11   35.79
merge               5.42   5.82  10.99   12.41
mlyacc             27.46  77.11 460.78  994.41
model-elimination  29.46  87.49      *       *
mpuz                5.48   6.12  12.22   14.26
nucleic            14.13  40.88      *       *
output1             6.33   8.73  26.23   35.68
peek                6.12   8.25  23.12   31.20
psdes-random        5.40   5.85   9.91   11.21
ratio-regions       6.83   9.83  37.48   47.80
ray                 9.76  20.62 106.26  168.60
raytrace           14.86  38.15 215.49  352.70
simple             11.59  28.05 160.74  299.37
smith-normal-form   9.96  81.82  88.11       *
tailfib             5.48   5.70   9.41   10.29
tak                 5.45   5.71   9.67   10.73
tensor              8.37  14.10  49.52   71.47
tsp                 6.58   9.59  31.93   43.69
tyan                8.98  18.80  81.27  115.25
vector-concat       5.46   5.84  10.79   12.24
vector-rev          5.39   5.67  10.10   11.32
vliw               20.05  52.51 338.71  522.70
wc-input1           7.28  13.05  45.98   62.67
wc-scanStream       7.44  12.80  53.54   75.82
zebra               8.56  16.03      *       *
zern                6.16   7.58  21.97   27.29

run time (seconds)
benchmark         MLton0 MLton1 MLton2 MLton3
barnes-hut         50.58  54.91 126.05 118.82
boyer              53.95  63.73 119.33  89.50
checksum           97.32  83.63 342.52 299.25
count-graphs       41.25  57.47 187.26 141.43
DLXSimulator       90.62  94.79 125.64 112.38
fft                36.73  37.81  52.38  54.23
fib                70.24  92.81 213.14 155.90
flat-array         25.01  33.15  84.13  75.35
hamlet             52.67 105.32      *      *
imp-for            46.78  67.14 297.92 242.17
knuth-bendix       39.88  56.56 151.34 116.02
lexgen             44.28  88.17 165.94 137.09
life               14.39  18.85  74.96  56.19
logic              55.03  72.50 145.46 105.67
mandelbrot         85.85  95.95 354.25 187.75
matrix-multiply     7.33   7.79  28.49  29.24
md5                53.24  72.94 170.48 118.97
merge              88.91  88.64 503.71 358.35
mlyacc             41.16  53.45  97.04  77.07
model-elimination  79.50 104.41      *      *
mpuz               41.79  60.80 195.86 162.05
nucleic            45.73  47.46      *      *
output1            14.39  28.35  63.12  48.06
peek               35.41  62.39 203.63 203.64
psdes-random       41.51  44.49 149.40 134.46
ratio-regions      52.91  75.45 187.50 121.98
ray                30.59  36.31 117.91  94.00
raytrace           42.48  61.52 193.87 141.55
simple             59.76  88.65 159.75 125.33
smith-normal-form  36.99  36.92  37.62      *
tailfib            43.81 105.86 250.68 219.75
tak                27.59  29.15  80.87  53.73
tensor             58.94 100.85 351.81 266.47
tsp                60.15  66.08 117.92 102.32
tyan               57.15  62.23 138.20 106.95
vector-concat      91.26 123.30 308.81 289.80
vector-rev        113.87 137.61 281.14 234.94
vliw               51.20  71.48 126.77  96.04
wc-input1          38.85  67.34 174.33 147.81
wc-scanStream      33.46  42.03 124.51 111.18
zebra              40.33  48.65      *      *
zern               41.26  49.05 133.13 115.24