[MLton] Re: A MLton / C-- Experiment
Matthew Fluet
fluet@cs.cornell.edu
Mon, 21 Mar 2005 16:34:28 -0500 (EST)
> * Conclusions *
>
> Despite being unable to achieve the goal of developing a C-- codegen
> sufficient for compiling and correctly running the whole of the MLton
> benchmark suite, I do not believe that we are that far off.
After receiving a few helpful comments, I was able to push my MLton /
C-- experiment through to completion.
> I don't know how tenable it is to go back to 32-bit file positions
> to complete the experiment
Stephen Weeks pointed out that it was a 4 line change to the MLton
sources to go back to 32-bit file positions.
> Returns in the Machine IL could become either return or cut to
> transfers in C--.
A useful discussion with John Dias revealed that because there are two
stacks (the ML stack and the C-- stack), there is really no reason
that the "return thingy" on the ML stack need actually correspond to
the return address or continuation. The runtime only uses it to look
up a frame index, while the codegen is free to interpret it anyway
that is conveinent, subject to being able to efficiently translate it
to a frame index. I was duplicating information by having the return
continuation both on the C-- stack and on the ML stack. So, I could
just put the frame index as the "return thingy" on the ML stack (in
which case, the runtime function to translate would be the identity),
and use normal C-- return. The upside of this translation is that it
yields more efficient C-- code; the downside of this translation is
that it doesn't exercise any of the C-- run-time API.
+ Regressions +
MLton's regression suite includes 234 small SML programs.
Of these, 24 use features incompatible with the C-- codegen (callcc,
finalization, threads, signals, worlds, etc.).
Of the remaining 210 programs, 15 failed to compile correctly:
11 fail in the linker, with "undefined reference to .Lexit_lNN",
which corresponds to a reported qc-- bug
1 fails in the C-- codegen, with an unsupported 64-bit operation;
this is an issue with MLton, as one pass which is supposed to replace
primitives not supported by a codegen with corresponding C-calls is
not replacing overflowing arithmetic primitives not supported by the
codegen.
1 fails in qc--, with "operator %sx reached postexpander",
which corresponds to a reported qc-- bug
1 fails in qc--, with "Asked for temporary in space `general-purpose
temporaries' with unsupported width 64"; this regression test
exercises 64-bit word operations by calls to C, but 64-bit values are
shuffled in C--
1 fails in qc--, with "Asked for temporary in space `general-purpose
temporaries' with unsupported width 64"; this regression test
exercises 64-bit real operations either by C-- operators or calls
to C.
In the last failing regression test, all the 64-bit values manipulated
by the C-- program are intended to be of register kind "float", but in
many basic blocks, such values are just copied from one place to
another without any explicit floating-point operation applied.
Of the 195 programs which were successfully compiled, _all_ but 2
executed correctly, and those two segfaulted due to overflowing the
system/C-- stack. Both, adopted from the MLKit regression suite, have
some non-tail-recursive folds over lists (presumably to improve region
inference), which are called with very large lists (on the order of
50K elements). This was a very pleasant outcome, since I could
optimistically assume that a successfully compiled program
corresponded to a correctly executing program.
+ Benchmarks +
MLton's benchmark suite includes 43 SML programs ranging from 15
lines (tak.sml) to 22900 lines (hamlet.sml), though the majority of
programs fall below 1000 lines. You can view the benchmarks at:
http://www.mlton.org/Performance
Of the 43 programs, 4 failed to compile, all in the linker with
"undefined reference to .Lexit_lNN". One additional program failed to
compile when the 'Backend.x86.improve=Optimize.improve' (peephole
optimizer) option was passed to qc--.
The benchmark results are appended below. I am quite impressed with
the run-time results, particularly since the entire experiment is
heavily biased against the C-- codegen. I implemented the simplest
possible C-- codegen that I could think of, which meant translating
the Machine IL essentially verbatim to C--. For example, a simple
equality comparision and transfer in the Machine IL is translated to
the rather verbose:
RW32_2 = %zx32(%bit(%eq(RW32_1, RW32_0)));
switch RW32_2 {
case (0x0 :: bits32) : { goto L_8045; }
case (0x1 :: bits32) : { goto L_8042; }
}
A better, though more complicated translation, might be:
if %eq(RW32_1, RW32_0) {
RW32_2 = 0x1 :: bits32;
goto L_8045;
} else {
RW32_2 = 0x0 :: bits32;
goto L_8042;
}
Similarly, I made no attempt to choose an intelligent block layout;
every basic block in the Machine IL becomes a labelled sequence of
statements ending with a transfer, often a goto.
Trends:
Generally, we see the executables produced by the C-- codegen running
1.4X to 6.0X longer than their native codegen counterparts.
Encouragingly, we see a sizeable improvement by the C-- peephole
optimizer across all benchmarks.
Executables produced by the C-- codegen are significantly larger than
executables produced by either the native or C codegens. In the worst
case (mlyacc), the native codegen yielded a 505K executable while the
C-- codegen yielded a 12M executable.
Compile times with the C-- codegen are slower than with either the
native or C codegens. I should point out that I am currently
producing one .cmm file for each Machine IL procedure in the program.
For the mlyacc benchmark, I produce 115 separate .cmm files, ranging
in size from 217 lines to 5889 lines and one at 49315 lines, with a
majority of files under 1000 lines. Separate files means separate
calls to qc--.opt and more object files for the linker. In the early
days of the native codegen, we discovered that the non-linearity in
the assembler made producing a single .s file for the whole program
prohibitively expensive, and separate .s files for each Machine IL
procedure was a waste of time with multiple calls to the assembler.
We now use a simple scheme where we put approximately 10000 assembly
lines per file, which appears to work well. I suspect something
similar could be done in the C-- codegen.
A special note should be made on the compile time of boyer with the
qc-- peephole optimizier and the failure of smith-normal-form to
compile with the qc-- peephole optimizer. Both stem from an artifact
of MLton's aggressive, whole-program globalization of values. One of
the Machine IL procedures is an "initGlobals" function which does
nothing but perform a slew of heap allocations. When the program has
lots of constant, global data, this procedure can get very large. For
boyer, it yields a single basic block with 5000 statements, all of
which are either moves or increments of the frontier pointer. For
smith-normal-form, the initGlobals function has a single basic block
with 13000 statements. Such basic blocks seem to be very bad for
peephole optimizers, either because of their size or because of the
large number of moves, which appears tempting to optimize. In any
case, when run on smith-normal-form, the qc-- peephole optimizer fails
due to memory exhaustion.
I will note that we experienced exactly the same effect in the native
codegen. We special case the initGlobals procedure and suppress all
of the native codegen optimizations on it. Again, I suspect something
similar could be done in the C-- codegen. Alternatively, perhaps
MLton should consider means by which this initial global heap image
could be computed at compile time and simply block-copied into the
heap at run time.
Finally, I'm reproducing some comments made by John Dias when he saw
the benchmark results for the first benchmark:
> > run time ratio
> > benchmark MLton1 MLton2 MLton3
> > barnes-hut 1.09 2.49 2.35
>
> Given our nearly complete lack of optimization, I'm not very discouraged
> by these numbers -- even the peephole optimizer underperforms (according
> to Norman). And as I said, I have made a number of unimprovements to the
> register allocator, so I dare say these performance numbers can be
> improved easily when I find the time -- hopefully soon.
>
> > size
> > benchmark MLton0 MLton1 MLton2 MLton3
> > barnes-hut 100,754 107,373 463,036 385,356
>
> I blame a combination of the register allocator and the incredibly verbose
> encoding of the run-time data. I hope one of Norman's summer students will
> tackle the problem of compressing that data.
>
> > compile time
> > benchmark MLton0 MLton1 MLton2 MLton3
> > barnes-hut 8.09 13.37 51.37 76.66
>
> Well, I can't say that MLton2 surprises me, but the disparity b/w 2 and 3
> is interesting. I'm willing to bet that the new cfg improves that.
*******************************************************************************
Full benchmark results
*******************************************************************************
MLton0 -- mlton -codegen native
MLton1 -- mlton -codegen c
MLton2 -- mlton -codegen cmm
MLton3 -- mlton -codegen cmm -cmmc-opt 'Backend.x86.improve=Optimize.improve'
run time ratio (MLtonN / MLton0)
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 1.00 1.09 2.49 2.35
boyer 1.00 1.18 2.21 1.66
checksum 1.00 0.86 3.52 3.07
count-graphs 1.00 1.39 4.54 3.43
DLXSimulator 1.00 1.05 1.39 1.24
fft 1.00 1.03 1.43 1.48
fib 1.00 1.32 3.03 2.22
flat-array 1.00 1.33 3.36 3.01
hamlet 1.00 2.00 * *
imp-for 1.00 1.44 6.37 5.18
knuth-bendix 1.00 1.42 3.80 2.91
lexgen 1.00 1.99 3.75 3.10
life 1.00 1.31 5.21 3.90
logic 1.00 1.32 2.64 1.92
mandelbrot 1.00 1.12 4.13 2.19
matrix-multiply 1.00 1.06 3.89 3.99
md5 1.00 1.37 3.20 2.23
merge 1.00 1.00 5.67 4.03
mlyacc 1.00 1.30 2.36 1.87
model-elimination 1.00 1.31 * *
mpuz 1.00 1.45 4.69 3.88
nucleic 1.00 1.04 * *
output1 1.00 1.97 4.39 3.34
peek 1.00 1.76 5.75 5.75
psdes-random 1.00 1.07 3.60 3.24
ratio-regions 1.00 1.43 3.54 2.31
ray 1.00 1.19 3.85 3.07
raytrace 1.00 1.45 4.56 3.33
simple 1.00 1.48 2.67 2.10
smith-normal-form 1.00 1.00 1.02 *
tailfib 1.00 2.42 5.72 5.02
tak 1.00 1.06 2.93 1.95
tensor 1.00 1.71 5.97 4.52
tsp 1.00 1.10 1.96 1.70
tyan 1.00 1.09 2.42 1.87
vector-concat 1.00 1.35 3.38 3.18
vector-rev 1.00 1.21 2.47 2.06
vliw 1.00 1.40 2.48 1.88
wc-input1 1.00 1.73 4.49 3.80
wc-scanStream 1.00 1.26 3.72 3.32
zebra 1.00 1.21 * *
zern 1.00 1.19 3.23 2.79
size (bytes)
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 100,754 107,373 463,036 385,356
boyer 138,545 150,465 1,115,965 1,024,773
checksum 52,385 52,481 72,965 67,117
count-graphs 66,305 73,529 179,701 142,221
DLXSimulator 129,789 139,521 755,205 632,533
fft 64,920 73,388 138,902 116,766
fib 47,877 52,565 69,009 62,889
flat-array 47,901 52,601 69,337 63,433
hamlet 1,243,296 1,349,004 * *
imp-for 47,733 52,377 67,845 62,245
knuth-bendix 105,473 121,377 382,709 297,605
lexgen 196,938 226,322 1,793,578 1,594,434
life 65,229 68,901 169,369 129,841
logic 106,737 102,413 804,201 687,993
mandelbrot 47,829 52,481 69,037 62,597
matrix-multiply 49,336 54,032 74,352 66,776
md5 73,737 77,749 214,169 184,553
merge 49,457 53,989 77,305 68,521
mlyacc 505,514 585,718 12,679,994 12,001,970
model-elimination 631,975 722,627 * *
mpuz 50,477 57,525 82,533 71,525
nucleic 199,432 154,299 * *
output1 76,499 81,343 235,227 202,075
peek 72,461 78,761 201,465 172,441
psdes-random 48,557 53,325 71,241 64,665
ratio-regions 73,573 82,161 267,733 204,309
ray 180,686 205,713 1,108,940 934,476
raytrace 264,355 335,726 1,779,525 1,393,165
simple 217,549 262,784 3,219,123 2,907,939
smith-normal-form 177,821 198,817 535,625 *
tailfib 47,573 52,313 67,353 61,761
tak 47,957 52,537 69,237 62,917
tensor 93,792 109,004 380,008 311,896
tsp 78,297 85,363 232,439 190,615
tyan 131,145 149,661 617,845 479,757
vector-concat 49,157 54,065 76,461 68,069
vector-rev 48,385 53,037 72,325 65,421
vliw 390,037 466,881 3,688,265 3,129,945
wc-input1 101,125 114,441 464,293 399,525
wc-scanStream 109,265 114,997 567,753 492,185
zebra 120,817 129,013 * *
zern 90,866 98,054 184,768 160,776
compile time (seconds)
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 8.09 13.37 51.37 76.66
boyer 8.92 28.17 76.85 1454.22
checksum 5.36 5.70 9.33 10.36
count-graphs 6.21 8.31 27.39 33.72
DLXSimulator 9.27 18.08 79.53 119.66
fft 5.95 6.98 19.29 24.02
fib 5.33 5.65 9.57 10.78
flat-array 5.40 5.71 9.74 10.82
hamlet 61.73 197.87 * *
imp-for 5.37 5.68 9.45 10.72
knuth-bendix 7.41 13.92 53.58 81.78
lexgen 10.92 25.18 125.07 194.11
life 6.02 7.94 24.96 35.42
logic 7.70 13.31 68.79 118.09
mandelbrot 5.38 5.86 9.64 10.85
matrix-multiply 5.47 5.92 11.05 12.25
md5 6.31 8.78 25.11 35.79
merge 5.42 5.82 10.99 12.41
mlyacc 27.46 77.11 460.78 994.41
model-elimination 29.46 87.49 * *
mpuz 5.48 6.12 12.22 14.26
nucleic 14.13 40.88 * *
output1 6.33 8.73 26.23 35.68
peek 6.12 8.25 23.12 31.20
psdes-random 5.40 5.85 9.91 11.21
ratio-regions 6.83 9.83 37.48 47.80
ray 9.76 20.62 106.26 168.60
raytrace 14.86 38.15 215.49 352.70
simple 11.59 28.05 160.74 299.37
smith-normal-form 9.96 81.82 88.11 *
tailfib 5.48 5.70 9.41 10.29
tak 5.45 5.71 9.67 10.73
tensor 8.37 14.10 49.52 71.47
tsp 6.58 9.59 31.93 43.69
tyan 8.98 18.80 81.27 115.25
vector-concat 5.46 5.84 10.79 12.24
vector-rev 5.39 5.67 10.10 11.32
vliw 20.05 52.51 338.71 522.70
wc-input1 7.28 13.05 45.98 62.67
wc-scanStream 7.44 12.80 53.54 75.82
zebra 8.56 16.03 * *
zern 6.16 7.58 21.97 27.29
run time (seconds)
benchmark MLton0 MLton1 MLton2 MLton3
barnes-hut 50.58 54.91 126.05 118.82
boyer 53.95 63.73 119.33 89.50
checksum 97.32 83.63 342.52 299.25
count-graphs 41.25 57.47 187.26 141.43
DLXSimulator 90.62 94.79 125.64 112.38
fft 36.73 37.81 52.38 54.23
fib 70.24 92.81 213.14 155.90
flat-array 25.01 33.15 84.13 75.35
hamlet 52.67 105.32 * *
imp-for 46.78 67.14 297.92 242.17
knuth-bendix 39.88 56.56 151.34 116.02
lexgen 44.28 88.17 165.94 137.09
life 14.39 18.85 74.96 56.19
logic 55.03 72.50 145.46 105.67
mandelbrot 85.85 95.95 354.25 187.75
matrix-multiply 7.33 7.79 28.49 29.24
md5 53.24 72.94 170.48 118.97
merge 88.91 88.64 503.71 358.35
mlyacc 41.16 53.45 97.04 77.07
model-elimination 79.50 104.41 * *
mpuz 41.79 60.80 195.86 162.05
nucleic 45.73 47.46 * *
output1 14.39 28.35 63.12 48.06
peek 35.41 62.39 203.63 203.64
psdes-random 41.51 44.49 149.40 134.46
ratio-regions 52.91 75.45 187.50 121.98
ray 30.59 36.31 117.91 94.00
raytrace 42.48 61.52 193.87 141.55
simple 59.76 88.65 159.75 125.33
smith-normal-form 36.99 36.92 37.62 *
tailfib 43.81 105.86 250.68 219.75
tak 27.59 29.15 80.87 53.73
tensor 58.94 100.85 351.81 266.47
tsp 60.15 66.08 117.92 102.32
tyan 57.15 62.23 138.20 106.95
vector-concat 91.26 123.30 308.81 289.80
vector-rev 113.87 137.61 281.14 234.94
vliw 51.20 71.48 126.77 96.04
wc-input1 38.85 67.34 174.33 147.81
wc-scanStream 33.46 42.03 124.51 111.18
zebra 40.33 48.65 * *
zern 41.26 49.05 133.13 115.24