[MLton] latest benchmarks

Wed Jun 20 09:15:04 PDT 2007

I've merged the x86_64 branch into trunk.  Since the previous 
announcement of the experimental release, there were only two minor bugs 
reported:
  1) Bug with -align 8 on x86_64
  2) Inconsistent behavior with -const 'MLton.detectOverflow false'
These have both been fixed, and I'm pretty happy with the state of the 
x86_64 port.

I ran the benchmark suite to compare the last public release to the 
current trunk.  It is a bit of an apples-to-oranges comparison, since I 
ran the benchmarks on an AMD Opteron (64-bit) system.  So, the 20051205 
compiler (and its resulting executables) are running in 32-bit mode, 
while the trunk compiler (and its resulting executables) are running in 
64-bit mode.

[BTW, it would be nice if someone could run a corresponding benchmark 
suite on a 32-bit system, for a more apples-to-apples comparison.]

You can see all of the results at:
http://mlton.org/cgi-bin/viewsvn.cgi/*checkout*/mlton/trunk/doc/x86_64-port-notes/bench-20070619.txt?rev=5659

Some of the highlights:

* Benchmarks were run on a uni-core, dual-processor AMD Opteron 2.0GHz, 
8GB Memory, Fedora Core 6 machine (with gcc version 4.1.1 and linux 
version 2.6.20 (x86_64)).

* compile time and code size is up across the board on trunk vs 
20051205.  I suspect that part of the code size increase can be 
attributed to the comparison of 32-bit executables to 64-bit 
executables.  Any 64-bit operation requires an additional 8bit 
instruction prefix (as do 32-bit ops that touch the extended register 
set).  Compile time is probably partly explained by the bigger Basis 
Library implementation (increasing elaboration time and carrying more 
code through early optimizations), and partly by the fact that the trunk 
compiler is executing a little slower than the 20051205 compiler.

* recent versions of gcc are doing fairly well with the C code.  (Note 
that using -codegen c with 20051205 uses the version of gcc on the host 
machine.)  Indeed, the flat-array.sml benchmark needs to be revised, as 
gcc recognizes that the inner loop is pure (Overflow exceptions are 
handled within the loop) and unused.  The SSA{,2} optimizer should also 
discover that the loop may be optimized, but that is another issue.
GCC also does fairly well on the checksum benchmark with 20051205, 
though it does horribly on the checksum benchmark with trunk.
I suspect that the later behavior is due to the fact that on x86_64, 
sequences (arrays/vectors) are indexed by 64-bit integers in the 
primitive operations (sub, update, etc), but indexed by 32-bit integers 
in the user code (Array.sub, Array.update, etc. since Int.int 
corresponds to Int32.int).  Hence, there are quite a few 64/32 
conversions going on.

* I note that with both native codegens and C codegens, with both 
20051205 and trunk, that -align 8 often has a positive impact on 
runtime, and rarely has a significant negative impact.  This might be 
due to the Opteron memory system.  Aligned reads probably help most on 
Real64 intensive benchmarks.

* The amd64 codegen is doing alright as compared to the x86 codegen.  I 
see at most a factor of 2 slowdown, and a few speedups.  Again, I'm not 
sure what real conclusions can be drawn.  Some slowdowns are going to be 
due to the changes to the runtime and Basis Library since 20051205; to 
isolate those, I need a comparison of 20051205 to trunk on a 32-bit 
system.  Some slowdowns are probably going to be due to the sequence 
indexing discussed above.