Some results

Matthew Fluet
Fri, 21 Jul 2000 12:56:53 -0400 (EDT)

Here's where I stand with the x86 backend:
1. Very very close to a fully working integer backend.
   I can pass every regression test that does not use floating point with
   the exception of the signals.sml test.  Likewise, everything from the
   examples directory (including save-world.sml) works, with the exception
   of the threads and signals files.  I'm somewhat at a loss for why
   threads and signals aren't working when everything else seems to be
   working.  I get a couple of different errors when I try running these
   programs (they all compile fine).  The signals.sml file actually runs
   fine, but I only get output for the sending thread; i.e., the output is 
      sending 1
      sending 2
      sending 3
   when I should see the replies from the other thread to the signals.
   The files thread1.sml and thread2.sml both segfault apparently due to
   some corruption of the gcState.currentThread variable (one of them
   fails in an assert when the expression s->currentThread->stack->used
   goes through an invalid memory address, the other one fails when the
   instruction pointer is set to 0x0 -- probably because some
   currentThread is pointing to a piece of stack with 0x0 where a return
   address was expected).  Maybe someone else has an idea why I'm getting
   these errors.  There are very few thread primitives that get compiled
   to native assembly and as best I can tell, they are working correctly.

2. The (sort of) bad news.  Performance isn't really where we would like
   it to be.  The results of all of the integer benchmarks are included
   below, but here's a typical entry:

testing count-graphs
        0:30.49 real,   29.61 user,     0.89 sys   <-- compile time
        0:19.10 real,   17.98 user,     1.09 sys   <-- execution time
mlton -global-regs:
        0:31.02 real,   30.14 user,     0.89 sys
        0:43.54 real,   42.49 user,     0.89 sys
        0:29.87 real,   28.60 user,     1.23 sys
        0:27.55 real,   26.51 user,     1.06 sys

Let's start with compile time.  (Obviously, this is using an SML/NJ
version of mlton.)  Right now, we're not seeing that much of
an improvement by avoiding a serious gcc call.  However, I think that
there is a real improvement there, it's just being hidden by other
factors.  The way I have the native backend signature, it looks a lot like
the C-backend signature.  This means that the output routine is invoked
with just one file output location.  Since I'm dumping assembly for the
cps functions and C for some data declarations and the main function, some
extra work needs to be done.  The x86mlton executable is really a script
which executes as follows:

        mlton -native -o file.c file.sml;
	sed ???? file.c > file_c.c
        mlton -S -O2 -o file_c.s file_c.c;
        as -o file_c.o file_c.s;
	sed ???? file.c > file_asm.s
        as -o file_asm.o file_asm.s;
        mlton -o file file_c.o file_asm.o; 

So, you can see that for a native compile, we're really doing three
invocations of mlton (two of which make calls to gcc), plus two
invocations of the assembler.  I suspect that adds up.  Looking at the
compile time for some of the larger benchmarks, there is some decent
improvement.  I think we'll see even better performance when we can make
the assembler call from within mlton.  Also, I think I'm going to produce
an exclusively assembly file -- looking at what the C-stub is doing, it's
mostly data declarations and the main function, all of which should turn
into pretty mindless sequences of assembly code.

Runtime performance isn't really that impressive right now.  However,
there are a number of different factors at work, and I think I can explain
some of them.

First, let me talk about mlton -global-regs.  This is still a C-codegen,
but with a different treatment of pseudo-registers.  Using mlton, the
pseudo-registers are declared as a bunch of local variables to a chunk
function.  Using mlton -global-regs, the pseudo-registers are declared as
a global array (taking the maximum over the number of pseudo-registers of
each type for each chunk).  This version is closer to what x86mlton is
actually executing, because changes to pseudo-registers are always
comitted to a memory location.  You'll notice that x86mlton is a bit
faster than the mlton -global-regs.

What does this tell us?  I think that it indicates that gcc is able to do
a fair amount of liveness analysis and take advantage of the fact that
many pseudo-registers are only live for very short durations of time.  For
example, a common idiom that appears in the C-codegen files is something
like the following:

        RI(0) = CI(SP(20));
        SI(24) = Int_add(SI(12), RI(0));
        RI(1) = Int_gt(SI(24), 4096);
        BZ(RI(1), L_153);

Now, it's fairly probably that RI(0) and RI(1) are dead after the branch.
If gcc figures that out, then it saves a few memory transfers.  It may
even be able to optimize the branch into a single gt test instead of a gt
test, a set, and then a zero test.  Likewise, with

        RI(3) = Int_add(SI(56), 1);
        SI(56) = RI(3);

we can probably completely eliminate the use of RI(3).

On the other hand, from the point of view of the current incarnation of
the x86 backend register allocator, every change to a pseudo-register
needs to be comitted to memory.  This probably leads to more memory
traffic than is strictly necessary.

Anyways, I think that the comparisons between mlton -global-regs and
x86mlton indicate that there is some non-trivial improvement to be gained
from avoiding trampolining and using a native backend.  And, there are a
few simple optimizations that can be applied to the assembly to gain some
more performance.  For example, a statement like

        BZ(RI(1), L_153);

is translated into a test followed by a conditional jump followed by an
unconditional jump.  It won't be difficult to modify the output routines
so that the unconditional jump can be eliminated when the code for the
target label of the unconditional jump can be output at that position.
Extending this a little bit, if the unconditional jump is the only jump to
that target label, then we can combine the two blocks and let the register
allocator work on the larger block.  Also, some simple instruction
scheduling to move loads up in the instruction stream would also help to
keep the processor pipeline full.

skipping barnes-hut
testing checksum
	0:07.29 real,	6.65 user,	0.66 sys
	0:11.75 real,	11.51 user,	0.18 sys
mlton -global-regs:
	0:07.33 real,	6.85 user,	0.55 sys
	0:31.35 real,	31.15 user,	0.17 sys
	0:08.12 real,	7.06 user,	1.12 sys
	0:28.27 real,	28.07 user,	0.18 sys
testing count-graphs
	0:30.49 real,	29.61 user,	0.89 sys
	0:19.10 real,	17.98 user,	1.09 sys
mlton -global-regs:
	0:31.02 real,	30.14 user,	0.89 sys
	0:43.54 real,	42.49 user,	0.89 sys
	0:29.87 real,	28.60 user,	1.23 sys
	0:27.55 real,	26.51 user,	1.06 sys
skipping fft
testing fib
	0:05.33 real,	4.81 user,	0.50 sys
	0:22.07 real,	21.92 user,	0.02 sys
mlton -global-regs:
	0:05.27 real,	4.72 user,	0.56 sys
	0:32.59 real,	32.57 user,	0.05 sys
	0:06.04 real,	4.85 user,	1.21 sys
	0:22.18 real,	22.19 user,	0.00 sys
testing knuth-bendix
	0:42.35 real,	40.99 user,	1.17 sys
	0:37.09 real,	36.24 user,	0.81 sys
mlton -global-regs:
	0:42.99 real,	39.97 user,	2.76 sys
	0:43.79 real,	41.71 user,	1.88 sys
	0:39.23 real,	37.74 user,	1.40 sys
	0:42.87 real,	41.83 user,	0.86 sys
skipping lexgen
testing life
	0:19.03 real,	18.17 user,	0.70 sys
	1:40.87 real,	99.28 user,	1.07 sys
mlton -global-regs:
	0:19.32 real,	18.66 user,	0.61 sys
	2:59.22 real,	177.69 user,	0.99 sys
	0:18.87 real,	17.78 user,	1.20 sys
	2:18.62 real,	137.04 user,	1.00 sys
testing logic
	1:36.40 real,	94.68 user,	1.63 sys
	1:32.00 real,	90.70 user,	1.07 sys
mlton -global-regs:
	2:12.81 real,	130.58 user,	1.91 sys
	1:47.64 real,	106.00 user,	1.16 sys
	1:30.97 real,	87.90 user,	2.91 sys
	1:29.79 real,	88.36 user,	1.09 sys
skipping mandelbrot
skipping matrix-multiply
testing mlyacc
	17:14.67 real,	1020.19 user,	9.92 sys
	0:42.36 real,	40.02 user,	1.54 sys
mlton -global-regs:
	18:20.07 real,	1081.74 user,	10.51 sys
MLton bug: toplevel handler not installed.        <-- no idea why
Please send a bug report to
Command exited with non-zero status 2
	0:00.01 real,	0.00 user,	0.00 sys
	16:25.57 real,	969.95 user,	13.38 sys
	0:47.22 real,	45.45 user,	1.46 sys
testing mpuz
	0:12.63 real,	11.69 user,	0.62 sys
	1:16.82 real,	76.13 user,	0.11 sys
mlton -global-regs:
	0:12.44 real,	11.57 user,	0.68 sys
	2:26.23 real,	144.97 user,	0.10 sys
	0:13.02 real,	11.72 user,	1.07 sys
	1:54.28 real,	113.54 user,	0.04 sys
skipping nucleic
testing ratio-regions
	0:56.39 real,	54.04 user,	1.26 sys
	0:42.14 real,	40.69 user,	1.20 sys
mlton -global-regs:
	0:56.94 real,	55.13 user,	1.27 sys
	1:51.91 real,	108.88 user,	1.10 sys
	0:56.24 real,	53.18 user,	2.00 sys
	1:07.63 real,	62.44 user,	3.15 sys
skipping ray
skipping simple
testing smith-normal-form                         <-- forgot to add
mlton:                                                  val _ = Main.doit ()
	0:25.72 real,	22.83 user,	1.50 sys
	0:00.00 real,	0.00 user,	0.00 sys
mlton -global-regs:
	0:25.14 real,	23.08 user,	1.11 sys
	0:00.00 real,	0.01 user,	0.00 sys
	0:25.54 real,	23.53 user,	1.58 sys
	0:00.00 real,	0.00 user,	0.00 sys
testing tak
	0:06.31 real,	5.13 user,	0.53 sys
	0:47.39 real,	46.40 user,	0.06 sys
mlton -global-regs:
	0:06.03 real,	5.21 user,	0.61 sys
	1:06.74 real,	65.03 user,	0.12 sys
	0:06.44 real,	5.40 user,	1.07 sys
	0:53.59 real,	53.08 user,	0.04 sys
skipping tensor
skipping tsp
skipping vliw
testing wc
	0:23.63 real,	22.54 user,	0.94 sys
	0:25.35 real,	22.70 user,	2.38 sys
mlton -global-regs:
	0:24.14 real,	23.18 user,	0.80 sys
	0:47.30 real,	43.91 user,	2.66 sys
	0:23.92 real,	22.29 user,	1.29 sys
	0:42.72 real,	40.07 user,	2.08 sys
skipping zern