benchmarks
Matthew Fluet
fluet@research.nj.nec.com
Tue, 15 Aug 2000 11:26:56 -0400 (EDT)
Here are the results from the benchmarks I ran this morning using an
x86-G1 mlton. (Hence, for compile times, both the c-codegen and the
x86-codegen are being run from the same executable; so, given the results
in the running times, I suspect that the compile times are uniformly
greater than compile times would be with a c-G1 mlton, but the differences
in compile times should reflect the tradeoffs between using gcc and the
x86-codegen. Also, the x86-codegen is being run with ChunkPerFunc but no
splitting of the assembly file.)
Compile Times: user+sys
benchmark c-codegen x86-codegen
checksum 3.15 3.15
count-graphs 9.60 7.34
fib 2.81 2.88
knuth-bendix 15.70 10.63
life 7.30 5.75
logic 47.36 30.60
mlyacc 443.73 387.54
mpuz 4.65 4.03
ratio-regions 17.32 13.03
smith-normal-form 221.73 94.04
tak 2.88 2.90
wc 7.50 6.03
This seems to confirm that we can do better than gcc, particularly for
large programs. I don't quite know what's up with smith-normal-form,
especially considering that the x86-codegen's executable is just as fast.
I looked at mlyacc, and it produces a 117,000 line assembly file, so I
don't think that separate assembly would help much at this level (although
I haven't done any serious testing of different sized assembly files).
However, I noted that during the x86-codegen phase of mlyacc, I was using
around 87% of memory; so I think I'm going to investigate
translating->output a chunk at a time, rather than doing each phase on the
whole program. I don't think I'll try for a G2 self-compile until I
investigate that.
Running Times: user+sys
benchmark c-codegen x86-codegen x86/c
checksum 11.58 13.20 1.14
count-graphs 18.87 21.20 1.12
fib 21.26 17.28 0.81
knuth-bendix 37.32 39.27 1.05
life 103.25 116.52 1.13
logic 91.48 77.36 0.79
mlyacc 41.10 34.70 0.84
mpuz 76.60 90.85 1.19
ratio-regions 41.40 39.79 0.96
smith-normal-form 4.04 4.02 1.00
tak 48.46 37.86 0.78
wc 24.72 36.34 1.47
We're closing the gap here. Here's what's currently going on in the
simplification phase: eliminating blocks which only jump to another label,
collapsing if-s whose branches are the same label (yes, this really does
occur after eliminating jumps to jumps), combining block A with block B
when A's goto B is the only transfer to B, peepholing (multiplication and
division by powers of 2, addition and subtraction of +/-1, collapse
cmp-setcc-test-jcc to cmp-jcc, eliminate simple copies of the form
RI(0) = SI(4);
SI(24) = RI(0),
eliminate self moves of the form RI(0) = RI(0) (which are created by the
elimination of self copies)), and rewriting transfers to allow some jumps
to fall through to the next instruction.
There are two other simplifications that I would like to try:
(1) currently, before a transfer, all pseudo-regs are flushed. For
ifs and switches, it might be better to only flush the pseudo-regs that
are live down the corresponding branch (and lift up the pseudo-regs that
are live down all branches).
(2) As a follow-up to (1), knowing which pseudo-regs are live into the
next block, I could coerce the register allocator to ensure that those
pseudo-regs are in some set of registers and then the next block would
assume that those pseudo-regs are held in the corresponding registers.
But, this is going to take a little analysis to choose which pseudo-regs
should be carried across the jump. This would probably result in a little
more register-register shuffling at the end of a block, that would
probably be better than a store-jump-load.
Adding (1) shouldn't be difficult at all. Adding (2) will be a little bit
more difficult, probably not by the end of this week.
Finally, there are some other tweaks I'd like to try to avoid some memory
references that I'm seeing. I don't expect anything spectacular, but
every little bit might help.