[MLton] cvs commit: rewrote x86.Block.compress to run in linear
time
Stephen Weeks
MLton@mlton.org
Thu, 1 Jul 2004 20:31:22 -0700
> > * localRef is taking too long. I still don't know why. It's not the
> > multi subpass. Any ideas? In any case, that's the only remaining
> > glaring problem in the pre codegen.
>
> You might trace the SSA- restore pass with a Control.traceBatch.
I put a trace around restore and here's what I got
localRef starting
multi starting
multi finished in 0.61 + 5.93 (91% GC)
restore starting
restore finished in 0.91 + 0.00 (0% GC)
restore starting
restore finished in 0.00 + 0.00 (0% GC)
restore starting
restore finished in 0.00 + 0.00 (0% GC)
localRef finished in 99.32 + 56.54 (36% GC)
So, it looks there are only three functions with local refs, and the
restores don't hurt.
> Looking at the -verbose 3 you sent out earlier, the biggest problem seems
> to be in copyPropagate. I glanced at that code and it does have a
> quadratic running time in the size of the basic block.
> You can turn that off with -native-copy-prop false.
>
> -native-optimize 0 will also cut down the toLiveness time in
> allocateRegisters. Essentially, that is responsible for computing the
> hints to the register allocator.
Here's a comparison of a compile with -native-optimize {1,0}.
-native-optimize
1 0
---------- ----------
compile time (s) 4836 4054
code size (bytes) 25,519,250 27,904,878
run time (s) 273 303
Overall, there was about a 20% improvement in compile time, 10% larger
code, and 10% slower code.
Here are some details from the x86 codegen.
-native-optimize 1
outputAssembly starting
translateChunk totals 39.02 + 10.62 (21% GC)
simplify totals 1584.76 + 92.90 (6% GC)
completeLiveInfo totals 215.69 + 3.44 (2% GC)
toLivenessBlock totals 128.13 + 0.95 (1% GC)
moveHoist totals 227.31 + 28.56 (11% GC)
peepholeLivenessBlock totals 249.96 + 27.58 (10% GC)
copyPropagate totals 744.30 + 30.50 (4% GC)
generateTransfers totals 115.97 + 4.83 (4% GC)
allocateRegisters totals 922.38 + 40.59 (4% GC)
toLiveness totals 652.69 + 3.31 (1% GC)
Assembly.allocateRegisters totals 264.26 + 36.12 (12% GC)
outputAssembly finished in 3636.43 + 152.73 (4% GC)
-native-optimize 0
outputAssembly starting
translateChunk totals 39.17 + 9.61 (20% GC)
simplify totals 127.05 + 2.73 (2% GC)
completeLiveInfo totals 126.81 + 2.73 (2% GC)
toLivenessBlock totals 0.00 + 0.00 (0% GC)
moveHoist totals 0.00 + 0.00 (0% GC)
peepholeLivenessBlock totals 0.00 + 0.00 (0% GC)
copyPropagate totals 0.00 + 0.00 (0% GC)
peepholeLivenessBlock_minor totals 0.00 + 0.00 (0% GC)
generateTransfers totals 160.16 + 4.87 (3% GC)
allocateRegisters totals 1777.27 + 39.32 (2% GC)
toLiveness totals 1315.40 + 1.59 (0% GC)
Assembly.allocateRegisters totals 454.60 + 35.82 (7% GC)
outputAssembly finished in 3061.40 + 60.55 (2% GC)
So, it looks like that with -native-optimize 0 there's a lot of
savings to the simplifier, much of it is lost in allocate registers.
If we could improve the quadratic-ness of copyPropagate, we could
probably get most of that time back. Any thoughts on the difficulty
of that?