[MLton] cvs commit: rewrote x86.Block.compress to run in linear time

Thu, 1 Jul 2004 16:08:56 -0700

> I take it from the lack of victory cries that this wasn't the dragon
> we needed to slay.

Nonsense.  Merely an indication that I occasionally eat.  :-)

While I was having lunch, I left a compile running.  It finished, and
here are the some timings:

MLton starting
   Compile SML starting
      pre codegen starting
	 closureConvertSimplify starting
	    localRef starting
	       multi starting
	       multi finished in 0.61 + 5.93 (91% GC)
	    localRef finished in 99.32 + 56.54 (36% GC)
	 closureConvertSimplify finished in 199.37 + 119.92 (38% GC)
	 backend finished in 112.45 + 41.49 (27% GC)
      pre codegen finished in 481.06 + 321.58 (40% GC)
      x86 code gen starting
	 outputAssembly starting
	    translateChunk totals 39.02 + 10.62 (21% GC)
	    simplify totals 1584.76 + 92.90 (6% GC)
	    generateTransfers totals 115.97 + 4.83 (4% GC)
	    allocateRegisters totals 922.38 + 40.59 (4% GC)
	 outputAssembly finished in 3636.43 + 152.73 (4% GC)
      x86 code gen finished in 3637.45 + 152.77 (4% GC)
   Compile SML finished in 4118.61 + 474.35 (10% GC)
   Compile C and Assemble starting
   Compile C and Assemble finished in 51.02 + 0.00 (0% GC)
   Link starting
   Link finished in 191.49 + 0.00 (0% GC)
MLton finished in 4361.35 + 474.49 (10% GC)

So, we're down from 8.6 to 1.3 hours.  That quadratic compress was
certainly most of the problem.  The remaining problems are

* localRef is taking too long.  I still don't know why.  It's not the
  multi subpass.  Any ideas?  In any case, that's the only remaining
  glaring problem in the pre codegen.

* There are a couple of really large .S files (30M and 40M).  And
  simplify and allocate registers take a huge chunk of time.  I'll try
  a compile with -native-optimize 0 to see what happens.  Perhaps
  another possibility would be for the codegen to automatically treat
  native-optimize as zero when compiling procedures that are too large
  (as we do now for the globals).  Other possibilities would be to do
  this only for main, or for procedures that are only called once
  (which we can certainly prove for main).

Hopefully fixing those will get us down around a half hour.  Also,
remember this compiled was run -verbose 3, which causes some slowdown
to compute all the IL sizes.  So we can probably get another 10%
from switching that off.