Some numbers on SML/NJ & MLton runs and also a few questions

Mon, 6 Oct 2003 18:00:32 -0400

Wow, mlton is really fast.

Here is what I have done. I have 14 different sorting algorithms -- twelve
operate on lists, one operates destructively on lists (dlm), and one operates
on vectors (heapsort -- hp). All the onm<i> algos are small variants; I'll
say more about that later.

I have a test harness that creates 10,000 element test lists:
  - 10k/1 means a 10k-elt list, random permutation of 1...10,000.
  - 10k/10< means (1) start with a 10k/1 list, then (2) step through
    it in random steps of size 1..10, sorting each step-segment into
    increasing order. Some sort algos can exploit existing order in
    the data, so this gives them a chance to do their thing.
  - 10k/k< is like 10k/10<, but the size of the contiguous runs are
    chosen from the range 1..1000, not 1...10 -- average run length
    is 500 elts, much larger. The order-exploiting "natural" sorts
    should do even better.
  - 10k/10x is like 10k/10<, except that half of the runs are ordered
    increasing, and half are ordered decreasing. Some algos can exploit
    decreasing as well as increasing runs of data, so this lets them
    show off.
  - 10k< is simply the completely ordered list [1,...,10,000].
  - 10k> is simply the completely ordered list [10,000,...,1].

The MLton runs are compiled with
    mlton -static true -verbose 2 -detect-overflow false -safe false  \
      -inline 100000 weeks.cm 
(But I don't think any of these fancy options really buy much.)

Below, the SML/NJ & Mlton times. Each random-list generator produces 500
lists; each algo sorts each list three times. I did not include gc time (which
is bogus, since some algos cons more than others). Times are in tenths of a
second, except for the side-effecting algos dlm (side-effecting lists) and hp
(vector heapsort), which are in tenths of a millisecond (moral: purity costs
you three orders of magnitude. Ouch). I ran the tests on my x86 linux
notebook.

Result: Mlton is way faster than sml/nj; just compare entries in the two
matrices:

    sml/nj times msec*100 (dlm & hp usec*100)
	  10k/1   10k/10<  10k/10x 10k/k< 10k/kx  10k<  10k>
    buh   1580    1616     1539    1312   1308    806   1267
    bum   1342    1389     1296    1111   1026    595    533
    nm    1510    1425     1479     834   1270     83    889
    snm   1425    1423     1335     826    779     74     38
    om    1372    1363     1310    1102   1025    688    426
    onm   1281    1217     1241     763    935    175    467
    onm6  1287    1225     1237     768    933    156    508
    onm4   971     894      932     497    664     94    494
    onm3   963     893      925     494    646     85    473
    onm2  1438    1232     1202     775    721    179     55
    onm5  1029     892      857     497    464     88     51
    sam   1362    1240     1331     815   1089    186    480

    dlm   3200    3100     1300    1700    900      0      0         
    hp    3000    3600     1500    1800    400    100      0

    mlton times msec*100 (dlm & hp usec*100)
	  10k/1   10k/10<  10k/10x 10k/k< 10k/kx 10k<  10k>
    buh   806     791      802     627    708    454   725
    bum   728     718      718     585    577    417   409
    nm    744     664      738     359    595     92   429
    snm   712     670      646     354    337     96    57
    onm   686     607      652     322    454     75   301
    onm6  691     620      655     324    454     73   348
    onm4  960     855      908     429    614     97   588
    onm3  959     850      908     428    612    102   596
    onm2  719     621      622     333    338     78    60
    onm5  971     851      836     429    412     95    47
    sam   759     640      744     334    598     80   481

    dlm   400     400      100     100    100      0   300
    hp    100       0        0     400    100    200   100

One really interesting relationship in the data is the interplay
between the various onm algos and the two compilers. 

  - Onm uses a recursive merge. If you merge two 1,000-elt lists, this merge
    can push up to 2,000 proc frames -- and this call/return pushing & popping
    is an inner-loop activity. It similarly uses a recursive algorithm to pick
    off prefixes of the list data that are contiguous increasing runs of data.

  - Onm4, on the other hand, uses an iterative algorithm for both the merge &
    getrun operations, which constructs the answer backwards. It then reverses
    this result. Thus instead of allocating O(n) stack frames, it allocates
    the same number of temporary cons cells. Cons cells are probably smaller
    than stack frames, but they come from the heap and are not eagerly
    deallocated. On the other hand, the code can then execute in a tight loop
    -- there is no control dependency on the stack, in the sense of needing to
    pull a ret pc off the stack.

Now, what's interesting is that the technique that is speediest depends
on the compiler. SML/NJ likes the iterative-cons&reverse algo, while MLton
likes the stack algo, and the differences are significant:
SML/NJ
    onm   1281    1217     1241     763    935    175    467
    onm4   971     894      932     497    664     94    494

MLton
    onm   686     607      652     322    454     75   301
    onm4  960     855      908     429    614     97   588

I guess the issue is that SML/NJ's implementation of the "stack" algo
doesn't actually use a stack -- it allocates the proc frames in the heap.
Total heap allocation is *greater* than if we iteratively allocated
temporary conses (as each proc frame is probably bigger than a 2-word
cons cell), *plus* you have to load target addresses in from memory when
you return up through the "stack."

I am wondering how good is MLton's heap allocation? SML/NJ can
alloc&initialise an n-word block in about n+2 insns. How about
MLton?

I think I sort of understand why the iterative algo does better in SML/NJ.
But I *don't* understand how MLton makes the recursive algo work so well.

As I said in an earlier msg, I carefully coded some of the onm variants to
do well in a setting where there are plenty of registers and the compiler
is good about using them. This is obviously not going to happen on an x86.
I have tried to get a compile on a Sparc, but it bombs out, trying to
compile with a -mno-epilogue flag. Gcc on my Sparc box has no such flag
(but it does have an -mno-prologue-epilogue flag). Any hints?

I also wonder a few things about the Sparc/C implementation (the usual
compiling-through-C issues):
  - Do ML proc frames get allocated on the C call stack, or are
    they allocated with a different, explicit chunk of memory?

  - How efficient is heap allocation?

  - How is general tail recursion managed?

    -Olin