Some numbers on SML/NJ & MLton runs and also a few questions
shivers@cc.gatech.edu
shivers@cc.gatech.edu
Mon, 6 Oct 2003 18:00:32 -0400
Wow, mlton is really fast.
Here is what I have done. I have 14 different sorting algorithms -- twelve
operate on lists, one operates destructively on lists (dlm), and one operates
on vectors (heapsort -- hp). All the onm<i> algos are small variants; I'll
say more about that later.
I have a test harness that creates 10,000 element test lists:
- 10k/1 means a 10k-elt list, random permutation of 1...10,000.
- 10k/10< means (1) start with a 10k/1 list, then (2) step through
it in random steps of size 1..10, sorting each step-segment into
increasing order. Some sort algos can exploit existing order in
the data, so this gives them a chance to do their thing.
- 10k/k< is like 10k/10<, but the size of the contiguous runs are
chosen from the range 1..1000, not 1...10 -- average run length
is 500 elts, much larger. The order-exploiting "natural" sorts
should do even better.
- 10k/10x is like 10k/10<, except that half of the runs are ordered
increasing, and half are ordered decreasing. Some algos can exploit
decreasing as well as increasing runs of data, so this lets them
show off.
- 10k< is simply the completely ordered list [1,...,10,000].
- 10k> is simply the completely ordered list [10,000,...,1].
The MLton runs are compiled with
mlton -static true -verbose 2 -detect-overflow false -safe false \
-inline 100000 weeks.cm
(But I don't think any of these fancy options really buy much.)
Below, the SML/NJ & Mlton times. Each random-list generator produces 500
lists; each algo sorts each list three times. I did not include gc time (which
is bogus, since some algos cons more than others). Times are in tenths of a
second, except for the side-effecting algos dlm (side-effecting lists) and hp
(vector heapsort), which are in tenths of a millisecond (moral: purity costs
you three orders of magnitude. Ouch). I ran the tests on my x86 linux
notebook.
Result: Mlton is way faster than sml/nj; just compare entries in the two
matrices:
sml/nj times msec*100 (dlm & hp usec*100)
10k/1 10k/10< 10k/10x 10k/k< 10k/kx 10k< 10k>
buh 1580 1616 1539 1312 1308 806 1267
bum 1342 1389 1296 1111 1026 595 533
nm 1510 1425 1479 834 1270 83 889
snm 1425 1423 1335 826 779 74 38
om 1372 1363 1310 1102 1025 688 426
onm 1281 1217 1241 763 935 175 467
onm6 1287 1225 1237 768 933 156 508
onm4 971 894 932 497 664 94 494
onm3 963 893 925 494 646 85 473
onm2 1438 1232 1202 775 721 179 55
onm5 1029 892 857 497 464 88 51
sam 1362 1240 1331 815 1089 186 480
dlm 3200 3100 1300 1700 900 0 0
hp 3000 3600 1500 1800 400 100 0
mlton times msec*100 (dlm & hp usec*100)
10k/1 10k/10< 10k/10x 10k/k< 10k/kx 10k< 10k>
buh 806 791 802 627 708 454 725
bum 728 718 718 585 577 417 409
nm 744 664 738 359 595 92 429
snm 712 670 646 354 337 96 57
onm 686 607 652 322 454 75 301
onm6 691 620 655 324 454 73 348
onm4 960 855 908 429 614 97 588
onm3 959 850 908 428 612 102 596
onm2 719 621 622 333 338 78 60
onm5 971 851 836 429 412 95 47
sam 759 640 744 334 598 80 481
dlm 400 400 100 100 100 0 300
hp 100 0 0 400 100 200 100
One really interesting relationship in the data is the interplay
between the various onm algos and the two compilers.
- Onm uses a recursive merge. If you merge two 1,000-elt lists, this merge
can push up to 2,000 proc frames -- and this call/return pushing & popping
is an inner-loop activity. It similarly uses a recursive algorithm to pick
off prefixes of the list data that are contiguous increasing runs of data.
- Onm4, on the other hand, uses an iterative algorithm for both the merge &
getrun operations, which constructs the answer backwards. It then reverses
this result. Thus instead of allocating O(n) stack frames, it allocates
the same number of temporary cons cells. Cons cells are probably smaller
than stack frames, but they come from the heap and are not eagerly
deallocated. On the other hand, the code can then execute in a tight loop
-- there is no control dependency on the stack, in the sense of needing to
pull a ret pc off the stack.
Now, what's interesting is that the technique that is speediest depends
on the compiler. SML/NJ likes the iterative-cons&reverse algo, while MLton
likes the stack algo, and the differences are significant:
SML/NJ
onm 1281 1217 1241 763 935 175 467
onm4 971 894 932 497 664 94 494
MLton
onm 686 607 652 322 454 75 301
onm4 960 855 908 429 614 97 588
I guess the issue is that SML/NJ's implementation of the "stack" algo
doesn't actually use a stack -- it allocates the proc frames in the heap.
Total heap allocation is *greater* than if we iteratively allocated
temporary conses (as each proc frame is probably bigger than a 2-word
cons cell), *plus* you have to load target addresses in from memory when
you return up through the "stack."
I am wondering how good is MLton's heap allocation? SML/NJ can
alloc&initialise an n-word block in about n+2 insns. How about
MLton?
I think I sort of understand why the iterative algo does better in SML/NJ.
But I *don't* understand how MLton makes the recursive algo work so well.
As I said in an earlier msg, I carefully coded some of the onm variants to
do well in a setting where there are plenty of registers and the compiler
is good about using them. This is obviously not going to happen on an x86.
I have tried to get a compile on a Sparc, but it bombs out, trying to
compile with a -mno-epilogue flag. Gcc on my Sparc box has no such flag
(but it does have an -mno-prologue-epilogue flag). Any hints?
I also wonder a few things about the Sparc/C implementation (the usual
compiling-through-C issues):
- Do ML proc frames get allocated on the C call stack, or are
they allocated with a different, explicit chunk of memory?
- How efficient is heap allocation?
- How is general tail recursion managed?
-Olin