[MLton-devel] Fwd: Re: pretty damn good
Stephen Weeks
MLton@mlton.org
Mon, 4 Nov 2002 15:20:16 -0800
> > I downloaded mlton to my 350MHz PII linux box, finally figured out how
> > to run the nucleic benchmark, and got the following timings:
...
> > [lucier@dsl-207-066 mlton-20020923]$ time ./nucleic.batch
> > 16.939u 2.218s 0:19.30 99.1% 0+0k 0+0io 108pf+0w
> >
> > The time for Gambit-C on the same benchmark is
...
> > [lucier@dsl-207-066 gambit]$ time ./nucleic -:m10000
> > (time (run-bench name count run ok?))
> > 2478 ms real time
> > 2464 ms cpu time (2400 user, 64 system)
> > 38 collections accounting for 101 ms real time (98 user, 0 system)
> > 392602904 bytes allocated
> > 2568 minor faults
> > 22 major faults
> > 2.451u 0.074s 0:03.52 71.5% 0+0k 0+0io 535pf+0w
...
> > If I read your ML code correctly, it runs the loop 200 times;
Yes, although we've recently updated our benchmarks so that they run
longer, since we do the runs on faster machines. In our CVS,
nucleic.sml now loops 1500 times.
> > the gambit
> > code runs it 10 times, so mlton's version is taking (16.939+2.218)/200=
> > .0957850000 seconds, while gambit's version is taking (2.451+0.074)/10=
> > .2525000000 seconds.
Neat!
Something else you might find interesting is the gc-summary runtime
switch, which prints out the following data for nucleic
% ./nucleic @MLton gc-summary --
GC type time ms number bytes bytes/sec
------------- ------- ------- --------------- ---------------
copying 700 4,400 140,266,188 200,380,274
mark-compact 0 0 0 0
minor 0 0 0 0
total GC time: 910 ms (17.0%)
max pause: 10 ms
total allocated: 981,597,368 bytes
max live: 57,380 bytes
max semispace: 466,944 bytes
max stack size: 1,888 bytes
marked cards: 0
minor scanned: 0 bytes
minor skipped: 0 bytes
So, the MLton executable allocates 981597368 / 200 = 4,907,987 bytes per
loop, while the Gambit executable allocates 392602904 / 10 =
39,260,290 bytes per loop.
Someone should really check that they are computing the same thing
before we conclude too much. :-)
> I've been playing with the C code generated by MLton and various
> compiler optimizations. This is about the best I can get at the
> moment:
>
> [lucier@dsl-207-066 mlton-20020923]$ gcc -I/usr/lib/mlton/self/include -O1
> -fomit-frame-pointer -fschedule-insns2 -fno-strict-aliasing -fno-math-errno
> nucleic.batch.c -o nucleic.batch.2 -L/usr/lib/mlton/self -lmlton -lm
> /usr/lib/libgmp.a -O2 [lucier@dsl-207-066 mlton-20020923]$ time
> ./nucleic.batch.2
> 17.730u 2.326s 0:20.45 98.0% 0+0k 0+0io 108pf+0w
>
>
> So it seems that you're suffering a 6% penalty on this benchmark for
> going through C. That's not so bad if the C back end could be made
> more portable.
To make sure I understand, the 6% comes from comparing the runtime of
the nucleic.batch executable generated -native true, which comes to
19.162 (averaging the two times you sent), with the runtime of the
nucleic.batch executable generated -native false and then hand tweaked
and compiled as above. If so, I get the following lisp for the ratio
(/ (+ 17.73 2.326) (/ (+ (+ 16.939 2.218) (+ 16.837 2.33)) 2.0))
This comes to 1.046654837699614, so I don't quite understand where the
6% comes from.
Although in the case of nucleic the C and native backends are fairly
close, in many other cases they are not. The last time I posted about
this was over a year ago on comp.lang.ml
http://groups.google.com/groups?q=insubject:sml+insubject:to+insubject:c&hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&selm=9lb1oi%24cao%241%40cantaloupe.srv.cs.cmu.edu&rnum=3
I suspect that the runtime ratios (C / native) have gotten larger
since then, since we have continued to improve the native codegen and
have left the C codegen untouched.
> You may (or may not) get a bit more performance by using gcc's
> computed goto's for returns rather than going through the dispatch
> table on the chunk switch.
I second what Henry said. This was way too buggy when we tried it.
> You also don't always go through a trampoline, only for intermodule
> calls; we must have been talking at cross purposes about
> trampolines.
Right. We only trampoline when we have to get from one C function to
another, which experiments we did long ago showed was pretty rare.
Plus, the backend goes to some effort to put blocks with control-flow
edges to each other in the same C function. I vaguely remember
getting the the idea for this from a Feeley paper (maybe about
Gambit?).
> I'd like to see how this thing does on other benchmarks; how *do* you
> run the benchmarks with various options?
In the compiler sources, there is a subdirectory called benchmark. If
you look at the Makefile in there, it will generate an executable
called "benchmark", which will benchmark MLton using command-line
specified combinations of flags. See the "test" target and the BFLAGS
variable for examples.
As a simple example, you can compare the C and native backends with
benchmark -mlton "mlton -native {false,true}"
If you do this, please send the results to MLton. I'd be interested
to see the latest ratios.
-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel