[MLton-devel] Fwd: Re: pretty damn good

Mon, 4 Nov 2002 15:20:16 -0800

> > I downloaded mlton to my 350MHz PII linux box, finally figured out how
> > to run the nucleic benchmark, and got the following timings:
...
> > [lucier@dsl-207-066 mlton-20020923]$ time ./nucleic.batch
> > 16.939u 2.218s 0:19.30 99.1%    0+0k 0+0io 108pf+0w
> >
> > The time for Gambit-C on the same benchmark is
...
> > [lucier@dsl-207-066 gambit]$ time ./nucleic -:m10000
> > (time (run-bench name count run ok?))
> >     2478 ms real time
> >     2464 ms cpu time (2400 user, 64 system)
> >     38 collections accounting for 101 ms real time (98 user, 0 system)
> >     392602904 bytes allocated
> >     2568 minor faults
> >     22 major faults
> > 2.451u 0.074s 0:03.52 71.5%     0+0k 0+0io 535pf+0w
...
> > If I read your ML code correctly, it runs the loop 200 times;

Yes, although we've recently updated our benchmarks so that they run
longer, since we do the runs on faster machines.  In our CVS,
nucleic.sml now loops 1500 times.

> > the gambit
> > code runs it 10 times, so mlton's version is taking (16.939+2.218)/200=
> > .0957850000 seconds, while gambit's version is taking (2.451+0.074)/10=
> > .2525000000 seconds.

Neat! 

Something else you might find interesting is the gc-summary runtime
switch, which prints out the following data for nucleic

% ./nucleic @MLton gc-summary --
GC type		time ms	 number		  bytes	      bytes/sec
-------------	-------	-------	---------------	---------------
copying		    700	  4,400	    140,266,188	    200,380,274	
mark-compact	      0	      0	              0	              0	
minor		      0	      0	              0	              0	
total GC time: 910 ms (17.0%)
max pause: 10 ms
total allocated: 981,597,368 bytes
max live: 57,380 bytes
max semispace: 466,944 bytes
max stack size: 1,888 bytes
marked cards: 0
minor scanned: 0 bytes
minor skipped: 0 bytes

So, the MLton executable allocates 981597368 / 200 = 4,907,987 bytes per
loop, while the Gambit executable allocates 392602904 / 10 =
39,260,290 bytes per loop.

Someone should really check that they are computing the same thing
before we conclude too much. :-)

> I've been playing with the C code generated by MLton and various
> compiler optimizations.  This is about the best I can get at the
> moment:
> 
> [lucier@dsl-207-066 mlton-20020923]$ gcc -I/usr/lib/mlton/self/include -O1
>  -fomit-frame-pointer -fschedule-insns2 -fno-strict-aliasing -fno-math-errno
>  nucleic.batch.c -o nucleic.batch.2 -L/usr/lib/mlton/self -lmlton -lm
>  /usr/lib/libgmp.a  -O2 [lucier@dsl-207-066 mlton-20020923]$ time
>  ./nucleic.batch.2
> 17.730u 2.326s 0:20.45 98.0%    0+0k 0+0io 108pf+0w
> 
> 
> So it seems that you're suffering a 6% penalty on this benchmark for
> going through C.  That's not so bad if the C back end could be made
> more portable.

To make sure I understand, the 6% comes from comparing the runtime of
the nucleic.batch executable generated -native true, which comes to 
19.162 (averaging the two times you sent), with the runtime of the
nucleic.batch executable generated -native false and then hand tweaked
and compiled as above.  If so, I get the following lisp for the ratio

(/ (+ 17.73 2.326) (/ (+ (+ 16.939 2.218) (+ 16.837 2.33)) 2.0))

This comes to 1.046654837699614, so I don't quite understand where the
6% comes from.

Although in the case of nucleic the C and native backends are fairly
close, in many other cases they are not.  The last time I posted about
this was over a year ago on comp.lang.ml

http://groups.google.com/groups?q=insubject:sml+insubject:to+insubject:c&hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&selm=9lb1oi%24cao%241%40cantaloupe.srv.cs.cmu.edu&rnum=3

I suspect that the runtime ratios (C / native) have gotten larger
since then, since we have continued to improve the native codegen and
have left the C codegen untouched.

> You may (or may not) get a bit more performance by using gcc's
> computed goto's for returns rather than going through the dispatch
> table on the chunk switch.

I second what Henry said.  This was way too buggy when we tried it.

> You also don't always go through a trampoline, only for intermodule
> calls; we must have been talking at cross purposes about
> trampolines.

Right.  We only trampoline when we have to get from one C function to
another, which experiments we did long ago showed was pretty rare.
Plus, the backend goes to some effort to put blocks with control-flow
edges to each other in the same C function.  I vaguely remember
getting the the idea for this from a Feeley paper (maybe about
Gambit?).

> I'd like to see how this thing does on other benchmarks; how *do* you
> run the benchmarks with various options?

In the compiler sources, there is a subdirectory called benchmark.  If
you look at the Makefile in there, it will generate an executable
called "benchmark", which will benchmark MLton using command-line
specified combinations of flags.  See the "test" target and the BFLAGS
variable for examples.

As a simple example, you can compare the C and native backends with

benchmark -mlton "mlton -native {false,true}"

If you do this, please send the results to MLton.  I'd be interested
to see the latest ratios.

-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel