[MLton-user] more optimization questions

Wed, 21 Dec 2005 11:39:47 -0800

> > by a C call to a function that just calls fabs.  The win in going from
> > (2) to (3) is in eliminating the C wrapper around fabs.  If anyone
> > wants to repeat my experiment, I did (3) by adding a line to
> > lib/mlton/include/c-chunk.h:
> >
> >   #define Real64_abs fabs
...
> Just FYI for the list.  The above change to the c-chunk.h file requires:
> 
> val abs = _import "fabs": real -> real;
> 
> to be added to the .sml file.

I don't think that's necessary for the approach I described, which has
both the advantage of not needing to modify input SML program as well
as a performance advantage.  To be more clear, here's what I did.  I
started with a vanilla install of MLton 20051202.  I then did two
things.

  * Eliminated the definition of abs from line 91 of
    lib/mlton/sml/basis/real/real.fun
  * Added a line to lib/mlton/include/c-chunk.h
    #define Real64_abs fabs

The first step causes MLton to use its primitive notion of abs, and
for the C codegen to emit a call to Real64_abs, which is a C wrapper
around fabs defined in libmlton.a.  The second step replaces the call
to Real64_abs with a call to fabs, for which gcc emits the fabs
instruction (on x86 anyway).

If you add the line

   val abs = _import "fabs": real -> real;

this tells MLton to treat abs as an FFI call, not a primitive.  It
therefore does not know as much about it and will not generate as good
code.  In particular, you will see the following in the generated C.

	S(Word32, 72) = 142;
	Push (76);
	FlushFrontier();
	FlushStackTop();
	CReturnR64 = fabs (R64_19);
	CacheFrontier();
	CacheStackTop();
L_1363:
	Push (-76);
	R64_0 = CReturnR64;

This is not as good as what you will get with the approach I
described, namely the following two lines of generated C.

	CReturnR64 = Real64_abs (R64_21);
	R64_1 = CReturnR64;

Perhaps this difference may explain why I saw a better speedup than
you.

> Also, it seems like in 2 years there is a relatively good (?) chance  
> that the fabs behavior has changed.  Maybe the "proper" abs code is  
> no longer required in the compiler.

We run on too many platforms with too many versions to check this, and
like to support older platforms too.  Also, I suspect users would
spend more time tracking down correctness bugs if we made the change
than they would performance bugs if we didn't.  So, I don't think it's
a good idea to eliminate the wrapper.  I do think we will add
something like

  structure FastReal: REAL

that will allow users to get at primitive versions of the Real
functions without the correctness wrappers.  So they can get C-style
speed and C-style correctness :-) if they want.  But they can do it
selectively and lazily, only after profiling shows it would help a
particular hot loop in their programs.