# [MLton] Number theory and MLton C FFI

Fri, 22 Oct 2004 00:29:42 +0200

```On Thu, Oct 21, 2004 at 10:16:34AM -0700, Brent Fulgham wrote:
> > No offence to sml, but these operations simply must
> > be performed in assembler-style C in order to be
> > efficient.
>
> I'm interested in the kinds of coding constructs
> that can't be efficiently performed in SML both
> to (1) help identify problems with the MLton compiler,
> and (2) to devise fiendish tests for the Shootout
> (http://shootout.alioth.debian.org).

Well, I'm not using C as a programming language...
Rather it is the target of a compiler written in ocaml which outputs C. :)
The C functions are straight-line affairs that look like:

a = b + c;
f = (b * c) + q;
m = a + q;
r = m*m*f;
...

The output code is also scheduled using domain-specific knowledge that
allows it to minimize cache misses. Furthermore, the compiler does
problem-specific mathematical simplifications. (this part is based off
of work by the fftw3 guys)

The +, * etc are also done like:

// heavily abridged:

typedef uint32_t Z;
typedef uint64_t ZZ;

#define M_BITS 31
#define R_BITS (sizeof(Z)*8)
static const Z M_MASK = (1U << M_BITS) - 1;

static inline Z ADD(Z x, Z y) {
Z z = x + y;
return (z & M_MASK) + (z >> M_BITS);
}

static inline Z MUL(Z x, Z y) {
#if 0
ZZ z = (ZZ)x * (ZZ)y;
#else
/* Use the edge of the register to seperate the high and low */
ZZ z = (ZZ)x * (ZZ)(y << (R_BITS-M_BITS));
return ADD(z >> R_BITS, (Z)z >> (R_BITS-M_BITS));
/*   noop ---^ (top)      ^---- bottom half */
#endif
}

Also, I have the same things using SSE2 intrinsics.

I theory, I suppose I could target SML instead of C.
However, I don't think you can get those above methods to compile to nearly
as nice code as

movl    %edi, %eax
shrl    \$31, %edi
andl    \$2147483647, %eax

MUL:
lea       (%ecx,%ecx), %edx                             #17.34
mull      %edx                                          #17.34
shrl      \$1, %eax                                      #18.41
movl      %edx, %eax                                    #18.16
andl      \$2147483647, %eax                             #18.16
shrl      \$31, %edx                                     #18.16

For my purposes, C is just semi-portable assembler.

> In general, I find that SML performs quite well in
> the tests, but perhaps we are not testing certain
> kinds of real-world problems properly.
>
> Can you provide any (small) examples of what you have
> observed?

Well, I tend to make local changes and then measure the performance in
aggregate. So, extracting small snippets of code doesn't really reflect
why some things are faster than other things. Cache matters a lot, as
does pipelining.

However, one _big_ problem I have is the inability of SML to capture a
product in a larger integer. So Word32*Word32 _SHOULD_ give Word64...
but it doesn't. This realllly sucks.

Real hardware does this in one instruction.
Promoting them to Word64s and then multiplying does not do that.

> This is basically how the MLton runtime library
> implements its lowest-level primitives.  But will
> the C FFI overhead reduce the benefit of this
> approach (assuming you can work around the tuple
> issues you mention)?

The methods we are talking about here take a LOT longer than a method call.
They are methods like:
perform any FFT over this polynomial (like 32MB)
Multiply a[i] by c^i for all a[i] in the polynomial
Multiply a[i]*b[i] and return the resulting vector
Compute the polynomial with roots at these 3M points.
... you get the idea. =)

PS. Your fibonacci.sml file is unfair compared to ocaml:
(* -*- mode: sml -*-
* \$Id: fibo.mlton.html,v 1.5 2004/07/03 07:11:33 bfulgham Exp \$
* http://shootout.alioth.debian.org/
* Revised by Wesley W. Terpstra
*)

fun fib n =
let fun helper x f0 f1 =
if x = 0 then f1 else
helper (x - 1) f1 (f0 + f1)
in helper n 0 1 end ;

val n = valOf (Int.fromString (hd (CommandLine.arguments () @ ["1"]))) ;
print (Int.toString (fib n) ^ "\n") ;;

Actually, a lot of your tests compare ocaml and mlton unfairly...

--
Wesley W. Terpstra

```