[MLton-user] minor 32bit vs. 64bit differences in floating-point calculations with large numbers

Thu Nov 11 06:47:11 PST 2010

On Fri, Oct 15, 2010 at 12:41 PM, Wesley W. Terpstra <wesley at terpstra.ca> wrote:
> On Fri, Oct 15, 2010 at 3:12 PM, David Hansel <hansel at reactive-systems.com>
> wrote:
>>
>> Thanks for that information.  Just one more question
>> (more out of curiosity than anything else):  is there
>> a technical reason that 64-bit MLton does NOT use the
>> FPU?
>
> Well, I think Matthew considered the SSE2 instructions superior to the x87
> instructions. SSE2 is required on all 64-bit machines.

There are multiple reasons to use the SSE2 instructions on amd64:
 * use XMM indexed registers, rather than x86 register stack
 * the amd64 calling conventions require "float" and "double" to be
passed and returned via XMMS registers
 * SSE instructions support proper IEEE 754 semantics

The downside is that you lose some floating-point operations; in
particular, all of the trigonometric operations call out to libm,
rather than being implemented inline with an assembly instruction.
Nothing prevents one from mixing SSE2 and x87 instructions, but it
doesn't seem to be used.  Indeed, the last time I gdb traced the
instruction sequence for doing a sin operation, libm uses an SSE2
instruction loop (think Taylor expansion), rather than using the x87
instruction.

> You'd have to ask Matthew for the details, but my understanding is that
> registers are easier to work with than a register stack. You can use
> traditional register allocation algorithms. Since the FPU has 8 slots in the
> stack and SSE2 on 64-bit has 16, you also get double the "registers" plus
> "random access". Plus, you can perform your calculations in 32-bit or 64-bit
> math in every step; not needing to worry about the extra precision.

Indeed, all good points.

> Personally, I'd love to see 128-bit floating point using SSE2 registers;
> then the x87 would be completely obsolete.

The amd64 calling convention passes 128-bit long doubles in XMM
registers, but none of the instructions are full 128-bit
floating-point operations.

>> Or is this just a case of "nobody has implemented
>> that yet"?
>
> Well, the amd64 and x86 codegens are very similar. You could probably mostly
> cut-and-paste the x86 FPU instructions out of the x86 codegen with very
> little trouble.

Yeah, that would probably be relatively easy.  Especially if you
dropped the XMM/SSE2 support.

> A perhaps better question might be: how hard would it be to port the SSE2
> math to i386. ;) Most modern 32-bit processors also have SSE2. Then you
> could have -ieee-fp sse2 on i386 to get the same effect as -ieee-fp true,
> but even faster than -ieee-fp false (but requiring SSE2 support, of course).

Again, about as easy as going in the other direction.  Especially if
you dropped the x87 support.  Supporting them in tandem wouldn't be as
easy.  Looks like any processor from 2004 or later almost certainly
supports SSE2.