On Sat, Oct 10, 2009 at 10:27 PM, Wesley W. Terpstra <span dir="ltr">&lt;<a href="mailto:wesley@terpstra.ca">wesley@terpstra.ca</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

 <div class="gmail_quote">I&#39;ve tried compiling with -align 8 and then it works... I&#39;m not sure this is a solution, though; it may have just masked the problem. <br></div></blockquote><div><br>Found the smoking gun! Debian builds gmp with -O3 whereas I used -O2 for MinGW32. If you look at the assembler output of mpz/mul_exp.c with the two options you will notice a difference... the introduction of a &#39;movdqa&#39; instruction, which is an SSE2 instruction that expects 16-byte alignment.<br>

<br>From what I&#39;ve read, an array of 64-bit words should be 64-bit aligned. MLton IntInfs are such arrays and must thus be 8-byte aligned. They aren&#39;t.<br><br>Here&#39;s the problem vectorized assembler from gcc with -O3 (I&#39;ve marked the problem code):<br>

<br>.LVL16:<br>        andl    $15, %eax<br>        shrq    $3, %rax<br>^^^^^^^^^^^ This ignores the 4-byte alignment of the array, only caring about it&#39;s 8-byte alignment before it moves on to doing 16-byte aligned moves.<br>

        cmpq    %r12, %rax<br>        cmova   %r12, %rax<br>        testq   %rax, %rax<br>        je      .L10<br>.LBB2:<br>        cmpq    %rax, %r12<br>        movq    $0, (%r14)<br>        leaq    8(%r14), %rdi<br>        leaq    -1(%r12), %rsi<br>

        je      .L8<br>.L10:<br>        movq    %r12, %rbx<br>        subq    %rax, %rbx<br>        movq    %rbx, %rcx<br>        shrq    %rcx<br>        movq    %rcx, %r9<br>        addq    %r9, %r9<br>        je      .L16<br>

        pxor    %xmm0, %xmm0<br>        leaq    (%r14,%rax,8), %r8<br>        xorl    %edx, %edx<br>        .p2align 4,,10<br>        .p2align 3<br>.L12:<br>        .loc 1 64 0<br>        movq    %rdx, %rax<br>        addq    $1, %rdx<br>

        salq    $4, %rax<br>        cmpq    %rcx, %rdx<br>        movdqa  %xmm0, (%r8,%rax)<br>^^^^^^^^^^^^^^^^^^^^^^^^^ At this point the memory MUST be 16-byte aligned, but isn&#39;t if the input is 4-byte aligned +8 -&gt; 12!=0 mod 16. This causes our segfault.<br>

        jb      .L12<br>        subq    %r9, %rsi<br>        cmpq    %r9, %rbx<br>        leaq    (%rdi,%r9,8), %rdi<br>        je      .L8<br><br>What&#39;s the plan going forward? align(AMD64) == 8?<br><br></div></div>