[MLton] Shortening dependency chains in the x86 codegen

Sun Jan 7 08:57:30 PST 2007

Take a look at the following code for 32-bit endianness swapping generated
by MLton on x86:

                                 ;; cc W->R comment
        movl %edx,%esi           ;; 1       Copy (edx->esi)
        andl $0xFFFF,%esi        ;; 2  esi  Modify (esi)
        movl %esi,%edi           ;; 3  esi  Copy (esi->edi)
        xorl %edx,%edi           ;; 4  edi  Modify (edi)
        shll $0x10,%esi          ;; 4
        shrl $0x10,%edi          ;; 5  edi
        xorl %edi,%esi           ;; 6  edi
        movl %esi,%edi           ;; 7  esi  Copy (esi->edi)
        andl $0xFF00FF,%edi      ;; 8  edi  Modify (edi)
        xorl %edi,%esi           ;; 9  edi
        shll $0x8,%edi           ;; 9
        shrl $0x8,%esi           ;; 10 esi
        xorl %edi,%esi           ;; 11 esi

The pattern pointed out by the comments is

 First  movl regA,regB
 then   op   ????,regB
 and continue to use both regA and regB.

This is problematic, because there is a "dependency chain" on regB.  Copying
regA to regB must be executed before the operation on regB and the subsequent
operations reading regB are also delayed.

It would be better to transform the pattern to

 First  movl regA,regB
 then   op   ????,regA
 and continue to use both regA and regB, where the roles of regA and regB
 have been swapped.

IOW, the idea is to systematically, after a register-to-register copy,
swap the roles of the registers, so that the target of the copy isn't the
target (or source) of the subsequent op.

The previous code would be transformed to

                                 ;; cc W->R
        movl %edx,%esi           ;; 1
        andl $0xFFFF,%edx        ;; 1
        movl %edx,%edi           ;; 2  edx
        xorl %esi,%edx           ;; 2
        shll $0x10,%edi          ;; 3  edi
        shrl $0x10,%edx          ;; 3
        xorl %edx,%edi           ;; 4  edx
        movl %edi,%edx           ;; 5  edi
        andl $0xFF00FF,%edi      ;; 5
        xorl %edi,%edx           ;; 6  edi
        shll $0x8,%edi           ;; 6
        shrl $0x8,%edx           ;; 7  edx
        xorl %edi,%edx           ;; 8  edx

The dependency chains of the transformed code are now shorter than of the
original code, which means that it is easier for a CPU to execute the
intructions in parallel.  The numbers in the comments indicate the clock
cycle at which the instruction completes on an idealized processor where
consecutive instructions without W->R dependencies can be executed in
parallel and every instruction takes just one cycle.

Any estimate on how difficult something like this would be to implement?

-Vesa Karvonen