[MLton] Shortening dependency chains in the x86 codegen
Vesa Karvonen
vesa.karvonen at cs.helsinki.fi
Sun Jan 7 08:57:30 PST 2007
Take a look at the following code for 32-bit endianness swapping generated
by MLton on x86:
;; cc W->R comment
movl %edx,%esi ;; 1 Copy (edx->esi)
andl $0xFFFF,%esi ;; 2 esi Modify (esi)
movl %esi,%edi ;; 3 esi Copy (esi->edi)
xorl %edx,%edi ;; 4 edi Modify (edi)
shll $0x10,%esi ;; 4
shrl $0x10,%edi ;; 5 edi
xorl %edi,%esi ;; 6 edi
movl %esi,%edi ;; 7 esi Copy (esi->edi)
andl $0xFF00FF,%edi ;; 8 edi Modify (edi)
xorl %edi,%esi ;; 9 edi
shll $0x8,%edi ;; 9
shrl $0x8,%esi ;; 10 esi
xorl %edi,%esi ;; 11 esi
The pattern pointed out by the comments is
First movl regA,regB
then op ????,regB
and continue to use both regA and regB.
This is problematic, because there is a "dependency chain" on regB. Copying
regA to regB must be executed before the operation on regB and the subsequent
operations reading regB are also delayed.
It would be better to transform the pattern to
First movl regA,regB
then op ????,regA
and continue to use both regA and regB, where the roles of regA and regB
have been swapped.
IOW, the idea is to systematically, after a register-to-register copy,
swap the roles of the registers, so that the target of the copy isn't the
target (or source) of the subsequent op.
The previous code would be transformed to
;; cc W->R
movl %edx,%esi ;; 1
andl $0xFFFF,%edx ;; 1
movl %edx,%edi ;; 2 edx
xorl %esi,%edx ;; 2
shll $0x10,%edi ;; 3 edi
shrl $0x10,%edx ;; 3
xorl %edx,%edi ;; 4 edx
movl %edi,%edx ;; 5 edi
andl $0xFF00FF,%edi ;; 5
xorl %edi,%edx ;; 6 edi
shll $0x8,%edi ;; 6
shrl $0x8,%edx ;; 7 edx
xorl %edi,%edx ;; 8 edx
The dependency chains of the transformed code are now shorter than of the
original code, which means that it is easier for a CPU to execute the
intructions in parallel. The numbers in the comments indicate the clock
cycle at which the instruction completes on an idealized processor where
consecutive instructions without W->R dependencies can be executed in
parallel and every instruction takes just one cycle.
Any estimate on how difficult something like this would be to implement?
-Vesa Karvonen
More information about the MLton
mailing list