[MLton] The PackWord performance regression on the 64-bit branch

Tue Nov 28 12:51:01 PST 2006

>> On the 64-bit branch MLton, the generated machine code (from the previous
>> SML code) is
>> 
>> 	sall $2,%esi
>> 	addl (globalWord32+((3*4)+(0*1))),%esi
>> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
>> 	cmpl $0x0,%edi
>> 	jl L_1255
>> L_841:
>> 	incl %esi
>> 	shll $0x8,%edi
>> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edx
>> 	cmpl $0x0,%edx
>> 	jl L_1254
>> L_840:
>> 	orl %edi,%edx
>> 	incl %esi
>> 	shll $0x8,%edx
>> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
>> 	cmpl $0x0,%edi
>> 	jl L_1253
>> L_839:
>> 	orl %edi,%edx
>> 	incl %esi
>> 	shll $0x8,%edx
>> 	movzbl (0+((0x0*1)+(0*1)))(%esi),%edi
>> 	cmpl $0x0,%edi
>> 	jl L_1252
>> L_838:
>> 	orl %edx,%edi
>> 	movl %edi,(0+((64*1)+(0*1)))(%ebp)
>> 	movl $0x1,(0+((60*1)+(0*1)))(%ebp)
>> 	addl $60,%ebp
>> 	movl $L_837,(0+(-1*4))(%ebp)
>> 
>> and it looks like a bug in the optimizer.
>
> Indeed; I would expect you to get code that is just like that in trunk.

I checked in a modification to the x86_64 branch basis library 
implementation that avoids this redundant test.  (In particular, we do not 
need to check for Overflow when Word<N>.wordSize is less than 
Int.precision.)

However, it would be nice to extend the optimizer to pick up these cases.
The redundantTests optimization is the right place to start, but it isn't 
really doing complicated test propagation and test implication.

Essentially, redundantTests looks for Word_equal and Word_lt comparisons 
that are dominated by tests that determine the outcome of the comparison, 
in which case the comparison can be eliminated.  The limitations of 
redundantTests are two-fold:
  1) Only explicit comparisons (i.e., Word_equal and Word_lt) are used to
     insert facts into the dominator tree
  2) A fact in the dominator tree may eliminate a comparison iff the
     comparison (or its negation) is syntactically identical to a fact.

In the above situation, we have SSA code that looks like:

...
global_3: word32 = 0x0
...
     x_212: word8 = Pointer_getWord8 (x_213, global_3)
     x_211: word32 = WordU8_toWord32 (x_212)
     x_210: bool = WordS32_lt (x_211, global_3)
     case x_210 of
       true => L_84 | false => L_83
...

The zero-extension (WordU8_toWord32) could introduce the facts:
   0 <=U x_211, 0 <=S x_211, x_211 <=U 255, x_211 <=S 255
   (and, possibly, x_211 <U 256, x_211 <S 256)

This would suffice to eliminate the WordS32_lt test, since in this 
situation the corresponding negated fact (0 <=S x_211) is syntactically 
equal to one of the facts introduced above.

However, it would be nice to eliminate tests which are not syntactically 
equal, but are logically implied; that is, consider transitive 
relationships and other implications (like x <=S 10 implies x <=S 20).

Similar in spirit to the zero- and sign-extension facts, we could also 
introduce facts at Word_quot and Word_rem statements.