SSA simplify passes
Matthew Fluet
fluet@CS.Cornell.EDU
Wed, 9 Jan 2002 09:45:38 -0500 (EST)
> I could probably hack the register allocator to do these types of
> memory-memory moves through a general purpose register. With register
> renaming, I shouldn't be slowed down by using the same register for both
> components of the float (right?).
This wasn't that hard to add, although the results are mixed. The
original benchmarks I ran used one version, but I realized I could be a
little more aggressive when doing this optimization, so I'm rerunning
numbers.
Note, there is a little bit of trickery with the way I'm handling
aliasing. Consider the following:
fldL (0+(56*1))(%ebp)
fstpL (0+(80*1))(%ebp)
Easy enough to rewrite to:
movl (0+(56*1))(%ebp),%eax
movl %eax,(0+(80*1))(%ebp)
movl ((0+(56*1))+4)(%ebp),%eax
movl %eax,((0+(80*1))+4)(%ebp)
But, trying to rewrite
fldL (0+(56*1))(%ebp)
fstpL (0+(60*1))(%ebp)
to
movl (0+(56*1))(%ebp),%eax
movl %eax,(0+(60*1))(%ebp)
movl ((0+(56*1))+4)(%ebp),%eax
movl %eax,((0+(60*1))+4)(%ebp)
is incorrect because we overwrite the second word of the float with the
first word. We need to perform the moves in the other order.
(There is an orthogonal issue concerning the fact that the destination
address is not mod 8 aligned; I thought we had fixed all that, but maybe
not.)
So, the real problem is one of aliasing/overlapping. By extending my
mayAlias predicate to give an ordering, I can choose the right way to
perform the moves. This works fine for stack slots, because we can always
determine which address is smaller (we're just comparing the offsets).
Unfortunately, with the heap this isn't usually possible; two arbitrary
array elements could always alias, and we can't tell which is the smaller
address. Currently, if we're trying to do a mem-mem floating point move
between two locations that mayAlias in an indeterminate order, I just punt
and move it through the floating point registers. So, for example, the
copy of a f.p. array into another f.p. array will always bounce through
the f.p. registers. I think we can avoid this, because heap locations
have the added property of never overlapping, even if two locations
mayAlias. But, currently, I haven't added that check.
The mandelbrot benchmark had a 10% slowdown with the "fast" f.p. move.
The difference between the two versions in the assembly is two lines:
L_17: L_17:
fildl (0+(44*1))(%ebp) fildl (0+(44*1))(%ebp)
faddL (0+(8*1))(%ebp) faddL (0+(8*1))(%ebp)
fmulL (globaldouble+(2*8)) fmulL (globaldouble+(2*8))
fstL (0+(48*1))(%ebp) fstL (0+(48*1))(%ebp)
movl $0,(0+(72*1))(%ebp) movl $0,(0+(72*1))(%ebp)
fstpL (0+(64*1))(%ebp) fstpL (0+(64*1))(%ebp)
fldL (0+(32*1))(%ebp) | movl (0+(32*1))(%ebp),%esi
fstpL (0+(56*1))(%ebp) | movl %esi,(0+(56*1))(%ebp)
> movl ((0+(32*1))+4)(%ebp),%esi
> movl %esi,((0+(56*1))+4)(%ebp)
.p2align 4,,7 .p2align 4,,7
loop3_0: loop3_0:
movl (0+(72*1))(%ebp),%esi movl (0+(72*1))(%ebp),%esi
cmpl $2048,%esi cmpl $2048,%esi
jnl L_11 jnl L_11
L_16: L_16:
fldL (0+(64*1))(%ebp) fldL (0+(64*1))(%ebp)
fld %st fld %st
fmul %st, %st fmul %st, %st
fldL (0+(56*1))(%ebp) fldL (0+(56*1))(%ebp)
fld %st fld %st
fmul %st, %st fmul %st, %st
fld %st(2) fld %st(2)
fadd %st(1), %st fadd %st(1), %st
fldL (globaldouble+(3*8)) fldL (globaldouble+(3*8))
fxch %st(1) fxch %st(1)
fcompp fcompp
fnstsw %ax fnstsw %ax
testw $0x4500,%ax testw $0x4500,%ax
jz L_39 jz L_39
But, the original (left) version has:
[fluet@lennon mandelbrot]$ mlprof -x -d 2 mandelbrot.false mlmon.false.out
13.98 seconds of CPU time
main_0 100.00% (13.98s)
L_16 59.01% (8.25s)
L_16 100.00% (8.25s)
L_17 17.53% (2.45s)
L_17 100.00% (2.45s)
and the new (right) version has:
[fluet@lennon mandelbrot]$ mlprof -x -d 2 mandelbrot.true mlmon.true.out
15.85 seconds of CPU time
main_0 100.00% (15.85s)
L_16 64.10% (10.16s)
L_16 100.00% (10.16s)
L_17 16.28% (2.58s)
L_17 100.00% (2.58s)
A little slowdown in L_17, which is where the change is, but that doesn't
seem that significant; but the difference in L_16 is striking. What I
think is going on is that the pipelining between the integer-unit and the
floating-point-unit is sufficiently disjoint that when performing the
fldL (0+(56*1))(%ebp) in L_16, in the original version the value can be
fetched from on chip (because it was recently written by the
floating-point unit at fstpL (0+(56*1))(%ebp)), but in the new version,
the value can't be fetched from on chip because it's tied up in the
integer unit.