SSA
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Mon, 22 Oct 2001 12:30:16 -0400 (EDT)
> > > We could (?) always put the exnStack element in a place relative to the
> > > stack, say in the word immediately below stackBottom. (Although that sort
> > > of messes with getting stacks mod 8 aligned.) The GC_thread struct
> > > seems a little overly abstract -- it doesn't seem as though two threads
> > > with different exnStacks could point to the same execution stack?
> >
> > Yes, I think you're right. We could flatten the GC_stack data
> > structure into the GC_thread data structure. That would be nice,
> > would save a little space, simplify the runtime a little, and enable
> > tracing the exception stack.
>
> I like tracing the exception stack; it seems a little more natural than
> the stub continuation label that fakes the liveness of both the cont and
> handler. I think I'll take a crack at this over the weekend.
O.k. Here are the results of my experiments with the runtime.
First, we can't flatten GC_stack into GC_thread. The reason is that
SML code and objects only refere to GC_threads, not GC_stacks. Now,
when the current thread runs out of stack space, it copies it's stack
data to a new, larger stack and updates the stack field of the current
thread. This ensures that all ML objects refering to the current
thread "see" the new stack. My first attempt at combining all the
GC_stack and GC_thread data works o.k. for single threaded apps, and even
for some multi-threaded apps, but mutex.sml and thread2.sml both were
failing.
So, I just moved the exnStack field from the GC_thread struct to the
GC_stack struct. This slightly penalizes setting handlers and raising,
because access to the exnStack field goes through one extra level of
redirection.
That all worked out o.k., so I took a crack at setting frame layouts for
handlers and walking the handler stack as part of GC.
Setting everything up wasn't too difficult. The changes to backend were
pretty minimal, although if we decide to go this route, I'll go through
and clean up some more. The C codegen needed almost no changes. The x86
codegen just needed to save the frame index with a handler label.
The runtime required a little more work. I disabled the "Frame layouts"
portion of the invariant check in gc.c. This was checking that all of the
live slots in a frame were less than the frame size. The issue with this
is what we mean by frame size. I decided that I wanted it to mean "the
amount to subtract from top to get to the frame for which these offsets
make sense." For continuations, this is the intuitive notion of frame
size -- we subtract the whole frame. For handlers, this isn't quite the
same, we actually subtract the offset of the slot for the code pointer; in
general, we allocate the slot for the code pointer before all the slots
that are live in a handler, so you can get situations where some of the
live offsets are larger than the "size" of the frame.
Actually changing GC_foreachPointerInObject to walk the handler stack was
not difficult. However, forward implicitly assumes that we only every
visit a pointer in an object once. When we walk the continuation stack
and then walk the handler stack, we're likely to hit some pointers more
than once. For a quick hack, I made forward check to see if the pointer
has already been forwarded to toSpace; if so, just return. This should
slow down the GC, but I don't see it happening in practice (see benchmarks
below). I think we can make GC_foreachPointerInObject a little more
complicated and have it walk the continuation and handler stacks at the
same time, and only hit each pointer once. That should only force stacks
to pay a little more in GC time.
Anyways, here are the results of the benchmarks:
MLton0 -- mlton-cvs
MLton1 -- mlton-newer
compile time
benchmark MLton0 MLton1
barnes-hut 3.1 2.8
checksum 0.6 0.6
count-graphs 1.9 2.0
DLXSimulator 4.8 5.0
fft 1.3 1.3
fib 0.6 0.6
hamlet 56.2 56.4
knuth-bendix 2.6 2.6
lexgen 6.2 6.3
life 1.5 1.5
logic 6.9 6.8
mandelbrot 0.7 0.6
matrix-multiply 0.7 0.7
md5 1.5 1.4
merge 0.7 0.6
mlyacc 26.1 25.9
mpuz 1.0 1.0
nucleic 3.6 3.6
peek 1.1 1.1
psdes-random 0.7 0.6
ratio-regions 3.0 3.0
ray 3.8 3.9
raytrace 10.3 10.3
simple 7.6 7.5
smith-normal-form 9.3 9.3
tailfib 0.6 0.6
tak 0.6 0.6
tensor 3.5 3.5
tsp 1.8 1.8
tyan 4.4 4.5
vector-concat 0.7 0.7
vector-rev 0.7 0.6
vliw 14.5 14.5
wc-input1 1.8 1.8
wc-scanStream 1.9 1.9
zebra 14.1 14.2
zern 1.1 1.1
run time
benchmark MLton0 MLton1
barnes-hut 4.9 4.9
checksum 4.1 4.1
count-graphs 5.8 6.1
DLXSimulator 20.2 21.0
fft 10.8 12.6
fib 3.9 4.0
hamlet 10.0 10.2
knuth-bendix 8.2 8.4
lexgen 12.9 12.8
life 9.9 10.2
logic 32.2 32.3
mandelbrot 8.6 15.3
matrix-multiply 6.8 6.7
md5 0.6 0.6
merge 70.3 67.8
mlyacc 11.8 11.8
mpuz 5.6 5.5
nucleic 8.7 8.7
peek 4.3 4.4
psdes-random 4.2 4.2
ratio-regions 10.9 10.9
ray 4.6 4.9
raytrace 5.7 5.7
simple 7.4 7.4
smith-normal-form 1.2 1.2
tailfib 20.3 19.1
tak 9.8 9.5
tensor 7.6 7.2
tsp 11.5 11.5
tyan 23.6 23.6
vector-concat 7.7 7.3
vector-rev 5.7 5.7
vliw 7.6 *
wc-input1 2.6 2.6
wc-scanStream 4.4 4.4
zebra 2.8 2.8
zern 47.1 45.4
run time ratio
benchmark MLton1
barnes-hut 1.0
checksum 1.0
count-graphs 1.0
DLXSimulator 1.0
fft 1.2
fib 1.0
hamlet 1.0
knuth-bendix 1.0
lexgen 1.0
life 1.0
logic 1.0
mandelbrot 1.8
matrix-multiply 1.0
md5 1.0
merge 1.0
mlyacc 1.0
mpuz 1.0
nucleic 1.0
peek 1.0
psdes-random 1.0
ratio-regions 1.0
ray 1.1
raytrace 1.0
simple 1.0
smith-normal-form 1.0
tailfib 0.9
tak 1.0
tensor 1.0
tsp 1.0
tyan 1.0
vector-concat 1.0
vector-rev 1.0
wc-input1 1.0
wc-scanStream 1.0
zebra 1.0
zern 1.0
size
benchmark MLton0 MLton1
barnes-hut 60,594 61,018
checksum 38,365 38,733
count-graphs 57,917 58,469
DLXSimulator 95,749 96,357
fft 46,953 47,353
fib 38,349 38,749
hamlet 970,824 978,280
knuth-bendix 77,590 78,638
lexgen 140,717 141,565
life 56,269 56,677
logic 166,389 166,829
mandelbrot 38,325 38,709
matrix-multiply 38,773 39,141
md5 46,014 46,470
merge 39,357 39,741
mlyacc 432,621 434,221
mpuz 44,133 44,549
nucleic 77,973 78,357
peek 46,118 46,598
psdes-random 39,365 39,733
ratio-regions 59,661 60,109
ray 84,592 85,304
raytrace 178,237 179,213
simple 165,345 165,969
smith-normal-form 143,886 144,326
tailfib 38,069 38,453
tak 38,397 38,781
tensor 63,021 63,533
tsp 51,310 51,710
tyan 94,958 95,766
vector-concat 39,013 39,397
vector-rev 38,837 39,221
vliw 281,705 283,145
wc-input1 56,918 57,502
wc-scanStream 59,278 59,846
zebra 133,550 134,198
zern 43,960 44,360
The easy comments; the fact that we have frame layouts for both
continuations and handlers increases the static data of the executables.
For the most part, the GC hacks aren't noticable in the running times.
mandelbrot has me completely stumped. The x86 assembly produced by the
x86 codegen is identical in both the old and new compilers. Running with
gc-summary reveals that neither program ever invokes the GC, so the
runtime changes can't explain it. Alignment of the tight loops? The
change in the runtime did affect the addresses of the produced assembly.
I tried aligning loop headers at mod 4 and mod 8 boundaries, and recovered
about a second -- nothing compared to the 6.5 seconds we lost. Profiling
shows that the same loops are hot in both, just that one takes a lot
longer.
[fluet@lennon mandelbrot]$ mlprof -d 1 mandelbrot.cvs mlmon.cvs.out
18.42 seconds of CPU time
main_0 100.00%
L_30 48.21%
L_26 31.27%
loop3_1 3.91%
L_45 3.37%
L_44 3.09%
L_27 3.04%
L_31 2.82%
L_46 2.06%
loop2_1 1.14%
L_32 1.09%
[fluet@lennon mandelbrot]$ mlprof -d 1 mandelbrot.newer mlmon.newer.out
26.47 seconds of CPU time
main_0 100.00%
L_30 56.18%
loop3_1 17.34%
L_26 10.31%
L_46 4.00%
loop2_1 3.93%
L_44 2.72%
L_31 2.34%
L_45 1.21%
L_27 1.10%
L_32 0.87%
Here are the top 3 blocks in .ssa and .s:
L_26 ()
x_25 = Real_fromInt (x_20)
x_24 = Real_add (x_26, x_25)
x_10 = Real_mul (x_24, global_15)
loop3_1 (x_6, x_10, global_8)
loop3_1 (x_13, x_14, x_2)
x_17 = Int_lt (x_2, global_12)
case x_17 of
false => L_34 | true => L_30
L_30 ()
x_12 = Real_mul (x_14, x_14)
x_11 = Real_mul (x_13, x_13)
x_16 = Real_add (x_11, x_12)
x_15 = Real_gt (x_16, global_14)
case x_15 of
false => L_32 | true => L_31
L_26:
fildl (0+(32*1))(%ebp)
faddL (0+(8*1))(%ebp)
fmulL (globaldouble+(1*8))
fstL (0+(40*1))(%ebp)
movl $0,(0+(64*1))(%ebp)
fstpL (0+(56*1))(%ebp)
fldL (0+(24*1))(%ebp)
fstpL (0+(48*1))(%ebp)
loop3_1:
movl (0+(64*1))(%ebp),%esi
cmpl $2048,%esi
jnl L_27
L_30:
fldL (0+(56*1))(%ebp)
fld %st
fmul %st, %st
fldL (0+(48*1))(%ebp)
fld %st
fmul %st, %st
fld %st(2)
fadd %st(1), %st
fldL (globaldouble+(0*8))
fxch %st(1)
fcompp
fnstsw %ax
testw $0x4500,%ax
jz L_63
Like I said, they are identical in both versions.
[fluet@lennon mandelbrot]$ nm -n mandelbrot.cvs | grep "..."
0804908c t loop1_1
080490b1 t loop2_1
080490c0 t L_26
080490df t loop3_1
080490ee t L_30
[fluet@lennon mandelbrot]$ nm -n mandelbrot.newer | grep "..."
08049094 t loop1_1
080490b9 t loop2_1
080490c8 t L_26
080490e7 t loop3_1
080490f6 t L_30
Alignment doesn't seem significantly better in the first. Long shot
guess: alignment of individual floating point instructions? I can't find
any support for this in the Intel documentation, but I'm at a loss.
The failure of vliw under the new compiler is not directly related to the
changes I made. Using the currently checked in sources, I get the
following:
[fluet@lennon vliw]$ ./vliw @MLton gc-summary fixed-heap 600k --
Segmentation fault
And compiling -g, I get:
[fluet@lennon vliw]$ ./vliw @MLton gc-summary fixed-heap 600k --
gc.c 583: assert(0 == s->fromSize or (s->frontier <= s->limit + LIMIT_SLOP
and s->limit == s->base + s->fromSize - LIMIT_SLOP)) failed.
Aborted
Commenting out that assert and looking at where the seg fault occurs, I
think I've got a guess. One SSA function is the following:
fun member_1 (x_6660, x_6659) = L_4441()
L_4441 ()
loop_385 (x_6659, x_6660)
loop_385 (x_6677, x_6664)
x_6678 = #2 x_6677
x_6663 = #1 x_6677
case x_6678 of
::_2 => L_4454 | nil_1 => L_4443
L_4454 (x_6662, x_6673)
case x_6663 of
Env_2 => L_4453 | Env_15 => L_4452
L_4453 ()
x_6676 = String_equal (x_6673, x_6664)
L_4445 (x_6676)
L_4452 (x_6672, x_6670)
L_4451 (explode_0 (x_6664, x_6672))
L_4451 (x_6675)
L_4450 (pref_0 (x_6675))
L_4450 (x_6674)
L_4449 (implode_0 (x_6674, x_6670))
L_4449 (x_6668)
L_4448 (explode_0 (x_6673, x_6672))
L_4448 (x_6671)
L_4447 (pref_0 (x_6671))
L_4447 (x_6669)
L_4446 (implode_0 (x_6669, x_6670))
L_4446 (x_6666)
x_6667 = String_equal (x_6668, x_6666)
L_4445 (x_6667)
L_4445 (x_6665)
case x_6665 of
false => L_4444 | true => L_4442
L_4444 ()
x_6661 = (x_6663, x_6662)
loop_385 (x_6661, x_6664)
L_4443 ()
global_12
L_4442 ()
global_83
There is an allocation L_4444 (which seems like it could be avoided using
some sort of local flattening, but that doesn't really matter.) Looking
at the corresponding .toMOut, there is no limit check with bytes > 0 in
this code. So, I think the seg faults are coming from just running past
the end of the heap frontier.
Anyways, I just found this, so I haven't taken a look at what needs to be
done to fix it.