non-tail returns
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Fri, 26 Oct 2001 10:50:30 -0400 (EDT)
I tried out Steve's proposal for faster non-tail returns. On a return,
the callee puts the return values in slots below the return address. This
also changes the code for a non-tail call (in the sense that we calculate
a different stack frame size) and for cont stubs (where we need the right
stack frame size and copy from the new locations). That's all o.k. There
is a little more trickery in getting the stack allocation to give the
continuation formals the right slots so that they line up with where a
non-tail callee is putting them. Anyways, here are the benchmark results:
MLton0 -- mlton -limit-check-per-block false -new-return false -native-live-stack false
MLton1 -- mlton -limit-check-per-block false -new-return false -native-live-stack true
MLton2 -- mlton -limit-check-per-block false -new-return true -native-live-stack false
MLton3 -- mlton -limit-check-per-block false -new-return true -native-live-stack true
MLton4 -- mlton -limit-check-per-block true -new-return false -native-live-stack false
MLton5 -- mlton -limit-check-per-block true -new-return false -native-live-stack true
MLton6 -- mlton -limit-check-per-block true -new-return true -native-live-stack false
MLton7 -- mlton -limit-check-per-block true -new-return true -native-live-stack true
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut 2.3 2.5 2.3 2.5 2.4 2.5 2.3 2.5
checksum 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
count-graphs 1.7 1.9 1.6 1.9 1.7 2.0 1.7 2.0
DLXSimulator 4.0 4.9 4.0 4.9 4.0 5.0 4.0 5.0
fft 1.2 1.4 1.2 1.3 1.2 1.4 1.2 1.4
fib 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
hamlet 45.8 * 44.3 * 48.0 * 47.6 *
knuth-bendix 2.2 2.4 2.2 2.4 2.2 2.5 2.2 2.5
lexgen 5.0 6.1 5.0 6.0 5.3 6.4 5.2 6.4
life 1.3 1.4 1.3 1.4 1.3 1.5 1.3 1.5
logic 5.7 6.0 5.7 6.0 7.6 8.8 7.5 8.8
mandelbrot 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
matrix-multiply 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
md5 1.2 1.3 1.2 1.3 1.2 1.3 1.2 1.3
merge 0.6 0.6 0.6 0.6 0.6 0.7 0.6 0.6
mlyacc 21.3 * 20.9 * 23.8 * 23.7 *
mpuz 0.8 0.9 0.8 0.9 0.9 0.9 0.9 0.9
nucleic 2.9 3.0 2.9 3.0 2.9 3.0 2.9 3.0
peek 1.0 1.1 1.0 1.0 1.0 1.1 1.0 1.1
psdes-random 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7
ratio-regions 2.4 3.2 2.5 3.2 2.4 3.3 2.5 3.3
ray 3.2 3.9 3.1 3.9 3.3 4.1 3.3 4.0
raytrace 8.3 * 8.3 * 8.4 * 8.5 *
simple 6.2 7.7 6.2 7.7 6.7 8.5 6.5 8.4
smith-normal-form 7.4 7.9 7.5 7.9 7.5 8.0 7.5 7.9
tailfib 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
tak 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
tensor 2.8 3.2 2.8 3.2 2.8 3.3 2.8 3.2
tsp 1.5 1.6 1.5 1.6 1.4 1.6 1.5 1.6
tyan 3.6 4.0 3.6 4.0 3.7 4.3 3.7 4.3
vector-concat 0.6 0.6 0.6 0.7 0.6 0.6 0.6 0.6
vector-rev 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
vliw 11.7 13.7 11.7 13.5 12.2 14.5 12.2 14.3
wc-input1 1.5 1.7 1.5 1.7 1.6 1.8 1.6 1.8
wc-scanStream 1.6 1.9 1.6 1.8 1.7 1.9 1.7 1.9
zebra 8.4 12.6 8.5 12.7 9.6 15.6 9.5 15.6
zern 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut 4.0 3.8 4.2 4.1 4.2 3.8 4.6 4.0
checksum 3.1 3.0 3.1 3.0 3.1 3.0 3.1 3.0
count-graphs 4.9 4.9 5.0 5.0 5.0 4.9 5.0 4.9
DLXSimulator 14.8 14.4 15.4 15.0 15.8 14.8 16.0 15.2
fft 8.7 8.6 8.7 8.7 8.6 8.6 8.6 8.5
fib 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4
hamlet 7.9 * 8.0 * 8.2 * 8.3 *
knuth-bendix 6.6 6.3 6.5 6.2 6.6 6.2 6.4 6.1
lexgen 10.2 9.6 10.4 9.8 10.3 9.7 10.4 9.5
life 8.1 7.7 8.2 7.7 7.8 7.0 7.8 7.0
logic 25.4 24.1 25.4 24.1 25.9 25.3 25.8 25.4
mandelbrot 6.9 5.9 6.9 5.9 6.9 5.9 6.9 5.9
matrix-multiply 5.2 2.2 5.1 2.3 5.2 2.2 5.1 2.3
md5 0.5 0.4 0.5 0.4 0.4 0.3 0.4 0.3
merge 48.7 48.0 54.5 53.9 49.5 48.4 54.1 53.8
mlyacc 9.3 * 9.4 * 9.4 * 9.5 *
mpuz 4.6 4.3 4.6 4.3 4.7 4.3 4.7 4.3
nucleic 6.9 6.7 7.1 6.9 7.4 7.1 7.1 7.0
peek 3.6 2.6 3.6 2.6 3.6 2.4 3.6 2.4
psdes-random 3.4 3.3 3.4 3.3 3.4 3.3 3.4 3.3
ratio-regions 8.4 8.5 8.4 8.5 8.4 8.5 8.4 8.5
ray 3.7 3.7 3.8 3.8 3.7 3.7 3.8 3.7
raytrace 4.6 * 4.6 * 4.7 * 4.8 *
simple 6.0 5.9 6.2 6.1 6.0 5.8 6.2 6.0
smith-normal-form 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
tailfib 16.3 10.6 16.3 10.6 16.3 10.6 16.3 10.7
tak 7.8 7.6 7.8 7.9 7.8 7.6 7.8 7.9
tensor 7.1 3.7 7.1 3.7 7.0 4.1 7.0 4.1
tsp 9.1 9.1 9.1 9.1 8.9 8.9 9.0 8.9
tyan 19.5 18.1 19.5 18.0 19.6 17.7 19.6 17.7
vector-concat 6.0 3.4 6.2 3.4 5.8 3.4 6.1 3.4
vector-rev 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2
vliw * * 6.2 5.5 6.3 5.7 6.2 5.6
wc-input1 2.2 2.5 2.2 2.5 2.1 2.6 2.1 2.6
wc-scanStream 3.4 2.5 3.4 2.5 3.1 2.4 3.1 2.4
zebra 2.3 2.2 2.3 2.2 2.3 2.2 2.3 2.2
zern 34.3 31.4 34.4 31.4 34.2 31.2 34.3 31.1
run time ratio
benchmark MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut 1.0 1.0 1.0 1.1 0.9 1.2 1.0
checksum 1.0 1.0 1.0 1.0 1.0 1.0 1.0
count-graphs 1.0 1.0 1.0 1.0 1.0 1.0 1.0
DLXSimulator 1.0 1.0 1.0 1.1 1.0 1.1 1.0
fft 1.0 1.0 1.0 1.0 1.0 1.0 1.0
fib 1.0 1.0 1.0 1.0 1.0 1.0 1.0
hamlet * 1.0 * 1.0 * 1.0 *
knuth-bendix 1.0 1.0 0.9 1.0 0.9 1.0 0.9
lexgen 0.9 1.0 1.0 1.0 1.0 1.0 0.9
life 0.9 1.0 1.0 1.0 0.9 1.0 0.9
logic 0.9 1.0 0.9 1.0 1.0 1.0 1.0
mandelbrot 0.9 1.0 0.9 1.0 0.9 1.0 0.9
matrix-multiply 0.4 1.0 0.4 1.0 0.4 1.0 0.4
md5 0.8 1.0 0.8 1.0 0.8 1.0 0.8
merge 1.0 1.1 1.1 1.0 1.0 1.1 1.1
mlyacc * 1.0 * 1.0 * 1.0 *
mpuz 0.9 1.0 0.9 1.0 0.9 1.0 0.9
nucleic 1.0 1.0 1.0 1.1 1.0 1.0 1.0
peek 0.7 1.0 0.7 1.0 0.7 1.0 0.7
psdes-random 1.0 1.0 1.0 1.0 1.0 1.0 1.0
ratio-regions 1.0 1.0 1.0 1.0 1.0 1.0 1.0
ray 1.0 1.0 1.0 1.0 1.0 1.0 1.0
raytrace * 1.0 * 1.0 * 1.0 *
simple 1.0 1.0 1.0 1.0 1.0 1.0 1.0
smith-normal-form 1.0 1.0 1.0 1.0 1.0 1.0 1.0
tailfib 0.7 1.0 0.7 1.0 0.7 1.0 0.7
tak 1.0 1.0 1.0 1.0 1.0 1.0 1.0
tensor 0.5 1.0 0.5 1.0 0.6 1.0 0.6
tsp 1.0 1.0 1.0 1.0 1.0 1.0 1.0
tyan 0.9 1.0 0.9 1.0 0.9 1.0 0.9
vector-concat 0.6 1.0 0.6 1.0 0.6 1.0 0.6
vector-rev 1.0 1.0 1.0 1.0 1.0 1.0 1.0
vliw * ~1.0 ~1.0 ~1.0 ~1.0 ~1.0 ~1.0
wc-input1 1.1 1.0 1.1 0.9 1.2 0.9 1.2
wc-scanStream 0.7 1.0 0.7 0.9 0.7 0.9 0.7
zebra 1.0 1.0 1.0 1.0 1.0 1.0 1.0
zern 0.9 1.0 0.9 1.0 0.9 1.0 0.9
size
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut 62,928 63,472 62,800 63,280 64,800 65,504 64,592 65,232
checksum 21,688 21,640 21,688 21,640 21,688 21,640 21,688 21,640
count-graphs 43,424 44,368 43,416 44,424 45,464 46,440 45,440 46,432
DLXSimulator 85,856 86,976 85,840 86,928 89,440 90,512 89,312 90,272
fft 30,908 31,180 30,908 31,180 31,284 31,524 31,300 31,524
fib 21,680 21,648 21,680 21,648 21,680 21,648 21,680 21,648
hamlet 1,059,227 * 1,055,579 * 1,286,547 * 1,280,371 *
knuth-bendix 64,993 65,105 64,945 65,057 67,025 67,089 67,177 67,481
lexgen 135,160 135,928 134,584 135,320 147,112 148,392 146,584 147,688
life 40,928 41,200 40,888 41,144 43,440 43,152 43,384 43,160
logic 159,976 159,096 158,312 157,432 254,312 250,184 252,456 248,360
mandelbrot 21,648 21,648 21,648 21,648 21,648 21,648 21,648 21,648
matrix-multiply 22,056 22,072 22,056 22,072 22,056 22,072 22,056 22,072
md5 30,073 29,785 30,073 29,785 30,537 30,201 30,537 30,201
merge 22,824 22,792 22,824 22,792 22,880 22,768 22,880 22,800
mlyacc 450,088 * 448,024 * 527,304 * 525,000 *
mpuz 27,952 27,664 27,952 27,664 28,376 28,024 28,376 28,024
nucleic 61,904 61,856 62,008 61,960 64,040 64,536 62,992 63,184
peek 30,137 30,169 30,137 30,169 31,353 31,177 31,353 31,177
psdes-random 22,736 22,720 22,736 22,720 22,736 22,720 22,736 22,720
ratio-regions 44,704 48,736 44,688 48,720 45,712 49,536 45,720 49,544
ray 72,795 75,211 72,715 75,211 79,067 81,707 78,763 81,339
raytrace 174,680 * 174,664 * 198,248 * 197,896 *
simple 159,556 161,332 158,036 158,084 182,324 186,244 180,772 182,932
smith-normal-form 144,180 145,396 144,172 145,404 145,196 146,156 145,196 146,172
tailfib 21,352 21,304 21,352 21,304 21,352 21,304 21,352 21,304
tak 21,704 21,672 21,696 21,648 21,704 21,672 21,696 21,648
tensor 65,971 64,483 65,971 64,483 67,747 66,611 67,763 66,627
tsp 36,289 36,257 36,305 36,193 36,161 36,609 36,177 36,417
tyan 84,585 85,305 84,441 85,225 90,105 91,113 89,865 91,177
vector-concat 22,360 22,200 22,360 22,200 22,360 22,200 22,360 22,200
vector-rev 22,192 22,128 22,192 22,128 22,184 22,088 22,184 22,088
vliw 290,980 292,796 290,340 291,820 317,100 320,420 316,828 319,596
wc-input1 42,137 42,041 42,129 42,033 44,497 44,369 44,497 44,369
wc-scanStream 44,761 44,329 44,753 44,321 47,209 47,001 47,209 47,001
zebra 122,841 145,929 122,801 145,889 130,145 147,089 130,145 147,089
zern 27,875 27,603 27,875 27,603 27,915 27,707 27,899 27,691
(Ignore the compilation failures in hamlet, mlyacc, and raytrace; I
tracked down the x86-codegen register allocation bug and fixed it.)
Unfortunately, the new non-tail returns do almost nothing; and where they
are doing something, they seem to hurt. Not quite sure why; in theory,
the change in return convention should improve some things.
Here are a couple of notes. Nothing forces all of the continuation
formals to be stack allocated. So we might actually be introducing extra
shuffle in continuation stubs, because the parallel move from actuals
returned to the formals is either the identity or there is a lot of
interference if not all of the formals are stack allocated.
On the other hand, for a return, there is no interference when moving the
return values into their return location below the top of the stack.
I had seen a 10% speed-up on fib two nights ago when I first go this
working; but that was with the limit-check insertion that put limit checks
pretty much everywhere.
I'm running benchmarks with another flag that indicates to the register
allocator that it should stack allocate all continuation formals. We'll
see how that turns out; of course, that now means running 16 versions of
each benchmark, so it'll probably take all day.