non-tail returns

Fri, 26 Oct 2001 10:50:30 -0400 (EDT)

I tried out Steve's proposal for faster non-tail returns.  On a return,
the callee puts the return values in slots below the return address.  This
also changes the code for a non-tail call (in the sense that we calculate
a different stack frame size) and for cont stubs (where we need the right
stack frame size and copy from the new locations).  That's all o.k.  There
is a little more trickery in getting the stack allocation to give the
continuation formals the right slots so that they line up with where a
non-tail callee is putting them.  Anyways, here are the benchmark results:

MLton0 -- mlton -limit-check-per-block false -new-return false -native-live-stack false
MLton1 -- mlton -limit-check-per-block false -new-return false -native-live-stack true
MLton2 -- mlton -limit-check-per-block false -new-return true -native-live-stack false
MLton3 -- mlton -limit-check-per-block false -new-return true -native-live-stack true
MLton4 -- mlton -limit-check-per-block true -new-return false -native-live-stack false
MLton5 -- mlton -limit-check-per-block true -new-return false -native-live-stack true
MLton6 -- mlton -limit-check-per-block true -new-return true -native-live-stack false
MLton7 -- mlton -limit-check-per-block true -new-return true -native-live-stack true
compile time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut           2.3    2.5    2.3    2.5    2.4    2.5    2.3    2.5
checksum             0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
count-graphs         1.7    1.9    1.6    1.9    1.7    2.0    1.7    2.0
DLXSimulator         4.0    4.9    4.0    4.9    4.0    5.0    4.0    5.0
fft                  1.2    1.4    1.2    1.3    1.2    1.4    1.2    1.4
fib                  0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
hamlet              45.8      *   44.3      *   48.0      *   47.6      *
knuth-bendix         2.2    2.4    2.2    2.4    2.2    2.5    2.2    2.5
lexgen               5.0    6.1    5.0    6.0    5.3    6.4    5.2    6.4
life                 1.3    1.4    1.3    1.4    1.3    1.5    1.3    1.5
logic                5.7    6.0    5.7    6.0    7.6    8.8    7.5    8.8
mandelbrot           0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
matrix-multiply      0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
md5                  1.2    1.3    1.2    1.3    1.2    1.3    1.2    1.3
merge                0.6    0.6    0.6    0.6    0.6    0.7    0.6    0.6
mlyacc              21.3      *   20.9      *   23.8      *   23.7      *
mpuz                 0.8    0.9    0.8    0.9    0.9    0.9    0.9    0.9
nucleic              2.9    3.0    2.9    3.0    2.9    3.0    2.9    3.0
peek                 1.0    1.1    1.0    1.0    1.0    1.1    1.0    1.1
psdes-random         0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.7
ratio-regions        2.4    3.2    2.5    3.2    2.4    3.3    2.5    3.3
ray                  3.2    3.9    3.1    3.9    3.3    4.1    3.3    4.0
raytrace             8.3      *    8.3      *    8.4      *    8.5      *
simple               6.2    7.7    6.2    7.7    6.7    8.5    6.5    8.4
smith-normal-form    7.4    7.9    7.5    7.9    7.5    8.0    7.5    7.9
tailfib              0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
tak                  0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
tensor               2.8    3.2    2.8    3.2    2.8    3.3    2.8    3.2
tsp                  1.5    1.6    1.5    1.6    1.4    1.6    1.5    1.6
tyan                 3.6    4.0    3.6    4.0    3.7    4.3    3.7    4.3
vector-concat        0.6    0.6    0.6    0.7    0.6    0.6    0.6    0.6
vector-rev           0.6    0.6    0.6    0.6    0.6    0.6    0.6    0.6
vliw                11.7   13.7   11.7   13.5   12.2   14.5   12.2   14.3
wc-input1            1.5    1.7    1.5    1.7    1.6    1.8    1.6    1.8
wc-scanStream        1.6    1.9    1.6    1.8    1.7    1.9    1.7    1.9
zebra                8.4   12.6    8.5   12.7    9.6   15.6    9.5   15.6
zern                 1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut           4.0    3.8    4.2    4.1    4.2    3.8    4.6    4.0
checksum             3.1    3.0    3.1    3.0    3.1    3.0    3.1    3.0
count-graphs         4.9    4.9    5.0    5.0    5.0    4.9    5.0    4.9
DLXSimulator        14.8   14.4   15.4   15.0   15.8   14.8   16.0   15.2
fft                  8.7    8.6    8.7    8.7    8.6    8.6    8.6    8.5
fib                  3.4    3.4    3.4    3.4    3.4    3.4    3.4    3.4
hamlet               7.9      *    8.0      *    8.2      *    8.3      *
knuth-bendix         6.6    6.3    6.5    6.2    6.6    6.2    6.4    6.1
lexgen              10.2    9.6   10.4    9.8   10.3    9.7   10.4    9.5
life                 8.1    7.7    8.2    7.7    7.8    7.0    7.8    7.0
logic               25.4   24.1   25.4   24.1   25.9   25.3   25.8   25.4
mandelbrot           6.9    5.9    6.9    5.9    6.9    5.9    6.9    5.9
matrix-multiply      5.2    2.2    5.1    2.3    5.2    2.2    5.1    2.3
md5                  0.5    0.4    0.5    0.4    0.4    0.3    0.4    0.3
merge               48.7   48.0   54.5   53.9   49.5   48.4   54.1   53.8
mlyacc               9.3      *    9.4      *    9.4      *    9.5      *
mpuz                 4.6    4.3    4.6    4.3    4.7    4.3    4.7    4.3
nucleic              6.9    6.7    7.1    6.9    7.4    7.1    7.1    7.0
peek                 3.6    2.6    3.6    2.6    3.6    2.4    3.6    2.4
psdes-random         3.4    3.3    3.4    3.3    3.4    3.3    3.4    3.3
ratio-regions        8.4    8.5    8.4    8.5    8.4    8.5    8.4    8.5
ray                  3.7    3.7    3.8    3.8    3.7    3.7    3.8    3.7
raytrace             4.6      *    4.6      *    4.7      *    4.8      *
simple               6.0    5.9    6.2    6.1    6.0    5.8    6.2    6.0
smith-normal-form    0.9    0.9    0.9    0.9    0.9    0.9    0.9    0.9
tailfib             16.3   10.6   16.3   10.6   16.3   10.6   16.3   10.7
tak                  7.8    7.6    7.8    7.9    7.8    7.6    7.8    7.9
tensor               7.1    3.7    7.1    3.7    7.0    4.1    7.0    4.1
tsp                  9.1    9.1    9.1    9.1    8.9    8.9    9.0    8.9
tyan                19.5   18.1   19.5   18.0   19.6   17.7   19.6   17.7
vector-concat        6.0    3.4    6.2    3.4    5.8    3.4    6.1    3.4
vector-rev           4.2    4.2    4.2    4.2    4.2    4.2    4.2    4.2
vliw                   *      *    6.2    5.5    6.3    5.7    6.2    5.6
wc-input1            2.2    2.5    2.2    2.5    2.1    2.6    2.1    2.6
wc-scanStream        3.4    2.5    3.4    2.5    3.1    2.4    3.1    2.4
zebra                2.3    2.2    2.3    2.2    2.3    2.2    2.3    2.2
zern                34.3   31.4   34.4   31.4   34.2   31.2   34.3   31.1
run time ratio
benchmark         MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7
barnes-hut           1.0    1.0    1.0    1.1    0.9    1.2    1.0
checksum             1.0    1.0    1.0    1.0    1.0    1.0    1.0
count-graphs         1.0    1.0    1.0    1.0    1.0    1.0    1.0
DLXSimulator         1.0    1.0    1.0    1.1    1.0    1.1    1.0
fft                  1.0    1.0    1.0    1.0    1.0    1.0    1.0
fib                  1.0    1.0    1.0    1.0    1.0    1.0    1.0
hamlet                 *    1.0      *    1.0      *    1.0      *
knuth-bendix         1.0    1.0    0.9    1.0    0.9    1.0    0.9
lexgen               0.9    1.0    1.0    1.0    1.0    1.0    0.9
life                 0.9    1.0    1.0    1.0    0.9    1.0    0.9
logic                0.9    1.0    0.9    1.0    1.0    1.0    1.0
mandelbrot           0.9    1.0    0.9    1.0    0.9    1.0    0.9
matrix-multiply      0.4    1.0    0.4    1.0    0.4    1.0    0.4
md5                  0.8    1.0    0.8    1.0    0.8    1.0    0.8
merge                1.0    1.1    1.1    1.0    1.0    1.1    1.1
mlyacc                 *    1.0      *    1.0      *    1.0      *
mpuz                 0.9    1.0    0.9    1.0    0.9    1.0    0.9
nucleic              1.0    1.0    1.0    1.1    1.0    1.0    1.0
peek                 0.7    1.0    0.7    1.0    0.7    1.0    0.7
psdes-random         1.0    1.0    1.0    1.0    1.0    1.0    1.0
ratio-regions        1.0    1.0    1.0    1.0    1.0    1.0    1.0
ray                  1.0    1.0    1.0    1.0    1.0    1.0    1.0
raytrace               *    1.0      *    1.0      *    1.0      *
simple               1.0    1.0    1.0    1.0    1.0    1.0    1.0
smith-normal-form    1.0    1.0    1.0    1.0    1.0    1.0    1.0
tailfib              0.7    1.0    0.7    1.0    0.7    1.0    0.7
tak                  1.0    1.0    1.0    1.0    1.0    1.0    1.0
tensor               0.5    1.0    0.5    1.0    0.6    1.0    0.6
tsp                  1.0    1.0    1.0    1.0    1.0    1.0    1.0
tyan                 0.9    1.0    0.9    1.0    0.9    1.0    0.9
vector-concat        0.6    1.0    0.6    1.0    0.6    1.0    0.6
vector-rev           1.0    1.0    1.0    1.0    1.0    1.0    1.0
vliw                   *   ~1.0   ~1.0   ~1.0   ~1.0   ~1.0   ~1.0
wc-input1            1.1    1.0    1.1    0.9    1.2    0.9    1.2
wc-scanStream        0.7    1.0    0.7    0.9    0.7    0.9    0.7
zebra                1.0    1.0    1.0    1.0    1.0    1.0    1.0
zern                 0.9    1.0    0.9    1.0    0.9    1.0    0.9
size
benchmark            MLton0  MLton1    MLton2  MLton3    MLton4  MLton5    MLton6  MLton7
barnes-hut           62,928  63,472    62,800  63,280    64,800  65,504    64,592  65,232
checksum             21,688  21,640    21,688  21,640    21,688  21,640    21,688  21,640
count-graphs         43,424  44,368    43,416  44,424    45,464  46,440    45,440  46,432
DLXSimulator         85,856  86,976    85,840  86,928    89,440  90,512    89,312  90,272
fft                  30,908  31,180    30,908  31,180    31,284  31,524    31,300  31,524
fib                  21,680  21,648    21,680  21,648    21,680  21,648    21,680  21,648
hamlet            1,059,227       * 1,055,579       * 1,286,547       * 1,280,371       *
knuth-bendix         64,993  65,105    64,945  65,057    67,025  67,089    67,177  67,481
lexgen              135,160 135,928   134,584 135,320   147,112 148,392   146,584 147,688
life                 40,928  41,200    40,888  41,144    43,440  43,152    43,384  43,160
logic               159,976 159,096   158,312 157,432   254,312 250,184   252,456 248,360
mandelbrot           21,648  21,648    21,648  21,648    21,648  21,648    21,648  21,648
matrix-multiply      22,056  22,072    22,056  22,072    22,056  22,072    22,056  22,072
md5                  30,073  29,785    30,073  29,785    30,537  30,201    30,537  30,201
merge                22,824  22,792    22,824  22,792    22,880  22,768    22,880  22,800
mlyacc              450,088       *   448,024       *   527,304       *   525,000       *
mpuz                 27,952  27,664    27,952  27,664    28,376  28,024    28,376  28,024
nucleic              61,904  61,856    62,008  61,960    64,040  64,536    62,992  63,184
peek                 30,137  30,169    30,137  30,169    31,353  31,177    31,353  31,177
psdes-random         22,736  22,720    22,736  22,720    22,736  22,720    22,736  22,720
ratio-regions        44,704  48,736    44,688  48,720    45,712  49,536    45,720  49,544
ray                  72,795  75,211    72,715  75,211    79,067  81,707    78,763  81,339
raytrace            174,680       *   174,664       *   198,248       *   197,896       *
simple              159,556 161,332   158,036 158,084   182,324 186,244   180,772 182,932
smith-normal-form   144,180 145,396   144,172 145,404   145,196 146,156   145,196 146,172
tailfib              21,352  21,304    21,352  21,304    21,352  21,304    21,352  21,304
tak                  21,704  21,672    21,696  21,648    21,704  21,672    21,696  21,648
tensor               65,971  64,483    65,971  64,483    67,747  66,611    67,763  66,627
tsp                  36,289  36,257    36,305  36,193    36,161  36,609    36,177  36,417
tyan                 84,585  85,305    84,441  85,225    90,105  91,113    89,865  91,177
vector-concat        22,360  22,200    22,360  22,200    22,360  22,200    22,360  22,200
vector-rev           22,192  22,128    22,192  22,128    22,184  22,088    22,184  22,088
vliw                290,980 292,796   290,340 291,820   317,100 320,420   316,828 319,596
wc-input1            42,137  42,041    42,129  42,033    44,497  44,369    44,497  44,369
wc-scanStream        44,761  44,329    44,753  44,321    47,209  47,001    47,209  47,001
zebra               122,841 145,929   122,801 145,889   130,145 147,089   130,145 147,089
zern                 27,875  27,603    27,875  27,603    27,915  27,707    27,899  27,691

(Ignore the compilation failures in hamlet, mlyacc, and raytrace; I
tracked down the x86-codegen register allocation bug and fixed it.)

Unfortunately, the new non-tail returns do almost nothing; and where they
are doing something, they seem to hurt.  Not quite sure why; in theory,
the change in return convention should improve some things.

Here are a couple of notes.  Nothing forces all of the continuation
formals to be stack allocated.  So we might actually be introducing extra
shuffle in continuation stubs, because the parallel move from actuals
returned to the formals is either the identity or there is a lot of
interference if not all of the formals are stack allocated.

On the other hand, for a return, there is no interference when moving the
return values into their return location below the top of the stack.

I had seen a 10% speed-up on fib two nights ago when I first go this
working; but that was with the limit-check insertion that put limit checks
pretty much everywhere.

I'm running benchmarks with another flag that indicates to the register
allocator that it should stack allocate all continuation formals.  We'll
see how that turns out; of course, that now means running 16 versions of
each benchmark, so it'll probably take all day.