[MLton] Question on profile.fun
Matthew Fluet
fluet@cs.cornell.edu
Sat, 4 Jun 2005 10:12:12 -0400 (EDT)
> > I tried the following experiment. Added a flag
> > -profile-dummy {false|true}
> > which instructs profile.fun(line 447) to, in addition to any other
> > profiling work, insert code to increment a dummy field in the gcState at
> > _every_ Profile statement in the RSSA IL program. While this isn't
> > exactly the work required to modify time profiling as I described
> > previously, I figure that it is about on par with that work.
>
> Did you really mean "increment a dummy field in the gcState"? I would
> think that is more costly than "move of a constant integer to a known
> slot in the gc state", which is what you suggested earlier.
Yes; I deliberately chose an operation that would likely be a little more
costly than what would be needed.
> > though, -profile mark does insert the time profiling labels into the
> > Machine IL, and I'm fairly certain that cutting up blocks at the
> > profiling labels is interfering with codegen optimizations)
> ...
> > There is another experiment to be done where labels are not
> > inserted into the Machine IL code.
>
> To be fair to the code-insertion approach, it would be worth doing
> such an experiment with -profile mark and have the backend profile
> pass drop the profiling stuff altogether. That would separate how
> much slowdown is due to missed SSA optimization and how much is due to
> missed codegen optimization. It would also provide a bound on how
> well one could do with the code-insertion approach to time profiling.
Here's a new experiment:
MLton0 -- mlton -profile no
MLton1 -- mlton -profile drop
MLton2 -- mlton -profile drop -profile-dummy true
MLton3 -- mlton -profile label
MLton4 -- mlton -profile label -profile-dummy true
MLton5 -- mlton -profile time
MLton6 -- mlton -profile time -profile-dummy true
I changed the name of -profile mark to -profile label; The new option
-profile drop causes the implementProfiling pass to implement nothing.
Hence, the RSSA IL programs out of implementProfiling with -profile drop
-profile-dummy true is the same as the program into the pass without any
profiling statements (or labels). So, I think
MLton0 vs. MLton1 -- slowdown due to missed SSA optimization
MLton1 vs. MLton3 -- slowdown due to missed codegen optimization
MLton0 vs. MLton2 -- bound on code-insertion approach to time profiling
> Inserting the profile labels and causing the codegen to miss
> optimizations is clearly interference, and it would be nice to avoid.
> We'll know how much it costs once we do the experiment to drop
> profiling annotations after the SSA optimizer is done.
The experiment described above and presented below should answer this
question.
> We don't want the codegen to misattribute time to the wrong source
> function and we would like to not inhibit its optimizations. Perhaps
> we could avoid the interference with a different approach to profiling
> in the codegen. Instead of annotating each basic block, we could
> annotate each instruction in the Machine IL with a sourceseq. Then,
> we would require the codegen to preserve that annotation throughout.
> There are clearly some issues where the codegen combines instructions,
> etc. But it could probably do a good job, and would have complete
> freedom to move code, combine or split basic blocks, etc. We could
> emit a table that maps pc -> sourceseq, similar to our current table,
> except probably not as compact since the current table maps pc ranges
> to sourceseq, since a label applies to an entire block).
>
> I don't see why that won't work, although it requires some significant
> hacking on the codegen.
It has the downside of making native codegen profiling different from C
codegen profiling. Anyways, since we'll shortly see that the profiling
labels don't affect the codegen optimizations appreciably, I don't think
this matters.
So here are the relevant results from the new experiment:
# MLton1/MLton0 <=
-- ----------------
7 1.0
28 1.1
36 1.2
37 1.3
40 1.4
40 1.5
outliers 1.63 checksum, 7.0 wc-scanStream
So, this seems to suggest that the slowdown due to missed SSA
optimizations is fairly low, though it is the cause of the insane behavior
of wc-scanStream. Knowing that, it is probably worth adding to a TODO to
investigate.
# MLton3/MLton1 <=
-- ----------------
15 1.0
40 1.1
42 1.2
So, the labels have virtually no effect on codegen optimizations. The two
"outliers" are 1.15 life, 1.19 zebra.
# MLton1/MLton0 <=
-- ----------------
2 1.0
15 1.1
25 1.2
27 1.3
31 1.4
33 1.5
outliers 1.5 tailfib, 1.6 psdes-random, 1.6 md5, checksum 1.7,
1.7 zebra, 1.7 life, 2.0 peek, 2.4 imp-for,
7.8 wc-scanStream
So, I admit that the bound on code-insertion for time-profiling isn't all
that good. Again, we might argue that this is an upper-bound, since the
inserted code is more expensive than a single move. I'll see about
queueing up that experiment.
# MLton5/MLton0 <=
-- ----------------
15 1.0
29 1.1
32 1.2
37 1.3
38 1.4
40 1.5
outliers 1.6 checksum, 7.6 wc-scanStream
And, furthermore, the cost of time profiling with labels yields a much
better ratio.
Finally, another argument against the code insertion approach is that
these benchmarks are run without profiling the basis library. Since
labels are inserted into every RSSA IL block, even when profiling the
Basis Library, the cost seen above shouldn't change. Whereas with the
code insertion approach, there would be more transitions in the profiling
graph corresponding to more work being done by the inserted code.
So, I'm convinced that the code insertion technique is too intrusive for
time profiling.
Here's the complete data:
MLton0 -- mlton -profile no
MLton1 -- mlton -profile drop
MLton2 -- mlton -profile drop -profile-dummy true
MLton3 -- mlton -profile label
MLton4 -- mlton -profile label -profile-dummy true
MLton5 -- mlton -profile time
MLton6 -- mlton -profile time -profile-dummy true
run time ratio
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut 1.00 1.04 1.08 1.03 1.11 0.98 1.00
boyer 1.00 1.05 1.09 1.04 1.10 1.02 1.05
checksum 1.00 1.63 1.70 1.66 1.68 1.60 1.67
count-graphs 1.00 1.06 1.19 1.05 1.17 1.04 1.30
DLXSimulator 1.00 1.05 1.02 1.07 1.04 1.08 1.09
fft 1.00 1.00 1.01 0.99 1.03 1.00 1.04
fib 1.00 1.36 1.44 1.39 1.47 1.40 1.49
flat-array 1.00 1.11 1.17 1.12 1.12 1.08 1.17
hamlet 1.00 1.16 1.21 1.18 1.21 1.21 1.16
imp-for 1.00 1.04 2.40 1.02 2.41 1.00 2.35
knuth-bendix 1.00 1.19 1.30 1.22 1.29 1.22 1.30
lexgen 1.00 1.04 1.05 1.06 1.12 1.02 1.11
life 1.00 1.14 1.74 1.31 1.81 1.29 1.83
logic 1.00 1.09 1.08 1.09 1.19 1.12 1.17
mandelbrot 1.00 1.04 1.13 1.05 1.11 0.66 0.76
matrix-multiply 1.00 1.02 1.17 1.00 1.15 0.92 1.26
md5 1.00 1.36 1.62 1.44 1.82 1.46 1.76
merge 1.00 1.04 1.06 1.04 1.07 1.01 1.08
mlyacc 1.00 1.02 1.05 1.07 1.07 1.07 1.04
model-elimination 1.00 1.03 1.06 1.05 1.05 1.09 1.04
mpuz 1.00 1.04 1.37 1.05 1.36 1.05 1.33
nucleic 1.00 0.97 1.00 1.02 1.03 0.98 1.02
output1 1.00 0.97 1.17 0.96 1.16 0.97 1.15
peek 1.00 1.24 1.99 1.24 2.01 1.24 1.98
psdes-random 1.00 1.10 1.60 1.10 1.59 1.08 1.51
ratio-regions 1.00 1.19 1.31 1.21 1.38 1.19 1.28
ray 1.00 1.10 1.14 1.08 1.14 1.09 1.14
raytrace 1.00 1.04 1.08 1.07 1.08 1.08 1.14
simple 1.00 0.96 1.11 1.02 1.10 0.90 1.00
smith-normal-form 1.00 1.04 1.04 1.04 1.00 1.01 1.01
tailfib 1.00 0.95 1.52 0.96 1.50 0.93 1.50
tak 1.00 1.36 1.37 1.44 1.46 1.41 1.46
tensor 1.00 0.82 1.49 0.83 1.48 0.82 1.47
tsp 1.00 1.01 1.06 1.04 1.06 1.04 1.02
tyan 1.00 1.11 1.14 1.16 1.15 1.09 1.18
vector-concat 1.00 1.01 1.00 1.01 1.02 0.98 0.96
vector-rev 1.00 1.04 1.11 1.00 1.07 1.03 1.11
vliw 1.00 1.07 1.06 1.09 1.10 1.05 1.11
wc-input1 1.00 1.13 1.31 1.16 1.38 1.17 1.31
wc-scanStream 1.00 7.07 7.76 7.13 7.74 7.61 7.48
zebra 1.00 1.11 1.72 1.32 1.62 1.28 1.59
zern 1.00 1.00 1.15 1.05 1.09 1.01 1.15
size
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut 99,700 120,058 122,938 145,314 148,194 149,084 151,976
boyer 135,375 162,147 171,379 206,323 215,699 213,565 222,925
checksum 50,095 52,627 52,835 55,755 55,979 62,725 62,949
count-graphs 63,135 76,263 78,911 93,783 96,423 102,129 104,657
DLXSimulator 126,067 165,179 172,091 216,299 222,707 221,613 228,021
fft 61,358 67,718 68,934 75,982 77,150 84,232 85,384
fib 44,691 47,047 47,207 50,143 50,351 57,129 57,337
flat-array 44,715 46,999 47,159 50,143 50,303 57,113 57,273
hamlet 1,246,854 1,913,294 2,061,582 2,699,934 2,858,062 2,704,854 2,862,966
imp-for 44,547 47,711 48,159 52,551 53,031 59,457 59,937
knuth-bendix 105,907 133,055 137,823 165,595 170,571 172,517 177,637
lexgen 199,332 270,180 283,412 358,500 372,220 361,020 374,724
life 62,059 71,295 72,823 85,407 86,879 91,801 93,305
logic 103,567 133,503 138,303 172,679 177,751 179,857 184,913
mandelbrot 44,643 47,055 47,295 50,215 50,439 57,153 57,393
matrix-multiply 46,294 49,750 50,166 54,414 54,846 61,200 61,632
md5 74,531 84,991 85,647 97,431 98,151 103,953 104,721
merge 46,271 49,279 49,599 53,255 53,575 60,161 60,497
mlyacc 501,140 696,352 733,048 933,496 971,240 935,984 973,728
model-elimination 631,901 884,693 933,909 1,236,233 1,286,281 1,241,211 1,291,339
mpuz 47,307 53,543 54,695 60,927 62,015 69,345 70,289
nucleic 196,246 208,546 209,810 221,650 223,026 228,420 229,812
output1 77,373 84,953 85,561 97,689 98,377 99,465 100,169
peek 73,483 81,495 82,231 92,551 93,335 97,649 98,321
psdes-random 45,355 48,407 48,727 52,255 52,591 59,177 59,513
ratio-regions 70,387 102,063 108,191 126,567 132,791 135,497 141,769
ray 178,284 226,868 231,620 291,556 296,644 298,444 303,532
raytrace 260,497 326,209 342,833 440,625 457,457 441,497 458,473
simple 219,103 302,895 317,871 379,215 394,463 383,857 399,249
smith-normal-form 178,867 196,427 199,259 220,459 223,275 225,973 228,837
tailfib 44,387 46,679 46,855 49,727 49,919 56,681 56,873
tak 44,771 46,943 47,071 49,959 50,119 56,993 57,153
tensor 94,850 117,762 123,322 156,514 161,466 161,940 166,900
tsp 79,059 90,295 91,895 105,999 107,679 112,377 114,121
tyan 132,123 175,379 182,851 232,427 240,331 238,141 246,101
vector-concat 45,971 48,375 48,503 52,047 52,191 59,001 59,145
vector-rev 45,199 47,863 48,039 51,311 51,503 58,313 58,505
vliw 387,187 668,207 706,111 916,471 955,039 919,071 957,623
wc-input1 99,071 115,083 116,139 138,827 140,027 140,963 142,163
wc-scanStream 106,215 123,483 124,619 151,215 152,495 153,415 154,695
zebra 121,515 157,363 174,371 264,243 277,699 270,589 284,125
zern 85,796 94,308 95,460 106,444 107,724 115,222 116,534
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut 9.15 10.92 11.15 10.87 11.43 11.58 11.96
boyer 9.94 11.04 12.43 11.82 13.21 12.06 11.93
checksum 7.53 7.15 7.00 7.77 7.77 7.35 7.48
count-graphs 7.91 8.41 8.80 8.62 8.81 9.31 9.52
DLXSimulator 10.78 12.97 12.71 13.43 12.97 14.12 13.42
fft 7.97 7.73 8.45 8.57 8.51 8.84 8.24
fib 7.45 7.69 7.61 6.93 7.06 7.88 6.95
flat-array 7.44 7.67 7.49 7.73 6.67 8.00 7.35
hamlet 70.92 86.31 93.04 101.31 105.00 100.84 107.82
imp-for 6.89 7.70 7.65 7.62 6.98 6.80 6.75
knuth-bendix 9.74 10.03 10.81 10.94 10.84 10.80 10.39
lexgen 13.58 15.15 15.23 15.40 17.45 16.89 14.92
life 8.13 8.00 8.52 8.26 8.48 8.65 8.16
logic 9.66 10.10 10.72 10.38 11.37 9.68 10.92
mandelbrot 7.42 7.31 6.83 7.62 6.86 7.84 7.23
matrix-multiply 7.62 7.88 7.25 7.98 6.87 7.10 8.21
md5 8.61 7.62 9.10 7.80 7.89 9.40 8.71
merge 7.58 7.46 7.66 7.78 7.73 7.69 7.23
mlyacc 29.55 36.79 39.08 40.88 41.68 39.53 43.13
model-elimination 32.96 41.64 42.02 45.17 48.33 47.00 48.00
mpuz 7.21 7.40 7.68 7.79 8.03 8.27 7.07
nucleic 16.39 16.08 14.65 17.13 16.38 16.92 17.05
output1 7.94 8.88 8.98 8.21 7.93 9.10 8.82
peek 8.40 8.93 8.91 8.63 9.04 7.72 9.07
psdes-random 7.15 7.46 7.62 7.61 7.78 8.06 7.65
ratio-regions 8.91 9.88 10.08 9.94 10.42 10.50 10.15
ray 12.42 14.21 13.99 14.59 15.30 13.20 14.75
raytrace 17.66 19.58 20.51 21.35 21.60 19.70 20.88
simple 13.02 15.28 16.07 15.51 17.99 17.40 16.91
smith-normal-form 11.85 12.66 13.81 13.37 12.89 14.02 13.89
tailfib 7.03 6.99 7.39 7.76 7.51 7.50 6.87
tak 7.48 7.34 7.72 7.24 7.73 7.95 7.00
tensor 10.82 11.80 11.20 12.43 10.95 12.32 12.20
tsp 9.00 9.61 9.43 9.81 8.35 9.87 8.89
tyan 11.02 12.19 12.08 13.77 14.17 13.52 12.50
vector-concat 7.61 7.48 7.35 7.40 7.50 7.57 7.18
vector-rev 6.82 6.97 7.71 6.61 7.66 8.06 7.71
vliw 23.60 32.21 32.51 35.16 36.15 34.72 36.15
wc-input1 9.45 9.59 9.72 10.01 10.34 10.66 10.75
wc-scanStream 9.46 9.25 10.72 10.70 10.40 9.75 10.76
zebra 11.22 11.91 12.38 12.29 13.91 13.37 13.76
zern 7.65 8.85 7.50 9.08 8.09 9.26 8.63
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut 58.28 60.76 63.02 59.88 64.67 57.13 58.33
boyer 63.62 66.60 69.57 66.17 69.72 64.97 66.82
checksum 109.64 178.60 185.90 181.71 184.22 175.48 182.80
count-graphs 48.10 51.05 57.30 50.51 56.42 50.12 62.30
DLXSimulator 101.19 106.11 102.82 108.59 104.82 109.50 110.39
fft 39.87 39.95 40.27 39.38 41.16 40.05 41.31
fib 79.10 107.22 114.01 109.75 116.04 110.39 118.08
flat-array 29.44 32.80 34.46 33.05 33.00 31.86 34.46
hamlet 58.48 67.73 70.72 69.05 70.49 70.84 67.86
imp-for 51.31 53.30 122.94 52.24 123.57 51.27 120.71
knuth-bendix 44.89 53.43 58.42 54.72 57.97 54.73 58.40
lexgen 50.62 52.60 53.16 53.42 56.73 51.72 56.25
life 16.54 18.77 28.74 21.72 29.86 21.27 30.25
logic 60.57 65.73 65.13 66.22 72.22 67.79 70.81
mandelbrot 93.38 97.49 105.28 97.76 103.28 62.07 71.43
matrix-multiply 8.29 8.41 9.70 8.32 9.52 7.67 10.43
md5 61.49 83.55 99.73 88.69 112.14 89.53 108.31
merge 95.39 99.47 101.25 98.78 101.82 96.20 103.46
mlyacc 47.16 48.17 49.60 50.55 50.70 50.43 48.98
model-elimination 94.79 98.02 100.49 99.09 99.82 103.40 98.83
mpuz 45.98 47.66 63.07 48.48 62.36 48.42 61.16
nucleic 51.82 50.19 51.89 52.90 53.17 50.99 53.10
output1 17.39 16.95 20.36 16.65 20.14 16.84 20.02
peek 39.99 49.43 79.72 49.70 80.20 49.63 79.10
psdes-random 44.28 48.78 70.70 48.71 70.43 47.72 66.69
ratio-regions 59.10 70.50 77.65 71.45 81.47 70.07 75.60
ray 33.72 37.08 38.35 36.31 38.47 36.59 38.57
raytrace 48.41 50.38 52.23 51.67 52.42 52.35 55.12
simple 71.23 68.30 79.10 72.55 78.35 64.12 71.17
smith-normal-form 41.19 42.84 43.02 42.99 41.34 41.76 41.60
tailfib 49.78 47.43 75.59 47.55 74.48 46.55 74.83
tak 31.22 42.47 42.82 44.91 45.51 43.93 45.52
tensor 67.00 55.27 100.14 55.49 99.03 55.10 98.52
tsp 72.60 73.00 77.02 75.44 77.13 75.27 74.17
tyan 64.98 71.94 74.03 75.22 74.78 70.59 76.77
vector-concat 105.55 106.09 105.51 106.63 107.69 103.12 101.55
vector-rev 128.94 133.73 142.84 128.60 137.88 133.25 143.30
vliw 63.06 67.47 67.05 68.47 69.36 65.96 70.29
wc-input1 44.98 50.97 58.86 52.11 62.26 52.41 58.80
wc-scanStream 38.30 270.89 297.11 273.18 296.50 291.51 286.35
zebra 45.81 50.84 78.70 60.67 74.36 58.51 72.78
zern 45.90 46.10 52.77 48.16 49.88 46.52 52.74