[MLton] Question on profile.fun
Matthew Fluet
fluet@cs.cornell.edu
Thu, 2 Jun 2005 20:45:55 -0400 (EDT)
> > You believe that the move of a constant integer to a known slot in
> > the gc state at transitions in the profile graph is too intrusive?
>
> Yes. The point is that it happens all the time, not just at (SSA)
> nontail calls, and not just at (SSA) basic block entries.
> Furthermore, to implement this portably within MLton, the right place
> to put it is at the Machine IL, which means it will interfere with
> codegen optimizations too. I bet it'll hurt more than 50% on some
> benchmarks. That's a lot of skew. I'm already annoyed by the skew
> that we get with -profile time as it is, 20-30% on some benchmarks
> IIRC, although it would be worth rerunning to see where we are today.
>
> In any case, you're welcome to try the experiment. It would be
> interesting to know.
I tried the following experiment. Added a flag
-profile-dummy {false|true}
which instructs profile.fun(line 447) to, in addition to any other
profiling work, insert code to increment a dummy field in the gcState at
_every_ Profile statement in the RSSA IL program. While this isn't
exactly the work required to modify time profiling as I described
previously, I figure that it is about on par with that work. (Possibly
more, since -profile-include/exclude flags may not require every Profile
statement to actually modify the current state.) Stephen is correct in
that the difference between
-profile mark -profile-dummy true
and
-profile mark -profile-dummy false
can incur more than 50% (of the running time of the unprofiled program).
Though, of the 42 benchmark programs:
25 incur less than 10%
30 incur less than 20%
33 incur less than 30%
36 incur less than 40%
38 incur less than 50%
and the last four are at 56%, 65%, 75%, and 133%.
I went ahead and ran the benchmarks with the options
mlton {-profile no,-profile {mark,time,alloc,count} {,-profile-dummy true}}
to measure the current impact of profiling. Note that this leaves the
other profiling flags (-profile-{c,branch,exclude,include,stack,raise}) at
their defaults, and most of those options would simply incur additional
cost.
I note that the difference between
-profile mark -profile-dummy false
and
-profile no
(a measure of the cost of carrying profiling data through the ILs and
optimizations, without incurring the cost of actually doing anything at
runtime; though, -profile mark does insert the time profiling labels into
the Machine IL, and I'm fairly certain that cutting up blocks at the
profiling labels is interfering with codegen optimizations), follows much
the same pattern as above:
27 incur less than 10%
34 incur less than 20%
36 incur less than 30%
38 incur less than 40%
40 incur less than 50%
and the last two are at 65% and 603%.
The difference between
-profile time -profile-dummy false
and
-profile no
(the measure Stephen was interested in), also follows essentially the same
pattern:
25 incur less than 10%
31 incur less than 20%
36 incur less than 30%
38 incur less than 40%
40 incur less than 50%
and the last two are at 65% and 613%.
So, the impact of actually doing time profiling over the impact of adding
profiling data to the program is virtually nil: only one incurs a cost of
more than 10%, and it is only 24%.
Anyways, here is the complete data.
MLton0 -- mlton -profile no
MLton1 -- mlton -profile mark
MLton2 -- mlton -profile mark -profile-dummy true
MLton3 -- mlton -profile time
MLton4 -- mlton -profile time -profile-dummy true
MLton5 -- mlton -profile alloc
MLton6 -- mlton -profile alloc -profile-dummy true
MLton7 -- mlton -profile count
MLton8 -- mlton -profile count -profile-dummy true
run time ratio
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
barnes-hut 1.00 1.05 1.09 0.99 1.02 1.32 1.35 2.32 2.34
boyer 1.00 1.09 1.11 1.04 1.07 1.61 1.63 2.86 2.86
checksum 1.00 1.65 1.72 1.65 1.80 1.73 1.72 8.52 8.87
count-graphs 1.00 1.03 1.17 1.04 1.16 2.16 2.28 10.46 10.67
DLXSimulator 1.00 1.00 1.00 1.05 1.05 1.83 1.84 1.04 1.04
fft 1.00 0.97 1.00 0.96 0.99 0.99 0.97 1.18 1.19
fib 1.00 1.39 1.50 1.39 1.49 1.36 1.44 5.32 5.20
flat-array 1.00 1.10 1.16 1.10 1.17 1.10 1.16 8.03 8.07
hamlet 1.00 1.11 1.16 1.13 1.17 1.59 1.66 2.77 2.88
imp-for 1.00 0.99 2.32 0.99 2.32 1.00 2.32 47.49 48.98
knuth-bendix 1.00 1.24 1.33 1.25 1.34 1.45 1.56 10.22 10.38
lexgen 1.00 1.07 1.15 1.06 1.13 1.28 1.31 2.52 2.53
life 1.00 1.31 1.79 1.35 1.85 1.94 2.54 21.97 22.58
logic 1.00 1.08 1.11 1.11 1.14 1.40 1.44 3.33 3.28
mandelbrot 1.00 1.04 1.11 0.68 0.77 0.68 0.78 5.49 5.78
matrix-multiply 1.00 0.92 1.05 0.83 1.16 0.83 1.16 6.85 6.86
md5 1.00 1.45 1.82 1.45 1.83 1.51 1.63 17.45 18.09
merge 1.00 1.01 1.03 0.99 1.05 1.32 1.33 1.38 1.43
mlyacc 1.00 1.04 1.08 1.05 1.08 1.73 1.79 2.43 2.46
model-elimination 1.00 1.08 1.12 1.09 1.13 1.61 1.66 3.07 3.11
mpuz 1.00 1.02 1.33 1.02 1.33 1.00 1.33 24.04 22.47
nucleic 1.00 0.99 1.02 0.99 1.02 1.27 1.30 1.77 1.80
output1 1.00 0.97 1.18 0.97 1.18 0.97 1.18 6.73 6.84
peek 1.00 1.25 2.00 1.25 2.00 1.25 2.00 81.34 82.03
psdes-random 1.00 1.10 1.59 1.10 1.62 1.10 1.63 25.47 26.09
ratio-regions 1.00 1.16 1.30 1.16 1.32 1.21 1.35 8.51 8.67
ray 1.00 1.07 1.12 1.05 1.11 1.14 1.16 4.97 5.00
raytrace 1.00 1.05 1.12 1.06 1.13 1.13 1.19 7.99 8.10
simple 1.00 1.02 1.15 0.94 1.07 1.44 1.47 3.90 3.96
smith-normal-form 1.00 1.00 1.00 1.01 1.01 1.01 1.01 1.02 1.03
tailfib 1.00 0.96 1.52 0.96 1.52 0.96 1.52 20.50 20.79
tak 1.00 1.44 1.47 1.43 1.47 1.36 1.37 4.45 4.50
tensor 1.00 0.84 1.49 0.84 1.49 0.84 1.49 64.94 66.26
tsp 1.00 1.03 1.06 1.27 1.04 1.02 1.05 4.06 4.24
tyan 1.00 1.08 1.12 1.09 1.13 1.84 1.88 3.98 4.07
vector-concat 1.00 1.00 0.99 1.01 1.02 1.15 1.00 1.01 1.01
vector-rev 1.00 0.96 1.03 0.98 1.04 0.97 1.04 3.92 4.01
vliw 1.00 1.14 1.22 1.22 1.20 1.82 1.84 3.30 3.31
wc-input1 1.00 1.20 1.42 1.18 1.40 1.16 1.38 10.20 10.74
wc-scanStream 1.00 7.03 7.29 7.13 7.30 15.56 15.90 15.40 15.38
zebra 1.00 1.19 1.55 1.27 1.70 2.81 3.30 23.19 25.67
zern 1.00 1.01 1.10 1.01 1.10 1.00 1.10 7.12 7.36
size
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
barnes-hut 99,700 145,314 148,194 149,084 151,976 131,076 134,116 140,700 143,468
boyer 135,375 206,323 215,699 213,565 222,925 182,861 192,061 187,989 198,709
checksum 50,095 55,755 55,979 62,725 62,949 58,397 58,589 56,557 56,701
count-graphs 63,135 93,783 96,423 102,129 104,657 89,393 92,041 89,713 91,905
DLXSimulator 126,067 216,299 222,707 221,613 228,021 182,613 189,605 211,717 219,317
fft 61,358 75,982 77,150 84,232 85,384 75,280 76,480 74,496 75,568
fib 44,691 50,143 50,351 57,129 57,337 52,801 52,945 50,177 50,209
flat-array 44,715 50,143 50,303 57,113 57,273 52,801 52,945 50,513 50,577
hamlet 1,246,854 2,699,934 2,858,062 2,704,854 2,862,966 2,354,358 2,520,822 2,586,806 2,740,006
imp-for 44,547 52,551 53,031 59,457 59,937 53,481 53,913 55,249 55,601
knuth-bendix 105,907 165,595 170,571 172,517 177,637 149,637 154,357 178,669 184,941
lexgen 199,332 358,500 372,220 361,020 374,724 308,524 321,588 314,940 329,492
life 62,059 85,407 86,879 91,801 93,305 82,073 83,809 87,185 89,089
logic 103,567 172,679 177,751 179,857 184,913 160,729 165,817 167,105 171,825
mandelbrot 44,643 50,215 50,439 57,153 57,393 52,793 53,033 51,065 51,209
matrix-multiply 46,294 54,414 54,846 61,200 61,632 55,368 55,784 55,880 56,264
md5 74,531 97,431 98,151 103,953 104,721 92,505 93,225 89,385 89,833
merge 46,271 53,255 53,575 60,161 60,497 55,593 55,897 52,657 52,849
mlyacc 501,140 933,496 971,240 935,984 973,728 806,744 842,640 941,984 995,088
model-elimination 631,901 1,236,233 1,286,281 1,241,211 1,291,339 1,058,871 1,109,323 1,164,299 1,218,123
mpuz 47,307 60,927 62,015 69,345 70,289 61,561 62,537 61,641 62,329
nucleic 196,246 221,650 223,026 228,420 229,812 220,484 222,004 233,108 235,060
output1 77,373 97,689 98,377 99,465 100,169 91,721 92,345 83,769 83,977
peek 73,483 92,551 93,335 97,649 98,321 90,481 91,057 86,009 86,329
psdes-random 45,355 52,255 52,591 59,177 59,513 54,129 54,449 53,489 53,793
ratio-regions 70,387 126,567 132,791 135,497 141,769 110,721 116,897 139,945 146,425
ray 178,284 291,556 296,644 298,444 303,532 256,732 261,340 265,868 273,868
raytrace 260,497 440,625 457,457 441,497 458,473 373,169 390,033 421,853 441,661
simple 219,103 379,215 394,463 383,857 399,249 359,777 375,569 402,561 420,609
smith-normal-form 178,867 220,459 223,275 225,973 228,837 206,117 208,965 209,509 212,005
tailfib 44,387 49,727 49,919 56,681 56,873 52,433 52,593 50,369 50,449
tak 44,771 49,959 50,119 56,993 57,153 52,713 52,825 50,209 50,241
tensor 94,850 156,514 161,466 161,940 166,900 129,652 135,228 198,764 205,452
tsp 79,059 105,999 107,679 112,377 114,121 98,385 100,097 104,225 106,305
tyan 132,123 232,427 240,331 238,141 246,101 199,773 207,853 209,421 217,549
vector-concat 45,971 52,047 52,191 59,001 59,145 54,529 54,641 51,777 51,841
vector-rev 45,199 51,311 51,503 58,313 58,505 53,633 53,809 51,713 51,873
vliw 387,187 916,471 955,039 919,071 957,623 750,415 787,719 834,191 871,487
wc-input1 99,071 138,827 140,027 140,963 142,163 127,219 128,371 111,667 111,859
wc-scanStream 106,215 151,215 152,495 153,415 154,695 138,771 140,003 119,379 119,667
zebra 121,515 264,243 277,699 270,589 284,125 187,853 204,925 318,441 337,465
zern 85,796 106,444 107,724 115,222 116,534 102,734 103,934 112,046 113,262
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
barnes-hut 8.87 9.69 9.81 9.78 10.01 9.54 9.61 9.87 9.86
boyer 9.51 10.65 10.95 10.98 11.17 10.53 10.71 10.62 10.94
checksum 6.28 6.40 6.45 6.59 6.64 6.57 6.55 6.58 6.62
count-graphs 7.11 7.66 7.68 7.77 7.88 7.63 7.70 7.73 7.79
DLXSimulator 9.91 11.58 11.89 11.68 11.84 11.14 11.32 11.70 11.87
fft 6.78 7.08 7.14 7.30 7.35 7.23 7.26 7.39 7.41
fib 6.23 6.43 6.37 6.58 6.55 6.50 6.54 6.56 6.61
flat-array 6.24 6.40 6.47 6.59 6.61 6.58 6.56 6.59 6.73
hamlet 62.93 86.85 92.24 86.99 91.89 81.14 85.29 82.68 86.08
imp-for 6.35 6.50 6.53 6.67 6.70 6.60 6.60 6.73 6.72
knuth-bendix 8.20 9.16 9.30 9.38 9.51 9.15 9.22 9.49 9.59
lexgen 11.63 14.18 14.48 14.28 14.61 13.58 13.96 13.73 14.12
life 6.90 7.23 7.25 7.38 7.50 8.71 7.33 7.42 7.42
logic 8.36 9.41 9.66 9.66 9.95 9.38 9.51 9.45 9.47
mandelbrot 6.29 6.51 6.54 6.72 6.64 6.57 6.59 6.66 6.64
matrix-multiply 7.99 6.58 6.58 6.78 6.73 6.80 6.77 6.81 6.76
md5 7.02 7.69 7.68 7.82 7.88 7.71 7.71 7.77 7.74
merge 6.33 6.51 6.46 6.63 6.67 6.70 6.63 6.72 6.68
mlyacc 27.13 34.82 35.98 34.84 36.07 32.96 34.16 33.36 34.76
model-elimination 29.42 40.24 41.53 40.32 41.72 37.63 38.88 36.46 38.03
mpuz 6.38 6.65 6.74 6.88 6.94 6.79 6.82 6.85 6.85
nucleic 13.81 14.09 14.19 14.54 14.64 14.36 14.30 14.58 14.61
output1 7.01 7.58 7.54 7.57 7.52 7.50 7.47 7.56 7.52
peek 6.93 7.45 7.50 7.57 7.55 7.50 7.54 7.65 7.66
psdes-random 6.25 6.53 6.52 6.71 6.76 6.64 6.63 6.69 6.71
ratio-regions 7.50 8.45 8.68 8.72 8.84 8.38 8.64 8.86 9.04
ray 10.32 12.57 12.73 12.78 12.95 12.17 12.35 12.62 12.82
raytrace 14.75 17.91 18.40 17.84 18.35 16.88 17.43 17.53 17.96
simple 12.22 14.37 14.95 14.48 14.97 13.98 14.65 14.68 15.27
smith-normal-form 10.51 11.57 11.60 11.59 11.73 11.41 11.47 11.70 11.79
tailfib 6.21 6.40 6.40 6.59 6.52 6.51 6.55 6.56 6.57
tak 6.18 6.38 6.41 6.57 6.58 6.53 6.52 6.58 6.56
tensor 8.87 10.31 10.48 10.44 10.60 10.07 10.12 10.62 10.93
tsp 7.36 8.08 8.14 9.71 8.27 8.04 8.14 8.36 8.37
tyan 9.72 11.63 11.89 11.83 12.06 11.24 11.50 11.54 11.77
vector-concat 6.24 6.47 6.53 6.67 6.70 7.20 6.62 6.66 6.68
vector-rev 6.21 7.34 6.39 6.68 6.64 6.62 6.61 6.69 6.65
vliw 20.30 30.19 31.32 38.42 31.35 28.04 29.16 27.53 28.74
wc-input1 7.85 8.81 8.82 8.85 8.89 8.62 8.69 8.74 8.69
wc-scanStream 8.17 9.14 9.25 9.22 9.23 9.02 9.03 9.03 9.04
zebra 9.29 11.24 11.64 11.39 11.82 10.78 11.07 12.38 12.76
zern 6.98 7.41 7.48 7.85 7.74 7.53 7.60 7.75 7.75
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
barnes-hut 50.57 53.24 55.01 49.92 51.37 66.92 68.46 117.11 118.22
boyer 53.95 58.84 60.10 56.34 57.66 86.89 88.07 154.22 154.15
checksum 97.36 160.35 167.09 160.89 175.53 168.39 167.05 829.31 863.90
count-graphs 41.38 42.76 48.26 42.97 48.08 89.45 94.32 432.76 441.54
DLXSimulator 90.58 90.61 90.76 95.37 95.55 165.44 166.24 94.01 93.97
fft 37.67 36.37 37.59 36.20 37.11 37.13 36.63 44.44 44.88
fib 70.29 97.56 105.30 97.52 104.97 95.66 100.98 373.84 365.49
flat-array 25.06 27.48 29.16 27.56 29.21 27.48 29.15 201.32 202.35
hamlet 53.08 59.16 61.40 59.72 62.08 84.56 88.18 146.84 153.01
imp-for 46.82 46.20 108.57 46.22 108.70 46.80 108.54 2223.65 2293.43
knuth-bendix 38.63 48.01 51.52 48.29 51.77 56.18 60.43 394.89 400.87
lexgen 44.42 47.73 51.02 46.95 50.10 56.65 58.40 111.86 112.36
life 14.42 18.94 25.81 19.45 26.63 27.93 36.60 316.88 325.70
logic 56.07 60.66 62.17 62.06 63.66 78.24 80.65 186.49 183.77
mandelbrot 82.46 85.93 91.28 55.98 63.40 55.98 63.96 452.52 476.85
matrix-multiply 8.20 7.52 8.60 6.83 9.54 6.84 9.49 56.16 56.30
md5 53.22 77.27 97.11 77.33 97.21 80.18 86.76 928.61 962.71
merge 87.86 88.44 90.62 87.31 92.63 115.78 116.82 121.43 125.70
mlyacc 41.03 42.77 44.30 43.08 44.45 71.10 73.40 99.64 101.11
model-elimination 79.62 85.98 88.96 86.45 90.02 127.97 132.24 244.67 247.38
mpuz 41.98 42.84 55.78 42.76 55.91 41.85 55.83 1008.87 943.37
nucleic 45.47 45.16 46.35 45.19 46.48 57.80 58.99 80.38 81.66
output1 15.24 14.83 17.95 14.85 17.95 14.86 17.93 102.54 104.18
peek 35.45 44.30 70.83 44.33 70.94 44.33 70.90 2882.97 2907.41
psdes-random 38.88 42.87 61.79 42.89 63.15 42.87 63.42 990.41 1014.38
ratio-regions 53.00 61.39 69.05 61.52 69.72 64.16 71.37 450.98 459.61
ray 30.73 32.84 34.31 32.37 34.23 35.13 35.70 152.67 153.65
raytrace 42.53 44.77 47.54 45.07 47.95 47.94 50.51 339.61 344.47
simple 60.43 61.72 69.39 56.99 64.63 87.20 88.59 235.90 239.03
smith-normal-form 36.94 36.97 37.09 37.13 37.16 37.16 37.19 37.83 37.88
tailfib 43.73 41.78 66.40 41.80 66.42 41.79 66.39 896.74 909.36
tak 27.60 39.72 40.43 39.38 40.43 37.51 37.90 122.74 124.18
tensor 58.97 49.31 88.00 49.35 88.07 49.28 88.05 3829.67 3907.20
tsp 64.40 66.41 68.30 81.56 67.17 65.55 67.37 261.32 272.98
tyan 58.11 62.71 64.83 63.19 65.45 106.68 109.43 231.13 236.41
vector-concat 92.48 92.40 91.95 93.55 94.28 106.68 92.91 92.99 93.77
vector-rev 119.16 114.60 122.84 116.38 124.34 115.85 124.12 466.79 478.11
vliw 51.63 59.10 62.99 62.90 61.76 93.84 94.80 170.32 170.97
wc-input1 38.72 46.45 55.15 45.73 54.38 44.84 53.34 394.99 415.97
wc-scanStream 34.30 241.25 250.02 244.38 250.43 533.79 545.33 528.20 527.37
zebra 40.69 48.31 63.23 51.66 69.29 114.39 134.11 943.36 1044.25
zern 41.11 41.53 45.34 41.52 45.22 41.31 45.17 292.87 302.44