[MLton] Question on profile.fun

Sat, 4 Jun 2005 10:12:12 -0400 (EDT)

> > I tried the following experiment.  Added a flag  
> >      -profile-dummy {false|true}  
> > which instructs  profile.fun(line 447)  to, in addition to any other 
> > profiling work, insert code to increment a dummy field in the gcState at 
> > _every_  Profile statement in the RSSA IL program.  While this isn't 
> > exactly the work required to modify time profiling as I described 
> > previously, I figure that it is about on par with that work.
> 
> Did you really mean "increment a dummy field in the gcState"?  I would
> think that is more costly than "move of a constant integer to a known
> slot in the gc state", which is what you suggested earlier.

Yes; I deliberately chose an operation that would likely be a little more 
costly than what would be needed.

> > though, -profile mark does insert the time profiling labels into the
> > Machine IL, and I'm fairly certain that cutting up blocks at the
> > profiling labels is interfering with codegen optimizations)
> ...
> > There is another experiment to be done where labels are not 
> > inserted into the Machine IL code.
> 
> To be fair to the code-insertion approach, it would be worth doing
> such an experiment with -profile mark and have the backend profile
> pass drop the profiling stuff altogether.  That would separate how
> much slowdown is due to missed SSA optimization and how much is due to
> missed codegen optimization.  It would also provide a bound on how
> well one could do with the code-insertion approach to time profiling.

Here's a new experiment:

MLton0 -- mlton -profile no
MLton1 -- mlton -profile drop 
MLton2 -- mlton -profile drop -profile-dummy true
MLton3 -- mlton -profile label 
MLton4 -- mlton -profile label -profile-dummy true
MLton5 -- mlton -profile time 
MLton6 -- mlton -profile time -profile-dummy true

I changed the name of -profile mark to -profile label; The new option
-profile drop causes the implementProfiling pass to implement nothing.  
Hence, the RSSA IL programs out of implementProfiling with -profile drop
-profile-dummy true is the same as the program into the pass without any 
profiling statements (or labels).  So, I think
 MLton0 vs. MLton1  --  slowdown due to missed SSA optimization
 MLton1 vs. MLton3  --  slowdown due to missed codegen optimization
 MLton0 vs. MLton2  --  bound on code-insertion approach to time profiling

> Inserting the profile labels and causing the codegen to miss
> optimizations is clearly interference, and it would be nice to avoid.
> We'll know how much it costs once we do the experiment to drop
> profiling annotations after the SSA optimizer is done.

The experiment described above and presented below should answer this 
question.

> We don't want the codegen to misattribute time to the wrong source
> function and we would like to not inhibit its optimizations.  Perhaps
> we could avoid the interference with a different approach to profiling
> in the codegen.  Instead of annotating each basic block, we could
> annotate each instruction in the Machine IL with a sourceseq.  Then,
> we would require the codegen to preserve that annotation throughout.
> There are clearly some issues where the codegen combines instructions,
> etc.  But it could probably do a good job, and would have complete
> freedom to move code, combine or split basic blocks, etc.  We could
> emit a table that maps pc -> sourceseq, similar to our current table,
> except probably not as compact since the current table maps pc ranges
> to sourceseq, since a label applies to an entire block).
>
> I don't see why that won't work, although it requires some significant
> hacking on the codegen.

It has the downside of making native codegen profiling different from C 
codegen profiling.  Anyways, since we'll shortly see that the profiling 
labels don't affect the codegen optimizations appreciably, I don't think 
this matters.

So here are the relevant results from the new experiment:

        #       MLton1/MLton0 <=
        --      ----------------
         7      1.0
        28      1.1
        36      1.2
        37      1.3
        40      1.4
        40      1.5

        outliers  1.63 checksum, 7.0 wc-scanStream

So, this seems to suggest that the slowdown due to missed SSA 
optimizations is fairly low, though it is the cause of the insane behavior 
of wc-scanStream.  Knowing that, it is probably worth adding to a TODO to 
investigate.

        #       MLton3/MLton1 <=
        --      ----------------
        15      1.0
        40      1.1
        42      1.2

So, the labels have virtually no effect on codegen optimizations.  The two
"outliers" are 1.15 life, 1.19 zebra.

        #       MLton1/MLton0 <=
        --      ----------------
        2       1.0
        15      1.1
        25      1.2
        27      1.3
        31      1.4
        33      1.5

        outliers  1.5 tailfib, 1.6 psdes-random, 1.6 md5, checksum 1.7,
                  1.7 zebra, 1.7 life, 2.0 peek, 2.4 imp-for,
                  7.8 wc-scanStream

So, I admit that the bound on code-insertion for time-profiling isn't all 
that good.  Again, we might argue that this is an upper-bound, since the 
inserted code is more expensive than a single move.  I'll see about 
queueing up that experiment.

        #       MLton5/MLton0 <=
        --      ----------------
        15      1.0
        29      1.1
        32      1.2
        37      1.3
        38      1.4
        40      1.5

        outliers  1.6 checksum, 7.6 wc-scanStream

And, furthermore, the cost of time profiling with labels yields a much 
better ratio.

Finally, another argument against the code insertion approach is that 
these benchmarks are run without profiling the basis library.  Since 
labels are inserted into every RSSA IL block, even when profiling the 
Basis Library, the cost seen above shouldn't change.  Whereas with the 
code insertion approach, there would be more transitions in the profiling 
graph corresponding to more work being done by the inserted code.

So, I'm convinced that the code insertion technique is too intrusive for 
time profiling.

Here's the complete data:

MLton0 -- mlton -profile no
MLton1 -- mlton -profile drop 
MLton2 -- mlton -profile drop -profile-dummy true
MLton3 -- mlton -profile label 
MLton4 -- mlton -profile label -profile-dummy true
MLton5 -- mlton -profile time 
MLton6 -- mlton -profile time -profile-dummy true
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut          1.00   1.04   1.08   1.03   1.11   0.98   1.00
boyer               1.00   1.05   1.09   1.04   1.10   1.02   1.05
checksum            1.00   1.63   1.70   1.66   1.68   1.60   1.67
count-graphs        1.00   1.06   1.19   1.05   1.17   1.04   1.30
DLXSimulator        1.00   1.05   1.02   1.07   1.04   1.08   1.09
fft                 1.00   1.00   1.01   0.99   1.03   1.00   1.04
fib                 1.00   1.36   1.44   1.39   1.47   1.40   1.49
flat-array          1.00   1.11   1.17   1.12   1.12   1.08   1.17
hamlet              1.00   1.16   1.21   1.18   1.21   1.21   1.16
imp-for             1.00   1.04   2.40   1.02   2.41   1.00   2.35
knuth-bendix        1.00   1.19   1.30   1.22   1.29   1.22   1.30
lexgen              1.00   1.04   1.05   1.06   1.12   1.02   1.11
life                1.00   1.14   1.74   1.31   1.81   1.29   1.83
logic               1.00   1.09   1.08   1.09   1.19   1.12   1.17
mandelbrot          1.00   1.04   1.13   1.05   1.11   0.66   0.76
matrix-multiply     1.00   1.02   1.17   1.00   1.15   0.92   1.26
md5                 1.00   1.36   1.62   1.44   1.82   1.46   1.76
merge               1.00   1.04   1.06   1.04   1.07   1.01   1.08
mlyacc              1.00   1.02   1.05   1.07   1.07   1.07   1.04
model-elimination   1.00   1.03   1.06   1.05   1.05   1.09   1.04
mpuz                1.00   1.04   1.37   1.05   1.36   1.05   1.33
nucleic             1.00   0.97   1.00   1.02   1.03   0.98   1.02
output1             1.00   0.97   1.17   0.96   1.16   0.97   1.15
peek                1.00   1.24   1.99   1.24   2.01   1.24   1.98
psdes-random        1.00   1.10   1.60   1.10   1.59   1.08   1.51
ratio-regions       1.00   1.19   1.31   1.21   1.38   1.19   1.28
ray                 1.00   1.10   1.14   1.08   1.14   1.09   1.14
raytrace            1.00   1.04   1.08   1.07   1.08   1.08   1.14
simple              1.00   0.96   1.11   1.02   1.10   0.90   1.00
smith-normal-form   1.00   1.04   1.04   1.04   1.00   1.01   1.01
tailfib             1.00   0.95   1.52   0.96   1.50   0.93   1.50
tak                 1.00   1.36   1.37   1.44   1.46   1.41   1.46
tensor              1.00   0.82   1.49   0.83   1.48   0.82   1.47
tsp                 1.00   1.01   1.06   1.04   1.06   1.04   1.02
tyan                1.00   1.11   1.14   1.16   1.15   1.09   1.18
vector-concat       1.00   1.01   1.00   1.01   1.02   0.98   0.96
vector-rev          1.00   1.04   1.11   1.00   1.07   1.03   1.11
vliw                1.00   1.07   1.06   1.09   1.10   1.05   1.11
wc-input1           1.00   1.13   1.31   1.16   1.38   1.17   1.31
wc-scanStream       1.00   7.07   7.76   7.13   7.74   7.61   7.48
zebra               1.00   1.11   1.72   1.32   1.62   1.28   1.59
zern                1.00   1.00   1.15   1.05   1.09   1.01   1.15
size
benchmark            MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6
barnes-hut           99,700   120,058   122,938   145,314   148,194   149,084   151,976
boyer               135,375   162,147   171,379   206,323   215,699   213,565   222,925
checksum             50,095    52,627    52,835    55,755    55,979    62,725    62,949
count-graphs         63,135    76,263    78,911    93,783    96,423   102,129   104,657
DLXSimulator        126,067   165,179   172,091   216,299   222,707   221,613   228,021
fft                  61,358    67,718    68,934    75,982    77,150    84,232    85,384
fib                  44,691    47,047    47,207    50,143    50,351    57,129    57,337
flat-array           44,715    46,999    47,159    50,143    50,303    57,113    57,273
hamlet            1,246,854 1,913,294 2,061,582 2,699,934 2,858,062 2,704,854 2,862,966
imp-for              44,547    47,711    48,159    52,551    53,031    59,457    59,937
knuth-bendix        105,907   133,055   137,823   165,595   170,571   172,517   177,637
lexgen              199,332   270,180   283,412   358,500   372,220   361,020   374,724
life                 62,059    71,295    72,823    85,407    86,879    91,801    93,305
logic               103,567   133,503   138,303   172,679   177,751   179,857   184,913
mandelbrot           44,643    47,055    47,295    50,215    50,439    57,153    57,393
matrix-multiply      46,294    49,750    50,166    54,414    54,846    61,200    61,632
md5                  74,531    84,991    85,647    97,431    98,151   103,953   104,721
merge                46,271    49,279    49,599    53,255    53,575    60,161    60,497
mlyacc              501,140   696,352   733,048   933,496   971,240   935,984   973,728
model-elimination   631,901   884,693   933,909 1,236,233 1,286,281 1,241,211 1,291,339
mpuz                 47,307    53,543    54,695    60,927    62,015    69,345    70,289
nucleic             196,246   208,546   209,810   221,650   223,026   228,420   229,812
output1              77,373    84,953    85,561    97,689    98,377    99,465   100,169
peek                 73,483    81,495    82,231    92,551    93,335    97,649    98,321
psdes-random         45,355    48,407    48,727    52,255    52,591    59,177    59,513
ratio-regions        70,387   102,063   108,191   126,567   132,791   135,497   141,769
ray                 178,284   226,868   231,620   291,556   296,644   298,444   303,532
raytrace            260,497   326,209   342,833   440,625   457,457   441,497   458,473
simple              219,103   302,895   317,871   379,215   394,463   383,857   399,249
smith-normal-form   178,867   196,427   199,259   220,459   223,275   225,973   228,837
tailfib              44,387    46,679    46,855    49,727    49,919    56,681    56,873
tak                  44,771    46,943    47,071    49,959    50,119    56,993    57,153
tensor               94,850   117,762   123,322   156,514   161,466   161,940   166,900
tsp                  79,059    90,295    91,895   105,999   107,679   112,377   114,121
tyan                132,123   175,379   182,851   232,427   240,331   238,141   246,101
vector-concat        45,971    48,375    48,503    52,047    52,191    59,001    59,145
vector-rev           45,199    47,863    48,039    51,311    51,503    58,313    58,505
vliw                387,187   668,207   706,111   916,471   955,039   919,071   957,623
wc-input1            99,071   115,083   116,139   138,827   140,027   140,963   142,163
wc-scanStream       106,215   123,483   124,619   151,215   152,495   153,415   154,695
zebra               121,515   157,363   174,371   264,243   277,699   270,589   284,125
zern                 85,796    94,308    95,460   106,444   107,724   115,222   116,534
compile time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut          9.15  10.92  11.15  10.87  11.43  11.58  11.96
boyer               9.94  11.04  12.43  11.82  13.21  12.06  11.93
checksum            7.53   7.15   7.00   7.77   7.77   7.35   7.48
count-graphs        7.91   8.41   8.80   8.62   8.81   9.31   9.52
DLXSimulator       10.78  12.97  12.71  13.43  12.97  14.12  13.42
fft                 7.97   7.73   8.45   8.57   8.51   8.84   8.24
fib                 7.45   7.69   7.61   6.93   7.06   7.88   6.95
flat-array          7.44   7.67   7.49   7.73   6.67   8.00   7.35
hamlet             70.92  86.31  93.04 101.31 105.00 100.84 107.82
imp-for             6.89   7.70   7.65   7.62   6.98   6.80   6.75
knuth-bendix        9.74  10.03  10.81  10.94  10.84  10.80  10.39
lexgen             13.58  15.15  15.23  15.40  17.45  16.89  14.92
life                8.13   8.00   8.52   8.26   8.48   8.65   8.16
logic               9.66  10.10  10.72  10.38  11.37   9.68  10.92
mandelbrot          7.42   7.31   6.83   7.62   6.86   7.84   7.23
matrix-multiply     7.62   7.88   7.25   7.98   6.87   7.10   8.21
md5                 8.61   7.62   9.10   7.80   7.89   9.40   8.71
merge               7.58   7.46   7.66   7.78   7.73   7.69   7.23
mlyacc             29.55  36.79  39.08  40.88  41.68  39.53  43.13
model-elimination  32.96  41.64  42.02  45.17  48.33  47.00  48.00
mpuz                7.21   7.40   7.68   7.79   8.03   8.27   7.07
nucleic            16.39  16.08  14.65  17.13  16.38  16.92  17.05
output1             7.94   8.88   8.98   8.21   7.93   9.10   8.82
peek                8.40   8.93   8.91   8.63   9.04   7.72   9.07
psdes-random        7.15   7.46   7.62   7.61   7.78   8.06   7.65
ratio-regions       8.91   9.88  10.08   9.94  10.42  10.50  10.15
ray                12.42  14.21  13.99  14.59  15.30  13.20  14.75
raytrace           17.66  19.58  20.51  21.35  21.60  19.70  20.88
simple             13.02  15.28  16.07  15.51  17.99  17.40  16.91
smith-normal-form  11.85  12.66  13.81  13.37  12.89  14.02  13.89
tailfib             7.03   6.99   7.39   7.76   7.51   7.50   6.87
tak                 7.48   7.34   7.72   7.24   7.73   7.95   7.00
tensor             10.82  11.80  11.20  12.43  10.95  12.32  12.20
tsp                 9.00   9.61   9.43   9.81   8.35   9.87   8.89
tyan               11.02  12.19  12.08  13.77  14.17  13.52  12.50
vector-concat       7.61   7.48   7.35   7.40   7.50   7.57   7.18
vector-rev          6.82   6.97   7.71   6.61   7.66   8.06   7.71
vliw               23.60  32.21  32.51  35.16  36.15  34.72  36.15
wc-input1           9.45   9.59   9.72  10.01  10.34  10.66  10.75
wc-scanStream       9.46   9.25  10.72  10.70  10.40   9.75  10.76
zebra              11.22  11.91  12.38  12.29  13.91  13.37  13.76
zern                7.65   8.85   7.50   9.08   8.09   9.26   8.63
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6
barnes-hut         58.28  60.76  63.02  59.88  64.67  57.13  58.33
boyer              63.62  66.60  69.57  66.17  69.72  64.97  66.82
checksum          109.64 178.60 185.90 181.71 184.22 175.48 182.80
count-graphs       48.10  51.05  57.30  50.51  56.42  50.12  62.30
DLXSimulator      101.19 106.11 102.82 108.59 104.82 109.50 110.39
fft                39.87  39.95  40.27  39.38  41.16  40.05  41.31
fib                79.10 107.22 114.01 109.75 116.04 110.39 118.08
flat-array         29.44  32.80  34.46  33.05  33.00  31.86  34.46
hamlet             58.48  67.73  70.72  69.05  70.49  70.84  67.86
imp-for            51.31  53.30 122.94  52.24 123.57  51.27 120.71
knuth-bendix       44.89  53.43  58.42  54.72  57.97  54.73  58.40
lexgen             50.62  52.60  53.16  53.42  56.73  51.72  56.25
life               16.54  18.77  28.74  21.72  29.86  21.27  30.25
logic              60.57  65.73  65.13  66.22  72.22  67.79  70.81
mandelbrot         93.38  97.49 105.28  97.76 103.28  62.07  71.43
matrix-multiply     8.29   8.41   9.70   8.32   9.52   7.67  10.43
md5                61.49  83.55  99.73  88.69 112.14  89.53 108.31
merge              95.39  99.47 101.25  98.78 101.82  96.20 103.46
mlyacc             47.16  48.17  49.60  50.55  50.70  50.43  48.98
model-elimination  94.79  98.02 100.49  99.09  99.82 103.40  98.83
mpuz               45.98  47.66  63.07  48.48  62.36  48.42  61.16
nucleic            51.82  50.19  51.89  52.90  53.17  50.99  53.10
output1            17.39  16.95  20.36  16.65  20.14  16.84  20.02
peek               39.99  49.43  79.72  49.70  80.20  49.63  79.10
psdes-random       44.28  48.78  70.70  48.71  70.43  47.72  66.69
ratio-regions      59.10  70.50  77.65  71.45  81.47  70.07  75.60
ray                33.72  37.08  38.35  36.31  38.47  36.59  38.57
raytrace           48.41  50.38  52.23  51.67  52.42  52.35  55.12
simple             71.23  68.30  79.10  72.55  78.35  64.12  71.17
smith-normal-form  41.19  42.84  43.02  42.99  41.34  41.76  41.60
tailfib            49.78  47.43  75.59  47.55  74.48  46.55  74.83
tak                31.22  42.47  42.82  44.91  45.51  43.93  45.52
tensor             67.00  55.27 100.14  55.49  99.03  55.10  98.52
tsp                72.60  73.00  77.02  75.44  77.13  75.27  74.17
tyan               64.98  71.94  74.03  75.22  74.78  70.59  76.77
vector-concat     105.55 106.09 105.51 106.63 107.69 103.12 101.55
vector-rev        128.94 133.73 142.84 128.60 137.88 133.25 143.30
vliw               63.06  67.47  67.05  68.47  69.36  65.96  70.29
wc-input1          44.98  50.97  58.86  52.11  62.26  52.41  58.80
wc-scanStream      38.30 270.89 297.11 273.18 296.50 291.51 286.35
zebra              45.81  50.84  78.70  60.67  74.36  58.51  72.78
zern               45.90  46.10  52.77  48.16  49.88  46.52  52.74