[MLton] Question on profile.fun

Fri, 3 Jun 2005 17:23:07 -0700

> I tried the following experiment.  Added a flag  
>      -profile-dummy {false|true}  
> which instructs  profile.fun(line 447)  to, in addition to any other 
> profiling work, insert code to increment a dummy field in the gcState at 
> _every_  Profile statement in the RSSA IL program.  While this isn't 
> exactly the work required to modify time profiling as I described 
> previously, I figure that it is about on par with that work.

Did you really mean "increment a dummy field in the gcState"?  I would
think that is more costly than "move of a constant integer to a known
slot in the gc state", which is what you suggested earlier.

> the difference between
>      -profile mark -profile-dummy true
> and
>      -profile mark -profile-dummy false
> can incur more than 50% (of the running time of the unprofiled program).
> Though, of the 42 benchmark programs:
>   25 incur less than 10%
>   30 incur less than 20%
>   33 incur less than 30%
>   36 incur less than 40%
>   38 incur less than 50%
>   and the last four are at 56%, 65%, 75%, and 133%.

I am surprised that there were only four outliers.  The outliers, in
order, are tailfib, peek, tensor, and imp-for, all of which are
extremely tight loops (tensor perhaps a slight exception).  Here is a
repeat of the run-time ratio table for the outliers

run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
imp-for             1.00   0.99   2.32   0.99   2.32   1.00   2.32  47.49  48.98
peek                1.00   1.25   2.00   1.25   2.00   1.25   2.00  81.34  82.03
tailfib             1.00   0.96   1.52   0.96   1.52   0.96   1.52  20.50  20.79
tensor              1.00   0.84   1.49   0.84   1.49   0.84   1.49  64.94  66.26

We see that the four outliers all have the following equalities.

	MLton1 = MLton3 = MLton5    	all with -profile-dummy false
	MLton2 = MLton4 = MLton6	all with -profile-dummy true

So, the runtime effect of time profiling is nil, and there is likely
no significant allocation, causing the runtime effect of allocation
profiling to be nil.  The following ratios are equal for each outlier.

 	MLton2/MLton1 = MLton4/MLton3 = MLton6/MLton5

We also see that MLton1 is <=1 for three of them and is only 1.25 for
peek.  So, inserting profiling info is affecting neither the SSA
optimizer nor the codegen significantly (with a slight exception for
peek).  That is, the slowdown we are seeing is solely due to the
insertion of the assembler to increment the profile counter.

> So, the impact of actually doing time profiling over the impact of adding 
> profiling data to the program is virtually nil: only one incurs a cost of 
> more than 10%, and it is only 24%.

As expected.  The ratio of MLton3 (-profile time) to MLton1 (-profile
mark), is always between 0.9 and 1.1, with the exception of mandelbrot
(0.65) and tsp (1.23).

> though, -profile mark does insert the time profiling labels into the
> Machine IL, and I'm fairly certain that cutting up blocks at the
> profiling labels is interfering with codegen optimizations)
...
> There is another experiment to be done where labels are not 
> inserted into the Machine IL code.

To be fair to the code-insertion approach, it would be worth doing
such an experiment with -profile mark and have the backend profile
pass drop the profiling stuff altogether.  That would separate how
much slowdown is due to missed SSA optimization and how much is due to
missed codegen optimization.  It would also provide a bound on how
well one could do with the code-insertion approach to time profiling.

My guess is that most of the lossage is missed SSA optimization.  It
would be especially interesting to confirm that wc-scanStream stills
show the same insane slowdown.

I don't really see how such an experiment will say much about the four
outliers, since, as I pointed out above, the profiling info doesn't
appear to adversely affect the SSA optimizer or codegen.  The only way
I see to improve the outliers is by doing a store of a constant rather
than an increment.

> > If you put in a 'mov' at the profiling entry point _instead_ of a label,
> > that may incur entirely different penalties than a label and a mov...?
> 
> Different, but I doubt entirely different.  If you compare
>   -profile alloc -profile-dummy false
> with
>   -profile alloc -profile-dummy true
> you see the same overheads.  And alloc profiling doesn't need extra labels 
> (though, it does add other code, which may interfere, but probably not 
> much).

-profile alloc inserts C calls to bump the counter, since it doesn't
matter so much if it interferes with time.  Also, it inserts the calls
at different places than for time profiling, since it only needs to
insert a call if the profile label is about to change and something
has been allocated.  So, I'm not sure how much we can glean from the
fact that the MLton6/MLton5 ratios follow roughly the same pattern.
Here are the counts for the ratio of MLton6 (-profile alloc
-profile-dummy false) to MLton5 (-profile alloc -profile-dummy true).

	#	MLton6/MLton5 <
	--	---------------
	29 	1.1
	33	1.2	
	34	1.3
	36	1.4
	37	1.5

	outliers: 1.6 (tailfib), 1.6 (peek), 1.8 (tensor), 2.3 (imp-for)

Again, I suspect the outliers are there because there is no
allocation, so the ratio is the same as in the comparison of MLton1
(-profile-mark -profile-dummy true) to MLton2 (-profile-mark
-profile-dummy false).

> So, inserting profiling labels effectively means that the
> instruction stream is interrupted at every profile label.  Smaller
> basic blocks means that there is less opportunity for optimization.
...
> But, depending on your view, maybe this is what you want.  The fact
> that profile labels prevents basic blocks from one source function
> from "mixing" with the basic blocks from another source function
> means that optimizations can't move computation from one block to
> another (and, hence, from being charged to the wrong source
> function).  I don't know how this fits into Stephen's "principle of
> not interfering with what you're measuring".

Inserting the profile labels and causing the codegen to miss
optimizations is clearly interference, and it would be nice to avoid.
We'll know how much it costs once we do the experiment to drop
profiling annotations after the SSA optimizer is done.

We don't want the codegen to misattribute time to the wrong source
function and we would like to not inhibit its optimizations.  Perhaps
we could avoid the interference with a different approach to profiling
in the codegen.  Instead of annotating each basic block, we could
annotate each instruction in the Machine IL with a sourceseq.  Then,
we would require the codegen to preserve that annotation throughout.
There are clearly some issues where the codegen combines instructions,
etc.  But it could probably do a good job, and would have complete
freedom to move code, combine or split basic blocks, etc.  We could
emit a table that maps pc -> sourceseq, similar to our current table,
except probably not as compact since the current table maps pc ranges
to sourceseq, since a label applies to an entire block).

I don't see why that won't work, although it requires some significant
hacking on the codegen.  The runtime cost should be the same as now,
since we already expand the compact table to full code size so that we
can do constant-time lookup in the signal handler.

But, this approach has nothing to offer us if the experiment shows
that the codegen isn't hurt much by the current label-per-block
approach.