[MLton-devel] cvs commit: source-level profiling

Stephen Weeks MLton@mlton.org
Fri, 3 Jan 2003 11:15:37 -0800

> Maybe this is too much to ask (in the sense of too much data to
> accumulate), but would it be possible to instead of bumping a counter for
> each source function, bump a counter for each "stack abstraction" for lack
> of a better term.

If we think of stacks as sequences of sources, Stack = Source*, and of
a run of the program as generating a multiset of stacks, M(Stack) =
Stack -> int, I like to think of what we're doing as defining a
function from M(Stack) to #MStack, an abstraction of multisets of
stacks.  This of course includes abstractions like you want of the
form #MStack = #Stack -> Int.  But it also includes my abstraction,
#MStack = Source -> int.

> In the best of all worlds, we could just bump a counter for each of
> these source stacks, dump that in the mlmon.out file and do all the
> processing offline -- recovering the current information as a
> special case, but also getting more detailed dependencies.  But,
> clearly the number of possible souce stacks is unbounded, so it is
> probably unreasonable to keep all of them.

I agree that taking #MStack = M(Stack) is probably not doable with
reasonable performance, at least for time profiling.  For space
profiling, since we have the call graph, I could imagine doing some
preprocessing to make it easy for enters and leaves to maintain a
M(Stack) as a call tree.  But it might take a lot of space.

> On the other hand, Steve made a reasonable simplification by only
> charging a source function one tick regardless of the number of
> times it appears on the stack.  (Note this is exactly what we want
> for a function like fib; on the other hand, for something like
> foldr, where the folded function might also invoke foldr, it's less
> clear that that is "best".  If you are trying to investigate the
> folded function, you won't be able to distinguish between the inner
> and the outer foldr.)

It's not quite so bad.  Because of polymorphic and polyvariant
cloning, the call-graph information is pretty precise.  I will
hopefully check in some code today so that mlprof can display the data
in the call graph and we can start to get a feel for how
understandable it is.

> Adopting that simplification, there are at most 2^n possible source
> stacks, where n is the number of source functions.  And, now that I
> think about it more, even that is way too much data.  In practice,
> I'm sure that it is much less.  And for any given program, given the
> sourceSeqs, frameSources and SSA call graph one could determine all
> possible sets of source functions that could appear on the stack.

I suspect that taking #MStack = #Stack -> int and #Stack = P(Source)
is not too much slower than #MStack = Source -> int.  We can have the
enter/leave functions maintain the current P(Source) and the clock
tick bump a counter.  What we need are hash-consed sets with a fast
way of adding or removing a Source, which should be doable.

Anyways, now that -profile stack inserts the enters/leaves, it is easy
enough to play around with abstractions without even modifying the
compiler.  Some other abstractions that might be useful would be to
take #Stack = Source^m, and keep track of the bottom m frames on the
stack or of the top m frames on the stack.  Or possibly even #Stack =
Source^m x Source^n, keeping track of both.  I could especially
imagine Source^m x Source being useful, as it would tell us at a high
level where we are (the bottom m frames) and at a detailed level which
function we are in (the top frame).

Another important thing to keep in mind is how the data will be
visualized.  One reason that I chose the abstraction I did is that it
is in the same form as the current data and is easy to visualize as a
table of functions sorted by count or by putting data in the call

> The only difficulty I see, and maybe this occured to Steve given his
> next message, is that on a raise/handle, we would need to walk the
> stack to determine the functions that were implicitly left.

Actually, we don't need to do this.  The profile insertion code
inserts handlers (line 1448 in ssa-tree.fun) at nontail calls so that
this stack walking happens automatically.  This is one of the causes
of the slowdown due to profiling.  But, it was also necessary, because
with inlining/contification, some of the raises to those handlers
might get turned into jumps, and it's essential to make the local call
stack balance.

> The interrupt handler should have cost proportional to the size of
> the local stack; likewise, the cost for a normal call/return should
> be proportional to the size of the local stack; and, in practice, I
> expect local stacks are fairly small.

Agreed on all counts.

> On the other hand, in order to realize a speedup over the current
> system, it needs to be the case that we aren't building up and
> taking down SSA stack frames faster than interrupts are being
> handled.  (Else, we pay for adjusting the local stacks of those
> frames which wouldn't be touched by the stack walk.)  But, I would
> imagine that that is the case.

Actually, it's probably not the case.  We are doing lots of
unneccesary local stack adjustment with time profiling, which is why
this new approach was slower than the full stack walking in many
cases, especially when there are lots of SSA nontail calls to small

This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
MLton-devel mailing list