[MLton-devel] cvs commit: source-level profiling

Matthew Fluet fluet@CS.Cornell.EDU
Thu, 2 Jan 2003 22:23:15 -0500 (EST)

> The goal of stack profiling (aka cumulative profiling) is to at each
> point of interest (clock tick or allocation) bump a counter for each
> source function that is on the stack (once per function).

I wonder if this is really the best profiling data that we can deliver.
In particular, it doesn't seem to give the nesting of times that I would
like to see.  For example, when profiling a self compile, it will be very
difficult to understand where time in the basis library corresponds to
time in functions lower on the stack (in the source program sense).

Maybe this is too much to ask (in the sense of too much data to
accumulate), but would it be possible to instead of bumping a counter for
each source function, bump a counter for each "stack abstraction" for lack
of a better term.

In more detail, taking time profiling as an example, at each interrupt,
the ML stack and program counter uniquely determine a source stack: for
each SSA continuation on the ML stack, the frameSources: int vector
identifies the source stack in the SSA function, and the program counter
likewise can be mapped to an index into sourceSeqs: int vector vector.
Appending all these int vectors together, we get the stack of executing
functions as they appear in the source program.  In the best of all
worlds, we could just bump a counter for each of these source stacks, dump
that in the mlmon.out file and do all the processing offline -- recovering
the current information as a special case, but also getting more detailed
dependencies.  But, clearly the number of possible souce stacks is
unbounded, so it is probably unreasonable to keep all of them.  On the
other hand, Steve made a reasonable simplification by only charging a
source function one tick regardless of the number of times it appears on
the stack.  (Note this is exactly what we want for a function like fib; on
the other hand, for something like foldr, where the folded function might
also invoke foldr, it's less clear that that is "best".  If you are trying
to investigate the folded function, you won't be able to distinguish
between the inner and the outer foldr.)  Adopting that simplification,
there are at most 2^n possible source stacks, where n is the number of
source functions.  And, now that I think about it more, even that is way
too much data.  In practice, I'm sure that it is much less.  And for any
given program, given the sourceSeqs, frameSources and SSA call graph one
could determine all possible sets of source functions that could appear on
the stack.

Anyways, that was an idea, although I'm less convinced now that it can be
realized in an efficient manner.

> The new idea that I had last night that I think may make the overhead
> acceptable, and is at least worth trying, is to only use your
> suggestion when making an SSA nontail call, not for all enters and
> leaves (i.e. not for source functions that are inlined or contified).

That seems quite reasonable and something similar was occuring to me when
thinking about the above, when I really wanted to snapshot the stack,
without needing to walk the whole stack at each tick.  The only difficulty
I see, and maybe this occured to Steve given his next message, is that on
a raise/handle, we would need to walk the stack to determine the functions
that were implicitly left.  On a normal call/return, the frameSources
should tell us exactly which functions need to be remembered and cleared
for the duration of the call, so constant work can be done to adjust for
them.  But, since a handler can undo a dynamic portion of the stack, a
dynamic walk will be necessary.  If I'm reading Steve's description right,
this would seem to give a big speedup.  The interrupt handler should have
cost proportional to the size of the local stack; likewise, the cost for a
normal call/return should be proportional to the size of the local stack;
and, in practice, I expect local stacks are fairly small.  Certainly, they
are smaller than the size of the whole stack.  On the other hand, in order
to realize a speedup over the current system, it needs to be the case that
we aren't building up and taking down SSA stack frames faster than
interrupts are being handled.  (Else, we pay for adjusting the local
stacks of those frames which wouldn't be touched by the stack walk.)
But, I would imagine that that is the case.

This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
MLton-devel mailing list