assembly/profiling question

Matthew Fluet
Fri, 24 Aug 2001 09:53:53 -0700 (PDT)

Just so we all have the same context, here's a fuller description of what
I'd like to add.

Add a Profiling structure to the MLton structure:

structure Profiling : PROFILING
signature PROFILING =
       (* a compile-time constant, like MLton.debug *)
       val profiling: bool

       (* reset all the bin counters to 0 *)
       val reset: unit -> unit
       (* write out all of the bins to a mlmon.out format file,
        * with the given file name
       val write: string -> unit

Append (prepend?) to the input source something like

val _ = if MLton.Profiling.profiling
          then OS.Process.atExit (fn () => MLton.Profiling.write "mlmon.out")
          else ()

(Yes, this has slightly different semantics than the current profiling, I
think.  The mlmon.out file won't have the timing info for GC_done.  Maybe
the right thing is to let the final write be handled as it is, by
registering atExit with C, rather than with ML.)

The whole point being that I can now profile sub-portions of the program.
For example, I think it would be great to modify Control.trace to
automatically profile each top-level pass.

To support all this, we need to change prof.c and export a few more
functions.  But I don't think that is so bad.

Then I was thinking about different things I could do with the bins and
mlprof.  The advantage of mlprof is that it is already set up to process
and display information by CPS/SSA functions, CPS/SSA blocks, and x86
blocks.  So, by accumulating allocation counts into bins, I automatically
can see which CPS functions are doing the majority of allocation, and even
figure out which blocks in the function are doing the most allocation.
(Probably obvious, but the key here is that I get to see dynamic counts;
it's easy to look at a block and see that it's allocating 10 tuples, but I
don't know whether that block is executed once or 10 million times.)
Similarly, we could get dynamic function counts, by incrementing a bin on
entry to each CPS function.  Likewise, dynamic block counts.  The only
issue with all this is that with only one set of bins, we can only profile
one thing at a time, and it will be a compile-time choice, not a runtime
choice.  (But, that seems to be o.k.  If I'm profiling time, I don't want
the results polluted by additional checks by the program checking to see
if it's currently gathering allocation stats.)

> I don't understand what he wants to do.  You can't use
> 	addl	X, (bin_symbol + (. - _start))
> (ignoring linker problems) because the bis are going to be word sized, so
> you have to scale.  Besides that, it would require an array whose size is the
> same as the size of the program times 4.  (The profiler uses this much.)

Yes, I forgot the scaling factor.  But that just scales the expression
above by a constant value.  I still claim that if the bin array is
statically allocated (e.g., in bss), then I could compute the absolute
address of the bin to increment during the final link.  I don't see the
big issue, with the linker.  I can do
addl X, (external_symbol)
in some assembly file, and the assembler just leaves a 32bit field ready
to be filled in later.  The only difference between this and the above is
that the above is a calculated 32bit value, rather than just dropping in
the address of external_symbol.

(Killer MLton app:  write a linker that does "what I want." ;)

> If you are willing to do that, then it is only one extra indirection.  You
> store the base address of the array minus _start in some cell, and then
> the store is to the contents of that word plus `.'.

That's fine, and probably what I'll settle on.  The only issue will be
convicing the codegen that references to memory that use the symbol "."
are always different (and presumably may-aliasing).  (The issue is that it
looks at memory references as syntactic constructs, and implicitly assumes
that label/symbol immediates are constant.)

> > in memory can just be what ever and be traversed when the file is written out.
> > A linked list is just fine.  You follow the links into a new array, sort them
> > and write it out.  What is the problem?
> No problem.  What you say is true enough.  It's just a question of the
> complexity of the code introduced into the code generator and the performance
> overhead.  If we can come up with a scheme with only an extra instruction per
> allocation, that would be ideal.  Maybe what you suggested above works.

The linked list would work o.k., but as Steve said, there is the
complexity (and location) of the code introduced.  Remember, everything
that is added by the codegen is added as assembly.  So, checking the link
pointer, setting the link pointer, updating the start of bins, and
incrementing the counter looks nice in C, but is messier in x86.  There is
also the fact that introducing the test for the link pointer inserts new
blocks and join points to the program.  A slightly similar issue with
making a C function call at each allocation (and I agree with Steve here
that the cost of that would be prohibitive.)  Likewise, I either insert
the additional code during the translation from the Machine IL (in which
case the extra code goes through all of the intervening passes) or I do it
sometime later, where I might have to be more careful (e.g., if I insert
after register allocation, I've got to step carefully around available