assembly/profiling question

Henry Cejtin henry@sourcelight.com
Sat, 25 Aug 2001 00:13:40 -0500


I  finally  had a chance to read through your (Matthew) mail on the Profiling
structure.  Note that at the very least  we  need  separate  flags  for  heap
profiling  and  for  time profiling since you can't do both correctly.  Since
the heap profiling requires different code, I agree  that  it  has  to  be  a
compile-time   option.    The   time   profiling   can,   if   you   use  the
`__attribute__((weak))' trick used in gc.c, be just a link-time option, which
is kind of nice, but not a big deal.

Next, I agree that having the same format file for heap and time profiling is
rather convenient, but I suggest that the translation can be done in the code
to  write  the  file.   Actually,  it  would  probably  good to make the time
profiling data be sparse, so just go with an all sparse  form  of  the  data.
I.e.,  the output consists of some header (different from the current one for
compatibility) and then a sequence of bins, each of which contains an address
and  a non-zero count.  I would probably vote for the bin addresses NOT being
sorted in the file, and require the mlprof program to sort  them.   With  the
sparse  bins,  it  would  probably  be  ok  to always make them 64 bits.  The
argument is that with a 100 HZ clock rate even if the program  runs  for  one
hour  of  CPU time you have a maximum of 360,000 bins, so at 12 bytes per bin
(4 bytes for the address and 8 for the counter), this is only 4.3  meg.   The
sparse  representation  with 64 bit counters costs 3 times the space per non-
zero bin, so it is smaller if the bins are at most  1/3  non-zero.   This  is
quite  likely  since  lots  of bins are zero because of instructions that are
more than one byte long.

The only funny extra thing about time profiling is the `unknown'  count  (the
number  of  times  that  an  interrupt  happened  when  you  were in a shared
library).

Hm, I was thinking that the only code  added  for  heap  profiling  would  be
something  which  loaded  one  register  with  the  size of the block, loaded
another with the address of the bin and then did a jsr to  a  fixed  function
(written  by  hand).   Now that I think about this I see that this won't work
because you don't have a stack.  Still, that doesn't really matter since  the
following would work:

    One register gets the size of the block being allocated.
    One register gets the address of the bin.
    One register gets the address to return to.

and  then  you jump to a fixed location.  This location (in libmlton.a) would
have the hand-written code to do the linking and incrementing of the bin.   I
guess  that  the  code  generator would have to know to emit this sequence of
loads and the jump (probably easy) and also that the registers are  going  to
get used (could be hard).

Actually,  here  is  what I would do: there would be a fixed area of size 3*4
bytes.  Now the code at each allocation would be

    Store register 1 in the first word of the fixed area.
    Load register 1 with the size of the block being allocated.
    Store register 2 in the second word of the fixed area.
    Load register 2 with the address of the bin.
    Store register 3 in the third word of the fixed area.
    Load register 3 with the address following the jump instruction.
    jump to the fixed code.

Note that now this code has NO  effect  that  the  code  generator  needs  to
understand.  The only connection between it and the code generator is getting
the size of the block.

As to the time overhead, I claim it will be essentially squat.  I.e., I would
expect  even  in  a program allocating lots of very small things that running
with heap profilinig on would probably be 1/4 the speed of  running  with  it
off  at worst.  Is that too much?  (I admit, the above is just a gut feeling,
but the point is that you are doing a bit more  than  a  dozen  extra  memory
references, and the base line is at least 2-3 words allocated.)

Now to make the format of the file invariant between heap and time profiling,
you translate in the code which writes out the file.  In the  heap  case  you
don't  do much: just walk the bins and write them out.  (Note, in memomry the
bin includes a next pointer, an address (of the allocation point in the  code
stream)  and  a 64 bit counter, so 4 words.  The write function doesn't write
out the next field.)

In the time profiling case, you just walk  through  the  bins  (which  are  a
contiguous  array) and write out the corresponding location and count, padded
to 64 bits, for each count which is  non-zero.   Note,  the  reason  for  the
difference  in  memory  layout  is  that the time profiling must minimize the
impact on CPU time (since that is what we are measuring) and  also  we  don't
know  what  bins  are  possible  to use when we start, so we just allocate an
array for all of them.  Also the bins only have to be  32  bits  (since  that
won't overflow for over a year of CPU time).