assembly/profiling question
Henry Cejtin
henry@sourcelight.com
Sat, 25 Aug 2001 00:13:40 -0500
I finally had a chance to read through your (Matthew) mail on the Profiling
structure. Note that at the very least we need separate flags for heap
profiling and for time profiling since you can't do both correctly. Since
the heap profiling requires different code, I agree that it has to be a
compile-time option. The time profiling can, if you use the
`__attribute__((weak))' trick used in gc.c, be just a link-time option, which
is kind of nice, but not a big deal.
Next, I agree that having the same format file for heap and time profiling is
rather convenient, but I suggest that the translation can be done in the code
to write the file. Actually, it would probably good to make the time
profiling data be sparse, so just go with an all sparse form of the data.
I.e., the output consists of some header (different from the current one for
compatibility) and then a sequence of bins, each of which contains an address
and a non-zero count. I would probably vote for the bin addresses NOT being
sorted in the file, and require the mlprof program to sort them. With the
sparse bins, it would probably be ok to always make them 64 bits. The
argument is that with a 100 HZ clock rate even if the program runs for one
hour of CPU time you have a maximum of 360,000 bins, so at 12 bytes per bin
(4 bytes for the address and 8 for the counter), this is only 4.3 meg. The
sparse representation with 64 bit counters costs 3 times the space per non-
zero bin, so it is smaller if the bins are at most 1/3 non-zero. This is
quite likely since lots of bins are zero because of instructions that are
more than one byte long.
The only funny extra thing about time profiling is the `unknown' count (the
number of times that an interrupt happened when you were in a shared
library).
Hm, I was thinking that the only code added for heap profiling would be
something which loaded one register with the size of the block, loaded
another with the address of the bin and then did a jsr to a fixed function
(written by hand). Now that I think about this I see that this won't work
because you don't have a stack. Still, that doesn't really matter since the
following would work:
One register gets the size of the block being allocated.
One register gets the address of the bin.
One register gets the address to return to.
and then you jump to a fixed location. This location (in libmlton.a) would
have the hand-written code to do the linking and incrementing of the bin. I
guess that the code generator would have to know to emit this sequence of
loads and the jump (probably easy) and also that the registers are going to
get used (could be hard).
Actually, here is what I would do: there would be a fixed area of size 3*4
bytes. Now the code at each allocation would be
Store register 1 in the first word of the fixed area.
Load register 1 with the size of the block being allocated.
Store register 2 in the second word of the fixed area.
Load register 2 with the address of the bin.
Store register 3 in the third word of the fixed area.
Load register 3 with the address following the jump instruction.
jump to the fixed code.
Note that now this code has NO effect that the code generator needs to
understand. The only connection between it and the code generator is getting
the size of the block.
As to the time overhead, I claim it will be essentially squat. I.e., I would
expect even in a program allocating lots of very small things that running
with heap profilinig on would probably be 1/4 the speed of running with it
off at worst. Is that too much? (I admit, the above is just a gut feeling,
but the point is that you are doing a bit more than a dozen extra memory
references, and the base line is at least 2-3 words allocated.)
Now to make the format of the file invariant between heap and time profiling,
you translate in the code which writes out the file. In the heap case you
don't do much: just walk the bins and write them out. (Note, in memomry the
bin includes a next pointer, an address (of the allocation point in the code
stream) and a 64 bit counter, so 4 words. The write function doesn't write
out the next field.)
In the time profiling case, you just walk through the bins (which are a
contiguous array) and write out the corresponding location and count, padded
to 64 bits, for each count which is non-zero. Note, the reason for the
difference in memory layout is that the time profiling must minimize the
impact on CPU time (since that is what we are measuring) and also we don't
know what bins are possible to use when we start, so we just allocate an
array for all of them. Also the bins only have to be 32 bits (since that
won't overflow for over a year of CPU time).