[MLton-devel] heap profiling
Stephen Weeks
MLton@mlton.org
Wed, 11 Sep 2002 19:22:27 -0700
Alain writes:
> I was wondering if we could not benefit from some form of heap
> profiling. We have some apps that really do allocate a lot, with
> GC time in some cases higher than 66%. Do you have ideas on this
> topic ?
We discussed heap profiling in August 2001 and came up with what we
think is a reasonable approach, but never implemented anything. Our
idea was to do something similar to mlprof that would provide dynamic
bytes allocated counts at the SSA (then called CPS) function and basic
block level. We would keep a counter for each allocation point in the
program and bump that counter by the number of bytes allocated each
time the allocation point was reached. There are some messy details
dealing with how the data is stored and implementing 64 bit
operations, but having read through our discussions again it all looks
reasonable to me.
For the record, here is my summary of our previous mail on the topic.
We had the following design choices to make
* How to store the data while running
- linked list of counters, one per point (linked in when the point
is reached)
- array with as many elements as there are # of bytes in text
segment (just like mlprof)
- parallel arrays of with # allocation points elements
one array stores the program counter for each allocation point
one array stores the bytes allocated for each allocation point
* How to store the data for analysis
- dense (like mlprof currently does)
- sparse
* Whether to use 32 or 64 bit bins
* How to increment the counter
- C call at each allocation
- inline code
* At what compiler pass to insert the increments
Here is what we decided.
* How to store the data while running
- for mlprof, keep as it is now, with an array with 32 bit bins, one
bin for every byte in the program.
- for mlprof-space, use parallel arrays with one element per
allocation point. The program counter array uses 32 bit bins and
the bytes allocated array uses 64 bit bins.
* How to store the data for analysis
- for both mlprof and mlprof-space, use a sparse, unsorted
representation with 32 bits for each address and 64 bits for each
count. For mlprof, this will be an improvement over the current
dense representation as long as less than 1/3 of the bins are
nonzero, which seems likely.
* How to increment the counter
- Use few instructions inline.
The disadvantage of this is that we have to add a little compiler
support for 64 bit add -- but we may want this anyway. The
disadvantage is the compile-time and run-time cost.
* At what compiler pass to insert the increments
- At the time Matthew proposed adding code to the x86-codegen.
However, since then the C codegen has continued to survive, and we
have seen the simplification that comes from moving stuff out of
the codegens into the backend. So, I propose to put it in the
backend.
One thing we didn't consider a year ago was how the approach would
interact with the selective profiling infrastructure that we have
since added. I think the parallel arrays work very nicely with
allowing the user program to switch which set of counters is being
used. We simply need to allocate one array of 64 bit bins for each
unit of profiling data.
So, it all seems pretty reasonable to me. I have a pretty good feel
for the difficulty of everything except for adding the 64 bit support
to the native codegen. Matthew, could you comment on that? If that's
too hard, we could going with a C call per allocation. Maybe the
performance hit will be acceptable.
Alain, I'm not sure if you've played around enough with mlprof to know
whether this kind of heap profiling information would be helpful. But
if you have, let us know what you think.
For ease of reference, here is all the previous mail.
--------------------------------------------------------------------------------
From: Matthew Fluet <mfluet@intertrust.com>
Date: Thu, 23 Aug 2001 10:14:24 -0700 (PDT)
Subject: assembly/profiling question
I've been thinking about extending profiling to gather allocation stats.
Essentially, I'd like to reuse the existing profiling infrastructure
(mlprof, mlmon.out file structure, even the bin datastructure and file
output in prof.c, if possible). Ideally, I'd like to just change every
addl X,gcState.frontier
to
addl X,gcState.frontier
addl X,(appropriate bin
Now, when you ran mlprof, you'd see the % of total program allocation done
in each CPS function and in each CPS block.
It seems to me that by the time the entire program is linked, then
the address of the appropriate bin can be known -- it's just
(bin_addr + (current_addr - text_sec_start_addr))
so, I want to write
addl X,(bin_symbol + (. - _start))
(Yes, this is slightly different than the way prof.c is set up right now.
I was thinking of linking in an assembly directive that allocated space
for the bins in the bss section.)
But, I can't even get (. - _start) to assemble; the _start symbol is
unknown in the current file. Am I just stuck?
I know I can do it with a few additional instructions to compute the
address, but the goal had been to not need any additional registers.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 12:49:10 -0500
Subject: Re: assembly/profiling question
The problem is (almost certainly) that you can't use an external symbol in a
negative context. I.e., all uses of externals must be external+constant.
If you want to profile allocation (which you can't do if you are also
prfiling run time since it would skew the times) then wouldn't it be easier
to just add one extra word allocated which had the program counter? I'm
thinking that the GC would first do a scan of the old space and write out the
data. Maybe this is more trouble.
Ah, how about this technique (a bit more expensive run-time wise): every
allocation point (or function, or what ever granularity you want) allocates a
counter which also has a link field. Then allocation (or entering the
allocation region) consists of if link NULL: link =
start of list of used bins start of used bins = my counter
The result is that on exit you have a linked list of all the counters that
were used.
I seem to recall that some scheme like this is what gprof uses for its
function call counts.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Thu, 23 Aug 2001 19:33:18 -0700
Subject: Re: assembly/profiling question
Matthew and I talked a bit about profiling, and it really would help if
space-mlprof uses the same infrastructure as time-mlprof. So, he's going to
stick with a variant of the scheme he mentioned,
> addl X,gcState.frontier
> to
> addl X,gcState.frontier
> addl X,(appropriate bin)
except that since i's not possible to get the assembler to stick in the
constant we want, there will be several instructions inserted that actually
compute the address at runtime. This won't affect allocation behaviour of the
program at all, so we should be fine.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 21:46:19 -0500
Subject: Re: assembly/profiling question
I don't understand what he wants to do. You can't use
addl X, (bin_symbol + (. - _start))
(ignoring linker problems) because the bis are going to be word sized, so
you have to scale. Besides that, it would require an array whosesize is the
same as the size of the program times 4. (The profiler uses this much.)
If you are willing to do that, then it is only one extra indirection. You
store the base address of the array minus _start in some cell, and then
the store is to the contents of that word plus `.'.
I can understand the convenience of having the output file be in the same
format (what ever that exactly means) as the mlprof.out file, but the data
in memory can just be what ever and be traversed when the file is written out.
A linked list is just fine. You follow the links into a new array, sort them
and write it out. What is the problem?
Note, either way the code being space-prof'd is going to be slower, so you
can just call a C function at each allocation point. This function can now
compute where you are (from the return address) and just needs to get the
size in some register.
I must be missing something since all of these seem really simple to me.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Thu, 23 Aug 2001 19:59:47 -0700
Subject: Re: assembly/profiling question
> Besides that, it would require an array whose size is the
> same as the size of the program times 4. (The profiler uses this much.)
We want that for space profiling too. In fact, 2^32 could really be a
limitation in self compiles.
> If you are willing to do that, then it is only one extra indirectin. You
> store the base address of the array minus _start in some cell, and then
> the store is to the contents of that word plus `.'.
Sounds excellent. I await Matthew's approval.
> I can understand the convenience of having the output file be in the same
> format (what ever that exactly means) as the mlprof.out file, but the data
> in memory can just be what ever and be traversed when the file is written out.
> A linked list is just fine. You follow the links into a new array, sort them
> and write it out. What is the problem?
No problem. What you say is true enough. It's just a question of the
complexity of the code introduced into the code generator and the performance
overhead. If we can come up with a scheme with only an extra instruction per
allocation, that would be ideal. Maybe what you suggested above works.
> Note, either way the code being space-prof'd is going to be slower, so you
> can just call a C function at each allocation point. This function can now
> cmpute where you are (from the return address) and just needs to get the
> size in some register.
I think the slowdown for this would be unnacceptable.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 22:12:43 -0500
Subject: Re: assembly/profiling question
I certainly didn't mean to imply that I thought that the slow down would be
too large to make the space profiling usable. It is just that you have to
remember never to do time profiling on the same run.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Thu, 23 Aug 2001 22:20:35 -0500
Subject: space profiling
I just realized that the space profiling is definitely going to have a real
problem with overflowing the bins. If I have a hot loop allocating lots of
junk, it isn't hard for one location to be responsible for more than 2^32
bytes of allocation. This seems like anothe argument for going for the
procedure call, and having the linked list use 64 bits worth of counter (plus
32 bits for the link). This way you are safe from wrapping.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 09:52:59 -0700
Subject: space profiling
> I just realized that the space profiling is definitely going to have a real
> problem with overflowing the bins. If I have a hot loop allocating lots of
> junk, it isn't hard for one location to be responsible for more than 2^32
> bytes of allocation.
I agree.
> This seems like another argument for goig for the
> procedure call, and having the linked list use 64 bits worth of counter (plus
> 32 bits for the link). This way you are safe from wrapping.
Couldn't we just go to 64 bit bins?
--------------------------------------------------------------------------------
From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 09:53:53 -0700 (PDT)
Subject: Re: assembly/profiling question
Just so we all have the same context, here's a fuller description of what
I'd like to add.
Add a Profiling structure to the MLton structure:
structure Profiling : PROFILING
signature PROFILING =
sig
(* a compile-time constant, like MLton.debug *)
val profiling: bool
(* reset all the bin counters to 0 *)
val reset: unit -> unit
(* write out all of the bins to a mlmon.out format file,
* with the given file name
*)
val write: string -> unit
end
Append (prepend?) to the input source something like
val _ = if MLton.Profiling.profiling
then OS.Process.atExit (fn () => MLton.Profiling.write "mlmon.out")
else ()
(Yes, this has slightly different semantics than the current profiling, I
think. The mlmon.out file won't have the timing info for GC_done. Maybe
the right thing is to let the final write be handled as it is, by
registering atExit with C, rather than with ML.)
The whole point being that I can now profile sub-portions of the program.
For example, I think it would be great to modify Control.trace to
automatically profile each top-level pass.
To support all this, we need to change prof.c and export a few more
functions. But I don't think that is so bad.
Then I was thinking about different hings I could do with the bins and
mlprof. The advantage of mlprof is that it is already set up to process
and display information by CPS/SSA functions, CPS/SSA blocks, and x86
blocks. So, by accumulating allocation counts into bins, I automatically
can see which CPS functions are doing the majority of allocation, and even
figure out which blocks in the function are doing the most allocation.
(Probably obvious, but the key here is that I get to see dynamic counts;
it's easy to look at a block and see that it's allocating 10 tuples, but I
don't know whether that block is executed once or 10 million times.)
Similarly, we could get dynamic function counts, by incrementing a bin on
entry to each CPS function. Likewise, dynamic block counts. The only
issue with all this is that with only one set of bins, we can only profile
one thing at a time, and it will be a compile-time choice, not a runtime
choice. (But, that seems to be o.k. If I'm profiling time, I don't want
the results pollutd by additional checks by the program checking to see
if it's currently gathering allocation stats.)
> I don't understand what he wants to do. You can't use
> addl X, (bin_symbol + (. - _start))
> (ignoring linker problems) because the bis are going to be word sized, so
> you have to scale. Besides that, it would require an array whose size is the
> same as the size of the program times 4. (The profiler uses this much.)
Yes, I forgot the scaling factor. But that just scales the expression
above by a constant value. I still claim that if the bin array is
statically allocated (e.g., in bss), then I could compute the absolute
address of the bin to increment during the final link. I don't see the
big issue, with the linker. I can do
addl X, (external_symbol)
in some assembly file, and the assembler just leaves a 32bit field ready
to be filled in later. The only difference between this and the above is
that the above is a calculated 32bit value, rather than just dropping in
the adress of external_symbol.
(Killer MLton app: write a linker that does "what I want." ;)
> If you are willing to do that, then it is only one extra indirection. You
> store the base address of the array minus _start in some cell, and then
> the store is to the contents of that word plus `.'.
That's fine, and probably what I'll settle on. The only issue will be
convicing the codegen that references to memory that use the symbol "."
are always different (and presumably may-aliasing). (The issue is that it
looks at memory references as syntactic constructs, and implicitly assumes
that label/symbol immediates are constant.)
> > in memory can just be what ever and be traversed when the file is written out.
> > A linked list is just fine. You follow the links into a new array, sort them
> > and write it out. What is the problem?
>
> No problem. What you say is true enough. It's just a question of the
> complexity of the code introduced into the code generator and the performance
>overhead. If we can come up with a scheme with only an extra instruction per
> allocation, that would be ideal. Maybe what you suggested above works.
The linked list would work o.k., but as Steve said, there is the
complexity (and location) of the code introduced. Remember, everything
that is added by the codegen is added as assembly. So, checking the link
pointer, setting the link pointer, updating the start of bins, and
incrementing the counter looks nice in C, but is messier in x86. There is
also the fact that introducing the test for the link pointer inserts new
blocks and join points to the program. A slightly similar issue with
making a C function call at each allocation (and I agree with Steve here
that the cost of that would be prohibitive.) Likewise, I either insert
the additional code during the translation from the Machine IL (in which
case the extra code goes through all of the intervening passes) or I do it
sometime later, where I might have to be more careful (e.g, if I insert
after register allocation, I've got to step carefully around available
registers).
--------------------------------------------------------------------------------
From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 09:57:26 -0700 (PDT)
Subject: Re: space profiling
> I just realized that the space profiling is definitely going to have a real
> problem with overflowing the bins. If I have a hot loop allocating lots of
> junk, it isn't hard for one location to be responsible for more than 2^32
> bytes of allocation. This seems like another argument for going for the
> procedure call, and having the linked list use 64 bits worth of counter (plus
> 32 bits for the link). This way you are safe from wrapping.
Yeah, the overflow might be an issue. This would require additional
changes to mlprof.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Fri, 24 Aug 2001 12:55:00 -0500
Subject: Re: space profiling
You could just use 64-bit bins, but then the space requirements double again
(8 bytes of bin for every byte of object code) and now the generated code at
each allocation point is a 32-bit add to a 64-bit counter. That seems to push
harder forcalling a procedure which can do it and do the linking to save
space.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 11:01:50 -0700
Subject: Re: space profiling
> You could just use 64-bit bins, but then the space requirements double again
> (8 bytes of bin for every byte of object code) and now the generated code at
> each allocation point is a 32-bit add to a 64-bit counter. That seems to push
> harder for calling a procedure which can do it and do the linking to save
>space.
The largest program we know of is MLton, whose text segment is currently just
under 6M. I can live with 48M (since it's not in the MLton heap), although I
agree that's pushing it.
Another possibility would be to coalesce the bins. How bad would it be to
cut the number of bins in half?
--------------------------------------------------------------------------------
From: Matthew Fluet <mfluet@intertrust.com>
Date: Fri, 24 Aug 2001 11:14:55 -0700 (PDT)
Subject: Re: space profiling
> > You could just use 64-bit bins, but then the space requirements double again
> > (8 bytes of bin for every byte of object code) and now the generated code at
> > each allocation point is a 32-bit add to a 64-bit counter. That seems to push
> > harder for calling a procedure which can do it and do the linkng to save
> > space.
>
> The largest program we know of is MLton, whose text segment is currently just
> under 6M. I can live with 48M (since it's not in the MLton heap), although I
> agree that's pushing it.
But, that 48M competes with the heap for space. (Yes, we don't need to
look at it during a GC, but if we allocate the space before GC_init, it's
that much less for the heap; if we allocate the space after GC_init, we
might fail because the heap might be taking up too much. Or maybe it
doesn't matter -- we'll just end up swapping in and out sections of the
bins.)
> Another possibility would be to coalesce the bins. How bad would it be to
> cut the number of bins in half?
It makes the computation of the right bin just a little more complicated.
It might also give "incorrect" results, where time profile ticks for
blocks get shifted forward or backwards. Probably o.k. for space
profiling, because there would be enough "buffer" space (i.e., setting
everything up for the allocaion and the tick) between the start of a
block and the occurence of the tick, and likewise between the tick and the
end of the block.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Fri, 24 Aug 2001 11:58:42 -0700
Subject: Re: space profiling
> It makes the computation of the right bin just a little more complicated.
> It might also give "incorrect" results, where time profile ticks for
> blocks get shifted forward or backwards. Probably o.k. for space
> profiling, because there would be enough "bffer" space (i.e., setting
> everything up for the allocation and the tick) between the start of a
> block and the occurence of the tick, and likewise between the tick and the
> end of the block.
That's what I was thinking. 32 bit bins for time and 64 bit bins for space.
It's a minor change to mlprof (once we get Int64 added to MLton) to make it
handle both formats.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Sat, 25 Aug 2001 00:13:40 -0500
Subject: Re: assembly/profiling question
I finally had a chance to read through your (Matthew) mail on the Profiling
structure. Note that at the very least we need separate flags for heap
profiling and for time profiling since you can't do both correctly. Since
the heap profiling requires ifferent code, I agree that it has to be a
compile-time option. The time profiling can, if you use the
`__attribute__((weak))' trick used in gc.c, be just a link-time option, which
is kind of nice, but not a big deal.
Next, I agree that having the same format file for heap and time profiling is
rather convenient, but I suggest that the translation can be done in the code
to write the file. Actually, it would probably good to make the time
profiling data be sparse, so just go with an all sparse form of the data.
I.e., the output consists of some header (different from the current one for
compatibility) and then a sequence of bins, each of which contains an address
and a non-zero count. I would probably vote for the bin addresses NOT being
sorted in the file, and require the mlprof program to sort them. With the
sparse bins, it would probably be ok to always make them 64 bits. The
argument is that with a 100 HZ clock rate even if the pogram runs for one
hour of CPU time you have a maximum of 360,000 bins, so at 12 bytes per bin
(4 bytes for the address and 8 for the counter), this is only 4.3 meg. The
sparse representation with 64 bit counters costs 3 times the space per non-
zero bin, so it is smaller if the bins are at most 1/3 non-zero. This is
quite likely since lots of bins are zero because of instructions that are
more than one byte long.
The only funny extra thing about time profiling is the `unknown' count (the
number of times that an interrupt happened when you were in a shared
library).
Hm, I was thinking that the only code added for heap profiling would be
something which loaded one register with the size of the block, loaded
another with the address of the bin and then did a jsr to a fixed function
(written by hand). Now that I think about this I see that this won't work
because you don't have a stack. Still, that doesn't really matter since the
folloing would work:
One register gets the size of the block being allocated.
One register gets the address of the bin.
One register gets the address to return to.
and then you jump to a fixed location. This location (in libmlton.a) would
have the hand-written code to do the linking and incrementing of the bin. I
guess that the code generator would have to know to emit this sequence of
loads and the jump (probably easy) and also that the registers are going to
get used (could be hard).
Actually, here is what I would do: there would be a fixed area of size 3*4
bytes. Now the code at each allocation would be
Store register 1 in the first word of the fixed area.
Load register 1 with the size of the block being allocated.
Store register 2 in the second word of the fixed area.
Load register 2 with the address of the bin.
Store register 3 in the third word of the fixed area.
Load register 3 with the address following the jump instruction.
jump to the fixed code.
Note that now this code has NO effect that the code generator needs to
understand. The only connection between it and the code generator is getting
the size of the block.
As to the time overhead, I claim it will be essentially squat. I.e., I would
expect even in a program allocating lots of very small things that running
with heap profilinig on would probably be 1/4 the speed of running with it
off at worst. Is that too much? (I admit, the above is just a gut feeling,
but the point is that you are doing a bit more than a dozen extra memory
references, and the base line is at least 2-3 words allocated.)
Now to make the format of the file invariant between heap and time profiling,
you translate in the code which writes out the file. In the heap case you
don't do much: just walk the bins and write them out. (Note, in memomry the
bin includes a next pointer, an address (of the allocation point in the code
stream) and a 64 bit countr, so 4 words. The write function doesn't write
out the next field.)
In the time profiling case, you just walk through the bins (which are a
contiguous array) and write out the corresponding location and count, padded
to 64 bits, for each count which is non-zero. Note, the reason for the
difference in memory layout is that the time profiling must minimize the
impact on CPU time (since that is what we are measuring) and also we don't
know what bins are possible to use when we start, so we just allocate an
array for all of them. Also the bins only have to be 32 bits (since that
won't overflow for over a year of CPU time).
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 16:06:47 -0500
Subject: space profiling
No comment on my suggestion for heap profiling?
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 15:32:56 -0700
Subject: space profiling
> No comment on my suggestion for heap profiling?
I just took a look. I am trying to sort through the many issues in my head.
Here are some thoughts.
* I like your idea for inserting instructions so that the code generator doesn't
have to know anything about how profiling works or how whatever statements get
inserted muck with registers. I think that is the way to go and should be
applicable to any scheme, if needed.
* I like the idea of separating the constraints on in-memory representation of
profiling information from the constraints on file format.
* I like the idea of uing the same sparse, unsorted, file format both for space
and time.
* I agree that for time profiling, we should stick with an array of 32 bit bins,
one bin per byte of code.
* I think that for space profiling, we should use an array of 96 bit bins, one
bin *per allocation point*, not per byte of executable. The bin stores the 32
bit code address and the 64 bit counter. Each allocation point is assigned a
unique integer. With this approach, we might even be able to use Matthew's
idea for one increment instruction to do the bump (the table and the unique
integer are compile/link time constants -- the only unknown at runtime is
the allocation amount, and then only in the case of arrays). The number of
allocation points, even for a self-compile, is small enough to make this
approach fine. I just measured and there are 55,651 allocations and 1,922
array allocations in a self compile.
* I am not willing to give up a run time factor of four for space profiling,and
probably not even a factor of 2. I would like to continue brainstorming until
we find an approach that we are all happy with that has hopefully no more than
a 10% or 20% hit. Maybe what I suggested above meets this?
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 18:02:29 -0500
Subject: Re: space profiling
It sounds like you like all of my idea except for the run-time cost. I haven't
actually produced a test of it yet (I will later today), but my claim was that
I can't believe that it would EVER be more than a factor of 4 times slower.
Note that if the program isn't allocating much than the overhead is irrelevant,
and if it is, then the temporary area and allocation bins are all going to
be in L1 cache, which will make this very fast. I think that it will really
not hurt much at all.
I'll do some tests today hackng up some assembler code to see how much of a
burden it is.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 16:10:52 -0700
Subject: Re: space profiling
> It sounds like you like all of my idea except for the run-time cost. I haven't
> actually produced a test of it yet (I will later today),
CVS is *much* more important.
> but my claim was that
> I can't believe that it would EVER be more than a factor of 4 times slower.
> Note that if the program isn't allocating much than the overhead is irrelevant,
> and if it is, then the temporary area and allocation bins are all going to
> be in L1 cache, which will make this very fast. I think that it will really
> not hurt much at all.
> I'll do some tests today hacking up some assembler ode to see how much of a
> burden it is.
How about my idea of an array with a bin per allocation point? It seems to have
both low time and space overhead.
Also, I don't completely understand your linked list approach. How do you get
from a code address to the appropriate bucket? Is a pointer stored in the code?
Or is it like my approach with a per code point spot know at compile time? If
so, then the only difference between our approaches is that mine allocates all
the bins statically in an array, while yours allocates them dynamically. As
I've argued, the most we're gonna see is 60,000 bins, so why not allocate them
statically.
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Mon, 27 Aug 2001 18:49:36 -0500
Subject: Re: space profiling
One problem (perhaps not a big deal) with statically allocating the bins is
that you have to know the number of allocation points. Also if you are going
to avoid writing out 0's (which will probably make the mlprof file much
smaller if you do) then you still have to scan the bin to see the size. I
agree, the difference isn't much either way.
If the bins are statically allocated, then the code at an allocation point is
going to look like:
save flags some where
addl <allocation size>, <low32 bits of counter>
adcl $0, <high 32 bits of counter>
restore flags from some where
If the bins are dynamically allocated, then the code at an allocation point
is going to be:
movl %esp, <save area>
movl <save area - 4>, %esp
pushl <allocation size>
pushl $<my bin address>
jsr <magic routine>
You can't store the bin after the jsr because it is in code space which is
not writable.
And now the magic routine can do what ever is needed, including restoring
everything (not trivial to get the stack back and then do the return, but
doable).
Ok, I agree, the fixed bins is a win.
Did you agree or disagree that for timing the bins should still be 96 bits on
output? I think they should just so that the code in mlprof doesn't worry
about more than one kind of bin. Going to the sparse variety will probably
save more space than this will cost us.
CVS moves another slot higher on the magic list.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Mon, 27 Aug 2001 16:56:08 -0700
Subject: Re: space profiling
> One problem (perhaps not a big deal) with statically allocating the bins is
> that you have to know the number of allocation points.
No problem. I think it is easy for Matthew to recognize frontier increments.
> If the bins are statically allocated, then the code at an allocation point is
> going to look like:
>
> save flags some where
> addl <allocation size>, <low 32 bits of counter>
> adcl $0, <high 32 bits of counter>
> restore flags from some where
Yuck, but it makes sense.
> Ok, I agree, the fixed bins is a win.
Excellent.
> Did you agree or disgree that for timing the bins should still be 96 bits on
> output? I think they should just so that the code in mlprof doesn't worry
> about more than one kind of bin. Going to the sparse variety will probably
> save more space than this will cost us.
I agree. One format for both time and space, sparse, 96 bit.
--------------------------------------------------------------------------------
From: Matthew Fluet <fluet@CS.Cornell.EDU>
Date: Tue, 28 Aug 2001 09:16:22 -0400 (EDT)
Subject: Re: space profiling
> > One problem (perhaps not a big deal) with statically allocating the bins is
> > that you have to know the number of allocation points.
>
> No problem. I think it is easy for Matthew to recognize frontier increments.
Yes, that's no problem. I now believe that we can do space profiling with
no impact on register allocation and without extraneous help from the
linker (at the expense of an insane number of symbols in the resulting
executable). The trick is that the register allocation will build up a
global data section, to be emitted as a final .S file.
My proposal is to do the folowing.
We recognize when register allocation is processing an instruction of the
form
addl <size of allocation>,<frontier location>
This should be allocated as
addl <reg or immediate>,<reg>
Now, we gensym two labels: LabelPC and LabelBin
Now, we emit the allocated instruction as
LabelPC:
addl <reg or immediate>,<reg>
addl <reg or immediate>,$(LabelBin+4)
adcl $0,$(LabelBin+8)
And we add to the global data section
LabelBin:
.long $LabelPC, 0, 0
Once we're done register allocation for the whole program, the global data
section will look like:
LabelBinA:
.long $LabelPCA, 0, 0
LabelBinB:
.long $LabelPCB, 0, 0
...
LabelBinZ:
.long $LabelPCZ, 0, 0
We choose some common label known to the output routine and emit the
global data section as:
.data
LabelSpaceProfBins:
.long <number of bins>
LabelBinA:
.long $LabelPCA, 0, 0
LabelBinB:
.long $LabelPCB, 0, 0
...
LabelBinZ:
.long $LabelPCZ, 0, 0
Now, on emission of space profiling info, the output routine should just
need to opy the right number of bins from LabelSpaceProfBins. (Or, I
guess, just emit the non-zero bins, to use the sparce format.)
> > If the bins are statically allocated, then the code at an allocation point is
> > going to look like:
> >
> > save flags some where
> > addl <allocation size>, <low 32 bits of counter>
> > adcl $0, <high 32 bits of counter>
> > restore flags from some where
So, this is similar to what I'm proposing above. I don't quite know what
you mean by "save flags some where". The previous instruction should be
the frontier increment, which will set the condition flags in some way,
but they are never used. If you meant save registers some where, then the
proposal above eliminates the need.
> > Ok, I agree, the fixed bins is a win.
The only disadvantage is that the size of the executable will be much
larger, but if Steve is saying that there are only 60,000 allocation
points, then that shouldn't be a problem. We might get some slow link
time as the linker attempts to resolve all 120,000 symbols.
O.k. New proposal (although this puts more burden on the output routine).
Instead of gensyming new symbols, we just keep a counter of the number of
allocation points seen. Each bin is now a constant offset from the start
of the bins. We still need symbols to store the PC somewhere. With this
method, I'd suggest two static arrays. One of just bins and one of just
PC addresses. (The one of just bins could be put in .bss section, which
saves a little space in the final executable.) The output routine needs
to walk both arrays in tandem, but that shouldn't be a big problem.
> > Did you agree or disagree that for timing the bins should still be 96 bits on
> > output? I think they should just so that the code in mlprof doesn't worry
> > about more than one kind of bin. Going to the sparse variety will probably
> > save more space than this will cost us.
>
> I agree. One format for both time and space, sparse, 96 bit.
ounds good, although that means we need 64bit integer support. I don't
think that will be too difficult. I looked at what gcc does for long long
operations, and it's pretty straightforward. Probably the biggest issue
will be redoing the C calling convention to support returns of 64 bit
values in %eax and %edx.
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Tue, 28 Aug 2001 09:32:22 -0700
Subject: Re: space profiling
> Yes, that's no problem. I now believe that we can do space profiling with
> no impact on register allocation and without extraneous help from the
> linker (at the expense of an insane number of symbols in the resulting
> executable).
...
> O.k. New proposal (although this puts more burden on the output routine).
> Instead of gensyming new symbols, we just keep a counter of the number of
> allocation points seen. Each bin is now a constant offset from the start
> of the bins. We still need symbols to store the PC somewhere. With this
> method, I'd suggest two static arrays. One of just bins and one of just
> PC addresses. (The one of just bins could be put in .bss section, which
> saves a little space in the final executable.) The output routine needs
> to walk both arrays in tandem, but that shouldn't be a big problem.
I prefer this to the insane number of symbols approach.
Does it make sense to move any of this into MachineOutput?
--------------------------------------------------------------------------------
From: Henry Cejtin <henry@sourcelight.com>
Date: Tue, 28 Aug 2001 13:57:08 -0500
Subject: Re: space profiling
In my mail, the reason for the `save flags some where' and `restore flags
from some where' was the idea that the entire rest of the code generator
could be completely unaware that any thing was going on. If this is no big
deal to the code generator, then it look like the 2 parallel arrays is the
way to go.
As to the 64 bit integer support, for now just using IntInf's should be fine.
It isn't like mlprof does much. As to the code in the program being
profiled, I would still probably vote (weakly) for doing this in C. With the
global nature of MLton, any ML code could cause changes in other things.
--------------------------------------------------------------------------------
From: Matthew Fluet <fluet@CS.Cornell.EDU>
Date: Wed, 29 Aug 2001 11:57:56 -0400 (EDT)
Subject: Re: space profiling
> As to the 64 bit integer support, for now just using IntInf's should be fine.
> It isn't like mlprof does much.
True. But, we'll want them eventually, particularly for Position.int's in
files.
> As to the code in the program being
> profiled, I would still probably vote (weakly) for doing this in C. With the
> global nature of MLton, any ML code could cause changes in other things.
Vote for doing what in C?
--------------------------------------------------------------------------------
From: Stephen Weeks <sweeks@intertrust.com>
Date: Wed, 29 Aug 2001 09:25:30 -0700
Subject: Re: space profiling
> > As to the code in the program being
> > profiled, I would still probably vote (weakly) for doing this in C. With the
> > global nature of MLton, any ML code could cause changes in other things.
>
> Vote for doing what in C?
I believe he means the atExit stuff. And it is a good point, that we don't want
profiling to interfere with optimization of the program -- the easiest way to do
that is keep it out of ML.
--------------------------------------------------------------------------------
-------------------------------------------------------
In remembrance
www.osdn.com/911/
_______________________________________________
MLton-devel mailing list
MLton-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlton-devel