profBug
Matthew Fluet
mfluet@intertrust.com
Mon, 4 Jun 2001 08:31:09 -0700 (PDT)
> I agree that if you don't specify -s it can get it wrong, but the problem is
> that if you clump all the static routines together in `<static>', it really
> isn't telling you anything useful.
True. Hopefully most profiles won't be dominated by C code, so this might
be a minor issue. And, I suspect that when C code is a hot spot, it will
likely be the GC, which is easily recognized by the labels.
> It's interesting that you prefer writing the stuff bottom-up. I have always
> really liked top down. Way way back I remember being pained by the bottom-up
> order in Pascal. I can see why ML uses bottom-up, since everything is a
> `let*', but it still hurts. I use structures to fix some of it, but lots of
> times I would just like something like Haskell's `where'. Oh well.
I agree that top down is the way to design something, but I've just gotten
into the habit of putting aux functions before their use. Probably biased
by ML, but even in C, I would need to prototype functions, so I would just
put the function bodies there too.
> One thing I'm still not completely clear on with the profiling. What is the
> notion of the -d stuff. Is the default (`-d 0') supposed to be at the CPS
> function level and everything above that only mapping to the assembler code?
> Have we abandoned trying to relate the profiling to the CPS stuff at all? It
> seems that currently the profiling stuff only relates to the assembly code we
> generate. (Of course this could just be my out-of-date version of MLton.)
> What exactly are the higher -d values supposed to show? (I know it is more
> detail, but I'm trying to ask if there is any concept of what more I get.)
Here's the info I wrote up about the -d option. This is from an old
e-mail, so the output format is a little different, but otherwise it's the
same. The remaining option, -t n, is a threshhold option and suppresses
the display of labels that account for less than n% of the time; good for
cutting out all of the really minor labels.
>From fluet@hoho.cs.cornell.edu Mon Jun 4 08:18:46 2001
Date: Thu, 22 Mar 2001 18:33:51 -0500 (EST)
From: Matthew Fluet <fluet@hoho.cs.cornell.edu>
To: MLton@sourcelight.com
Subject: mlprof output
Here's what I currently have working for the mlprof output. (I generated
the gmon.out file by compiling logic with -p -native false, so the counts
don't mean anything relative to the executable compiled with -native true,
but at least it produced enough data to get some results).
[fluet@lennon mlprof-native]$ ./mlprof
wrong number of args
Usage: mlprof [-d{0|1|2}] [-s] a.out-file gmon.out-file
The -s option is the same as before, determines whether or not static
functions are included. The -d option determines the detail (or depth) of
the profiling for the generated native code. The default is 0, and yields
the following:
[fluet@lennon mlprof-native]$ ./mlprof logic gmon.out
60.61 seconds of CPU time
x_52 63.09%
x_178 13.03%
main_0 7.99%
x_781 7.03%
init_dummy (C @ 0x8048C18) 2.69%
saveGlobals (C) 2.19%
x_973 1.55%
x_56 0.41%
x_603 0.35%
x_585 0.33%
x_715 0.25%
x_558 0.23%
oc_4 0.21%
x_540 0.18%
x_521 0.15%
x_97 0.08%
x_688 0.07%
x_83 0.05%
x_149 0.03%
x_670 0.03%
x_651 0.03%
x_733 0.02%
Everything not marked (C) corresponds to information derived from the
static labels. For depth 0, these correspond to top level cps functions.
There is a slight tradeoff here, I think, with the old version of mlprof;
because we default to ChunkPerFunc for the native backend, this coarsest
grain profiling can correspond to very large chunks of code. To get more
detailed information, we crank up the detail option. Here's a snippet of
-d 1:
60.61 seconds of CPU time
x_52 63.09%
L_2330,L_889 13.02% 8.22%
L_2324,L_885 7.95% 5.02%
L_1349,L_2813 7.95% 5.02%
L_1937 6.69% 4.22%
L_1359,L_2823 4.68% 2.95%
... (a little less than 200 more)
x_178 13.03%
L_115,L_1542 43.80% 5.71%
L_113,L_1539 21.01% 2.74%
L_105,L_1530 5.19% 0.68%
L_1518,L_95 3.92% 0.51%
L_100,L_1524 3.16% 0.41%
...
main_0 7.99%
L_1463,L_33 48.14% 3.84%
main_0 25.21% 2.01%
L_33 12.60% 1.01%
L_13 7.64% 0.61%
L_12 3.72% 0.30%
L_10 1.65% 0.13%
L_60 0.41% 0.03%
loop_8 0.21% 0.02%
L_38 0.21% 0.02%
L_0 0.21% 0.02%
...
With detail level 1, we start incorporating labels that were generated by
backend.fun. Depth 0 labels correspond to cps functions (which are passed
to blocks in the machine IL in the profileName field); Depth 1 labels
correspond to labels on blocks in the machine IL. A lot of these labels
were inherited directly from CPS IL continuation declarations, so you
should be able to trace back to the .cps and coordinate somewhat.
Combined labels, like L_2330,L_889, correspond to two blocks that were
combined by one of the native-codegen simplification passes. Labels that
don't appear in the CPS IL were introduced by backend.fun in the
translation to the machine IL. The two columns give percentages relative
to the containing depth 0 function and the whole program.
Finally, detail level 2 incorporates every label introduced by the native
codegen.
60.61 seconds of CPU time
x_52 63.09%
L_2330,L_889 13.02% 8.22%
L_889,skipGC_125 100.00% 13.02% 8.22%
L_2324,L_885 7.95% 5.02%
L_885,skipGC_123 100.00% 7.95% 5.02%
L_1349,L_2813 7.95% 5.02%
L_1349,L_2813 100.00% 7.95% 5.02%
L_1937 6.69% 4.22%
L_1938 50.00% 3.35% 2.11%
doGC_96 50.00% 3.35% 2.11%
L_1359,L_2823 4.68% 2.95%
L_1359,L_2823 100.00% 4.68% 2.95%
I don't know if this will be particularly useful, but it was easy enough
to compute. We see more blocks that were combined by simplification
passes. Labels like skipGC, doGC, statementLimitCheckLoop are all
introduced by the native codegen, so they don't correspond to any real
code in the CPS IL. On the other hand, they do give some insight on the
interaction with the runtime; for example, time spent in a skipGC block
will generally correspond to some allocation being done after performing a
GC check. Of course, we'll need to look at real gmon.out files for any of
this to be valid.
I looked through a -d 2 profile without filtering out blocks with no
ticks, and depth 2 is almost always just one block corresponding to the
depth 1 block; when it isn't, it's because it's broken up by a GC check,
array allocation loop, or an overflow check.
Let me know if this gives all the information we would like. It should be
trivial to modify the gmon reader to work with whatever Henry cooks up.