mlprof output
Matthew Fluet
Matthew Fluet <fluet@CS.Cornell.EDU>
Thu, 22 Mar 2001 18:33:51 -0500 (EST)
Here's what I currently have working for the mlprof output. (I generated
the gmon.out file by compiling logic with -p -native false, so the counts
don't mean anything relative to the executable compiled with -native true,
but at least it produced enough data to get some results).
[fluet@lennon mlprof-native]$ ./mlprof
wrong number of args
Usage: mlprof [-d{0|1|2}] [-s] a.out-file gmon.out-file
The -s option is the same as before, determines whether or not static
functions are included. The -d option determines the detail (or depth) of
the profiling for the generated native code. The default is 0, and yields
the following:
[fluet@lennon mlprof-native]$ ./mlprof logic gmon.out
60.61 seconds of CPU time
x_52 63.09%
x_178 13.03%
main_0 7.99%
x_781 7.03%
init_dummy (C @ 0x8048C18) 2.69%
saveGlobals (C) 2.19%
x_973 1.55%
x_56 0.41%
x_603 0.35%
x_585 0.33%
x_715 0.25%
x_558 0.23%
oc_4 0.21%
x_540 0.18%
x_521 0.15%
x_97 0.08%
x_688 0.07%
x_83 0.05%
x_149 0.03%
x_670 0.03%
x_651 0.03%
x_733 0.02%
Everything not marked (C) corresponds to information derived from the
static labels. For depth 0, these correspond to top level cps functions.
There is a slight tradeoff here, I think, with the old version of mlprof;
because we default to ChunkPerFunc for the native backend, this coarsest
grain profiling can correspond to very large chunks of code. To get more
detailed information, we crank up the detail option. Here's a snippet of
-d 1:
60.61 seconds of CPU time
x_52 63.09%
L_2330,L_889 13.02% 8.22%
L_2324,L_885 7.95% 5.02%
L_1349,L_2813 7.95% 5.02%
L_1937 6.69% 4.22%
L_1359,L_2823 4.68% 2.95%
... (a little less than 200 more)
x_178 13.03%
L_115,L_1542 43.80% 5.71%
L_113,L_1539 21.01% 2.74%
L_105,L_1530 5.19% 0.68%
L_1518,L_95 3.92% 0.51%
L_100,L_1524 3.16% 0.41%
...
main_0 7.99%
L_1463,L_33 48.14% 3.84%
main_0 25.21% 2.01%
L_33 12.60% 1.01%
L_13 7.64% 0.61%
L_12 3.72% 0.30%
L_10 1.65% 0.13%
L_60 0.41% 0.03%
loop_8 0.21% 0.02%
L_38 0.21% 0.02%
L_0 0.21% 0.02%
...
With detail level 1, we start incorporating labels that were generated by
backend.fun. Depth 0 labels correspond to cps functions (which is passed
to blocks in the machine IL in the profileName field); Depth 1 labels
correspond to labels on blocks in the machine IL. A lot of these labels
were inherited directly from CPS IL continuation declarations, so you
should be able to trace back to the .cps and coordinate somewhat.
Combined labels, like L_2330,L_889, correspond to two blocks that were
combined by one of the native-codegen simplification passes. Labels that
don't appear in the CPS IL were introduced by backend.fun in the
translation to the machine IL. The two columns give percentages relative
to the containing depth 0 function and the whole program.
Finally, detail level 2 incorporates every label introduced by the native
codegen.
60.61 seconds of CPU time
x_52 63.09%
L_2330,L_889 13.02% 8.22%
L_889,skipGC_125 100.00% 13.02% 8.22%
L_2324,L_885 7.95% 5.02%
L_885,skipGC_123 100.00% 7.95% 5.02%
L_1349,L_2813 7.95% 5.02%
L_1349,L_2813 100.00% 7.95% 5.02%
L_1937 6.69% 4.22%
L_1938 50.00% 3.35% 2.11%
doGC_96 50.00% 3.35% 2.11%
L_1359,L_2823 4.68% 2.95%
L_1359,L_2823 100.00% 4.68% 2.95%
I don't know if this will be particularly useful, but it was easy enough
to compute. We see more blocks that were combined by simplification
passes. Labels like skipGC, doGC, statementLimitCheckLoop are all
introduced by the native codegen, so they don't correspond to any real
code in the CPS IL. On the other hand, they do give some insight on the
interaction with the runtime; for example, time spent in an skipGC block
will generally correspond to some allocation being done after performing a
GC check. Of course, we'll need to look at real gmon.out files for any of
this to be valid.
I looked through a -d 2 profile without filtering out blocks with no
ticks, and depth 2 is almost always just one block corresponding to the
depth 1 block; when it isn't, it's because it's broken up by a GC check,
array allocation loop, or an overflow check.
Let me know if this gives all the information we would like. It should be
trivial to modify the gmon reader to work with whatever Henry cooks up.