[MLton] commit r4241: First cut at fixing -codegen c -profile time

Matthew Fluet fluet@mlton.org
Thu, 17 Nov 2005 19:06:17 -0800

A different approach to time profiling: maintain the current source
position via program operations.  This has the advantage that it
significantly more portable than trying to inject just the right
information into C code in a manner that isn't corrupted by gcc

The high points:
 * added  volatile uint curSourceSeqsIndex;  to struct GC_state;
   PROFILE_TIME_LABEL, which distinguishes the two methods of time
   profiling (the former uses a the GC_state field to compute the
   sourceSeqsIndex, the later uses assembler labels).
 * added  -profile time-field  and  -profile time-label  as expert
   options; -profile time  remains and is set according to the
   codegen; you can use the expert options to override the codegen
   default (use time-label with the C-codegen at your peril, use
   time-field with the native codegen for benchmarking purposes).
 * made _minimal_ changes to profile.fun; essentially, if we would
   have put in a profile label, instead put in an explict move of 
   the sourceSeqsIndex to the GC_state field.

Note that this sets curSourceSeqsIndex more often than necessary.
Profile labels were inserted for code coverage properties; namely, at
the beginning of every basic block -- whether or not control could
have flowed from a block with a different souceSeqsIndex.  Inspecting
the generated code, it is clear that we could insert fewer sets.

Here is a comparison of the results of using different combinations of
codegen and profiling methods:

[fluet@localhost native]$ mlton -codegen native -profile time hamlet.sml
[fluet@localhost native]$ ./hamlet > /dev/null
[fluet@localhost native]$ mlprof -thresh 2.0 hamlet mlmon.out
25.98 seconds of CPU time (4.54 seconds GC)
         function            cur
--------------------------- -----
<gc>                        14.9%
Sequence.Slice.collate.loop 12.7%
BinaryMapFn.find.mem        11.0%
Integer.scan                 8.6%
BinaryMapFn.insert           8.0%
Lab.compare                  5.2%
Sequence.make2               4.4%
BinaryMapFn.T'               3.9%
Integer.scan.negate          3.0%
EvalCore.evalExp             2.6%
Integer.scan.num             2.2%
BinaryMapFn.foldli.fold      2.1%
Integer.scan.finishNum       2.1%

[fluet@localhost native]$ mlton -codegen native -profile time-field hamlet.sml
[fluet@localhost native]$ ./hamlet > /dev/null
[fluet@localhost native]$ mlprof -thresh 2.0 hamlet mlmon.out
29.31 seconds of CPU time (4.42 seconds GC)
         function            cur
--------------------------- -----
Sequence.Slice.collate.loop 18.4%
<gc>                        13.1%
Integer.scan                 9.3%
BinaryMapFn.insert           6.2%
BinaryMapFn.find.mem         6.0%
Sequence.make2               4.3%
Integer.scan.negate          3.8%
Lab.compare                  3.6%
BinaryMapFn.T'               3.5%
wrapOverflow.fn              2.8%
Integer.scan.finishNum       2.7%
BinaryMapFn.N                2.6%
EvalCore.evalExp             2.5%
Integer.scan.num             2.1%
EvalCore.evalAtExp           2.0%

[fluet@localhost c]$ mlton -codegen c -profile time hamlet.sml 
[fluet@localhost c]$ ./hamlet > /dev/null
[fluet@localhost c]$ mlprof -thresh 2.0 hamlet mlmon.out
60.78 seconds of CPU time (4.72 seconds GC)
         function            cur
--------------------------- -----
Sequence.Slice.collate.loop 18.7%
BinaryMapFn.find.mem        11.0%
BinaryMapFn.insert          10.2%
<gc>                         7.2%
Integer.scan.negate          6.0%
Integer.scan                 6.0%
Lab.compare                  4.7%
Sequence.make2               3.8%
EvalCore.evalExp             3.5%
EvalCore.evalAtPat           3.3%
BinaryMapFn.T'               2.6%
EvalCore.evalPatRow          2.4%
Integer.scan.finishNum       2.4%
wrapOverflow.fn              2.1%

There is a slight slowdown of using the field method over using the
label method, but nothing like the slowdown of going to the C codegen.
Regardless of the running time of the program, you get essentially the
same profiling results (with the exception that <gc> time proportional
to total running time).


U   mlton/trunk/include/c-chunk.h
U   mlton/trunk/mlton/backend/backend.fun
U   mlton/trunk/mlton/backend/machine.fun
U   mlton/trunk/mlton/backend/profile.fun
U   mlton/trunk/mlton/backend/rep-type.fun
U   mlton/trunk/mlton/backend/runtime.fun
U   mlton/trunk/mlton/backend/runtime.sig
U   mlton/trunk/mlton/codegen/c-codegen/c-codegen.fun
U   mlton/trunk/mlton/control/control-flags.sig
U   mlton/trunk/mlton/control/control-flags.sml
U   mlton/trunk/mlton/main/compile.fun
U   mlton/trunk/mlton/main/lookup-constant.fun
U   mlton/trunk/mlton/main/main.fun
U   mlton/trunk/runtime/gc.c
U   mlton/trunk/runtime/gc.h