before first PROFILE line
Henry Cejtin
henry@clairv.com
Fri, 9 Jul 1999 00:05:06 -0500
I have just done a careful (by hand) investigation of the profiling result on
ratio regions, which reported a lot of time in `before first PROFILE line' of
Chunks. I deduce that the output of mlprof is completely correct, and we
really are spending that time there.
To be more specific, in my run, mlprof claimed that 5.93% of the time was
spent before the first PROFILE line of Chunk2. Since there were a total of
2361 ticks (23.61 CPU seconds) in the run, that means 140 ticks.
The code at the start of Chunk2 is:
0804d710 <Chunk2>:
Chunk2():
/tmp/ratio.c:6772
804d710: 83 ec 58 subl $0x58,%esp
804d713: 55 pushl %ebp
804d714: 57 pushl %edi
804d715: 56 pushl %esi
804d716: 53 pushl %ebx
804d717: 8b 1d 4c 21 05 08 movl 0x805214c,%ebx
804d71d: 8b 35 28 21 05 08 movl 0x8052128,%esi
/tmp/ratio.c:6814
804d723: 83 3d c8 21 05 08 40 cmpl $0x40,0x80521c8
The first PROFILE messages of Chunk2 is line 6813:
/* PROFILE: MLTON_chunkSwitch (magic) */
and so the first line of the chunkSwitch should be line 6814, and as we can
see in the above code, this is exactly correct (the compare to $0x40 (or 64)
is indeed the check for the first case).
Summing all this up, we see that Chunk2 starts at 0x804D710 and that the code
for MLTON_chunkSwitch starts at 0x804D723.
Ok, now here is the information from the gmon.out file for the start of
Chunk2:
address number of ticks
[0x804d710, 0x804d714) 76
[0x804d714, 0x804d718) 45
[0x804d71c, 0x804d720) 7
[0x804d720, 0x804d724) 12
Note, I charge all of a bin to the center of the bin, so all these bins get
charged to the before first PROFILE area, and the only doubt is that the last
bin might not really all be correct (some part of those 12 ticks should have
been charged to MLTON_chunkSwitch, but we don't know how much).
Thus the real correct answer is that between
76 + 45 + 7 = 128
and 140 tick (or between 5.42% and 5.93%) really was used by Chunk2 before it
got to the chunkSwitch. Note, the only thing that this code is doing is
allocating some space on the C stack (a subtract from a register), saving
some registers on the stack, and loading some globals into registers. The
last two actions are the code generated for
char *stackTop = gcState.stackTop;
pointer frontier = gcState.frontier;
Now this could only happen if we are entering calling the C function Chunk2 a
lot. In the 23.61 CPU seconds that ratio runs for, Chunk2 is called
29,291,557 times. The time charged to this code is
and so each call is taking
1.4 seconds / 29291557 = 47.8 nanoseconds = 19 cycles
on my 400 MHz machine. 7 instructions, 8 memory references (6 writes and 2
reads), each of 32 bits. Definitely not too bad.
My summary: mlprof is working perfectly.