[MLton-user] Problem with Timer.checkCPUTimer on Sparc
shivers@cc.gatech.edu
shivers@cc.gatech.edu
Sat, 8 Nov 2003 18:00:54 -0500
> Note that gc time is 170 usec & usr time is 140 usec. But that is
> impossible, since usr time *includes* gc time, according to the SML doc
> at
>
> http://www.standardml.org/Basis/timer.html
...
> So it seems like my code took negative time. ??? Is this a bug?
> Did I make a mistake?
The Basis spec doesn't make it clear to me whether system time spent
during garbage collection should be allocated to usr or sys. Right
now, MLton gets the usr and sys values from a call to getrusage. This
has the effect of allocating system time spent during garbage
collection to sys. To allocate the system time to usr would require
extra bookkeeping to keep track of the system time spent in gc and to
subtract it off from sys and add it to usr. That seems odd to me.
I'll ask around to see what other SML implementors think.
The spec seems clear: "Note that implementations should include any time spent
in the garbage collector in the usr time component."
But I'll editorialise further: the spec is broken either way. You
have two axes for time: usr/sys & gc/non-gc. Four bins in the cross
product. If you fold any two of these bins together, you are going to
screw up some measurement. The API needs to provide all four.
Further, as you noticed yourself, there is an atomicity problem, in that you
get CPU time and its related GC time with two different calls, but a GC could
occur in-between. We can't language-lawyer our way out of that one; the API
is busted.
So, with the current MLton implementation, usr + sys should always be
greater than or equal to gc, but usr alone might not be.
Here is what I am measuring: I have three implementations of list-merge. I
create a pair of random, ordered int lists, whose sum length is 10,000 elts.
Then for each merge function, I turn on the timer, merge the lists, and turn
off the timer. I do this for three such randomly generated merge-pairs. I can
pretty tightly bound the amount of alloc that's happening here:
- impl 1: recursive. Needs up to 10,000 stack frames & up to 10,000 conses.
- impl 2: iterate&reverse. Builds the answer backwards in a loop,
then reverses it. Needs 0 stack frames & up to 20,000 conses.
- impl 3: As in 2, but the inner loop has 7 vars, not 3 and loop is
"rotated" to ensure that it's operating on non-empty lists. Should run
faster if there are enough callee-saves regs to hold all 7 iter vars
(or if the < compare gets inlined and there are enough general-purpose
regs). Allocates as in impl 2.
(The 3 functions appear at the end of this msg.) That gives you 9 merges: 3
impls * 3 merge-pairs. Below are the nine timings, straight out of
checkCPUTimer & checkGCTime. Total heap alloc for program: 3 * (10k + 20k +
20k) = 150k conses.
Run on a Sparc, compiled with
-detect-overflow true -safe true -inline 100000
so I get whatever the default GC machinery is.
; (usec) (msec)
; Measurements non-gc gc impl & merge-pair
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
((usr 0) (sys 0) (gc 0)) 0 0 recursive 1
((usr 150000) (sys 20000) (gc 160000)) 10 160 iterate1 1
((usr 130000) (sys 10000) (gc 140000)) 0 140 iterate2 1
((usr 0) (sys 0) (gc 0)) 0 0 recursive 2
((usr 100000) (sys 70000) (gc 150000)) 20 150 iterate1 2
((usr 110000) (sys 30000) (gc 130000)) 10 130 iterate2 2
((usr 0) (sys 0) (gc 0)) 0 0 recursive 3
((usr 140000) (sys 20000) (gc 140000)) 20 140 iterate1 3
((usr 100000) (sys 40000) (gc 110000)) 30 110 iterate2 3
These numbers are weird in *multiple* ways, as I'll outline below.
- First, the measurements in the sexp in the left columns are
#usr(CheckCPUTimer tm) (usr time -- gc & non-gc)
#sys(CheckCPUTimer tm) (sys time -- gc & non-gc)
CheckGCTime tm (gc time -- usr & sys)
in usec. The *meaning* of these things is, as indicated to the side,
your older semantics (as opposed to your brand-new fixup).
The measurements in the columns on the right are in msec & hand-generated
from the left columns: non-gc = (usr+sys-gc).
- It's kind of suspicious that the usec-precise times are always in
10msec units. 1/100th of a second is pretty coarse!
- It's kind of suspicious that gc *dominates* these measurements --
these runs are spending 10x as much time in the gc as in the computing
(look at the right columns, where I did some subtraction by hand).
Huh?
- It's kind of suspicious that the gc spends so much time in the OS. It's just
wandering around doing reads & writes to memory. (The *program* is
*definitely* spending *zero* time in the OS during the timed portions --
it's just merging two integer lists. Unless paging is a lot more
cpu-intensive than I'd previously thought?)
- Clearly, the recursive impl is the big winner on the SPARC. But it's such a
huge winner it's suspicious.
- Is this slow? 20 msec = 1/50th of a second to merge 10,000 integers?
That's one list element every 2 usec, on a 450Mhz SPARC,
i.e., one list element every 900 cycles, yeah, that seems a
little slow. The inner loop on the iterate1 impl is:
- 2 nil tests
- 2 car loads; 2 cdr loads
- an inlined (yeah, Mlton!) fixnum <
- a cons
- jump to top of loop
plus the cost of a revAppend, which is likely
- nil test
- car & cdr load
- cons
- jump
(I've appended the actual code to this msg; it's function merge2.)
That's about 25 insn per list element? So we're talking ~ 36 cycles/insn,
back-of-the-envelope? That seems real high.
More data: I did more runs, this time with 1000 different 100k-elt list-pairs.
I would turn the timer on, do the merge 10 times, and then turn it off. I
collected usr+sys-gc, in accordance w/your claim that there is some gc time in
both the usr & sys component, so I measured "non-gc sys+usr" time. (How I
generate a list pair: I count from 100k down to 1; for each int, I flip a
psuedo-random coin and put the int in list xs or list ys. When done, I have
two ordered lists, xs & ys, whose sum length is 100k-elts.)
Here are the numbers, for SML/NJ & MLton, on a linux x86 & a Solaris Sparc:
Merging numbers: ((listlen 100000) (nreps 10) (nlists 1000))
Measuring usr+sys non-gc time in msec
x86 (mongkok -- 1.2Ghz)
Mlton (total process run time: usr=6m1.6s / sys=0.37s)
(rec 58530)
(iter 140880)
(iter2 131230)
SML/NJ
(rec 482710)
(iter 264830)
(iter2 266200)
SPARC (forge -- 450Mhz)
Mlton
(rec 33990) ; 15.3 cycles/elt & 20 insns/elt => 0.76 cycles/insn
(iter 253530) ; 114.1 cycles/elt & 25 insns/elt => 4.56 cycles/insn
(iter2 250540) ; 112.7 cycles/elt & 20 insns/elt => 5.64 cycles/insn
SML/NJ
(rec 920900)
(iter 627720)
(iter2 608730)
Check the cycles/insn counts I calculated out to the right for the Sparc MLton
run. For the iter function, cycles/insn looks in the ballpark for what is
probably a memory-bound loop with bad locality. However, the value for the
recursive function -- less than 1 cycle/insn -- is WAY too low. Something
funny is going on there -- maybe the compiler is getting the insn count down
in some way I don't understand (see my calculations below), or the timing
accounting is still off. Weird.
Other notes:
- In Mlton, recursive merge crushes iterate&reverse,
but with SML/NJ, it's the other way 'round.
- Mlton trounces SML/NJ in all instances, by *integer* factors.
-Olin
===============================================================================
Here's how I calculate cycles/insn
-------------------------------------------------------------------------------
(define (cycles/insn time-sec insns/elt listlen nreps nlists cycles/sec)
(let* ((cycles/elt (cycles/elt time-sec listlen nreps nlists cycles/sec)))
(/ cycles/elt insns/elt)))
(define (cycles/elt time-sec listlen nreps nlists cycles/sec)
(let* ((elts (* listlen nlists nreps)) ; total elts processed
(cycles (* time-sec cycles/sec))) ; total cycles elapsed
(exact->inexact (/ cycles elts))))
Back-of-the-envelope estimates for insns/elt
-------------------------------------------------------------------------------
(assuming loop iter vars all live in regs, w/no stack traffic across < call.)
recur: ~ 20 insns
2 nil-tests (2 insns)
2 car loads & 2 cdr loads (4 insns)
1 fixnum < (1 insn)
1 recursive call (4 insns)
1 cons (5 insns)
1 return (2 insns)
iter: ~ 25 insns
2 nil-tests (2 insns) [iter]
2 car loads & 2 cdr loads (4 insns)
1 fixnum < (1 insn)
1 cons (5 insns)
1 jump (1 insn)
1 nil test (1 insn) [revApp]
1 car load & 1 cdr load (2 insns)
1 cons (5 insns)
1 jump (1 insn)
iter2: ~ 20 insns
1 fixnum < (1 insn) [iter]
1 nil test (1 insn)
1 car & cdr load (2 insns)
1 cons (5 insns)
1 jump (1 insn)
1 nil test (1 insn) [revApp]
1 car load & 1 cdr load (2 insns)
1 cons (5 insns)
1 jump (1 insn)
The actual merge functions
-------------------------------------------------------------------------------
fun merge1 op< (xs,ys) = (* Recursive *)
let fun m([], ys) = ys
| m(xs, []) = xs
| m(xs as (x :: xt), ys as (y :: yt)) =
if y < x then y :: m(xs,yt) else x :: m(xt,ys)
in m(xs,ys)
end
fun merge2 op< (xs,ys) = (* Iterate/reverse *)
let fun m([], ys, racc) = List.revAppend(racc,ys)
| m(xs, [], racc) = List.revAppend(racc,xs)
| m(xs as (x :: xt), ys as (y :: yt), racc) =
if y < x then m(xs, yt, y :: racc)
else m(xt, ys, x :: racc)
in m(xs,ys,[])
end
(* Iterate/reverse; better if we have plenty of registers. *)
fun merge3 op< (xs,[]) = xs
| merge3 op< ([],ys) = ys
| merge3 op< (xs as (x::xrest), ys as (y::yrest)) =
let val revapp = List.revAppend
fun m(xs,x,xt, ys,y,yt, racc) =
if y < x
then case yt of [] => revapp(racc, y::xs)
| y2::ys => m(xs,x,xt, yt,y2,ys, y::racc)
else case xt of [] => revapp(racc, x::ys)
| x2::xs => m(xt,x2,xs, ys,y,yt, x::racc)
in m(xs,x,xrest, ys,y,yrest, [])
end