Time bug is kernel bug, not MLton
Henry Cejtin
henry@sourcelight.com
Sat, 14 Jul 2001 00:01:01 -0500
Yay: the Time bug is definitely a kernel bug bug, not a MLton bug. The hint
was they mail from Matthew on the .99. I have C code that will cause the bug
to show up. It only seems to happen on an SMP machine, but if I run 3 copies
of the program, one usually gets the axe within a few minutes. The bug seems
to be a race between storing the seconds part and the microseconds part in
the struct rusage. It reads the counter kept in the process struct (which
keeps times in `ticks', (1/100 of a second)), extracts the seconds parts and
stores that in the tv_sec field, then re-reads the counter, extracts the
microseconds part and stores that in the tv_usec field. The result is that
if the other CPU increments the counter between the two reads, and that
increment causes it to wrap, then you get the old seconds and the new
microseconds, with the latter being 0.
Mind you, there is a comment in the code indicating why this can't happen,
but I suspect that although it used to be true, with the greater multi-
threading in the kernel, it no longer is correct.
I don't see any really good way around this. The result is that you will
occasionally get times that are 1 second less than they should be. This
should only be possible for the USAGE_SELF and only for the system time.
Despite this, I am still confused about the behavior of the MLton test I
wrote for this. Here is the code:
fun showTime t =
let val ts = Time.toSeconds t
val t' = Time.- (t, Time.fromSeconds ts)
val tus = Time.toMicroseconds t'
in print (LargeInt.toString ts ^ " seconds + "
^ LargeInt.toString tus)
end
fun tryDelta (msg, old, new) =
(Time.- (new, old);
())
handle _ => (print ("Binfo for " ^ msg ^ ", old = ");
showTime old;
print ", new = ";
showTime new;
print "\n";
OS.Process.exit OS.Process.failure)
fun loop { self = { utime = su, stime = ss },
children = { utime = cu, stime = cs },
gc = { utime = gu, stime = gs } } =
let val new as { self = { utime = su', stime = ss' },
children = { utime = cu', stime = cs' },
gc = { utime = gu', stime = gs' } } =
MLton.Rusage.rusage ()
in tryDelta ("self user", su, su');
tryDelta ("self sys", ss, ss');
tryDelta ("child user", cu, cu');
tryDelta ("cuild sys", cs, cs');
tryDelta ("gc user", gu, gu');
tryDelta ("gc sys", gs, gs');
loop new
end
val _ = loop (MLton.Rusage.rusage ())
Now as expected, this program died several times as follows:
syzygy> htime zulu
Binfo for self sys, old = 10 seconds + 990000, new = 10 seconds + 0
Command exited with non-zero status 1
real 0:54.08
user 25.52
sys 11.00
=======================================================================
syzygy> htime zulu
Binfo for self sys, old = 105 seconds + 990000, new = 105 seconds + 0
Command exited with non-zero status 1
real 5:51.07
user 228.01
sys 106.00
Note, the system time was 11 seconds in the first case, and the old self
system time reported by getrusage() was 10.99 and the new was 10.00. Same
for the second case.
The thing that is confusing me is that one time it died as follows:
syzygy> htime zulu
unhandled exception Time
Command exited with non-zero status 1
real 0:51.70
user 23.91
sys 11.00
The system time is 11.00, so I'm sure it was that which killed it, but who
raised the Time exception in an unhandled context? The whole way I wrote
showTime was to make this impossible (and to make sure I got the full
accuracy of the time without any floating point strangeness). I must be
missing something easy here, but I have looked at it for a while and have no
idea. Any notions?