Time bug is kernel bug, not MLton

Sat, 14 Jul 2001 00:01:01 -0500

Yay:  the Time bug is definitely a kernel bug bug, not a MLton bug.  The hint
was they mail from Matthew on the .99.  I have C code that will cause the bug
to show up.  It only seems to happen on an SMP machine, but if I run 3 copies
of the program, one usually gets the axe within a few minutes.  The bug seems
to  be  a  race between storing the seconds part and the microseconds part in
the struct rusage.  It reads the counter kept in the  process  struct  (which
keeps  times in `ticks', (1/100 of a second)), extracts the seconds parts and
stores that in the tv_sec field, then  re-reads  the  counter,  extracts  the
microseconds  part  and stores that in the tv_usec field.  The result is that
if the other CPU increments the counter  between  the  two  reads,  and  that
increment  causes  it  to  wrap,  then  you  get  the old seconds and the new
microseconds, with the latter being 0.

Mind you, there is a comment in the code indicating why  this  can't  happen,
but  I  suspect  that  although  it  used to be true, with the greater multi-
threading in the kernel, it no longer is correct.

I don't see any really good way around this.  The result  is  that  you  will
occasionally  get  times  that  are  1 second less than they should be.  This
should only be possible for the USAGE_SELF and only for the system time.

Despite this, I am still confused about the behavior  of  the  MLton  test  I
wrote for this.  Here is the code:

	fun showTime t =
	       let val ts = Time.toSeconds t
		   val t' = Time.- (t, Time.fromSeconds ts)
		   val tus = Time.toMicroseconds t'
	       in print (LargeInt.toString ts ^ " seconds + "
			    ^ LargeInt.toString tus)
	       end

	fun tryDelta (msg, old, new) =
	       (Time.- (new, old);
		())
		  handle _ => (print ("Binfo for " ^ msg ^ ", old = ");
			       showTime old;
			       print ", new = ";
			       showTime new;
			       print "\n";
			       OS.Process.exit OS.Process.failure)

	fun loop { self = { utime = su, stime = ss },
		   children = { utime = cu, stime = cs },
		   gc = { utime = gu, stime = gs } } =
	       let val new as { self = { utime = su', stime = ss' },
				children = { utime = cu', stime = cs' },
				gc = { utime = gu', stime = gs' } } =
			  MLton.Rusage.rusage ()
	       in tryDelta ("self user", su, su');
		  tryDelta ("self sys", ss, ss');
		  tryDelta ("child user", cu, cu');
		  tryDelta ("cuild sys", cs, cs');
		  tryDelta ("gc user", gu, gu');
		  tryDelta ("gc sys", gs, gs');
		  loop new
	       end

	val _ = loop (MLton.Rusage.rusage ())

Now as expected, this program died several times as follows:

    syzygy> htime zulu
    Binfo for self sys, old = 10 seconds + 990000, new = 10 seconds + 0
    Command exited with non-zero status 1

    real    0:54.08
    user    25.52
    sys     11.00
    =======================================================================
    syzygy> htime zulu
    Binfo for self sys, old = 105 seconds + 990000, new = 105 seconds + 0
    Command exited with non-zero status 1

    real    5:51.07
    user    228.01
    sys     106.00

Note,  the  system  time  was  11 seconds in the first case, and the old self
system time reported by getrusage() was 10.99 and the new  was  10.00.   Same
for the second case.

The thing that is confusing me is that one time it died as follows:

    syzygy> htime zulu
    unhandled exception Time
    Command exited with non-zero status 1

    real    0:51.70
    user    23.91
    sys     11.00

The  system  time  is 11.00, so I'm sure it was that which killed it, but who
raised the Time exception in an unhandled context?  The  whole  way  I  wrote
showTime  was  to  make  this  impossible  (and  to  make sure I got the full
accuracy of the time without any floating  point  strangeness).   I  must  be
missing  something easy here, but I have looked at it for a while and have no
idea.  Any notions?