[MLton] cvs commit: sped up output1 a lot

Jesper Louis Andersen jlouis@mongers.org
Sat, 3 Jan 2004 00:11:10 +0100

Quoting Stephen Weeks (sweeks@sweeks.com):

> > Also I'm curious how the normal getchar is so fast under NetBSD.  Do
> > they not use glibc?
> I have no idea.  Hopefully Jesper can tell us.

putc() is fast because the unlocked and locked version is the same
for non-reentrant programs (which implies non-pthread). This explains
the reason why the 2 versions doesnt differ. Next, if you need 
threads, you have a separate libc with reentrancy, and then the speed
goes down to about what you see with Linux. 

I am not sure if this is POSIXly correct though. My intuition says no.

Furthermore it is an inlened macro:


NetBSD and FreeBSD does not use glibc for anything except when emulating
linux. They have their own counterparts to glibc, which explains why
they differ.

As to the speed there is not really much to look at. An Athlon 2000 XP
is a bit faster than a 1.6Ghz P4. 

Then why is the C program beating us by so much? A trace on the
system calls suggests that we make one per 4096 bytes we write, 
whereas the C-based code makes one per 64K bytes it writes. Now,
system calls are expensive...

for i in 1000 10000 100000 1000000; do
ktrace /tmp $i;
kdump | grep RET | grep write;

..suggests that they use a buffer of 64K internally in their 
code. A test run from me shows that we are having 0.00 seconds sys
time used by the C version whereas we in the MLton version uses 0.13
seconds sys time. This is so little that it does not matter.


Basicly the same code applies here for the locked version that is.
The locked version does what you would expect by locking and unlocking
around the putc() call. 

They are also using a buffer size of 4K, like we do, soo the trace
of system calls follows our pattern. 


The problem _has_ to be in the code we execute inside MLton. There must
be something in that loop that could benefit from something. Beware that
the unlocked versions of the C code is defined by the macro:

 static inline int __sputc(int _c, FILE *_p) {
	if (--_p->_w >= 0 
            || (_p->_w >= _p->_lbfsize && (char)_c != '\n'))
		return (*_p->_p++ = _c);
		return (__swbuf(_c, _p));

So it will get inlined straight into the C code when -O2 is enabled.
I tried a run with -O2 -fno-inline and then it falls down in speed
to 26.48 secs on the FreeBSD machine which is a bit slower than the
MLton version. 


If we want to get closer to C here, we should look internally at the
MLton code generated at the point.