[MLton] OT Threading (was: Multicore CPU's and MLton)

Wesley W. Terpstra wesley@terpstra.ca
Mon, 4 Jul 2005 20:38:20 +0200

I think this thread has by now drifted quite far off topic.
If you want to discuss problems with st, please talk to its authors.

On Tue, Jul 05, 2005 at 01:51:42AM +1000, John Skaller wrote:
> State Threads is fundamentally flawed, as you can see by these lines
> from the description:
> "The execution state is saved on the thread's stack (a contiguous chunk
> of memory) which is transparently managed by the C environment."
> "The context switch overhead is a cost of doing _setjmp()/_longjmp() (no
> system calls are involved)."
> It is well known this doesn't work. 

I disagree... I have used it (and ported it). It works fine.

There are some problems with newer glibc's which store, for system level
threads, a nice pointer to (locked) global state at the bottom of the
stack. However, if one uses libst, one generally wants to turn off all of 
the system-level thread code anyways since it is quite slow. (I do this)

As for the MLton st work-alike (which I have), this is irrelevant anyways,
since I use MLtonThread.switch instead of setjmp. I should profile which is
actually faster at some point...

> For a start it cannot work in C++ due to exceptions,

First, st is for servers written in C. 
Second, what you say is incorrect.

Each stack is a valid stack, st does not jump inside the same stack.
All you need is a catch-all handler at the bottom.

> and it is very unlikely to work in conjunction with any system
> using an exact garbage collector

Which neither C, nor C++ have in most applications.

MLton also places it's own stack in a non-standard location.
I'm not familiar with garbage collector theory, is MLton exact?

> however the main problem is the brain dead model of a linear stack used on
> Unix systems.  There simply isn't enough address space to allocate enough
> stacks, even on a 64 bit machine!

Applications I have written with libst have been tested on 32bit with about
40,000 concurrent sessions. They worked fine: just set the stack size. 

If you allocate 8M of anything per request, you are asking for trouble. 
This is a problem for both system and process level threads (in C) anyways.

> (And there are other problems, for example there is no ISO Standard way to
> allocate address space --you certainly can't use malloc, that is required
> to actually allocate memory)

ISO C99 hardly specifies anything, for real applications you need more.
Fortunately, POSIX does specify such a method: mmap.
BTW, malloc works just fine for this too---st uses it if mmap is missing.

Again, for the MLton version, this is a non-point as the entire thing
operates normally in the garbage collected address space setup by MLton.

> Felix automatically control inverts blocking reads to yields.

That is what st does too (though it tries a NB read first)... 
It looks like normal threads, but is an EDSM (no lock-contention).

> unlike some other systems there is no generic scheduler .. 

That is interesting.
I could imagine some uses for this with heterogeneous workloads.

> What the compiler does, fundamentally, is the control inversion.

I don't see why one needs a special programming language for this.
SML, C, Java, Python, ... are flexible enough to do this themselves.

> See 'Stackless Python'...

I am familiar with twisted, which makes use of this.

However, under python, concerns about the overhead due to locking for
system-level threads play second fiddle to the concerns over the 
performance lost due to using python in the first place. I think the
debate about system/process level threading has no place here.

On Tue, Jul 05, 2005 at 01:56:42AM +1000, John Skaller wrote:
> I think you're missing the fundamental problem: sockets and 
> kernel TCP/IP stacks. 
> The only way to write a fast web server requires a raw socket
> and a significant part of the TCP/IP stack will have to be 
> provided.

You are talking about the hack MS uses to respond immediately to a web page
request without the TCP handshake? I won't justify that with a response.

I know some projects do this, but I have never seen compelling evidence that
it was necessary (except for the evil MS hack, which I hate).

As for reimplementing the TCP/IP stack, why not improve the kernel stack? 
I doubt any small sized development team could reimplement it better.

If your concern is the copy overhead from kernel buffers to userland, there
is 'zero-copy' support in the linux kernel. If your concern is the context
switch to userland, you are going to have this anyways unless you write your
own network driver for the card and hook interrupts yourself.

Wesley W. Terpstra