[MLton] Multicore CPU's and MLton

Wesley W. Terpstra wesley@terpstra.ca
Mon, 4 Jul 2005 15:26:56 +0200

On Mon, Jul 04, 2005 at 07:28:49AM -0400, Ray Racine wrote:
> On Mon, 2005-07-04 at 12:35 +0200, Wesley W. Terpstra wrote:
> > The best way to implement scalable web services is not with threads.
> > What you will find much more scaleable is to use an EDSM inside
> > process-level concurrency. 
> There has been some of this happening in the Java world over the last
> few years.  In Java this all really started with from the SEDA project.

Yeah, I've used SEDA.

There are a lot of problems with it, but the main problem is that it uses
multiple worker threads to process the event queue. This throws out one of
the main scalability points of an EDSM in the first place: no contention.

SEDA code still has to lock common data whenever it is accessed.
->  all common process resources are still contention points.

I wouldn't recommend SEDA after having compared it on the same project
against libst (for C) and twisted (for python). The libst version was
fastest, but the twisted version was simplest (comparable speed to SEDA).
Admittedly, there were other factors here (for example the language!)...

> Those projects and others are why in my original post I said "(even that
> is changing)" when I was referring to the 1-1 correspondence between
> request and native thread.


> I still believe as we move forward over the next few years any popular
> language must support native threading. In fact this is one of the main
> reasons I have moved from SML/MLton to Scala/Java.

Threads can be useful, no question. However, that usefulness is limited to
the class of applications where shared memory helps rather than hinders.

Generally, as evidenced by locking (which is another synchronization method
above total ordering on the memory), shared memory doesn't really capture
parallelism well. I refer you to the classic paper, 'Understanding the
limitations of casually and totally ordered communication' for details.

BTW, these issues are extremely old. Practice is far behind theory here.

On Mon, Jul 04, 2005 at 08:11:00AM -0400, Ray Racine wrote:
> On Mon, 2005-07-04 at 12:43 +0200, Wesley W. Terpstra wrote:
> > On Sun, Jul 03, 2005 at 05:36:49PM -0400, Ray Racine wrote:
> > > - For web (and concurrent multiuser) based development native thread
> > > concurrency is essential for scalable applications. IMHO.
> > 
> > I completely disagree with this point.
> I am hypothesizing on where I think your point of disagreement is ...
> correct me if I'm off base...
> I believe you may be disagreeing because you (very correctly) believe
> designing an HTTP server (or any socket/thread library) using a 1-1
> correspondence between threads (native or otherwise) and a user request
> is not the superior proformant model.  I absolutely agree with you on
> this.  That model is changing pretty much everywhere in all camps.  For
> example those URL's I sent earlier.

Yep. That's my contention.

> I believe native threads (and multi-processors, multi-core machines)
> move you from measuring scalability from a hardware unit of a core to a
> hardware unit of a entire physical node (the whole machine itself, all
> cores, all CPUs) which I claim is desirable. 

I agree.

> Here I am use "hardware unit" as in "System scalability is the ability
> of an application to sustain its performance per hardware unit (such as
> a CPU) with the increasing number of these units" at http://state-
> threads.sourceforge.net/docs/st.html
> So yes I believe a system where the hardware unit is the CPU for a fair
> thread system single process model will be out scaled by a system which
> is capable of treating the hardware unit as the entire physical node.
> Eventually you are going to have 1+ processes and must deal with inter
> process communication and all its accompanying complexities.
> A native thread model avoids the overhead of eventually dealing with
> inter-process communication.  In one system I can leverage SMP threading
> API's to accomplish the equivalent via inter process IPC APIs to
> accomplish and I believe in a cleaner manner.

This is where I start to disagree. There is lots of inter-process
communication between threads with shared memory. 
You don't even see all of it, and what you do see is: locking.

> For example, imagine I have an 8 CPU with 4 cores/cpu system = 32 cores.
> Model #1
> Create ~32 processes each process "locked" to a core.  Each process uses
> a SM fair thread model.  To coordinate the whole system use IPC
> mechanisms such as pipes or shared memory or ...  may be a serialization
> model atop of that for communication / load balancing.

The nice thing here is that IPC is used directly by the programmer and
neatly captures the actual synchronization concerns and no others. Also, 
no process needs to block because those synchronization points become
events for the EDSM just like any other.

> Model #2
> Same as above.  32 fair thread / SM process.  However instead I create
> each "process" as a native thread so the system is now one process with
> 32 native threads.  Each native thread "locked" to a core is leveraging
> your SM library to each maximum throughput. Now the issues of process
> communication become native thread concerns.  In this case I can use a
> CML mailbox where I would need an IPC pipe, shared mutex, socket +
> serialization, ...

You must still lock the file descriptor table for each new connection.
Memory management (new/free) also becomes a point of lock-contention.
Probably other things I don't know about become bottlenecks here too.

Furthermore, when you have a threaded application, you always have to lock
global data structures. This incurs overhead every time you use them. The
fact that you have to do locking at all suggests that the 'shared memory' 
of threads is not a sufficient synchronization medium.

So, to summarize:
1- threads have contention over many programmer invisible things
2- threads must lock many things that aren't strictly points of contention
3- threads use of shared memory fails to capture the actual synchronization
   properties of the application (meaning programmers must still use locks)

> What will you do to leverage a new IBM Cell processor which is composed
> of a Power processor core working with 8 cell processor cores?  CPUs of
> the future will look more like the cell.  How will you leverage these
> cores?

I propose model #1 is preferrable in most situations.

Again, I am talking mostly about network driven applications here.
Threads do make sense for some parallel algorithms.

Wesley W. Terpstra