[MLton] More on Parallel Runtime

Sat Oct 20 12:40:43 PDT 2007

Eric McCorkle wrote:
> I mentioned a few months ago that I was working on a parallel runtime 
> system which I intended for use with MLton and possibly other compilers 
> for CML-like languages.  Here is what I have so far:
> 
> I've built an M:N threading system based entirely on lock-free data 
> structures.  User-level threads (termed just "threads") are mapped onto 
> real threads (termed "executors") by the scheduler system, which can be 
> configured to use one of two algorithms.  Each executor keeps its own 
> scheduler, and all schedulers use a lock-free fifo structure for 
> work-sharing.  I originally intended for the system to use an 
> activation-style API like FreeBSD's KSE interface, but it can also use 
> pthreads (in fact, I have only implemented the pthread backend at the 
> moment).
> 
> The runtime and the program communicate by means of a per-thread mailbox 
> structure.  The actual program is expected to make use of safe-points to 
> periodically check its mailbox, possibly calling into the runtime.  This 
> has two very beneficial effects, I think.  The first is that there are 
> no involuntary context-switches.  The second is that it is possible to 
> prevent the scheduler from interrupting a given block (necessary for 
> implementing some locks) simply by omitting safe-points.
> 
> My plans from here are to implement the Cheng-Blelloch collector (with 
> some lock-free improvements).  To do this, I need to know exactly how to 
> traverse the memory graph.  It also may be advantageous to try 
> allocating stack frames individually from the heap, as this would reduce 
> the overhead of thread creation to a tiny amount, and would reduce the 
> collector's pause as well.  I'll also need to modify MLton to tie into 
> the runtime system at some point (bits like calling into the runtime, as 
> well as implementing the CML structures).
> 
> I looked briefly, and it seems MLton generates type signatures to allow 
> the collector to trace the memory graph.  What I need to know is a) is 
> this correct? b) where is the format of this specified?  Also, what 
> would it take to implement the write-barrier that the Cheng-Blelloch 
> algorithm requires?  Lastly, is there anything else I should consider?
> 

Hopefully, I may have some code that might be useful.
I've also been working on multi-core support form MLton but began by 
reworking the runtime to support multiple pthreads with plans to add a 
M:N threading system. Specifically, I've focused on adding pthread 
support to the runtime with as few changes to MLton as possible.

My current solution uses separate heaps for each pthread, the 
mark-compact algorithm without card-marking, and a global lock for 
collection and atomic operations.

Notes:
- The "executor" model seemed necessary because I found no way in the 
FFI to spawn pthreads with different (unit -> unit) functions.

- A global lock for GC_collect and atomic operations 
(Thread_atomicBegin/End). In the interest of changing as little of 
MLton's runtime as possible, global locks seemed a quick-and-dirty solution.

- C-codegen: to look up per-pthread data with minimal changes to the 
compiler (I didn't want to change the x86 codegen until I knew it worked).

- Per-pthread heaps: to not require locks when allocating heap space.

- Split GC_state structure into global data (args, globals, summary 
info) and pthread-specific data (heap, frontier, stackTop, etc).

- Thread_atomicBegin/End are now function calls that acquire the global 
lock. This was because I needed the frontier/stack information to be 
when a GC_collect occurred in another pthread.

I have tested this code with a few programs that spawn a simple 
"executor" on multiple pthreads. Currently I'm trying to get CML working 
using this model and have run into a little snag getting signal handling 
to work (CML uses preemptive threads). Well-defined safe-points would be 
an elegant way to overcome this.

I'd be happy to send you what I have. A home in MLton's SVN seems like 
the best place to work from.

To answer your questions:
a) Correct. The runtime uses GC_state's frameLayouts and objectTypes to 
traverse the heap.
b) They are generated in the mlton/backend/backend.fun (frameLayouts and 
frameOffsets) and in mlton/backend/ssa-to-rssa.fun (objectTypes)

-- 
Philip Schatz
CS Graduate Student
Purdue University