I'm confused why calling an _export-ed function introduces another thread of control. Is this just done because the exported ML function might need some more C stack? My point is that from the point of optimization, I would think that ML calling C calling ML would still obey all the single-threaded constraints.