[MLton] regions

Fri, 1 Oct 2004 12:08:29 -0700

Thanks for the detailed explanation Matthew. 

I'll expand a little on my worries about space safety and other
performance issues when using a region system to implement SML.

First, I don't even know if there is any pure region system that is
safe for space.  My guess is that there's not, and here's why.  Region
inference decides when data can be allocated in a stack-like manner.
Data for which the inference can not determine the lifetime is moved
into a global region whose lifetime is the entire program.  There are
two ways in which region inference can break down:

* When the inference is too conservative, that is, when the data is
  used in a stack-like manner but the region inference can't figure
  it out.

* When data is not used in a stack-like manner.  In this case no
  amount of region inference will help.

This global region is a source of space leaks.  That is, no matter
what region system you use, there are some programs such that the
global region must exist, and its size will grow to an unbounded
multiple of the live data size.  To me, this means that one *must*
have a GC to achieve space safety.

> They have also worked on combining garbage collection with
> region-based memory management.
>
> Unfortunately, it was my impression that rather than getting the
> best of both worlds, you ended up with compromising both systems:
>
> http://www.it-c.dk/research/mlkit/papers.html

Relevant to space safety, the main thing I took away from these papers
is supporting evidence for space leaks in the global region.  They
show a number of benchmarks where the memory usage of the program
running with just regions is a large multiple (2, 10, 50, even 150!) 
of the program running with regions + GC.  They also show a large SML
program, the ML Kit, built with regions + GC and self compiling but do
not show the corresponding example built with just regions.  My guess
is that the reason is that the version without GC simply runs out
memory.

Matthew, I'm wondering what you meant by compromising both systems.  I
understand that they had to weaken the region analysis in order to
allow the pointer tracing in a GC.  But going the other direction, it
seems possible to me that a combined region + GC system could at least
be safe for space, although it isn't clear to me if their combination
achieves this.

> So, one might ask whether we might do the same thing -- i.e., provide a
> MLton.Regions structure with explicit region based memory management
> operations, so that the programmer could use them when appropriate.  I've
> recently thought about this question:
> 
> http://www.cs.cornell.edu/People/fluet/rgn-monad/index.html
> 
> Unfortunately, I think my current conclusion is that the SML type system
> is just a little too weak to really support this option.

I'm curious if you think it would be worthwhile if the typing issue
were addressed.  That is, would the performance really be better than
our generational GC?

> Final thoughts -- consider why one would choose region based memory
> management.  One common arguement is that the region operations can
> all be done in (approximately) constant time; therefore, you
> eliminate GC pause times.

I've heard this argument too, as well as other arguments that regions
improve overall performance.  I'm not convinced by either.  Are there
others?

As to the performance of region-based system, the ML Kit papers do
give some numbers to show their system with just regions does better
than either a system with just GC or a combined system.  But I'm not
really interested in a pure region system because of the lack of space
safety.  And the other performance numbers are less convincing -- they
do some comparisons with an old version of SML/NJ (instead of a new
version or MLton) and they compare to a simple two-space collector
without generations.  I've not seen a comparison with a more serious
collector.

As to pause times, my argument about space safety comes into play
again.  If one wants space safety then one must have a GC, which
brings problems about pause times back.  The only place where regions
offer an improvement is on programs specially written for regions
where the data has a stack-like behavior, the analysis can prove it,
and the programmer can be convinced that there are no space leaks.
Maybe the Cyclone stuff does this -- I haven't read it.