[MLton-user] 'a ref for C programmers: a primer
Wesley W. Terpstra
terpstra at gkec.tu-darmstadt.de
Mon Feb 12 10:02:02 PST 2007
As a former C programmer, I've always been confused by 'a ref in SML.
I wanted to state how I've come to think about them for anyone else
who is confused.
Please let me know if there's a mistake anywhere!
Normal variable bindings are by reference, or pointers to the value
(in principle).
For example,
> val x = SomeVector
> val y = x
will share space, both point to the same object in the heap. This
applies (in principle) to every object, even integers. Every object
lives on the heap. Bindings are just pointers to these objects.
Whenever you access a value by a name, you are implicitly
dereferencing the pointer to the object pointed to.
However, MLton is clever enough to sometimes recognize that it's
cheaper to just copy a variable, rather than store it on the heap.
This is subject to some safety criteria that ensure the program
cannot distinguish the behaviour from the 'every binding is a
pointer' interpretation above. Another criteria is that the object is
not 'too big', which would waste memory.
With an 'a vector, MLton may choose to pack the object directly into
the vector. For example, int vector will not include binding pointers
for each cell. If the contained object is too big, or too
complicated, MLton may still use a pointer in the vector if it chooses.
When you write
> val x = ref 0
> val y = x
this creates a special mutable pointer on the heap. The binding 'x'
points to the mutable pointer. The mutable pointer points to the
integer 0. Thus, when you copy x to y, both binding x and binding y
point to the same mutable pointer. When updating a ref, you first
follow the binding's pointer to the mutable pointer, then modify the
mutable pointer to point at something else. This is why 'y := 5'
updates the value of '!x'.
Again, MLton is clever about these pointers and can optimize them
away subject to safety criteria. In the above example, it could
recognize that the integer 0 could simply be stored where the mutable
pointer should be. This is analogous to how it can decide to
eliminate the binding pointer for non-ref values. However, in this
case it is the mutable pointer saved, not the binding pointer. If the
ref'd object is too complicated, it may not be able to apply this
optimization.
MLton may also be able to save the binding pointer. However, this
savings will never place the mutable pointer on the stack, as a
reference from the heap to the mutable pointer could outlive the
current stack frame. So, when you have a variable like 'x' above, you
are pretty much guaranteed that there is at least one level of
indirection going on.
When an 'a ref is contained in a larger structure, MLton may be
clever enough to embed the mutable pointer directly into the
structure. It can only do this if the structure lives on the heap.
For example, if you have a datatype or tuple that includes a mutable
pointer, MLton might choose to put the mutable pointer directly
inside the object. MLton never copies an object with an embedded
mutable pointer, as this would be violate the 'every binding is a
pointer' interpretation. Unfortunately, if you ever need to make
another binding to these embedded refs, this optimization cannot be
applied due to MLton's GC (no pointers to the inside of an object).
'a array is similar, except that MLton can safely embed all the
references in the containing vector. This is safe, because there do
not exist functions to make an 'a ref from the cell of an array. This
is not the case with 'a ref vector. Here, the user might copy a ref
binding out of the vector, which forbids MLton from embedding the
reference into the vector, due to the GC.
In fact, MLton is so clever that it can often apply both
optimizations at the same time. For a int array, MLton knows it can
use the array cells (the mutable pointers) to just store the int in
place of the mutable pointer. Thus, int array works as you would like
it to with the FFI. However, from a conceptual point of view, there
are still the two levels of pointers. Similarly, with a datatype
containing an int ref, MLton may just place the int directly into the
datatype. Modifying the ref cell modifies the value in the datatype.
This object will never be copied and the user cannot not make new
bindings to the embedded ref (or this optimization would not happen).
So, there's a reason it's confusing: it is complicated! Conceptually,
the 'every binding is a pointer' and 'a ref introduces a mutable
pointer' are correct. However, in practice, both of these pointers
can often be eliminated completely by MLton.
More information about the MLton-user
mailing list