[MLton] caching elaboration

Sun, 15 Aug 2004 10:36:36 -0700

Here's a simple proposal for caching the elaboration of mlb files to
speed up type error messages.  It addresses what I think is by far the
most common usage pattern of MLton, which is to repeatedly call MLton
with the same command-line switches on the same input mlb, while
tweaking one or more files that the mlb (indirectly) points to.

A run of MLton depends on certain facts about the "environment",
namely:

* the current working directory
* the command-line arguments
* the contents of certain files

If those environment facts do not change, then the run of MLton will
not change.  

At a finer-grained level, a run of MLton probes the environment facts
in some sequence.  If the probes are the same for some prefix of the
sequence, then the behavior of MLton will be the same on that prefix.

So, my idea is for MLton, as it elaborates, to record the facts about
the environment that it depends on, and then to save the world at some
point during elaboration.  In addition to writing out the world it
will write out in an associated file the facts (call them
"assumptions") under which that world is valid.  Then, a subsequent
run of MLton can simply check the assumptions, and, if they are true,
then resume with the saved world.  As long as we capture all the
assumptions and check them accurately, then this should be correct.
There may be more than three kinds of assumptions I listed above, but
I can't think of any right now.

The other decisions to be made are when MLton should save a world and
how many worlds should be saved.  For this, I propose the simplest
thing I can think of that catches common usage.  At any point in time,
for a user, MLton should have at most one saved world.  When MLton
runs, it checks the assumptions for that world and resumes if they are
valid.  If MLton is unable to resume, it deletes the saved world and
starts from scratch.  Whenever MLton encounters an elaboration
failure, it writes out a little bit of information so that on the next
run, it will save the world after the last successfully elaborated file
up to the point of failure (the save can be avoided if that point
already corresponds exactly to the saved world).

The intended effect is that when one is working on a file, the
elaboration of all files up to that point is cached, and subsequent
runs immediately start with elaboration of that file.

Of course, there are more complicated schemes involving multiple
cached worlds, checkpointing every n files, rolling back, etc.  But I
think this would make a nice 90% solution and is easy to implement.

As to how to record the assumptions, it looks easy to me.  The current
working directory is just an absolute path.  The command-line
arguments is a list of strings.  For the contents of files, we could
keep a map from absolute path to a hash of the contents (or we could
even store the contents themselves -- it's not that much disk space).
BTW, this is one reason why I think it is cleanest to view the
filesystem as a map from absolute path to file contents, abstracting
from any hard-link or soft-link issues.