[MLton] cvs commit: world no longer contains a preprocessed basis library

Matthew Fluet fluet@cs.cornell.edu
Fri, 12 Dec 2003 09:48:27 -0500 (EST)


> > Here's a "working" version of a new cmcat.
> ...
> > But, in any event, I don't think that the lib/mlton sources are
> > semantically correct from CM's point of view.  At the very least,
> > CM.make "lib/mlton/heap/sources.cm" doesn't work.
>
> I commented out the heap stuff, which is very old and obviously
> unused, from sources.cm and checked that change in.  With that, I was
> able to use your cmcat to produce a file list that MLton was able to
> type check.  But, it did have a bunch of extra files that the old
> cmcat was able to drop.

I think I've figured out what's going on with the new cmcat.  The issue is
that  CM.make  will trigger compilation of .cm files that are included in
the top-level .cm file, but this compilation will only force the
{str,sig,fct} bindings necessary to close off the top-level .cm file.
This explains why CM.make "lib/mlton/sources.cm" would succeed, while
CM.make "lib/mlton/heap/sources.cm" would fail.
CM.make "lib/mlton/sources.cm" only tries to compile the bindings exported
by "lib/mlton/sources.cm", which includes  BinaryHeap  but not
{{Eager,Lazy}Binomial,Fibonacci}Heap.  Hence, when CM.make recurses on
"lib/mlton/heap/sources.cm", it does so only asking for BinaryHeap, which
compiles without problems.  Whereas, CM.make "lib/mlton/heap/sources.cm"
will force {Binary,{Eager,Lazy}Binomial,Fibonacci}Heap, which fails to
compile.

CM.Graph.graph, on the other hand, does not recurse on .cm files included
in the top-level .cm file.  Instead, it gives an import dependency.  At
the same time, though, CM.Graph.graph will only contain the .{sig,sml,fun}
files and .cm imports necessary to yield the exports.  So, for example, if
I have:

sources.cm:  Group
                structure A
             is
                a.sml (* defines A, no other depenencies *)
                b.sml (* defines B *)
                c/c.cm

Then CM.Graph.graph "sources.cm" will only include a.sml and will not have
a compile node for b.sml, nor an import dependency on c/c.cm.


Hence, the reason that the new cmcat is giving more files for
"mlton/sources.cm" than the old cmcat is that
CM.Graph.graph "mlton/sources.cm" yields an import dependency on
"lib/mlton/sources.cm", so the new cmcat recurses.  However,
"mlton/sources.cm" doesn't actually depend on every export of
"lib/mlton/sources.cm", so the old cmcat would leave those out.  The new
cmcat just pulls in "lib/mlton/sources.cm" and everything necessary to
build all the exports.

In order to recover the "tighter" behavior of the old cmcat, we would need
to generate the graph for the necessary imports, not all the exports.  I
don't see a very easy way of doing it.  It is easy enough to scan a graph
to determine what symbols are imported from each imported .cm file.

The first thing that comes to mind is create a temporary .cm file with
only the necessary imports.  So, if we had

AB/sources.cm:  Group
                   structure A
                   structure B
                is
                   XYZ/sources.cm
                   AB/a.sml (* uses structure X *)
                   AB/b.sml (* uses structure Y *)

XYZ/sources.cm: Group
                  structure X
                  structure Y
                  structure Z
                is
                  XYZ/x.sml
                  XYZ/y.sml
                  XYZ/z.sml

then cmcat "A/sources.cm" would generate an import dependency on
"XYZ/sources.cm" with the symbols "structure X" and "structure Y".  We
would then like to greate the temporary file:

temp.cm: Group
           structure X
           structure Y
         is
           XYZ/sources.cm

but CM.Graph.graph "temp.cm" will just generate another import dependency
on "XYZ/sources.cm".

We could try copying "XYZ/sources.cm" to generate

temp.cm: Group
           structure X
           structure Y
         is
           XYZ/x.sml
           XYZ/y.sml

but I'm afraid that that would quickly become complicated, trying to
support the whole CM description language.  Not to mention the multiple
paths issue which is described below.


The other thing we could do would be to simplify the graph produced by
CM.Graph.graph.  Knowing the needed imports, we could mark the nodes that
define those imports, do a reverse DFS to mark the necessary files.
Finally, we would determine which of the import depedencies remain and
which symbols from those imports are required, and then recurse.

It is more complicated than it sounds.  CM.Graph.graph essentially gives a
list-of-edges graph representation, so we need to map it over into a
"real" graph inorder to do any real processing.  It's further complicated
by the fact that the same .cm import might be imported via multiple paths:

sources.cm: Group
              structure Main
            is
              A/sources.cm
              B/sources.cm
              main.sml (* uses structure A, structure B *)

A/sources.cm: Group
                structure A
              is
                XYZ/sources.cm
                A/a.sml (* uses structure X *)

B/sources.cm: Group
                structure B
              is
                XYZ/sources.cm
                B/b.sml (* uses structure Y *)

XYZ/sources.cm: Group
                  structure X
                  structure Y
                  structure Z
                is
                  XYZ/x.sml
                  XYZ/y.sml
                  XYZ/z.sml

In this case, we want structure X and Y from XYZ/sources.cm, but we don't
know that until we scan both A/sources.cm and B/sources.cm.  So, it seems
that we really need to follow all the imports (even following those that
might later be determined not necessary), generate the topological sort,
and then start working backwards so that we are sure to hit all the
imports before trying to simplify a graph.


Or, we just let MLton's removeUnused passes eliminate the dead code.  The
two downsides of that are that (1) we might end up with extra code, (2) as
we saw before, while CM.make "sources.cm" might succeed, the corresponding
cmcat "sources.cm" might produce a file list with errors (unbound
identifiers or type errors, corresponding to code that CM never compiled).