[MLton] Scaling issue during refFlatten when building the type dependencies graph

Mon Sep 8 19:12:49 PDT 2008

On Mon, 8 Sep 2008, Nicolas Bertolotti wrote:
> I recently faced a scaling issue during the refFlatten. The issue occurs 
> during the construction of a graph which uses a very large amount of RAM 
> because the edges are stored as lists and the redundant edges are not 
> automatically filtered during the construction of the graph itself.

I believe that the DirectedGraph module is meant to support graphs with 
multiple edges between nodes.  While representing the set of edges by a 
data structure other than a list might be more efficient, I don't believe 
that it would be appropriate the automatically combine edges between the 
same nodes.

> I have been able to workaround it by modifying the ref-flatten.fun file and replace the following construct:
>                      | Datatype tc =>
>                                    (ignore o Graph.addEdge)
>                                     (graph, {from = n, to = tyconNode tc})
> with:
>                      | Datatype tc =>
>                            (* NBe: Avoid redundant edges in the graph. *)
>                            let
>                                val m = tyconNode tc
>                            in
>                                if (Graph.Node.hasEdge {from = n, to = m}) then
>                                    ()
>                                else
>                                    ((ignore o Graph.addEdge)
>                                     (graph, {from = n, to = m}))
>                            end
>
> This change drastically reduced the memory usage (from millions of edges 
> to one thousand edges) and solved the problem.

Wow!  That is quite a difference (and suggests a large number of 
datatype/variants in the input program).

> I don't expect this fix to be included as is in the compiler as there 
> are probably better ways to handle this (use an other data type to store 
> the edges, use a different algorithm to build the graph ...).

Actually, I think this is the best solution to the immediate problem.  In 
any case, even with a different datatype for edges, there is no need for 
the RefFlatten graph to include duplicate edges.  It is true that the 
Node.hasEdge requires a linear search through the list of edges; that 
could be improved to constant time with another property.  But, I'm not 
sure that you would see a difference in practice; even with your large 
problem, the list of edges is now the "short" list of non-redundant edges 
and for any given node it can grow to at most the number of datatypes in 
the IL program (and will typically be much, much shorter in practice).