[MLton] Scaling issue during refFlatten when building the type
dependencies graph
Matthew Fluet
fluet at tti-c.org
Mon Sep 8 19:12:49 PDT 2008
On Mon, 8 Sep 2008, Nicolas Bertolotti wrote:
> I recently faced a scaling issue during the refFlatten. The issue occurs
> during the construction of a graph which uses a very large amount of RAM
> because the edges are stored as lists and the redundant edges are not
> automatically filtered during the construction of the graph itself.
I believe that the DirectedGraph module is meant to support graphs with
multiple edges between nodes. While representing the set of edges by a
data structure other than a list might be more efficient, I don't believe
that it would be appropriate the automatically combine edges between the
same nodes.
> I have been able to workaround it by modifying the ref-flatten.fun file and replace the following construct:
> | Datatype tc =>
> (ignore o Graph.addEdge)
> (graph, {from = n, to = tyconNode tc})
> with:
> | Datatype tc =>
> (* NBe: Avoid redundant edges in the graph. *)
> let
> val m = tyconNode tc
> in
> if (Graph.Node.hasEdge {from = n, to = m}) then
> ()
> else
> ((ignore o Graph.addEdge)
> (graph, {from = n, to = m}))
> end
>
> This change drastically reduced the memory usage (from millions of edges
> to one thousand edges) and solved the problem.
Wow! That is quite a difference (and suggests a large number of
datatype/variants in the input program).
> I don't expect this fix to be included as is in the compiler as there
> are probably better ways to handle this (use an other data type to store
> the edges, use a different algorithm to build the graph ...).
Actually, I think this is the best solution to the immediate problem. In
any case, even with a different datatype for edges, there is no need for
the RefFlatten graph to include duplicate edges. It is true that the
Node.hasEdge requires a linear search through the list of edges; that
could be improved to constant time with another property. But, I'm not
sure that you would see a difference in practice; even with your large
problem, the list of edges is now the "short" list of non-redundant edges
and for any given node it can grow to at most the number of datatypes in
the IL program (and will typically be much, much shorter in practice).
More information about the MLton
mailing list