[MLton] statically ensuring regular-language membership with fold
Stephen Weeks
MLton@mlton.org
Sun, 16 Oct 2005 07:35:10 -0700
This note describes a construction that, given a regular language R
over alphabet {a, b, ...}, defines SML values "start" and "a", "b",
... such that the SML expression
(start s1 s2 ... sn $): unit
is well typed if and only if the sequence <s1, s2, ..., sn> is in R,
where each si is in {a, b, ...}. ("$" is our usual variable-arguments
terminator)
It is perhaps easier to understand the construction by example, so
here it is for the regular language (ab)*. There is a three-state
DFA that accepts (ab)*. Here is the transition table.
a b
--- ---
X | Y | Z |
Y | Y | Y |
Z | X | Y |
--- ---
Z is the start state and is also the only accepting state.
The construction is built on top of the fold utility.
----------------------------------------------------------------------
fun $ (a, f) = f a
fun pass x f = f x
structure Fold =
struct
type ('a, 'b, 'c, 'd) step = 'a * ('b -> 'c) -> 'd
type ('a, 'b, 'c, 'd) t = ('a, 'b, 'c, 'd) step -> 'd
type ('a1, 'a2, 'b, 'c, 'd) step0 =
('a1, 'b, 'c, ('a2, 'b, 'c, 'd) t) step
val fold = pass
fun step0 h (a1, f) = fold (h a1, f)
end
----------------------------------------------------------------------
Below is an expression of the DFA in an SML signature, where "start" is
a folder (of type Fold.t) and each symbol is a stepper (of type
Fold.step0). Furthermore, each symbol directly expresses its column
in the transition table in its type. Each step keeps track of the
state and each state has embedded within it whether it is accepting or
rejecting. The finisher function of the fold requires the final state
to be accepting.
----------------------------------------------------------------------
signature AB =
sig
type accept
type reject
type ('a, 'b, 'c, 'd, 'e) state
type ('a, 'b, 'c, 'd, 'e) trans
type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
= (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
('f2, 'a, 'b, 'c, 'd) state,
(accept, 'e, 'f, 'g, 'h) state,
unit,
'z) Fold.step0
val start: ((accept, 'b, 'c, 'd, 'd) state,
(accept, 'e, 'f, 'g, 'h) state,
unit,
'z) Fold.t
val a: (('a, 'b, 'c) Y,
('a, 'b, 'c) Y,
('a, 'b, 'c) X,
'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
val b: (('a, 'b, 'c) Z,
('a, 'b, 'c) Y,
('a, 'b, 'c) Y,
'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
end
----------------------------------------------------------------------
It is easy to check that that the above signature works as intended.
In the following functor, the uncommented examples type check and the
commented examples fail to typecheck.
----------------------------------------------------------------------
functor Test (S: AB): sig end =
struct
open S
val () = start $
(* val () = start a $ *)
(* val () = start b $ *)
val () = start a b $
(* val () = start a b a $ *)
val () = start a b a b $
end
----------------------------------------------------------------------
There are a number of ways to implement AB. Several of them are
mostly phantom and share some common definitions.
----------------------------------------------------------------------
functor GenAB (S: sig
type ('a, 'b, 'c, 'd, 'e) state
type ('a, 'b, 'c, 'd, 'e) trans
end) =
struct
open S
type accept = unit
type reject = unit
type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
= (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
('f2, 'a, 'b, 'c, 'd) state,
(accept, 'e, 'f, 'g, 'h) state,
unit,
'z) Fold.step0
end
----------------------------------------------------------------------
Here is a pure phantom implementation.
----------------------------------------------------------------------
structure AB:> AB =
struct
structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = unit
type ('a, 'b, 'c, 'd, 'e) trans = unit)
open S
fun start $ = Fold.fold ((), ignore) $
fun symbol $ = Fold.step0 ignore $
val a = symbol
val b = symbol
end
----------------------------------------------------------------------
Here is an implementation that dynamically keeps track of the state.
----------------------------------------------------------------------
structure AB:> AB =
struct
datatype state = X | Y | Z
structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = state
type ('a, 'b, 'c, 'd, 'e) trans = unit)
open S
fun start $ = Fold.fold (Z, ignore) $
val symbol = Fold.step0
fun a $ = symbol (fn X => Y | Y => Y | Z => X) $
fun b $ = symbol (fn X => Z | Y => Y | Z => Y) $
end
----------------------------------------------------------------------
Here is an implementation that dynamically keeps track of the
transitions, and prints them out at the end.
----------------------------------------------------------------------
structure AB:> AB =
struct
structure S = GenAB (type ('a, 'b, 'c, 'd, 'e) state = char list
type ('a, 'b, 'c, 'd, 'e) trans = unit)
open S
fun start $ =
Fold.fold ([], fn cs => print (implode (rev (#"\n" :: cs)))) $
fun symbol c = Fold.step0 (fn cs => c :: cs)
fun a $ = symbol #"a" $
fun b $ = symbol #"b" $
end
----------------------------------------------------------------------
Calling our Test functor on the above AB yields
----------------------------------------------------------------------
ab
abab
----------------------------------------------------------------------
There is also a completely non-phantom implementation. I'm not sure
if it has any practical use, but it does have one property that none
of the other implementations do -- one can use it without the
signature constraint and it still enforces membership in the regular
language. All of the other implementations, because of their phantom
aspects, require the opaque signature constraint in order to function
properly.
I also mention this implementation for historical reasons, since it is
the one I built first, and used to help me derive the AB signature.
----------------------------------------------------------------------
structure AB: AB =
struct
datatype accept = Accept
datatype reject = Reject
datatype ('a, 'b, 'c) r = R of {X: 'a, Y: 'b, Z: 'c}
type ('a, 'b, 'c, 'd, 'e) state = 'a * (('b, 'c, 'd) r -> 'e)
type ('a, 'b, 'c) states =
((reject, 'a, 'b, 'c, 'a) state,
(reject, 'a, 'b, 'c, 'b) state,
(accept, 'a, 'b, 'c, 'c) state) r
type ('a, 'b, 'c, 'd, 'e) trans =
('b, 'c, 'd) states -> ('a, 'b, 'c, 'd, 'e) state
type ('a, 'b, 'c) X = (reject, 'a, 'b, 'c, 'a) trans
type ('a, 'b, 'c) Y = (reject, 'a, 'b, 'c, 'b) trans
type ('a, 'b, 'c) Z = (accept, 'a, 'b, 'c, 'c) trans
type ('t1, 't2, 't3, 'f1, 'f2, 'a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'z) symbol
= (('f1, 't1, 't2, 't3, ('f2, 'a, 'b, 'c, 'd) trans) state,
('f2, 'a, 'b, 'c, 'd) state,
(accept, 'e, 'f, 'g, 'h) state,
unit,
'z) Fold.step0
fun state (f, sel): ('a, 'b, 'c, 'd, 'e) state =
(f, fn R r => sel r)
fun states (): ('a, 'b, 'c) states =
R {X = state (Reject, #X),
Y = state (Reject, #Y),
Z = state (Accept, #Z)}
fun run state = Fold.fold (state, fn (Accept, _) => ())
fun trans sel (R states) = sel states
val X: ('a, 'b, 'c) X = fn $ => trans #X $
val Y: ('a, 'b, 'c) Y = fn $ => trans #Y $
val Z: ('a, 'b, 'c) Z = fn $ => trans #Z $
fun start $ = run (Z (states ())) $
fun symbol r = Fold.step0 (fn (_, sel) => sel (R r) (states ()))
fun a $ = symbol {X = Y, Y = Y, Z = X} $
fun b $ = symbol {X = Z, Y = Y, Z = Y} $
end
----------------------------------------------------------------------
That's it for the example construction. Hopefully the general
construction is now clear.
Suppose we have a DFA with n states, S1, ..., Sn and m symbols u1,
... um. Let d(Si, uj) be the state transitioned to from state Si on
symbol uj. Define Fi to be "accept" if Si is accepting and to be
"reject" if Si is rejecting. Let "s" be the index of the start state.
The signature corresponding to the DFA is the following.
type accept
type reject
type ('a, 'a1, ..., 'an, 'e) state
type ('a, 'a1, ..., 'an, 'e) trans
type ('a1, ..., 'an) S1 = (F1, 'a1, ..., 'an, 'a1) trans
...
type ('a1, ..., 'an) Sn = (Fn, 'a1, ..., 'an, 'an) trans
type ('t1, ..., 'tn, 'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z)
symbol
= (('f1, 't1, ..., 'tn, ('f2, 'a1, ..., 'an, 'b) trans) state,
('f2, 'a1, ..., 'an, 'b) state,
(accept, 'c1, ..., 'cn, 'd) state,
unit, 'z) Fold.step0
val start: ((Fs, 'a1, ..., 'an, 'as) state,
(accept, 'b1, ..., 'bn, 'c) state,
unit, 'z) Fold.t
val u1: (('a1, ..., 'an) d(S1,u1),
...
('a1, ..., 'an) d(Sn,u1),
'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z) symbol
...
val um: (('a1, ..., 'an) d(S1,um),
...
('a1, ..., 'an) d(Sn,um),
'f1, 'f2, 'a1, ..., 'an, 'b, 'c1, ..., 'cn, 'd, 'z) symbol
The size of the signature is proportional to the size of the DFA
(number of states times number of symbols). An interesting question
that I do not know the answer to is how big the inferred type
derivations are. I think because of the way fold works, they will be
quadratic in the number of input symbols, times the size of the DFA.
It wouldn't be too hard write some code that generates the signature
(and one of the implementations) for any DFA.
There are some immediate practical applications of this construction.
It allows one to easily conclude that a number of conditions for
optional arguments can be enforced statically, simply because they can
be expressed as a regular language. For example, one can easily
guarantee that each optional argument occurs no more than once.
Furthermore, once that guarantee is in place, any other constraint can
be expressed, since all finite languages are regular. So, for
example, one could require the optional arguments to occur in a
particular order. Or, impose a pre-order. Or, one could require that
the arguments occur as a prefix of some total order.
For many regular languages, there is a simpler signature that enforces
language membership than the construction above. For example, for
(ab)*, here is a simpler signature.
----------------------------------------------------------------------
signature AB' =
sig
type x
type z
val start: (z, z, unit, 'w) Fold.t
val a: (z, x, z, unit, 'w) Fold.step0
val b: (x, z, z, unit, 'w) Fold.step0
end
----------------------------------------------------------------------
This kind of simplification is possible only because of the very
simple nature of the DFA. In addition to AB' being simpler than AB,
AB' also has the benefit of better error reporting. With AB', one
gets a type error messages as soon as one attempts to enter the dead
state, while with AB one must wait until the $ terminator to get the
error. I think it's not too hard to modify the general construction
to meet this property, though.
It is also possible to ensure membership in some non-regular
languages. For example, here is a signature that ensures membership
in a^n b^n.
----------------------------------------------------------------------
signature ANBN =
sig
type A
type B
type Z
type 'a S
val start: ( Z * A, Z * 'c, unit, 'y) Fold.t
val a: ( 'n * A, 'n S * A, 'w, 'x, 'y) Fold.step0
val b: ('n S * 'c, 'n * B, 'w, 'x, 'y) Fold.step0
end
----------------------------------------------------------------------
As ANBN makes clear, one can implement a type-level stack and do
pushes and pops within the steppers. By merging this idea with DFA
construction above, I suspect it is possible to implement any pushdown
automaton (PDA), and hence to statically ensure membership in any
deterministic context-free language. I'm not sure about
nondeterministic PDA's, but perhaps SML's type inference can be made
to do the guessing needed for nondeterminism.
Finally, it is possible to ensure membership in some non-context-free
languages. For example, here is a signature that ensures membership
in a^n b^n c^n.
----------------------------------------------------------------------
signature ANBNCN =
sig
type A
type B
type C
type Z
type 'a S
val start: ( Z * Z * A, Z * Z * 'd, unit, 'y) Fold.t
val a: ( 'n * 'n * A, 'n S * 'n S * A, 'w, 'x, 'y) Fold.step0
val b: ('n S * 'm * 'd, 'n * 'm * B, 'w, 'x, 'y) Fold.step0
val c: ( Z * 'm S * 'd, Z * 'm * C, 'w, 'x, 'y) Fold.step0
end
----------------------------------------------------------------------