[MLton] WideChar

Stephen Weeks MLton@mlton.org
Mon, 13 Dec 2004 10:06:17 -0800


> A foolish user might write:
> case enc of
>     UTF8 => ...
>   | UTF16 => ..
>   | X strEnc => ...
> 
> The user writing a case statement over the datatype will be uncommon, but I
> wouldn't put it outside the realm of possibility. IMO, we shouldn't break
> user code by adding new encodings; even bad user code.

OK, I understand.  The problem is that by adding a new encoding as a
variant, we have changed the representation of an encoding from 
X "foo" to FOO.  Code that used to handle the foo encoding as an
extension (whether it actually ever used the string "foo" or not) is
now broken.  There are two ways in which such code is broken,
corresponding to whether X is used as a constructor or as a destructor
via a pattern.  We can fix the constructor problem by making the
type abstract.

signature ENCODING =
   sig
      type t

      datatype dest =
	 UTF8
       | UTF16
       | X of string

      val dest: t -> dest
      val fromString: string -> t
      val utf8: t
      val utf16: t
   end

We guarantee, e.g., that UTF8 = dest (fromString "utf8").  This also
addresses your concerns about getting an encoding from a string at
runtime.  Now, if we add a new variant, the signature looks like

signature ENCODING =
   sig
      type t

      datatype dest =
         Foo
       | UTF8
       | UTF16
       | X of string

      val dest: t -> dest
      val foo: t
      val fromString: string -> t
      val utf8: t
      val utf16: t
   end

Code that used Encoding.fromString "foo" will still work, because the
Encoding implementation will produce the Foo variant.  However, code
that used X "foo" will fail.

> Or is there a way to make a datatype partially opaque so that the user is
> forced by the exhaustive pattern match checking to add a _ => pattern?

There isn't really a way to do this.  The "exn" type in SML has this
property, because it is an extensible type, but no other datatype
does.  The only other types that require a _ pattern are the constant
types (IntInf.int, string, ...) with an infinite number of values.

The only solution I see to the destructor problem is to hide the
representation of the type completely.

signature ENCODING =
   sig
      type t

      val foo: t
      val fromString: string -> t
      val utf8: t
      val utf16: t
   end

With this, we have lost the ability to do case dispatches, but maybe
that's not a big deal.  Furthermore, we can recover this by adding a
toString function.

signature ENCODING =
   sig
      type t

      val equals: t * t -> bool
      val foo: t
      val fromString: string -> t
      val toString: t -> string
      val utf8: t
      val utf16: t
   end

Now, we get the type-system support of known encodings guaranteeing
agreement on encoding name.  Adding a new known encoding lets new code
get the support, while old code uses the string.  We can compare
encodings for equality, so we can write nested if-then-else to
dispatch on them, still supported by the type system.  If one really
wants, one could do a case expression

  case toString e of
      "UTF8" => ...
    | "UTF16" => ...
    | ...

but then one loses compile-time checking of encoding name.  One could
do a more robust, but possibly slower, case expression by defining a
list of cases

  val cases = [(utf8, v1), (utf16, v2), ...]

and then using List.peek to extract the appropriate case.  We could
even add support for this to ENCODING if we want.

  val casee: (t * 'a) list -> 'a

Overall, I don't see any drawbacks of this approach over exposing
Encoding.t as string.  Let me know if you still see some.  BTW,
underneath, ENCODING is probably datatype t = T of string, but we've
bought ourselves some type-system support by hiding this.