[MLton] WideChar

Stephen Weeks MLton@mlton.org
Wed, 15 Dec 2004 23:41:07 -0800

> What I don't like about your ENCODING signature is that it is a growing
> interface. One of the rules I've had beaten into me over years of C++
> development is that abstract base classes must never add new methods.
> OTOH, the nature of encodings being added IS that the interface grows.

Yes, but I don't see why this is worse than using strings.  With
ENCODING, when the interface grows (i.e. new encodings become
universally known) the documented interface grows.  With strings, the
interface still grows (certain strings corresponding to the new
encodings are now treated specially), but it isn't documented in the

> Sticking with your api and combining the suggestions I've heard:
> signature UNICODE_ENCODING =
>   sig
>     eqtype encoding (* why define equals? just use an eqtype *)

Using eqtype is bad for a few reasons.  First, it forces the object to
be implemented in a certain way (for example, the object can not
contain any functions).  Second, it doesn't allow the programmer to
define his own notion of equality, or to change the notion if the
object's implementation changes.  Third, the MLton convention is for
structures to have an equals function :-).  This convention allows
different structures that implement equality functions in different
ways to be used in the same way by clients (either programmers or
functors).  So, stick with the equals function, even if it is
implemented by = underneath.

Also, the MLton convention is to call the single type defined by a
signature t.  This avoids repeated names, and also makes it easier to
use different structures in the same way (again, either by programmers
or functors).

Another MLton convention is to use tupling, not currying.  Currying is
reserved for when the partial application does interesting

I didn't see the need for separate signatures, or for putting the
encode and decode functions in their own structure.  I also think the
fromChar and toChar functions belong in the (MLton extension of the)
WideString module.

Applying all these changes gives the following.

signature UNICODE =
      structure Encoding:
	    type t

	    exception Unsupported

	    val equals: t * t -> bool
	    val fromName: string -> t (* may raise Unsupported *)
	    val toName: t -> string
	    val utf16be: t
	    val utf16le: t (* little-endian *)
	    val utf32be: t
	    val utf32le: t
	    val utf8: t

      exception Bad
      (* raises Bad if the stream is broken *)
      val decode: (Encoding.t * (Word8.word, 'a) reader
		   -> (WideChar.char, 'a) reader)
      val encode: Encoding.t * WideString.string -> Word8Vector.vector

Sorry to throw all these MLton conventions at you at once.  We haven't
yet gotten around to writing much of the style guide.

> In glibc and libiconv, a conversion can be loaded dynamically from a .so.
> It seems that I can't use the FFI on a function pointer, so dlsym is out.

Sure you can, thanks to some fairly recent efforts.  See


> I lean towards a register function in UNICODE_ENCODING like: 
> val register: string -> {
> 	encoder: WideString -> Word8Vector.vector,
> 	decoder: (Word8.word, 'a) reader -> (WideChar.char, 'a) reader)
> 	} -> unit

Unfortunately, this won't work due to SML's absence of first class
polymorphism (see http://mlton.org/FirstClassPolymorphism).  This
isn't passing in a polymorphic decoder function like you might think.
You need to eliminate the polymorphism before passing the decoder in.
Yet another good argument for using streams instead of readers.  You
might do something like the following.

      val register:
	 * {decode: Word8.word Stream.t -> WideChar.char Stream.t,
	    encode: WideString.string -> Word8Vector.vector}
	 -> unit

Where Stream: STREAM and

signature STREAM =
      type 'a t

      val dest: 'a t -> ('a t * 'a) option
      val new: (unit -> ('a t * 'a) option) -> 'a t

But this might be going a bit far afield.  Let's see what others