[MLton] WideChar

Wesley W. Terpstra terpstra@gkec.tu-darmstadt.de
Thu, 16 Dec 2004 05:49:29 +0100


On Mon, Dec 13, 2004 at 10:06:17AM -0800, Stephen Weeks wrote:
> signature ENCODING =
>    sig
>       type t
> 
>       val equals: t * t -> bool
>       val foo: t
>       val fromString: string -> t
>       val toString: t -> string
>       val utf8: t
>       val utf16: t
>    end

I don't really like this.
It just seems too over-complicated. =(

> Now, we get the type-system support of known encodings guaranteeing
> agreement on encoding name.

OTOH, I suppose this is a benefit.
It is nice to have support for a hard-coded decoder checked at compile-time.

What I don't like about your ENCODING signature is that it is a growing
interface. One of the rules I've had beaten into me over years of C++
development is that abstract base classes must never add new methods.
OTOH, the nature of encodings being added IS that the interface grows.

Sticking with your api and combining the suggestions I've heard:

signature UNICODE_ENCODING =
  sig
    eqtype encoding (* why define equals? just use an eqtype *)
    exception Unsupported
    exception Bad
    
    val fromName: string -> t (* may raise Unsupported *)
    val toName: t -> string
    
    val utf8:    encoding
    val utf16le: encoding (* little-endian *)
    val utf16be: encoding
    val utf32le: encoding
    val utf32be: encoding
  end

signature UNICODE_DECODER =
  sig
    type encoding
    (* raises Encoding.Bad if the stream is broken *)
    val get: encoding -> (Word8.word, 'a) reader -> (WideChar.char, 'a) reader
    
    (* returns NONE if input WideChar != ISO-8869-1 *)
    val toChar: (WideChar.char, 'a) reader -> (Char.char, 'a) reader
    (* always succeeds *)
    val fromChar: (Char.char, 'a) reader -> (WideChar.char, 'a) reader
  end

signature UNICODE_ENCODER =
  sig
    type encoding
    val get: encoding -> WideString -> Word8Vector.vector
    
    (* raises Chr if would overflow *)
    val toChar: WideString -> string
    (* always succeeds *)
    val fromChar: string -> WideString
  end

signature UNICODE =
  sig
    structure Encoding : UNICODE_ENCODING
    structure Decoder : UNICODE_DECODER where type encoding = Encoding.encoding
    structure Encoder : UNICODE_DECODER where type encoding = Encoding.encoding
  end
structure Unicode : UNICODE (* defined at top-level by unicode.mlb *)

> [ keep it portable ]

Well, I intend to assume that WideChar exists and fits at least 21 bits.
Aside from that I won't make any other assumptions about the SML compiler.

-----------------------

One missing piece is support for user-extensions.

In glibc and libiconv, a conversion can be loaded dynamically from a .so.
It seems that I can't use the FFI on a function pointer, so dlsym is out.

I lean towards a register function in UNICODE_ENCODING like: 

val register: string -> {
	encoder: WideString -> Word8Vector.vector,
	decoder: (Word8.word, 'a) reader -> (WideChar.char, 'a) reader)
	} -> unit

Another detail, some of these encoders and decoders will need to keep some
state. For example, the first Word8Vector.vector output might include a
header during encoding and might read the header during decoding.

I'm not sure how best to do this since a lazy scanner might 'rescan' the
header repeatedly.

Matthew observed:
> I'm simply noting the software engineering issue of recovering from bad
> encodings.  Because the conversion is applied lazily, you need to write
> recovery code for every use of the stream.  This recovery can either be
> explicit (checking for NONE and/or handling an exception at the use) or
> implicit (handling an exception around the whole use of the stream).

I agree that the user has to wrap their reader function in a test for the
Unicode.Encoding.Bad exception. However, I don't see any other way to deal
with this. It's much like an IO error, IMO.

-- 
Wesley W. Terpstra <wesley@terpstra.ca>