[MLton] CharacterEncoding.Scheme

Sun Feb 11 15:48:42 PST 2007

To remind everyone where we were two years ago (!!) let me quote the  
interface for encoding/decoding Stephen posted:
> signature UNICODE =
> sig
> structure Encoding:
>      sig
>      type t
>
>      exception Unsupported
>
>      val equals: t * t -> bool
>      val fromName: string -> t (* may raise Unsupported *)
>      val toName: t -> string
>      val utf16be: t
>      val utf16le: t (* little-endian *)
>      val utf32be: t
>      val utf32le: t
>      val utf8: t
>      end
>
> exception Bad
> (* raises Bad if the stream is broken *)
> val decode: (Encoding.t * (Word8.word, 'a) reader
>          -> (WideChar.char, 'a) reader)
> val encode: Encoding.t * WideString.string -> Word8Vector.vector
> end

I'm starting to write this now, in a new i18n.mlb as part of the  
basis library that you have to specifically include. The signature  
and structure will be called CHARACTER_ENCODING and CharacterEncoding  
(with substructure Scheme) respectively. After this, I'll think about  
adding support for marking strings as translatable ala gettext and  
locale specific character classes under CharacterClass. I'm not yet  
sure how best to do gettext in SML without having __FILE__ and  
__LINE__ macros ala C.

However, first the CharacterEncoding structure!

The problem I see with our old interface is that it's impossible to  
write using iconv. It's no problem to write these parsers in SML, but  
it would also be nice to benefit from the huge array of character  
encoding schemes already supported by iconv. The problem is that  
iconv maintains hidden state as it parses strings. For example, there  
could be a character '^U' that made all subsequent characters  
uppercase, and '^L' for lowercase. The SML functional parser is able  
to 'rewind' its state, which an iconv based converter cannot.

I think it's clear that the SML interface is better, but I'm not sure  
how to reconcile this with the most popular free software  
implementation we could reuse.