[MLton] CharacterEncoding.Scheme
Wesley W. Terpstra
wesley at terpstra.ca
Sun Feb 11 15:48:42 PST 2007
To remind everyone where we were two years ago (!!) let me quote the
interface for encoding/decoding Stephen posted:
> signature UNICODE =
> sig
> structure Encoding:
> sig
> type t
>
> exception Unsupported
>
> val equals: t * t -> bool
> val fromName: string -> t (* may raise Unsupported *)
> val toName: t -> string
> val utf16be: t
> val utf16le: t (* little-endian *)
> val utf32be: t
> val utf32le: t
> val utf8: t
> end
>
> exception Bad
> (* raises Bad if the stream is broken *)
> val decode: (Encoding.t * (Word8.word, 'a) reader
> -> (WideChar.char, 'a) reader)
> val encode: Encoding.t * WideString.string -> Word8Vector.vector
> end
I'm starting to write this now, in a new i18n.mlb as part of the
basis library that you have to specifically include. The signature
and structure will be called CHARACTER_ENCODING and CharacterEncoding
(with substructure Scheme) respectively. After this, I'll think about
adding support for marking strings as translatable ala gettext and
locale specific character classes under CharacterClass. I'm not yet
sure how best to do gettext in SML without having __FILE__ and
__LINE__ macros ala C.
However, first the CharacterEncoding structure!
The problem I see with our old interface is that it's impossible to
write using iconv. It's no problem to write these parsers in SML, but
it would also be nice to benefit from the huge array of character
encoding schemes already supported by iconv. The problem is that
iconv maintains hidden state as it parses strings. For example, there
could be a character '^U' that made all subsequent characters
uppercase, and '^L' for lowercase. The SML functional parser is able
to 'rewind' its state, which an iconv based converter cannot.
I think it's clear that the SML interface is better, but I'm not sure
how to reconcile this with the most popular free software
implementation we could reuse.
More information about the MLton
mailing list