[MLton] CharacterEncoding.Scheme
Wesley W. Terpstra
wesley at terpstra.ca
Mon Feb 12 08:07:46 PST 2007
On Feb 12, 2007, at 12:48 AM, Wesley W. Terpstra wrote:
> The problem I see with our old interface is that it's impossible to
> write using iconv. It's no problem to write these parsers in SML,
> but it would also be nice to benefit from the huge array of
> character encoding schemes already supported by iconv. The problem
> is that iconv maintains hidden state as it parses strings. For
> example, there could be a character '^U' that made all subsequent
> characters uppercase, and '^L' for lowercase. The SML functional
> parser is able to 'rewind' its state, which an iconv based
> converter cannot.
>
> I think it's clear that the SML interface is better, but I'm not
> sure how to reconcile this with the most popular free software
> implementation we could reuse.
In thinking about it more, I think we only need to provide two things:
1. Good UTF-8 / UTF-16 encoders that integrate well with SML.
2. Access to the system's encoder -- along with all the ugliness that
entails.
Towards #1, I think we want the usual scanner adaptation used in the
basis. To make their use convenient, we can add two methods: one that
converts broken characters to a given bad-character sign (like #"?")
and one that raises an exception.
Towards #2, we should just provide a facility to query the operating
system for the character encoding forms it supports, and to invoke
these. It's not possible to make a SML stateless scanner here, so we
don't try.
Here's my proposed signature:
> signature CODING =
> sig
> (* There is never a problem with character repertoire here *)
> structure UTF:
> sig
> type decoder
> type encoder
>
> val decode8: decoder
> val encode8: encoder
>
> val decode16: decoder
> val encode16: encoder
>
> (* Endian-specific UTF-16 *)
> val decode16le: decoder
> val encode16le: encoder
> val decode16be: decoder
> val encode16be: encoder
>
> val decode: decoder * (Word8.word, 'a) reader ->
> (WideChar.char, 'a) reader
> val encode: encoder * WideSubstring.substring ->
> Word8Vector.vector
>
> (* Adapt the scanner to throw an exception on error *)
> val throwScan: (Word8.word, 'a) reader * (WideChar.char,
> 'a) reader -> (WideChar.char, 'a) reader
> (* Replace corrupt input characters with the provided
> character *)
> val safeScan: WideChar.char * (Word8.word, 'a) reader *
> (WideChar.char, 'a) reader -> (WideChar.char, 'a) reader
> end
>
> (* Decoders provided by the system may only partially decode/
> encode *)
> structure System:
> sig
> type decoder
> type encoder
>
> val getDecoder: string -> decoder option
> val getEncoder: string -> encoder option
>
> val encode: encoder * WideSubstring.substring ->
> WideSubstring.substring * Word8Vector.vector
> val decode: decoder * Word8VectorSlice.vector_slice ->
> Word8VectorSlice.vector_slice *
> WideString.string
> end
> end
Comments?
Should WidePrimIO really do 32-bit read/writes? I'm tempted to make
it operate on UTF-8 even though the basis says nothing about this.
Would this violate the standard in any way?
What's the easiest way to get a Word8.word scanner for stdIn?
How tightly coupled to MLton may I make this? Can I add the 'iconv'
method to the runtime? It will need a C wrapper, because it's not
possible to take a pointer to a vector+offset within the SML FFI.
(Also iconv is sometimes a macro)
More information about the MLton
mailing list