[MLton] CharacterEncoding.Scheme

Mon Feb 12 08:07:46 PST 2007

On Feb 12, 2007, at 12:48 AM, Wesley W. Terpstra wrote:
> The problem I see with our old interface is that it's impossible to  
> write using iconv. It's no problem to write these parsers in SML,  
> but it would also be nice to benefit from the huge array of  
> character encoding schemes already supported by iconv. The problem  
> is that iconv maintains hidden state as it parses strings. For  
> example, there could be a character '^U' that made all subsequent  
> characters uppercase, and '^L' for lowercase. The SML functional  
> parser is able to 'rewind' its state, which an iconv based  
> converter cannot.
>
> I think it's clear that the SML interface is better, but I'm not  
> sure how to reconcile this with the most popular free software  
> implementation we could reuse.

In thinking about it more, I think we only need to provide two things:
1. Good UTF-8 / UTF-16 encoders that integrate well with SML.
2. Access to the system's encoder -- along with all the ugliness that  
entails.

Towards #1, I think we want the usual scanner adaptation used in the  
basis. To make their use convenient, we can add two methods: one that  
converts broken characters to a given bad-character sign (like #"?")  
and one that raises an exception.

Towards #2, we should just provide a facility to query the operating  
system for the character encoding forms it supports, and to invoke  
these. It's not possible to make a SML stateless scanner here, so we  
don't try.

Here's my proposed signature:
> signature CODING =
>    sig
>       (* There is never a problem with character repertoire here *)
>       structure UTF:
>         sig
>            type decoder
>            type encoder
>
>            val decode8: decoder
>            val encode8: encoder
>
>            val decode16: decoder
>            val encode16: encoder
>
>            (* Endian-specific UTF-16 *)
>            val decode16le: decoder
>            val encode16le: encoder
>            val decode16be: decoder
>            val encode16be: encoder
>
>            val decode: decoder * (Word8.word, 'a) reader ->  
> (WideChar.char, 'a) reader
>            val encode: encoder * WideSubstring.substring ->  
> Word8Vector.vector
>
>            (* Adapt the scanner to throw an exception on error *)
>            val throwScan: (Word8.word, 'a) reader * (WideChar.char,  
> 'a) reader -> (WideChar.char, 'a) reader
>            (* Replace corrupt input characters with the provided  
> character *)
>            val safeScan: WideChar.char * (Word8.word, 'a) reader *  
> (WideChar.char, 'a) reader -> (WideChar.char, 'a) reader
>         end
>
>       (* Decoders provided by the system may only partially decode/ 
> encode *)
>       structure System:
>          sig
>             type decoder
>             type encoder
>
>             val getDecoder: string -> decoder option
>             val getEncoder: string -> encoder option
>
>             val encode: encoder * WideSubstring.substring ->
>                         WideSubstring.substring * Word8Vector.vector
>             val decode: decoder * Word8VectorSlice.vector_slice ->
>                         Word8VectorSlice.vector_slice *  
> WideString.string
>          end
>    end

Comments?

Should WidePrimIO really do 32-bit read/writes? I'm tempted to make  
it operate on UTF-8 even though the basis says nothing about this.  
Would this violate the standard in any way?

What's the easiest way to get a Word8.word scanner for stdIn?

How tightly coupled to MLton may I make this? Can I add the 'iconv'  
method to the runtime? It will need a C wrapper, because it's not  
possible to take a pointer to a vector+offset within the SML FFI.  
(Also iconv is sometimes a macro)