[MLton] Unicode... again

Fri Feb 9 10:12:45 PST 2007

On Feb 9, 2007, at 5:56 PM, skaller wrote:
>>> Tradeoff between flexibility and safety here..
>>
>> I don't see this point.
>
> The point is people want to work with other charsets and
> encodings whether you like it or not, and for that purpose
> you can choose whether to let them use the string type,
> or force them to use a raw encoding like (Word8Vector?)

That's what I plan to do. If it's in a String, it is supposed to be  
only values with code point < 256. If you want to work with some  
encoded text, well that's a blob. Blobs are Word8Vector.vector. We  
will be providing a function similar to iconv that allows incremental  
conversion of WideChar <-> Word8. If I recall, we were planning to do  
it similar to the ('a, 'b) reader types used already in SML.

> However if you want to read strings
> from files in UTF-8 format (at run time i mean)
> you can get errors, and if you want
> to do it FAST you cannot detect them: validating the input
> isn't really an option (it will cause a whole extra pass
> on the file which is WAY too expensive).

I don't see how this is any different from parsing any input in  
general. You can incrementally parse it, and when an error occurs  
your parser reports an error. By using the SML ('a, 'b) reader  
abstraction, you can just compose a UTF-8/whatever decoder up to a  
BinIO stream and connect that in turn to your parser. MLton will  
inline all the abstraction anyway. :-)

As an aside: I've made WideChar = 16/32 bits a compiler flag, similar  
to how the default int type is chosen.