[MLton] WideChar

Wesley W. Terpstra terpstra@gkec.tu-darmstadt.de
Mon, 13 Dec 2004 01:09:59 +0100


On Sat, Dec 11, 2004 at 05:23:24PM -0500, Matthew Fluet wrote:
>  1.1. The trivial converter that maps a Char.char to it's corresponding
>       WideChar.char (i.e., the ASCII embedding).  This will certainly be
>       useful for applications/libraries that use WideChar by default, but
>       a user wants to feed it a stream of Char.char.
>  1.2. The decoding converter, that may aggregate input elements into a
>       WideChar.char.
>
> I would argue that what one really wants is a Word8.word reader to a
> WideChar.char reader.  A Word8.word is an "uniterpreted" 8bit value.  When
> one wants to recover a 1.2 style decoding converter from a Char.char
> reader, it should first be sent through the Byte structure, which
> explicitly relinquishes the bit interpretation.  (For the most part, this
> will be a nop for MLton.)

I completely agree with all your points above.

So, a user who wants a specific encoding should use binIO and attach that to
a Unicode decoder; it makes perfect sense. It also neatly sidesteps problems
with end-line conversions. If a person has some two-byte wide encoding and
CRLF appears, it would be disaster if on windows this was changed to LF.

This is an issue that didn't even occur to me until you pointed out the two
cases inside #1. So, I think you're right and that in fact we _must_ do it
your way.

> Now, it is the 1.2 style converter that one might want NONE or exception
> raising semantics.  But, a 1.1 style converter would never raise an
> exception.

Again, completely on the money. :)

> The problem with both NONE and exceptions for 1.2 style converters is that
> the invalidity of the input stream is not discovered until sufficient
> input is read; i.e., not at the point where the conversion is applied.

I read your later comments to Stephen, but didn't really understand them.

Sure, the conversion is applied later, when you actually input from the
stream, but why is that a problem? Getting a 'NONE' means some conversion 
somewhere in the conversion chain failed; not in the very last conversion.

I don't think users want us to magically check that an input stream is valid
UTF8/whatever before we start converting. We should just do it in a stream.

Are you suggesting that a user might want to notice that the stream isn't
UTF8 and then try rescanning as ISO-8859-1 from the very beginning of the
file? One should never need to guess the encoding; either it's right or the
input stream is broken. At most you might want to be able to try and skip
past the broken section.

-- 
Wesley W. Terpstra <wesley@terpstra.ca>