[MLton] WideChar

Matthew Fluet fluet@cs.cornell.edu
Sat, 11 Dec 2004 17:23:24 -0500 (EST)


> On Fri, Dec 10, 2004 at 12:24:48PM -0800, Stephen Weeks wrote:
> > * When converting a char reader to a widechar reader, it is sometimes
> >   useful to raise an exception on encountering a widechar and
> >   sometimes useful to return NONE.  We should provide both types of
> >   converters.
>
> I thought about this a bit more after Henry's email.
> There are two converters and I think we were getting confused by them:
>
> 1. convert from an input source/'char' reader to WideChar reader by decoding
>    - in this case i think you probably want an exception since it means that
>      the input is corrupt
>    - conceivably we also want it to stop with a NONE here too, not sure
> 2. convert a WideChar reader back to a char reader for the purposes of using
>    Int.scan, Date.scan, etc. -- not encoding, simply converting types
>    - here you always want NONE; those scanners don't accept the string if
>      there are non-ASCII chars in the stream
>
> The case where I was talking about returning NONE was #2.
> The case where Henry talked about exceptions was #1.

I agree with Wesley that there are these two different sorts of
converters.  Also, converting a WideChar.char reader into a Char.char
reader (i.e, #2) should return NONE, as it corresponds to trying to scan
an ASCII character, which may or may not be at the head of the stream.

However, I'm less convinced that there is a place for a non-trivial
Char.char reader to a WideChar.char reader.  In fact, here is another
place that there are really two converters:
 1.1. The trivial converter that maps a Char.char to it's corresponding
      WideChar.char (i.e., the ASCII embedding).  This will certainly be
      useful for applications/libraries that use WideChar by default, but
      a user wants to feed it a stream of Char.char.
 1.2. The decoding converter, that may aggregate input elements into a
      WideChar.char.

Note that if the Basis Library provided a LargeChar structure that
corresponded to the largest support character type, then #1.1 is almost
certainly what one wants for a Char.char reader to LargeChar.char reader.
So, I think that is also a likely case for a Char.char reader to a
WideChar.char reader.

I would actually argue that one rarely wants the 1.2 converter on
Char.char.  By convention, a Char.char is an "interpreted" 8bit value.  I
would argue that what one really wants is a Word8.word reader to a
WideChar.char reader.  A Word8.word is an "uniterpreted" 8bit value.  When
one wants to recover a 1.2 style decoding converter from a Char.char
reader, it should first be sent through the Byte structure, which
explicitly relinquishes the bit interpretation.  (For the most part, this
will be a nop for MLton.)

Now, it is the 1.2 style converter that one might want NONE or exception
raising semantics.  But, a 1.1 style converter would never raise an
exception.

The problem with both NONE and exceptions for 1.2 style converters is that
the invalidity of the input stream is not discovered until sufficient
input is read; i.e., not at the point where the conversion is applied.