[MLton] WideChar?

Stephen Weeks MLton@mlton.org
Wed, 8 Dec 2004 13:27:10 -0800


> What I would see as the 'ideal' SML solution would be:
> 
> There are Char : CHAR, UCS2 : CHAR, and WideChar = UCS4 : CHAR structures.
> You would use a LargeChar to pick the largest available Char type.
> UCS2String, LargeString, should also exist.

Seems fine to me.

> The values #"a" and "dfgdfsg" should be polymorphic just like 5.

Yes.  BTW, the SML term in this situation is "overloaded", not
"polymorphic".

> It should be made explicit that Char is ISO-8859-1 (so Char.ord is Unicode).

In what way is this not explicit now?  Is this just a documentation
issue or is there some code that needs to change?

> It should be made explicit that SML source code files are by default UTF-8 
>  (to permit Unicode characters inside strings).

It seems to me like this is not allowed by the Definition.  Here's the
relevant part of Section 2.2, which discusses constants.

  We assume an underlying alphabet of N characters (N >= 256),
  numbered 0 to N - 1, which agrees with the ASCII character set on
  the characters numbered 0 to 127.  The interval [0, N-1] is called
  the ordinal range of the alphabet.  A string constant is a sequence,
  between quotes ("), of zero or more printable characters (i.e.,
  numbered 33-126), spaces, or escape sequences.  Each escape sequence
  starts with the escape character \, and stands for a character
  sequence.  The escape sequences are:
  ...
    \uxxxx The single character with number xxxx (4 hexadecimal digits
           denoting an integer in the ordinal range of the alphabet)

So, there is already a mechanism that permits Unicode characters
inside strings (and hence chars).  All of the SML compilers except for
SML/NJ support \u escape sequences, although no SML compiler supports
characters with ordinal greater than 255 (so that \u is not very
useful yet).  I suppose there is some wiggle room in the phrase
"printable characters", which is specifically defined relevant to
ASCII but could be interpreted to vary depending on the underlying
alphabet.

In any case, it seems like the most portable route to go at first
would be to keep string constants as is, and require \u escape
sequences for Unicode characters.  Once we support characters beyond
255, we're not portable other SML implementations anyways, so I don't
object to making character and string constants UTF-8.

I am not convinced that it is a good idea to allow UTF-8 encoding
elsewhere (e.g. in variable names).  That seems like it will kill
portability with other SML implementations, as well as make it
more difficult to share code between people of different languages.
Also keep in mind that this will require a substantial upgrade to
ML-Lex to support Unicode (not a bad thing, just work).

> There is an ICONV signature including at least:
>   type string
>   type char
>   exception UnknownCharset of string
> 
>   val decode: string -> (char, 'a) reader -> (LargeChar, 'a) reader
>   val encode: LargeChar -> string
> 
>   val registerDecoder: string -> ((char, 'a) reader -> (LargeChar, 'a) reader) -> unit
>   val registerEncoder: string -> (LargeChar -> string) -> unit
> 
> Two structures for representing the source string type:
>   IConv, IConvUCS2 (string = UCS2String)
> 
> encode and decode take a string naming the source charset.

I'd rather provide a datatype (UTF8 | UTF16 | ...).  For
extensibility, you can have a variant "X of string".

> Maybe add toLarge and fromLarge in CHAR and STRING signatures like INT.

We would do this in MLton.Char and MLton.String.  But yes.

> The whole CHAR.is* family should never have been dumped inside CHAR.
> The note that these are locale dependent under WideChar makes it even worse.
> The idea of a per-process locale is a C screw-up that even C++ fixes.
> The best way to handle this in SML is not clear (no OOP - hrm).

I don't know enough to comment here.  Here are some options:

 * have a ref cell holding the locale and use fluid-let  
   (I guess this is the C screwup)
 * make the locale an argument to each function that needs it.
   build nice wrappers to hide it as much as possible
 * make the locale a functor argument, and create a structure per locale

> What is the idea behind WideTextPrimIO?

The StreamIO and ImperativeIO functors use the *PrimIO structures to
build higher-level I/O modules.  I assume the idea here is to be able
to build a WideTextIO module similar to TextIO.  I wouldn't' read too
much into this -- as Matthew said, none of the basis library designers
felt competent to handle internationalization issues.

> I don't know how to deal with all of the scan functions that expect a
> Char.char. Suggestions? I think they should work with WideChar too.

There might be a more composable way that doesn't require us to add
all these new functions.  What if we added a function that converts a
LargeChar reader into a Char reader?  It would raise an exception
given an out-of-range char.  Is there a reason why functions like
Int.scan would need to handle to handle wide chars?

> How does one make official changes to the SML Basis Library anyways?

It doesn't seem possible to me.  There is an email list

	sml-basis-discuss@mailman.cs.uchicago.edu
	http://mailman.cs.uchicago.edu/mailman/listinfo/sml-basis-discuss

This list is read by many of the SML implementors, including John
Reppy.  This list is usually not very active, but occasionally sees
bursts of activity.  You're welcome to send mail there to see what
people think.  However, there is no clear process for getting a change
made and no structure for incorporating the opinions of the many SML
implementors in the decision making process.

My suggestion is to use the list to solicit feedback on your design,
if you think you need more beyond what's available on MLton@mlton.org.
Then, we (the MLton list) will move forward and implement what's
needed in MLton.  What we provide will be clearly marked as an
extension, probably by putting it in the MLton structure.

> If I implemented WideChar in MLton, I would at present ignore the comment
> that WideChar.is* should be localized; that's just wrong.

Fine by me.  Simply add an appropriate note after the module listed at
http://mlton.org/BasisLibrary.

> I would need help making #"g" and "Dgfsg" polymorphic,

This is not hard.  We can help out.


BTW, I assume since you've ported fxp that you had a look at 

	http://mlton.org/References#Neumann99Thesis

which talks about how fxp handles Unicode.