[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Tue, 29 Nov 2005 19:10:49 -0600

I think that this proposal is too heavy weight for its usefulness.
The Basis design assumes that there is an implementation of the TEXT
signature for each char/string/substring type, so you'll have all the
arrays, vectors, slices, etc. for each type.  Furthermore, there need
to be conversion functions between types and perhaps multiple versions
of TextIO.

A different strategy (one that we considered at one point in the Basis
design, but then abandoned for reasons that I cannot remember), is to
separate the notion of character classification from representation.
For example, one could have two types of char (Char.char and  
but multiple classification modules (e.g., Ascii, ISO8859_1, ...) that
provide interpretations of these types.  Functions like isAlpha would
be part of these classification modules.

	- John

On Nov 29, 2005, at 4:36 PM, Stephen Weeks wrote:

>> Keeping with the mindset that a structure matching CHAR is in fact a
>> character set, not just a bag of integers, how about this:
>> Char (8 bit, high ascii 'undefined') <-- required (raises Chr for
>> values beyond FF)
>> Ascii (7 bit) <-- required (raises Chr for values beyond 7F)
>> Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
>> Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values
>> beyond FFFF)
>> WideChar (must be Unicode) <-- optional (raises Chr for surrogates
>> and values beyond 10FFFF)
> I like this proposal.
> As to whether \U escapes should accept 6 or 8 hex digits, I lean
> towards 8 because it seems possible that in the future we will need
> more than 6 digits, and I wouldn't want to break old code or to
> support 6 and 8 simultaneously.  Also, we have \u for the common case
> of 4 digits.  Finally, with source files allowed to be UTF-8, \U
> escapes should be pretty rare.
>> If we are banning values beyond 10FFFF, then perhaps we should also
>> ban values between D800-DFFF which may not appear in a conforming
>> UTF-32 string.
> Yes, that makes sense if we are really thinking of WideChar as
> Unicode.
>> One question is whether or not the Ucs2/Iso8859_1/Ascii structures
>> should have all of the extra structures that go with them
>> (Ucs2String, Ucs2Vector, Ucs2Substring, ...).
> One way to go would be to export functors that let people build these
> if they really want them.
> _______________________________________________
> Sml-basis-discuss mailing list
> Sml-basis-discuss@mailman.cs.uchicago.edu
> http://mailman.cs.uchicago.edu/mailman/listinfo/sml-basis-discuss