[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Wesley W. Terpstra wesley@terpstra.ca
Wed, 30 Nov 2005 13:49:03 +0100


On Nov 30, 2005, at 2:10 AM, John Reppy wrote:
> I think that this proposal is too heavy weight for its usefulness.

I agree that it's pretty heavy-weight.
However, at least in MLton creating the structures isn't a big deal.

> The Basis design assumes that there is an implementation of the TEXT
> signature for each char/string/substring type, so you'll have all the
> arrays, vectors, slices, etc. for each type.

What if we just said that only Char and WiderChar had the structures at
the toplevel? All the others only provide Ucs2Text, AsciiText, ...  
That has
very little namespace pollution, yet provides everything desired.  
 From my
experience with MLton's Char, most/all of these structures can be  
cookie-
cutter stamped out of a functor, so it's not much trouble to implement.

Re: namespace polution, MLton has Int{1-64} and Word{1-64}.
Now *that* is heavy weight! :-)

> Furthermore, there need to be conversion functions between types ...

Well, a simple 'toWide' and 'fromWide' would take  care of that.
(Analogous to the promotion to/from LargeInt)

However, there are a couple problems here:

WideChar does not exist ony many platforms. Is it possible to have these
elements of the CHAR signature marked as required iff. WideChar exists?

What will Char.toWide do? As I already mentioned, high ascii  
(128-255) is
undefined. What does it map to in a WideChar?! I still think defining  
high
ascii to be *something* is better than nothing.

> and perhaps multiple versions of TextIO.

I don't think this is desirable.

Instead, you should use BinaryIO and compose it with a charset decoder.
An implementation will only have a few charset representations in main
memory and certainly no variable width ones. If you use a general  
charset
decoder for reading, then you can support all charsets with the same  
code.

> A different strategy (one that we considered at one point in the Basis
> design, but then abandoned for reasons that I cannot remember), is to
> separate the notion of character classification from representation.

I think I know why this wasn't done:

1. If you write a string in SML 'val x = "asfasf"', then this string  
must contain
the code points which correspond to the symbol with shape 'a', then  
's', ...
When you have a single storage type, with multiple charsets, then  
this is
ambiguous. ie: Is #"€" 0xA4 or 0x80? Depends on your charset!

2. Simply taking a string which was previously considered an ISO-8859-1
string and declaring that it is now an ISO-8859-15 string would be  
typesafe,
yet buggy. If you used phantom types like 'charset char, you might be  
able
to avoid the worst.

3. Maybe not then, but now: backwards compatibility.

Finally, you would still need at least three representations (1,2,4  
byte).
My proposal had five, which isn't terribly worse, and saves on the
classification structures. If we say Char=ISO-8859-1, then there are
only three structures in my proposal too. (Char, Ucs2, WideChar)

I keep coming back to arguing for Char being ISO-8859-1. It makes the
problem of conversion between WideChar and Char so much cleaner...