[MLton] Unicode / WideChar

skaller skaller@users.sourceforge.net
Mon, 21 Nov 2005 23:42:11 +1100


On Mon, 2005-11-21 at 13:06 +0100, Wesley W. Terpstra wrote:
> I wanted to allay several concerns I've heard.
> 
> Sort order with UTF-16 will not be a problem in MLton.
> You will have three choices for storing strings:
> String1 (String), String2, String4 

UCS 8,16, and 32 -- I think this is correct.

> Each of them will include a '<' which just orders by value
> (code point). 

Yup.

> This will be perfectly fine because MLton will
> do the right thing for little/big-endian automatically 

Endian-ness is irrelevant surely. All 3 are strings
of unsigned integers. The endianess is invisible
to the user.

> I may
> or may not blacklist the surrogate code points. Probably
> not, actually, so that people can abuse Char2 as an Int16
> if they feel the need.

I agree. Do not blacklist. But my reasoning is the same
only generalised: these data structures store integers.
There is no relation to Unicode in particular.


> Also, Char{1,2,4}.{<,toUpper,isAlpha,...} are all locale
> *IN*-dependent. 

These are Unicode specific functions. By definition
they're locale independent.

I think some care should be taken to separate the
Unicode functions from the String data structures.

The reason is: you could write these functions for
a different character set. There is a whole swag
of archaic 8 bit character sets for example.

I would actually argue, that Char? is wrong.
They're not chars, they're integers, and they
are not associated with any particular code set.

This breaks abstraction .. you can even multiply
two characters ... 

Unfortunately, the proper abstraction is elusive
and will simply make it too hard to do anything:
for example to decode say BIG5 and convert to Unicode,
you will have to go through hoops .. and won't be able
to do it at all without a BIG5_char abstraction.

in the end, a comment made to me by Bill Plauger really
makes sense: he said something like "C handles character
set issues better than any other language .. simply
because it doesn't"

BTW: I am curious about the Unicode database implementation:
the database is BIG. How are you going to represent this
efficiently? (Eg,  case mapping function)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net