[MLton] Unicode / WideChar

Wesley W. Terpstra wesley@terpstra.ca
Mon, 21 Nov 2005 13:06:59 +0100


I wanted to allay several concerns I've heard.

Sort order with UTF-16 will not be a problem in MLton.
You will have three choices for storing strings:
String1 (String), String2, String4 (WideString).
(The String{1,2,4} are only available in i18n.mlb)

Each of them will include a '<' which just orders by value
(code point). This will be perfectly fine because MLton will
do the right thing for little/big-endian automatically (the
collate method on vectors uses the element comparison
which will be using the correct endian-ness). Also, the
String2 is UCS2, not UTF-16. That means if you try to put
in a character that is > 65536, you get exception Chr. I may
or may not blacklist the surrogate code points. Probably
not, actually, so that people can abuse Char2 as an Int16
if they feel the need.

Also, Char{1,2,4}.{<,toUpper,isAlpha,...} are all locale
*IN*-dependent. I hope this addresses Henry's concern
about program behaviour mysteriously changing. Unicode
includes a 'default' upper-casing, and I would say < with
code point order is the 'default' sort. If you want locale
specific sort order, case changes, etc., you will need to use
the i18n.mlb which will provide locale-specific versions.

As for worrying about the is* methods, I've gained some
perspective over the last day and think I know what to do
now. (I will have isDigit include non-decimal, and
letter > upper + lower)