[MLton] Unicode / WideChar

Mon, 21 Nov 2005 12:56:47 +0100

>> UTF-16 is the
>> replacement, and sorting that representation lexicographically
>> (potentially after byte-swapping) does not result in the codepoint
>> order!
>
> Here is the algorithm (from ISO/IEC JTC1/SC2/WG2 N 1035):
>
> UCS                 UTF-16
> x =  0000 0000..    x;
>      0000 FFFD1
>
> x =  0001 0000..    y; z;
>      0010 FFFF
>                     where
>                     y = ((x - 0001 0000) / 400) + D800
>                     z = ((x - 0001 0000) % 400) + DC00

U+0FEFF is mapped to 0xFEFF, but U+10100 is mapped to 0xD800 0xDD00,
which is lexicographically less than 0xFEFF.

The abomination that results from this discrepancy is called CESU-8.
(The nice thing about Unicode is that there are so many encodings to
choose from.)