[MLton] Unicode / WideChar
Florian Weimer
fw@deneb.enyo.de
Mon, 21 Nov 2005 12:56:47 +0100
>> UTF-16 is the
>> replacement, and sorting that representation lexicographically
>> (potentially after byte-swapping) does not result in the codepoint
>> order!
>
> Here is the algorithm (from ISO/IEC JTC1/SC2/WG2 N 1035):
>
> UCS UTF-16
> x = 0000 0000.. x;
> 0000 FFFD1
>
> x = 0001 0000.. y; z;
> 0010 FFFF
> where
> y = ((x - 0001 0000) / 400) + D800
> z = ((x - 0001 0000) % 400) + DC00
U+0FEFF is mapped to 0xFEFF, but U+10100 is mapped to 0xD800 0xDD00,
which is lexicographically less than 0xFEFF.
The abomination that results from this discrepancy is called CESU-8.
(The nice thing about Unicode is that there are so many encodings to
choose from.)