# [MLton] Unicode / WideChar

**Florian Weimer
**
fw@deneb.enyo.de

*Mon, 21 Nov 2005 12:56:47 +0100*

>>* UTF-16 is the
*>>* replacement, and sorting that representation lexicographically
*>>* (potentially after byte-swapping) does not result in the codepoint
*>>* order!
*>*
*>* Here is the algorithm (from ISO/IEC JTC1/SC2/WG2 N 1035):
*>*
*>* UCS UTF-16
*>* x = 0000 0000.. x;
*>* 0000 FFFD1
*>*
*>* x = 0001 0000.. y; z;
*>* 0010 FFFF
*>* where
*>* y = ((x - 0001 0000) / 400) + D800
*>* z = ((x - 0001 0000) % 400) + DC00
*
U+0FEFF is mapped to 0xFEFF, but U+10100 is mapped to 0xD800 0xDD00,
which is lexicographically less than 0xFEFF.
The abomination that results from this discrepancy is called CESU-8.
(The nice thing about Unicode is that there are so many encodings to
choose from.)