[MLton] Unicode / WideChar
skaller
skaller@users.sourceforge.net
Mon, 21 Nov 2005 22:49:20 +1100
On Mon, 2005-11-21 at 09:19 +0100, Florian Weimer wrote:
> UTF-16 is the
> replacement, and sorting that representation lexicographically
> (potentially after byte-swapping) does not result in the codepoint
> order!
Here is the algorithm (from ISO/IEC JTC1/SC2/WG2 N 1035):
UCS UTF-16
x = 0000 0000.. x;
0000 FFFD1
x = 0001 0000.. y; z;
0010 FFFF
where
y = ((x - 0001 0000) / 400) + D800
z = ((x - 0001 0000) % 400) + DC00
Please show an example where order is not preserved!
If a < b, then f(a) < f(b), considered lexicographically.
I think this is completely obvious. UCS2 char is mapped
to itself so preserves order. Larger code point is mapped
to a pair HI, LO. with a_HI <= b_HI, and
if( a_HI = b_HI) a_LO < b_LO
and therefore, order is preserved, since HI is compared
before LO. In particular / is nonstrict monotonic increasing,
proving a_HI <= b_HI, and clearly if we have an equal case,
the remainders after division by 0x400 preserve order.
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net