[MLton] Unicode / WideChar

skaller skaller@users.sourceforge.net
Mon, 21 Nov 2005 22:49:20 +1100


On Mon, 2005-11-21 at 09:19 +0100, Florian Weimer wrote:

> UTF-16 is the
> replacement, and sorting that representation lexicographically
> (potentially after byte-swapping) does not result in the codepoint
> order!

Here is the algorithm (from ISO/IEC JTC1/SC2/WG2 N 1035):

UCS                 UTF-16
x =  0000 0000..    x;
     0000 FFFD1

x =  0001 0000..    y; z;
     0010 FFFF
                    where
                    y = ((x - 0001 0000) / 400) + D800
                    z = ((x - 0001 0000) % 400) + DC00

Please show an example where order is not preserved!

If a < b, then f(a) < f(b), considered lexicographically.
I think this is completely obvious. UCS2 char is mapped
to itself so preserves order. Larger code point is mapped
to a pair HI, LO. with a_HI <= b_HI, and

	if( a_HI = b_HI) a_LO < b_LO

and therefore, order is preserved, since HI is compared
before LO. In particular / is nonstrict monotonic increasing,
proving a_HI <= b_HI, and clearly if we have an equal case,
the remainders after division by 0x400 preserve order.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net