[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Wed, 30 Nov 2005 08:30:05 -0600


After thinking about this issue some more, I really think that the  
question of representation
(how many bits) and interpretation (how do the bits correspond to  
glyphs) should be kept
independent.  This approach loses some advantage of type safety, but  
it will be much more
practical (and compatible with existing practice).  The question of  
how to interpret a bit
pattern comes up in three situations:

	1) when we want to classify the character (e.g., isAlpha).

	2) when we want to lexically compare two characters (locale may  
matter in this case).

	3) when we want to display the character.

1 & 2 define a view of a representation, it is not unreasonable to  
imagine an implementation
supporting multiple views of a representation (e.g., both ASCII and  
ISO-8859-1 views of 8-bit
characters).  In practice, I doubt that applications will mix views,  
so distinguishing them
by type is not likely to buy you much.  The third issue is where  
things get hairy,
but there isn't going to be a one-size fits all solution.  For  
example, I can render Unicode
directly on MacOS X, but in X windows I need to select a font and get  
the encoding right.

There are two kinds of conversions between Char.char and  
WideChar.char.  One is to just embed
Char.char into WideChar.char (i.e., add high-order zero bits).  The  
second is to interpret
multibyte sequences in 8-bit strings as single wide characters.  In  
the second case, there
are multiple encoding schemes, although UTF-8 may be the only on to  
care about.  I think that
both kinds of conversions should be supported, but the second kind  
must be defined independent
of the home structures for the types.

	- John