[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support
John Reppy
jhr@cs.uchicago.edu
Wed, 30 Nov 2005 08:30:05 -0600
After thinking about this issue some more, I really think that the
question of representation
(how many bits) and interpretation (how do the bits correspond to
glyphs) should be kept
independent. This approach loses some advantage of type safety, but
it will be much more
practical (and compatible with existing practice). The question of
how to interpret a bit
pattern comes up in three situations:
1) when we want to classify the character (e.g., isAlpha).
2) when we want to lexically compare two characters (locale may
matter in this case).
3) when we want to display the character.
1 & 2 define a view of a representation, it is not unreasonable to
imagine an implementation
supporting multiple views of a representation (e.g., both ASCII and
ISO-8859-1 views of 8-bit
characters). In practice, I doubt that applications will mix views,
so distinguishing them
by type is not likely to buy you much. The third issue is where
things get hairy,
but there isn't going to be a one-size fits all solution. For
example, I can render Unicode
directly on MacOS X, but in X windows I need to select a font and get
the encoding right.
There are two kinds of conversions between Char.char and
WideChar.char. One is to just embed
Char.char into WideChar.char (i.e., add high-order zero bits). The
second is to interpret
multibyte sequences in 8-bit strings as single wide characters. In
the second case, there
are multiple encoding schemes, although UTF-8 may be the only on to
care about. I think that
both kinds of conversions should be supported, but the second kind
must be defined independent
of the home structures for the types.
- John