[MLton] WideChar
Adam Goode
adam@evdebs.org
Fri, 10 Dec 2004 20:54:43 -0500
On Sat, 2004-12-11 at 02:05 +0100, Wesley W. Terpstra wrote:
> On Fri, Dec 10, 2004 at 07:40:28PM -0500, Adam Goode wrote:
> > Right now, toUpper and toLower return the character unchanged if it
> > doesn't have a corresponding mapping. Shouldn't this just be the
> > behavior for the WideChar functions?
>
> I suppose that makes sense.
> So, you would leave ß as ß when converting toUpper in WideChar?
>
Well, we have the "SpecialCasing.txt" file too. :)
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
This has stuff which is not 1-1 character-wise, or has locale-dependent
stuff. toUpper(ß) evaluates to SS, which is no longer a single WideChar,
but something else! Might we through an exception here, since the
character DOES have an uppercase mapping, just not a single character
one.
Here is the es-zed case from SpecialCasing.txt:
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>))
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> That is nasty.
> So, why not put these things in the locale and provide the Unicode
> defined classes and conversions in the WideChar.is* .toUpper/Lower?
>
> So, you get Unicode behaviour in WideChar+Char, but if you want
> locale-correct conversions for German or Turkish, you use:
>
> signature LOCALE =
> sig
> ...
> val toLower: t -> char -> string
> val isUpper: t -> char -> bool
> end
>
Seems like this is the only thing to do. Unicode encourages programmers
to write all their character conversion functions to work with whole
strings to avoid these problems with non-simple cases. Does that make
sense here or is it overkill?
Adam