[MLton] WideChar

Adam Goode adam@evdebs.org
Fri, 10 Dec 2004 20:54:43 -0500

On Sat, 2004-12-11 at 02:05 +0100, Wesley W. Terpstra wrote:
> On Fri, Dec 10, 2004 at 07:40:28PM -0500, Adam Goode wrote: 
> > Right now, toUpper and toLower return the character unchanged if it
> > doesn't have a corresponding mapping. Shouldn't this just be the
> > behavior for the WideChar functions?
> I suppose that makes sense.
> So, you would leave  as  when converting toUpper in WideChar?

Well, we have the "SpecialCasing.txt" file too. :)

This has stuff which is not 1-1 character-wise, or has locale-dependent
stuff. toUpper() evaluates to SS, which is no longer a single WideChar,
but something else! Might we through an exception here, since the
character DOES have an uppercase mapping, just not a single character

Here is the es-zed case from SpecialCasing.txt:

# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>))

# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

> That is nasty.
> So, why not put these things in the locale and provide the Unicode
> defined classes and conversions in the WideChar.is* .toUpper/Lower?
> So, you get Unicode behaviour in WideChar+Char, but if you want
> locale-correct conversions for German or Turkish, you use:
> signature LOCALE =
>   sig
>     ...
>     val toLower: t -> char -> string
>     val isUpper: t -> char -> bool
>   end

Seems like this is the only thing to do. Unicode encourages programmers
to write all their character conversion functions to work with whole
strings to avoid these problems with non-simple cases. Does that make
sense here or is it overkill?