[MLton] WideChar

Fri, 10 Dec 2004 20:54:43 -0500

On Sat, 2004-12-11 at 02:05 +0100, Wesley W. Terpstra wrote:
> On Fri, Dec 10, 2004 at 07:40:28PM -0500, Adam Goode wrote: 
> > Right now, toUpper and toLower return the character unchanged if it
> > doesn't have a corresponding mapping. Shouldn't this just be the
> > behavior for the WideChar functions?
> 
> I suppose that makes sense.
> So, you would leave ß as ß when converting toUpper in WideChar?
> 

Well, we have the "SpecialCasing.txt" file too. :)
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

This has stuff which is not 1-1 character-wise, or has locale-dependent
stuff. toUpper(ß) evaluates to SS, which is no longer a single WideChar,
but something else! Might we through an exception here, since the
character DOES have an uppercase mapping, just not a single character
one.

Here is the es-zed case from SpecialCasing.txt:

# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>))

# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

> That is nasty.
> So, why not put these things in the locale and provide the Unicode
> defined classes and conversions in the WideChar.is* .toUpper/Lower?
> 
> So, you get Unicode behaviour in WideChar+Char, but if you want
> locale-correct conversions for German or Turkish, you use:
> 
> signature LOCALE =
>   sig
>     ...
>     val toLower: t -> char -> string
>     val isUpper: t -> char -> bool
>   end
> 

Seems like this is the only thing to do. Unicode encourages programmers
to write all their character conversion functions to work with whole
strings to avoid these problems with non-simple cases. Does that make
sense here or is it overkill?

Adam