[MLton] Unicode... again

Mon Feb 12 11:54:34 PST 2007

On Feb 12, 2007, at 8:20 PM, Matthew Fluet wrote:
>> - For the time being I choose to ignore the basis' claim that "in  
>> WideChar, the functions toLower, toLower, isAlpha,..., isUpper  
>> and, in general, the definition of a ``letter'' are locale- 
>> dependent" and raise an Unimplemented exception for these methods.  
>> I think the standard is dreadfully misguided in assuming a global  
>> locale, and I defer what to do here till later as it is what  
>> blocked my progress last time. (IMO these functions have only  
>> questionable use, anyway)
>
> Not to dismiss any of the thought and work already done, but I'm  
> curious
> why another 'obvious' interpretation of WideChar hasn't been  
> explored. That is, why don't we take WideChar as an (admittedly  
> brain-dead) wrapping of functions defined in <wchar.h>.  These  
> descriptions of these functions seem to match the Basis Library  
> descriptions, in that they have a notion of the current locale.   
> Admittedly, WideChar wouldn't provide access to changing the locale  
> (the setlocale function), but this would seem consistent with other  
> portions of the SML Basis Library that provides just a thin veneer  
> over corresponding POSIX functions.

We could do that. My definitions of the is* methods are place- 
holders. I consider these methods worse than useless; I'd rather they  
simply didn't exist. Since they do exist, mapping them to iswalpha,  
iswalnum, etc. might be ok... as long as these are portably  
available. Still, I'd rather try to use the locale independent  
character classes specified by Unicode. However, this is where I got  
stuck last time, so I decided to skip it for now, as in the grand  
scheme, these methods are rather unimportant.

I can't find my old post, so I'll briefly summarize why I wish they  
didn't exist:
1. Hidden dependencies are bad
2. A program needs multiple locales simultaneously if it interacts  
with multiple users.
3. A program may need to switch the 'global' locale (eg: a login  
program). How does this affect cached look-up tables?
4. A program may need WideChar, yet not be internationalized. Locale  
dependence might introduce bugs.
5. Typically, a locale also influences number and date parsing /  
output. A programmer not expecting this will have problems.
6. Even C++ threw out the C interface and provided its own, with  
explicit dependency on the locale.

Generally, there are several steps to localizing a program. First, it  
has to support the character set. Then, you have to isolate all of  
the locale dependent issues (literal string, date/number formats,  
text direction, etc) and predicate these on a locale. This is  
internationalization. Finally, you localize it by translating text,  
providing date/number formats, etc. Each step needs testing. Just  
flipping a global switch and praying is not a great idea.

If this were a perfect world, I'd make WideChar.is* use the Unicode  
locale-independent mappings always. Then we would have a Locale :  
LOCALE with a CharClass substructure that provided localized versions  
of these. An internationalized program would use these methods for  
user-interaction, but use the WideChar classes for non-user purposes.