[MLton] Unicode... again
Wesley W. Terpstra
terpstra at gkec.tu-darmstadt.de
Mon Feb 12 11:54:34 PST 2007
On Feb 12, 2007, at 8:20 PM, Matthew Fluet wrote:
>> - For the time being I choose to ignore the basis' claim that "in
>> WideChar, the functions toLower, toLower, isAlpha,..., isUpper
>> and, in general, the definition of a ``letter'' are locale-
>> dependent" and raise an Unimplemented exception for these methods.
>> I think the standard is dreadfully misguided in assuming a global
>> locale, and I defer what to do here till later as it is what
>> blocked my progress last time. (IMO these functions have only
>> questionable use, anyway)
> Not to dismiss any of the thought and work already done, but I'm
> why another 'obvious' interpretation of WideChar hasn't been
> explored. That is, why don't we take WideChar as an (admittedly
> brain-dead) wrapping of functions defined in <wchar.h>. These
> descriptions of these functions seem to match the Basis Library
> descriptions, in that they have a notion of the current locale.
> Admittedly, WideChar wouldn't provide access to changing the locale
> (the setlocale function), but this would seem consistent with other
> portions of the SML Basis Library that provides just a thin veneer
> over corresponding POSIX functions.
We could do that. My definitions of the is* methods are place-
holders. I consider these methods worse than useless; I'd rather they
simply didn't exist. Since they do exist, mapping them to iswalpha,
iswalnum, etc. might be ok... as long as these are portably
available. Still, I'd rather try to use the locale independent
character classes specified by Unicode. However, this is where I got
stuck last time, so I decided to skip it for now, as in the grand
scheme, these methods are rather unimportant.
I can't find my old post, so I'll briefly summarize why I wish they
1. Hidden dependencies are bad
2. A program needs multiple locales simultaneously if it interacts
with multiple users.
3. A program may need to switch the 'global' locale (eg: a login
program). How does this affect cached look-up tables?
4. A program may need WideChar, yet not be internationalized. Locale
dependence might introduce bugs.
5. Typically, a locale also influences number and date parsing /
output. A programmer not expecting this will have problems.
6. Even C++ threw out the C interface and provided its own, with
explicit dependency on the locale.
Generally, there are several steps to localizing a program. First, it
has to support the character set. Then, you have to isolate all of
the locale dependent issues (literal string, date/number formats,
text direction, etc) and predicate these on a locale. This is
internationalization. Finally, you localize it by translating text,
providing date/number formats, etc. Each step needs testing. Just
flipping a global switch and praying is not a great idea.
If this were a perfect world, I'd make WideChar.is* use the Unicode
locale-independent mappings always. Then we would have a Locale :
LOCALE with a CharClass substructure that provided localized versions
of these. An internationalized program would use these methods for
user-interaction, but use the WideChar classes for non-user purposes.
More information about the MLton