[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support
Wesley W. Terpstra
wesley@terpstra.ca
Tue, 29 Nov 2005 16:04:11 +0100
I've re-ordered parts of your email to address things with some
linearity. :-)
On Nov 27, 2005, at 10:24 PM, Dave Berry wrote
> If I understand your proposal correctly, you are suggesting that we
> make WideChar always be Unicode, make the existing WideChar use the
> default categorisation of Unicode, and add a new module for locale-
> dependent operations.
That's exactly what I am proposing.
While we're at it, there needs to be a charset encoder/decoder
included too. Locale-specific date/time/number/currency formatting
would also be good, as would something like gettext.
> Perhaps it would make sense to have an 8-bit equivalent of the
> locale-dependent module as well? Then programmers could explicitly
> support ISO-8859-1 (and -2, -3, etc.)
I think this would add little. You can decode your input ISO-8859-x/
whatever into Unicode, and then work with it there. This is more
flexible anyways since then your code will work with any character set.
> I'm not familiar with isNumber, but it looks a reasonable
> suggestion to support it.. Which characters are included in
> isNumber but not isDigit?
In ISO-8859-1, there are 1/4 and 1/2 symbols. Also, other languages
include stranger concepts of number that aren't necessarily decimal,
and thus can't be called a digit (base 10).
> I think we can remove the requirement that isAlpha = isLower +
> isUpper for WideChar. I assume the rationale for this is that some
> languages don't have the concept of case?
Yes.
> I believe that the reason that chr and ord deal in ints is purely
> for backwards compatibility. So I guess that having chr raise an
> exception for values > 10FFFF would work OK, when WideChar == Unicode.
That's what I will do then.
If we are banning values beyond 10FFFF, then perhaps we should also
ban values between D800-DFFF which may not appear in a conforming
UTF-32 string.
I also think the restriction that WideChar be a multiple of 8 bits
should be removed. It serves no real purpose, AFAICT, so why limit an
implementer's choices?
> If we allow source files that are encoded in UTF-8, what effect
> would this have on portability to compilers that don't use
> Unicode? Or, to put this another way, what would be the minimum
> amount of support that an implementation would have to provide for
> UTF-8, and how much work would it be to implement?
Compilers without Unicode support already do the right thing:
complain if given high ASCII. UTF-8 includes ASCII as a subset, so
any file that only uses ASCII will work under both sorts of
compilers. If you have high-ascii, then it means you have included
Unicode values in your strings, which means that the SML source file
requires Unicode. If the compiler doesn't support Unicode, then this
is grounds for an error.
So, as far as I can see, the minimum work to implement this is
changing the error message from something like 'high ascii forbidden'
to 'this compiler doesn't support Unicode'.
If you want to add Unicode support, then you have a working WideChar/
String. Decoding UTF-8 into a WideChar is about 10-20 lines, so
that's not much additional effort either. The real work is getting
MLlex to support such a large character set. However, that's only
needed for Unicode-enabled SML compilers.
> Your first question is about the character set of the Char
> structure. The idea behind this structure is that it should be the
> locale-independent 7-bit ASCII characters, with the other 128
> characters having no special semantics - analogous to the "C" locale.
The problem is that Char.is* assigns semantics to the high ASCII
characters. At least for me, the distinction between a simple Word8
and a Char is that a Char carries a character set with it. Observe
the difference in their interfaces. The fact that all those methods
are defined for 'char' means that you *have* assigned semantic
meaning to high-ascii. You've defined that all high ascii is: not a
control and not printable.
> It may be pragmatic to specify Char to be ISO-8859-1, to match
> Unicode (and HTML). However, I'm against it because it gives
> people a misplaced expectation that it significantly addresses the
> internationalisation/localisation question. E.g. I think your
> statement that ISO-8859-1 covers most of the "major" European
> languages is culturally biased.
I concede the point that ISO-8859-1 is inadequate for Europe. :-)
My primary motivation for specifying that Char = ISO-8859-1 was that
I wanted the character set to be a subset of Unicode. I thought that
this would be the path of 'least surprise' for a programmer migrating
his code from Char to WideChar. However, changing this definition
could break existing SML programs, which expect isAlpha to return
false above 7F.
> Underlying your whole post is the assumption that WideChar
> characters must be using Unicode. This is not an assumption that
> the Basis makes - it allows for other wide character sets. The
> WideChar structure was modelled on the C wchar_t type, which in
> turn was designed to support a character-set independent approach
> to handling international characters, as opposed to the universal
> character set approach of Unicode. I don't know whether C still
> takes this approach or whether it's the best one to take, but it
> may explain why the structure is specified as it is.
Ok. This makes sense, and clears up a lot of the background reasoning
for me.
If I were to predict the future, I would say that Unicode is the
ASCII of tomorrow.
The basis already grants privileged status to ASCII, so it should for
Unicode too.
That said, I agree that it is useful to allow for extra structures to
match the CHAR interface. The Russians might like their SML
implementations to include a KOI8R structure. However, after lifting
the 'multiple of 8 bit' restriction, I'd like to impose a different
restriction: 'must be fixed width'. This means that UTF-8 and UTF-16
may not match the CHAR signature.
With respect to WideChar, it intuitively appeared to me that this was
trying to be like LargeInt, ie: something which could contain all
integers, or in this case, all characters. This is exactly what
Unicode tries to be: a superset of all character sets. Therefore, I
would argue, that specifying that WideChar MUST be Unicode is a
perfectly natural thing to do.
> I'd rather keep Char as 7-bit ASCII.
Then why not raise Chr if you try to put in a value above 0x7F?
> There's nothing preventing any implementation from implementing
> other structure that match CHAR - they just won't be portable if
> they rely on compiler magic. I'd have thought we could consider a
> Char16 structure if enough people are interested.
Keeping with the mindset that a structure matching CHAR is in fact a
character set, not just a bag of integers, how about this:
Char (8 bit, high ascii 'undefined') <-- required (raises Chr for
values beyond FF)
Ascii (7 bit) <-- required (raises Chr for values beyond 7F)
Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values
beyond FFFF)
WideChar (must be Unicode) <-- optional (raises Chr for surrogates
and values beyond 10FFFF)
... plus any number of locale-specific charset the implementor likes.
We have nice subset behaviour for Ascii, Iso8859_1, Ucs2, WideChar.
Char works as it always has, and is explicitly NOT a subset of the
others, though it agrees for all values which have isAscii = true.
This fact should be documented in bright flashing red.
One question is whether or not the Ucs2/Iso8859_1/Ascii structures
should have all of the extra structures that go with them
(Ucs2String, Ucs2Vector, Ucs2Substring, ...).
> You are right that the basis does not specify locale parameters or
> how to set global locales. It does use a global model,- the
> perceived advantage being that the same code could be run in
> different locales just by changing the environment, rather than
> changing the code.
I think that this a very bad thing to do.
First, it is rarely as simple as just changing an environment
variable unless the application developer has put effort into
internationalization. Simply using WideChar does not indicate that
the program has been internationalized; I can think of several non-
internationalized applications which would need to use WideChar. If a
program has not been carefully internationalized, it is quite
possible that changing the environment locale will render the
software inoperable or introduce mysterious bugs (imagine if bash
looked at the locale variable when running shell scripts; how many
would break if the number formatting changed?). If a programmer is
using a method that depends on the environment, I think this should
be made very clear in the interface to help prevent such problems.
Second, only single-user applications have a single locale; if the
applications is a server, then it needs to be able operate in a
different locale for each user. Furthermore, even single-user
applications may need to operate with multiple locales. For example,
a login program (like gdm) needs to allow a user to select his
language/locale as part of the login procedure.
> Setting the locale was left for either an extension to the Basis or
> for the environment to specify.
Allowing the global locale to be changed is an even more frightening
prospect. Doesn't this deeply conflict with the design principles of
a functional programming language? This would predicate large
components of the software off of what amounts to a mutable global
variable. Suppose I used memoization on some method of mine which
used internally the is* methods. How does a global locale switch
affect this? Hidden dependencies are simply bad.
I see no problem with having an immutable startup locale which is
specified by the environment. This is similar in some respect to a
command-line argument. However, I would argue that nothing should be
predicated off of it unless specifically instructed by the
programmer. eg:
signature LOCALE =
type locale
val initialLocale : locale
fun lookupLocale: string -> locale
structure CharCategories : sig
val isSpace: locale * WideChar.char -> bool
...
This addresses all of my concerns, and still provides the
functionality you would have gotten from a C-style global locale.
BTW, you will notice this is the same approach taken by C++, which
also recognized the problem with a global hidden locale and chose to
throw out the C scheme. Of course, rather than a pair of arguments,
it has locale objects, but this amounts to more or less the same.
> Your suggestions on parsing and serialisation seem reasonable to me.
You understand why I listed it as an incompatible change?
What I suggested meant that 0xDA, previously converted to '\xDA' is
now converted to '\xC3\x9A'.
Actually, I just did some poking around, and found this:
#include <stdio.h>
#include <wchar.h>
int main() {
wchar_t x = L'\U12345678';
printf("%x\n", (int)x);
return 0;
}
So, forget the bit about toCString being a problem. C99 adds \u and
\U. For consistency, I suppose SML should accept \U12345678 instead
of \U123456 even though the first two digits must be zero since the
value has to be less than 0x110000.
Anyways, my new proposal for CHAR no longer has any points which
would break compatibility with existing SML programs. That's a pretty
big improvement from just one email round. Keep the comments coming,
please. :-)