[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support
Dave Berry
dave@berrybental.me.uk
Sun, 27 Nov 2005 21:24:26 +0000
Hi Wesley,
It's good to see someone working on Unicode & SML. IMO, this area of the
Basis is likely to need some tweaks (at least) as we gain more practical
experience.
Your first question is about the character set of the Char structure. The
idea behind this structure is that it should be the locale-independent
7-bit ASCII characters, with the other 128 characters having no special
semantics - analogous to the "C" locale. For other character sets, you
need to use WideChar. This was largely a pragmatic decision, so that we
could rely on one locale-independent character set that was easy to
implement, while still providing locale support for those who needed it.
You are right that the basis does not specify locale parameters or how to
set global locales. It does use a global model,- the perceived advantage
being that the same code could be run in different locales just by changing
the environment, rather than changing the code. Setting the locale was
left for either an extension to the Basis or for the environment to specify.
You are also right that (WideChar.isX o WideChar.chr o Char.ord) !=
Char.isX, but only if (a) the character set used for WideChar is not a
superset of 7-bit ASCII, or (b) the character tested is > chr(127), which
is outwith the defined range of meaningful values for Char. If you are
dealing with ISO-8859-1 (say) then Char is by definition inadequate.
Underlying your whole post is the assumption that WideChar characters must
be using Unicode. This is not an assumption that the Basis makes - it
allows for other wide character sets. The WideChar structure was modelled
on the C wchar_t type, which in turn was designed to support a
character-set independent approach to handling international characters, as
opposed to the universal character set approach of Unicode. I don't know
whether C still takes this approach or whether it's the best one to take,
but it may explain why the structure is specified as it is.
If I understand your proposal correctly, you are suggesting that we make
WideChar always be Unicode, make the existing WideChar use the default
categorisation of Unicode, and add a new module for locale-dependent
operations. That seems a plausible approach. It really needs someone to
implement it and try it in anger.
Perhaps it would make sense to have an 8-bit equivalent of the
locale-dependent module as well? Then programmers could explicitly support
ISO-8859-1 (and -2, -3, etc.)
I'm not familiar with isNumber, but it looks a reasonable suggestion to
support it.. Which characters are included in isNumber but not isDigit?
I think we can remove the requirement that isAlpha = isLower + isUpper for
WideChar. I assume the rationale for this is that some languages don't
have the concept of case?
It may be pragmatic to specify Char to be ISO-8859-1, to match Unicode (and
HTML). However, I'm against it because it gives people a misplaced
expectation that it significantly addresses the
internationalisation/localisation question. E.g. I think your statement
that ISO-8859-1 covers most of the "major" European languages is culturally
biased. Even if we define "major" as the main official languages of states
in the European Union, several are not covered (e.g. Polish, Czech, Greek,
Slovak, Maltese, Latvian, Lithuanian, ...). I think it's worth noting that
Poland is one of the larger EU states. (And as I live in Scotland, I'll
mention the celtic languages of Gaelic and Welsh, while conceding that
these are spoken by small populations). I'd rather keep Char as 7-bit ASCII.
Moving on to your section 2, I believe that the reason that chr and ord
deal in ints is purely for backwards compatibility. So I guess that having
chr raise an exception for values > 10FFFF would work OK, when WideChar ==
Unicode.
There's nothing preventing any implementation from implementing other
structure that match CHAR - they just won't be portable if they rely on
compiler magic. I'd have thought we could consider a Char16 structure if
enough people are interested.
Your suggestions on parsing and serialisation seem reasonable to me.
If we allow source files that are encoded in UTF-8, what effect would this
have on portability to compilers that don't use Unicode? Or, to put this
another way, what would be the minimum amount of support that an
implementation would have to provide for UTF-8, and how much work would it
be to implement?
Thank you for taking the time to write up your thoughts. I hope my reply
has helped to explain the rationale for the current design.
Best wishes,
Dave.