[MLton] WideChar
Stephen Weeks
MLton@mlton.org
Fri, 10 Dec 2004 12:24:48 -0800
Some thoughts on the WideChar stuff. Some of this is covered in
others' email and some is simply summarizing and stating my current
position.
* To make it clear that the char type is ISO-8859-1, I added a note to
http://mlton.org/BasisLibrary.
* The behavior of the Char and String functions does not depend on
locale.
* Writing string constants using only \u escapes makes it practically
impossible to write non-English programs. We will support UTF-8
encoded strings.
* \u is not sufficient to get all Unicode, hence there is an
unacceptable omission in the Definition -- we should also allow
\Uxxxxxxxx. We should not drop support for \u, since the Definition
requires it, and also it is by far the more common case.
* Restricting variable names to printable ASCII is painful for
non-English-speaking programmers. We should move toward making the
base alphabet that MLton accepts Unicode, and make the default
encoding of programs be UTF-8.
* It is a mistake to argue for extensions to the Definition based on
the fact that they only hurt portability away from MLton, not to
MLton. There are other valid reasons for extensions, but this is
not one of them. This is exactly the argument that SML/NJ has used
many times, and it has harmed the SML community by fragmenting it.
* There are standard table compression techniques (multi-level tables
with sharing) that can make a Unicode ML-Lex feasible.
* When converting a char reader to a widechar reader, it is sometimes
useful to raise an exception on encountering a widechar and
sometimes useful to return NONE. We should provide both types of
converters.
* In the basis library, char is defined as int8 rather than word8 so
the FFI works. In C, char means signed char, which may have a
different calling convention that unsigned char. If we defined
char as Word8.word, then in order to import/export a function that
deals with chars, one would have to use int8 and coerce on the SML
side. That seems like a major pain.
* Using a datatype for encoding names is preferable to using strings.
Using datatypes does not introduce any problems with adding new
encodings. For example, with strings, pattern matches will look
like
case enc of
"UTF8" => ...
| "UTF16" => ...
| "my-encoding" => ...
| _ => error "unknown encoding"
With a datatype, the same match would look like
case enc of
UTF8 => ...
| UTF16 => ...
| X "my-encoding" => ...
| _ => error "unknown encoding"
In either case, adding a new encoding (either as an extension or a
special variant) causes no problems. And with the datatype of known
encodings, one gets the benefit in the common case of type-checker
supported agreement of encoding name.
* We should put the new localization stuff in an MLB library, not the
MLton structure.
* I don't see what OOP buys in handling locales (i.e. I don't see any
use for inheritance, dynamic dispatch, or otherwise). We can simply
have functions that depend on the locale.
signature LOCALE =
struct
type t
val make: ??? -> t
val isAlpha: t * LargeChar.t -> bool
val isPrint: t * LargeChar.t -> bool
end
You could make this look more OOP by using a record of member
methods, as below.
signature LOCALE_FACTORY =
struct
val make: ??? -> {isAlpha: LargeChar.t -> bool,
isPrint: LargeChar.t -> bool}
end
Since these signatures are completely equivalent (one can write a
functor mapping between them) I'd go for the first approach, as it
is more idiomatic SML.