[MLton] Unicode / WideChar
Henry Cejtin
henry.cejtin@sbcglobal.net
Mon, 21 Nov 2005 19:08:16 -0600
I definitely disagree. At the very least, there is equality on characters.
Similarly, I would argue order. You are quite correct that there is more
than one order, but that isn't relevant. Even a type with an arbitrary order
(and hence equality) is a VERY useful item.
The fact that the definition of these other operations requires, in the end,
a (perhaps implicit) listing of cases, doesn't change at all the facts that
the operation is well defined on the abstract type. To be concrete, the fact
that I think that toUpper applied to a `a' should yield `A' is precisely
because, in my youth I was given this case (along with a bunch of others).
It REALLY IS, in the end, a table of rules.
Correspondingly, making use, for instance, of translation from char to
unicode-code-point (an int), is merely a (very useful) short cut for using
the table-of-unicode-code-points as a way of saving time and memory.
Your argument about toUpper for Chinese makes much sense. There is, as far
as I know, no natural definition for upper/lower case in Chinese characters.
Hence we are reduced to the usual problem: what do we do when we have a
definition that naturally is a partial function? We can either make it total
(by `arbitrarily' defining it in the other case) or we can leave it partial
by raising an exception in the cases when its argument doesn't apply.
(Actually, theoretically, there is also a 3rd possibility: to not terminate
on the `undefined' cases.) The definition in the CHAR signature says that
for such things, it returns its argument unchanged. I don't see that as an
unreasonable choice, but I don't feel strongly about it.
Even if it were true that the operations/properties that would be ok for a
char type where a subset of those for integers, that is NOT (to me) at all an
argument against char being a different type. I want to know when I have
something that should be treated as a char and when it should be treated as
an integer. The fact, as you say, that one wants to use chars as indices
into an array simply argues for one of two approaches: either one can do what
Pascal did: allow non-integer types as array subscripts (lots of problems if
the size of an array is not a compile time constant), or else stick with
positive-integer-subscript arrays and require a conversion function. That is
fine.
You are correct that this requires inserting ord/chr conversions. You view
that as redundant. I think of chars and integers as definitely different, so
I view leaving them out as a pun.
Your argument about isalpha being a mistake may very well be true. I
definitely don't have enough experience to say if it is. Regardless, that
is, again in my very strong opinion, no argument at all against viewing a
char and its code point as the same type. That seems like exactly the same
error as thinking of a list and a stack as the same thing just because I can
use one to implement the other.
I agree that code points are just integers. The point is that what unicode
is is, among other things, an association between integers and characters.
The other important thing it is, determined by the above, is a set of
characters.
Perhaps we will just have to agree to disagree, but I don't buy your argument
at all.