[MLton] Unicode / WideChar

Henry Cejtin henry.cejtin@sbcglobal.net
Mon, 21 Nov 2005 19:08:16 -0600


I  definitely  disagree.  At the very least, there is equality on characters.
Similarly, I would argue order.  You are quite correct  that  there  is  more
than one order, but that isn't relevant.  Even a type with an arbitrary order
(and hence equality) is a VERY useful item.

The fact that the definition of these other operations requires, in the  end,
a  (perhaps  implicit) listing of cases, doesn't change at all the facts that
the operation is well defined on the abstract type.  To be concrete, the fact
that  I  think  that  toUpper  applied to a `a' should yield `A' is precisely
because, in my youth I was given this case (along with a  bunch  of  others).
It REALLY IS, in the end, a table of rules.

Correspondingly,  making  use,  for  instance,  of  translation  from char to
unicode-code-point (an int), is merely a (very useful) short  cut  for  using
the table-of-unicode-code-points as a way of saving time and memory.

Your  argument  about toUpper for Chinese makes much sense.  There is, as far
as I know, no natural definition for upper/lower case in Chinese  characters.
Hence  we  are  reduced  to  the  usual problem: what do we do when we have a
definition that naturally is a partial function?  We can either make it total
(by  `arbitrarily'  defining it in the other case) or we can leave it partial
by raising an exception  in  the  cases  when  its  argument  doesn't  apply.
(Actually,  theoretically,  there is also a 3rd possibility: to not terminate
on the `undefined' cases.)  The definition in the CHAR  signature  says  that
for  such  things, it returns its argument unchanged.  I don't see that as an
unreasonable choice, but I don't feel strongly about it.

Even if it were true that the operations/properties that would be  ok  for  a
char type where a subset of those for integers, that is NOT (to me) at all an
argument against char being a different type.  I want to  know  when  I  have
something  that  should be treated as a char and when it should be treated as
an integer.  The fact, as you say, that one wants to  use  chars  as  indices
into an array simply argues for one of two approaches: either one can do what
Pascal did: allow non-integer types as array subscripts (lots of problems  if
the  size  of  an  array  is not a compile time constant), or else stick with
positive-integer-subscript arrays and require a conversion function.  That is
fine.

You  are  correct that this requires inserting ord/chr conversions.  You view
that as redundant.  I think of chars and integers as definitely different, so
I view leaving them out as a pun.

Your  argument  about  isalpha  being  a  mistake  may  very well be true.  I
definitely don't have enough experience to say if it  is.   Regardless,  that
is,  again  in  my  very strong opinion, no argument at all against viewing a
char and its code point as the same type.  That seems like exactly  the  same
error  as thinking of a list and a stack as the same thing just because I can
use one to implement the other.

I agree that code points are just integers.  The point is that  what  unicode
is  is,  among  other things, an association between integers and characters.
The other important thing it is,  determined  by  the  above,  is  a  set  of
characters.

Perhaps we will just have to agree to disagree, but I don't buy your argument
at all.