[MLton] Unicode / WideChar

skaller skaller@users.sourceforge.net
Tue, 22 Nov 2005 13:58:21 +1100


On Mon, 2005-11-21 at 19:08 -0600, Henry Cejtin wrote:
> I  definitely  disagree.  At the very least, there is equality on characters.
> Similarly, I would argue order.  You are quite correct  that  there  is  more
> than one order, but that isn't relevant.  Even a type with an arbitrary order
> (and hence equality) is a VERY useful item.

No dispute.

> The fact that the definition of these other operations requires, in the  end,
> a  (perhaps  implicit) listing of cases, doesn't change at all the facts that
> the operation is well defined on the abstract type.  To be concrete, the fact
> that  I  think  that  toUpper  applied to a `a' should yield `A' is precisely
> because, in my youth I was given this case (along with a  bunch  of  others).
> It REALLY IS, in the end, a table of rules.

Yes, but you were taught English, or French .. Casing does make
sense for most European languages. As I mentioned, there was
a reasonable argument for abstraction as a technique for handling
the collection of Euro charsets like Latin-1 .. etc.

The problem is, your personal upbringing doesn't generalise
to the whole world very nicely. Other scripts simply don't
work like that. Some have 3 cases, some none. Some use
other techniques.

For example, I learned English .. and I think German
and French are WEIRD! They have really weird things
called accents. Spanish is even worse!

So, with my background, I'm appalled ASCII has 
ridiculous characters like ~ and ^ in it .. except of
course as a Programmer I bemoan the fact that I have
bitwise complement and exclusive or -- but no symbol
for set membership etc .. :))

> Correspondingly,  making  use,  for  instance,  of  translation  from char to
> unicode-code-point (an int), is merely a (very useful) short  cut  for  using
> the table-of-unicode-code-points as a way of saving time and memory.

Yes. This is basically my point. In practice, to do any serious
work with i18n charsets, encodings, etc, you have to take
this shortcut almost ALL the time. Which defeats the purpose
of the abstraction in the first place ;(

> Your  argument  about toUpper for Chinese makes much sense.  There is, as far
> as I know, no natural definition for upper/lower case in Chinese  characters.
> Hence  we  are  reduced  to  the  usual problem: what do we do when we have a
> definition that naturally is a partial function? 

I will take a slightly different viewpoint. What to do with
partial functions is a technical issue .. there are several
solutions, I may or may not feel strongly depending on
circumstances. 

But that isn't my viewpoint here: in natural context,
there is no such problem. The case mapping for, say,
English, is partial and extended to total by convention
to solve that. But for Chinese, there is no such mapping
in the first place.

So, in Unicode, which attempts to be the universal 
code set --including both Chinese and English -- it may
make sense to use the Unicode Standard's definition.

I'm not arguing about that -- what I'm arguing is that
the toUpper() function is NOT a property of an abstract
character type, such as, for example, one that
could be instantiated to "Chinese Character".

The unicode solution is basically what it says: 
a Unification. That is, it is an algebraic sum of
a large set of common (and not so common) character sets.

It has certain properties, such as toUpper(), by specification.
But it is a universal type NOT an abstraction!

It is common  practice in OO to have
some base class and derive from this abstraction many
concrete types. This can work if the abstraction
is truly ABSTRACT and not a sum. The problem, however,
is that often the type is actually a sum, and every
new instantiation requires adding more and more
methods with defaults to the base, as a way of handling
the new instance.

A good example is any Taxonomy. The top of a taxonomical
tree, such as Living Thing, is NOT an abstraction at all.
A taxonomy is ACTUALLY a hierarchical decomposition by
partitioning .. in other words it is a tree of sums.
There is nothing remotely abstract in it.

Unicode is a mix. It is a *specific* charset (not an
abstraction) that tries to both sum and abstract all
character sets. But the type of a unicode character
is quite concrete.

The right way to do this (IMHO), in Ocaml, would be:

	sig
		type char = int
		...

that is, you provide an alias rather than an abstract type.
This allows one to extend the concrete module, or define other
modules with abstract signatures with different representations.

You can still write algorithms which only require the abstraction
(minus the 'int') and you can still write extensions that
*require*  the concrete type. So the concrete module is both
Open (for extension) and Closed (as a module).

In particular you can note that for ML languages,
pattern matching would work on a tuple of chars as ints --
It cannot work if 'char' is abstract! (In Ocaml you can do:

	match xx with | (x,y) when P(x) && Q(y) -> ...

to work around that .. but it is uglier .. :)

> Even if it were true that the operations/properties that would be  ok  for  a
> char type where a subset of those for integers, that is NOT (to me) at all an
> argument against char being a different type.  

It is *an* argument -- in the nonformal sense of the word argument :)

Whether it is a convincing arguments depends on practice,
that is, it is partly statistical, based on experience, etc.

> I want to  know  when  I  have
> something  that  should be treated as a char and when it should be treated as
> an integer.  

Yes. I understand that. This is the usual argument for
abstraction. Abstraction is very powerful .. so this is
generally a powerful argument.

However the distinction between abstraction and representation
is not mathematically driven, but driven by usage: sometimes
abstraction buys lots of safety, representation independence,
etc .. and it is valuable.

And sometimes the abstraction buys very little, and causes
the usual problems (such as continuously having to apply
functions to construct/destruct it).

So I hope you'd agree in some cases things are best abstract,
and in other not.

I am simply arguing that for characters it is a case
where the abstraction doesn't buy enough to bother.

This doesn't negate the advantages abstraction would have,
it just balances them against the disadvantages.

> You  are  correct that this requires inserting ord/chr conversions.  You view
> that as redundant.  

No. It isn't redundant -- it is useful when the need to do
it is low. I feel we don't disagree on any principles here!

I am just saying -- there is a compromise, and in
the case of characters there is very little advantage
in abstraction.

> I think of chars and integers as definitely different, so
> I view leaving them out as a pun.

Your intellectual view is not very significant. I share
your view -- but I've been heavily exposed to the real
world of i18n on Standards committees, and listened to
a host of arguments on both sides -- I'm not giving
you my view, but my opinion of the consensus of 
other peoples views, which are rooted in lots of
real world implementation experience.

Of course this experience IS biased towards using
the existing C model, and the formation of the C++
model, the viewpoint should not be accepted
without challenge and analysis --- but the outcome
is based on pragmatics not a single opinion of
one person's view of how things should be -- how
they actually work out in practice is often not
what we desire.

> Your  argument  about  isalpha  being  a  mistake  may  very well be true.  I
> definitely don't have enough experience to say if it  is.   Regardless,  that
> is,  again  in  my  very strong opinion, no argument at all against viewing a
> char and its code point as the same type.

yes but see above. You will have to agree that the distinction
in general between an implemented abstraction, and a representation,
is somewhat arbitrarily made, depending on how useful it is.

After all, the notion of a 'type' is just a hack: there is
no such thing really. It is simply a convenience to help
make code easier to manage -- a tool to be used when
appropriate.

Given that, we have no dispute on the utility of abstraction,
the dispute is simply about whether, in this particular case,
abstraction is warranted.

Like you, I used to think it was. But I discovered

(a) my view was not shared by nuts and bolts people,
including for example Bill Plauger.

(b) my own attempts to design an abstraction have failed
totally 

I have to conclude the nuts and bolts people might just
be right on this one.

>   That seems like exactly  the  same
> error  as thinking of a list and a stack as the same thing just because I can
> use one to implement the other.

It could well be an error. When I was younger I was all so gung-ho
on abstraction. Now I am older and wiser I am even thinking
of writing a book "Against Abstraction" :)

Functional languages are HEAVILY biased against abstraction.
Algebraic types -- sums and products -- are not abstract.
OO is HEAVILY biased to abstraction -- every class is an
abstract type.

We all know FP is much better than OO? Why? Because it
doesn't force you make every single type abstract.
[Not the only reason of course! And not saying
you cannot and should not provide abstractions in FPLs ..
its just that algebraic types are so convenient .. :]

> I agree that code points are just integers.  The point is that  what  unicode
> is  is,  among  other things, an association between integers and characters.
> The other important thing it is,  determined  by  the  above,  is  a  set  of
> characters.

Actually, this isn't so --- Unicode and ISO10646 Standard clearly
dissociate themselves from any notion of character. To the point
of FORMAL complaints from the i18n people about programming languages
like C and C++ daring to use the name 'char' for a data type which
quite specifically it isn't.

By formal complaint I mean a written DEMAND from ISO that
C and C++ account for using terminology that is BANNED
in ISO documents by i18n standards. Of course the
response was "Sorry, historical error which can't be
fixed for reasons of compatibility".

That's why we both keep saying 'code point' -- they're NOT
characters. Which character is Control A? :) Is tilde really
a character?

You are correct, there is an *intended interpretation* of the
formalism to relate to an 'abstract' notion of a character,
but the relationship is loose and deliberately left unstated
because it cannot be formalised.

Anyhow .. you will see for yourself, as Wesley tries to actually
implement the interface, what issues arise.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net