[MLton] Unicode / WideChar
skaller
skaller@users.sourceforge.net
Tue, 22 Nov 2005 13:58:21 +1100
On Mon, 2005-11-21 at 19:08 -0600, Henry Cejtin wrote:
> I definitely disagree. At the very least, there is equality on characters.
> Similarly, I would argue order. You are quite correct that there is more
> than one order, but that isn't relevant. Even a type with an arbitrary order
> (and hence equality) is a VERY useful item.
No dispute.
> The fact that the definition of these other operations requires, in the end,
> a (perhaps implicit) listing of cases, doesn't change at all the facts that
> the operation is well defined on the abstract type. To be concrete, the fact
> that I think that toUpper applied to a `a' should yield `A' is precisely
> because, in my youth I was given this case (along with a bunch of others).
> It REALLY IS, in the end, a table of rules.
Yes, but you were taught English, or French .. Casing does make
sense for most European languages. As I mentioned, there was
a reasonable argument for abstraction as a technique for handling
the collection of Euro charsets like Latin-1 .. etc.
The problem is, your personal upbringing doesn't generalise
to the whole world very nicely. Other scripts simply don't
work like that. Some have 3 cases, some none. Some use
other techniques.
For example, I learned English .. and I think German
and French are WEIRD! They have really weird things
called accents. Spanish is even worse!
So, with my background, I'm appalled ASCII has
ridiculous characters like ~ and ^ in it .. except of
course as a Programmer I bemoan the fact that I have
bitwise complement and exclusive or -- but no symbol
for set membership etc .. :))
> Correspondingly, making use, for instance, of translation from char to
> unicode-code-point (an int), is merely a (very useful) short cut for using
> the table-of-unicode-code-points as a way of saving time and memory.
Yes. This is basically my point. In practice, to do any serious
work with i18n charsets, encodings, etc, you have to take
this shortcut almost ALL the time. Which defeats the purpose
of the abstraction in the first place ;(
> Your argument about toUpper for Chinese makes much sense. There is, as far
> as I know, no natural definition for upper/lower case in Chinese characters.
> Hence we are reduced to the usual problem: what do we do when we have a
> definition that naturally is a partial function?
I will take a slightly different viewpoint. What to do with
partial functions is a technical issue .. there are several
solutions, I may or may not feel strongly depending on
circumstances.
But that isn't my viewpoint here: in natural context,
there is no such problem. The case mapping for, say,
English, is partial and extended to total by convention
to solve that. But for Chinese, there is no such mapping
in the first place.
So, in Unicode, which attempts to be the universal
code set --including both Chinese and English -- it may
make sense to use the Unicode Standard's definition.
I'm not arguing about that -- what I'm arguing is that
the toUpper() function is NOT a property of an abstract
character type, such as, for example, one that
could be instantiated to "Chinese Character".
The unicode solution is basically what it says:
a Unification. That is, it is an algebraic sum of
a large set of common (and not so common) character sets.
It has certain properties, such as toUpper(), by specification.
But it is a universal type NOT an abstraction!
It is common practice in OO to have
some base class and derive from this abstraction many
concrete types. This can work if the abstraction
is truly ABSTRACT and not a sum. The problem, however,
is that often the type is actually a sum, and every
new instantiation requires adding more and more
methods with defaults to the base, as a way of handling
the new instance.
A good example is any Taxonomy. The top of a taxonomical
tree, such as Living Thing, is NOT an abstraction at all.
A taxonomy is ACTUALLY a hierarchical decomposition by
partitioning .. in other words it is a tree of sums.
There is nothing remotely abstract in it.
Unicode is a mix. It is a *specific* charset (not an
abstraction) that tries to both sum and abstract all
character sets. But the type of a unicode character
is quite concrete.
The right way to do this (IMHO), in Ocaml, would be:
sig
type char = int
...
that is, you provide an alias rather than an abstract type.
This allows one to extend the concrete module, or define other
modules with abstract signatures with different representations.
You can still write algorithms which only require the abstraction
(minus the 'int') and you can still write extensions that
*require* the concrete type. So the concrete module is both
Open (for extension) and Closed (as a module).
In particular you can note that for ML languages,
pattern matching would work on a tuple of chars as ints --
It cannot work if 'char' is abstract! (In Ocaml you can do:
match xx with | (x,y) when P(x) && Q(y) -> ...
to work around that .. but it is uglier .. :)
> Even if it were true that the operations/properties that would be ok for a
> char type where a subset of those for integers, that is NOT (to me) at all an
> argument against char being a different type.
It is *an* argument -- in the nonformal sense of the word argument :)
Whether it is a convincing arguments depends on practice,
that is, it is partly statistical, based on experience, etc.
> I want to know when I have
> something that should be treated as a char and when it should be treated as
> an integer.
Yes. I understand that. This is the usual argument for
abstraction. Abstraction is very powerful .. so this is
generally a powerful argument.
However the distinction between abstraction and representation
is not mathematically driven, but driven by usage: sometimes
abstraction buys lots of safety, representation independence,
etc .. and it is valuable.
And sometimes the abstraction buys very little, and causes
the usual problems (such as continuously having to apply
functions to construct/destruct it).
So I hope you'd agree in some cases things are best abstract,
and in other not.
I am simply arguing that for characters it is a case
where the abstraction doesn't buy enough to bother.
This doesn't negate the advantages abstraction would have,
it just balances them against the disadvantages.
> You are correct that this requires inserting ord/chr conversions. You view
> that as redundant.
No. It isn't redundant -- it is useful when the need to do
it is low. I feel we don't disagree on any principles here!
I am just saying -- there is a compromise, and in
the case of characters there is very little advantage
in abstraction.
> I think of chars and integers as definitely different, so
> I view leaving them out as a pun.
Your intellectual view is not very significant. I share
your view -- but I've been heavily exposed to the real
world of i18n on Standards committees, and listened to
a host of arguments on both sides -- I'm not giving
you my view, but my opinion of the consensus of
other peoples views, which are rooted in lots of
real world implementation experience.
Of course this experience IS biased towards using
the existing C model, and the formation of the C++
model, the viewpoint should not be accepted
without challenge and analysis --- but the outcome
is based on pragmatics not a single opinion of
one person's view of how things should be -- how
they actually work out in practice is often not
what we desire.
> Your argument about isalpha being a mistake may very well be true. I
> definitely don't have enough experience to say if it is. Regardless, that
> is, again in my very strong opinion, no argument at all against viewing a
> char and its code point as the same type.
yes but see above. You will have to agree that the distinction
in general between an implemented abstraction, and a representation,
is somewhat arbitrarily made, depending on how useful it is.
After all, the notion of a 'type' is just a hack: there is
no such thing really. It is simply a convenience to help
make code easier to manage -- a tool to be used when
appropriate.
Given that, we have no dispute on the utility of abstraction,
the dispute is simply about whether, in this particular case,
abstraction is warranted.
Like you, I used to think it was. But I discovered
(a) my view was not shared by nuts and bolts people,
including for example Bill Plauger.
(b) my own attempts to design an abstraction have failed
totally
I have to conclude the nuts and bolts people might just
be right on this one.
> That seems like exactly the same
> error as thinking of a list and a stack as the same thing just because I can
> use one to implement the other.
It could well be an error. When I was younger I was all so gung-ho
on abstraction. Now I am older and wiser I am even thinking
of writing a book "Against Abstraction" :)
Functional languages are HEAVILY biased against abstraction.
Algebraic types -- sums and products -- are not abstract.
OO is HEAVILY biased to abstraction -- every class is an
abstract type.
We all know FP is much better than OO? Why? Because it
doesn't force you make every single type abstract.
[Not the only reason of course! And not saying
you cannot and should not provide abstractions in FPLs ..
its just that algebraic types are so convenient .. :]
> I agree that code points are just integers. The point is that what unicode
> is is, among other things, an association between integers and characters.
> The other important thing it is, determined by the above, is a set of
> characters.
Actually, this isn't so --- Unicode and ISO10646 Standard clearly
dissociate themselves from any notion of character. To the point
of FORMAL complaints from the i18n people about programming languages
like C and C++ daring to use the name 'char' for a data type which
quite specifically it isn't.
By formal complaint I mean a written DEMAND from ISO that
C and C++ account for using terminology that is BANNED
in ISO documents by i18n standards. Of course the
response was "Sorry, historical error which can't be
fixed for reasons of compatibility".
That's why we both keep saying 'code point' -- they're NOT
characters. Which character is Control A? :) Is tilde really
a character?
You are correct, there is an *intended interpretation* of the
formalism to relate to an 'abstract' notion of a character,
but the relationship is loose and deliberately left unstated
because it cannot be formalised.
Anyhow .. you will see for yourself, as Wesley tries to actually
implement the interface, what issues arise.
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net