[MLton] Unicode / WideChar

skaller skaller@users.sourceforge.net
Tue, 22 Nov 2005 11:14:59 +1100

On Mon, 2005-11-21 at 10:40 -0600, Henry Cejtin wrote:
> Wow: I must say I COMPLETELY disagree.  The notion of distinguishing between
> a character and an encoding of the character seems to me to be incredibly
> useful and desirable.  

Try to define the interface. You will find you cannot. 

There is an embedding and its inverse:

	int -> char -> string

Now what else? You will find there isn't a SINGLE operation
on chars that makes any sense. They're totally useless.

BTW: I am abusing the word 'char' here, and using it
incorrectly in the usual way of programmers -- of course
I really mean 'code point'.

Now, you may think there are some operations on chars.
For example comparison. Wrong. Bzzz. Comparison is
an external function, not a property, for example:

	code_point_compare: A < B ... a < b
	librarians_compare: A < a < B < b

So, how would you define them? Well, you CANNOT.
At least not without getting the INTEGER encoding.

now, lets try the ML way: we'll use a Module (category).
At first this makes sense:

	String = sig 
		type char 
		type string

		eq: char -> char -> bool
		lt: char -> char -> bool
		ord: char -> int
		chr: int -> char
		upper: char -> char
		isupper: char -> bool

Unfortunately, this only looks like it makes sense!

WTF does 'upper' mean in Chinese?? Where are the
properties required to render Arabic? What the heck
is an Umlat in German?

The problem is -- the only 'common' elements of the
module, the ones that characterise the notion of a char --
are a large subset of the properties on an (C) integer.

To actually *use* chars, you need data structures like
arrays indexed by them, bitmaps indexed by them,
etc etc -- and all of those data structures are
defined exclusively for *integers* -- to make
them work for chars, you will either have to
manually rewrite all the functions using 'Ord/Chr',
or you will have to make a functor to abstract the
data structure -- and such functors don't exist,
because these data structures all rely on peculiarities
of small integer representations. For example to lookup
an array you have to use the formula

	start_address + word_size * index

and there is no possibility other than 'index' is 
an integral type here.

The fact is, the notion of 'character' is so heterogenous,
with every charset requiring a weird set of special properties
unique to it, that there simply isn't anything to abstract.

The BEST you can do is what C does. NOTHING. There is a
literal for chars (eg 'a'). It is unspecified what
code point that is. The collation order is unspecified.
Everything is unspecified. This total lack of specification
and total failure to provide any abstraction turns out
to be the best thing you can do in practice.

The only real requirement on char in C is that it include
the code points required by C itself as a subset
-- so that it is possible to write a C compiler in C,
and print diagnostics.

Every language that tries to do more than this is worse
at manipulating characters than C, and provides no more
safety -- we have to conclude that a type abstraction
(as opposed to an alias) is basically an impediment.

[Actually, C provides too much -- isalpha macros
etc should NEVER have been allowed as part of the
Standard -- the originators were too Eurocentric
and we're stuck with this cultural bias now]

The conclusion is based on real experience -- not
on a naive belief in a non-existent abstraction.

I too shared your view that char
*ought* to be abstracted. But the plain facts of
the matter are it is worse than pointless, it is
actually an impediment.

> This seems even more obvious with unicode, where there
> are multiple encodings (UTF-8, UTF-16, etc.). 

No, you're not understanding. UTF-8 and UTF-16 are NOT
encodings of Unicode. They're mappings between streams
of bytes and streams of integers (UTF-8, UTF-16le,
UTF-16be, UTF-16 concrete) or streams of words and integers 
(UTF-16 abstraction).

Unicode itself is just a set of code points -- INTEGERS --
and a set of functions which provide properties, relations,
and mappings, plus an intended interpretation which includes
some glyphs (the glyphs are not characters and they're not
code points -- they're a 2D visual pattern).

So the very mention of 'UTF-8 encoding of Unicode' is
just talking about an ordinary function on integers
NOT some abstract notion of characters. If you define
UTF8 function on characters, you're artificially
restricting it to one use and forcing me to
go and recode it if I need it for another purpose.
[you have to agree streams of integers are common
in programming .. :]

Please look at the standards documents: U+23AF is an
integer. That's a Unicode code point. There's nothing
abstract about it.

> The fact that there are
> functions which convert between chars and some encoding, which you can think
> of as casts, or you can think of as actual conversions, is no obstacle.

It is a serious impediment, because it breaks higher
order functions completely.

All of these functions MUST work with some kind of encoding.
The usual choice is to use integers since it is considered
universal. So to write, for example:

	capitalise: string -> string

which capitalises the first letter of each word in a string,
you really MUST implement it like:

	capitalise (isAlpha, toUpper, input_string)

because isAlpha, toUpper are code set dependent:
to make the function capitalise 'higher order' so it works
on any code set, you have to parameterise it with the
properties of the instance code set.

By making it work with an abstraction -- as in the first
signature -- there is no choice but to include it
inside the module defining String -- since only there
are the required functions isAlpha etc implicitly

The problem is, whilst this works, it breaks the
Open/Closed principle of modules. There is an 'infinite'
set of useful functions like this. But since you do
not know the encoding of 'Char' outside the module,
the properties you need for each one cannot be
defined. The ONLY way to define them is to 
convert to integer using Ord/Chr, and then parameterise
the HOF's using these functions.

And that makes the HOF's defined *inside* the String
module useless -- because they don't accept these
external functions, they rely on the properties
in the module instead. So you have to duplicate them.

In the end .. the String module (like the above one)
proves to provide NO benefits and acts as a serious barrier 
to getting any work done. The only way to make it useful is
to remodel it as a functor parameterised by a char,
so that a string is a polymorphic container type
supporting indexing and concatenation.

In fact, that too turns out to be useless :)
Since Array already does the same job.

I hope inside all this long windedness you get
some sense of what I mean -- the notion of Char
as an abstraction is counter productive. 

John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net