[MLton] Unicode / WideChar

skaller skaller@users.sourceforge.net
Tue, 22 Nov 2005 02:54:31 +1100


On Mon, 2005-11-21 at 14:08 +0100, Wesley W. Terpstra wrote:

> > I would actually argue, that Char? is wrong.
> > They're not chars, they're integers, and they
> > are not associated with any particular code set.
> 
> I disagree; a Char2 differs from Int16 in exactly
> the fact that a Char2 has something to do with
> a particular code set. That's why CHAR has all
> those isAlpha, isAlnum, etc., methods.

But you aren't disagreeing :)

I know this is what happens .. my argument is that
it is a design error. 

C does not make this mistake, neither does C++.
C has no character type at all (its just an integer).

C++ does, it has a set of rules about what types
can be characters, what extra information must be
provided, and then allows you to have strings
or streams of any of these types.

Both solutions are polymorphic, so they support
Unicode as well as (almost) all other character sets.

I'm not against special support for Unicode .. I'm
suggesting it should be provided *entirely* functionally,
by using a universal representation (integers) to represent
characters.

If you don't do this, you're just forcing clients
to do the conversion manually using Ord and Chr
(or whatever equivalents MLton provides), and writing
they're own replacements for things like 'isAlpha' --
just to get the string handling (concatenation,
substrings, etc). Only, some of it will not work
-- if they forget that 'Capitalise()' HOF is bound
to the native case mapper .. not their own, for example.
So the typing system will not prevent such type errors,
because Ord and Chr are casts that let you escape it.

That is .. Char is isomorphic to some integer anyhow,
using Ord/Chr as the isomorphism .. you might as well
bite the bullet and just use Int16, Int32, etc and save
the user a lot of trouble. The 'abstraction' to Char
doesn't buy you anything, but it gets in the way.

Normally, I'd be in favour of abstraction .. and
parameterisation if possible to get additional reuse
whilst preserving type safety. But in this case, I believe
it just isn't worth it.

In theory, I think a character is something with an embedding
to string, is totally ordered, and supports addition and subtraction
of integers, subtraction of characters (and perhaps a few
more I forgot). So any Char type should have these operations.

The problem is that *in practice* real charsets are also
amenable to other operations, including bitshifts and
bitwise logic. Thus the encoding is often confused
with the abstraction because it is useful to do so.
The 'raw' abstraction I mentioned above just isn't
very useful. It is a pleasure to write

	i = i * 10 + ch - '0'

in C to convert a string to an integer. It it not nearly
as clear to write:

	i = i * 10 + Ord(ch) - Ord('0')

and of course BOTH are wrong for EBCDIC :)

This is somehow related to, for example, the idea of
doing 'dimensional analysis' on physical data.
We can encode that in the type system .. so you can't
add metres and seconds together.

The problem is that most 'real world' calculations
are not dimensionally sound .. for the simple reason
they're approximations and fudges. To retain correctness,
you have to litter your code with conversions from
metres to float and back.. in the end it buys you 
more trouble that it is worth.

BTW: same problem in Felix. It has 3 char types and 3
string types, only they're 'char', 'wchar_t' and 'unicode'
(which is 8, who the heck knows, and 32 bits). I too have
a constraint (not from ML Standard, but from C++ Standard).
I'm just commenting that in the end, I think this is a 
design error.

[C++ actual does have a serious design error: there is no
integer type guaranteed the right size for UCS-4]

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net