[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Wesley W. Terpstra wesley@terpstra.ca
Wed, 30 Nov 2005 17:50:13 +0100


On Nov 30, 2005, at 3:10 PM, John Reppy wrote:
> Having a character type without a corresponding string/substring  
> type seems
> weird.  Once you have string/substring, then you effectively have  
> the vector
> and slice structures too, so why not add arrays and array slices to  
> get the
> complete set?

As Matthew already said, I think we should have those structures, but  
perhaps
not at the top-level namespace. If it's little work for an  
implementation to provide
them, and they don't pollute the namespace, then I see no problem  
with having
them. It's also not a particularly confusing concept... probably less  
confusing
than splitting representation and meaning into orthogonal concepts.

>> What will Char.toWide do? As I already mentioned, high ascii  
>> (128-255) is
>> undefined. What does it map to in a WideChar?! I still think  
>> defining high
>> ascii to be *something* is better than nothing.
>
> I think that it depends on how one views the Char.char type.  In my  
> view, it
> is an enumeration of 256 values.  There are a collection of  
> predicates that
> classify these values and there is a standard string representation  
> that
> corresponds to the SML notion of character/string literals.  The  
> value #"\128"
> is perfectly well defined, it just doesn't happen to have a tight  
> binding to
> a particular glyph.

Right, but when you convert it to Unicode you are binding it to a glyph.
So, which glyph do you bind it to? Or do you raise Chr?

>> Instead, you should use BinaryIO and compose it with a charset  
>> decoder.
>> An implementation will only have a few charset representations in  
>> main
>> memory and certainly no variable width ones. If you use a general  
>> charset
>> decoder for reading, then you can support all charsets with the  
>> same code.
>
> For converting between data on disk/wire/etc., filters are the way  
> to go (TextIO
> already has this property for newline conversion), but there is the  
> issue
> of OS interfaces; for example, pathnames.

I'm not sure I understand your point here...
Do you mean that some system/kernel calls will need a particular  
charset?
As far as I know the only kernel with that feature is the windows  
kernel, where
it can take UCS2 strings as well as ASCII. (Another reason UCS2 is  
needed).

For filenames on UNIX, I suppose you might want to write out UTF-8  
strings.
That's not a big problem, though, since the same structure which can  
wrap
the BinIO readers also converts WideString.string to  
Word8Vector.vector with
the charset you specify.

I don't really see how any of this relates to the usefulness of  
TextIO, though.
You wouldn't have used TextIO to create filenames anyways, would you?

>> 1. If you write a string in SML 'val x = "asfasf"', then this  
>> string must contain
>> the code points which correspond to the symbol with shape 'a',  
>> then 's', ...
>> When you have a single storage type, with multiple charsets, then  
>> this is
>> ambiguous. ie: Is #"" 0xA4 or 0x80? Depends on your charset!
>
> This was not the reason.  This problem is more of an editor problem  
> and one of
> the reasons that I'm not a big fan of extending the source token  
> set of SML
> beyond ASCII.

I think I have explained myself badly; the problem I was trying to  
describe
has nothing to do with the editor. You are giving an SML compiler an  
input
file, that input file is in some character set the compiler  
understands. The
compiler knows that #"" is the Euro sign, and the charset it was  
written in
the editor is irrelevant at this point, because the compiler already  
decoded
the file into it's internal representation.

Rather, the problem comes in after the compiler does type inference. The
compiler has this character and it says, "Ok! This is going to be a  
Char8.char
which has an unspecified charset". Now, it has to think, what will  
the binary
value be that I write into the output programs text segment? The  
compiler,
as per your suggestion, doesn't know the charset of Char8, because you
left it unspecified. Now it must decide, what on earth to do with a  
Euro sign.
Should it use 0xA4 for an ISO-8859-1 type of Char8 or 0x80 for a windows
extended ACSII Char8. The compiler knows that you want a Euro sign, b/c
that's what you wrote in the input file, but because Char8 does not  
include
a concept of charset, it is unable to decide what binary value this  
turns into.

This problem also appears for normal ASCII.
Take the character #"c". What should the compiler do with it?

It doesn't know your Char8 is going to be KOI8R, so it would probably  
just
use ASCII, and that means that when you later use the characters as  
if it
were a KOI8R it would be some completely random glyph, when what you
clearly meant was the Russian letter 'c' (which sounds like 's').

Does that make things more clear?
This makes the fact that Char is ASCII extremely important.
Otherwise, the compiler would have no way of transforming string  
literals
(which have been decoded/parsed already) into values in the heap.

>> Finally, you would still need at least three representations  
>> (1,2,4 byte).
>> My proposal had five, which isn't terribly worse, and saves on the
>> classification structures. If we say Char=ISO-8859-1, then there are
>> only three structures in my proposal too. (Char, Ucs2, WideChar)
>>
>> I keep coming back to arguing for Char being ISO-8859-1. It makes the
>> problem of conversion between WideChar and Char so much cleaner...
>
> Why not just have 8-bit Char.char and 32-bit WideChar.char?

Nearly all of Unicode fits into the first 16 bits. As a matter of
practicality, many people use this for in-memory Unicode.