[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

Wesley W. Terpstra wesley@terpstra.ca
Wed, 30 Nov 2005 21:33:12 +0100


On Nov 30, 2005, at 6:22 PM, John Reppy wrote:
> I don't see converting to WideChar as "converting to Unicode".

I argued earlier that WideChar should be defined as Unicode.
My reasoning was that Char is already defined as ASCII (low 7 bit).
Unicode is the ASCII of the future. Unicode tries to be a superset
of all charsets (and in this respect is much like LargeInt).

Therefore, it seems natural that Unicode fills the role of WideChar,
and as a practical matter, seems the right thing to do in terms of the
utility of the basis in the future.

>> I think I have explained myself badly; the problem I was trying to  
>> describe
>> has nothing to do with the editor. You are giving an SML compiler  
>> an input
>> file, that input file is in some character set the compiler  
>> understands. The
>> compiler knows that #"€" is the Euro sign, and the charset it was  
>> written in
>> the editor is irrelevant at this point, because the compiler  
>> already decoded
>> the file into it's internal representation.
>>
>> Rather, the problem comes in after the compiler does type  
>> inference. The
>> compiler has this character and it says, "Ok! This is going to be  
>> a Char8.char
>> which has an unspecified charset". Now, it has to think, what will  
>> the binary
>> value be that I write into the output programs text segment? The  
>> compiler,
>> as per your suggestion, doesn't know the charset of Char8, because  
>> you
>> left it unspecified. Now it must decide, what on earth to do with  
>> a Euro sign.
>> Should it use 0xA4 for an ISO-8859-1 type of Char8 or 0x80 for a  
>> windows
>> extended ACSII Char8. The compiler knows that you want a Euro  
>> sign, b/c
>> that's what you wrote in the input file, but because Char8 does  
>> not include
>> a concept of charset, it is unable to decide what binary value  
>> this turns into.
>>
>> This problem also appears for normal ASCII.
>> Take the character #"c". What should the compiler do with it?
>>
>> It doesn't know your Char8 is going to be KOI8R, so it would  
>> probably just
>> use ASCII, and that means that when you later use the characters  
>> as if it
>> were a KOI8R it would be some completely random glyph, when what you
>> clearly meant was the Russian letter 'c' (which sounds like 's').
>>
>> Does that make things more clear?
>> This makes the fact that Char is ASCII extremely important.
>> Otherwise, the compiler would have no way of transforming string  
>> literals
>> (which have been decoded/parsed already) into values in the heap.
>
> I think you are drawing the wrong conclusion.  Instead of saying  
> that Char.char
> is ASCII, you should say that SML programs are interpreted as being  
> encoded
> in the ASCII character set (I think that the definition actually  
> states this
> assumption, but I don't have my copy handy to check).

I think you still haven't understood what I'm trying to say.

What I was talking about has absolutely nothing to do with SML programs
being written in ASCII. They could be written in some totally bizarre  
charset,
and exactly the same thing would be true.

I am talking about what happens after the file has already been parsed.
The fact is simply that a string like "Hello world" defines a  
sequence of
symbol shapes. What encoding you used for the SML file is irrelevant.
That sequence of symbol shapes has a meaning to us humans; we would
be most upset if print "Hello world\n" did not output the appropriate  
looking
characters on our terminal.

When you store that string in the heap of the running program, the  
compiler
need to make a choice about how those symbols will be put into main  
memory.
Unless Char8 has a charset, the compiler has no way of deciding how  
to put
those symbols into main memory. If it picks some arbitrary character  
set for the
purposes of storing it in RAM, that won't help me when I try to use  
the string as
though it were a <insert-charset-i-combine-with-Char8-here>.

For this reason, Char, by its nature, MUST be bound to a specific  
charset.

If you disagree, walk me through how
     val () = ascii_print "Hello world\n"
     val () = koi8r_print "привет\n"
is going to work.

I care about these steps:
1. parsing the SML file (for the example, lets say the above SML is  
UTF-8)
2. what the compiler decides the type of the strings are (are they  
the same)
3. what the contents of the String8s are that are stored in the  
output executable
4. how you ensure that the string will be the charset 'print' expects
(ie: ascii_print interprets the charset-less Char8 as ASCII, koi8r as  
KOI8R)

I contend that without assigning some charset to Char8, you will fail  
to perform
step 3. If you pick some arbitrary encoding, you can't satisfy both  
prints in step 4.
Right now, the basis says Char is ascii, so the first print works. If  
you specify
no charset for Char, you can't even convert it to ASCII, b/c you  
don't what the
memory, holding the string you have, means.