[MLton] Unicode... again

Fri Feb 9 07:00:49 PST 2007

On Feb 9, 2007, at 2:50 PM, skaller wrote:
> The problem with this is that it also leaves all the solutions
> being non-portable wrt 'characters' although calculations with
> integers is fairly deterministic.

I don't really agree here. We are making the choice that characters  
are opaque. When you use the \x \u and \U syntax, you are specifying  
a Unicode code point. That's consistent with the definition of SML  
without WideChar, and makes good sense for SML with WideChar.

SML prevents you from doing arithmetic with characters. Therefore,  
the CEF used by MLton is completely arbitrary. For efficiency  
reasons, we will use the Unicode code point as an integer, but we  
need not do so. The 'CHAR.ord' and 'CHAR.chr' functions are defined  
to map characters <-> code points according to Unicode. Therefore,  
the operations done in the integer form are likewise well defined.

> The problem I try to look at is this one: what does
>
> 	"\x88"
>
> mean?
>
> * In a byte string: one byte, hex code 88.
> * In a UCS4 string a 32 bit word, value hex 88
> * In a UTF-8 string, the two byte encoding of hex 88.
>
> Note none of these meaning has anything to do with character
> sets or Unicode, but depends only on the encoding (CEF?).

MLton would have to use the correct CEF for the string literal, based  
on the inferred type. It has to do this even for #"x", because that  
might be a single byte, two bytes, or four bytes. If we added a UTF-8  
type string, your "\x88" example would be interpreted as "\u0088" and  
converted into the appropriate two-byte encoding under UTF-8.

I'm now leaning towards not adding UTf-8, as I think it's  
unnecessary, and nasty that there would be no corresponding Char  
type. (The basis quite heavily assumes there is a Char type specific  
to each String type, and that these String types are the same as the  
monomorphic vector over that type)

> BTW: the use of \x88 here just means 'code point hex 88'.
> You MIGHT chose instead that \x88 is byte 88 even in UTF-8,

I think this would be a mistake. If you wanted to (for some odd  
reason) write your string literal as UTF-8 escaped with SML \x  
escapes, then you should put that into a Word8Vector via Byte.  
There's no type problems then as it would be a CharVector input.

> suggesting these three kinds of string MUST be distinct types
This has to be the case anyway, as not all Char implementations have  
the same width.

> On the other hand consider
> 	"A'" -- with an accent of some kind, ONE character
>
> This is really hard for my brain. What this means cannot be  
> portable as such.

I don't agree. If the source code was written as:
   val x = "пришет"
The source file had some CES (which I've argued we should just make  
UTF-8). This is parsed at compile-time into Unicode characters  
(possibly with this ml-ulex). After being parsed, MLton knows the  
sequence of Unicode code points in that string literal. When MLton  
needs to write this into the text segment of the binary, it would do  
so depending on the inferred string type. If the inferred string type  
was simply String.string, you should get a compile-time error to the  
effect that this string is "too big" for the type. If the type is  
WideString.string, then it will be written as a four byte value in  
machine endian order. If there were a UTF-8 type, it would be written  
as UTF-8 CEF.

> Fact is .. I'd really like to find an answer to the question
> myself. My language Felix only provides two types at the moment:
>
> 	"...." // 8 bit string
> 	u".. " // 32 bit string
>
> and
>
> 	"\x88" --> byte x88, even if it is invalid utf-8
> 	"\u0088" --> UTF-8 encoding of code point x88
> 	u"\u0088" --> UCS4 encoding of code point x88
> 	u"\x88" --> GAK I HAVE NO IDEA .. probably should be illegal?

I think the choices you've listed are all consistent with how I  
intend for this to work. The u"\x88" should be code point 0x88.

> The downside of this scheme is that the same string can be used for
>
> * 8 bit code points
> * UTF-8 encoding
>
> at the same time, which is not only inconsistent logically,
> it is also unsound in that you can generate a string you thought
> was UTF-8, but which contains an invalid UTF-8 sequence.
>
> If that happens due to I/O that might be acceptable but it should
> never happen as a result of the compiler transcoding a literal.
>
> Tradeoff between flexibility and safety here..

I don't see this point. There is exactly one type inferred for each  
string. MLton will never write incorrect UTF-8 to the text segment as  
it would do so from a sequence of code points. If your input source  
file had invalid UTF-8, then it would be a parse error. Even if the  
source file had UTF-8 format and used a UTF-8 string literal, MLton  
would decode the source file into WideString, decide that the literal  
is used as a UTF8String, and then re-encode that WideString back to  
UTF-8 in the program's text segment.