[MLton] Unicode... again

Fri Feb 9 08:56:26 PST 2007

On Fri, 2007-02-09 at 16:00 +0100, Wesley W. Terpstra wrote:
> On Feb 9, 2007, at 2:50 PM, skaller wrote:
> > The problem with this is that it also leaves all the solutions
> > being non-portable wrt 'characters' although calculations with
> > integers is fairly deterministic.
> 
> I don't really agree here. We are making the choice that characters  
> are opaque. When you use the \x \u and \U syntax, you are specifying  
> a Unicode code point. That's consistent with the definition of SML  
> without WideChar, and makes good sense for SML with WideChar.

Excuse me but I think you misunderstood: I'm refering to 
C in that comment, not SML or MLton. 

> SML prevents you from doing arithmetic with characters.

Clearly ord/chr functions make this claim false, and C
is just the same -- C can't do arithmetic on chars either
(they're promoted to int .. which is precisely the ord
function).

> I'm now leaning towards not adding UTf-8, as I think it's  
> unnecessary, and nasty that there would be no corresponding Char  
> type. (The basis quite heavily assumes there is a Char type specific  
> to each String type, and that these String types are the same as the  
> monomorphic vector over that type)

Quite apart from what the basis library assumes .. Felix does
the same as you suggest: there's no UTF-8 string type.

However 0x88 in an 8 bit string is not the UTF-8 encoding
of code point x88, it is actually byte 0x88, as in C.

That is, it is indeed code point x88, expressed in 
UCS-1 encoding. Whereas \u0088 is the UTF-8 encoding
of some code set.

Now, there is a REASON for this .. in particular NOT
abstracting the string type to opaque characters:
it allows you to write algorithms which manipulate
many kinds of encodings of anything (not just human text!).

Strings are used heavily as byte streams in most languages.

Yes, I understand Word8Vector is available .. but that wont
easily mix with strings (eg literals). Conversion is not
an option for large texts. It isn't clear if this is an
issue in practice however (since large texts aren't
going to play Mlton or Unicode games, but use special
app-specific encodings anyhow .. eg a text editor isn't
going to edit a 10 Meg file as a string, it will use
a rope or something).

> > On the other hand consider
> > 	"A'" -- with an accent of some kind, ONE character
> >
> > This is really hard for my brain. What this means cannot be  
> > portable as such.
> 
> I don't agree. If the source code was written as:
>    val x = "пришет"
> The source file had some CES (which I've argued we should just make  
> UTF-8). This is parsed at compile-time into Unicode characters  
> (possibly with this ml-ulex). 

How? I mean -- how can you tell the encoding?
[UTF-8 default plus command line switch .. or .. force UTF-8
and make the client comply -- I actually think enforcement
is better but then I'm not Japanese ..]

> > Fact is .. I'd really like to find an answer to the question
> > myself. My language Felix only provides two types at the moment:
> >
> > 	"...." // 8 bit string
> > 	u".. " // 32 bit string
> >
> > and
> >
> > 	"\x88" --> byte x88, even if it is invalid utf-8
> > 	"\u0088" --> UTF-8 encoding of code point x88
> > 	u"\u0088" --> UCS4 encoding of code point x88
> > 	u"\x88" --> GAK I HAVE NO IDEA .. probably should be illegal?
> 
> I think the choices you've listed are all consistent with how I  
> intend for this to work.

Whew!

>  The u"\x88" should be code point 0x88.

Ok, since i had no idea, your idea is better than mine :)

> > Tradeoff between flexibility and safety here..
> 
> I don't see this point. 

The point is people want to work with other charsets and
encodings whether you like it or not, and for that purpose
you can choose whether to let them use the string type,
or force them to use a raw encoding like (Word8Vector?)

I'm happy with enforcement though: I think the choice
is hard: pro's and cons either way.

The lesson from the past is don't try to be too abstract,
because i18n stuff is too complex and defies simplistic
abstractions .. meaning you have to do any serious work
with representations anyhow.

> There is exactly one type inferred for each  
> string. MLton will never write incorrect UTF-8 to the text segment as  
> it would do so from a sequence of code points. If your input source  
> file had invalid UTF-8, then it would be a parse error. Even if the  
> source file had UTF-8 format and used a UTF-8 string literal, MLton  
> would decode the source file into WideString, decide that the literal  
> is used as a UTF8String, and then re-encode that WideString back to  
> UTF-8 in the program's text segment.

Right. This makes sense. However if you want to read strings
from files in UTF-8 format (at run time i mean)
you can get errors, and if you want
to do it FAST you cannot detect them: validating the input
isn't really an option (it will cause a whole extra pass
on the file which is WAY too expensive).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net