[MLton] Unicode... again

Fri Feb 9 13:05:58 PST 2007

> Once again I find myself needing Unicode in MLton. 

Just to orient the discussion of "what to implement where": you find 
yourself needing to process Unicode files with an SML program compiled 
by MLton /or/ you find yourself needing to have Unicode strings in an 
SML program compiled by MLton.

The former doesn't require any changes to the compiler (not they 
wouldn't be welcome).

> - CharX differs from IntX in that a CharX contains a character. This 
> sounds obvious, but it caused considerable debate earlier. I hope that 
> given the above definition of character, things are clear. A character 
> corresponds to our concept of the letter 'a', irrespective of the font. 
> A character is NOT a number. It is not even a code point.

I don't recall the details of the earlier debate, but while expecting 
CharX to differ from IntX sounds good, it doesn't give much insight into 
the representation.  In particular the 'X' would almost certainly seem 
to imply a fixed-width word/integer.

> - The CharX.ord method "returns the (non-negative) integer code of the 
> character c." should be interpreted as meaning "returns the 
> (non-negative) integer CODE POINT of the character c in UNICODE." There 
> is no serious competition to Unicode, and as its character repertoire is 
> open, there never will be.

That seems reasonable.

> - This interpretation of CharX.ord means that Char contains exactly 
> those characters in the repertoire of ISO-8859-1. The SML standard says 
> a char contains the 'extended ASCII 8-bit character set'. This should be 
> interpreted as contains 'characters in the repertoire of ISO-8859-1'. 
> The original authors were simply unaware that there does not EXIST an 
> extended ASCII 8-bit character set. We take the extension to 
> specifically be iso-8859-1. The result of isAlpha/etc remain unchanged.

That seems fine and consistent with the Basis Library.

> - The inclusion of maxOrd in the CHAR signature unfortunately forces our 
> hand at which character repertoires we can support. Specifically, it 
> forces us to use character encoding forms that are prefixes of the full 
> Unicode. We therefore leave Char8 above as I described, and take Char16 
> (name debate below) as the BML (Basic Multilingual Plane). Even though 
> several code points in the BMP remain unassigned to characters 
> (especially those for UTF-16 that will always be unassigned), we choose 
> not to raise 'Chr' if an unassigned character is requested. We stick to 
> the rule of raising Chr if chr i has i > maxOrd. Thus if empty code 
> points are later filled, our programs remain compatible without 
> recompilation.

I presume you also intend to extend this to Char32, which would 
correspond to full Unicode, but include (many?) unassigned code points. 
  Again, 'Chr' wouldn't be raised.

> - For the time being I choose to ignore the basis' claim that "in 
> WideChar, the functions toLower, toLower, isAlpha,..., isUpper and, in 
> general, the definition of a ``letter'' are locale-dependent" and raise 
> an Unimplemented exception for these methods. I think the standard is 
> dreadfully misguided in assuming a global locale, and I defer what to do 
> here till later as it is what blocked my progress last time. (IMO these 
> functions have only questionable use, anyway)

I think that is reasonable.

> - The input character encoding scheme (CES) of an SML source file is 
> UTF-8. At present, the CES allows only 7 bit ascii. Because compilers 
> give a parse error on 'high ascii', we can choose anything for the last 
> bit we want. Choosing UTF-8 makes sense so that we can include Unicode 
> strings inside string literal definitions, yet remain 100% backward 
> compatible.
> 
> - Strings can include unicode via \uXXXX for a code point (in hex) from 
> the BMP (Basic Multilingual Plane) or \UXXXXXXXX for a code point in 
> general (MLton already supports both). Furthermore, supposing the input 
> source SML file contains Unicode, these characters are similarly allowed 
> in a string. If a character is too big to fit into the inferred char 
> type, a compile-time error results.

As I understand the implementation of the latter in MLton, any string 
that has \uXXXX will be inferred to have type String16.string = 
Char16Vector.vector and any string that has \UXXXXXXXX will be inferred 
to have type String32.string = Char32Vector.vector.  (Inference might 
also force the type to a higher StringN.string type.)

That would seem to lend more support for Char16 as BMP and Char32 as 
full unicode.

> - To be absolutely clear: Char is NOT UTF-8. There is no variable length 
> encoding in any of the CharX methods. Similarly, String is not UTF-8 and 
> StringX.length remains constant time. When we convert a CharX/StringX to 
> UTF-8 the output type is Word8Vector.vector---a sequence of octets. The 
> same applies for other encodings. Dual byte encodings like UTF-16 
> correspond to Word16Vector.vector. Endian issues are left up to how the 
> Word16 is input/output by the program later.

Sure, I don't see CharX or StringX as any encoding.  Whether UTF-8 
encoding bytes are transparently Word8Vector.vector or have their own 
abstract type seems like an open choice.  But, it seems like

> The main debatable point I keep coming back to is the character encoding 
> form (CEF) of WideChar in memory. (I hope) we agree with my earlier 
> point that the basis and spirit of SML require a fixed-width Char. 

Yes, that seems pretty much implied by the Basis.

> As for naming the structures, Char and WideChar are dictated by the 
> standard. 

Well, WideChar is an optional part of the Basis and I don't know of any 
implementations that support it, so we're not missing much by just 
ignoring it.  Furthermore, if CHAR functions don't seem to apply off of 
the ASCII subset, then trying to force something into WideChar probably 
isn't necessary.

On the other hand, since no one is using WideChar, whether or not there 
is an implementation that mostly raises exceptions probably won't be 
noticed.

> Agreed? Can I just whip this up and check it in? ;-)

I believe that there is still a unicode branch in the repository.  I 
would recommend that you merge changes from trunk into that branch and 
continue development there.  That gives people a chance to see 
development and suggest changes before we merge them into trunk.