[MLton] Unicode... again

Thu Feb 8 07:35:04 PST 2007

Once again I find myself needing Unicode in MLton. I failed to  
implement this last time. I bit off more than I could chew in one  
bite. I propose to instead start with a minimal implementation that  
captures the most useful elements. I specifically would like to leave  
the is* methods of WideChar undefined (ie: they all raise an  
Unimplemented exception for the time being).

I think the bare minimum is the ability to convert UTF-8 back and  
forth between Char and WideChar. This alone would probably be enough  
to get most of the benefits for the least cost. It's also a much less  
ambitious goal. I can't speak for Henry, but I assume this would also  
affect mGTK as well as SQLite3. (They both expect input strings as  
UTF-8)

There was some debate about how this should be approached in SML.  
I'll lay out what I think is obvious to a person in possession of the  
relevant facts and then the design point that remains open. First I  
want to quote this (from wikipedia), as I think this was a major  
source of confusion in the past:
> Unicode and its parallel standard, ISO 10646 Universal Character  
> Set, which together constitute the most modern character encoding  
> [...] separated the ideas of what characters are available, their  
> numbering, how those numbers are encoded as a series of "code  
> units" (limited-size numbers), and finally how those units are  
> encoded as a stream of octets (bytes). The idea behind this  
> decomposition is to establish a universal set of characters that  
> can be encoded in a variety of ways. To correctly describe this  
> model needs more precise terms than "character set" and "character  
> encoding". The terms used in the modern model follow:
> A character repertoire is the full set of abstract characters that  
> a system supports. The repertoire may be closed, that is no  
> additions are allowed without creating a new standard (as is the  
> case with ASCII and most of the ISO-8859 series), or it may be  
> open, allowing additions (as is the case with Unicode and to a  
> limited extent the Windows code pages). [...]
> A coded character set specifies how to represent a repertoire of  
> characters using a number of non-negative integer codes called code  
> points. For example, in a given repertoire, a character  
> representing the capital letter "A" in the Latin alphabet might be  
> assigned to the integer 65, the character for "B" to 66, and so on.  
> A complete set of characters and corresponding integers is a coded  
> character set. Multiple coded character sets may share the same  
> repertoire; for example ISO-8859-1 and IBM code pages 037 and 500  
> all cover the same repertoire but map them to different codes. In a  
> coded character set, each code point only represents one character.
>
> A character encoding form (CEF) specifies the conversion of a coded  
> character set's integer codes into a set of limited-size integer  
> code values that facilitate storage in a system that represents  
> numbers in binary form using a fixed number of bits (e.g.,  
> virtually any computer system). For example, a system that stores  
> numeric information in 16-bit units would only be able to directly  
> represent integers from 0 to 65,535 in each unit, but larger  
> integers could be represented if more than one 16-bit unit could be  
> used. This is what a CEF accommodates: it defines a way of mapping  
> single code point from a range of, say, 0 to 1.4 million, to a  
> series of one or more code values from a range of, say, 0 to 65,535.
>
> The simplest CEF system is simply to choose large enough units that  
> the values from the coded character set can be encoded directly  
> (one code point to one code value). This works well for coded  
> character sets that fit in 8 bits (as most legacy non-CJK encodings  
> do) and reasonably well for coded character sets that fit in 16  
> bits (such as early versions of Unicode). However, as the size of  
> the coded character set increases (e.g. modern Unicode requires at  
> least 21 bits/character), this becomes less and less efficient, and  
> it is difficult to adapt existing systems to use larger code  
> values. Therefore, most systems working with later versions of  
> Unicode use either UTF-8, which maps Unicode code points to  
> variable-length sequences of octets, or UTF-16, which maps Unicode  
> code points to variable-length sequences of 16-bit words.
>
> Finally, a character encoding scheme (CES) specifies how the fixed- 
> size integer codes should be mapped into an octet sequence suitable  
> for saving on an octet-based file system or transmitting over an  
> octet-based network. With Unicode, a simple character encoding  
> scheme is used in most cases, simply specifying if the bytes for  
> each integer should be in big-endian or little-endian order (even  
> this isn't needed with UTF-8). However, there are also compound  
> character encoding schemes, which use escape sequences to switch  
> between several simple schemes (such as ISO 2022), and compressing  
> schemes, which try to minimise the number of bytes used per code  
> unit (such as SCSU, BOCU, and Punycode).
I hope that with the above definitions there will be no further  
debate on these points:

- CharX differs from IntX in that a CharX contains a character. This  
sounds obvious, but it caused considerable debate earlier. I hope  
that given the above definition of character, things are clear. A  
character corresponds to our concept of the letter 'a', irrespective  
of the font. A character is NOT a number. It is not even a code point.

- The CharX.ord method "returns the (non-negative) integer code of  
the character c." should be interpreted as meaning "returns the (non- 
negative) integer CODE POINT of the character c in UNICODE." There is  
no serious competition to Unicode, and as its character repertoire is  
open, there never will be.

- This interpretation of CharX.ord means that Char contains exactly  
those characters in the repertoire of ISO-8859-1. The SML standard  
says a char contains the 'extended ASCII 8-bit character set'. This  
should be interpreted as contains 'characters in the repertoire of  
ISO-8859-1'. The original authors were simply unaware that there does  
not EXIST an extended ASCII 8-bit character set. We take the  
extension to specifically be iso-8859-1. The result of isAlpha/etc  
remain unchanged.

- The inclusion of maxOrd in the CHAR signature unfortunately forces  
our hand at which character repertoires we can support. Specifically,  
it forces us to use character encoding forms that are prefixes of the  
full Unicode. We therefore leave Char8 above as I described, and take  
Char16 (name debate below) as the BML (Basic Multilingual Plane).  
Even though several code points in the BMP remain unassigned to  
characters (especially those for UTF-16 that will always be  
unassigned), we choose not to raise 'Chr' if an unassigned character  
is requested. We stick to the rule of raising Chr if chr i has i >  
maxOrd. Thus if empty code points are later filled, our programs  
remain compatible without recompilation.

- For the time being I choose to ignore the basis' claim that "in  
WideChar, the functions toLower, toLower, isAlpha,..., isUpper and,  
in general, the definition of a ``letter'' are locale-dependent" and  
raise an Unimplemented exception for these methods. I think the  
standard is dreadfully misguided in assuming a global locale, and I  
defer what to do here till later as it is what blocked my progress  
last time. (IMO these functions have only questionable use, anyway)

- The input character encoding scheme (CES) of an SML source file is  
UTF-8. At present, the CES allows only 7 bit ascii. Because compilers  
give a parse error on 'high ascii', we can choose anything for the  
last bit we want. Choosing UTF-8 makes sense so that we can include  
Unicode strings inside string literal definitions, yet remain 100%  
backward compatible.

- Strings can include unicode via \uXXXX for a code point (in hex)  
from the BMP (Basic Multilingual Plane) or \UXXXXXXXX for a code  
point in general (MLton already supports both). Furthermore,  
supposing the input source SML file contains Unicode, these  
characters are similarly allowed in a string. If a character is too  
big to fit into the inferred char type, a compile-time error results.

- To be absolutely clear: Char is NOT UTF-8. There is no variable  
length encoding in any of the CharX methods. Similarly, String is not  
UTF-8 and StringX.length remains constant time. When we convert a  
CharX/StringX to UTF-8 the output type is Word8Vector.vector---a  
sequence of octets. The same applies for other encodings. Dual byte  
encodings like UTF-16 correspond to Word16Vector.vector. Endian  
issues are left up to how the Word16 is input/output by the program  
later.

----

The main debatable point I keep coming back to is the character  
encoding form (CEF) of WideChar in memory. (I hope) we agree with my  
earlier point that the basis and spirit of SML require a fixed-width  
Char. That means UTF-16 (which can use two word16s together to encode  
a single Unicode code point) is not acceptable for WideChar. Long ago  
I argued that WideChar should be like LargeInt---able to hold all  
Unicode characters. I know that this would require 32 bits per  
character since 21 bits is not a convenient size. Taking a long-term  
point of view, I don't think this cost is unbearable. It greatly  
simplifies development, and you can always use Char16. Furthermore,  
as Unicode has an open character repertoire, this gives MLton  
considerable room for future Unicode extensions.

As for naming the structures, Char and WideChar are dictated by the  
standard. If WideChar is like LargeInt, then it would be desirable to  
have a middle ground. I hesitate to call it UCS2Char as this is a  
character encoding form. On the other hand, since maxOrd forced our  
hand, it is the accurate name for the character repertoire and memory  
representation. The other option would be BMPChar, which I find more  
accurate as it describes the character repertoire that can be  
contained, without restricting our encoding form.

The alternative would be to take WideChar as the BMP, and define  
LargeChar as the equivalent of LargeInt---a 32 bit type able to hold  
all characters. I have no real opinion on whether Char/BMPChar/ 
WideChar or Char/WideChar/LargeChar is better.

If we agree with all my bullet points and can reach a consensus on  
whether WideChar is 16/32, then the actual implementation of all the  
above is trivial. Once the structures exist in the basis, I would  
turn my attention to a new structure for encoding/decoding CharX to/ 
from a Word{8,16}Vector.vector. This would then easily allow Unicode  
string literals: we don't need to modify lex/yacc, just extend the  
lexer to allow high ascii in string literals. Then we decode the  
UTF-8 inside MLton's frontend, not in yacc. The lexer converts \uXXXX  
to UTF-8.

Agreed? Can I just whip this up and check it in? ;-)