[MLton] Unicode... again
Wesley W. Terpstra
terpstra at gkec.tu-darmstadt.de
Thu Feb 8 07:35:04 PST 2007
Once again I find myself needing Unicode in MLton. I failed to
implement this last time. I bit off more than I could chew in one
bite. I propose to instead start with a minimal implementation that
captures the most useful elements. I specifically would like to leave
the is* methods of WideChar undefined (ie: they all raise an
Unimplemented exception for the time being).
I think the bare minimum is the ability to convert UTF-8 back and
forth between Char and WideChar. This alone would probably be enough
to get most of the benefits for the least cost. It's also a much less
ambitious goal. I can't speak for Henry, but I assume this would also
affect mGTK as well as SQLite3. (They both expect input strings as
UTF-8)
There was some debate about how this should be approached in SML.
I'll lay out what I think is obvious to a person in possession of the
relevant facts and then the design point that remains open. First I
want to quote this (from wikipedia), as I think this was a major
source of confusion in the past:
> Unicode and its parallel standard, ISO 10646 Universal Character
> Set, which together constitute the most modern character encoding
> [...] separated the ideas of what characters are available, their
> numbering, how those numbers are encoded as a series of "code
> units" (limited-size numbers), and finally how those units are
> encoded as a stream of octets (bytes). The idea behind this
> decomposition is to establish a universal set of characters that
> can be encoded in a variety of ways. To correctly describe this
> model needs more precise terms than "character set" and "character
> encoding". The terms used in the modern model follow:
> A character repertoire is the full set of abstract characters that
> a system supports. The repertoire may be closed, that is no
> additions are allowed without creating a new standard (as is the
> case with ASCII and most of the ISO-8859 series), or it may be
> open, allowing additions (as is the case with Unicode and to a
> limited extent the Windows code pages). [...]
> A coded character set specifies how to represent a repertoire of
> characters using a number of non-negative integer codes called code
> points. For example, in a given repertoire, a character
> representing the capital letter "A" in the Latin alphabet might be
> assigned to the integer 65, the character for "B" to 66, and so on.
> A complete set of characters and corresponding integers is a coded
> character set. Multiple coded character sets may share the same
> repertoire; for example ISO-8859-1 and IBM code pages 037 and 500
> all cover the same repertoire but map them to different codes. In a
> coded character set, each code point only represents one character.
>
> A character encoding form (CEF) specifies the conversion of a coded
> character set's integer codes into a set of limited-size integer
> code values that facilitate storage in a system that represents
> numbers in binary form using a fixed number of bits (e.g.,
> virtually any computer system). For example, a system that stores
> numeric information in 16-bit units would only be able to directly
> represent integers from 0 to 65,535 in each unit, but larger
> integers could be represented if more than one 16-bit unit could be
> used. This is what a CEF accommodates: it defines a way of mapping
> single code point from a range of, say, 0 to 1.4 million, to a
> series of one or more code values from a range of, say, 0 to 65,535.
>
> The simplest CEF system is simply to choose large enough units that
> the values from the coded character set can be encoded directly
> (one code point to one code value). This works well for coded
> character sets that fit in 8 bits (as most legacy non-CJK encodings
> do) and reasonably well for coded character sets that fit in 16
> bits (such as early versions of Unicode). However, as the size of
> the coded character set increases (e.g. modern Unicode requires at
> least 21 bits/character), this becomes less and less efficient, and
> it is difficult to adapt existing systems to use larger code
> values. Therefore, most systems working with later versions of
> Unicode use either UTF-8, which maps Unicode code points to
> variable-length sequences of octets, or UTF-16, which maps Unicode
> code points to variable-length sequences of 16-bit words.
>
> Finally, a character encoding scheme (CES) specifies how the fixed-
> size integer codes should be mapped into an octet sequence suitable
> for saving on an octet-based file system or transmitting over an
> octet-based network. With Unicode, a simple character encoding
> scheme is used in most cases, simply specifying if the bytes for
> each integer should be in big-endian or little-endian order (even
> this isn't needed with UTF-8). However, there are also compound
> character encoding schemes, which use escape sequences to switch
> between several simple schemes (such as ISO 2022), and compressing
> schemes, which try to minimise the number of bytes used per code
> unit (such as SCSU, BOCU, and Punycode).
I hope that with the above definitions there will be no further
debate on these points:
- CharX differs from IntX in that a CharX contains a character. This
sounds obvious, but it caused considerable debate earlier. I hope
that given the above definition of character, things are clear. A
character corresponds to our concept of the letter 'a', irrespective
of the font. A character is NOT a number. It is not even a code point.
- The CharX.ord method "returns the (non-negative) integer code of
the character c." should be interpreted as meaning "returns the (non-
negative) integer CODE POINT of the character c in UNICODE." There is
no serious competition to Unicode, and as its character repertoire is
open, there never will be.
- This interpretation of CharX.ord means that Char contains exactly
those characters in the repertoire of ISO-8859-1. The SML standard
says a char contains the 'extended ASCII 8-bit character set'. This
should be interpreted as contains 'characters in the repertoire of
ISO-8859-1'. The original authors were simply unaware that there does
not EXIST an extended ASCII 8-bit character set. We take the
extension to specifically be iso-8859-1. The result of isAlpha/etc
remain unchanged.
- The inclusion of maxOrd in the CHAR signature unfortunately forces
our hand at which character repertoires we can support. Specifically,
it forces us to use character encoding forms that are prefixes of the
full Unicode. We therefore leave Char8 above as I described, and take
Char16 (name debate below) as the BML (Basic Multilingual Plane).
Even though several code points in the BMP remain unassigned to
characters (especially those for UTF-16 that will always be
unassigned), we choose not to raise 'Chr' if an unassigned character
is requested. We stick to the rule of raising Chr if chr i has i >
maxOrd. Thus if empty code points are later filled, our programs
remain compatible without recompilation.
- For the time being I choose to ignore the basis' claim that "in
WideChar, the functions toLower, toLower, isAlpha,..., isUpper and,
in general, the definition of a ``letter'' are locale-dependent" and
raise an Unimplemented exception for these methods. I think the
standard is dreadfully misguided in assuming a global locale, and I
defer what to do here till later as it is what blocked my progress
last time. (IMO these functions have only questionable use, anyway)
- The input character encoding scheme (CES) of an SML source file is
UTF-8. At present, the CES allows only 7 bit ascii. Because compilers
give a parse error on 'high ascii', we can choose anything for the
last bit we want. Choosing UTF-8 makes sense so that we can include
Unicode strings inside string literal definitions, yet remain 100%
backward compatible.
- Strings can include unicode via \uXXXX for a code point (in hex)
from the BMP (Basic Multilingual Plane) or \UXXXXXXXX for a code
point in general (MLton already supports both). Furthermore,
supposing the input source SML file contains Unicode, these
characters are similarly allowed in a string. If a character is too
big to fit into the inferred char type, a compile-time error results.
- To be absolutely clear: Char is NOT UTF-8. There is no variable
length encoding in any of the CharX methods. Similarly, String is not
UTF-8 and StringX.length remains constant time. When we convert a
CharX/StringX to UTF-8 the output type is Word8Vector.vector---a
sequence of octets. The same applies for other encodings. Dual byte
encodings like UTF-16 correspond to Word16Vector.vector. Endian
issues are left up to how the Word16 is input/output by the program
later.
----
The main debatable point I keep coming back to is the character
encoding form (CEF) of WideChar in memory. (I hope) we agree with my
earlier point that the basis and spirit of SML require a fixed-width
Char. That means UTF-16 (which can use two word16s together to encode
a single Unicode code point) is not acceptable for WideChar. Long ago
I argued that WideChar should be like LargeInt---able to hold all
Unicode characters. I know that this would require 32 bits per
character since 21 bits is not a convenient size. Taking a long-term
point of view, I don't think this cost is unbearable. It greatly
simplifies development, and you can always use Char16. Furthermore,
as Unicode has an open character repertoire, this gives MLton
considerable room for future Unicode extensions.
As for naming the structures, Char and WideChar are dictated by the
standard. If WideChar is like LargeInt, then it would be desirable to
have a middle ground. I hesitate to call it UCS2Char as this is a
character encoding form. On the other hand, since maxOrd forced our
hand, it is the accurate name for the character repertoire and memory
representation. The other option would be BMPChar, which I find more
accurate as it describes the character repertoire that can be
contained, without restricting our encoding form.
The alternative would be to take WideChar as the BMP, and define
LargeChar as the equivalent of LargeInt---a 32 bit type able to hold
all characters. I have no real opinion on whether Char/BMPChar/
WideChar or Char/WideChar/LargeChar is better.
If we agree with all my bullet points and can reach a consensus on
whether WideChar is 16/32, then the actual implementation of all the
above is trivial. Once the structures exist in the basis, I would
turn my attention to a new structure for encoding/decoding CharX to/
from a Word{8,16}Vector.vector. This would then easily allow Unicode
string literals: we don't need to modify lex/yacc, just extend the
lexer to allow high ascii in string literals. Then we decode the
UTF-8 inside MLton's frontend, not in yacc. The lexer converts \uXXXX
to UTF-8.
Agreed? Can I just whip this up and check it in? ;-)
More information about the MLton
mailing list