Support in The Definition of Standard ML
There is no real support for Unicode in the Definition; there are only a few throw-away sentences along the lines of "the characters with numbers 0 to 127 coincide with the ASCII character set."
Support in The Standard ML Basis Library
Neither is there real support for Unicode in the Basis
Library. The general consensus (which includes the opinions of the
editors of the Basis Library) is that the WideChar
and WideString
structures are insufficient for the purposes of Unicode. There is no
LargeChar
structure, which in itself is a deficiency, since a
programmer can not program against the largest supported character
size.
Current Support in MLton
MLton, as a minor extension over the Definition, supports UTF-8 byte sequences in text constants. This feature enables "UTF-8 convenience" (but not comprehensive Unicode support); in particular, it allows one to copy text from a browser and paste it into a string constant in an editor and, furthermore, if the string is printed to a terminal, then will (typically) appear as the original text. See the extended text constants feature of Successor ML for more details.
MLton, also as a minor extension over the Definition, supports
\Uxxxxxxxx
numeric escapes in text constants and has preliminary
internal support for 16- and 32-bit characters and strings.
MLton provides WideChar
and WideString
structures, corresponding
to 32-bit characters and strings, respectively.
Questions and Discussions
There are periodic flurries of questions and discussion about Unicode in MLton/SML. In December 2004, there was a discussion that led to some seemingly sound design decisions. The discussion started at:
There is a good summary of points at:
In November 2005, there was a followup discussion and the beginning of some coding.
Also see
The fxp XML parser has some support for dealing with Unicode documents.