Unicode - MLton Standard ML Compiler (SML Compiler)

The current release of MLton does not support Unicode. We are working on adding support.

WideChar structure.
UTF-8 encoded source files.

There is no real support for Unicode in the Definition of Standard ML; there are only a few throw-away sentences along the lines of "ASCII must be a subset of the character set in programs".

Neither is there real support for Unicode in the Standard ML Basis Library. The general consensus (which includes the opinions of the editors of the Basis Library) is that the WideChar structure is insufficient for the purposes of Unicode. There is no LargeChar structure, which in itself is a deficiency, since a programmer can not program against the largest supported character size.

MLton has some preliminary support for 16 and 32 bit characters and strings. It is even possible to include arbitrary Unicode characters in 32-bit strings using a \Uxxxxxxxx escape sequence. (This longer escape sequence is a minor extension over the Definition which only allows \uxxxx.) This is by no means completely satisfactory in terms of support for Unicode, but it is what is currently available.

There are periodic flurries of questions and discussion about Unicode in MLton/SML. In December 2004, there was a discussion that led to some seemingly sound design decisions. The discussion started at:

http://mlton.org/pipermail/mlton/2004-December/026396.html

There is a good summary of points at:

http://mlton.org/pipermail/mlton/2004-December/026440.html

In November 2005, there was a followup discussion and the beginning of some coding.

http://mlton.org/pipermail/mlton/2005-November/028300.html

We are optimistic that support will appear in the next MLton release.

Also see

The fxp XML parser has some support for dealing with Unicode documents.