[MLton] Unicode... again
Wesley W. Terpstra
terpstra at gkec.tu-darmstadt.de
Fri Feb 9 13:51:25 PST 2007
On Feb 9, 2007, at 10:05 PM, Matthew Fluet wrote:
>> Once again I find myself needing Unicode in MLton.
> Just to orient the discussion of "what to implement where": you
> find yourself needing to process Unicode files with an SML program
> compiled by MLton /or/ you find yourself needing to have Unicode
> strings in an SML program compiled by MLton.
> The former doesn't require any changes to the compiler (not they
> wouldn't be welcome).
TBH, I don't see much of this distinction. If I'm processing Unicode
files with an SML program compiled by MLton, I will almost certainly
need WideChar from MLton, which does not exist. I could certainly
fake it with a Word32, but the basis tells me to use another type.
>> - CharX differs from IntX in that a CharX contains a character.
>> This sounds obvious, but it caused considerable debate earlier. I
>> hope that given the above definition of character, things are
>> clear. A character corresponds to our concept of the letter 'a',
>> irrespective of the font. A character is NOT a number. It is not
>> even a code point.
> I don't recall the details of the earlier debate, but while
> expecting CharX to differ from IntX sounds good, it doesn't give
> much insight into the representation. In particular the 'X' would
> almost certainly seem to imply a fixed-width word/integer.
At the moment, there is only Char and WideChar in what I've been
writing. I never meant to actually call them Char8/16/32 this time
around. I think you are completely correct that it would otherwise
imply how the character is stored. The representation is now
controlled the same way int width is controlled. I've also
generalized this for Char (though not all the way to adding the
command-line option, as that would break Byte).
>> - For the time being I choose to ignore the basis' claim that "in
>> WideChar, the functions toLower, toLower, isAlpha,..., isUpper
>> and, in general, the definition of a ``letter'' are locale-
>> dependent" and raise an Unimplemented exception for these methods.
>> I think the standard is dreadfully misguided in assuming a global
>> locale, and I defer what to do here till later as it is what
>> blocked my progress last time. (IMO these functions have only
>> questionable use, anyway)
> I think that is reasonable.
Actually, since I've functorized the Char implementation in the
basis, it's presently following the exact same rules for WideChar as
well. Locale-specific methods should be in another structure IMO. One
that is parameterized by the locale.
> As I understand the implementation of the latter in MLton, any
> string that has \uXXXX will be inferred to have type
> String16.string = Char16Vector.vector and any string that has
> \UXXXXXXXX will be inferred to have type String32.string =
> Char32Vector.vector. (Inference might also force the type to a
> higher StringN.string type.)
That's exactly what I expected. :-)
> That would seem to lend more support for Char16 as BMP and Char32
> as full unicode.
skaller has changed my mind since I last wrote this. Providing a
Char16 (under any name) encourages people to use it. Just providing a
WideChar (=Char32) is probably better. If people need a more memory
efficient representation, they can convert WideChar/WideString into a
Word8Vector.vector that is UTF-8 encoded.
> I don't see CharX or StringX as any encoding.
Actually, with hind-sight, they DO have an encoding. The succ/pred
methods in CHAR require that imply that we encode them as code points.
>> Agreed? Can I just whip this up and check it in? ;-)
> I believe that there is still a unicode branch in the repository.
> I would recommend that you merge changes from trunk into that
> branch and continue development there.
> That gives people a chance to see development and suggest changes
> before we merge them into trunk.
The branch hit a dead-end. The new 64-bit changes also obsoleted it.
My new changeset started fresh off the current trunk, and is almost
complete. I could make a new branch with them or send a patch to the
> The lastest version of SML/NJ (ver 110.62) includes
> signature UTF8
> structure UTF8 : UTF8
I'll take a look at this to see about UTF-8 conversion, once we have
WideChar in svn.
PS. I need a heap sort under MLton's licence. Anyone have a bug
tested (and short) implementation?
More information about the MLton