[MLton] Unicode and WideChar support
Stephen Weeks
sweeks@sweeks.com
Tue, 29 Nov 2005 14:36:37 -0800
> Keeping with the mindset that a structure matching CHAR is in fact a
> character set, not just a bag of integers, how about this:
>
> Char (8 bit, high ascii 'undefined') <-- required (raises Chr for
> values beyond FF)
> Ascii (7 bit) <-- required (raises Chr for values beyond 7F)
>
> Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
> Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values
> beyond FFFF)
> WideChar (must be Unicode) <-- optional (raises Chr for surrogates
> and values beyond 10FFFF)
I like this proposal.
As to whether \U escapes should accept 6 or 8 hex digits, I lean
towards 8 because it seems possible that in the future we will need
more than 6 digits, and I wouldn't want to break old code or to
support 6 and 8 simultaneously. Also, we have \u for the common case
of 4 digits. Finally, with source files allowed to be UTF-8, \U
escapes should be pretty rare.
> If we are banning values beyond 10FFFF, then perhaps we should also
> ban values between D800-DFFF which may not appear in a conforming
> UTF-32 string.
Yes, that makes sense if we are really thinking of WideChar as
Unicode.
> One question is whether or not the Ucs2/Iso8859_1/Ascii structures
> should have all of the extra structures that go with them
> (Ucs2String, Ucs2Vector, Ucs2Substring, ...).
One way to go would be to export functors that let people build these
if they really want them.