[MLton] Unicode and WideChar support

Stephen Weeks sweeks@sweeks.com
Tue, 29 Nov 2005 14:36:37 -0800


> Keeping with the mindset that a structure matching CHAR is in fact a  
> character set, not just a bag of integers, how about this:
> 
> Char (8 bit, high ascii 'undefined') <-- required (raises Chr for  
> values beyond FF)
> Ascii (7 bit) <-- required (raises Chr for values beyond 7F)
> 
> Iso8859_1 (8 bit) <-- optional (raises Chr for values beyond FF)
> Ucs2 (16 bit) <-- optional (raises Chr for surrogates and values  
> beyond FFFF)
> WideChar (must be Unicode) <-- optional (raises Chr for surrogates  
> and values beyond 10FFFF)

I like this proposal.

As to whether \U escapes should accept 6 or 8 hex digits, I lean
towards 8 because it seems possible that in the future we will need
more than 6 digits, and I wouldn't want to break old code or to
support 6 and 8 simultaneously.  Also, we have \u for the common case
of 4 digits.  Finally, with source files allowed to be UTF-8, \U
escapes should be pretty rare.

> If we are banning values beyond 10FFFF, then perhaps we should also
> ban values between D800-DFFF which may not appear in a conforming
> UTF-32 string.

Yes, that makes sense if we are really thinking of WideChar as
Unicode.

> One question is whether or not the Ucs2/Iso8859_1/Ascii structures
> should have all of the extra structures that go with them
> (Ucs2String, Ucs2Vector, Ucs2Substring, ...).

One way to go would be to export functors that let people build these
if they really want them.