[MLton] Unicode... again

Michael Norrish michael.norrish at nicta.com.au
Thu Feb 8 14:59:27 PST 2007


Wesley W. Terpstra wrote:

 > - To be absolutely clear: Char is NOT UTF-8. There is no variable 
length
 > encoding in any of the CharX methods. [...]

 > [...] Long ago I argued that WideChar should be like LargeInt---able to
 > hold all Unicode characters. I know that this would require 32 bits
 > per character since 21 bits is not a convenient size. Taking a
 > long-term point of view, I don't think this cost is unbearable.

 > As for naming the structures, Char and WideChar are dictated by the
 > standard. If WideChar is like LargeInt, then it would be desirable to
 > have a middle ground. I hesitate to call it UCS2Char as this is a
 > character encoding form [...]

I think I'm in total agreement with your vision.  Pragmatically, I
wonder how important you think providing the 16 bit character type is.
It seems a kind of optional extra for people who want space-efficient
BMP.  Or do you imagine the vast majority of people will want to just
use the BMP, and will therefore resent wasting 16 bits per char?  (It
certainly does seem as if there won't be much use of stuff outside
BMP, but who can tell?)

 > If we agree with all my bullet points and can reach a consensus on
 > whether WideChar is 16/32, then the actual implementation of all the
 > above is trivial. Once the structures exist in the basis, I would turn
 > my attention to a new structure for encoding/decoding CharX to/from a
 > Word{8,16}Vector.vector. This would then easily allow Unicode string
 > literals: we don't need to modify lex/yacc, just extend the lexer to
 > allow high ascii in string literals. Then we decode the UTF-8 inside
 > MLton's frontend, not in yacc. The lexer converts \uXXXX to UTF-8.

 > Agreed? Can I just whip this up and check it in? ;-)

Go for it!

Michael.



More information about the MLton mailing list