[MLton-user] SML unicode support
Henry Cejtin
henry@sourcelight.com
Wed, 5 Jan 2005 13:18:09 -0600
There is no way to casually handle UTF-8 (or even Unicode) characters in C.
The encodings UTF-8 and UTF-16 do not store one character in 8 or 16 bits.
That would clearly not be possible because there are more than 256 and even
more than 65,536 Unicode characters. UTF-8 and UTF-16 are ways of encoding
characters as COLLECTIONS of 8-bit bytes or 16-bit chunks. Not all
characters will take the same number of bytes/chunks. UTF-32 lets all
characters be the same size (32-bits or 4 bytes) but no one stores them that
way externally (in files) because of the large waste of space.
The expectation is that files will be in UTF-8 or UTF-16 and on reading them
they will be converted to something more convenient. (Note, if you store a
string in UTF-8 itself, then you can't go to the N-th character without
walking through all the previous characters to see how long they are.)