[MLton] WideChar?

Wesley W. Terpstra terpstra@gkec.tu-darmstadt.de
Thu, 9 Dec 2004 00:56:34 +0100


On Wed, Dec 08, 2004 at 01:27:10PM -0800, Stephen Weeks wrote:
> > The values #"a" and "dfgdfsg" should be polymorphic just like 5.
> Yes.  BTW, the SML term in this situation is "overloaded", not
> "polymorphic".

Right.

One more thing on this point: this means that depending on what function
this gets passed to, the value might get expanded. eg:
"foo" -> 'foo' when passed as a String
"foo" -> '\^@f\^@o\^@o' when passed as a UCS2String
(in terms of one possible binary representation in memory)

I think the equality issue is sidestepped since the parameters are different
types and thus can't be compared by =, but I am still an SML newbie. :)

> > It should be made explicit that Char is ISO-8859-1 (so Char.ord is Unicode).
> In what way is this not explicit now?  Is this just a documentation
> issue or is there some code that needs to change?

It already is ISO-8859-1, yeah, but it should be documented and specified
that it is. Otherwise, 'Char.ord' makes no sense for chars 0x80-0xff.
Furthermore, CHAR.ord should be declared to output the Unicode code point.

OTOH, If we can't change http://www.standardml.org/Basis/char.html, there's
not much we can do to fix the documentation. :(

>   We assume an underlying alphabet of N characters (N >= 256),
>   numbered 0 to N - 1, which agrees with the ASCII character set on
>   the characters numbered 0 to 127. 

Unicode agrees with ISO-8859-1 for the first 256 characters, so it
definitely agrees with ascii. N >= 256 seems to anticipate Unicode.

>   The interval [0, N-1] is called
>   the ordinal range of the alphabet.  A string constant is a sequence,
>   between quotes ("), of zero or more printable characters (i.e.,
>   numbered 33-126), spaces, or escape sequences.

Well, it all depends on how you read this. ;)
Printable characters certainly include the Russian alphabet on my keyboard.
It all depends on your locale. The comment about numbered 33-126 might be a
clarification of which ASCII characters are printable.

> I suppose there is some wiggle room in the phrase
> "printable characters", which is specifically defined relevant to
> ASCII but could be interpreted to vary depending on the underlying
> alphabet.

That's the wiggle-room I'm jumping on. =)

>     \uxxxx The single character with number xxxx (4 hexadecimal digits
>            denoting an integer in the ordinal range of the alphabet)

Yes, I know about "\uxxxx" -- it's completely inadequate.
For starters, Unicode 4.0 doesn't even fit in 16 bits.

Furthermore, you can't expect non-English programmers who are writing source
code to write strings like "\u041f\u0440\u0438\u0432\u0435\u0442" -- Привет
(which means 'hi' in Russian). That's a 6 letter word; on my Russian
keyboard, I typed it directly into this UTF-8 enabled editor in four
seconds. Looking up the characters took me about two minutes.

Russian programmers writing software to be used by Russians and maintained
by Russians need to write Russian text strings. I think that's clear.

If you don't allow the strings to include characters used in foreign
languages, most foreign programmers will simply not use SML. It's bad
enough that most programming languages restrict what an identifier is 
to exclude characters from other languages.

> In any case, it seems like the most portable route to go at first
> would be to keep string constants as is, and require \u escape
> sequences for Unicode characters.  Once we support characters beyond
> 255, we're not portable other SML implementations anyways, so I don't
> object to making character and string constants UTF-8.

Ultimately, the standard you quoted got one thing very right: the base
alphabet is left unspecified. That means you can specify for MLton that the
base alphabet is Unicode. The _encoding_ of that base alphabet is something
that you can be much more flexible about.

If we add Unicode+iconv support to MLton's basis library, then MLton itself
should be able to read the various encoded forms of the base alphabet. What
I was proposing was that by default, you 'guess' that files are in UTF-8. If
a German programmer wants maybe he could run 'mlton -c iso-8859-1 foo.sml'
to have MLton decode his äöüß without needing a UTF-8 editor.

> I am not convinced that it is a good idea to allow UTF-8 encoding
> elsewhere (e.g. in variable names).  That seems like it will kill
> portability with other SML implementations

That's true.

I wasn't proposing to extend the definition of identifiers.
However, the entire source file should be encoded in exactly one way.
You can still throw parse-errors at users who use Scheiße for identifiers.

I didn't propose this mostly because this is a step most programming
languages do not take. Notable exception: anything XML-based. Now that
you've raised the issue though...

There's the MS-principle: embrace-and-extend. ;-)

There's nothing that forces programmers to put non-ascii-printable
characters into identifier names. If they do so, they've made a choice
not to be portable. It doesn't hurt portability TO MLton only from.

> as well as make it
> more difficult to share code between people of different languages.

I completely disagree here.

Some projects are written by people who don't speak English.
Being forced to use a latin alphabet for variable names can make it very
difficult for the non-English-speaking programmer to remember what his
variable names are supposed to mean. Certainly the SML keywords are English,
but there are only a few of those and the basis library can be translated.

Forcing English on people to 'standardize' code is short sighted, imo.

> Also keep in mind that this will require a substantial upgrade to
> ML-Lex to support Unicode (not a bad thing, just work).

I know; I looked at writing an XML parser using ML-Lex before I ported fxp.
One idea about how to go: parameterize the functor by a TEXT type.
However, those lookup tables probably won't work for Unicode...

Anyways, parsing the source-code as Unicode is certainly the very last step.

> >   val decode: string -> (char, 'a) reader -> (LargeChar, 'a) reader
> >   val encode: [edit: string ->] LargeChar -> string
> I'd rather provide a datatype (UTF8 | UTF16 | ...).  For
> extensibility, you can have a variant "X of string".

Interesting.
I'm a bit concerned that adding more encodings to the list might break
compatibility with programs that pattern match against this type.

> > The whole CHAR.is* family should never have been dumped inside CHAR.
> > The best way to handle this in SML is not clear (no OOP - hrm).
> 
>  * have a ref cell holding the locale and use fluid-let  
>    (I guess this is the C screwup)

A single program may need to interface with several locales at once.
So yes, that's the C screwup. Threading makes this worse.

>  * make the locale an argument to each function that needs it.
>    build nice wrappers to hide it as much as possible

That might be workable.

>  * make the locale a functor argument, and create a structure per locale

Unfortunately, the locale information is not known at compile time.

If there was OOP in SML, a good solution would be to provide a locale
'factory' that produces locale objects. The member methods of these 
objects include all the local-specific methods.

Henry Cejtin wrote:
> Can't we just dump locales entirely and have narrow chars be ASCII  (or
> ISO- Latin-1) and wide chars be straight unicode with all external stuff
> in UTF-8?

A locale includes conventions for formatting numbers, dates, etc. 
Locales also should automatically localize the text strings to the user's
natural language. Locales are orthogonal to character sets. I plan on making
the character set inside MLton 'straight unicode' as you suggest.

The encoding/decoding of Unicode into a character set is also orthogonal to
the locale (although a locale is probably only usable in a few non-Unicode
character sets).

This is why I am annoyed by the CHAR.is* methods; the authors of this
signature confused a locale with a character set. That's also why I am
advocating to interpret these methods as 'locale-neutral'---or English. =)

The date/number stuff is way beyond my knowledge of i18n.

> > What is the idea behind WideTextPrimIO?
> I assume the idea here is to be able to build a WideTextIO module 
> similar to TextIO.

... yeah, but what would it do?
If you're reading from an external source, you need to perform
encoding/decoding of the characters from/into Unicode. Is it just 
supposed to be for reading native UCS2 or what?

I am planning on ignoring this unless anyone has an objection.

> > I don't know how to deal with all of the scan functions that expect a
> > Char.char. Suggestions? I think they should work with WideChar too.
> 
> There might be a more composable way that doesn't require us to add
> all these new functions.  What if we added a function that converts a
> LargeChar reader into a Char reader?  It would raise an exception
> given an out-of-range char. 

Since all of the .scan functions we're talking about can't scan from
non-ascii anyways, this might be appropriate. How about returning NONE
instead of raising an exception? It seems more like a parsing failure...

> Is there a reason why functions like
> Int.scan would need to handle to handle wide chars?

Ideally, reading numbers should be locale dependent.
Still, leaving Int.scan to be English-only seems reasonable.

Actually, maybe we should say CHAR.is* is English-only as well.
If you want localized version of CHAR.is* and Int.scan and Date.scan and
whatever, then you need to use a special yet-to-be-defined localization
interface.

> > How does one make official changes to the SML Basis Library anyways?
> It doesn't seem possible to me.  There is an email list
> 
> You're welcome to send mail there to see what people think. 
> My suggestion is to use the list to solicit feedback on your design,
> if you think you need more beyond what's available on MLton@mlton.org.

I'm mostly unclear on the issues to do with localization of Int.scan,
CHAR.is*, how the iconv interface should look, and these things. So..

1. I'll put together something for the stuff I am more clear about:
WideChar. Then write some of the more common encoding/decoding methods 
in SML and put together a prototype iconv API.

2a. At that point bounce the iconv API off the more general SML list so that
people can actually touch+feel what it is I am talking about. Convince
people that WidePrimIO is a bad idea and what we want are (char, 'a) reader
-> (WideChar, 'a) reader conversions for decoding.

3. Think about (WideChar, 'a) -> (CHAR, 'a) for reducing the size.
(Maybe call this CHAR.scanWide or something) --> Int.scan works.

2b. Parallel to the iconv discussion, try and figure out how to make 
ML-Lex+WideChar.

3b. Add support to the MLton front-end to support Unicode in strings.

Obviously, these items can be refined as I go.

So, the questions I need answers to before I start step 1. Should I call
them Char2 and Char4 (akin to Int2, Int3, ...)? Then maybe WideChar is
what we mean by LargeChar; WideChar is the largest.

I take it that since Int<N> : INTEGER is in basis library and Char<N> : CHAR
is not that you will want Char2 and Char4 in MLton.?

Do I need MLton compiler support for the base type?
I notice in misc/primitive.sml that:

structure Char =
   struct
      type t = char
      type char = t
   end

int8, int16, real32, word64, etc all appear to be some magical top-level
things too. For now I will just make 'type t = word16/32'. I am guessing
these magical types exist to make literals like #"a" agree with Char.char? 
I also see in ./mlton/atoms/c-type.fun that 'char = int8'. Why not word8?

> BTW, I assume since you've ported fxp that you had a look at 
> 	http://mlton.org/References#Neumann99Thesis
> which talks about how fxp handles Unicode.

I have now.

PS. Where can I read the standard you quoted? I have been looking for the
SML definition for months, but haven't found it---only books on Amazon.

-- 
Wesley W. Terpstra <wesley@terpstra.ca>