[MLton] WideChar

Wesley W. Terpstra terpstra@gkec.tu-darmstadt.de
Sat, 11 Dec 2004 00:30:49 +0100


On Fri, Dec 10, 2004 at 12:24:48PM -0800, Stephen Weeks wrote:
> * When converting a char reader to a widechar reader, it is sometimes
>   useful to raise an exception on encountering a widechar and
>   sometimes useful to return NONE.  We should provide both types of
>   converters. 

I thought about this a bit more after Henry's email.
There are two converters and I think we were getting confused by them:

1. convert from an input source/'char' reader to WideChar reader by decoding
   - in this case i think you probably want an exception since it means that
     the input is corrupt
   - conceivably we also want it to stop with a NONE here too, not sure
2. convert a WideChar reader back to a char reader for the purposes of using
   Int.scan, Date.scan, etc. -- not encoding, simply converting types
   - here you always want NONE; those scanners don't accept the string if
     there are non-ASCII chars in the stream

The case where I was talking about returning NONE was #2.
The case where Henry talked about exceptions was #1.

Whether a NONE for case #1 also makes sense I don't know.

> * Using a datatype for encoding names is preferable to using strings.
>   Using datatypes does not introduce any  problems with adding new
>   encodings. 

A foolish user might write:
case enc of
    UTF8 => ...
  | UTF16 => ..
  | X strEnc => ...

The user writing a case statement over the datatype will be uncommon, but I
wouldn't put it outside the realm of possibility. IMO, we shouldn't break
user code by adding new encodings; even bad user code.

Or is there a way to make a datatype partially opaque so that the user is
forced by the exhaustive pattern match checking to add a _ => pattern?
If so, then I am in complete agreement with your suggestion, with one
caveat: 

=> Anything in the datatype must also have a matching string version.

Otherwise, when a user wants to pick a locale dynamically he has to special
case all of the datatype'd encodings. It is very often the case that you have
the encoding as a string acquired at runtime.

The most typical uses of this stuff are going to be:
	IConv.decoder UTF8 (TextIO.input1 TextIO.stdIn)
and
	val enc : string = ... something I got from some config option,
			       an XML header, the user, etc
        val file = IConv.decoder (X enc) (TextIO.openIn file)

As for the user adding new encodings, there are two solutions I see:

1. The IConv : ICONV has a register function to add new codes
2. We require the user to make a new MyIConv : ICONV which handles his
   special encodings and defers to IConv the rest.

#2 is nice in that it keeps the API simple. On the other hand it violates
the principle of abstraction; users have to know that there is an extended
version somewhere else in their project that they should use instead.

> * I don't see what OOP buys in handling locales (i.e. I don't see any
>   use for inheritance, dynamic dispatch, or otherwise).  We can simply
>   have functions that depend on the locale.

Let me make you aware of some of the issues before we make up our minds. 
I think this point needs the most discussion; character sets and conversion
methods are all pretty easy to understand.

"en_EN" is very much like "en_US", both are of 'supertype' "en"
"pt_BR" vs "pt" is a much better example, but not as well-known ;)

The _XX part is not standardized and users can invent their own.
For example, maybe I decide that I speak en_BC for the west part of Canada.
I would probably share much in common with en_CA, but maybe I consider an
extra character printable. 

Applications should _automatically_ and without recompilation recognize this
if I set my locale to en_BC and configure my system to have the locale.

In Japanese there are three kinds of printable characters.
It's been a long time since I studied it, but I think they were Hiragana,
Katakana, and the Chinese Kanji. (spelling may be wrong for all of these)

It may make sense for that locale to include more 'member methods' which
tell these character class appart. Furthermore, Japanese has no concept of
uppercase / lowercase. Should these methods exist in this case at all?

Another pair of reasonable sounding requirements are: 
	toLower = toLower o toUpper
	CHAR.toLower: char -> char
The SML Basis Library makes both of these assumptions.

However, in German neither can be true
	toLower o toUpper: Straße -> STRASSE -> strasse
	toLower: Straße -> straße
The word Hass != Haß (the second does not exist)

Copying other languages like C++ may help.

-- 
Wesley W. Terpstra <wesley@terpstra.ca>