[MLton] Unicode / WideChar

Wesley W. Terpstra terpstra@gkec.informatik.tu-darmstadt.de
Sun, 20 Nov 2005 23:16:22 +0100

On Nov 20, 2005, at 10:37 PM, Henry Cejtin wrote:
> ... One thing that I would REALLY like to be true would  be
> for  none  of  these to depend  on  locale.

That's what I'm aiming for.
The documents I got from unicode.org are locale independent.
I suspect this is the main reason for disagreement with C++.

> For digits, I suspect that you really just want ASCII 0-9, but again
> I am not at  all  certain.

Well... I don't know about that.

The Arabic people might take offense when we cut out their
(original) version of 0-9. ;-)

Besides, you want existing number parsing code to work.
That will never happen. Number parsing code needs to be
locale-specific. Same with dates and times.

If you want only the English digits, then that would have to
be a locale-specific isdigit, I think. That will be outside of the
basis-2002.mlb, though, and in an i18n.mlb.

Right now I plan to include just WideChar/... in the basis.mlb.
Then there will be another i18n.mlb which includes locale
dependent date/time/number stuff, Char{1,2,4}, gettext,
and charset conversion (as discussed previously on this list).

My immediate goal is the charset conversion and Char{1,2,4}
components, because that's what's needed to get MLton to
support UTF-8 input files. After that, gettext support, so we can
localize our software, more or less. Date/time/number parsing
is way beyond me, though. When/if I get there, I intend to look
around for a library that does it for us.

> I suspect though that [tables] would still be the fastest method.
> This is based on the fact that at  least  for  English,  almost  all
> characters  are ASCII, which means that only 128 bytes has to
> be in the cache to get a VERY good hit rate.

That's true.
However, it's really not reasonable to spend that much space
on the table when the actual information content is about 5-20
ranges of integers.