[MLton] WideChar?

Matthew Fluet fluet@cs.cornell.edu
Thu, 9 Dec 2004 09:54:28 -0500 (EST)


> On Wed, Dec 08, 2004 at 01:27:10PM -0800, Stephen Weeks wrote:
> > > The values #"a" and "dfgdfsg" should be polymorphic just like 5.
> > Yes.  BTW, the SML term in this situation is "overloaded", not
> > "polymorphic".
>
> Right.
>
> One more thing on this point: this means that depending on what function
> this gets passed to, the value might get expanded. eg:
> "foo" -> 'foo' when passed as a String
> "foo" -> '\^@f\^@o\^@o' when passed as a UCS2String
> (in terms of one possible binary representation in memory)

This exact same thing happens with overload resolution of numeric
constants (the constant '4' can be resolved to any of Int32.int,
Int64.int, IntInf.int, etc., all with different memory representations).
The infrastructure for handling this stuff is in place.

> I think the equality issue is sidestepped since the parameters are different
> types and thus can't be compared by =, but I am still an SML newbie. :)

If you are asking: "Might I ever be required to check the equality of
string (constants) that have different memory representations?", then you
are correct that the answer is 'no' -- because polymorphic equality
requires comparing values of the same type.  Now, it may be that the use
of equality is what resolves a constant to a particular type.

> > > What is the idea behind WideTextPrimIO?
> > I assume the idea here is to be able to build a WideTextIO module
> > similar to TextIO.
>
> ... yeah, but what would it do?
> If you're reading from an external source, you need to perform
> encoding/decoding of the characters from/into Unicode. Is it just
> supposed to be for reading native UCS2 or what?
>
> I am planning on ignoring this unless anyone has an objection.

It really depends upon how you intend to implement things.  The PrimIO
functions are what talk to the OS, getting you bags of bytes (from files,
sockets, etc.).  If you can use the OS to decode, then you may be able to
build up from WidePrimIO.  If you can't use the OS to decode (or want to
decode yourself), then you can probably just build up from BinIO -- you
get an uninterpreted stream of bytes, which are decoded by you.

> Actually, maybe we should say CHAR.is* is English-only as well.
> If you want localized version of CHAR.is* and Int.scan and Date.scan and
> whatever, then you need to use a special yet-to-be-defined localization
> interface.

That is probably fair.  It is certainly the easiest place to start.

> > > How does one make official changes to the SML Basis Library anyways?
> > It doesn't seem possible to me.  There is an email list

Back in Nov. 2002, after the first major revision to the online spec since
1999, I tried prodding John into clarifying the "fluidity" of the spec.
In the short run, he said, getting the spec published was the priority.
However:

  In the long run, the specification is not meant to be a "static"
  document (unlike the Definition of SML).  We expect and hope that it
  will evolve and grow.  To that end, there should be a steering committee
  with representatives from the major players to maintain the
  specification. Changes to specification come in several forms:

    1) correction to broken features.

    2) new APIs

    3) additional operations to existing APIs

    4) incompatible changes to existing APIs

    5) deletion of depreciated features/APIs.

  For the sake of stability and backward compatibility, 1, 2, and 3 should
  be the most common form of change (and I hope that 1 never happens).
  Any new feature or API should be justified by some experience.

          - John

It would seem that we are entering the "long run" phase.  Although, few of
the implementations besides MLton seem to care about keeping upto date
with the spec.  So, I suspect that an implementation in MLton is the right
starting point.

> So, the questions I need answers to before I start step 1. Should I call
> them Char2 and Char4 (akin to Int2, Int3, ...)? Then maybe WideChar is
> what we mean by LargeChar; WideChar is the largest.
>
> I take it that since Int<N> : INTEGER is in basis library and Char<N> : CHAR
> is not that you will want Char2 and Char4 in MLton.?

It really doesn't matter; names can easily be changed.

My suggestion would be to start a new MLB library, and build everything
you want there.  The MLton structure is getting awfully full and a bit of
a hodgepodge of true extentions (e.g., Thread), useful additions to the
Basis Library (e.g, Word, IntInf), etc.

To the degree that it is possible, build against the Basis Library.  You
should certainly be able to write down all the signatures that you might
want without needing to touch any part of the compiler.

> Do I need MLton compiler support for the base type?

Minimal.

> I notice in misc/primitive.sml that:
>
> structure Char =
>    struct
>       type t = char
>       type char = t
>    end
>
> int8, int16, real32, word64, etc all appear to be some magical top-level
> things too. For now I will just make 'type t = word16/32'. I am guessing
> these magical types exist to make literals like #"a" agree with Char.char?

Essentially.  You should be able to get very far with type t = word16/32,
because you won't have literals.

> I also see in ./mlton/atoms/c-type.fun that 'char = int8'. Why not word8?

The C calling convention on the PowerPC (and possibly other architectures)
requires passing a signed 8-bit value as a sign-extended 32-bit value.
(Because this is the convention, the caller may then perform a 32-bit
operation on the argument without first extracting the low-order bits.)
So, unfortunately, we need to remember at the C interface whether things
are signed or unsigned.  Elsewhere in the compiler, we map char to word8.

Although, now that I write this, it isn't clear why char is mapped to a
signed, 8-bit value.

> PS. Where can I read the standard you quoted? I have been looking for the
> SML definition for months, but haven't found it---only books on Amazon.

There is no online copy of the Definition.