[MLton] Unicode / WideChar

Wesley W. Terpstra wesley@terpstra.ca
Wed, 23 Nov 2005 17:39:14 +0100


On Nov 23, 2005, at 4:08 AM, Stephen Weeks wrote:
>> Enough data to quickly reproduce the perfect hash will be kept in a
>> generated .sml, along with a table which describes the named unicode
>> chars. This is all expanded at startup.
>>
>> I estimate the memory required to be somewhere around 100-400K for
>> the all the tables involved. Every MLton built binary would also
>> increase in size by about 30K.
>
> Why would this happen?

The increase in binary size comes from storing the unicode mapping.
I've worked at compressing the Unicode database, and have gotten all
the information needed down to under 6k now.

Working at the RAM side of things, I've got it down to 128K of memory.
hash(c) = ((c >> 14) * 16837 + c) & 0x7FFF

The Word32 indexed includes all the information needed for is* and
to* methods, although the to* methods need an auxilliary 92 element
lookup array each.

Each Word32 contains:
   bits [0, 21)   = the Unicode code point in this bucket
   bits [21, 28) = the uppercase delta index
   bits [28, 30) = control | letter | numeral | punctuation
   bits [30, 32) = uppercase | lowercase | whitespace | other
... yes! a perfect fit! :-)

fetch c =
   let
     val x = hash(c)
     val v = Vector.sub (table, x)
   in
      if c = v & low21 then SOME v else
      (* only 8 unicode chars fail that test, catch them here: *)
      let
        val v = Vector.sub (table, x+1)
      in
        if c = v & low21 then SOME v else NONE
      end
   end

isAlpha c =
   case fetch c of
     NONE => false
   | SOME v => v & bits28 = letter

toUpper c =
   case fetch c of
     NONE => c
   | SOME v => c + Vector.sub (udelta, v >> 21 & 7bits)

There are 8 collisions in this map, requiring two probes for
characters not in unicode. OTOH, that's the uncommon
case, so it being twice as slow is probably ok.

It is possible to eliminate all the collisions in roughly the same
memory, if I allow a modulus that's not a power of two, but then the
nice & turns in hash turns into a nasty mod.

> Would it be possible to set things up so that
> the startup memory use and time, as well as the code-size increase,
> would only happen if the WideChar functions were actually used?

Because I functorized the Char stuff into CharFn and this uses the
Unicode common lookup table, Char itself will pull this in. I suppose
I could special-case hack Char1...

I think at 6k size, that's not a big deal.
The 128K of RAM and loading it is a bit much, maybe..?

I'm still searching for a compacter hash function, but I doubt I will
get under 100K (25,000 entries) with no collisions.