[MLton] Unicode / WideChar

Wed, 23 Nov 2005 01:51:58 +0100

On Nov 21, 2005, at 2:08 PM, Wesley W. Terpstra wrote:
>> BTW: I am curious about the Unicode database implementation:
>> the database is BIG. How are you going to represent this
>> efficiently? (Eg,  case mapping function)
>
> Case mappings I haven't done yet. :-)
>
> Property mapping is easier because Unicode nicely
> grouped similar property code points together.

So, I've been thinking about this some more.
Looking at Unicode 4.1.0:

code points = 1114112
named characters = 16351
letters = 10014
uppercase = 1296
lowercase = 1608
numerals = 695
controls = 230
having uppercase mapping = 906
having lowercase mapping = 893

Looking at these numbers, you see that there is actually very
little content to most of these tables. To exploit this, I propose
to use an entirely different approach than I've heard mentioned
so far that will still be small O(1) time and space proportional to
the number of entries with content.

For all the named characters (entries exist in UnicodeData.txt)
we provide a map from codepoint to the following type 'info':

datatype class = CONTROL | LETTER | NUMERAL | PUNCTUATION
datatype detail = WHITESPACE | UPPERCASE | LOWERCASE | OTHER
type info = WideChar.char * class * detail

For all the characters with upper case mapping we keep
type info = WideChar.char * WideChar.char
Ditto for the lower case mapping.

Now, for the map, I propose that we use a minimal perfect
hash. This way, we can in a few operations convert a code
point to the hash table locatoin. If the first WideChar component
does not match the char we were looking up, then the char is
not named (for is* methods) or has no case change (for to*).
The first case means return false, the second means identity.
If we do have a hit, well those tables give us exactly the value
we should be returning.

If we use a minimal perfect hash, then we need only 16351
entries in the map (and proportional space for the minimal
perfect hash) to cover all the is* methods. For the case maps
we will have two more tables of size ~1000 entries. Lookups
will be nice and speedy, and everyone goes home happy.

Enough data to quickly reproduce the perfect hash will be
kept in a generated .sml, along with a table which describes
the named unicode chars. This is all expanded at startup.

I estimate the memory required to be somewhere around
100-400K for the all the tables involved. Every MLton built
binary would also increase in size by about 30K.

Anyone have objections to this proposal?
Anyone have pointers to existing SML perfect hash code?
Anyone have pointers to perfect hashes that are particularily
performant in the operations/memory trade-off?