Fri, 10 Dec 2004 16:22:19 -0600
With regargds to your point:
When converting a char reader to a widechar reader, it is sometimes
useful to raise an exception on encountering a widechar and sometimes
useful to return NONE. We should provide both types of converters.
The point I was making in my comment was connected with the fact that not all
collections of bytes are legal UTF-8 objects. In that case, the problem
isn't encountering a widechar, it is encountering bytes which make no sense.
Note, none of this is really about chars, but about encodings; that is why
the exception makes sense (to me). It is caused exactly by the fact that the
function from bytes to unicode which is UTF-8 is only a partial function.
With regards to the multi-level table compression for Lex, long ago in
building up a DFA string matcher I needed to store things compactly so that
speed would be optimized. I just used the following hack: divide the
character space into equivalence classes where 2 characters are equivalent
iff all transitions from all states are the same. Then you just do one extra
lookup (character to equivalence class) followed by the usual stuff. That
first lookup would be the one where there would be lots of sharing.
I never saw a case where this performed poorly.
With regard to word8 vs. int8, isn't is a problem either way? I.e., does the
FFI support unsigned char? If so then that should be word8 while char should
be int8 (except on some machines (MIPS?) where chars default to unsigned and
it is signed char that would be int8).
With regards to locale dependency, isn't one of the huge points of unicode
exactly that isAlpha and isPrint do NOT depend on things like locale.
I have been pimped MANY times by code that depends on the locale because it
pretty much ONLY makes sense when the output is to a human or the input is
from a human. When things are between programs it is a disaster if the two
programs don't agree.