[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support
John Reppy
jhr@cs.uchicago.edu
Tue, 29 Nov 2005 10:51:10 -0600
The lexer doesn't generate strings. The input is assumed to be 8-bit
characters
(i.e., type char) and one can specify 7-bit, 8-bit, and UTF-8
interpretations of
the character stream (ML-lex only supports 7-bit and 8-bit).
- John
On Nov 29, 2005, at 10:30 AM, Geoffrey Alan Washburn wrote:
> Aaron Turon wrote:
>> I have been working with John Reppy on a (largely) backwards-
>> compatible replacement for ML-lex. The new tool is based on
>> Brzozowski's notion of regular expression derivatives[1], making
>> it easy to support boolean operations on REs such as intersection
>> and negation. Code generation is not finalized, but will most
>> likely be control-flow-based (one function per state, with tail
>> calls) rather than table-based. We have designed the tool to
>> support unicode. I hope to have an initial version out for testing
>> some time next month -- please feel free to send mail with
>> suggestions or requests.
> This would be great. In the past to handle some ad-hoc uses of
> UTF-8 in my parsers I've had to build a custom
> version of ml-lex with CharSetSize >129.
>
> Though given that there isn't yet an agreed upon Basis module
> for Unicode what does your lexer generate in terms of strings?
>
> -- [Geoff Washburn|geoffw@cis.upenn.edu|http://www.cis.upenn.edu/
> ~geoffw/]