[MLton] Re: [Sml-basis-discuss] Unicode and WideChar support

John Reppy jhr@cs.uchicago.edu
Tue, 29 Nov 2005 11:08:47 -0600


I think that we'll have

	val yytext : unit -> substring

where UTF-8 is used to encode unicode characters.  We use substrings  
to avoid
unnecessary copying and a function to be lazy about substring  
creation (our assumption
is that compilers are better at eliminating unused local functions  
than unused calls
to external functions that happen to be pure).

Note that Unicode support is not part of ml-lex compatibility mode.

	- John

On Nov 29, 2005, at 10:56 AM, Geoffrey Alan Washburn wrote:

> John Reppy wrote:
>> The lexer doesn't generate strings.  The input is assumed to be 8- 
>> bit characters
>> (i.e., type char) and one can specify 7-bit, 8-bit, and UTF-8  
>> interpretations of
>> the character stream (ML-lex only supports 7-bit and 8-bit).
>     Okay, maybe I need to rephrase my question as: If you tell it  
> you want to use UTF-8 for the input stream,
> what type does yytext (or the equivalent) have?  Is it just string,  
> possibly containing sequences of high-bit characters?
> -- [Geoff Washburn|geoffw@cis.upenn.edu|http://www.cis.upenn.edu/ 
> ~geoffw/]