[MLton] WideChar
Wesley W. Terpstra
terpstra@gkec.tu-darmstadt.de
Thu, 16 Dec 2004 05:49:29 +0100
On Mon, Dec 13, 2004 at 10:06:17AM -0800, Stephen Weeks wrote:
> signature ENCODING =
> sig
> type t
>
> val equals: t * t -> bool
> val foo: t
> val fromString: string -> t
> val toString: t -> string
> val utf8: t
> val utf16: t
> end
I don't really like this.
It just seems too over-complicated. =(
> Now, we get the type-system support of known encodings guaranteeing
> agreement on encoding name.
OTOH, I suppose this is a benefit.
It is nice to have support for a hard-coded decoder checked at compile-time.
What I don't like about your ENCODING signature is that it is a growing
interface. One of the rules I've had beaten into me over years of C++
development is that abstract base classes must never add new methods.
OTOH, the nature of encodings being added IS that the interface grows.
Sticking with your api and combining the suggestions I've heard:
signature UNICODE_ENCODING =
sig
eqtype encoding (* why define equals? just use an eqtype *)
exception Unsupported
exception Bad
val fromName: string -> t (* may raise Unsupported *)
val toName: t -> string
val utf8: encoding
val utf16le: encoding (* little-endian *)
val utf16be: encoding
val utf32le: encoding
val utf32be: encoding
end
signature UNICODE_DECODER =
sig
type encoding
(* raises Encoding.Bad if the stream is broken *)
val get: encoding -> (Word8.word, 'a) reader -> (WideChar.char, 'a) reader
(* returns NONE if input WideChar != ISO-8869-1 *)
val toChar: (WideChar.char, 'a) reader -> (Char.char, 'a) reader
(* always succeeds *)
val fromChar: (Char.char, 'a) reader -> (WideChar.char, 'a) reader
end
signature UNICODE_ENCODER =
sig
type encoding
val get: encoding -> WideString -> Word8Vector.vector
(* raises Chr if would overflow *)
val toChar: WideString -> string
(* always succeeds *)
val fromChar: string -> WideString
end
signature UNICODE =
sig
structure Encoding : UNICODE_ENCODING
structure Decoder : UNICODE_DECODER where type encoding = Encoding.encoding
structure Encoder : UNICODE_DECODER where type encoding = Encoding.encoding
end
structure Unicode : UNICODE (* defined at top-level by unicode.mlb *)
> [ keep it portable ]
Well, I intend to assume that WideChar exists and fits at least 21 bits.
Aside from that I won't make any other assumptions about the SML compiler.
-----------------------
One missing piece is support for user-extensions.
In glibc and libiconv, a conversion can be loaded dynamically from a .so.
It seems that I can't use the FFI on a function pointer, so dlsym is out.
I lean towards a register function in UNICODE_ENCODING like:
val register: string -> {
encoder: WideString -> Word8Vector.vector,
decoder: (Word8.word, 'a) reader -> (WideChar.char, 'a) reader)
} -> unit
Another detail, some of these encoders and decoders will need to keep some
state. For example, the first Word8Vector.vector output might include a
header during encoding and might read the header during decoding.
I'm not sure how best to do this since a lazy scanner might 'rescan' the
header repeatedly.
Matthew observed:
> I'm simply noting the software engineering issue of recovering from bad
> encodings. Because the conversion is applied lazily, you need to write
> recovery code for every use of the stream. This recovery can either be
> explicit (checking for NONE and/or handling an exception at the use) or
> implicit (handling an exception around the whole use of the stream).
I agree that the user has to wrap their reader function in a test for the
Unicode.Encoding.Bad exception. However, I don't see any other way to deal
with this. It's much like an IO error, IMO.
--
Wesley W. Terpstra <wesley@terpstra.ca>