fast I/O
Henry Cejtin
henry@sourcelight.com
Tue, 19 Sep 2000 01:23:01 -0500
I have thought some more about the general problem of mutable vs. immutable
I/O stuff, and here is the problem as I see it: First note that reading
almost anything requires at least one character of look ahead. Consider
reading an integer: you don't know it is done until you look one character
too far.
This fact does not, of course, require immutable objects marking positions in
I/O streams of course. It doesn't technically require any support at all.
Unfortunately, without any support from the underlying I/O system, you will
be forced to hold the looked ahead character in a variable and to pass it
around. What's worse, some times you won't have it, so you will have to read
it just to pass in.
Rejecting that approach, we have to add some support in the I/O system. What
Pascal did, which was quite elegant, but kind of a pain to use, was to have
the notion of 1 character look ahead built in by providing the equivalent of
a `peek' function. This returned the next character, but did NOT advance.
The reason why this is nice is because all combinations of calls are legal
(see the C solution for a system where this is not the case), but it is a
pain to use. Consider using this the code to read in a (weakly) positive
integer (no white space skipping):
fun getuint (stream): int =
let fun loop ac =
let val ch = peek stream
in if isDigit ch
then (advance stream;
loop (10*ac + ord ch - ord #"0"))
else ac
val ch = peek stream
in if isDigit ch
then loop (ord ch - ord #"0")
else raise ???
end
Note the need to advance the stream. 2 calls per character, and even if
advance returns the next character, it is a pain to use.
The approach used in C is to provide ungetc instead of peek. This is quite
ugly because: it isn't legal to call unget twice in a row without an
intervening getc call, and also what happens if you unget something other
than the last character read? What does ungetting EOF mean? Ignoring that,
it is easy to use and a bit more compact than peek.
You could, of course, generalize unget so that you could unget as many
characters in a row as you wanted (and then you would get them in the reverse
order). This is all quite consistent and nice, but note it isn't quite as
modular as the functional approach. The reason is that anything that does a
read which may have to be pushed back has to know that so that it will save
the characters. With the functional approach, no one has to know except for
the routine that is going to decide to go back. The down side of this is
that only the GC will know that some save set of characters can be gotten rid
of now. The code generator is almost always going to have to generate the
saved state even though the code typically only wants 1 character of look
ahead.
I think that my conclusion is that the arbitrary-unget is the way to go
because per-character overhead is just too high with anything else. I don't
see any way to make this work with the `reader' concept though. Wait, you
could do this by having the `state' be the last character looked at (maybe).
Any way, I think that the mod you sent me, although correct, is going to be
either too slow, or else is going to force me to carry around the one
character of look ahead in my code. Both of these alternatives seem very
very bad.