FFT benchmark
Henry Cejtin
henry@sourcelight.com
Sun, 6 Aug 2000 17:57:01 -0500
I finally sat down and looked at the hot loop you sent me from the FFT
benchmark. One inefficiency is the repeated references to the same stack
(MLton stack) location. I don't know if the C compiler is going to be clever
enough to figure out that it can cache the value stored in the MLton stack
location in a register or in a C stack location. Given that it doesn't do
any pointer alias analysis, it certainly is going to have to reload them over
any store into the MLton stack. Even a store into any C-local variable whose
address has ben taken will force this, but you probably never do that.
An example is the reload of SP(256) that will have to happen after the store
XD(SP(256), SI(308)) = RD(14);
and the reload of SI(308) after the store
XD(SP(256), RI(8)) = RD(20);
(although the latter is probably pretty cheap).
Also, although the C compiler should be doing it, not the repeated re-loading
of things from the MLton stack. As an example, at the beginning of the code
look at
RD(9) = XD(SP(256), SI(308));
RD(10) = XD(SP(256), RI(9));
RD(11) = Real_sub(RD(9), RD(10));
RD(12) = XD(SP(256), SI(308));
RD(13) = XD(SP(256), RI(9));
RD(14) = Real_add(RD(12), RD(13));
I would think that doing this kind of common subexpression elimination (just
of the result of loads) would be a good idea any way. Especially for the
extra-indirection stuff (all the loads above).
I tried to look at the generated C code, but nothing I ran on any of the
versions of the FFT benchmark produced similar C code. What did you compile
and with what version of MLton? Could you send me the source you used?
Speaking of FFT's, the FFT code in the benchmark really seems insane. For
one thing, the natural characterization is to go from a complex array to a
complex array, but from what it looks like, they use 2 parallel arrays of
double's instead (probably because they wanted to avoid the boxing overhead
of an array of pairs of doubles). Also the result (assuming it is supposed
to leave the result in the input arrays) ceratinly isn't the FFT.