x86 performance
Henry Cejtin
henry@sourcelight.com
Wed, 9 Aug 2000 01:28:37 -0500
In the even/odd code, the main differences I see are:
The C compiler uses leal as cheap 3-address arithmetic while the x86
version uses a move followed by an add constant. Note, the C
compiler way is 1 instruction, and it is only 3 bytes long. The x86
version is 2 instructions and 5 bytes long. Also C code is
absolutely filled with loads and stores at short offsets from a
register (either because the register is a pointer to a struct or the
stack) so I am sure that this addressing mode is very fast. This
could be a big difference.
The C compiler uses decl to decrement a register while the x86 does a
subtract of a constant. The C way takes 1 byte, the x86 way takes 3
bytes. Again, I could believe that this makes a difference, although
probably not a lot.
The C compiler kept a value in a register while the x86 had to store it
into memory and then reload it. Also the x86 code used an absolute
location for both the store and the load. I.e., 0 instructions and 0
bytes vs. 2 instructions and 12 bytes. (Yes, storing and loading
%esi to/from an absolute location is 6 bytes.) This is probably very
costly.
You didn't say what kind of machine you were running on, but lets suppose it
was a 400 MHz P6. Thus the C code is taking
16.99 * 400 * 10^6 / (750 * 10^6) = 9 cycles
while the x86 code is taking
22.65 * 400 * 10^6 / (750 * 10^6) = 12 cycles
so the difference is 3 cycles. I would say very believable.
Note, converting to the C-style code (at least in this case) is trivial for a
peep-hole optimizer.
The funnyness with the jumps, I theorize, costs zero because they are
predicted correctly.