x86 performance

Henry Cejtin henry@sourcelight.com
Wed, 9 Aug 2000 01:28:37 -0500


In the even/odd code, the main differences I see are:

    The  C  compiler  uses  leal  as cheap 3-address arithmetic while the x86
        version uses a move  followed  by  an  add  constant.   Note,  the  C
        compiler  way is 1 instruction, and it is only 3 bytes long.  The x86
        version is  2  instructions  and  5  bytes  long.   Also  C  code  is
        absolutely  filled  with  loads  and  stores  at short offsets from a
        register (either because the register is a pointer to a struct or the
        stack)  so  I  am  sure that this addressing mode is very fast.  This
        could be a big difference.

    The C compiler uses decl to decrement a register while  the  x86  does  a
        subtract  of a constant.  The C way takes 1 byte, the x86 way takes 3
        bytes.  Again, I could believe that this makes a difference, although
        probably not a lot.

    The  C  compiler kept a value in a register while the x86 had to store it
        into memory and then reload it.  Also the x86 code used  an  absolute
        location for both the store and the load.  I.e., 0 instructions and 0
        bytes vs. 2 instructions and 12 bytes.   (Yes,  storing  and  loading
        %esi to/from an absolute location is 6 bytes.)  This is probably very
        costly.

You didn't say what kind of machine you were running on, but lets suppose  it
was a 400 MHz P6.  Thus the C code is taking
    16.99 * 400 * 10^6 / (750 * 10^6) = 9 cycles
while the x86 code is taking
    22.65 * 400 * 10^6 / (750 * 10^6) = 12 cycles
so the difference is 3 cycles.  I would say very believable.

Note, converting to the C-style code (at least in this case) is trivial for a
peep-hole optimizer.

The  funnyness  with  the  jumps,  I  theorize,  costs  zero because they are
predicted correctly.