One thing I forgot to mention. If you compile nestedloop.gcc without the -fomit-frame-pointer, it doesn't allocate the counter in a register. This slows it down from .95 to 2.41. So, most of the win comes from the register allocation. Clearly we must be hurting (everywhere) with our stack slots.