One thing I was surprised by was that not only loop nests was slow: so was matrix multiply. I didn't look at it, but it was scary that it was slow as well as the nested loop.