News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Stack optimization

Started by Ryan, August 29, 2012, 03:32:06 AM

Previous topic - Next topic

hutch--

The place where you got stung the most with a 3 gig Prescott core was the pipeline length. One hiccup and you got hit badly with a stall and with the long pipeline it was a long stall. The trick with Prescott cores is to stay away from the old DOS era smarties and use the preferred instruction set as much as possible as they tended to pair much better and reduce the number of stalls. The reason why Intel gave up on the PIV series was the logic had run out of puff, a very long pipeline with the clock frequency wound up was reasonably effective with perfect pairing instructions but a disaster for older style more complex code that did not pair well.

The next generation Core series processors in both dual core and quads ate them alive in terms of throughput because they had a much shorter pipeline and were free of the hyperthreading. The i7 series recombined the capacities to be faster again.

Tedd

Quote from: Ryan on August 29, 2012, 03:32:06 AM
Also, I've noticed in my disassembled code that when the stack pointer is moved, it uses an add instruction with a negative number instead of just subtracting a number without the sign.  Is there an advantage to using add instead of sub?
The only difference I've ever noticed is in the encoding for a value of 128. So it was either: use sub, but check for 128 and then use add; or just use add without any need to check. So, the reason is laziness :badgrin:


As for instruction size encoding, it doesn't make a lot of difference once the instructions have been decoded, however, it can make some difference in loops. If you can reduce a loop enough that it fits entirely in cache, then you should get some improvement; but reducing a few instructions here and there won't change much unless you're already on the edge of a desired size. Alignment can help for similar reasons - it ensures you start on a boundary and don't waste half a cache-line on the instructions before the loop.
Potato2