Print Page - Stack optimization

Title: Stack optimization
Post by: Ryan on August 29, 2012, 03:32:06 AM

Dave mentioned in the TEXTEQU topic that there is an advantage to doing things a little differently. I'm just curious what those things are.

Also, I've noticed in my disassembled code that when the stack pointer is moved, it uses an add instruction with a negative number instead of just subtracting a number without the sign. Is there an advantage to using add instead of sub?

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 03:41:18 AM

well - Jochen pointed out that the EBP register gets special treatment with regard to offsets
i am talking about how the opcodes have been assigned by intel, here

with other registers, like EBX, you can use [EBX] and it will use a short opcode
if you use [EBX+4], it uses a longer opcode that has the offset

EBP has a special form, in that it allows the use of byte-sized offset operands
if you use [EBP], the assembler actually generates [EBP+00], because there is no intel opcode for [EBP]
the minimal opcode uses a byte offset

now - that gives you a slight edge when you use smaller offsets with EBP
the range of the byte offsets is -128 to +127
so - if we keep our used offsets in that range, we get a code size reduction
this can be important in a loop that uses many different stack frame references

you might do something like this

Code Select

        sub     esp,128        ;reserve space above EBP for 32 dwords
        mov     ebp,esp        ;establish the stack frame base pointer in the middle of local data
        sub     esp,128        ;reserve space below EBP for 32 dwords

now, you can access 64 dwords in the stack frame using the smaller byte offset opcode :biggrin:

Title: Re: Stack optimization
Post by: Ryan on August 29, 2012, 03:49:09 AM

Does ESP have the same optimization, or is it just EBP?

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 03:51:53 AM

i might mention.....

this can give you an edge when you are trying to keep the code distance short across a loop

Code Select

TopOfLoop:

;do stuff

        sub     ecx,1
        jnz     TopOfLoop

if the code length inside the loop exceeds 128 bytes, the JNZ will be the NEAR variety, having a dword offset
if the code length is shorter, however, you get the SHORT version of the JNZ
short jumps execute signifigantly faster than near jumps
not to mention - the shorter code has a better chance of fitting into the prefetch cache

.....
i don't believe ESP gets the same set of opcodes as EBP
you'd have to refer to the intel manual on that one
but - know that intel has optimized use of EBP on the stack because it is used so often

Title: Re: Stack optimization
Post by: Ryan on August 29, 2012, 03:55:56 AM

So what's advantage of using an add with a large number (negative) versus a small number with sub?

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 03:59:51 AM

i am not sure there always is one - lol
again - you have to look at how the opcode space is used
seemingly large negative values can often be small signed offsets

Code Select

add esp,-8 ;intel has a short signed opcode variation for values between -128 and +127

i see this variation in compiler-generated code quite often
however, the advantage is probably trivial

Title: Re: Stack optimization
Post by: Ryan on August 29, 2012, 04:03:18 AM

I don't have the MASM stuff here at work. Norton didn't like it, and I didn't feel like fighting it, but a simple proc with a local variable assembles with an add. I'll make an example when I get home.

Title: Re: Stack optimization
Post by: Ryan on August 29, 2012, 09:09:40 AM

I've attached a zip of two files. One is the source, the other is the disassembly generated from the option in QEditor.

I declared a local variable within a proc. MASM generates an add instruction on esp. It results in 4 bytes. I also added my own manipulation of esp with a sub, which is also 4 bytes.

Is there an advantage to using an add instruction instead of sub? I'm wondering what the reasoning is why MASM would generate an add.

Title: Re: Stack optimization
Post by: qWord on August 29, 2012, 09:59:18 AM

Quote from: Ryan on August 29, 2012, 09:09:40 AMIs there an advantage to using an add instruction instead of sub? I'm wondering what the reasoning is why MASM would generate an add.

Probably it has historically reasons: maybe SUB was slower than ADD on earlier architectures.
I don't think that it makes any difference on current machines.

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 10:45:38 AM

i think i just reasoned out the answer :P

as i mentioned earlier, there is a byte form where the range is -128 to +127

Code Select

        add     esp,-128  ;signed byte form used
        sub     esp,128   ;dword form used

i think the dword form is 2 bytes longer
the advantage applies to that one specific value, only

Title: Re: Stack optimization
Post by: qWord on August 29, 2012, 11:15:03 AM

Quote from: dedndave on August 29, 2012, 10:45:38 AMi think the dword form is 2 bytes longer
the advantage applies to that one specific value, only

well, both instruction, SUB and ADD, allow 3-Byte encoding: OP reg32,signed imm8
EDIT: ok - i see your argument: 128 :P

Title: Re: Stack optimization
Post by: Ryan on August 29, 2012, 11:45:42 AM

00401000 55 push ebp
00401001 8BEC mov ebp,esp
00401003 83C4FC add esp,0FFFFFFFCh
00401006 C745FC01000000 mov dword ptr [ebp-4],1
0040100D 83EC04 sub esp,4
00401010 C9 leave
00401011 C3 ret

I said in a previous post that add and sub in this case take 4 bytes. I must have miscounted; it takes 3 bytes. The clock cycles appear to be the same.

Interesting that ENTER takes 14 clock cycles. That is more than the 3 instructions that make up an ENTER. Size is different though. ENTER takes 4 bytes, while the above disassembly takes 6.

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 12:28:17 PM

many compilers do not use ENTER, Ryan - because it is slower
probably most modern compilers don't
they do use LEAVE, however - and so does MASM

qWord - the 128 thing - hey, i think i earned a gold star for the day - lol

Title: Re: Stack optimization
Post by: hutch-- on August 29, 2012, 01:05:27 PM

The assumption that the instruction "byte length" matters has been out of date since the i486, almost exclusively later processors munch instructions at about the same rate no matter if they are the long or short encoding. The short encodings exist primarily for backwards compatible old code. It may sound good to try and apply DOS era optimisations of the shortest encodings and you may alter the amount of free space in the tail end PE section with 512 byte alignment but I have yet to see any speed advantage of shorter instruction encodings, there are too many other factors effecting speed than simple byte length.

Lower instruction counts often show up as faster and reduced memory operation usually show up as faster, much of the rest is fantasy. :biggrin:

Title: Re: Stack optimization
Post by: dedndave on August 29, 2012, 01:25:12 PM

i don't disagree entirely
if you can reduce the instruction count (particularly inside a loop), you can get a big advantage

however, if you have reduced the instruction count all you can...
and the loop ends with a NEAR branch back to the top...
the top of the loop wants to be aligned - and even then it is slower

if you can pinch a byte here and there - enough to make it a SHORT branch...
the top of the loop no longer cares if it is aligned - and the loop is simply faster
i have seen a number of cases where this is true

perhaps the effect isn't as pronounced on newer cores
but - it makes a difference on my P4

Title: Re: Stack optimization
Post by: hutch-- on August 29, 2012, 06:58:51 PM

The place where you got stung the most with a 3 gig Prescott core was the pipeline length. One hiccup and you got hit badly with a stall and with the long pipeline it was a long stall. The trick with Prescott cores is to stay away from the old DOS era smarties and use the preferred instruction set as much as possible as they tended to pair much better and reduce the number of stalls. The reason why Intel gave up on the PIV series was the logic had run out of puff, a very long pipeline with the clock frequency wound up was reasonably effective with perfect pairing instructions but a disaster for older style more complex code that did not pair well.

The next generation Core series processors in both dual core and quads ate them alive in terms of throughput because they had a much shorter pipeline and were free of the hyperthreading. The i7 series recombined the capacities to be faster again.

Title: Re: Stack optimization
Post by: Tedd on August 30, 2012, 01:53:02 AM

Quote from: Ryan on August 29, 2012, 03:32:06 AM
Also, I've noticed in my disassembled code that when the stack pointer is moved, it uses an add instruction with a negative number instead of just subtracting a number without the sign. Is there an advantage to using add instead of sub?

The only difference I've ever noticed is in the encoding for a value of 128. So it was either: use sub, but check for 128 and then use add; or just use add without any need to check. So, the reason is laziness :badgrin:

As for instruction size encoding, it doesn't make a lot of difference once the instructions have been decoded, however, it can make some difference in loops. If you can reduce a loop enough that it fits entirely in cache, then you should get some improvement; but reducing a few instructions here and there won't change much unless you're already on the edge of a desired size. Alignment can help for similar reasons - it ensures you start on a boundary and don't waste half a cache-line on the instructions before the loop.

The MASM Forum

General => The Campus => Topic started by: Ryan on August 29, 2012, 03:32:06 AM