Stack optimization

Ryan · August 29, 2012, 03:32:06 AM

Dave mentioned in the TEXTEQU topic that there is an advantage to doing things a little differently. I'm just curious what those things are.

Also, I've noticed in my disassembled code that when the stack pointer is moved, it uses an add instruction with a negative number instead of just subtracting a number without the sign. Is there an advantage to using add instead of sub?

dedndave · August 29, 2012, 03:41:18 AM

well - Jochen pointed out that the EBP register gets special treatment with regard to offsets
i am talking about how the opcodes have been assigned by intel, here

with other registers, like EBX, you can use [EBX] and it will use a short opcode
if you use [EBX+4], it uses a longer opcode that has the offset

EBP has a special form, in that it allows the use of byte-sized offset operands
if you use [EBP], the assembler actually generates [EBP+00], because there is no intel opcode for [EBP]
the minimal opcode uses a byte offset

now - that gives you a slight edge when you use smaller offsets with EBP
the range of the byte offsets is -128 to +127
so - if we keep our used offsets in that range, we get a code size reduction
this can be important in a loop that uses many different stack frame references

you might do something like this

Code Select

        sub     esp,128        ;reserve space above EBP for 32 dwords
        mov     ebp,esp        ;establish the stack frame base pointer in the middle of local data
        sub     esp,128        ;reserve space below EBP for 32 dwords

now, you can access 64 dwords in the stack frame using the smaller byte offset opcode

Ryan · August 29, 2012, 03:49:09 AM

Does ESP have the same optimization, or is it just EBP?

dedndave · August 29, 2012, 03:51:53 AM

i might mention.....

this can give you an edge when you are trying to keep the code distance short across a loop

Code Select

TopOfLoop:

;do stuff

        sub     ecx,1
        jnz     TopOfLoop

if the code length inside the loop exceeds 128 bytes, the JNZ will be the NEAR variety, having a dword offset
if the code length is shorter, however, you get the SHORT version of the JNZ
short jumps execute signifigantly faster than near jumps
not to mention - the shorter code has a better chance of fitting into the prefetch cache

.....
i don't believe ESP gets the same set of opcodes as EBP
you'd have to refer to the intel manual on that one
but - know that intel has optimized use of EBP on the stack because it is used so often

Ryan · August 29, 2012, 03:55:56 AM

So what's advantage of using an add with a large number (negative) versus a small number with sub?

dedndave · August 29, 2012, 03:59:51 AM

i am not sure there always is one - lol
again - you have to look at how the opcode space is used
seemingly large negative values can often be small signed offsets

Code Select

add esp,-8 ;intel has a short signed opcode variation for values between -128 and +127

i see this variation in compiler-generated code quite often
however, the advantage is probably trivial

Ryan · August 29, 2012, 04:03:18 AM

I don't have the MASM stuff here at work. Norton didn't like it, and I didn't feel like fighting it, but a simple proc with a local variable assembles with an add. I'll make an example when I get home.

Ryan · August 29, 2012, 09:09:40 AM

I've attached a zip of two files. One is the source, the other is the disassembly generated from the option in QEditor.

I declared a local variable within a proc. MASM generates an add instruction on esp. It results in 4 bytes. I also added my own manipulation of esp with a sub, which is also 4 bytes.

Is there an advantage to using an add instruction instead of sub? I'm wondering what the reasoning is why MASM would generate an add.

qWord · August 29, 2012, 09:59:18 AM

Quote from: Ryan on August 29, 2012, 09:09:40 AMIs there an advantage to using an add instruction instead of sub? I'm wondering what the reasoning is why MASM would generate an add.

Probably it has historically reasons: maybe SUB was slower than ADD on earlier architectures.
I don't think that it makes any difference on current machines.

dedndave · August 29, 2012, 10:45:38 AM

i think i just reasoned out the answer :P

as i mentioned earlier, there is a byte form where the range is -128 to +127

Code Select

        add     esp,-128  ;signed byte form used
        sub     esp,128   ;dword form used

i think the dword form is 2 bytes longer
the advantage applies to that one specific value, only

qWord · August 29, 2012, 11:15:03 AM

Quote from: dedndave on August 29, 2012, 10:45:38 AMi think the dword form is 2 bytes longer
the advantage applies to that one specific value, only

well, both instruction, SUB and ADD, allow 3-Byte encoding: OP reg32,signed imm8
EDIT: ok - i see your argument: 128 :P

Ryan · August 29, 2012, 11:45:42 AM

00401000 55 push ebp
00401001 8BEC mov ebp,esp
00401003 83C4FC add esp,0FFFFFFFCh
00401006 C745FC01000000 mov dword ptr [ebp-4],1
0040100D 83EC04 sub esp,4
00401010 C9 leave
00401011 C3 ret

I said in a previous post that add and sub in this case take 4 bytes. I must have miscounted; it takes 3 bytes. The clock cycles appear to be the same.

Interesting that ENTER takes 14 clock cycles. That is more than the 3 instructions that make up an ENTER. Size is different though. ENTER takes 4 bytes, while the above disassembly takes 6.

dedndave · August 29, 2012, 12:28:17 PM

many compilers do not use ENTER, Ryan - because it is slower
probably most modern compilers don't
they do use LEAVE, however - and so does MASM

qWord - the 128 thing - hey, i think i earned a gold star for the day - lol

hutch-- · August 29, 2012, 01:05:27 PM

The assumption that the instruction "byte length" matters has been out of date since the i486, almost exclusively later processors munch instructions at about the same rate no matter if they are the long or short encoding. The short encodings exist primarily for backwards compatible old code. It may sound good to try and apply DOS era optimisations of the shortest encodings and you may alter the amount of free space in the tail end PE section with 512 byte alignment but I have yet to see any speed advantage of shorter instruction encodings, there are too many other factors effecting speed than simple byte length.

Lower instruction counts often show up as faster and reduced memory operation usually show up as faster, much of the rest is fantasy.

dedndave · August 29, 2012, 01:25:12 PM

i don't disagree entirely
if you can reduce the instruction count (particularly inside a loop), you can get a big advantage

however, if you have reduced the instruction count all you can...
and the loop ends with a NEAR branch back to the top...
the top of the loop wants to be aligned - and even then it is slower

if you can pinch a byte here and there - enough to make it a SHORT branch...
the top of the loop no longer cares if it is aligned - and the loop is simply faster
i have seen a number of cases where this is true

perhaps the effect isn't as pronounced on newer cores
but - it makes a difference on my P4

The MASM Forum

News:

Stack optimization

Ryan

dedndave

Ryan

dedndave

Ryan

dedndave

Ryan

Ryan

qWord

dedndave

qWord

Ryan

dedndave

hutch--

dedndave