News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Stack optimization

Started by Ryan, August 29, 2012, 03:32:06 AM

Previous topic - Next topic

Ryan

Dave mentioned in the TEXTEQU topic that there is an advantage to doing things a little differently.  I'm just curious what those things are.

Also, I've noticed in my disassembled code that when the stack pointer is moved, it uses an add instruction with a negative number instead of just subtracting a number without the sign.  Is there an advantage to using add instead of sub?

dedndave

well - Jochen pointed out that the EBP register gets special treatment with regard to offsets
i am talking about how the opcodes have been assigned by intel, here

with other registers, like EBX, you can use [EBX] and it will use a short opcode
if you use [EBX+4], it uses a longer opcode that has the offset

EBP has a special form, in that it allows the use of byte-sized offset operands
if you use [EBP], the assembler actually generates [EBP+00], because there is no intel opcode for [EBP]
the minimal opcode uses a byte offset

now - that gives you a slight edge when you use smaller offsets with EBP
the range of the byte offsets is -128 to +127
so - if we keep our used offsets in that range, we get a code size reduction
this can be important in a loop that uses many different stack frame references

you might do something like this
        sub     esp,128        ;reserve space above EBP for 32 dwords
        mov     ebp,esp        ;establish the stack frame base pointer in the middle of local data
        sub     esp,128        ;reserve space below EBP for 32 dwords

now, you can access 64 dwords in the stack frame using the smaller byte offset opcode   :biggrin:

Ryan

Does ESP have the same optimization, or is it just EBP?

dedndave

i might mention.....

this can give you an edge when you are trying to keep the code distance short across a loop
TopOfLoop:

;do stuff

        sub     ecx,1
        jnz     TopOfLoop


if the code length inside the loop exceeds 128 bytes, the JNZ will be the NEAR variety, having a dword offset
if the code length is shorter, however, you get the SHORT version of the JNZ
short jumps execute signifigantly faster than near jumps
not to mention - the shorter code has a better chance of fitting into the prefetch cache

.....
i don't believe ESP gets the same set of opcodes as EBP
you'd have to refer to the intel manual on that one
but - know that intel has optimized use of EBP on the stack because it is used so often

Ryan

So what's advantage of using an add with a large number (negative) versus a small number with sub?

dedndave

i am not sure there always is one - lol
again - you have to look at how the opcode space is used
seemingly large negative values can often be small signed offsets
        add     esp,-8      ;intel has a short signed opcode variation for values between -128 and +127

i see this variation in compiler-generated code quite often
however, the advantage is probably trivial

Ryan

I don't have the MASM stuff here at work.  Norton didn't like it, and I didn't feel like fighting it, but a simple proc with a local variable assembles with an add.  I'll make an example when I get home.

Ryan

I've attached a zip of two files.  One is the source, the other is the disassembly generated from the option in QEditor.

I declared a local variable within a proc.  MASM generates an add instruction on esp.  It results in 4 bytes.  I also added my own manipulation of esp with a sub, which is also 4 bytes.

Is there an advantage to using an add instruction instead of sub?  I'm wondering what the reasoning is why MASM would generate an add.

qWord

Quote from: Ryan on August 29, 2012, 09:09:40 AMIs there an advantage to using an add instruction instead of sub?  I'm wondering what the reasoning is why MASM would generate an add.
Probably it has historically reasons: maybe SUB was slower than ADD on earlier architectures.
I don't think that it makes any difference on current machines.
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

i think i just reasoned out the answer   :P

as i mentioned earlier, there is a byte form where the range is -128 to +127
        add     esp,-128  ;signed byte form used
        sub     esp,128   ;dword form used


i think the dword form is 2 bytes longer
the advantage applies to that one specific value, only

qWord

Quote from: dedndave on August 29, 2012, 10:45:38 AMi think the dword form is 2 bytes longer
the advantage applies to that one specific value, only
well, both instruction, SUB and ADD, allow 3-Byte encoding: OP reg32,signed imm8
EDIT: ok - i see your argument: 128  :P
MREAL macros - when you need floating point arithmetic while assembling!

Ryan

00401000 55                     push    ebp
00401001 8BEC                   mov     ebp,esp
00401003 83C4FC                 add     esp,0FFFFFFFCh
00401006 C745FC01000000         mov     dword ptr [ebp-4],1
0040100D 83EC04                 sub     esp,4
00401010 C9                     leave
00401011 C3                     ret

I said in a previous post that add and sub in this case take 4 bytes.  I must have miscounted; it takes 3 bytes.  The clock cycles appear to be the same.

Interesting that ENTER takes 14 clock cycles.  That is more than the 3 instructions that make up an ENTER.  Size is different though.  ENTER takes 4 bytes, while the above disassembly takes 6.

dedndave

many compilers do not use ENTER, Ryan - because it is slower
probably most modern compilers don't
they do use LEAVE, however - and so does MASM


qWord - the 128 thing - hey, i think i earned a gold star for the day - lol

hutch--

The assumption that the instruction "byte length" matters has been out of date since the i486, almost exclusively later processors munch instructions at about the same rate no matter if they are the long or short encoding. The short encodings exist primarily for backwards compatible old code. It may sound good to try and apply DOS era optimisations of the shortest encodings and you may alter the amount of free space in the tail end PE section with 512 byte alignment but I have yet to see any speed advantage of shorter instruction encodings, there are too many other factors effecting speed than simple byte length.

Lower instruction counts often show up as faster and reduced memory operation usually show up as faster, much of the rest is fantasy.  :biggrin:

dedndave

i don't disagree entirely
if you can reduce the instruction count (particularly inside a loop), you can get a big advantage

however, if you have reduced the instruction count all you can...
and the loop ends with a NEAR branch back to the top...
the top of the loop wants to be aligned - and even then it is slower

if you can pinch a byte here and there - enough to make it a SHORT branch...
the top of the loop no longer cares if it is aligned - and the loop is simply faster
i have seen a number of cases where this is true

perhaps the effect isn't as pronounced on newer cores
but - it makes a difference on my P4