News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

RBP vs RSP stack frames

Started by johnsa, March 24, 2017, 11:27:55 PM

Previous topic - Next topic

johnsa


timer_begin 1, HIGH_PRIORITY_CLASS
mov r10,100000000
ltest:
invoke wobble
dec r10
jnz short ltest

timer_end



wobble PROC FRAME USES rsi rdi rbx r10 r11 r12 rdx
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4

mov eax,ebx
ret

wobble ENDP


Both RSP and RBP based stack-frames execute the test loop in 295ms with no difference.
Size wise the RBP version does produce smaller code :



55                   push        rbp 
48 8B EC             mov         rbp,rsp 
56                   push        rsi 
57                   push        rdi 
53                   push        rbx 
41 52                push        r10 
41 53                push        r11 
41 54                push        r12 
52                   push        rdx 
48 83 EC 38          sub         rsp,38h 
...
48 83 C4 38          add         rsp,38h 
5A                   pop         rdx 
41 5C                pop         r12 
41 5B                pop         r11 
41 5A                pop         r10 
5B                   pop         rbx 
5F                   pop         rdi 
5E                   pop         rsi 
5D                   pop         rbp 
C3                   ret 


= 34 bytes.

vs.



48 89 74 24 08       mov         qword ptr [rsp+8],rsi 
48 89 7C 24 10       mov         qword ptr [rsp+10h],rdi 
48 89 5C 24 18       mov         qword ptr [rsp+18h],rbx 
4C 89 54 24 20       mov         qword ptr [rsp+20h],r10 
41 53                push        r11 
41 54                push        r12 
52                   push        rdx 
48 83 EC 60          sub         rsp,60h 
...
48 83 C4 60          add         rsp,60h 
5A                   pop         rdx 
41 5C                pop         r12 
41 5B                pop         r11 
48 8B 74 24 08       mov         rsi,qword ptr [rsp+8] 
48 8B 7C 24 10       mov         rdi,qword ptr [rsp+10h] 
48 8B 5C 24 18       mov         rbx,qword ptr [rsp+18h] 
4C 8B 54 24 20       mov         r10,qword ptr [rsp+20h] 
C3                   ret 


=58 bytes

When the procedures make less use of USES clauses :


wobble PROC FRAME USES rsi
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4

mov eax,ebx
ret

wobble ENDP



wobble:
55                   push        rbp 
48 8B EC             mov         rbp,rsp 
56                   push        rsi 
48 83 EC 38          sub         rsp,38h 
8B C3                mov         eax,ebx 
48 83 C4 38          add         rsp,38h 
5E                   pop         rsi 
5D                   pop         rbp 
C3                   ret 

= 18 bytes

wobble:
56                   push        rsi 
48 83 EC 30          sub         rsp,30h 
8B C3                mov         eax,ebx 
48 83 C4 30          add         rsp,30h 
5E                   pop         rsi 
C3                   ret 

= 13 bytes


and we now have performance wise:

228ms (RSP) vs 234ms (RBP)

almost too close to call, but it's there.. and with a 5 byte saving.


johnsa

Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

That said however, the RSP option does still reduce the total amount of allocated stack, which WILL improve caching.

aw27

Quote from: johnsa on March 24, 2017, 11:34:49 PM
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

Totally wrong. Please LOOK at this case:

option casemap:none
option frame:auto
OPTION STACKBASE:RSP
option win64:11

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5B181B  mov         qword ptr [rsp+8],rcx 
000000013F5B1820  mov         qword ptr [rsp+18h],r8 
000000013F5B1825  mov         qword ptr [rsp+20h],r9 
000000013F5B182A  sub         rsp,38h 
000000013F5B182E  mov         rax,1 
000000013F5B1835  add         rax,qword ptr [rsp+50h] 
000000013F5B183A  mov         qword ptr [rsp+20h],rax 
000000013F5B183F  add         rax,qword ptr [rsp+58h] 
000000013F5B1844  mov         qword ptr [rsp+28h],rax 
000000013F5B1849  mov         rcx,qword ptr [rsp+40h] 
000000013F5B184E  mov         r8,qword ptr [rsp+20h] 
000000013F5B1853  mov         r9,qword ptr [rsp+28h] 
000000013F5B1858  call        000000013F5B1800 
000000013F5B185D  add         rsp,38h 
000000013F5B1861  ret
TOTAL : 70 bytes

Now with:
option casemap:none
option frame:auto
option win64:2

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
        mov dest, rcx
        mov val1, r8
        mov val2, r9
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5C1814  push        rbp 
000000013F5C1815  mov         rbp,rsp 
000000013F5C1818  sub         rsp,30h 
000000013F5C181C  mov         qword ptr [rbp+10h],rcx 
000000013F5C1820  mov         qword ptr [rbp+20h],r8 
000000013F5C1824  mov         qword ptr [rbp+28h],r9 
000000013F5C1828  mov         rax,1 
000000013F5C182F  add         rax,qword ptr [rbp+20h] 
000000013F5C1833  mov         qword ptr [rbp-8],rax 
000000013F5C1837  add         rax,qword ptr [rbp+28h] 
000000013F5C183B  mov         qword ptr [rbp-10h],rax 
000000013F5C183F  mov         rcx,qword ptr [rbp+10h] 
000000013F5C1843  mov         r8,qword ptr [rbp-8] 
000000013F5C1847  mov         r9,qword ptr [rbp-10h] 
000000013F5C184B  call        000000013F5C1800 
000000013F5C1850  add         rsp,30h 
000000013F5C1854  pop         rbp 
000000013F5C1855  ret 
TOTAL: 65 bytes.

Difference is "only" more 5 bytes for the OPTION STACKBASE:RSP alternative.
Move from memory to register and vice-versa use longer and slower instructions.
There is another problem, you can not dynamically allocate memory on the stack with the OPTION STACKBASE:RSP in use.

johnsa

MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).


johnsa

That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.

johnsa

RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.

aw27

Quote from: johnsa on March 25, 2017, 01:21:39 AM
MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).
I don't think your calculations have any meaning without knowing the processor and in particular have no meaning at all for modern processors. Agner Frog has a lot of literature about that.

aw27

Quote from: johnsa on March 25, 2017, 01:27:35 AM
That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.
Come on, it is not bad practice at all. Where are you hearing such things?
Shall I allocate a dynamic array from the heap? Why, if I can just get rid of it by leaving the function (this is a perfect hassle free garbage collection mechanism).

aw27

Quote from: johnsa on March 25, 2017, 01:48:57 AM
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.

Any sized function has always a lot local storage.

johnsa

I got those timings from Agner, applicable to most modern processors re: push vs mov.

But yeah this is healthy debate.. I'm seeing some merit in RBP and it's giving me ideas how we can achieve both and simplify all these options out.

so here is what I propose:

1) We make the FRAME attribute on the PROC redundant (specify it, dont specify it.. it doesn't matter)
    The PROC will be setup the same way in either case and this will avoid a lot of problems where FRAME has been left off and then the stack isn't aligned properly.

2) We keep STACKBASE:RSP, STACKBASE:RBP as is.. we've removed STACKBASE:ESP (that was a silly idea anyway).

3) Based on 2 we make option win64 completely redundant, this "smart" logic we've created around RSP/WIN64:11 we apply to RBP too so that really.. all you need to choose, is .. am I using RSP or RBP?

and that decision could be wrapped around specific PROCs where required, along with OPTION PROLOGUE:NONE for when you absolutely require a raw procedure (but I'm not sure that would really be useful) given that the smart logic produces the same output when the code inside the proc is right.

johnsa


Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..

coder

PUSH, POP, ENTER, LEAVE and even RET are high-level / complex instructions, with hidden / extended microcodes. Less microcodes mean less power consumption. In 64-bit and mobile computing, optimizing for power consumption is just as important as optimizing for speed and size. FASTCALL convention was designed in efforts to reduce the use of these high-level instructions. The costliest among them is POP.

For every PUSH, these are the extra steps taken at the machine level

if(StackAddressSize == 32) //this is a 32-bit PUSH, for an example {
if(OperandSize == 32) {
ESP = ESP - 4;
SS:ESP = Source //push doubleword
}
else { //OperandSize == 16
ESP = ESP - 2;
SS:ESP = Source; //push word
}
}
else { //StackAddressSize == 16
if(OperandSize == 16) {
SP = SP - 2;
SS:ESP = Source //push word
}
else { //OperandSize == 32
SP = SP - 4;
SS:ESP = Source; //push doubleword
}
}


Which in plain RISC-style instructions, can be done simply by

sub rsp,8
mov [rsp],something


Hence we get some peculiar-looking FASTCALL / register-based conventions like the above.

Just my 2 cents




aw27

Quote from: johnsa on March 25, 2017, 02:09:02 AM

Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..

It is not widespread because people has been educated for not messing much with the stack pointer.
Anyway, allocating N bytes of memory is just sub rsp, N . Releasing it is add rsp, N .
Actually, I have an example (or sort of) here https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment



johnsa

I used it for my OO macros once in x86, to store local objects etc.. but I'm trying to think how this will work in x64

So lets assume RBP is setup so we can refer to locals and arguments, that's fine.. now we'd have the SUB RSP,N to reserve space for all the calls, so RSP is at the bottom of that block of memory.
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?

coder

sweet Mary mother of Jesus!  :icon_eek: