The MASM Forum

64 bit assembler => UASM Assembler Development => Topic started by: johnsa on March 24, 2017, 11:27:55 PM

Title: RBP vs RSP stack frames
Post by: johnsa on March 24, 2017, 11:27:55 PM

timer_begin 1, HIGH_PRIORITY_CLASS
mov r10,100000000
ltest:
invoke wobble
dec r10
jnz short ltest

timer_end



wobble PROC FRAME USES rsi rdi rbx r10 r11 r12 rdx
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4

mov eax,ebx
ret

wobble ENDP


Both RSP and RBP based stack-frames execute the test loop in 295ms with no difference.
Size wise the RBP version does produce smaller code :



55                   push        rbp 
48 8B EC             mov         rbp,rsp 
56                   push        rsi 
57                   push        rdi 
53                   push        rbx 
41 52                push        r10 
41 53                push        r11 
41 54                push        r12 
52                   push        rdx 
48 83 EC 38          sub         rsp,38h 
...
48 83 C4 38          add         rsp,38h 
5A                   pop         rdx 
41 5C                pop         r12 
41 5B                pop         r11 
41 5A                pop         r10 
5B                   pop         rbx 
5F                   pop         rdi 
5E                   pop         rsi 
5D                   pop         rbp 
C3                   ret 


= 34 bytes.

vs.



48 89 74 24 08       mov         qword ptr [rsp+8],rsi 
48 89 7C 24 10       mov         qword ptr [rsp+10h],rdi 
48 89 5C 24 18       mov         qword ptr [rsp+18h],rbx 
4C 89 54 24 20       mov         qword ptr [rsp+20h],r10 
41 53                push        r11 
41 54                push        r12 
52                   push        rdx 
48 83 EC 60          sub         rsp,60h 
...
48 83 C4 60          add         rsp,60h 
5A                   pop         rdx 
41 5C                pop         r12 
41 5B                pop         r11 
48 8B 74 24 08       mov         rsi,qword ptr [rsp+8] 
48 8B 7C 24 10       mov         rdi,qword ptr [rsp+10h] 
48 8B 5C 24 18       mov         rbx,qword ptr [rsp+18h] 
4C 8B 54 24 20       mov         r10,qword ptr [rsp+20h] 
C3                   ret 


=58 bytes

When the procedures make less use of USES clauses :


wobble PROC FRAME USES rsi
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4

mov eax,ebx
ret

wobble ENDP



wobble:
55                   push        rbp 
48 8B EC             mov         rbp,rsp 
56                   push        rsi 
48 83 EC 38          sub         rsp,38h 
8B C3                mov         eax,ebx 
48 83 C4 38          add         rsp,38h 
5E                   pop         rsi 
5D                   pop         rbp 
C3                   ret 

= 18 bytes

wobble:
56                   push        rsi 
48 83 EC 30          sub         rsp,30h 
8B C3                mov         eax,ebx 
48 83 C4 30          add         rsp,30h 
5E                   pop         rsi 
C3                   ret 

= 13 bytes


and we now have performance wise:

228ms (RSP) vs 234ms (RBP)

almost too close to call, but it's there.. and with a 5 byte saving.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 24, 2017, 11:34:49 PM
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

That said however, the RSP option does still reduce the total amount of allocated stack, which WILL improve caching.
Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 12:37:55 AM
Quote from: johnsa on March 24, 2017, 11:34:49 PM
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

Totally wrong. Please LOOK at this case:

option casemap:none
option frame:auto
OPTION STACKBASE:RSP
option win64:11

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5B181B  mov         qword ptr [rsp+8],rcx 
000000013F5B1820  mov         qword ptr [rsp+18h],r8 
000000013F5B1825  mov         qword ptr [rsp+20h],r9 
000000013F5B182A  sub         rsp,38h 
000000013F5B182E  mov         rax,1 
000000013F5B1835  add         rax,qword ptr [rsp+50h] 
000000013F5B183A  mov         qword ptr [rsp+20h],rax 
000000013F5B183F  add         rax,qword ptr [rsp+58h] 
000000013F5B1844  mov         qword ptr [rsp+28h],rax 
000000013F5B1849  mov         rcx,qword ptr [rsp+40h] 
000000013F5B184E  mov         r8,qword ptr [rsp+20h] 
000000013F5B1853  mov         r9,qword ptr [rsp+28h] 
000000013F5B1858  call        000000013F5B1800 
000000013F5B185D  add         rsp,38h 
000000013F5B1861  ret
TOTAL : 70 bytes

Now with:
option casemap:none
option frame:auto
option win64:2

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
        mov dest, rcx
        mov val1, r8
        mov val2, r9
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5C1814  push        rbp 
000000013F5C1815  mov         rbp,rsp 
000000013F5C1818  sub         rsp,30h 
000000013F5C181C  mov         qword ptr [rbp+10h],rcx 
000000013F5C1820  mov         qword ptr [rbp+20h],r8 
000000013F5C1824  mov         qword ptr [rbp+28h],r9 
000000013F5C1828  mov         rax,1 
000000013F5C182F  add         rax,qword ptr [rbp+20h] 
000000013F5C1833  mov         qword ptr [rbp-8],rax 
000000013F5C1837  add         rax,qword ptr [rbp+28h] 
000000013F5C183B  mov         qword ptr [rbp-10h],rax 
000000013F5C183F  mov         rcx,qword ptr [rbp+10h] 
000000013F5C1843  mov         r8,qword ptr [rbp-8] 
000000013F5C1847  mov         r9,qword ptr [rbp-10h] 
000000013F5C184B  call        000000013F5C1800 
000000013F5C1850  add         rsp,30h 
000000013F5C1854  pop         rbp 
000000013F5C1855  ret 
TOTAL: 65 bytes.

Difference is "only" more 5 bytes for the OPTION STACKBASE:RSP alternative.
Move from memory to register and vice-versa use longer and slower instructions.
There is another problem, you can not dynamically allocate memory on the stack with the OPTION STACKBASE:RSP in use.
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:21:39 AM
MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:27:35 AM
That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:48:57 AM
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.
Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 01:50:38 AM
Quote from: johnsa on March 25, 2017, 01:21:39 AM
MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).
I don't think your calculations have any meaning without knowing the processor and in particular have no meaning at all for modern processors. Agner Frog has a lot of literature about that.
Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 01:55:17 AM
Quote from: johnsa on March 25, 2017, 01:27:35 AM
That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.
Come on, it is not bad practice at all. Where are you hearing such things?
Shall I allocate a dynamic array from the heap? Why, if I can just get rid of it by leaving the function (this is a perfect hassle free garbage collection mechanism).
Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 02:00:23 AM
Quote from: johnsa on March 25, 2017, 01:48:57 AM
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.

Any sized function has always a lot local storage.
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:05:26 AM
I got those timings from Agner, applicable to most modern processors re: push vs mov.

But yeah this is healthy debate.. I'm seeing some merit in RBP and it's giving me ideas how we can achieve both and simplify all these options out.

so here is what I propose:

1) We make the FRAME attribute on the PROC redundant (specify it, dont specify it.. it doesn't matter)
    The PROC will be setup the same way in either case and this will avoid a lot of problems where FRAME has been left off and then the stack isn't aligned properly.

2) We keep STACKBASE:RSP, STACKBASE:RBP as is.. we've removed STACKBASE:ESP (that was a silly idea anyway).

3) Based on 2 we make option win64 completely redundant, this "smart" logic we've created around RSP/WIN64:11 we apply to RBP too so that really.. all you need to choose, is .. am I using RSP or RBP?

and that decision could be wrapped around specific PROCs where required, along with OPTION PROLOGUE:NONE for when you absolutely require a raw procedure (but I'm not sure that would really be useful) given that the smart logic produces the same output when the code inside the proc is right.
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:09:02 AM

Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..
Title: Re: RBP vs RSP stack frames
Post by: coder on March 25, 2017, 02:22:34 AM
PUSH, POP, ENTER, LEAVE and even RET are high-level / complex instructions, with hidden / extended microcodes. Less microcodes mean less power consumption. In 64-bit and mobile computing, optimizing for power consumption is just as important as optimizing for speed and size. FASTCALL convention was designed in efforts to reduce the use of these high-level instructions. The costliest among them is POP.

For every PUSH, these are the extra steps taken at the machine level

if(StackAddressSize == 32) //this is a 32-bit PUSH, for an example {
if(OperandSize == 32) {
ESP = ESP - 4;
SS:ESP = Source //push doubleword
}
else { //OperandSize == 16
ESP = ESP - 2;
SS:ESP = Source; //push word
}
}
else { //StackAddressSize == 16
if(OperandSize == 16) {
SP = SP - 2;
SS:ESP = Source //push word
}
else { //OperandSize == 32
SP = SP - 4;
SS:ESP = Source; //push doubleword
}
}


Which in plain RISC-style instructions, can be done simply by

sub rsp,8
mov [rsp],something


Hence we get some peculiar-looking FASTCALL / register-based conventions like the above.

Just my 2 cents



Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 02:38:35 AM
Quote from: johnsa on March 25, 2017, 02:09:02 AM

Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..

It is not widespread because people has been educated for not messing much with the stack pointer.
Anyway, allocating N bytes of memory is just sub rsp, N . Releasing it is add rsp, N .
Actually, I have an example (or sort of) here https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment (https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment)


Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:51:57 AM
I used it for my OO macros once in x86, to store local objects etc.. but I'm trying to think how this will work in x64

So lets assume RBP is setup so we can refer to locals and arguments, that's fine.. now we'd have the SUB RSP,N to reserve space for all the calls, so RSP is at the bottom of that block of memory.
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?
Title: Re: RBP vs RSP stack frames
Post by: coder on March 25, 2017, 02:54:43 AM
sweet Mary mother of Jesus!  :icon_eek:
Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 03:11:24 AM
Quote from: johnsa on March 25, 2017, 02:51:57 AM
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?
No, you are simply "rebasing" the stack pointer. If the function is not a leaf, before calling a subrotine you will have to:
1) subtract the usual 32 bytes
2) align the stack.
On return:
add the usual 32 bytes + bytes used for stack alignment if any.
After that you will be as you were before the call :)
Title: Re: RBP vs RSP stack frames
Post by: hutch-- on March 25, 2017, 03:15:26 AM
This is what I get with 64 bit MASM using a custom prologue/epilogue. The entry/exit code is small and for high level code, its easily fast enough. For low level code you don't use a stack frame.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL a1    :QWORD
    LOCAL a2    :QWORD
    LOCAL a3    :QWORD
    LOCAL a4    :QWORD

    mov a1, 1
    mov a2, 2
    mov a3, 3
    mov a4, 4

    xor rcx, rcx
    call ExitProcess

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

comment * +++++++++++++++++++++++++++

segment .text
enter 0x80, 0x0
sub rsp, 0x80
mov qword ptr [rbp-0x68], 0x1
mov qword ptr [rbp-0x70], 2
mov qword ptr [rbp-0x78], 3
mov qword ptr [rbp-0x80], 4
xor rcx, rcx
call qword ptr [ExitProcess]
leave
ret
* +++++++++++++++++++++++++++++++++++


The disassembly in detail.

.text:0000000140001000 C8800000                   enter 0x80, 0x0
.text:0000000140001004 4881EC80000000             sub rsp, 0x80
.text:000000014000100b 48C7459801000000           mov qword ptr [rbp-0x68], 0x1
.text:0000000140001013 48C7459002000000           mov qword ptr [rbp-0x70], 2
.text:000000014000101b 48C7458803000000           mov qword ptr [rbp-0x78], 3
.text:0000000140001023 48C7458004000000           mov qword ptr [rbp-0x80], 4
.text:000000014000102b 4833C9                     xor rcx, rcx
.text:000000014000102e FF1560100000               call qword ptr [ExitProcess]
.text:0000000140001034 C9                         leave
.text:0000000140001035 C3                         ret
Title: Re: RBP vs RSP stack frames
Post by: jj2007 on March 25, 2017, 03:21:02 AM
Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?

StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 03:26:36 AM
interestingly, as I've always avoided ENTER/LEAVE as I had assumed they weren't  that quick..
http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp (http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp)
Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 03:28:42 AM
Quote from: jj2007 on March 25, 2017, 03:21:02 AM
Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?

StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)

That looks interesting! and I guess thats x64 as well as x86 ?
Title: Re: RBP vs RSP stack frames
Post by: hutch-- on March 25, 2017, 03:50:51 AM
There has never been a problem with LEAVE and it was the normal cleanup in 32 bit MASM where ENTER was known to be slow in 32 bit. With the size of 64 bit instructions generally being larger than the 32 bit versions, using ENTER does not seem to be a problem as any high level code is some powers slower than direct mnemonic code. With pure mnemonic code you would go for not using a stack frame as you total call overhead is simple CALL/RET.
Title: Re: RBP vs RSP stack frames
Post by: jj2007 on March 25, 2017, 04:16:41 AM
Quote from: johnsa on March 25, 2017, 03:28:42 AMThat looks interesting! and I guess thats x64 as well as x86 ?

No, StackBuffer() is 32-bit only so far.

Re enter+leave:
Quote from: jj2007 on July 27, 2016, 12:56:50 AM
Quote from: TWell on July 26, 2016, 12:00:30 AMEDIT: How about testing in x64 enter/leave and rsp sub/add ?

Saw your edit only now, sorry. If I remember well, we tested that for 32-bit code in the Lab; enter was slow, leave was fast.

P.S.: Made a few tests, and for a naked procedure, enter is about 15% slower than push rbp + mov rbp, rsp

Which means a cycle or so. As Hutch wrote above, if it's really speed critical, you would use only registers + CALL + RET.

And if you want it really fast, i.e. the extra cycle for enter slows your algo down, then your design is wrong. Short procedures in speed critical loops are nonsense, drop the call and the ret and use a macro, or "inline" it by hand.