Print Page - RBP vs RSP stack frames

Title: RBP vs RSP stack frames
Post by: johnsa on March 24, 2017, 11:27:55 PM


	timer_begin 1, HIGH_PRIORITY_CLASS
		mov r10,100000000
	ltest:
		invoke wobble
		dec r10
		jnz short ltest
		
	timer_end

Code Select


wobble PROC FRAME USES rsi rdi rbx r10 r11 r12 rdx
	LOCAL surfacePtr:QWORD
	LOCAL b0y:REAL4
	LOCAL b1y:REAL4
	LOCAL b2y:REAL4	
	LOCAL b3y:REAL4
	LOCAL d0,d1,d2,d3,cosa,sina:REAL4

	mov eax,ebx
	ret
	
wobble ENDP

Both RSP and RBP based stack-frames execute the test loop in 295ms with no difference.
Size wise the RBP version does produce smaller code :

Code Select


55                   push        rbp  
48 8B EC             mov         rbp,rsp  
56                   push        rsi  
57                   push        rdi  
53                   push        rbx  
41 52                push        r10  
41 53                push        r11  
41 54                push        r12  
52                   push        rdx  
48 83 EC 38          sub         rsp,38h  
...
48 83 C4 38          add         rsp,38h  
5A                   pop         rdx  
41 5C                pop         r12  
41 5B                pop         r11  
41 5A                pop         r10  
5B                   pop         rbx  
5F                   pop         rdi  
5E                   pop         rsi  
5D                   pop         rbp  
C3                   ret

= 34 bytes.

vs.

Code Select



48 89 74 24 08       mov         qword ptr [rsp+8],rsi  
48 89 7C 24 10       mov         qword ptr [rsp+10h],rdi  
48 89 5C 24 18       mov         qword ptr [rsp+18h],rbx  
4C 89 54 24 20       mov         qword ptr [rsp+20h],r10  
41 53                push        r11  
41 54                push        r12  
52                   push        rdx  
48 83 EC 60          sub         rsp,60h  
...
48 83 C4 60          add         rsp,60h  
5A                   pop         rdx  
41 5C                pop         r12  
41 5B                pop         r11  
48 8B 74 24 08       mov         rsi,qword ptr [rsp+8]  
48 8B 7C 24 10       mov         rdi,qword ptr [rsp+10h]  
48 8B 5C 24 18       mov         rbx,qword ptr [rsp+18h]  
4C 8B 54 24 20       mov         r10,qword ptr [rsp+20h]  
C3                   ret

=58 bytes

When the procedures make less use of USES clauses :

Code Select


wobble PROC FRAME USES rsi
	LOCAL surfacePtr:QWORD
	LOCAL b0y:REAL4
	LOCAL b1y:REAL4
	LOCAL b2y:REAL4	
	LOCAL b3y:REAL4
	LOCAL d0,d1,d2,d3,cosa,sina:REAL4

	mov eax,ebx
	ret
	
wobble ENDP

Code Select


wobble:
55                   push        rbp  
48 8B EC             mov         rbp,rsp  
56                   push        rsi  
48 83 EC 38          sub         rsp,38h  
8B C3                mov         eax,ebx  
48 83 C4 38          add         rsp,38h  
5E                   pop         rsi  
5D                   pop         rbp  
C3                   ret  

= 18 bytes

wobble:
56                   push        rsi  
48 83 EC 30          sub         rsp,30h  
8B C3                mov         eax,ebx  
48 83 C4 30          add         rsp,30h  
5E                   pop         rsi  
C3                   ret  

= 13 bytes

and we now have performance wise:

228ms (RSP) vs 234ms (RBP)

almost too close to call, but it's there.. and with a 5 byte saving.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 24, 2017, 11:34:49 PM

Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

That said however, the RSP option does still reduce the total amount of allocated stack, which WILL improve caching.

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 12:37:55 AM

Quote from: johnsa on March 24, 2017, 11:34:49 PM
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.

Totally wrong. Please LOOK at this case:

option casemap:none
option frame:auto
OPTION STACKBASE:RSP
option win64:11

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5B181B mov qword ptr [rsp+8],rcx
000000013F5B1820 mov qword ptr [rsp+18h],r8
000000013F5B1825 mov qword ptr [rsp+20h],r9
000000013F5B182A sub rsp,38h
000000013F5B182E mov rax,1
000000013F5B1835 add rax,qword ptr [rsp+50h]
000000013F5B183A mov qword ptr [rsp+20h],rax
000000013F5B183F add rax,qword ptr [rsp+58h]
000000013F5B1844 mov qword ptr [rsp+28h],rax
000000013F5B1849 mov rcx,qword ptr [rsp+40h]
000000013F5B184E mov r8,qword ptr [rsp+20h]
000000013F5B1853 mov r9,qword ptr [rsp+28h]
000000013F5B1858 call 000000013F5B1800
000000013F5B185D add rsp,38h
000000013F5B1861 ret
TOTAL : 70 bytes

Now with:
option casemap:none
option frame:auto
option win64:2

getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
   LOCAL myVar1 : qword
   LOCAL myVar2 : qword
mov dest, rcx
mov val1, r8
mov val2, r9
   mov rax, 1
   add rax, val1
   mov myVar1, rax
   add rax, val2
   mov myVar2, rax
   INVOKE sub1, dest, rdx, myVar1, myVar2
   ret
getSum endp

decompiles to:
getSum:
000000013F5C1814 push rbp
000000013F5C1815 mov rbp,rsp
000000013F5C1818 sub rsp,30h
000000013F5C181C mov qword ptr [rbp+10h],rcx
000000013F5C1820 mov qword ptr [rbp+20h],r8
000000013F5C1824 mov qword ptr [rbp+28h],r9
000000013F5C1828 mov rax,1
000000013F5C182F add rax,qword ptr [rbp+20h]
000000013F5C1833 mov qword ptr [rbp-8],rax
000000013F5C1837 add rax,qword ptr [rbp+28h]
000000013F5C183B mov qword ptr [rbp-10h],rax
000000013F5C183F mov rcx,qword ptr [rbp+10h]
000000013F5C1843 mov r8,qword ptr [rbp-8]
000000013F5C1847 mov r9,qword ptr [rbp-10h]
000000013F5C184B call 000000013F5C1800
000000013F5C1850 add rsp,30h
000000013F5C1854 pop rbp
000000013F5C1855 ret
TOTAL: 65 bytes.

Difference is "only" more 5 bytes for the OPTION STACKBASE:RSP alternative.
Move from memory to register and vice-versa use longer and slower instructions.
There is another problem, you can not dynamically allocate memory on the stack with the OPTION STACKBASE:RSP in use.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:21:39 AM

MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:27:35 AM

That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 01:48:57 AM

RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 01:50:38 AM

Quote from: johnsa on March 25, 2017, 01:21:39 AM
MOV m,r
latency : 2
rcp. throughput : 1

PUSH r
latency : 3
rcp. throughput : 1

So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).

I don't think your calculations have any meaning without knowing the processor and in particular have no meaning at all for modern processors. Agner Frog has a lot of literature about that.

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 01:55:17 AM

Quote from: johnsa on March 25, 2017, 01:27:35 AM
That said..

Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.

Come on, it is not bad practice at all. Where are you hearing such things?
Shall I allocate a dynamic array from the heap? Why, if I can just get rid of it by leaving the function (this is a perfect hassle free garbage collection mechanism).

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 02:00:23 AM

Quote from: johnsa on March 25, 2017, 01:48:57 AM
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.

Any sized function has always a lot local storage.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:05:26 AM

I got those timings from Agner, applicable to most modern processors re: push vs mov.

But yeah this is healthy debate.. I'm seeing some merit in RBP and it's giving me ideas how we can achieve both and simplify all these options out.

so here is what I propose:

1) We make the FRAME attribute on the PROC redundant (specify it, dont specify it.. it doesn't matter)
The PROC will be setup the same way in either case and this will avoid a lot of problems where FRAME has been left off and then the stack isn't aligned properly.

2) We keep STACKBASE:RSP, STACKBASE:RBP as is.. we've removed STACKBASE:ESP (that was a silly idea anyway).

3) Based on 2 we make option win64 completely redundant, this "smart" logic we've created around RSP/WIN64:11 we apply to RBP too so that really.. all you need to choose, is .. am I using RSP or RBP?

and that decision could be wrapped around specific PROCs where required, along with OPTION PROLOGUE:NONE for when you absolutely require a raw procedure (but I'm not sure that would really be useful) given that the smart logic produces the same output when the code inside the proc is right.

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:09:02 AM

Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..

Title: Re: RBP vs RSP stack frames
Post by: coder on March 25, 2017, 02:22:34 AM

PUSH, POP, ENTER, LEAVE and even RET are high-level / complex instructions, with hidden / extended microcodes. Less microcodes mean less power consumption. In 64-bit and mobile computing, optimizing for power consumption is just as important as optimizing for speed and size. FASTCALL convention was designed in efforts to reduce the use of these high-level instructions. The costliest among them is POP.

For every PUSH, these are the extra steps taken at the machine level

Code Select

if(StackAddressSize == 32) //this is a 32-bit PUSH, for an example {
	if(OperandSize == 32) {
		ESP = ESP - 4;
		SS:ESP = Source //push doubleword
	}
	else { //OperandSize == 16
		ESP = ESP - 2;
		SS:ESP = Source; //push word
	}
}
else { //StackAddressSize == 16
	if(OperandSize == 16) {
		SP = SP - 2;
		SS:ESP = Source //push word
	}
	else { //OperandSize == 32
		SP = SP - 4;
		SS:ESP = Source; //push doubleword
	}
}

Which in plain RISC-style instructions, can be done simply by

Code Select

sub rsp,8
mov [rsp],something

Hence we get some peculiar-looking FASTCALL / register-based conventions like the above.

Just my 2 cents

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 02:38:35 AM

Quote from: johnsa on March 25, 2017, 02:09:02 AM

Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..

It is not widespread because people has been educated for not messing much with the stack pointer.
Anyway, allocating N bytes of memory is just sub rsp, N . Releasing it is add rsp, N .
Actually, I have an example (or sort of) here https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment (https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment)

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 02:51:57 AM

I used it for my OO macros once in x86, to store local objects etc.. but I'm trying to think how this will work in x64

So lets assume RBP is setup so we can refer to locals and arguments, that's fine.. now we'd have the SUB RSP,N to reserve space for all the calls, so RSP is at the bottom of that block of memory.
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?

Title: Re: RBP vs RSP stack frames
Post by: coder on March 25, 2017, 02:54:43 AM

sweet Mary mother of Jesus! :icon_eek:

Title: Re: RBP vs RSP stack frames
Post by: aw27 on March 25, 2017, 03:11:24 AM

Quote from: johnsa on March 25, 2017, 02:51:57 AM
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?

No, you are simply "rebasing" the stack pointer. If the function is not a leaf, before calling a subrotine you will have to:
1) subtract the usual 32 bytes
2) align the stack.
On return:
add the usual 32 bytes + bytes used for stack alignment if any.
After that you will be as you were before the call :)

Title: Re: RBP vs RSP stack frames
Post by: hutch-- on March 25, 2017, 03:15:26 AM

This is what I get with 64 bit MASM using a custom prologue/epilogue. The entry/exit code is small and for high level code, its easily fast enough. For low level code you don't use a stack frame.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

LOCAL a1 :QWORD
LOCAL a2 :QWORD
LOCAL a3 :QWORD
LOCAL a4 :QWORD

mov a1, 1
mov a2, 2
mov a3, 3
mov a4, 4

xor rcx, rcx
call ExitProcess

ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

comment * +++++++++++++++++++++++++++

segment .text
enter 0x80, 0x0
sub rsp, 0x80
mov qword ptr [rbp-0x68], 0x1
mov qword ptr [rbp-0x70], 2
mov qword ptr [rbp-0x78], 3
mov qword ptr [rbp-0x80], 4
xor rcx, rcx
call qword ptr [ExitProcess]
leave
ret
* +++++++++++++++++++++++++++++++++++

The disassembly in detail.

.text:0000000140001000 C8800000 enter 0x80, 0x0
.text:0000000140001004 4881EC80000000 sub rsp, 0x80
.text:000000014000100b 48C7459801000000 mov qword ptr [rbp-0x68], 0x1
.text:0000000140001013 48C7459002000000 mov qword ptr [rbp-0x70], 2
.text:000000014000101b 48C7458803000000 mov qword ptr [rbp-0x78], 3
.text:0000000140001023 48C7458004000000 mov qword ptr [rbp-0x80], 4
.text:000000014000102b 4833C9 xor rcx, rcx
.text:000000014000102e FF1560100000 call qword ptr [ExitProcess]
.text:0000000140001034 C9 leave
.text:0000000140001035 C3 ret

Title: Re: RBP vs RSP stack frames
Post by: jj2007 on March 25, 2017, 03:21:02 AM

Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?

StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 03:26:36 AM

interestingly, as I've always avoided ENTER/LEAVE as I had assumed they weren't that quick..
http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp (http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp)

Title: Re: RBP vs RSP stack frames
Post by: johnsa on March 25, 2017, 03:28:42 AM

Quote from: jj2007 on March 25, 2017, 03:21:02 AM
Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?

StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)

That looks interesting! and I guess thats x64 as well as x86 ?

Title: Re: RBP vs RSP stack frames
Post by: hutch-- on March 25, 2017, 03:50:51 AM

There has never been a problem with LEAVE and it was the normal cleanup in 32 bit MASM where ENTER was known to be slow in 32 bit. With the size of 64 bit instructions generally being larger than the 32 bit versions, using ENTER does not seem to be a problem as any high level code is some powers slower than direct mnemonic code. With pure mnemonic code you would go for not using a stack frame as you total call overhead is simple CALL/RET.

Title: Re: RBP vs RSP stack frames
Post by: jj2007 on March 25, 2017, 04:16:41 AM

Quote from: johnsa on March 25, 2017, 03:28:42 AMThat looks interesting! and I guess thats x64 as well as x86 ?

No, StackBuffer() is 32-bit only so far.

Re enter+leave:

Quote from: jj2007 on July 27, 2016, 12:56:50 AM
Quote from: TWell on July 26, 2016, 12:00:30 AMEDIT: How about testing in x64 enter/leave and rsp sub/add ?

Saw your edit only now, sorry. If I remember well, we tested that for 32-bit code in the Lab; enter was slow, leave was fast.

P.S.: Made a few tests, and for a naked procedure, enter is about 15% slower than push rbp + mov rbp, rsp

Which means a cycle or so. As Hutch wrote above, if it's really speed critical, you would use only registers + CALL + RET.

And if you want it really fast, i.e. the extra cycle for enter slows your algo down, then your design is wrong. Short procedures in speed critical loops are nonsense, drop the call and the ret and use a macro, or "inline" it by hand.

The MASM Forum

64 bit assembler => UASM Assembler Development => Topic started by: johnsa on March 24, 2017, 11:27:55 PM