timer_begin 1, HIGH_PRIORITY_CLASS
mov r10,100000000
ltest:
invoke wobble
dec r10
jnz short ltest
timer_end
wobble PROC FRAME USES rsi rdi rbx r10 r11 r12 rdx
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4
mov eax,ebx
ret
wobble ENDP
Both RSP and RBP based stack-frames execute the test loop in 295ms with no difference.
Size wise the RBP version does produce smaller code :
55 push rbp
48 8B EC mov rbp,rsp
56 push rsi
57 push rdi
53 push rbx
41 52 push r10
41 53 push r11
41 54 push r12
52 push rdx
48 83 EC 38 sub rsp,38h
...
48 83 C4 38 add rsp,38h
5A pop rdx
41 5C pop r12
41 5B pop r11
41 5A pop r10
5B pop rbx
5F pop rdi
5E pop rsi
5D pop rbp
C3 ret
= 34 bytes.
vs.
48 89 74 24 08 mov qword ptr [rsp+8],rsi
48 89 7C 24 10 mov qword ptr [rsp+10h],rdi
48 89 5C 24 18 mov qword ptr [rsp+18h],rbx
4C 89 54 24 20 mov qword ptr [rsp+20h],r10
41 53 push r11
41 54 push r12
52 push rdx
48 83 EC 60 sub rsp,60h
...
48 83 C4 60 add rsp,60h
5A pop rdx
41 5C pop r12
41 5B pop r11
48 8B 74 24 08 mov rsi,qword ptr [rsp+8]
48 8B 7C 24 10 mov rdi,qword ptr [rsp+10h]
48 8B 5C 24 18 mov rbx,qword ptr [rsp+18h]
4C 8B 54 24 20 mov r10,qword ptr [rsp+20h]
C3 ret
=58 bytes
When the procedures make less use of USES clauses :
wobble PROC FRAME USES rsi
LOCAL surfacePtr:QWORD
LOCAL b0y:REAL4
LOCAL b1y:REAL4
LOCAL b2y:REAL4
LOCAL b3y:REAL4
LOCAL d0,d1,d2,d3,cosa,sina:REAL4
mov eax,ebx
ret
wobble ENDP
wobble:
55 push rbp
48 8B EC mov rbp,rsp
56 push rsi
48 83 EC 38 sub rsp,38h
8B C3 mov eax,ebx
48 83 C4 38 add rsp,38h
5E pop rsi
5D pop rbp
C3 ret
= 18 bytes
wobble:
56 push rsi
48 83 EC 30 sub rsp,30h
8B C3 mov eax,ebx
48 83 C4 30 add rsp,30h
5E pop rsi
C3 ret
= 13 bytes
and we now have performance wise:
228ms (RSP) vs 234ms (RBP)
almost too close to call, but it's there.. and with a 5 byte saving.
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.
That said however, the RSP option does still reduce the total amount of allocated stack, which WILL improve caching.
Quote from: johnsa on March 24, 2017, 11:34:49 PM
Basically , the only thing that would make an RSP based prologue bigger than the RBP is a large list in USES.
Totally wrong. Please LOOK at this case:
option casemap:none
option frame:auto
OPTION STACKBASE:RSP
option win64:11
getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
LOCAL myVar1 : qword
LOCAL myVar2 : qword
mov rax, 1
add rax, val1
mov myVar1, rax
add rax, val2
mov myVar2, rax
INVOKE sub1, dest, rdx, myVar1, myVar2
ret
getSum endp
decompiles to:
getSum:
000000013F5B181B mov qword ptr [rsp+8],rcx
000000013F5B1820 mov qword ptr [rsp+18h],r8
000000013F5B1825 mov qword ptr [rsp+20h],r9
000000013F5B182A sub rsp,38h
000000013F5B182E mov rax,1
000000013F5B1835 add rax,qword ptr [rsp+50h]
000000013F5B183A mov qword ptr [rsp+20h],rax
000000013F5B183F add rax,qword ptr [rsp+58h]
000000013F5B1844 mov qword ptr [rsp+28h],rax
000000013F5B1849 mov rcx,qword ptr [rsp+40h]
000000013F5B184E mov r8,qword ptr [rsp+20h]
000000013F5B1853 mov r9,qword ptr [rsp+28h]
000000013F5B1858 call 000000013F5B1800
000000013F5B185D add rsp,38h
000000013F5B1861 ret
TOTAL : 70 bytes
Now with:
option casemap:none
option frame:auto
option win64:2
getSum proc public FRAME dest:ptr, src:ptr, val1 : qword, val2:qword
LOCAL myVar1 : qword
LOCAL myVar2 : qword
mov dest, rcx
mov val1, r8
mov val2, r9
mov rax, 1
add rax, val1
mov myVar1, rax
add rax, val2
mov myVar2, rax
INVOKE sub1, dest, rdx, myVar1, myVar2
ret
getSum endp
decompiles to:
getSum:
000000013F5C1814 push rbp
000000013F5C1815 mov rbp,rsp
000000013F5C1818 sub rsp,30h
000000013F5C181C mov qword ptr [rbp+10h],rcx
000000013F5C1820 mov qword ptr [rbp+20h],r8
000000013F5C1824 mov qword ptr [rbp+28h],r9
000000013F5C1828 mov rax,1
000000013F5C182F add rax,qword ptr [rbp+20h]
000000013F5C1833 mov qword ptr [rbp-8],rax
000000013F5C1837 add rax,qword ptr [rbp+28h]
000000013F5C183B mov qword ptr [rbp-10h],rax
000000013F5C183F mov rcx,qword ptr [rbp+10h]
000000013F5C1843 mov r8,qword ptr [rbp-8]
000000013F5C1847 mov r9,qword ptr [rbp-10h]
000000013F5C184B call 000000013F5C1800
000000013F5C1850 add rsp,30h
000000013F5C1854 pop rbp
000000013F5C1855 ret
TOTAL: 65 bytes.
Difference is "only" more 5 bytes for the OPTION STACKBASE:RSP alternative.
Move from memory to register and vice-versa use longer and slower instructions.
There is another problem, you can not dynamically allocate memory on the stack with the OPTION STACKBASE:RSP in use.
MOV m,r
latency : 2
rcp. throughput : 1
PUSH r
latency : 3
rcp. throughput : 1
So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).
That said..
Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.
Quote from: johnsa on March 25, 2017, 01:21:39 AM
MOV m,r
latency : 2
rcp. throughput : 1
PUSH r
latency : 3
rcp. throughput : 1
So even if it used pushes (which are shorter.. they have worse latency).
But in your example both RSP mode and RBP mode store the parameters via MOV which is identical (throughput wise ).
I don't think your calculations have any meaning without knowing the processor and in particular have no meaning at all for modern processors. Agner Frog has a lot of literature about that.
Quote from: johnsa on March 25, 2017, 01:27:35 AM
That said..
Yes, for dynamic stack allocating functions you'd probably want to use RBP version, but local dynamic stack allocations are considered bad practice anyway in a lot of cases, just allocate from the heap instead, or if you have a known size or limited set of sizes you could create a LOCAL to account for that to act as a buffer.
Come on, it is not bad practice at all. Where are you hearing such things?
Shall I allocate a dynamic array from the heap? Why, if I can just get rid of it by leaving the function (this is a perfect hassle free garbage collection mechanism).
Quote from: johnsa on March 25, 2017, 01:48:57 AM
RBP encoded addresses are 1 byte shorter, which is stupid .. x86.. encoding..
So i guess if you're using a lot of local/parameter references that could add up after a while.
Any sized function has always a lot local storage.
I got those timings from Agner, applicable to most modern processors re: push vs mov.
But yeah this is healthy debate.. I'm seeing some merit in RBP and it's giving me ideas how we can achieve both and simplify all these options out.
so here is what I propose:
1) We make the FRAME attribute on the PROC redundant (specify it, dont specify it.. it doesn't matter)
The PROC will be setup the same way in either case and this will avoid a lot of problems where FRAME has been left off and then the stack isn't aligned properly.
2) We keep STACKBASE:RSP, STACKBASE:RBP as is.. we've removed STACKBASE:ESP (that was a silly idea anyway).
3) Based on 2 we make option win64 completely redundant, this "smart" logic we've created around RSP/WIN64:11 we apply to RBP too so that really.. all you need to choose, is .. am I using RSP or RBP?
and that decision could be wrapped around specific PROCs where required, along with OPTION PROLOGUE:NONE for when you absolutely require a raw procedure (but I'm not sure that would really be useful) given that the smart logic produces the same output when the code inside the proc is right.
Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..
PUSH, POP, ENTER, LEAVE and even RET are high-level / complex instructions, with hidden / extended microcodes. Less microcodes mean less power consumption. In 64-bit and mobile computing, optimizing for power consumption is just as important as optimizing for speed and size. FASTCALL convention was designed in efforts to reduce the use of these high-level instructions. The costliest among them is POP.
For every PUSH, these are the extra steps taken at the machine level
if(StackAddressSize == 32) //this is a 32-bit PUSH, for an example {
if(OperandSize == 32) {
ESP = ESP - 4;
SS:ESP = Source //push doubleword
}
else { //OperandSize == 16
ESP = ESP - 2;
SS:ESP = Source; //push word
}
}
else { //StackAddressSize == 16
if(OperandSize == 16) {
SP = SP - 2;
SS:ESP = Source //push word
}
else { //OperandSize == 32
SP = SP - 4;
SS:ESP = Source; //push doubleword
}
}
Which in plain RISC-style instructions, can be done simply by
sub rsp,8
mov [rsp],something
Hence we get some peculiar-looking FASTCALL / register-based conventions like the above.
Just my 2 cents
Quote from: johnsa on March 25, 2017, 02:09:02 AM
Do you have an ASM based example of how you'd handle dynamically allocating the stack ?
I'm just trying to get my head around how that would work with a static rsp (IE: it's decremented once to reserve space for all calls inside the proc.. which is what we do with win64:11 and what C compiler does too..
I had a quick look at the disasm from alloca () .. but it's awful..
It is not widespread because people has been educated for not messing much with the stack pointer.
Anyway, allocating N bytes of memory is just sub rsp, N . Releasing it is add rsp, N .
Actually, I have an example (or sort of) here https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment (https://www.codeproject.com/Articles/1123638/MASM-Stack-Memory-Alignment)
I used it for my OO macros once in x86, to store local objects etc.. but I'm trying to think how this will work in x64
So lets assume RBP is setup so we can refer to locals and arguments, that's fine.. now we'd have the SUB RSP,N to reserve space for all the calls, so RSP is at the bottom of that block of memory.
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?
sweet Mary mother of Jesus! :icon_eek:
Quote from: johnsa on March 25, 2017, 02:51:57 AM
the invokes will assume they can use [RSP+0] -> [RSP+x] to fill in the parameters..
So if you SUB rsp,Y somewhere in the proc .. invokes would overwrite your dynamic stack allocation ?
No, you are simply "rebasing" the stack pointer. If the function is not a leaf, before calling a subrotine you will have to:
1) subtract the usual 32 bytes
2) align the stack.
On return:
add the usual 32 bytes + bytes used for stack alignment if any.
After that you will be as you were before the call :)
This is what I get with 64 bit MASM using a custom prologue/epilogue. The entry/exit code is small and for high level code, its easily fast enough. For low level code you don't use a stack frame.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL a1 :QWORD
LOCAL a2 :QWORD
LOCAL a3 :QWORD
LOCAL a4 :QWORD
mov a1, 1
mov a2, 2
mov a3, 3
mov a4, 4
xor rcx, rcx
call ExitProcess
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
comment * +++++++++++++++++++++++++++
segment .text
enter 0x80, 0x0
sub rsp, 0x80
mov qword ptr [rbp-0x68], 0x1
mov qword ptr [rbp-0x70], 2
mov qword ptr [rbp-0x78], 3
mov qword ptr [rbp-0x80], 4
xor rcx, rcx
call qword ptr [ExitProcess]
leave
ret
* +++++++++++++++++++++++++++++++++++
The disassembly in detail.
.text:0000000140001000 C8800000 enter 0x80, 0x0
.text:0000000140001004 4881EC80000000 sub rsp, 0x80
.text:000000014000100b 48C7459801000000 mov qword ptr [rbp-0x68], 0x1
.text:0000000140001013 48C7459002000000 mov qword ptr [rbp-0x70], 2
.text:000000014000101b 48C7458803000000 mov qword ptr [rbp-0x78], 3
.text:0000000140001023 48C7458004000000 mov qword ptr [rbp-0x80], 4
.text:000000014000102b 4833C9 xor rcx, rcx
.text:000000014000102e FF1560100000 call qword ptr [ExitProcess]
.text:0000000140001034 C9 leave
.text:0000000140001035 C3 ret
Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?
StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)
interestingly, as I've always avoided ENTER/LEAVE as I had assumed they weren't that quick..
http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp (http://stackoverflow.com/questions/5959890/enter-vs-push-ebp-mov-ebp-esp-sub-esp-imm-and-leave-vs-mov-esp-ebp)
Quote from: jj2007 on March 25, 2017, 03:21:02 AM
Quote from: johnsa on March 25, 2017, 02:09:02 AMDo you have an ASM based example of how you'd handle dynamically allocating the stack ?
StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255)
That looks interesting! and I guess thats x64 as well as x86 ?
There has never been a problem with LEAVE and it was the normal cleanup in 32 bit MASM where ENTER was known to be slow in 32 bit. With the size of 64 bit instructions generally being larger than the 32 bit versions, using ENTER does not seem to be a problem as any high level code is some powers slower than direct mnemonic code. With pure mnemonic code you would go for not using a stack frame as you total call overhead is simple CALL/RET.
Quote from: johnsa on March 25, 2017, 03:28:42 AMThat looks interesting! and I guess thats x64 as well as x86 ?
No, StackBuffer() is 32-bit only so far.
Re enter+leave:
Quote from: jj2007 on July 27, 2016, 12:56:50 AM
Quote from: TWell on July 26, 2016, 12:00:30 AMEDIT: How about testing in x64 enter/leave and rsp sub/add ?
Saw your edit only now, sorry. If I remember well, we tested that for 32-bit code in the Lab; enter was slow, leave was fast.
P.S.: Made a few tests, and for a naked procedure, enter is about 15% slower than push rbp + mov rbp, rsp
Which means a cycle or so. As Hutch wrote above, if it's really speed critical, you would use only registers + CALL + RET.
And if you want it
really fast, i.e. the extra cycle for enter slows your algo down, then your design is wrong. Short procedures in speed critical loops are nonsense, drop the call and the ret and use a macro, or "inline" it by hand.