News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Stackframes

Started by JK, April 05, 2021, 09:46:30 PM

Previous topic - Next topic

JK

I fear this is going to be a somewhat lenghty story, so i split it into digestible parts ...


I would like to have what i would call a "clean" stackframe in 32 and 64 bit, where "clean means:
1.) E/RBP based stackframe
2.) argument and locals work by name
3.) top of locals is at or has a constant (the same in every procedure) offset to E/RBP
4.) order of locals is not changed, i.e locals appear on the stack in the same order as they were defined in code
5.) no need to keep the stack balanced before leaving (epilogue restores E/RSP automatically)


Why?

1.) E/RSP is "free" for me to use, e.g. create temp storage on the stack on the fly
2.) makes coding so much easier, everything is relative to E/RBP and the assembler
     calculates the displacement for me
3.) i can rely on the fact, that my locals are at a distinct place relative to E/RBP,
     which allows e.g. for easily setting all or certain parts of them to zero or similar
4.) same as 3.)
5.) a mov E/RBP, E/RSP, mov E/RSP, E/RBP pair reliably restores the stack before leaving,
     meaning i can do all kinds of "things" in my procedure with the stack pointer
     (exception in 64 bit it must be 16 bit aligned before calls) and i´m not obliged to keep
     track and clean up.

I know playing with the stack must follow rules, i cannot do literally everything, but as long as i play by the rules, i can do everything (even in 64 bit).


Basically such a stackframe would look like this:

; bottom of stack
;------------------ <- R/ESP points here  |  sub R/ESP, # of local bytes (+ alignment in 64 bit)
; local ...                               |
;------------------                       |
; local 2                                 |
;------------------                       |
; local 1 (DWORD)                         |                    -> local 1 = ESP-4 / RSP-4
;------------------ <- R/EBP points here  |  mov R/EBP, R/ESP  -> arg 1   = ESP + Ch / RSP + 18h
; old R/EPB                               |
;------------------                       |  push R/EBP
; return address                          |
;------------------                       |  call...
; arg 1                                   |
;------------------                       |
; arg 2                                   |
;------------------                       |
; arg 3                                   |
;------------------                       |
; arg 4                                   |
;------------------                       |
; arg 5                                   |
;------------------                       |
; arg 6                                   |
;------------------                       |
; ...                                     |
;------------------
; top of stack


The basic layout could be the same in 32 and 64 bit. In 64 bit the space for arg 1 to arg 4 must always be there, even if there are no arguments. This space (shadow space) can be used by the called procedure to save arg 1 to arg 4, which in 64 bit are passed by register.

One thing is still missing: registers to save by the callee must be placed on the stack somewhwere
There are 3 places, which fit:
- before "old R/EPB"
- after "old R/EPB"
- after locals
In every case it is possible to automatically restore E/RSP to a correct value before leaving


Question: is there any reason why this wouldn´t work? I know that this approach is not optimized, but i want it stable and reliable in first place. If i want it better, faster, smaller, whatever ...  i can always do so by hand or using special option settings as available


to be continued ...



JK

#1
As most of times i started testing with UASM. Unfortunately i cannot exactly get what i want (see preceeding post) using available options. So i tried to write my own PROLGUE/EPILOGUE. I´m a novice at this, so i might have made mistakes, but i think there is an error in UASM too.


I used the attached code (i left comments of my thoughts and results, which hopefuly makes it easier to understand, what i did and tried). To make things easier, i didn´t specify registers to save ("uses ...").


Now it´s getting complicated and i hope i can explain my findings in a comprehensible way


Looking at the generated code for "testit proc", i see that arguments are not referenced correctly (RBP displacement is off). This could be a failure in my PROLOGUE/EPILOGUE code, but it isn´t off in a consistent manner too. Comparing this with what is generated without my custom PROLOGUE/EPILOGUE, i can see the displacements are different. But also the difference between displacements is different, which should not be the case, even if the basic error in is my code.

with my custom PROLOGUE/EPILOGUE:(RBP consistently points to top of locals, order of locals is kept):

000000000124101B | 48:894C24 08             | mov qword ptr ss:[rsp+8],rcx            |
0000000001241020 | 66:0FD64C24 10           | movq qword ptr ss:[rsp+10],xmm1         |
0000000001241026 | 66:0FD65424 18           | movq qword ptr ss:[rsp+18],xmm2         |
000000000124102C | 4C:894C24 20             | mov qword ptr ss:[rsp+20],r9            |
0000000001241031 | 55                       | push rbp                                |
0000000001241032 | 48:8BEC                  | mov rbp,rsp                             |
0000000001241035 | 48:83EC 20               | sub rsp,20                              |
0000000001241039 | 8B45 40                  | mov eax,dword ptr ss:[rbp+40]           | x  +40 arg5
000000000124103C | 8B45 20                  | mov eax,dword ptr ss:[rbp+20]           | d4 +20 arg2: diff = 3 -> 3 x 8h = 18h -> not ok, is 20h
000000000124103F | 8B45 10                  | mov eax,dword ptr ss:[rbp+10]           |
0000000001241042 | 48:83EC 20               | sub rsp,20                              |
0000000001241046 | 48:8D4D E8               | lea rcx,qword ptr ss:[rbp-18]           |
000000000124104A | E8 B1FFFFFF              | call uasm_test_64.1241000               |
000000000124104F | 48:83C4 20               | add rsp,20                              |
0000000001241053 | 48:8D45 40               | lea rax,qword ptr ss:[rbp+40]           |
0000000001241057 | 48:8D55 20               | lea rdx,qword ptr ss:[rbp+20]           |
000000000124105B | 48:8D75 28               | lea rsi,qword ptr ss:[rbp+28]           |
000000000124105F | 48:8D7D 30               | lea rdi,qword ptr ss:[rbp+30]           |
0000000001241063 | 48:8D45 FC               | lea rax,qword ptr ss:[rbp-4]            |
0000000001241067 | 48:8D5D E8               | lea rbx,qword ptr ss:[rbp-18]           |
000000000124106B | 48:8D4D E4               | lea rcx,qword ptr ss:[rbp-1C]           |
000000000124106F | 48:8BE5                  | mov rsp,rbp                             |
0000000001241072 | 5D                       | pop rbp                                 |
0000000001241073 | C3                       | ret                                     |


without a custom PROLOGUE/EPILOGUE (caveat here is: locals are changed in order and locals don´t start at a constant offset from RBP, offset is different for each procedure):

00000000013C1024 | 48:894C24 08             | mov qword ptr ss:[rsp+8],rcx            |
00000000013C1029 | F3:0F114C24 10           | movss dword ptr ss:[rsp+10],xmm1        |
00000000013C102F | F2:0F115424 18           | movsd qword ptr ss:[rsp+18],xmm2        |
00000000013C1035 | 4C:894C24 20             | mov qword ptr ss:[rsp+20],r9            |
00000000013C103A | 48:55                    | push rbp                                |
00000000013C103C | 48:83EC 20               | sub rsp,20                              |
00000000013C1040 | 48:8D6C24 10             | lea rbp,qword ptr ss:[rsp+10]           |
00000000013C1045 | 8B45 40                  | mov eax,dword ptr ss:[rbp+40]           | x  +40, arg5                                     
00000000013C1048 | 8B45 28                  | mov eax,dword ptr ss:[rbp+28]           | d4 +28, arg2 diff = 3 -> 3 x 8h = 18h -> ok       
00000000013C104B | 8B45 20                  | mov eax,dword ptr ss:[rbp+20]           |
00000000013C104E | 48:83EC 20               | sub rsp,20                              |
00000000013C1052 | 48:8D4D 00               | lea rcx,qword ptr ss:[rbp]              |
00000000013C1056 | E8 A5FFFFFF              | call uasm_test_64.13C1000               |
00000000013C105B | 48:83C4 20               | add rsp,20                              |
00000000013C105F | 48:8D45 40               | lea rax,qword ptr ss:[rbp+40]           |
00000000013C1063 | 48:8D55 28               | lea rdx,qword ptr ss:[rbp+28]           |
00000000013C1067 | 48:8D75 30               | lea rsi,qword ptr ss:[rbp+30]           |
00000000013C106B | 48:8D7D 38               | lea rdi,qword ptr ss:[rbp+38]           |
00000000013C106F | 48:8D45 FC               | lea rax,qword ptr ss:[rbp-4]            |
00000000013C1073 | 48:8D5D 00               | lea rbx,qword ptr ss:[rbp]              |
00000000013C1077 | 48:8D4D F8               | lea rcx,qword ptr ss:[rbp-8]            |
00000000013C107B | 48:8D65 10               | lea rsp,qword ptr ss:[rbp+10]           |
00000000013C107F | 5D                       | pop rbp                                 |
00000000013C1080 | C3                       | ret                                     |


So while the offset in RBP displacment for arguments could be a cause of error in my PROLOGUE code, the distance between arguments should be the the same in both case, but it isn´t!


testit proc (default PROLOGUE, OPTION STACKBASE RBP, OPTION WIN65:5):
RBP = RSP + 10h after PROLOGUE -> arguments are resolved correctly

argument  x:   RSP+40, (arg # 5)                                     
argument d4:  RSP+28, (arg # 2) diff = 3 -> 3 x 8h = 18h -> ok (expected)     


testit proc (custom PROLOGUE, OPTION STACKBASE RBP, OPTION WIN65:5):
RBP = RSP + 20h after PRLOGUE -> arguments are 10h off

argument  x :  RSP+40, (arg # 5)                                     
argument d4:  RSP+20, (arg # 2) diff = 3 -> 3 x 8h = 18h -> in fact it is 20h, which is wrong


additional Question: the value returned by the PROLOGUE doesn´t seem to have any effect on the code generated,
e.g. returning differnt numbers or <0> doesn´t change anything. OTOH returning nothing throws an error. So what for is this return value?


JK

johnsa

I think there are a number of issues here...

1. if you use stackbase:RSP mode, then that gives you RBP free for use (if you really need an extra register).
2. the automatic prologue/epi. does a lot more than just deal with params and locals, there is alignment to consider as well as pre-allocation for the largest contained invoke within the proc
.. ie if an invoke need's to reserve 128 bytes of stack for it's params, the prologue in the parent proc already handles this which I don't think you can replicate via a custom prologue macro.
3. The locals are re-arranged to ensure that they can be packed and aligned efficiently.
4. The home-space slots are only filled if the parameter is actually used (a minor optimisation to avoid copying the reg params if they're unused, or used directly via register)

The zero'ing of locals as a batch is an issue, I removed option zerolocals as it wasn't fully implemented and it's not optimal as you frequently don't need to zero all of them, but only specific ones.
I was considering adding a new directive, LOCALZ ie. which would produce the code to zero just the specific local be it primitive/struct or array etc.

Assuming you use stackbase RSP and had LOCALZ directive, that should cover all your requirements?


JK

Thanks for your reply!

Quote1. if you use stackbase:RSP mode, then that gives you RBP free for use (if you really need an extra register).

I want RSP to be free in a sense so that i can do pushes and pops in my procedures. It´s not about gaining an extra register, there are more than enough in 64 bit.

Quote2. the automatic prologue/epi. does a lot more than just deal with params and locals, there is alignment to consider as well as pre-allocation for the largest contained invoke within the proc
.. ie if an invoke need's to reserve 128 bytes of stack for it's params, the prologue in the parent proc already handles this which I don't think you can replicate via a custom prologue macro.

I have seen that, but this kind of stackframe layout (with pre-allocation for the largest contained invoke within the proc) makes it impossible to do pushes and pops inside the current procedure, because depending on the space needed the pushed data might be overwritten by the next stackframe (64 bit). This is why i would like to have an RBP based stackframe (which does sub RSP, offset before the call and add RSP, offset after the call (essentially what WIN64:1 to 7 does, but without the "downsides" (my personal view) i mentioned)). The price is less optimized code, but i´m willing to pay it.


Quote3. The locals are re-arranged to ensure that they can be packed and aligned efficiently
Yes, but this way you must fill each local you want to zero one by one, which is far from efficient. By keeping the order, space is wasted, that´s true, but OTOH in 64 bit there is more than enough space - much more than we ever had in 32 bit.


Quote4. The home-space slots are only filled if the parameter is actually used (a minor optimisation to avoid copying the reg params if they're unused, or used directly via register)
This a quite useful feature IMHO, because it happens automatically. In my code i must do it by hand "<rxxr>", but it´s still possible.


There is no need to always zero all locals, but sometimes it would make things easier. My idea was writing a macro (zerolocals) taking none, one or two parameters.
- if no parameter is given, all locals are set to zero
- if one parameter is given, all locals from start to (and including) this local are set to zero
- if two parameters are given, all locals starting with the first given local up to the second  are set to zero

By arranging my locals in an appropriate order i can ensure that this can be done very effectivly. I can optimize this process inside my macro (mov a few bytes vs. rep stosb), i can even repeat this inside my procedure with different parameters for different locals in different places.


Please don´t get me wrong, i do not expect you to code this for me, just because i want to have my way! I´m willing to do it myself, but currently i cannot, because:
- possibly something goes wrong inside UASM implementing a custom PROLOGUE/EPILOGUE, see my example above
- i´m doing something wrong, so would you please help doing it right
- both of it


This is not an easy matter, i know and it gives room for discussions, but please hang on - thanks,


JK

johnsa

Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.
That said, feel free to do whatever you want :) The point of assembler is not to enforce controls, so in that spirit I'll check out the issue with the custom prologue and see if I can help there.

option zerolocals may still be useful too, perhaps we have both LOCALZ and the OPTION, although these days I'm tending to avoid adding more directive complexity than required. Happy to hear votes on the subject as to which is preferred.

I will have a look at the re-ordering of locals again too.

jj2007

Quote from: johnsa on April 06, 2021, 01:51:20 AM
Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.

The "good reason" is that saving regs to global variables bloats the exe enormously. The only reason against pushing is that it can't be done (really, it's forbidden!) if there is any call in the proc.

The bloat argument applies also to locals if their total size exceeds 128 bytes, as discussed in ZeroLocals.

The same logic applies to the ordering of locals: if you put the "big" locals first, such as buffer[128]:byte, then all other variables use the long encodings, which are 3 bytes longer: bloat. Hutch has a different opinion...
Quote from: hutch-- on April 01, 2021, 10:10:16 PM
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice.
... but I won't change my coding style: size matters, because the code cache is limited. That's also an argument to use rbp instead of rsp for the frame: all [rsp+x] instructions are one byte longer than [rbp+x].

Since I care for compatibility between UAsm, AsmC and MASM, I will not suggest to introduce an "align 16 once you encounter the first variable that needs it" :cool:

JK

QuoteHappy to hear votes on the subject as to which is preferred

If LOCALZ is meant to work like LOCAL but to simultaneously zero the listed variables, i would prefer it over an OPTION:ZEROLOCALS, because it gives more control.


What i would prefer most, is having E/RBP pointing to a fixed location inside the stack frame (preferably to the top of locals, but by all means having a constant offset to the top of locals and to the procedure´s arguments) after the PROLOGUE with E/RBP based stackframes. This makes debugging so much easier, if you cannot have symbols.

Currently RBP´s offset to the top of locals (i think i remember ESP being stable in this repect) is different in different procedures. This makes debugging harder than necessary, because for each and every procedure, i must look, where locals and arguments start in relation to RBP. It would be so much easier, if i simply could rely on:
- locals start at RBP - some fixed offset
- arguments start at RBP + some other fixed offset

This is exactly, what my proposed stack frame layout ensures. I think you had a hard time developing WIN65:15, a lot of calculations must be done for optimizing out RBP at all - and you made it work. It is great to have such an option! I´m not against optimization options, but (at least for me) in the development phase, it´s a nightmare debugging such code.

In general my coding plan is: first make it work, keep it simple, don´t make it overly complicated, be sure you can debug it. And if it works, make it better, faster smaller, whatever ...


Thanks for your help!


JK 

jj2007

QuoteIt would be so much easier, if i simply could rely on:
- locals start at RBP - some fixed offset
- arguments start at RBP + some other fixed offset

That's what you get with (for example) the JBasic prolog macro: [rbp-4] is the first local, [rbp+10h] the first argument.

nidud

#8
deleted

JK

@jj,

QuoteThe "good reason" is that saving regs to global variables bloats the exe enormously
i agree on this and the next paragraphs, but i disagree on this
QuoteThe only reason against pushing is that it can't be done (really, it's forbidden!) if there is any call in the proc.
Maybe i´m wrong, but IMHO it depends on how you build a 64 bit stack frame.

- i agree, a RSP based stack frame will obviously not work.
- but a RBP based stack frame, which makes enough room (sub RSP, ...) for the shadow space and arguments (either push arg 5 and higher, or sub RSP, ... + mov [RSP+...], ...) before the call and corrects the stack afterwards (add RSP, ...) doesn´t have these restrictions to my understanding.

You must make RSP align 16, before a call, because you cannot know, if the called (external) procedure uses locals, which actually need alignment. If such a procedure is called with wrong RSP alignment, the alignment of these locals will now become wrong as well. So depending on the procedure you might get away with RSP align 8 before a call or not. But if the stack is built like i just described, you will always get away, if you make sure RSP is aligned 16 before a call, regardless how many pushes and pops there were in between (of course you must not pop more than you pushed)


JK


daydreamer

Quote from: jj2007 on April 06, 2021, 02:30:56 AM
Quote from: johnsa on April 06, 2021, 01:51:20 AM
Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.

The "good reason" is that saving regs to global variables bloats the exe enormously. The only reason against pushing is that it can't be done (really, it's forbidden!) if there is any call in the proc.

The bloat argument applies also to locals if their total size exceeds 128 bytes, as discussed in ZeroLocals.

The same logic applies to the ordering of locals: if you put the "big" locals first, such as buffer[128]:byte, then all other variables use the long encodings, which are 3 bytes longer: bloat. Hutch has a different opinion...
Quote from: hutch-- on April 01, 2021, 10:10:16 PM
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice.
... but I won't change my coding style: size matters, because the code cache is limited. That's also an argument to use rbp instead of rsp for the frame: all [rsp+x] instructions are one byte longer than [rbp+x].

Since I care for compatibility between UAsm, AsmC and MASM, I will not suggest to introduce an "align 16 once you encounter the first variable that needs it" :cool:
but pushs and pops in 32bit are onebyte instructions compared to other instructions,not only size matters,speed indirect also
the copy/zero out array movsb/stosb snippet for local array,you can directly reuse rdi as pointer afterwards



 
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: nidud on April 06, 2021, 03:06:35 AMYou may get the size of this using the @ReservedStack variable.

Ups... error A2006:undefined symbol : @ReservedStack

LiaoMi

Quote from: jj2007 on April 06, 2021, 06:03:39 AM
Quote from: nidud on April 06, 2021, 03:06:35 AMYou may get the size of this using the @ReservedStack variable.

Ups... error A2006:undefined symbol : @ReservedStack

https://www.japheth.de/JWasm/Manual.html

3.9 Directive OPTION WIN64
Directive OPTION WIN64 allows to set parameters for the Win64 output format if this format (see -win64 cmdline option) is selected. For other output formats, this option has no effect. The syntax for the directive is:
OPTION WIN64: switches
accepted values for switches are:
Store Register Arguments [ bit 0 ]:
- 0: the "home locations" (also sometimes called "shadow space") of the first 4 register parameters are uninitialized. This is the default setting.
- 1: register contents of the PROC's first 4 parameters (RCX, RDX, R8 and R9 ) will be copied to the "home locations" within a PROC's prologue.
INVOKE Stack Space Reservation [bit 1]:
- 0: for each INVOKE the stack is adjusted to reserve space for the parameters required for the call. After the call, the space is released again. This is the default setting.
- 1: the maximum stack space required by all INVOKEs inside a procedure is computed by the assembler and reserved once on the procedure's entry. It's released when the procedure is exited. If INVOKEs are to be used outside of procedures, the stack space has to be reserved manually!
Note: an assembly time variable, @ReservedStack, is created internally when this option is set. It will reflect the value computed by the assembler. It should also be mentioned that when this option is on, and a procedure contains no INVOKEs at all, then nevertheless the minimal amount of 4*8 bytes is reserved on the stack.
Warning: You should have understood exactly what this option does BEFORE you're using it. Using PUSH/POP instruction pairs to "save" values across an INVOKE is VERBOTEN if this option is on.

https://github.com/Terraspace/UASM/blob/master/procJWasm.c
/* v2.11: use @ReservedStack only if option win64:2 is set */

JK

Quote1: the maximum stack space required by all INVOKEs inside a procedure is computed by the assembler and reserved once on the procedure's entry. It's released when the procedure is exited.
Thanks LiaoMi for clarfying it - this is how i interpret this option as well. Using this option saves you some otherwise necessary sub/add RSP,... , but requires RSP to remain unchanged inside a procedure (after the PROLOGUE, before the EPILOGUE). This is one way of managing stack frames in 64 bit, but it is not the only way to do it and it´s not the way, i want it.

QuoteWarning: You should have understood exactly what this option does BEFORE you're using it. Using PUSH/POP instruction pairs to "save" values across an INVOKE is VERBOTEN if this option is on
i absolutely agree, this is why i used WIN65:5 (bit 1 = 0, meaning this option is off) in my code example and this is, why i´m trying to implement my custom PROLOGUE and EPILOGUE. Among other things i want to be able change RSP inside a procedure (i know i must keep 16 bit alignment before calls).


Basically you can build your (RBP based) 64 bit stack frames just like you do it in 32 bit:
- push arguments (+ shadow space in 64 bit for argument 1 to 4),
- the call pushes the return address
      BTW. this is, what INVOKE does for me until here anyway. At this point the assembler could precalculate the space needed for locals of this procedure
     and for highest number of arguments of all INVOKEs in this procedure and set RSP accordingly (including 16 bit alignment), but it isn´t obliged to do so.
     Advantage: no further stack adjustment needed for all INVOKEs in this procedure
     Disadvantage: RSP MUST NOT be changed, no further local allocations or push/pop possible

- save registers to shadow space (optional)
- push RBP
- mov RBP, RSP
- make room for locals (sub RSP, ...)
- push non-volatile registers (this could be done before or after "push RBP" as well - at any rate corrcet alignment of RSP must be ensured as a result).

Now RSP is at the bottom of all data, which must not be overwritten. It can be freely used for whatever i want (as long as 16 alignment is ensured before calls). According to what i read about the 64 bit ABI, such a layout is not forbidden. No one forces you to build a stack frame in a way, that you must not change RSP inside procedures, this is a decision taken for optimisation reasons, but it´s not a must IMHO.


JK 

nidud

#14
deleted