I fear this is going to be a somewhat lenghty story, so i split it into digestible parts ...
I would like to have what i would call a "clean" stackframe in 32 and 64 bit, where "clean means:
1.) E/RBP based stackframe
2.) argument and locals work by name
3.) top of locals is at or has a constant (the same in every procedure) offset to E/RBP
4.) order of locals is not changed, i.e locals appear on the stack in the same order as they were defined in code
5.) no need to keep the stack balanced before leaving (epilogue restores E/RSP automatically)
Why?
1.) E/RSP is "free" for me to use, e.g. create temp storage on the stack on the fly
2.) makes coding so much easier, everything is relative to E/RBP and the assembler
calculates the displacement for me
3.) i can rely on the fact, that my locals are at a distinct place relative to E/RBP,
which allows e.g. for easily setting all or certain parts of them to zero or similar
4.) same as 3.)
5.) a mov E/RBP, E/RSP, mov E/RSP, E/RBP pair reliably restores the stack before leaving,
meaning i can do all kinds of "things" in my procedure with the stack pointer
(exception in 64 bit it must be 16 bit aligned before calls) and i´m not obliged to keep
track and clean up.
I know playing with the stack must follow rules, i cannot do literally everything, but as long as i play by the rules, i can do everything (even in 64 bit).
Basically such a stackframe would look like this:
; bottom of stack
;------------------ <- R/ESP points here | sub R/ESP, # of local bytes (+ alignment in 64 bit)
; local ... |
;------------------ |
; local 2 |
;------------------ |
; local 1 (DWORD) | -> local 1 = ESP-4 / RSP-4
;------------------ <- R/EBP points here | mov R/EBP, R/ESP -> arg 1 = ESP + Ch / RSP + 18h
; old R/EPB |
;------------------ | push R/EBP
; return address |
;------------------ | call...
; arg 1 |
;------------------ |
; arg 2 |
;------------------ |
; arg 3 |
;------------------ |
; arg 4 |
;------------------ |
; arg 5 |
;------------------ |
; arg 6 |
;------------------ |
; ... |
;------------------
; top of stack
The basic layout could be the same in 32 and 64 bit. In 64 bit the space for arg 1 to arg 4 must always be there, even if there are no arguments. This space (shadow space) can be used by the called procedure to save arg 1 to arg 4, which in 64 bit are passed by register.
One thing is still missing: registers to save by the callee must be placed on the stack somewhwere
There are 3 places, which fit:
- before "old R/EPB"
- after "old R/EPB"
- after locals
In every case it is possible to automatically restore E/RSP to a correct value before leaving
Question: is there any reason why this wouldn´t work? I know that this approach is not optimized, but i want it stable and reliable in first place. If i want it better, faster, smaller, whatever ... i can always do so by hand or using special option settings as available
to be continued ...
As most of times i started testing with UASM. Unfortunately i cannot exactly get what i want (see preceeding post) using available options. So i tried to write my own PROLGUE/EPILOGUE. I´m a novice at this, so i might have made mistakes, but i think there is an error in UASM too.
I used the attached code (i left comments of my thoughts and results, which hopefuly makes it easier to understand, what i did and tried). To make things easier, i didn´t specify registers to save ("uses ...").
Now it´s getting complicated and i hope i can explain my findings in a comprehensible way
Looking at the generated code for "testit proc", i see that arguments are not referenced correctly (RBP displacement is off). This could be a failure in my PROLOGUE/EPILOGUE code, but it isn´t off in a consistent manner too. Comparing this with what is generated without my custom PROLOGUE/EPILOGUE, i can see the displacements are different. But also the difference between displacements is different, which should not be the case, even if the basic error in is my code.
with my custom PROLOGUE/EPILOGUE:(RBP consistently points to top of locals, order of locals is kept):
000000000124101B | 48:894C24 08 | mov qword ptr ss:[rsp+8],rcx |
0000000001241020 | 66:0FD64C24 10 | movq qword ptr ss:[rsp+10],xmm1 |
0000000001241026 | 66:0FD65424 18 | movq qword ptr ss:[rsp+18],xmm2 |
000000000124102C | 4C:894C24 20 | mov qword ptr ss:[rsp+20],r9 |
0000000001241031 | 55 | push rbp |
0000000001241032 | 48:8BEC | mov rbp,rsp |
0000000001241035 | 48:83EC 20 | sub rsp,20 |
0000000001241039 | 8B45 40 | mov eax,dword ptr ss:[rbp+40] | x +40 arg5
000000000124103C | 8B45 20 | mov eax,dword ptr ss:[rbp+20] | d4 +20 arg2: diff = 3 -> 3 x 8h = 18h -> not ok, is 20h
000000000124103F | 8B45 10 | mov eax,dword ptr ss:[rbp+10] |
0000000001241042 | 48:83EC 20 | sub rsp,20 |
0000000001241046 | 48:8D4D E8 | lea rcx,qword ptr ss:[rbp-18] |
000000000124104A | E8 B1FFFFFF | call uasm_test_64.1241000 |
000000000124104F | 48:83C4 20 | add rsp,20 |
0000000001241053 | 48:8D45 40 | lea rax,qword ptr ss:[rbp+40] |
0000000001241057 | 48:8D55 20 | lea rdx,qword ptr ss:[rbp+20] |
000000000124105B | 48:8D75 28 | lea rsi,qword ptr ss:[rbp+28] |
000000000124105F | 48:8D7D 30 | lea rdi,qword ptr ss:[rbp+30] |
0000000001241063 | 48:8D45 FC | lea rax,qword ptr ss:[rbp-4] |
0000000001241067 | 48:8D5D E8 | lea rbx,qword ptr ss:[rbp-18] |
000000000124106B | 48:8D4D E4 | lea rcx,qword ptr ss:[rbp-1C] |
000000000124106F | 48:8BE5 | mov rsp,rbp |
0000000001241072 | 5D | pop rbp |
0000000001241073 | C3 | ret |
without a custom PROLOGUE/EPILOGUE (caveat here is: locals are changed in order and locals don´t start at a constant offset from RBP, offset is different for each procedure):
00000000013C1024 | 48:894C24 08 | mov qword ptr ss:[rsp+8],rcx |
00000000013C1029 | F3:0F114C24 10 | movss dword ptr ss:[rsp+10],xmm1 |
00000000013C102F | F2:0F115424 18 | movsd qword ptr ss:[rsp+18],xmm2 |
00000000013C1035 | 4C:894C24 20 | mov qword ptr ss:[rsp+20],r9 |
00000000013C103A | 48:55 | push rbp |
00000000013C103C | 48:83EC 20 | sub rsp,20 |
00000000013C1040 | 48:8D6C24 10 | lea rbp,qword ptr ss:[rsp+10] |
00000000013C1045 | 8B45 40 | mov eax,dword ptr ss:[rbp+40] | x +40, arg5
00000000013C1048 | 8B45 28 | mov eax,dword ptr ss:[rbp+28] | d4 +28, arg2 diff = 3 -> 3 x 8h = 18h -> ok
00000000013C104B | 8B45 20 | mov eax,dword ptr ss:[rbp+20] |
00000000013C104E | 48:83EC 20 | sub rsp,20 |
00000000013C1052 | 48:8D4D 00 | lea rcx,qword ptr ss:[rbp] |
00000000013C1056 | E8 A5FFFFFF | call uasm_test_64.13C1000 |
00000000013C105B | 48:83C4 20 | add rsp,20 |
00000000013C105F | 48:8D45 40 | lea rax,qword ptr ss:[rbp+40] |
00000000013C1063 | 48:8D55 28 | lea rdx,qword ptr ss:[rbp+28] |
00000000013C1067 | 48:8D75 30 | lea rsi,qword ptr ss:[rbp+30] |
00000000013C106B | 48:8D7D 38 | lea rdi,qword ptr ss:[rbp+38] |
00000000013C106F | 48:8D45 FC | lea rax,qword ptr ss:[rbp-4] |
00000000013C1073 | 48:8D5D 00 | lea rbx,qword ptr ss:[rbp] |
00000000013C1077 | 48:8D4D F8 | lea rcx,qword ptr ss:[rbp-8] |
00000000013C107B | 48:8D65 10 | lea rsp,qword ptr ss:[rbp+10] |
00000000013C107F | 5D | pop rbp |
00000000013C1080 | C3 | ret |
So while the offset in RBP displacment for arguments could be a cause of error in my PROLOGUE code, the distance between arguments should be the the same in both case, but it isn´t!
testit proc (default PROLOGUE, OPTION STACKBASE RBP, OPTION WIN65:5):
RBP = RSP + 10h after PROLOGUE -> arguments are resolved correctly
argument x: RSP+40, (arg # 5)
argument d4: RSP+28, (arg # 2) diff = 3 -> 3 x 8h = 18h -> ok (expected)
testit proc (custom PROLOGUE, OPTION STACKBASE RBP, OPTION WIN65:5):
RBP = RSP + 20h after PRLOGUE -> arguments are 10h off
argument x : RSP+40, (arg # 5)
argument d4: RSP+20, (arg # 2) diff = 3 -> 3 x 8h = 18h -> in fact it is 20h, which is wrong
additional Question: the value returned by the PROLOGUE doesn´t seem to have any effect on the code generated,
e.g. returning differnt numbers or <0> doesn´t change anything. OTOH returning nothing throws an error. So what for is this return value?
JK
I think there are a number of issues here...
1. if you use stackbase:RSP mode, then that gives you RBP free for use (if you really need an extra register).
2. the automatic prologue/epi. does a lot more than just deal with params and locals, there is alignment to consider as well as pre-allocation for the largest contained invoke within the proc
.. ie if an invoke need's to reserve 128 bytes of stack for it's params, the prologue in the parent proc already handles this which I don't think you can replicate via a custom prologue macro.
3. The locals are re-arranged to ensure that they can be packed and aligned efficiently.
4. The home-space slots are only filled if the parameter is actually used (a minor optimisation to avoid copying the reg params if they're unused, or used directly via register)
The zero'ing of locals as a batch is an issue, I removed option zerolocals as it wasn't fully implemented and it's not optimal as you frequently don't need to zero all of them, but only specific ones.
I was considering adding a new directive, LOCALZ ie. which would produce the code to zero just the specific local be it primitive/struct or array etc.
Assuming you use stackbase RSP and had LOCALZ directive, that should cover all your requirements?
Thanks for your reply!
Quote1. if you use stackbase:RSP mode, then that gives you RBP free for use (if you really need an extra register).
I want RSP to be free in a sense so that i can do pushes and pops in my procedures. It´s not about gaining an extra register, there are more than enough in 64 bit.
Quote2. the automatic prologue/epi. does a lot more than just deal with params and locals, there is alignment to consider as well as pre-allocation for the largest contained invoke within the proc
.. ie if an invoke need's to reserve 128 bytes of stack for it's params, the prologue in the parent proc already handles this which I don't think you can replicate via a custom prologue macro.
I have seen that, but this kind of stackframe layout (with pre-allocation for the largest contained invoke within the proc) makes it impossible to do pushes and pops inside the current procedure, because depending on the space needed the pushed data might be overwritten by the next stackframe (64 bit). This is why i would like to have an RBP based stackframe (which does sub RSP, offset before the call and add RSP, offset after the call (essentially what WIN64:1 to 7 does, but without the "downsides" (my personal view) i mentioned)). The price is less optimized code, but i´m willing to pay it.
Quote3. The locals are re-arranged to ensure that they can be packed and aligned efficiently
Yes, but this way you must fill each local you want to zero one by one, which is far from efficient. By keeping the order, space is wasted, that´s true, but OTOH in 64 bit there is more than enough space - much more than we ever had in 32 bit.
Quote4. The home-space slots are only filled if the parameter is actually used (a minor optimisation to avoid copying the reg params if they're unused, or used directly via register)
This a quite useful feature IMHO, because it happens automatically. In my code i must do it by hand "<rxxr>", but it´s still possible.
There is no need to always zero all locals, but sometimes it would make things easier. My idea was writing a macro (zerolocals) taking none, one or two parameters.
- if no parameter is given, all locals are set to zero
- if one parameter is given, all locals from start to (and including) this local are set to zero
- if two parameters are given, all locals starting with the first given local up to the second are set to zero
By arranging my locals in an appropriate order i can ensure that this can be done very effectivly. I can optimize this process inside my macro (mov a few bytes vs. rep stosb), i can even repeat this inside my procedure with different parameters for different locals in different places.
Please don´t get me wrong, i do not expect you to code this for me, just because i want to have my way! I´m willing to do it myself, but currently i cannot, because:
- possibly something goes wrong inside UASM implementing a custom PROLOGUE/EPILOGUE, see my example above
- i´m doing something wrong, so would you please help doing it right
- both of it
This is not an easy matter, i know and it gives room for discussions, but please hang on - thanks,
JK
Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.
That said, feel free to do whatever you want :) The point of assembler is not to enforce controls, so in that spirit I'll check out the issue with the custom prologue and see if I can help there.
option zerolocals may still be useful too, perhaps we have both LOCALZ and the OPTION, although these days I'm tending to avoid adding more directive complexity than required. Happy to hear votes on the subject as to which is preferred.
I will have a look at the re-ordering of locals again too.
Quote from: johnsa on April 06, 2021, 01:51:20 AM
Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.
The "good reason" is that saving regs to global variables bloats the exe enormously. The only reason against pushing is that it can't be done (really, it's
forbidden!) if there is any
call in the proc.
The bloat argument applies also to locals if their total size exceeds 128 bytes, as discussed in ZeroLocals (http://masm32.com/board/index.php?topic=9258.msg101786#msg101786).
The same logic applies to the ordering of locals: if you put the "big" locals first, such as buffer[128]:byte, then all other variables use the long encodings, which are 3 bytes longer:
bloat. Hutch has a different opinion...
Quote from: hutch-- on April 01, 2021, 10:10:16 PM
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice.
... but I won't change my coding style: size matters, because the code cache is limited. That's also an argument to use rbp instead of rsp for the frame: all [rsp+x] instructions are one byte longer than [rbp+x].
Since I care for compatibility between UAsm, AsmC and MASM, I will not suggest to introduce an "align 16 once you encounter the first variable that needs it" :cool:
QuoteHappy to hear votes on the subject as to which is preferred
If LOCALZ is meant to work like LOCAL but to simultaneously zero the listed variables, i would prefer it over an OPTION:ZEROLOCALS, because it gives more control.
What i would prefer most, is having E/RBP pointing to a fixed location inside the stack frame (preferably to the top of locals, but by all means having a constant offset to the top of locals and to the procedure´s arguments) after the PROLOGUE with E/RBP based stackframes. This makes debugging so much easier, if you cannot have symbols.
Currently RBP´s offset to the top of locals (i think i remember ESP being stable in this repect) is different in different procedures. This makes debugging harder than necessary, because for each and every procedure, i must look, where locals and arguments start in relation to RBP. It would be so much easier, if i simply could rely on:
- locals start at RBP - some fixed offset
- arguments start at RBP + some other fixed offset
This is exactly, what my proposed stack frame layout ensures. I think you had a hard time developing WIN65:15, a lot of calculations must be done for optimizing out RBP at all - and you made it work. It is great to have such an option! I´m not against optimization options, but (at least for me) in the development phase, it´s a nightmare debugging such code.
In general my coding plan is: first make it work, keep it simple, don´t make it overly complicated, be sure you can debug it. And if it works, make it better, faster smaller, whatever ...
Thanks for your help!
JK
QuoteIt would be so much easier, if i simply could rely on:
- locals start at RBP - some fixed offset
- arguments start at RBP + some other fixed offset
That's what you get with (for example) the JBasic prolog macro: [rbp-4] is the first local, [rbp+10h] the first argument.
deleted
@jj,
QuoteThe "good reason" is that saving regs to global variables bloats the exe enormously
i agree on this and the next paragraphs, but i disagree on this
QuoteThe only reason against pushing is that it can't be done (really, it's forbidden!) if there is any call in the proc.
Maybe i´m wrong, but IMHO it depends on how you build a 64 bit stack frame.
- i agree, a RSP based stack frame will obviously not work.
- but a RBP based stack frame, which makes enough room (sub RSP, ...) for the shadow space and arguments (either push arg 5 and higher, or sub RSP, ... + mov [RSP+...], ...) before the call and corrects the stack afterwards (add RSP, ...) doesn´t have these restrictions to my understanding.
You must make RSP align 16, before a call, because you cannot know, if the called (external) procedure uses locals, which actually need alignment. If such a procedure is called with wrong RSP alignment, the alignment of these locals will now become wrong as well. So depending on the procedure you might get away with RSP align 8 before a call or not. But if the stack is built like i just described, you will always get away, if you make sure RSP is aligned 16 before a call, regardless how many pushes and pops there were in between (of course you must not pop more than you pushed)
JK
Quote from: jj2007 on April 06, 2021, 02:30:56 AM
Quote from: johnsa on April 06, 2021, 01:51:20 AM
Ok, well I'd strongly advise against doing pushes and pops in 64bit, or any sort of manual stack manipulation, it's likely to be error prone and give you hard to find bugs and I can't think of a good reason to do it.
The "good reason" is that saving regs to global variables bloats the exe enormously. The only reason against pushing is that it can't be done (really, it's forbidden!) if there is any call in the proc.
The bloat argument applies also to locals if their total size exceeds 128 bytes, as discussed in ZeroLocals (http://masm32.com/board/index.php?topic=9258.msg101786#msg101786).
The same logic applies to the ordering of locals: if you put the "big" locals first, such as buffer[128]:byte, then all other variables use the long encodings, which are 3 bytes longer: bloat. Hutch has a different opinion...
Quote from: hutch-- on April 01, 2021, 10:10:16 PM
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice.
... but I won't change my coding style: size matters, because the code cache is limited. That's also an argument to use rbp instead of rsp for the frame: all [rsp+x] instructions are one byte longer than [rbp+x].
Since I care for compatibility between UAsm, AsmC and MASM, I will not suggest to introduce an "align 16 once you encounter the first variable that needs it" :cool:
but pushs and pops in 32bit are onebyte instructions compared to other instructions,not only size matters,speed indirect also
the copy/zero out array movsb/stosb snippet for local array,you can directly reuse rdi as pointer afterwards
Quote from: nidud on April 06, 2021, 03:06:35 AMYou may get the size of this using the @ReservedStack variable.
Ups... error A2006:undefined symbol : @ReservedStack
Quote from: jj2007 on April 06, 2021, 06:03:39 AM
Quote from: nidud on April 06, 2021, 03:06:35 AMYou may get the size of this using the @ReservedStack variable.
Ups... error A2006:undefined symbol : @ReservedStack
https://www.japheth.de/JWasm/Manual.html
3.9 Directive OPTION WIN64
Directive OPTION WIN64 allows to set parameters for the Win64 output format if this format (see -win64 cmdline option) is selected. For other output formats, this option has no effect. The syntax for the directive is:
OPTION WIN64: switches
accepted values for switches are:
Store Register Arguments [ bit 0 ]:
- 0: the "home locations" (also sometimes called "shadow space") of the first 4 register parameters are uninitialized. This is the default setting.
- 1: register contents of the PROC's first 4 parameters (RCX, RDX, R8 and R9 ) will be copied to the "home locations" within a PROC's prologue.
INVOKE Stack Space Reservation [bit 1]:
- 0: for each INVOKE the stack is adjusted to reserve space for the parameters required for the call. After the call, the space is released again. This is the default setting.
- 1: the maximum stack space required by all INVOKEs inside a procedure is computed by the assembler and reserved once on the procedure's entry. It's released when the procedure is exited. If INVOKEs are to be used outside of procedures, the stack space has to be reserved manually!
Note: an assembly time variable, @ReservedStack, is created internally when this option is set. It will reflect the value computed by the assembler. It should also be mentioned that when this option is on, and a procedure contains no INVOKEs at all, then nevertheless the minimal amount of 4*8 bytes is reserved on the stack.
Warning: You should have understood exactly what this option does BEFORE you're using it. Using PUSH/POP instruction pairs to "save" values across an INVOKE is VERBOTEN if this option is on.
https://github.com/Terraspace/UASM/blob/master/procJWasm.c
/* v2.11: use @ReservedStack only if option win64:2 is set */
Quote1: the maximum stack space required by all INVOKEs inside a procedure is computed by the assembler and reserved once on the procedure's entry. It's released when the procedure is exited.
Thanks LiaoMi for clarfying it - this is how i interpret this option as well. Using this option saves you some otherwise necessary sub/add RSP,... , but requires RSP to remain unchanged inside a procedure (after the PROLOGUE, before the EPILOGUE). This is one way of managing stack frames in 64 bit, but it is not the only way to do it and it´s not the way, i want it.
QuoteWarning: You should have understood exactly what this option does BEFORE you're using it. Using PUSH/POP instruction pairs to "save" values across an INVOKE is VERBOTEN if this option is on
i absolutely agree, this is why i used WIN65:5 (bit 1 = 0, meaning this option is off) in my code example and this is, why i´m trying to implement my custom PROLOGUE and EPILOGUE. Among other things i want to be able change RSP inside a procedure (i know i must keep 16 bit alignment before calls).
Basically you can build your (RBP based) 64 bit stack frames just like you do it in 32 bit:
- push arguments (+ shadow space in 64 bit for argument 1 to 4),
- the call pushes the return address
BTW. this is, what INVOKE does for me until here anyway. At this point the assembler could precalculate the space needed for locals of this procedure
and for highest number of arguments of all INVOKEs in this procedure and set RSP accordingly (including 16 bit alignment), but it isn´t obliged to do so.
Advantage: no further stack adjustment needed for all INVOKEs in this procedure
Disadvantage: RSP MUST NOT be changed, no further local allocations or push/pop possible
- save registers to shadow space (optional)
- push RBP
- mov RBP, RSP
- make room for locals (sub RSP, ...)
- push non-volatile registers (this could be done before or after "push RBP" as well - at any rate corrcet alignment of RSP must be ensured as a result).
Now RSP is at the bottom of all data, which must not be overwritten. It can be freely used for whatever i want (as long as 16 alignment is ensured before calls). According to what i read about the 64 bit ABI, such a layout is not forbidden. No one forces you to build a stack frame in a way, that you must not change RSP inside procedures, this is a decision taken for optimisation reasons, but it´s not a must IMHO.
JK
deleted
Quote from: JK on April 06, 2021, 08:29:34 AMNow RSP is at the bottom of all data, which must not be overwritten. It can be freely used for whatever i want (as long as 16 alignment is ensured before calls). According to what i read about the 64 bit ABI, such a layout is not forbidden. No one forces you to build a stack frame in a way, that you must not change RSP inside procedures, this is a decision taken for optimisation reasons, but it´s not a must IMHO.
Be careful. Use the FillShadowSpace macro (http://masm32.com/board/index.php?topic=9270.msg101791#msg101791) to see what happens to your shadow space if you call one of the
rare WinAPI functions (Sleep, for example) that actually use it :cool:
FillShadowSpace
int 3
push rsi
push rdi
jinvoke Sleep, 100
pop rdi
pop rsi
John has given the same warning that I have DON'T alter the stack with PUSH/POP instructions. If you really have to preserve data in that manner, use a LOCAL and MOV the data to it in the normal manner,
mov reg, data
mov localname, reg
Unless you enjoy p*ssing around trying to find why that app will not start, don't make a mess of the stack.
A couple of basic things here, creative genius and personal preference certainly have their place but that place is not how the mechanics of the operating system work. Get the mechanics of the OS as they are designed to work THEN apply creative genius and personal preference and you will get what you are after.
Quote from: hutch-- on April 06, 2021, 01:08:31 PMUnless you enjoy p*ssing around trying to find why that app will not start, don't make a mess of the stack
Practically all apps
will start, unfortunately. The problem will bite you much later, as
most WinAPI calls don't use the shadow space. Until now, I found only
Sleep() being a shadow space user.
This is Win7-64. I wonder how the situation is on Win10, built with more recent compilers... any evidence from the UAsm/AsmC/ML64 developers?
With about 4 years of practice, I hve yet to see misaligned procedures do anything else than not start. No doubt the OS loader will do something but the notion of "start" is the app appearing on the screen and with misaligned code via stack blunders, you try and run it and nothing happens and you don't get told anything either.
Effectively the OS loader spits the dummy and the app will not run.
This is on Win 10 64 bit which I have been using for the last 5 years.
Quote from: hutch-- on April 06, 2021, 08:47:59 PMI hve yet to see misaligned procedures do anything else than not start.
Most developers grasp quickly the concept of align 16. The real issue are the subtle problems that may arise when you push+pop pairwise. It looks
good because the stack remains aligned for use with xmm regs, and
most of the time nothing bad happens, but...
Still learning the bad way...
Part of the assembly that I' have to code, and to research plataform performance SIMD beneficts in UASM.
Not only I have to align the stack before a call to procedure, I have to save stack space for all the defaults arguments and only the defaults arguments for the platform convention, 4*8+1*8 Win, 6*8+1*8 Lin.
Some times using sp as beneficts.
Some times abusing sp as segment faults.
This run/runs perfectly....
; Constructor
procstart _uasm_CPUFeatures_Init, callconv, void, < >, < >, infolevel:dword
ifdef __x32__
ifdef __windows__
mov __uasm_dt_CPUFeatures_infolevel, dp0()
xor dp0(), dp0()
endif
ifdef __unix__
mov __uasm_dt_CPUFeatures_infolevel, infolevel
;mov [dp0()+4], null
endif
endif ;__x32__
ifdef __x64__
mov __uasm_dt_CPUFeatures_infolevel, dp0()
xor dp0(), dp0()
endif ;__x64__
ifdef __x32__
push ebx
; detect if cpuidinstruction supported by microprocessor:
pushfd
pop eax
btc eax, 21 ; check if cpuidbit can toggle
push eax
popfd
pushfd
pop ebx
xor ebx, eax
bt ebx, 21
jc CPUInitNoID ; cpuidnot supported
xor eax, eax ; 0
; /* %eax=00H, %ecx %ebx */
mov __uasm_dt_CPUFeatures_CPUID, true
cpuid ; get number of cpuidfunctions
test eax, eax
jnz CPUInitIdentificable ; function 1 not supported
CPUInitNoID:
.if (__uasm_dt_CPUFeatures_infolevel >= 1) ;infolevel >= 1
push edi
; processor has no CPUID
mov dword ptr [edi], '8038' ; Write text '80386 or 80486'
mov dword ptr [edi+4], '6 or'
mov dword ptr [edi+8], ' 804'
mov dword ptr [edi+12], '86' ; End with 0
mov __uasm_dt_CPUFeatures_ProcessorName, edi ; Pointer to result
pop edi
.endif
pop ebx
jmp CPUInitEND
endif ;__x32__
ifdef __x32__
pop ebx
CPUInitIdentificable:
push ebp
mov ebp, esp
sub esp, 16 ; 3*4=12+4 Align 8
;mov [esp], esp
mov [esp], ebx
mov [esp+4], esi
mov [esp+8], edi
;push esp
;push ebp
;push ebx
;push esi
;push edi
endif ;__x32__
ifdef __x64__
push rbp
mov rbp, rsp
ifdef __windows__
sub rsp, 64 ; 7*8=56+8 Align 16
else
sub rsp, 48 ; 5*8=40+8 Align 16
endif
;mov [rsp], rsp
mov [rsp], rbx
ifdef __windows__
mov [rsp+8], rsi
mov [rsp+16], rdi
mov [rsp+24], r11
mov [rsp+32], r12
mov [rsp+40], r14
mov [rsp+48], r15
else
mov [rsp+8], r11
mov [rsp+16], r12
mov [rsp+24], r14
mov [rsp+32], r15
endif
;push rbx
;ifdef __windows__
;push rsi
;push rdi
;endif
;push rsp
;push rbp
;push r11
;push r12
;push r14
;push r15
endif ;__x64__
;.........................blablablablablas------------------
;.........................blablablablablas------------------
;.........................blablablablablas------------------
;.........................blablablablablas------------------
;.........................blablablablablas------------------
not_supported:
ifdef __x32__
mov edi, [esp+8]
mov esi, [esp+4]
mov ebx, [esp]
;mov esp, [esp]
add esp, 16
mov esp, ebp
pop ebp
;pop edi
;pop esi
;pop ebx
;pop ebp
;pop esp
endif ;__x32__
ifdef __x64__
ifdef __windows__
mov r15, [rsp+48]
mov r14, [rsp+40]
mov r12, [rsp+32]
mov r11, [rsp+24]
mov rdi, [rsp+16]
mov rsi, [rsp+8]
else
mov r15, [rsp+32]
mov r14, [rsp+24]
mov r12, [rsp+16]
mov r11, [rsp+8]
endif
mov rbx, [rsp]
;mov rsp, [rsp]
ifdef __windows__
add rsp, 64
else
add rsp, 48
endif
mov rsp, rbp
pop rbp
;pop r15
;pop r14
;pop r12
;pop r11
;ifdef __windows__
;pop rdi
;pop rsi
;endif
;pop rbx
;pop rbp
;pop rsp
endif ;__x64__
CPUInitEND: ; finished
ret
procend
public main
main proc (dword) argc:dword, argv:ptr ptr byte, envp:ptr ptr byte
; space for 4 arguments + 16byte aligned stack
sub rsp, 28h
call _uasm_CPUFeatures_Init
call _uasm_CPUFeatures_ProcessorName
mov rp0(), rret()
call printf
mov rp0(), cstr(stringwith, " With caches sizes:"," L1= ","%I64d"," bytes, L2= ", "%I64d"," bytes, L3= ","%I64d"," bytes.")
mov rp1(), __uasm_dt_CPUFeatures_DataCacheSizeL1
mov rp2(), __uasm_dt_CPUFeatures_DataCacheSizeL2
mov rp3(), __uasm_dt_CPUFeatures_DataCacheSizeL3
call printf
mov rp0(), cstr(datawith, 10,"With:",10,0)
call printf
call _uasm_CPUFeatures_Fin
xor eax, eax
xor ecx, ecx
call exit
add rsp, 28h
ret
main endp
@jj (re post #15)
i give another example, consider this code:
;option stackbase:RBP
;option win64:7
include windows.inc
includelib kernel32.lib
includelib user32.lib ;DrawTextEx
.code
testit proc uses rbx rsi rdi, x:dword, f4:real4, z:qword, f8:real4, n:qword
;*************************************************************************************
; proc
;*************************************************************************************
local tx :qword ;-8h size 8
LOCAL ty :qword ;-10h size 8
;total size of locals = 10h
nop
int 3
lea rax, x ;address of first argument
lea rax, tx ;address of first local
mov rax, 0ABCDEFh
push rax
push rax
push rax
push rax
push rax
push rax
invoke Sleep, 10
; invoke DrawTextEx, 0, 0, 0, 0, 0, 0 ;sub rsp,38 -> sub rsp,48 (6 arguments)
pop RAX
pop RAX
pop RAX
pop RAX
pop RAX
pop RAX
ret
testit endp
;*************************************************************************************
start proc uses rbx rsi rdi ;r15
;*************************************************************************************
; main proc
;*************************************************************************************
local x :dword ;-4h size 4
local f4 :real4 ;-8h size 4
local z :Qword ;-10h size 8
local f8 :REAL8 ;-18h size 8
local n :qword ;-20h size 8
local r :RECT ;-30h size 10h
nop ;procedure code starts here
lea rax, x ;address of first local
nop ;invoke starts here
invoke testit, x, 1.4, z, f8, n ;5 arguments
invoke ExitProcess, 0
ret
start endp
end start
please compile it once like it is and another time with options set (remove comment in the first two lines), the resulting code will be fundamentally different in how the stack pointer moves.
Version 1:
;*************************************************************************************
; wo. any options set -> RBP points to top of locals, 1. arg is RBP + 10
;*************************************************************************************
;00000000012C1000 | 55 | push rbp |
;00000000012C1001 | 48:8BEC | mov rbp,rsp |
;00000000012C1004 | 48:83C4 F0 | add rsp,FFFFFFFFFFFFFFF0 | 10h for locals
;00000000012C1008 | 53 | push rbx |
;00000000012C1009 | 56 | push rsi |
;00000000012C100A | 57 | push rdi |
;00000000012C100B | 90 | nop |
;00000000012C100C | CC | int3 |
;00000000012C100D | 48:8D45 10 | lea rax,qword ptr ss:[rbp+10] | 1. argument
;00000000012C1011 | 48:8D45 F8 | lea rax,qword ptr ss:[rbp-8] | 1. local
;00000000012C1015 | 48:C7C0 EFCDAB00 | mov rax,ABCDEF |
;00000000012C101C | 50 | push rax |
;00000000012C101D | 50 | push rax |
;00000000012C101E | 50 | push rax |
;00000000012C101F | 50 | push rax |
;00000000012C1020 | 50 | push rax |
;00000000012C1021 | 50 | push rax |
;00000000012C1022 | 48:83EC 20 | sub rsp,20 | make room for 4 arguments
;00000000012C1026 | B9 0A000000 | mov ecx,A | arg 1
;00000000012C102B | FF15 CF0F0000 | call qword ptr ds:[<&Sleep>] |
;00000000012C1031 | 48:83C4 20 | add rsp,20 | correct stack
;00000000012C1035 | 58 | pop rax |
;00000000012C1036 | 58 | pop rax |
;00000000012C1037 | 58 | pop rax |
;00000000012C1038 | 58 | pop rax |
;00000000012C1039 | 58 | pop rax |
;00000000012C103A | 58 | pop rax |
;00000000012C103B | 5F | pop rdi |
;00000000012C103C | 5E | pop rsi |
;00000000012C103D | 5B | pop rbx |
;00000000012C103E | C9 | leave |
;00000000012C103F | C3 | ret |
;00000000012C1040 | 55 | push rbp |
;00000000012C1041 | 48:8BEC | mov rbp,rsp |
;00000000012C1044 | 48:83C4 D0 | add rsp,FFFFFFFFFFFFFFD0 |
;00000000012C1048 | 53 | push rbx |
;00000000012C1049 | 56 | push rsi |
;00000000012C104A | 57 | push rdi |
;00000000012C104B | 90 | nop |
;00000000012C104C | 48:8D45 FC | lea rax,qword ptr ss:[rbp-4] | 1. local (DWORD)
;00000000012C1050 | 90 | nop | invoke starts here
;00000000012C1051 | 48:83EC 30 | sub rsp,30 | make room for 5 arguments
;00000000012C1055 | 8B4D FC | mov ecx,dword ptr ss:[rbp-4] | arg 1
;00000000012C1058 | B8 3333B33F | mov eax,3FB33333 |
;00000000012C105D | 66:0F6EC8 | movd xmm1,eax | arg 2
;00000000012C1061 | 4C:8B45 F0 | mov r8,qword ptr ss:[rbp-10] | arg 3
;00000000012C1065 | 66:0F6E5D E8 | movd xmm3,dword ptr ss:[rbp-18] | arg 4
;00000000012C106A | 48:8B45 E0 | mov rax,qword ptr ss:[rbp-20] |
;00000000012C106E | 48:894424 20 | mov qword ptr ss:[rsp+20],rax | arg 5
;00000000012C1073 | E8 88FFFFFF | call uasm_stack_test_64.12C1000 |
;00000000012C1078 | 48:83C4 30 | add rsp,30 | correct stack
;00000000012C107C | 48:83EC 20 | sub rsp,20 | make room for 4 arguments
;00000000012C1080 | 33C9 | xor ecx,ecx |
;00000000012C1082 | FF15 800F0000 | call qword ptr ds:[<&RtlExitUserProcess |
;00000000012C1088 | 48:83C4 20 | add rsp,20 |
;00000000012C108C | 5F | pop rdi |
;00000000012C108D | 5E | pop rsi |
;00000000012C108E | 5B | pop rbx |
;00000000012C108F | C9 | leave |
;00000000012C1090 | C3 | ret |
Version 2:
;*************************************************************************************
; Option stackbase:rbp + option win65:7
;*************************************************************************************
;00000000013A1000 | 894C24 08 | mov dword ptr ss:[rsp+8],ecx | copy arg 1 to shadow space
;00000000013A1004 | 48:55 | push rbp |
;00000000013A1006 | 53 | push rbx |
;00000000013A1007 | 56 | push rsi |
;00000000013A1008 | 57 | push rdi |
;00000000013A1009 | 48:83EC 38 | sub rsp,38 | make room for locals (10h)
; | + next call (20h)
; | + 8 bit stack alignment = 38h
;00000000013A100D | 48:8D6C24 30 | lea rbp,qword ptr ss:[rsp+30] |
;00000000013A1012 | 90 | nop |
;00000000013A1013 | CC | int3 |
;00000000013A1014 | 48:8D45 30 | lea rax,qword ptr ss:[rbp+30] | 1. argument
;00000000013A1018 | 48:8D45 F8 | lea rax,qword ptr ss:[rbp-8] | 1. local
;00000000013A101C | 48:C7C0 EFCDAB00 | mov rax,ABCDEF |
;00000000013A1023 | 50 | push rax |
;00000000013A1024 | 50 | push rax |
;00000000013A1025 | 50 | push rax |
;00000000013A1026 | 50 | push rax |
;00000000013A1027 | 50 | push rax |
;00000000013A1028 | 50 | push rax |
;00000000013A1029 | B9 0A000000 | mov ecx,A | arg 1
;00000000013A102E | FF15 CC0F0000 | call qword ptr ds:[<&Sleep>] |
;00000000013A1034 | 58 | pop rax |
;00000000013A1035 | 58 | pop rax |
;00000000013A1036 | 58 | pop rax |
;00000000013A1037 | 58 | pop rax |
;00000000013A1038 | 58 | pop rax |
;00000000013A1039 | 58 | pop rax |
;00000000013A103A | 48:8D65 08 | lea rsp,qword ptr ss:[rbp+8] |
;00000000013A103E | 5F | pop rdi |
;00000000013A103F | 5E | pop rsi |
;00000000013A1040 | 5B | pop rbx |
;00000000013A1041 | 5D | pop rbp |
;00000000013A1042 | C3 | ret |
;00000000013A1043 | 48:55 | push rbp |
;00000000013A1045 | 53 | push rbx |
;00000000013A1046 | 56 | push rsi |
;00000000013A1047 | 57 | push rdi |
;00000000013A1048 | 48:83EC 68 | sub rsp,68 | make room for locals + next call
;00000000013A104C | 48:8D6C24 50 | lea rbp,qword ptr ss:[rsp+50] |
;00000000013A1051 | 90 | nop |
;00000000013A1052 | 48:8D45 E4 | lea rax,qword ptr ss:[rbp-1C] | 1. local (DWORD)
;00000000013A1056 | 90 | nop | invoke starts here
;00000000013A1057 | 8B4D E4 | mov ecx,dword ptr ss:[rbp-1C] | arg 1
;00000000013A105A | B8 3333B33F | mov eax,3FB33333 |
;00000000013A105F | 66:0F6EC8 | movd xmm1,eax | arg 2
;00000000013A1063 | 4C:8B45 F8 | mov r8,qword ptr ss:[rbp-8] | arg 3
;00000000013A1067 | 66:0F6E5D F0 | movd xmm3,dword ptr ss:[rbp-10] | arg 4
;00000000013A106C | 48:8B45 E8 | mov rax,qword ptr ss:[rbp-18] |
;00000000013A1070 | 48:894424 20 | mov qword ptr ss:[rsp+20],rax | arg 5
;00000000013A1075 | E8 86FFFFFF | call uasm_stack_test_64.13A1000 |
;00000000013A107A | 33C9 | xor ecx,ecx |
;00000000013A107C | FF15 860F0000 | call qword ptr ds:[<&RtlExitUserProcess |
;00000000013A1082 | 48:8D65 18 | lea rsp,qword ptr ss:[rbp+18] |
;00000000013A1086 | 5F | pop rdi |
;00000000013A1087 | 5E | pop rsi |
;00000000013A1088 | 5B | pop rbx |
;00000000013A1089 | 5D | pop rbp |
;00000000013A108A | C3 | ret |
when stepping through the code you will see, that in the first version all pushes in "testit" are left as the are, while in the second version 4 of them (Sleep´s shadow space) are overwritten by zero!
The first version makes room for the arguments of a new call before each new call and corrects the stack to where is was after the call. Therefore the called procedure can do with it´s arguments and it´s shadow space whatever it pleases, without affecting, what has been pushed before.
The second version sets RSP once (and for all) at procedure entry to a value, which will be suitable for all calls in this procedure. That is, it pre-calculates (see, what happens to "sub rsp,38" if you uncomment the DrawTextEx line, it turns to "sub rsp,48") the highest space needed for the arguments of coming calls, then adds the required space for the locals of this procedure and finally adjusts RSP to 16 bit. This saves all "sub RSP, ... - add RSP, ..." pairs enclosing calls in the first version. It is more efficient in this respect!
But with this approach pushes before calls are not possible, because the called procedure might overwrite, what has been pushed (your example).
So my claim is: with a stack pointer handling like in version 1 pushes and pops aren´t a problem, with version 2 pushes and pops will definitely cause problems, as you (and hutch and others) pointed out.
JK
Quote from: JK on April 07, 2021, 12:10:25 AMplease compile it once like it is
Sorry, I can't compile it. My environment variables are not set to any path, so "include windows.inc" will not work. Besides, there are three or four competing 64-bit SDKs around (plus my own), and the info on how to use them is scattered all over the place. Afaik none of them has an installer comparable to the Masm32 SDK, so I watch with awe what all of you are doing, but
if I assemble 64-bit code it's with JBasic only...
Quote from: jj2007 on March 30, 2021, 02:08:44 AM
Attached the installer of the JBasic library
I am still fascinated at how you guys are trapped with STDCALL from Win32 in 64 bit. push/call notation belongs to a bygone era. If you need to save and restore registers, just create a LOCAL and MOV the register into the LOCAL.
; pseudo code
LOCAL .rax :QWORD
; .....
mov .rax, rax
; on exit
mov rax, .rax
@jj,
sorry i cannot supply an istaller! This is. what i use in a batch file:
Assembler: UASM V2.52: (...\uasm64.exe /c -win64 -Zp8 /win64 /D_WIN64 /Cp /W2 /I ...)
Linker: MS´s link.exe V 14.20.27508.1: (...\LINK.EXE /LARGEADDRESSAWARE:NO /SUBSYSTEM:CONSOLE /RELEASE /VERSION:4.0 /MACHINE:X64 /LIBPATH: ...)
Include files: http://www.terraspace.co.uk/WinInc209.zip
Lib files: MASM64 SDK
JK
MASM64 SDK: http://www.masm32.com/download/install64.zip
Quote from: JK on April 07, 2021, 02:10:53 AM
MASM64 SDK: http://www.masm32.com/download/install64.zip
Over two years old, so it can't be the current version. Do you think the \Masm32\install64\m64lib stuff will still work?
Come on, jj ...
I wouldn't link with /largeaddressaware:no , that shouldn't be necessary for a 64bit image, you're basically telling the OS that the 64bit exe can't deal with addresses > 4gb which will come back and bite you later if you want to do anything with large memory allocs or file mapping etc.
I see JKs idea, he wants to be able to do whatever he feels like inside his proc, if that means pushing/popping so be it.
I personally don't see the point, I just want to get stuff done and care not about the stack frame or alignment etc.. most of those nano level optimisations will have no positive benefit (but this is just me). Even with the clever prologue/epilogue optimisations an assembler is still no match for a compiler that can auto-inline, tail-call eliminations etc. If that level of optimisation is a concern then in some regards C is actually a better general purpose bet. (I hate to say it as a die-hard assembler fanatic).
If you really want to abuse yourself, you could forego using PROC at all and go oldskool, or make a custom invoke/proc macro.
IE... XPROC myFunction, arg1, DWORD, arg2, QWORD and then generate a basic prologue with no automatic reservation.
Just a thought :)
:biggrin:
> Do you think the \Masm32\install64\m64lib stuff will still work?
You must have confused this with the risky junk you keep posting with manually tweaked stack frames and pushed and pops. That library, even being out of date still built super reliable executables that still run perfectly.
Also note that the masm64 library is not redistributable. It is copyright freeware, not open sauce.
I extend to the Watcom derivatives the level of support I get from them, nothing.
Quote from: hutch-- on April 07, 2021, 03:54:09 PMthe risky junk you keep posting with manually tweaked stack frames and pushed and pops.
You are not talking to me, right? :biggrin:
:biggrin:
Would I tell a lie ? After your years long crusade against 64 bit and MASM in particular with dodgy code and unreliable technical data, why would I lie about it ? :tongue:
Any examples for "dodgy code and unreliable technical data"? I'm curious.
:biggrin:
Yeah, look at you last few months of postings about twiddling the stack with push and pop. :tongue:
Quote from: hutch-- on April 08, 2021, 08:34:54 AM
:biggrin:
Yeah, look at you last few months of postings about twiddling the stack with push and pop. :tongue:
Forum search for
push pop and
jj2007 does not find posts where I argued for pushing & popping in 64-bit code. Probably you are confusing me with another member :cool:
:biggrin:
You may need a different search criteria.