In 32 bit i can zero all locals quite easy whenever i use an EBP based stack frame. Everything between ESP and EBP is "local" right after the prologue. The more, locals are located on the stack in the same order they are defined in code in 32 bit.
Running the same code in 64 bit, i see that locals don´t start always at RBP and i cannot rely on the fact that the defined order of locals, will be the same on the stack. E.g. in code i have 3 locals: a DWORD, a structure and another DWORD, on the stack i find 2 DWORDs one right after the other followed by the structure.
The basic idea in 64 bit for the following code:
local x :dword
local y :rect
local z: dword
was:
cld
xor rax, rax
lea rcx, x
add rcx, sizeof(x)
lea rdi, z
sub rcx, rdi
rep stosb
(get the address of the first local, add it´s size, get the address of the last local to zero out, fill everything in beween with zero)
But obviously this doesn´t work! What else could i do? I want to reliably zero out all or certain portions of my locals in a procedure. Of course i could explicitly zero out each local (mov rax, 0 -> mov ..., rax), but there must be a much more elegant and faster way - how?
Thanks
JK
Quote from: JK on March 27, 2021, 05:18:13 AMin 64 bit, i see that locals don´t start always at RBP and i cannot rely on the fact that the defined order of locals, will be the same on the stack. E.g. in code i have 3 locals: a DWORD, a structure and another DWORD, on the stack i find 2 DWORDs one right after the other followed by the structure.
With which frameword/assembler does this happen? I use my own PROLOGUE and EPILOGUE, which assigns LOCALs exactly where the programmer wants them to be; and the "ZeroLocals" is built into the PROLOGUE macro:
SayHi proc <cb> arg:SIZE_P ; <cb> indicates ok for usage as a callback function; the arg is a pointer
Local v1, v2, v3, v4, rc:RECT
Print Str$(" \nLocals v1...v4: %i %i %i %i", v1, v2, v3, v4)
jinvoke MessageBox, 0, arg, Chr$("Hi"), MB_OK or MB_SETFOREGROUND
ret
SayHi endp
0000000140001002 | 55 | push rbp |
0000000140001003 | 48 C7 C5 E0 FF FF FF | mov rbp,FFFFFFFFFFFFFFE0 | clear locals
000000014000100A | 83 24 2C 00 | and dword ptr ss:[rsp+rbp],0 |
000000014000100E | 48 83 C5 04 | add rbp,4 |
0000000140001012 | 78 F6 | js 14000100A |
0000000140001014 | 48 8B EC | mov rbp,rsp |
0000000140001017 | 48 89 4D 10 | mov qword ptr ss:[rbp+10],rcx | load arguments (option <cb>)
000000014000101B | 48 89 55 18 | mov qword ptr ss:[rbp+18],rdx |
000000014000101F | 4C 89 45 20 | mov qword ptr ss:[rbp+20],r8 |
0000000140001023 | 4C 89 4D 28 | mov qword ptr ss:[rbp+28],r9 |
Thanks JJ, i´m using UASM with option win64:15 and option stackbase:rbp. But maybe i should run my own PROLOGUE and EPILOGUE like you do. I suspect that some of UASM´s "optimizations" make a procedure look different depending on number and type of parameters, number and type of locals, if it´s a leaf procedure or not and maybe some other stuff. This makes it difficult to find a generic approach.
Could you please expand a bit on how you do it (PROLOGUE and EPILOGUE).
I´m very much interested in from case to case not zeroing all locals all the time, but sometimes only some of it. Something like this:
local a ...
local b ...
local c ...
zero_until_here (or zero_from_here)
local x ...
local y
This would require a custom PROLOGUE and an additional macro for setting a list of locals to zero. If i knew the start or end address of the locals block and i could rely on locals appearing on the stack in the order, they were defined in code, i could specify one local as a start/end point for a loop setting everything in between to zero. Maybe better like this:
local a ...
local b ...
local c ...
local x ...
local y
zero_until(c)
...
would zero a,b and c, and:
local a ...
local b ...
local c ...
local x ...
local y
zero_until(yc)
...
would zero out all locals.
How to do this, preferably for both 32 and 64 bit ?
JK
Quote from: JK on March 27, 2021, 08:00:07 AMI´m very much interested in from case to case not zeroing all locals all the time, but sometimes only some of it
Known problem: you have a dozen local dwords and structures, plus a 8kB buffer that doesn't need zeroing.
In JBasic (the dual 64/32 bit demo that comes with MasmBasic), with e.g.
useClv=120 you can limit the ClearLocals to 120 bytes
How I do it, that is a long story. The file JBasic.inc comes with the MasmBasic package (http://masm32.com/board/index.php?topic=94.0), but explaining the PROLOGUE macro would require more time than I currently have, sorry...
Thanks anyway - i found JBasic.inc, now i can study how you do it!
One more question - is this your generic PROLOGUE and EPILOGUE, or are there more variations to be found in MasmBasic?
JK
It 's the generic one, set in j@start
Btw there was a long PROC and prolog/epilog (http://masm32.com/board/index.php?topic=6717.0) thread three years ago. Sooner or later you will stumble over "undocumented MASM features" when testing your stuff with Watcom vs ML64 assemblers.
While the guts of UASM is not my concern, if you need to be able to set local values in your prologue, Bob Zale did it in PB so if you disassemble a PB executable, you will see his technique. It means you must be able to set dynamic code to do this after the entry to the proc.
If I had a reason to, I think it can be done in MASM so if UASM has properly duplicated the MASM pre-processor, it should be able to insert dynamic code after the procedure entry.
Quote from: hutch-- on March 27, 2021, 11:02:50 PMI think it can be done in MASM
Indeed, it can be done. I added the "clear the locals" feature to JBasic (my dual 64-/32-bit library) half a year ago.
if useClv lt localbytes and useClv ge 4 ; with e.g. useClv=120, you can limit the ClearLocals to 120 bytes
lea rbp, [rsp-useClv]
else
lea rbp, [rsp-localbytes-locBytesOff] ; ML64 and AsmC address locals differently, and create a 16-byte unused zone near the stack pointer
endif
ClearLoc: and dword ptr [rbp], 0
add rbp, 4
cmp rbp, rsp
js ClearLoc
@hutch
I know how PB does it, a disassembly reveals this. Basically i know and understand what i must do. But i hoped to get away without a custom PROLOGUE/EPILOGUE. UASM does a good job optimzing code and stack space usage, but the downside is - at least in 64 bit - i cannot calculate the stack position of locals from RBP and RSP in a reliable manner, because of these optimizations. I even cannot rely on the order of locals to appear as these are defined. UASM puts larger locals at the start.
@JJ
In the meantime i had a chance to study your code and i understand, what you are doing. As said above, i hoped to be able to use, what´s built-in into UASM, and to add some code clearing the locals, i want to be cleared. But maybe the only way to accheive, what i want, is doing it the hard way by writing my own custom PROLOGUE/EPILGUE. I will be able to learn from your code and extract, what i need!
One more question: the return value of a custom PROLOGUE tells the assembler where the locals are located in relation to EBP/RBP - right ?
JK
Here is another attempt to initialize the local variables. This can be combined with the custom PROLOGUE \ EPILOGUE method :
include \masm32\include\masm32rt.inc
INITLOC MACRO registers
VarSize=0
StackPos=0
IFNB <registers>
StackPos=4*registers
ENDIF
ENDM
LOCALX MACRO _name,_type
VarSize=VarSize + SIZEOF(_type)
LOCAL _name : _type
ENDM
ENDLOC MACRO
lea eax,[esp+StackPos]
invoke memfill,eax,VarSize,0
ENDM
.code
start:
call main
invoke ExitProcess,0
main PROC USES esi edi ebx
INITLOC 1 ; number of preserved registers
LOCALX rc,RECT
LOCALX x,DWORD
LOCALX y,DWORD
ENDLOC
lea eax,rc
invoke crt_printf,\
CTXT("y = %d , x = %d , &rc = %X"),\
y,x,eax
ret
main ENDP
END start
Hi,
UASM now handles primitives first then structs/arrays, so do not expect them to be stored in the same order they are declared.
deleted
deleted
Quote from: JK on March 28, 2021, 08:04:11 AMOne more question: the return value of a custom PROLOGUE tells the assembler where the locals are located in relation to EBP/RBP - right ?
I'm afraid it doesn't, that would make the task easier.
The JBasic library installer has moved here (http://masm32.com/board/index.php?topic=9266.0).
I attach a shortened version of the JBasic library. Extract all files to \Masm32\MasmBasic\Res, then drag an asm file over Buildme.bat
(it needs \Masm32\bin\UAsm64.exe, all the rest should be available).
Hello,
Here is an example for Masm 64-bit :
include \masm32\include64\masm64rt.inc
.data
string1 db 'y = %d , x = %d , &rc = %X',0
INITLOC MACRO
VarSize=0
ENDM
LOCALX MACRO _name,_type
VarSize=VarSize + SIZEOF(_type)
LastVar TEXTEQU <_name>
LOCAL _name : _type
ENDM
ENDLOC MACRO
; % echo LastVar
lea rcx,LastVar
invoke vc_memset,rcx,0,VarSize
ENDM
.code
start PROC
call main
invoke ExitProcess,0
start ENDP
main PROC
INITLOC
LOCALX .rsi,QWORD
LOCALX .rdi,QWORD
LOCALX .rbx,QWORD
LOCALX rc,RECT
LOCALX x,QWORD
LOCALX y,QWORD
ENDLOC
mov .rsi,rsi
mov .rdi,rdi
mov .rbx,rbx
lea r9,rc
invoke vc_printf,\
ADDR string1,\
y,x,r9
mov rsi,.rsi
mov rdi,.rdi
mov rbx,.rbx
ret
main ENDP
END
Another version :
include \masm32\include\masm32rt.inc
INITLOC MACRO
VarSize=0
ENDM
@p MACRO _name,_type
VarSize=VarSize + SIZEOF(_type)
LastVar TEXTEQU <_name>
EXITM <_name : _type >
ENDM
ENDLOC MACRO
lea eax,LastVar
invoke memfill,eax,VarSize,0
ENDM
.code
start:
call main
invoke ExitProcess,0
main PROC USES esi edi ebx
INITLOC
LOCAL @p(rc,RECT)
LOCAL @p(x,DWORD)
LOCAL @p(y,DWORD)
ENDLOC
invoke crt_printf,\
CTXT("y = %d , x = %d"),\
y,x
ret
main ENDP
END start
Thanks Vortex,
in my initial post i supplied code, which essentially does the same as your macros do. All of this works only if the order of locals isn´t changed by the assembler. My first try was UASM, which does change (optimize?) the order of locals on the stack. In the meantime i played a bit with UASM options and with ml/ml64 and found that this not always the case. So avoiding certain options basically solves my problem.
I read about rep stosb vs memfill and made own timing tests. To my surprise rep stosb - at least on my (fairly old) machine - can not only keep up with memfill but is slightly faster in a range, where you would expect the average total size of locals.
JK
Hi JK,
The instruction pair REP MOVSB/STOSB is still very competitive as it has special case circuitry as long as you use REP with them.
Now with both ML and ML64 they explicitly maintain the written order of your locals and one of the ways to keep everything aligned properly is to start with the biggest data types first and place the rest in decending order. This way everything is correctly aligned.
Quote from: hutch-- on April 01, 2021, 11:28:05 AMone of the ways to keep everything aligned properly is to start with the biggest data types first and place the rest in decending order. This way everything is correctly aligned.
That's correct, but it creates bloat:
Local buffer[124]:BYTE, v1, v2, v3, v4, v5, v6, v7, rc:RECT
int 3
mov eax, v1 ; still short
mov ecx, v2 ; bloated long instructions for v2...rc, 3 bytes more for each mov, inc, add etc
lea rdx, buffer ; short instruction
00000001400011D7 | CC | int3 |
00000001400011D8 | 8B 45 80 | mov eax,dword ptr ss:[rbp-80] |
00000001400011DB | 8B 8D 7C FF FF FF | mov ecx,dword ptr ss:[rbp-84] |
00000001400011E1 | 48 8D 55 84 | lea rdx,qword ptr ss:[rbp-7C] |
:biggrin:
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice. You may in some circumstances get misaligned data to work but why waste the performance when you just need to do it correctly in the first place.
With x64 and Win64, you will always have more memory than older OS versions and you need to use it in a compatible way with the OS specs, fiddling a few bytes here and there simply does not matter.
In order, align the proc if it needs to be larger than the default 64 bit then add any locals in decending order of size. MASM can do it so with any of thew Watcom derivatives, look for an option that will do this for you.
is it possible to use rep movsd for LOCAL array and xchg esp,edi and start pop data from array in a loop?
and xchg esp back
@hutch
about data alignment: why must data be aligned in 64 bit? is it a matter of performance, or is it a matter of crash or not?
regarding locals: is it sufficient to properly align only the first local in a procedure, or must every local be aligned according to it´s size?
This is still confusing to me, the latter seems to be the case when stepping through 64 bit code with a debugger. Even if i don´t care about sorting locals according to their size - the assembler seems to align it for me. I see only addresses ending with 0 or 8 (except for byte, word and dword locals, which are aligned at 1,2 or 4 byte boundaries).
So what is the advantage of sorting locals by size, did i miss something? I want to make it stable in first place, i don´t want to over-optimize things and risk hard to find bugs.
Thanks
JK
:biggrin:
Its simple enough, win64 is not tolerant, misalign data and code and the app may not start. Usually the default alignment for a procedure is 64 bit so if you put 64 bit locals first, then 32 bit locals then 16 bit then 8 bit in decending order, they are all aligned correctly. Doing this avoids hard to track problems in 64 bit.
Now if you need to use 128 or 256 byte data types, you need to align the procedure to the largest data type then put any other in descending order. In MASM there are facilities to create larger alignments so you will have to look through the UASM capacity to see if you can do the same.
Thanks hutch,
i´m not an exclusive UASM user, UASM seemed easier to start with in 64 bit, because of some already built-in features, but i´m also looking for ways to do things with ml/ml64. Ml/ml64 has been around for a long time and i think it will be in the future. I´m not sure, if clones like UASM, ASMC and other will be alive then as well.
Currently i try to learn new things in the 64 bit assembler world, filling knowledge gaps at all knowlegde levels. Therefore i may ask for very basic things and next time i may ask for very special stuff.
Quote from: hutch-- on April 01, 2021, 10:10:16 PM
:biggrin:
Better bloat than broken. If you are dealing with instructions that need alignment you have no choice. You may in some circumstances get misaligned data to work but why waste the performance when you just need to do it correctly in the first place.
As an old friend of mine used to say: "Bottom line is REAL MEN[tm] write their
loop code in Intel mnemonics, stack frames by hand, not everyone wants a compiler writer to hold their hot little hand." :biggrin:
:biggrin:
Well, there is nothing wrong with writing your loop code in Intel mnemonics, particularly if you want it fast, in 32 bit it was easy to write manual stack frames or lack of one and you have to be careful about letting compiler writers hold your hot little hand, they may lead you down the garden path to a visual garbage generator.
Walk the straight and narrow and you will learn to write missiles that make the script kiddies sob into their chardonnet. (perhaps lemonade) :tongue:
Quote from: hutch-- on April 02, 2021, 05:30:46 AM
Its simple enough, win64 is not tolerant, misalign data and code and the app may not start. Usually the default alignment for a procedure is 64 bit so if you put 64 bit locals first, then 32 bit locals then 16 bit then 8 bit in decending order, they are all aligned correctly. Doing this avoids hard to track problems in 64 bit
Example
Local a,b,c,D:dword
Local array[256]:dword
;copy data to local array
Mov saveesp,esp
So in which order do I put array and other locals,to start pop 256 dwords in a loop?
Quote from: daydreamer on April 03, 2021, 06:03:32 PMpop 256 dwords in a loop?
Can you post complete code for that, please?
Magnus,
Quote
to start pop 256 dwords in a loop
In Win64 you don't pop anything. What you have said did not make sense.
Converting the LOCVAR macro to Poasm is easy :
LOCVAR MACRO _name,_type
VarSize=VarSize + SIZEOF(_type)
LastVar TEXTEQU _name ; Masm equivalent = LastVar TEXTEQU <_name>
LOCAL _name : _type
ENDM
The ENDLOC macro remains the same.