sub rsp,(32*8 + 8 MOD 32)A = the desired alignment (16 or 32)8 MOD 16 = 8 MOD 32 = 8
QuoteA = the desired alignment (16 or 32)8 MOD 16 = 8 MOD 32 = 8
==> SUB RSP,(A*B + 8 MOD A) = SUB RSP,(A*B + 8 )
if A = 16:
B=4 (dword):SUB RSP,(16*4+ 8 ) ==> SUB RSP,72
B=8 (dword):SUB RSP,(16*8+ 8 ) ==> SUB RSP,136 :eusa_naughty:
SUB RSP,((localbytes +8) and -10h)
361 ms for sub
860 ms for pushA bit difficult to test the performance of function prologs/epilogs without doing just that..
I think in this case we have to conclude the stack is static. The macro, as you pointed out, wouldn't work otherwise.
Well, my point was just to show the advantage / disadvantage of using RSP or RBP as a base. There are a few assumption made about this, but yes, the former is usually faster.
The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature
The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature
The x64 ABI requires that the stack itself is aligned on entry. Therefore assembly time alignment is possible.
I am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.
;
; Build: jwasm/hjwasm/asmc -pe test.asm
;
.x64
.model flat, fastcall
option win64:7
option stackbase:RBP
option dllimport:<msvcrt>
printf proto :ptr byte, :vararg
exit proto :qword
option dllimport:none
.data
format db "l1: %d",10
db "l2: %d",10
db "l3: %d",10
db "l4: %d",10
db "l5: %d",10,0
.code
Alignment proc uses rsi rdi rbx
local l1 : byte
local l2 : xmmword
local l3 : byte
local l4 : ymmword
local l5 : byte
GetAlig macro reg, l
lea reg,l
bsf rcx,reg
mov reg,1
shl reg,cl
endm
GetAlig rsi,l1
GetAlig r8, l2
GetAlig r9, l3
GetAlig r10,l4
GetAlig r11,l5
invoke printf, addr format, rsi, r8, r9, r10, r11
ret
Alignment endp
main proc
invoke Alignment
invoke exit,0
main endp
end main
Come to think of it this should probably also apply to 32-bit. Maybe 16-bit as well.
option alignlocals:[on|off]
Come to think of it this should probably also apply to 32-bit. Maybe 16-bit as well.
option alignlocals:[on|off]
32 and 64 would still be correct, you're checking the alignment with bsf, so if something is aligned 64 doesn't preclude 16.. that's just by virtue of the fact that cyclic aligns to 16 will eventually be aligned to a > power 2. I'm not bothering to align 32 for ymmword, as the OS doesn't give you a 32byte aligned stack on entry, or at least it's not guaranteed. C/C++ use movups to refer to ymm locals so I'll do the same for now (until we get guaranteed 32byte entry alignment). You can manually compensate on entry into an asm app, but if you were making a library and using it from HLL without that, assuming ymm locals are aligned will probably not work ?
here (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/299644) is an interesting explanation about AVX alignment requirement.