The MASM Forum

64 bit assembler => 64 bit assembler. Conceptual Issues => Topic started by: coder on April 04, 2017, 04:31:09 AM

Title: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 04, 2017, 04:31:09 AM
This technique assumes that the stack environment of the codebase defaults to 16-byte

One advantage is that this can be used to completely eliminate the costly function prologs and epilogs, for good.
Another advantage is that this can also deal with unconventional alignment, such as 32-byte.

It has the same effects as AND RSP,-16 or AND RSP,-32 minus the stack frame.

To create an aligned stack to the desired alignment, the equation is

SUB RSP,(A*B + 8 MOD A)

and restore it almost the same way using

ADD RSP,(A*B + 8 MOD A)

where;
A = the desired alignment (16 or 32)
B = The number of the 'notches' for local space allocation
Default codebase alignment = 16

Example: If you want to create a 32*8 local space, aligned to 32-byte boundary (in 16-byte alignment codebase), just use it like so

sub rsp,(32*8 + 8 MOD 32)

If you are not interested in allocating local space, just use B=0 and still get the stack aligned according to A.

Haven't run enough tests on it, so I don't know how it would work in all situations. Suggest to me corrections and improvements.

Thanks.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: mabdelouahab on April 04, 2017, 07:03:13 AM
QuoteA = the desired alignment (16 or 32)
8 MOD 16 = 8 MOD 32 = 8
    ==> SUB RSP,(A*B + 8 MOD A) = SUB RSP,(A*B + 8  )
    if A = 16:
    B=4 (dword):SUB RSP,(16*4+ 8  ) ==> SUB RSP,72
    B=8 (dword):SUB RSP,(16*8+ 8  ) ==> SUB RSP,136  :eusa_naughty:   
 

SUB RSP,((localbytes +8) and -10h)
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 04, 2017, 07:49:08 AM
Quote from: mabdelouahab on April 04, 2017, 07:03:13 AM
QuoteA = the desired alignment (16 or 32)
8 MOD 16 = 8 MOD 32 = 8
    ==> SUB RSP,(A*B + 8 MOD A) = SUB RSP,(A*B + 8  )
    if A = 16:
    B=4 (dword):SUB RSP,(16*4+ 8  ) ==> SUB RSP,72
    B=8 (dword):SUB RSP,(16*8+ 8  ) ==> SUB RSP,136  :eusa_naughty:   
 

SUB RSP,((localbytes +8) and -10h)

Yes, that's the same effect as AND RSP,-16, you get extra 8 bytes and aligned stack at the same time. Only this time, no stack frame needed.

16*8 = 128 + 8 = 136. Exactly the effect as AND RSP,-16 / SUB RSP,16*8 sequence. Stack aligned. No stack frame :)

Thanks for testing it.


Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 04, 2017, 09:30:56 PM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 05, 2017, 01:11:03 AM
Hi Nidud. Thanks for testing it out.

Common sense tells that when you're testing for a function, at least test its performance by actually making a call to it or something similar to its structure. I ran a quick test, slightly different than yours, using GetTickCount64 with 100mil loops

361 ms for sub
860 ms for push


Almost 50 - 60 percent gain.

I know the formula I suggested above is slightly off for 16-byte alignment, but it's working just fine for 32-byte alignment. Hope someone can offer a little correction.

Thanks and nice to know you.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 05, 2017, 01:16:24 AM
Additional info: This works only in strict Win64 ABI environment, with the stack alignment is well-observed across calls. This won't work on Linux.


Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 05, 2017, 02:27:39 AM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 05, 2017, 02:57:27 AM
Hi nidud. thanks for the reply and counter tests.

I was actually trying to eliminate other variables and focused solely on the performance of function prologs/epilogs vs plain fastcall, arranged as-is. If we were to include other variables such as code alignment and arrangement, it would be a completely different discussions, no longer related to stack alignment which is a dynamic entity of a program and not static as code alignment.

But I appreciate you replies though.

Thanks.


 
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 05, 2017, 04:51:34 AM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 08, 2017, 07:49:05 AM
Quote from: nidud on April 05, 2017, 04:51:34 AM
A bit difficult to test the performance of function prologs/epilogs without doing just that..
I think in this case we have to conclude the stack is static. The macro, as you pointed out, wouldn't work otherwise.

Well, my point was just to show the advantage / disadvantage of using RSP or RBP as a base. There are a few assumption made about this, but yes, the former is usually faster.

I don't think stack is static though. It's a runtime load. The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature, unlike applying it against code which is done / calculated at compile time (static).

Yeah it doesn't quite work. How I wished someone with better math could 'repair' it. It would save us lots of stacking problems :D
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: jj2007 on April 08, 2017, 08:25:29 AM
Quote from: coder on April 08, 2017, 07:49:05 AMThe same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature

The x64 ABI requires that the stack itself is aligned on entry. Therefore assembly time alignment is possible.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 08, 2017, 10:10:47 AM
Quote from: jj2007 on April 08, 2017, 08:25:29 AM
Quote from: coder on April 08, 2017, 07:49:05 AMThe same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature

The x64 ABI requires that the stack itself is aligned on entry. Therefore assembly time alignment is possible.

Yeah.

Here's an idea. It would be a nice addition if you could extend the "LOCAL" directive to something like

    local    align 16 <var>
    local    <var> align 32

I am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: jj2007 on April 08, 2017, 10:34:00 AM
Quote from: coder on April 08, 2017, 10:10:47 AMI am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.

The PROLOG macro does not allow access to individual variables. But you can find a workaround here (http://masm32.com/board/index.php?topic=6103.msg64817#msg64817).

And afaik, the HJWasm team have implemented local alignment now.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 08, 2017, 10:18:48 PM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: johnsa on April 10, 2017, 05:23:34 AM


;
; Build: jwasm/hjwasm/asmc -pe test.asm
;
.x64
.model flat, fastcall
option win64:7
option stackbase:RBP

option dllimport:<msvcrt>
printf proto :ptr byte, :vararg
exit proto :qword
option dllimport:none

.data

format db "l1: %d",10
db "l2: %d",10
db "l3: %d",10
db "l4: %d",10
db "l5: %d",10,0

.code

Alignment proc uses rsi rdi rbx

  local l1 : byte
local l2 : xmmword
local l3 : byte
local l4 : ymmword
local l5 : byte

GetAlig macro reg, l
lea reg,l
                bsf rcx,reg
mov reg,1
shl reg,cl
endm

GetAlig rsi,l1
GetAlig r8, l2
GetAlig r9, l3
GetAlig r10,l4
GetAlig r11,l5

invoke printf, addr format, rsi, r8, r9, r10, r11

ret
Alignment endp

main proc

invoke Alignment
invoke exit,0

main endp

end main



Using win64:7, stackbase:rbp
l1: 1
l2: 16
l3: 1
l4: 64
l5: 1

Win64:11, stackbase:rsp
l1: 16
l2: 64
l3: 16
l4: 16
l5: 16
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: johnsa on April 10, 2017, 06:37:48 AM
I've made a change to hjwasm (for the next update) to set stackbase:rbp as the default when nothing is specified. It's technically required, but shouldn't accidentally cause things to go off if omitted.
I've also put an option in to optimise the amount of stack used with win64:15 / stackbase:rsp but in all cases things will be aligned as planned.

Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 10, 2017, 06:52:02 AM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 10, 2017, 07:04:14 AM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: johnsa on April 10, 2017, 08:26:14 AM
Quote from: nidud on April 10, 2017, 07:04:14 AM
Come to think of it this should probably also apply to 32-bit. Maybe 16-bit as well.

option alignlocals:[on|off]

32 and 64 would still be correct, you're checking the alignment with bsf, so if something is aligned 64 doesn't preclude 16.. that's just by virtue of the fact that cyclic aligns to 16 will eventually be aligned to a > power 2. I'm not bothering to align 32 for ymmword, as the OS doesn't give you a 32byte aligned stack on entry, or at least it's not guaranteed. C/C++ use movups to refer to ymm locals so I'll do the same for now (until we get guaranteed 32byte entry alignment). You can manually compensate on entry into an asm app, but if you were making a library and using it from HLL without that, assuming ymm locals are aligned will probably not work ?
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: nidud on April 10, 2017, 09:04:25 AM
deleted
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: habran on April 10, 2017, 09:48:41 AM
here (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/299644) is an interesting explanation about AVX alignment requirement.
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 10, 2017, 11:21:03 AM
Quote from: johnsa on April 10, 2017, 08:26:14 AM
Quote from: nidud on April 10, 2017, 07:04:14 AM
Come to think of it this should probably also apply to 32-bit. Maybe 16-bit as well.

option alignlocals:[on|off]

32 and 64 would still be correct, you're checking the alignment with bsf, so if something is aligned 64 doesn't preclude 16.. that's just by virtue of the fact that cyclic aligns to 16 will eventually be aligned to a > power 2. I'm not bothering to align 32 for ymmword, as the OS doesn't give you a 32byte aligned stack on entry, or at least it's not guaranteed. C/C++ use movups to refer to ymm locals so I'll do the same for now (until we get guaranteed 32byte entry alignment). You can manually compensate on entry into an asm app, but if you were making a library and using it from HLL without that, assuming ymm locals are aligned will probably not work ?

Why bother bro? AVX don't take alignment too seriously anyway. HJWASM should just stick to the ABI requirements as specified, for now. Since M$ never stated that their ABI is closed-ended, it is highly likely that they would, in the future introduce the 32-byte or even 64-bytes extensions to the current ABI. Then you can put the finishing touch to HJWASM once and for all. 
Title: Re: Stack Alignment Without Stack Frame in 64-bit ABI
Post by: coder on April 10, 2017, 11:39:58 AM
Quote from: habran on April 10, 2017, 09:48:41 AM
here (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/299644) is an interesting explanation about AVX alignment requirement.

My thoughts exactly!