Author Topic: Stack Alignment Without Stack Frame in 64-bit ABI  (Read 995 times)

coder

  • Member
  • **
  • Posts: 103
Stack Alignment Without Stack Frame in 64-bit ABI
« on: April 04, 2017, 04:31:09 AM »
This technique assumes that the stack environment of the codebase defaults to 16-byte
 
One advantage is that this can be used to completely eliminate the costly function prologs and epilogs, for good.
Another advantage is that this can also deal with unconventional alignment, such as 32-byte.

It has the same effects as AND RSP,-16 or AND RSP,-32 minus the stack frame.

To create an aligned stack to the desired alignment, the equation is

SUB RSP,(A*B + 8 MOD A)


and restore it almost the same way using

ADD RSP,(A*B + 8 MOD A)

where;
A = the desired alignment (16 or 32)
B = The number of the 'notches' for local space allocation
Default codebase alignment = 16

Example: If you want to create a 32*8 local space, aligned to 32-byte boundary (in 16-byte alignment codebase), just use it like so

Code: [Select]
sub rsp,(32*8 + 8 MOD 32)
If you are not interested in allocating local space, just use B=0 and still get the stack aligned according to A.

Haven't run enough tests on it, so I don't know how it would work in all situations. Suggest to me corrections and improvements.

Thanks.

mabdelouahab

  • Member
  • ***
  • Posts: 330
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #1 on: April 04, 2017, 07:03:13 AM »
Quote
A = the desired alignment (16 or 32)
8 MOD 16 = 8 MOD 32 = 8
    ==> SUB RSP,(A*B + 8 MOD A) = SUB RSP,(A*B + 8  )
    if A = 16:
    B=4 (dword):SUB RSP,(16*4+ 8  ) ==> SUB RSP,72
    B=8 (dword):SUB RSP,(16*8+ 8  ) ==> SUB RSP,136  :eusa_naughty:   
 

SUB RSP,((localbytes +8) and -10h)

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #2 on: April 04, 2017, 07:49:08 AM »
Quote
A = the desired alignment (16 or 32)
8 MOD 16 = 8 MOD 32 = 8
    ==> SUB RSP,(A*B + 8 MOD A) = SUB RSP,(A*B + 8  )
    if A = 16:
    B=4 (dword):SUB RSP,(16*4+ 8  ) ==> SUB RSP,72
    B=8 (dword):SUB RSP,(16*8+ 8  ) ==> SUB RSP,136  :eusa_naughty:   
 

SUB RSP,((localbytes +8) and -10h)

Yes, that's the same effect as AND RSP,-16, you get extra 8 bytes and aligned stack at the same time. Only this time, no stack frame needed.

16*8 = 128 + 8 = 136. Exactly the effect as AND RSP,-16 / SUB RSP,16*8 sequence. Stack aligned. No stack frame :)

Thanks for testing it.



nidud

  • Member
  • *****
  • Posts: 1354
    • https://github.com/nidud/asmc
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #3 on: April 04, 2017, 09:30:56 PM »
simple test case:

Code: [Select]
enter 32,0
movaps [rbp-16],xmm0
movaps [rbp-32],xmm1
leave

Code: [Select]
sub rsp,16*2 + 8
movaps [rsp+0],xmm0
movaps [rsp+16],xmm1
add rsp,16*2 + 8

Code: [Select]
push rbp
mov rbp,rsp
sub rsp,32
movaps [rbp-16],xmm0
movaps [rbp-32],xmm1
leave

Code: [Select]
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4.2)
----------------------------------------------
-- test(1)
    38787 cycles, rep(3000), code( 14) 0.asm: enter/leave
    14545 cycles, rep(3000), code( 18) 1.asm: sub rsp
    12832 cycles, rep(3000), code( 18) 2.asm: push rbp
-- test(2)
    38926 cycles, rep(3000), code( 14) 0.asm: enter/leave
    15945 cycles, rep(3000), code( 18) 1.asm: sub rsp
    17186 cycles, rep(3000), code( 18) 2.asm: push rbp
-- test(3)
    39026 cycles, rep(3000), code( 14) 0.asm: enter/leave
    15001 cycles, rep(3000), code( 18) 1.asm: sub rsp
    14988 cycles, rep(3000), code( 18) 2.asm: push rbp

total [1 .. 3], 1++
    45006 cycles 2.asm: push rbp
    45491 cycles 1.asm: sub rsp
   116739 cycles 0.asm: enter/leave

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #4 on: April 05, 2017, 01:11:03 AM »
Hi Nidud. Thanks for testing it out.

Common sense tells that when you're testing for a function, at least test its performance by actually making a call to it or something similar to its structure. I ran a quick test, slightly different than yours, using GetTickCount64 with 100mil loops

Code: [Select]
361 ms for sub
860 ms for push

Almost 50 - 60 percent gain.

I know the formula I suggested above is slightly off for 16-byte alignment, but it's working just fine for 32-byte alignment. Hope someone can offer a little correction.

Thanks and nice to know you.

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #5 on: April 05, 2017, 01:16:24 AM »
Additional info: This works only in strict Win64 ABI environment, with the stack alignment is well-observed across calls. This won't work on Linux.

 

nidud

  • Member
  • *****
  • Posts: 1354
    • https://github.com/nidud/asmc
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #6 on: April 05, 2017, 02:27:39 AM »
Hi coder,

Common sense tells that when you're testing for a function, at least test its performance by actually making a call to it or something similar to its structure.

 :biggrin:

The test calls each algo (3000 << 10) times, and they are all loaded at the same offset to ensure equal alignment for each test.

Quote
I ran a quick test, slightly different than yours

Indeed.

You start by creating seperate loops for each algo, so that will most likely end up on different offsets. The same goes for the actual functions you testing, none of them are aligned. Here's a modified version:

Code: [Select]
;ml64 /c test2.asm
;gcc -m64 test2.obj -o test2.exe
externdef printf:proc
externdef GetTickCount64:proc

MAXLOOP equ 100000000

.data
fmt1 db 'sub : %d ms',0ah,0
fmt2 db 'push: %d ms',0ah,0


.code
main proc
sub rsp,40
;===============================
;test1 group
;===============================
cpuid
mov rdi,MAXLOOP
call GetTickCount64
mov rbx,rdx
F1: call t2;t1
sub rdi,1
jnz F1
call GetTickCount64
sub rdx,rbx
mov rcx,offset fmt2;fmt1
call printf
;===============================
;test2 group
;===============================
cpuid
mov rdi,MAXLOOP
call GetTickCount64
mov rbx,rdx
F2: call t1;t2
sub rdi,1
jnz F2
call GetTickCount64
sub rdx,rbx
mov rcx,offset fmt1;fmt2
call printf
add rsp,40
ret
main endp

;==================================
align 16
t3 proc
sub rsp,32*8 + 8 mod 32
movdqa [rsp],xmm2
movdqa [rsp+16],xmm3
add rsp,32*8 + 8 mod 32
ret
t3 endp
align 16
t4 proc
push rbp
mov rbp,rsp
sub rsp,32*8
movdqa [rbp-16],xmm0
movdqa [rbp-32],xmm1
leave
ret
t4 endp
align 16
t2 proc
push rbp
mov rbp,rsp
sub rsp,32*8
movdqa [rbp-16],xmm0
movdqa [rbp-32],xmm1
call t3
leave
ret
t2 endp
align 16
t1 proc
sub rsp,32*8 + 8 mod 32
movdqa [rsp],xmm0
movdqa [rsp+16],xmm1
call t4
add rsp,32*8 + 8 mod 32
ret
t1 endp

;==================================
END

Now the result are a bit different:
Code: [Select]
push: 327 ms
sub : 359 ms

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #7 on: April 05, 2017, 02:57:27 AM »
Hi nidud. thanks for the reply and counter tests.

I was actually trying to eliminate other variables and focused solely on the performance of function prologs/epilogs vs plain fastcall, arranged as-is. If we were to include other variables such as code alignment and arrangement, it would be a completely different discussions, no longer related to stack alignment which is a dynamic entity of a program and not static as code alignment.

But I appreciate you replies though.

Thanks.


 

nidud

  • Member
  • *****
  • Posts: 1354
    • https://github.com/nidud/asmc
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #8 on: April 05, 2017, 04:51:34 AM »
I was actually trying to eliminate other variables and focused solely on the performance of function prologs/epilogs vs plain fastcall, arranged as-is. If we were to include other variables such as code alignment and arrangement,

A bit difficult to test the performance of function prologs/epilogs without doing just that..

Quote
it would be a completely different discussions, no longer related to stack alignment which is a dynamic entity of a program and not static as code alignment.

 :biggrin:

I think in this case we have to conclude the stack is static. The macro, as you pointed out, wouldn't work otherwise.

Well, my point was just to show the advantage / disadvantage of using RSP or RBP as a base. There are a few assumption made about this, but yes, the former is usually faster.

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #9 on: April 08, 2017, 07:49:05 AM »
A bit difficult to test the performance of function prologs/epilogs without doing just that..
I think in this case we have to conclude the stack is static. The macro, as you pointed out, wouldn't work otherwise.

Well, my point was just to show the advantage / disadvantage of using RSP or RBP as a base. There are a few assumption made about this, but yes, the former is usually faster.

I don't think stack is static though. It's a runtime load. The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature, unlike applying it against code which is done / calculated at compile time (static).

Yeah it doesn't quite work. How I wished someone with better math could 'repair' it. It would save us lots of stacking problems :D

jj2007

  • Member
  • *****
  • Posts: 7453
  • Assembler is fun ;-)
    • MasmBasic
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #10 on: April 08, 2017, 08:25:29 AM »
The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature

The x64 ABI requires that the stack itself is aligned on entry. Therefore assembly time alignment is possible.

coder

  • Member
  • **
  • Posts: 103
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #11 on: April 08, 2017, 10:10:47 AM »
The same reason why we can't apply ALIGN directive against it in LOCALS due to its runtime nature

The x64 ABI requires that the stack itself is aligned on entry. Therefore assembly time alignment is possible.

Yeah.

Here's an idea. It would be a nice addition if you could extend the "LOCAL" directive to something like

    local    align 16 <var>
    local    <var> align 32

I am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.

jj2007

  • Member
  • *****
  • Posts: 7453
  • Assembler is fun ;-)
    • MasmBasic
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #12 on: April 08, 2017, 10:34:00 AM »
I am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.

The PROLOG macro does not allow access to individual variables. But you can find a workaround here.

And afaik, the HJWasm team have implemented local alignment now.

nidud

  • Member
  • *****
  • Posts: 1354
    • https://github.com/nidud/asmc
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #13 on: April 08, 2017, 10:18:48 PM »
I am extremely poor at macro though. But if this can be done, it will be a beautiful addition and feature.
And afaik, the HJWasm team have implemented local alignment now.

 :biggrin:

Alignment of locals was introduced in JWASM a long time ago.

Test case:
Code: [Select]
;
; Build: jwasm/hjwasm/asmc -pe test.asm
;
.x64
.model flat, fastcall

option dllimport:<msvcrt>
printf proto :ptr byte, :vararg
exit proto :qword
option dllimport:none

.data

format db "l1: %d",10
db "l2: %d",10
db "l3: %d",10
db "l4: %d",10
db "l5: %d",10,0

.code

option win64:7

Alignment proc uses rsi rdi rbx

  local l1: byte,
l2: xmmword,
l3: byte,
l4: ymmword,
l5: byte

GetAlig macro reg, l
lea reg,l
                bsf rcx,reg
mov reg,1
shl reg,cl
endm

GetAlig rsi,l1
GetAlig r8, l2
GetAlig r9, l3
GetAlig r10,l4
GetAlig r11,l5

invoke printf, addr format, rsi, r8, r9, r10, r11

ret
Alignment endp

main proc

invoke Alignment
invoke exit,0

main endp

end main

I assume(d) JWASM only aligned 16..

JWASM xmm/xmm
Code: [Select]
l1: 1
l2: 16
l3: 1
l4: 16
l5: 1

JWASM xmm/ymm
Code: [Select]
l1: 1
l2: 16
l3: 1
l4: 64
l5: 1

ASMC xmm/xmm
Code: [Select]
l1: 1
l2: 16
l3: 1
l4: 16
l5: 1

ASMC xmm/ymm
Code: [Select]
l1: 1
l2: 16
l3: 1
l4: 64
l5: 1

HJWASM xmm/xmm
Code: [Select]
l1: 1
l2: 8
l3: 1
l4: 8
l5: 1

HJWASM xmm/ymm
Code: [Select]
l1: 1
l2: 8
l3: 1
l4: 8
l5: 1

It appear that HJWASM is not compatible with JWASM.

adding option win64:11:

HJWASM xmm/xmm
Code: [Select]
l1: 16
l2: 32
l3: 16
l4: 256
l5: 16

HJWASM xmm/ymm
Code: [Select]
l1: 16
l2: 64
l3: 16
l4: 16
l5: 16

johnsa

  • Member
  • ***
  • Posts: 462
    • Uasm
Re: Stack Alignment Without Stack Frame in 64-bit ABI
« Reply #14 on: April 10, 2017, 05:23:34 AM »
Code: [Select]

;
; Build: jwasm/hjwasm/asmc -pe test.asm
;
.x64
.model flat, fastcall
option win64:7
option stackbase:RBP

option dllimport:<msvcrt>
printf proto :ptr byte, :vararg
exit proto :qword
option dllimport:none

.data

format db "l1: %d",10
db "l2: %d",10
db "l3: %d",10
db "l4: %d",10
db "l5: %d",10,0

.code

Alignment proc uses rsi rdi rbx

  local l1 : byte
local l2 : xmmword
local l3 : byte
local l4 : ymmword
local l5 : byte

GetAlig macro reg, l
lea reg,l
                bsf rcx,reg
mov reg,1
shl reg,cl
endm

GetAlig rsi,l1
GetAlig r8, l2
GetAlig r9, l3
GetAlig r10,l4
GetAlig r11,l5

invoke printf, addr format, rsi, r8, r9, r10, r11

ret
Alignment endp

main proc

invoke Alignment
invoke exit,0

main endp

end main


Using win64:7, stackbase:rbp
l1: 1
l2: 16
l3: 1
l4: 64
l5: 1

Win64:11, stackbase:rsp
l1: 16
l2: 64
l3: 16
l4: 16
l5: 16