How would yall write this assembly?

lemonjumps · September 07, 2024, 01:31:23 AM

Hi, so, I have this code,

what it does is it calls another code using the x86-64 microsoft call convention.

It's input parameters are a list of variables aligned to 8 Bytes, in rdx. (it's assumed that the variables are formatted correctly for the cpu).
And then there's r8 which has a list of types for special cases like float and double, also aligned to 8 Bytes.
rcx has the function that should be called, and r9 has the count of arguments.

I've chose to store values for rdx rcx r9 r8 in r11 to r14, r10 is the array offset, and r15 stores the size of additional stack where other values are passed.

What I'd like to ask for is, how could this be better, what are things I should improve?

This is literally the first actual x64 assembly I ever wrote. (TwT)
I've also noticed that the variables that I'm pushing to stack are technically backwards, so I'll have to fix that too. I think it could be written in a way, where I reserve stack first, and then write into it backwards.

Code Select

pinADcallWIN:
    push rbp
    mov rbp, rsp
    
    xor r15, r15
    xor r10, r10

    dec r9d
    jz __winCall

    mov rax, qword ptr [r8 + r10]
    cmp rax, 1
    je _rcxSWfloat
    cmp rax, 2
    je _rcxSWdouble
    mov r11, qword ptr [rdx + r10]
    jmp _rcxSWend
_rcxSWfloat:
    movss xmm0, dword ptr [rdx + r10]
    jmp _rcxSWend
_rcxSWdouble:
    movsd xmm0, qword ptr [rdx + r10]
_rcxSWend:
    add r10, 8

    dec r9d
    jz __winCall

    mov rax, qword ptr [r8 + r10]
    cmp rax, 1
    je _rdxSWfloat
    cmp rax, 2
    je _rdxSWdouble
    mov r12, qword ptr [rdx + r10]
    jmp _rdxSWend
_rdxSWfloat:
    movss xmm1, dword ptr [rdx + r10]
    jmp _rdxSWend
_rdxSWdouble:
    movsd xmm1, qword ptr [rdx + r10]
_rdxSWend:
    add r10, 8

    dec r9d
    jz __winCall

    mov rax, qword ptr [r8 + r10]
    cmp rax, 1
    je _r8SWfloat
    cmp rax, 2
    je _r8SWdouble
    mov r13, qword ptr [rdx + r10]
    jmp _r8SWend
_r8SWfloat:
    movss xmm2, dword ptr [rdx + r10]
    jmp _r8SWend
_r8SWdouble:
    movsd xmm2, qword ptr [rdx + r10]
_r8SWend:
    add r10, 8

    dec r9d
    jz __winCall

    mov rax, qword ptr [r8 + r10]
    cmp rax, 1
    je _r9SWfloat
    cmp rax, 2
    je _r9SWdouble
    mov r14, qword ptr [rdx + r10]
    jmp _r9SWend
_r9SWfloat:
    movss xmm3, dword ptr [rdx + r10]
    jmp _r9SWend
_r9SWdouble:
    movsd xmm3, qword ptr [rdx + r10]
_r9SWend:
    add r10, 8

    cmp r9d, 0
    je __winCall

__winCallLoop:
    push qword ptr [rdx + r10]
    add r15, 8
    add r10, 8
    dec r9d
    jnz __winCallLoop
__winCall:
    mov rax, rcx
    mov rcx, r11
    mov rdx, r12
    mov r8, r13
    mov r9, r14

    mov qword ptr [rbp + 16], r15

    sub rsp, 32

    call rax

    mov r15, qword ptr [rbp + 16]
    add rsp, r15
    add rsp, 32
    
    pop rbp
    ret

Vortex · September 07, 2024, 03:46:18 AM

Hi lemonjumps,

Code Select

xor r15, r15
Your function does not call other functions or API functions but it's important to note that :

QuoteThe x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.

https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170

NoCforMe · September 07, 2024, 06:26:19 AM

Let me second that emotion; very important point here. The ABI expects certain registers to be un-trashed, so be sure to save and restore them in your code if you need to use them. In x64:

RBX
RBP
RDI
RSI
RSP
R12
R13
R14
R15
XMM6-XMM15

are what I call "sacred" registers. Don't be sacrilegious!

(All others, RAX, RCX, etc.) are volatile and can be trashed.)

lemonjumps · September 07, 2024, 07:55:23 AM

Quote from: Vortex on September 07, 2024, 03:46:18 AMThe x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.

oooh ok I didn't know that. and yeah I'm actually calling RAX at the end :D
I'll be refactoring it tonight So I'll look into it.

NoCforMe · September 07, 2024, 08:06:59 AM

Quote from: lemonjumps on September 07, 2024, 07:55:23 AMI didn't know that. and yeah I'm actually calling RAX at the end

That's fine, so long as you know what's in that register when you issue the call ...

Just so you know, the flipside of having to save and restore non-volatile registers is to assume that all the volatile ones (RAX, RCX, etc.) contain garbage when entering your code (unless, of course, you yourself put something in them before making the call, which is perfectly legitimate).

lemonjumps · September 07, 2024, 10:12:44 PM

OK, so I'm working on optimizing my code, and I realized two things.

1. the list of variables is basically the same format as stack, so just treating it as one shouldn't be a problem
2. since volatile registers don't matter, only when there's a parameter for a function, I can just write values into both XMM and normal registers, and the called function will just pick whichever it wants.

The code is still lengthy since I have versions for 1,2,3,4+ parameters as that looks to be the simplest and fastest solution :D
Also I wonder if it's faster to call both pop and movsd or to have cmp with pointer math.

here's what my code looks like now OwO

Code Select

pinADcallWIN:
    push rbp
    push r11 
    push r12 
    push r13
    push r14
    push r15
    mov rbp, rsp

    cmp r9, 1
    je _winCall1
    cmp r9, 2
    je _winCall2
    cmp r9, 3
    je _winCall3
    
    jmp _winCall4p

_winCall1:
    mov rsp, rdx

    movsd xmm0, qword ptr [rsp]
    pop r11

    mov rsp, rbp
    jmp __winCall

_winCall2:
    mov rsp, rdx

    movsd xmm0, qword ptr [rsp]
    pop r11
    movsd xmm1, qword ptr [rsp]
    pop r12

    mov rsp, rbp
    jmp __winCall

_winCall3:
    mov rsp, rdx

    movsd xmm0, qword ptr [rsp]
    pop r11
    movsd xmm1, qword ptr [rsp]
    pop r12
    movsd xmm2, qword ptr [rsp]
    pop r13

    mov rsp, rbp
    jmp __winCall

_winCall4p:
    mov rsp, rdx

    movsd xmm0, qword ptr [rsp]
    pop r11
    movsd xmm1, qword ptr [rsp]
    pop r12
    movsd xmm2, qword ptr [rsp]
    pop r13
    movsd xmm3, qword ptr [rsp]
    pop r14

    mov rsp, rbp

    sub r9d, 4
    jz __winCall

    add rdx, 24
    imul r9, 8
    mov r15, r9

__winCallLoop:
    push qword ptr [rdx + r9]
    sub r9d, 8
    jnz __winCallLoop

__winCall:
    mov rax, rcx
    mov rcx, r11
    mov rdx, r12
    mov r8, r13
    mov r9, r14

    sub rsp, 32

    call rax

    add rsp, r15
    add rsp, 32

    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop rbp
    ret

satpro · December 25, 2024, 01:39:19 AM

I would like to suggest a different way of looking at the x64 stack. I would also like confess that this was the most difficult concept I have ever had to learn in assembly language, bar none. In fact, it kept me from getting past CreateWindow without using INVOKE, and it stalled my DirectX learning for over a year. Even thinking about those dark days gives me the chills.

To see what I mean, all you have to do is look in a debugger at actual, emitted code, especially around the INVOKEs. Even in assembly it will make your eyes bug out. The overhead! I use GoAsm, which is an efficient assembler output-wise, but even it's INVOKE or COMINVOKE is costly, stretching easily into dozens of extra opcode bytes for any meaty Win32 call.

I looked at code for a long time, and at some point I realized most Win32 programs are devoted mostly to continual stack manipulation throughout the life of any program. It's awful, to be honest. Let's just say it was a mind-bender when it finally hit me.

It does not need to be that way.

We have been taught (more or less) to look at the x86/64 Stack as some "pile of plates" that you are seemingly external to and then access by using pushes and pulls, pushing RSP, RBP, etc. And the absolute BIGGEE of all BIGGEES? We ALL put our parameters on the Stack in reverse order. Why?

What if you were to view the Stack as a room you are standing in and never leave, but just add to and remove from while within? It's a different point of view that leads to simplicity. And it works for both x86 and x64.

This is what I do; maybe it will click for you, too. It starts with aligning the Stack once and only once, and not for each INVOKE (which is a macro I no longer use) and this is why:

...

So, at 'Start' or in any callback (e.g. WndProc or TimerProc) you ALWAYS have a stack that is aligned to a multiple of 8 (RSP ends with 8, not 0). Always, and your calls to Win32 look the same way to Windows -- RSP is off by 8. Usually the largest-parameter Win32 call you will probably encounter in 'Winmain' is CreateWindow with 10 or 11 params. Keep this in mind. We are going to re-use the Stack and need to know the largest call parameter-wise. And then I keep 'Winmain' as monolithic as possible -- a tree-stump looking for branches. Every subroutine branches and returns to the stump.

The FIRST thing you will do in any program is align the stack with any number ending with a 0: a multiple of 16. You do that with a simple SUB RSP, X8h. Now the Stack has a 00h-looking RSP. You calculate the largest number 'X' will be, which is a multiple of 16. Then, at the very end (or not, in the case of ExitProcess), you will write a matching ADD RSP, X8h. The Stack will NOT matter at the end of your program.

You can add to that SUB RSP, though:

You have a shadow space of 32 bytes.
You have a need for up to a dozen parameters.
You will be using the non-volatile registers at some point, and maybe a LOT of them if you are using SSE and so on.
You may need local space.
COM calls add an additional first parameter to the mix, the interface pptr.

You are going to make space for all of it at once and only once. Change that first line to: SUB RSP, 88h (which is: 128+8, or 136 decimal).

Now there is room for everything (and then some) and you never again need to worry about what you push or pull or any of that stack stuff. Now, watch this:

Code Select

Win32 call with 7 params
---------------------
mov rcx, p1                 ; p1: goes in RCX, if it's a ptr you use "LEA RCX, someptr" instead of MOV
mov rdx, p2                 ; p2: goes in RDX
mov r8, p3                  ; p3: goes in R8
mov r9, p4                  ; p4: goes in R9
mov [rsp+20h], p5           ; p5: goes to RSP+20h, params 5-up go right below the 20h-byte shadow space!
mov [rsp+28h], p6           ; p6: goes to RSP+28h
lea rax, someptr            ; if a param is a ptr, put it in a register first
mov [rsp+30h], rax          ; p7: instead of "MOV [RSP + 30h], OFFSET someptr  (use LEA instead of MOV)
call Win32

And that's it. When you come back from the Win32 call the Stack is good and you are still standing in the "room" you made; it's the very first thing you did.

Your stack is forever aligned. Think how easy and structured your own leaf subroutines will be, already knowing that when you get there, YOUR stack is off by 8, too! You can calculate what you need, if any, and repeat the process on a smaller scale, maybe only saving a couple of registers like RSI, RDI, and RBX. In any leaf make sure your SUB RSP, XXh opcode ends with an 8h and is large enough to handle your needs for the entire subroutine. Here is a subroutine that saves three registers but has no Win32 calls:

Code Select

align 16
MySub:

sub rsp, 18h                ; align with the 8h + space for 3 non-volatile registers only
mov [rsp+00h], rsi      ; save rsi  (you don't actually write the "+00h" part, just the "MOV [RSP], RSI")
mov [rsp+8h], rdi        ; save rdi
mov [rsp+10h], rbx     ; save rbx

; -----  Enter  -----

; -------------------------------------------------
; here is where you might use those registers
; -------------------------------------------------


; -----  Exit  -----

mov rsi, [rsp]             ; restore rsi  (a reverse of the entry, or prolog)
mov rdi, [rsp+8h]       ; restore rdi
mov rbx, [rsp+10h]     ; restore rbx
add rsp, 18h               ; restore the stack

ret

Count what you need. Space has to be allotted in multiples of 16, plus an 8h (8h, 18h, 28h, ... 108h, etc.).

Here is where I learned about the '8h' alignment part. If I had this call in a subroutine somewhere it would crash. GoAsm's ComInvoke did not account for it, and it always crashed... until I fixed what I was doing. Not to mention, everything in that macro is a push or pull.

Code Select

HRESULT Blt(  LPRECT lpDestRect,                    
  LPDIRECTDRAWSURFACE7 lpDDSrcSurface,  
  LPRECT lpSrcRect,                     
  DWORD dwFlags,                        
  LPDDBLTFX lpDDBltFx
 a COM call with 5 params
----------------------
    mov rcx, [pp_Front]                 ; *this
    xor rdx, rdx                        ; p1: LPRECT lpDestRect
    mov r8, [pp_Back]                   ; p2: LPDIRECTDRAWSURFACE7 lpDDSrcSurface
    xor r9, r9                          ; p3: LPRECT lpSrcRect
    mov D[rsp+20h], DDBLT_WAIT          ; p4: DWORD dwFlags
    mov [rsp+28h], rdx                  ; p5: LPDDBLTFX lpDDBltFx

    mov rax, [rcx]                      ; resolve *this
    add rax, IDirectDrawSurface7.Blt    ; offset to Blt ptr
    call [rax]                          ; call the method
    or rax, rax                         ; check for the successful "0"
    jnz << Error                        ; or handle an error

You don't ever have to push params in reverse order. This is what I think of as bare-metal programming, or very close to it. And it beats regular assembly language techniques hands-down in a race. The trick is re-using the stack frame and foregoing macros you cannot see the innards of.

I just hope it was understandable.

Merry Christmas,
Bert

_japheth · December 27, 2024, 06:02:15 PM

Quote from: satpro on December 25, 2024, 01:39:19 AMWhat if you were to view the Stack as a room you are standing in and never leave, but just add to and remove from while within? It's a different point of view that leads to simplicity. And it works for both x86 and x64.

It works for both modes, and probably 15 years ago I first saw it in code generated by gcc for 32-bit: instead of a lot of pushes, there was just the ESP register used as base register to write the arguments to:

Code Select

    sub esp, 48
    mov dword [esp+4], 1
    mov dword [esp+8], 2
    call Win32func@8
    sub esp, 2*4
    mov dword [esp+4], 1
    mov dword [esp+8], 2
    mov dword [esp+12], 3
    call Win32func@12
    sub esp, 3*4
    add esp, 48
    ret

In theory, it's supposed to be faster, even in 32-bit, at least since the pentium cpu.

However, my little benchmark (attached) shows that it not only blows the code significantly (factor 3), but is also about 10% slower. Not talking about the ugliness of the code ...

Note that the source of the benchmark is small, but the generated binaries are 7 MB/21 MB - might be a little problem for some versions of Masm... The reason for those large binaries is to crumble the cpu's L2 cache.

daydreamer · December 28, 2024, 07:23:58 AM

Stack space :makes me thinking of local 2d array
What about use [esp+ebx*4], low byte bx = x,high byte bx =y
Might need increase stack space if you want big room

jj2007 · December 28, 2024, 10:27:44 PM

The source has a problem, it seems:

QuoteC:\Masm32\Members\Japheth\Stack>Make.bat
Assembling: STDCALL.ASM
STDCALL.OBJ : fatal error LNK1136: invalid or corrupt file
Assembling: STKCALL.ASM
STKCALL.OBJ : fatal error LNK1136: invalid or corrupt file

Same with UAsm64:
C:\Masm32\Members\Japheth\Stack>Make.bat
STDCALL.ASM: 106 lines, 2 passes, 14651 ms, 0 warnings, 0 errors
STDCALL.OBJ : fatal error LNK1136: invalid or corrupt file
STKCALL.ASM: 110 lines, 2 passes, 16028 ms, 0 warnings, 0 errors
STKCALL.OBJ : fatal error LNK1136: invalid or corrupt file

Quote from: _japheth on December 27, 2024, 06:02:15 PMinstead of a lot of pushes, there was just the ESP register used as base register to write the arguments to

X64 works very much that way. I doubt that it's significantly faster than the pushes - see attachment. In any case, instead of [esp+x] one should go for [ebp+x], as it's one byte shorter. Same for [edi+x] as used below.

Code Select

AMD Athlon Gold 3150U with Radeon Graphics      (SSE4)

806     cycles for 100 * push & pop
783     cycles for 100 * move [esp]
789     cycles for 100 * move [edi]

787     cycles for 100 * push & pop
797     cycles for 100 * move [esp]
796     cycles for 100 * move [edi]

790     cycles for 100 * push & pop
797     cycles for 100 * move [esp]
791     cycles for 100 * move [edi]

790     cycles for 100 * push & pop
798     cycles for 100 * move [esp]
786     cycles for 100 * move [edi]

788     cycles for 100 * push & pop
802     cycles for 100 * move [esp]
789     cycles for 100 * move [edi]

20      bytes for push & pop
52      bytes for move [esp]
46      bytes for move [edi]

_japheth · December 29, 2024, 12:20:38 AM

Quote from: jj2007 on December 28, 2024, 10:27:44 PMThe source has a problem, it seems:
QuoteC:\Masm32\Members\Japheth\Stack>Make.bat
Assembling: STDCALL.ASM
STDCALL.OBJ : fatal error LNK1136: invalid or corrupt file
Assembling: STKCALL.ASM
STKCALL.OBJ : fatal error LNK1136: invalid or corrupt file

It's not an assembler problem, but a link's one. As mentioned in Make.bat, I had to make Masm create the object module in OMF format - if -coff was used, it was unable to finish the assembly process. So I guess you're using a linker that simply doesn't know the OMF format anymore.

jj2007 · December 30, 2024, 03:15:51 AM

Quote from: _japheth on December 29, 2024, 12:20:38 AMSo I guess you're using a linker that simply doesn't know the OMF format anymore

1. Microsoft (R) Incremental Linker Version 14.29.30154.0
2. Pelles Linker, Version 8.00.2

The MASM Forum

News:

How would yall write this assembly?

lemonjumps

Vortex

NoCforMe

lemonjumps

NoCforMe

lemonjumps

satpro

_japheth

daydreamer

jj2007

_japheth

jj2007