News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Int128 in assembler

Started by bigbadbob, June 13, 2018, 02:04:43 PM

Previous topic - Next topic

aw27

@nidud

Quote
Registers:
Code: [Select]
    mov rax,rcx ; in:  rdx:rcx, r8:r9
    add rax,r8  ; out: rdx:rax
    adc rdx,r9
Pointers:
Code: [Select]
    mov r9,[rcx]        ; in:  [rcx], [rdx]
    mov r10,[rcx+8]     ; out: [r8]
    add r9,[rdx]
    adc r10,[rdx+8]
    mov [r8],r9
    mov [r8+8],r10

You are mixing apples with oranges in a frustrated attempt to come up with something true.
I am talking about what is wrong with pointers when calling functions. and you start a compulsive addition manipulation inside a function.

Quote
The Linux implementation of the Quadmath actually use both. There may be some advantages in doing that but I failed to see any:
If we need a high precision (you define the precision you want) math library we should use MPIR (GMP fork), not bloated limited precision math DLLs like the quadmath. MPIR is in large part written in ASM. I have already posted how to use MPIR from ASM.
Tell me, what is quadmath good for? Can we use it for a large number factorization for example?

Quote
This is simply not true so I think it's safe to just write this off as pure ignorance on your part.
is this an argument? Why don't you produce another compulsive code demo on this one?

Quote
You see, your assertion that it's impossible to write assembler code which is faster and more compact than optimized C++ is simply not true (I assume that was the conclusion in the article you wrote).
I never said it was impossible, but am waiting patiently for someone to beat the compiler on the same routines. This will be more interesting than going there and downvoting an article, as some people do once in a while, that has deserved the prize of article of the month.


nidud

#46
deleted

bigbadbob

I found something annoying about using FASTCALL (Win64 ABI) when using PROC arguments.


Test1 PROC arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD,arg5:QWORD,arg6:QWORD
   ; save shadow space
   mov [rbp+16], rcx
   mov [rbp+24], rdx
   mov [rbp+32], r8
   mov [rbp+40], r9

   ; ready to add the arguments up
   mov rax, arg1
   add rax, arg2
   add rax, arg3
   add rax, arg4
   add rax, arg5
   add rax, arg6
   ret 
Test1 ENDP 


And this one:

Test2 PROC arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD,arg5:QWORD,arg6:QWORD
   ; save shadow space
   mov arg1, rcx
   mov arg2, rdx
   mov arg3, r8
   mov arg4, r9

   ; ready to add the arguments up
   mov rax, arg1
   add rax, arg2
   add rax, arg3
   add rax, arg4
   add rax, arg5
   add rax, arg6
   ret 
Test2 ENDP 


I made up an example of storing the shadow space. 
Both of those work the same.
I wrote this based on what I thought Hutch said.  I would think that always using the name of the argument is better.  Though I know it might be different if the arguments were not all QWORD.

I have this example, though...

Test3 PROC arg1:WORD,arg2:WORD,arg3:WORD,arg4:WORD,arg5:WORD,arg6:WORD
   ; save shadow space
   mov QWORD PTR arg1, rcx
   mov QWORD PTR arg2, rdx
   mov QWORD PTR arg3, r8
   mov QWORD PTR arg4, r9

   ; ready to add the arguments up
   mov ax, arg1
   add ax, arg2
   add ax, arg3
   add ax, arg4
   add ax, arg5
   add ax, arg6
   ret 
Test3 ENDP 


What is your opinion on this?  I recast it to QWORD so that the actual type of the argument does not matter.
My issue is that if I added "USES RBX" then the stack would change?  I think that using the named arguments is better.
If a programmer is going to bother using the automatic stack manipulation for named arguments, then the programmer
should never manually mention the stack locations they might change if the procedure is rewritten and a new
register is added to the list in USES.

Now I would also like to reveal something...
I've used MASM32 before and I know all about PROC, PROTO, invoke, etc.  (It might have been 10-15 years ago).
But I've read the MASM64 help and noticed that ML64 does not have invoke, but someone made a macro.

I also was wondering why someone did not yet write:
invokefast which would not waste time with storing the first 4 arguments on the stack.
The callee should use the stack space if needed, but the caller shouldn't.
It should be a black box.  The caller says here is a shadow space, I will not fill it in.  Use it if you want, but I'll ignore it.

The callee if it uses a whole bunch of registers might use the shadow space.  But the caller should not even know.
What I'm asking about is a macro that follows the FASTCALL convention exactly.

So I would like to tell you that I expected the named arguments to be on the stack, that is why I did not use this notation.
I started coding after I read to use the registers, so that is what I've been doing.

Time to rewrite my UInt128Mul... I'll be using the shadow space correctly now.

hutch--

Bob,

The calling techniques in the MASM64 stuff so far does procedure calls in a couple of ways. There is a general "procedure_call" type macro that is called with a number of wrappers, "invoke" included and a pure register call macro which will accept up to 4 arguments, both conform to the Win 64 ABI and handle both ends of the market.

The invoke style macro writes the first 4 args to shadow space as well as the registers and the rest directly to the correct stack locations. It also supports quoted text. The direct register call macros only write up to the first 4 registers so you can do both and in a very efficient manner.

ML64 is an unconfigured assembler and needs to use the pre-processor to configure it. Whereas ML.EXE in Win 32 was easy enough to write pure mnemonic code, the Win 64 ABI is a lot more complex than stack based Win 32 and while it can be done purely manually, its not for the faint of heart.

With stackframe support, you can handle any of the normal high level API and procedure calls but for pure algorithms you write procedures with no stack frame and call them using up to the first 4 registers.

bigbadbob

Quote from: hutch-- on June 17, 2018, 02:30:26 PM
The invoke style macro writes the first 4 args to shadow space as well as the registers and the rest directly to the correct stack locations. It also supports quoted text. The direct register call macros only write up to the first 4 registers so you can do both and in a very efficient manner.

I read the macro source code and I don't think that invoke writes to the shadow space, it actually appears to work as I expected.

Please note that I installed MASM32 and MASM64 on my current computer in 2017.


aw27

@nidud

Quote
Well, lets take this from the start

Sure, we can retry as many times as you need, hopefully you will understand in the end.

Quote
GCC extends this to 64-bit.

  mov rax,rcx ; in:  rdx:rcx, r8:r9
    add rax,r8  ; out: rdx:rax
    adc rdx,r9
    ret

Ah, so this is the part you find really cool.
The problem is that nothing useful can be done with the return value in RDX:RAX.
You will still need to call functions with pointer arguments, or are you going to pass arguments in rdx:rax or may be also in r11:r10, r13:r12, you don't clarify this part (ah, from __int128 foo(__int128 a, __int128 b ) it appears that you are thinking about some 128 bit registers that don't exist yet in this planet) ?
If neither is true you will need to save RDX:RAX to memory, like it or not. This means that RDX:RAX is only a useless carrier and consumer of CPU cycles of the return value from callee to caller.



hutch--

Bob,

This works fine, the macro that "invoke" calls definitely writes to shadow space.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    invoke testproc,150,300,450,600

    waitkey

    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

testproc proc arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD

  ; clear the 4 registers

    xor rcx, rcx
    xor rdx, rdx
    xor r8, r8
    xor r9, r9

  ; display the values from shadow space

    conout str$(arg1),lf
    conout str$(arg2),lf
    conout str$(arg3),lf
    conout str$(arg4),lf

    ret

testproc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

jj2007

Quote from: bigbadbob on June 17, 2018, 03:40:23 PMI read the macro source code and I don't think that invoke writes to the shadow space

Macros are powerful, you can ask them explicitly to write to shadow space; with jinvoke, just add a <cb> after proc:

include \Masm32\MasmBasic\Res\JBasic.inc        ; see 64-bit assembly with RichMasm
.code
testproc proc <cb> arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD
    xor rcx, rcx                        ; clear the 4 registers
    xor rdx, rdx
    xor r8, r8
    xor r9, r9
    PrintLine Str$("a1: %i\na2: %i\na3: %i\na4: %i", arg1, arg2, arg3, arg4)  ; display the values from shadow space
    ret
testproc endp

Init           ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  jinvoke testproc, 100, 200, 300, 400
  Inkey Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
EndOfCode


Output:
100
200
300
400
This code was assembled with ml64 in 64-bit format

Builds & runs also with UAsm, of course. The <cb> behaviour is needed for Windows callback functions, such as WndProc. Without the <cb>, the macro saves the four instructions needed to write to shadow space, but the callee must know what to do with the four registers 8)

nidud

#53
deleted

bigbadbob

Let's move further ABI/FASTCALL discussion to http://masm32.com/board/index.php?topic=7222.0

nidud I noticed that this code does not properly multiply the generic case of OWORD multiplied by OWORD.  You said that it was inline so it probably works for a special case.
Quote
These functions are used for the REAL16 (quadmath) implementation so most of them are inline. The mul function goes something like this:


        .if !rdx && !r11
            mul     r10
            xor     r10,r10
        .else
            mul     r10
            mov     rbx,rdx
            mov     rdi,rax
            mov     rax,rcx
            mul     r11
            mov     r11,rdx
            xchg    r10,rax
            mov     rdx,rcx
            mul     rdx
            add     rbx,rax
            adc     r10,rdx
            adc     r11,0


There should be 4 multiplies in the else part.
rdx:rax * r11:r10 = rax * r10 (QWORD0 and QWORD1) + rdx * r10 (QWORD1 and QWORD2) + rax * r11 (QWORD1 and QWORD2) + rdx * r11 (QWORD2 and QWORD3)


nidud

#55
deleted

nidud

#56
deleted

bigbadbob

Quote from: nidud on June 18, 2018, 09:10:02 AM
:biggrin:

There actually is 4 multiplies in there so it's more or less the same code as above using different regs.

I accidentally did not scroll and then quoted incorrectly.  You are correct, the first post was correct.
:t

hutch--

Guys,

I moved this topic as it is way too complicated for learners.

bigbadbob

Hi nidud,

Here is my version, very commented.

_text SEGMENT 

public UInt128Mul

UInt128 STRUCT
    loQWORD QWORD ?
    hiQWORD QWORD ?
UInt128 ENDS

UInt256 STRUCT
    myQWORD0 QWORD ?
    myQWORD1 QWORD ?
    myQWORD2 QWORD ?
    myQWORD3 QWORD ?
UInt256 ENDS

; UInt128Mul
; ---------
; RCX - WORD - PTR to UInt256 (result)
; RDX - WORD - PTR to UInt128 (input1)
; R8  - WORD - PTR to UInt128 (input2)
; R9  - volatile, not used for parameter passing
; ---------
; RAX volatile
; R10 volatile
; R11 volatile
;----------
; C Header - variant 1
; UInt128 UInt128Mul(UInt128* const input1, UInt128* const input2);
; C Header - variant 2
; void UInt128Mul(UInt256* result, UInt128* const input1, UInt128* const input2);
;----------
; source remains unchanged
; performs *R8 = *RCX * *RDX in 128 bit mode resulting in 256 bits
UInt128Mul PROC result:PTR UInt128, input1:PTR UInt128, input2:PTR UInt128
   ; no need to save shadow space yet
   cmp (UInt128 PTR [rdx]).hiQWORD, 0
   jne long_math
   cmp (UInt128 PTR [r8]).hiQWORD, 0
   jne long_math

   mov rax, (UInt128 PTR [rdx]).loQWORD
   mov r10, (UInt128 PTR [r8]).loQWORD
   mul r10
   mov (UInt256 PTR [rcx]).myQWORD0, rax
   mov (UInt256 PTR [rcx]).myQWORD1, rdx
   mov (UInt256 PTR [rcx]).myQWORD2, 0
   mov (UInt256 PTR [rcx]).myQWORD3, 0
   mov rax, rcx
   ret

long_math:
   ; save shadow space
   mov result, rcx ; this saved shadow parameter is actually used
   mov input1, rdx ; saved but not used
   mov input2, r8  ; saved but not used

   push r12
   push r13
   push r14
   push r15
   xor r14, r14 ; make r14 and r15 zero because they will start as carries
   xor r15, r15
   
   mov rax, (UInt128 PTR [rdx]).loQWORD ; rdx is a pointer to input1
   mov r10, (UInt128 PTR [r8]).loQWORD  ; r8 is a pointer to input2
   mov rcx, (UInt128 PTR [rdx]).hiQWORD
   mov r11, (UInt128 PTR [r8]).hiQWORD
   mov r8, rax
   mul r10         ; input1.loQWORD * input2.loQWORD ==> rdx : rax
   mov r12, rax    ; this is the result.myQWORD0 (final result)
   mov r13, rdx    ; this is the result.myQWORD1 (temp)
   mov rax, r8   
   mul r11         ; input1.loQWORD * input2.hiQWORD ==> rdx : rax
   add r13, rax    ; update result.myQWORD1 (still temp) by adding
   adc r14, rdx    ; result.myQWORD2 (temp) add with carry
   mov rax, rcx
   mul r10         ; input1.hiQWORD * input2.loQWORD ==> rdx : rax
   add r13, rax    ; update result.myQWORD1 (final result) by adding
   adc r14, rdx    ; update result.myQWORD2 (temp) by adding with carry
   adc r15, 0      ; begin using result.myQWORD3 (temp) in case of carry
   mov rax, rcx
   mul r11         ; input1.hiQWORD * input2.hiQWORD ==> rdx : rax
   add r14, rax    ; update result.myQWORD2 (final result) by adding
   adc r15, rdx    ; update result.myQWORD3 (final result) by adding with carry
   mov rax, result ; load result pointer into rax to begin storing in memory
   mov (UInt256 PTR [rax]).myQWORD0, r12
   mov (UInt256 PTR [rax]).myQWORD1, r13
   mov (UInt256 PTR [rax]).myQWORD1, r14
   mov (UInt256 PTR [rax]).myQWORD1, r15
   pop r15
   pop r14
   pop r13
   pop r12
   ret
UInt128Mul ENDP 
_text ENDS 
END