I'm a C# programmer, but have always understood assembly. I just don't code in it very often.
This is my first at least semi-working 64-bit dll written in assembler.
I actually called it from C# using PInvoke to test it, so at the minimum it is compliant to the ABI, and should work from assembler "C"/"C++" and .Net using PInvoke.
Not really sure if it is fully compliant, but it did not crash C#.
Sample Add method:
_text SEGMENT
public Int128Add
; Int128Add
; ---------
; RCX - QWORD - PTR to Int128 (input1)
; RDX - QWORD - PTR to Int128 (input2)
; R8 - QWORD - PTR to Int128 (result)
; R9 - unused
; ---------
; RAX volatile
; R10 volatile
; R11 volatile
;----------
; C Header
; void Int128Add(_m128* const input1, _m128* const input2, _m128* result )
;----------
; source remains unchanged
; performs *R8 = *RCX + *RDX in 128 bit mode
Int128Add PROC FRAME
push rbp
.pushreg rbp
sub rsp, 010h
.allocstack 010h
mov rbp, rsp
.setframe rbp, 0
.endprolog
mov rax, QWORD PTR [rcx]
mov r10, QWORD PTR [rcx+8]
add rax, QWORD PTR [rdx]
adc r10, QWORD PTR [rdx+8]
mov QWORD PTR [r8], rax
mov QWORD PTR [r8+8], r10
; epilog
add rsp, 010h
pop rbp
ret
Int128Add ENDP
_text ENDS
END
And that method is working, but I think that I allocated too much stack.
I think that add and subtract are the same Int128 (signed) and UInt128 (unsigned)
Also untested, multiply:
_text SEGMENT
public UInt128Mul
; UInt128Mul
; ---------
; RCX - QWORD - PTR to Int128 (input1)
; RDX - QWORD - PTR to Int128 (input2)
; R8 - QWORD - PTR to Int256 (result)
; R9 - unused
; ---------
; RAX volatile
; R10 volatile
; R11 volatile
;----------
; C Header
; void UInt128Mul(_int128* const input1, _int128* const input2, _int256* result )
;----------
; source remains unchanged
; performs *R8 = *RCX * *RDX in 128 bit mode resulting in 256 bits
UInt128Mul PROC FRAME
push rbp
.pushreg rbp
push rdx
.pushreg rdx
sub rsp, 050h
.allocstack 050h
mov rbp, rsp
.setframe rbp, 0
.endprolog
; _int256 temp1 = 20h bytes rbp, rbp+8, rbp+16, rbp+24
; _int256 temp2 = 20h bytes rbp+32, rbp+40, rbp+48, rbp+56
; _int128 input2 shadow = 16 bytes (10h) rbp+64, rbp+72
; temp1 = temp2 = 0
xor rax, rax
mov QWORD PTR [rbp], rax
mov QWORD PTR [rbp+8], rax
mov QWORD PTR [rbp+16], rax
mov QWORD PTR [rbp+24], rax
mov QWORD PTR [rbp+32], rax
mov QWORD PTR [rbp+40], rax
mov QWORD PTR [rbp+48], rax
mov QWORD PTR [rbp+56], rax
; input2 shadow = *input2
mov rax, QWORD PTR [rdx]
mov QWORD PTR [rbp+64], rax
mov rax, QWORD PTR [rdx+8]
mov QWORD PTR [rbp+72], rax
mov rax, QWORD PTR [rcx]
mov rdx, QWORD PTR [rbp+64]
mul rdx
mov QWORD PTR [rbp], rax
mov QWORD PTR [rbp+8], rdx
mov rax, QWORD PTR [rcx+8]
mov rdx, QWORD PTR [rbp+64]
mul rdx
mov QWORD PTR [rbp+32], rax
mov QWORD PTR [rbp+40], rdx
call add_temp
mov rax, QWORD PTR [rcx]
mov rdx, QWORD PTR [rbp+72]
mul rdx
mov QWORD PTR [rbp+32], rax
mov QWORD PTR [rbp+40], rdx
call add_temp
mov rax, QWORD PTR [rcx+8]
mov rdx, QWORD PTR [rbp+72]
mul rdx
mov QWORD PTR [rbp+40], rax
mov QWORD PTR [rbp+48], rdx
xor rax, rax
mov QWORD PTR [rbp+32], rax
call add_temp
mov rax, QWORD PTR [rbp]
mov rdx, QWORD PTR [rbp+8]
mov r10, QWORD PTR [rbp+16]
mov r11, QWORD PTR [rbp+24]
mov QWORD PTR [r8], rax
mov QWORD PTR [r8+8], rdx
mov QWORD PTR [r8+16], r10
mov QWORD PTR [r8+24], r11
; epilog
add rsp, 050h
pop rdx
pop rbp
ret
add_temp:
mov rax, QWORD PTR [rbp]
add rax, QWORD PTR [rbp+32]
mov QWORD PTR [rbp], rax
mov rax, QWORD PTR [rbp+8]
adc rax, QWORD PTR [rbp+40]
mov QWORD PTR [rbp+8], rax
mov rax, QWORD PTR [rbp+16]
adc rax, QWORD PTR [rbp+48]
mov QWORD PTR [rbp+16], rax
mov rax, QWORD PTR [rbp+24]
adc rax, QWORD PTR [rbp+56]
mov QWORD PTR [rbp+24], rax
ret
UInt128Mul ENDP
_text ENDS
END
I'm also looking for a good divide 128 algorithm.
There is nothing we can call a _int128 data type (as far as I know), it is likely a structure you forgot to define.
In ASM we have owords but the ABI does not consider them a "returnable" data type.
Your coding style is bad, nobody really uses exception frames in ASM without knowing exactly what they are doing. This is not your case, you are not even sure you need 16 bytes of stack.
I did not look at your 2nd function.
Bob,
If you want to pass and return a 128 bit sized piece of data, you would normally pass the address of that data. You could of course use one or more SSE registers both in and out but AW is correct here, according to the Win64 ABI, you can only pass up to a 64 bit value as normal arguments as the ABI is designed that way.
deleted
Hi, bigbadbob!
will look here (http://x86asm.net/articles/working-with-big-numbers-using-x86-instructions/index.html)
Thank you for your responses.
My goal is to learn enough that I can properly explain/teach it to someone else. There should be enough comments that I could learn the ABI all over again if I forget.
I've never wrote assembly language code for pay. I would consider myself a beginner, but I learn quickly.
Just so you know I'm writing a DLL in GitHub.
https://github.com/robertkolski/BigBadInt128/blob/master/src/Int128Add.asm
As far as I know, nobody looks at my project. I thought that joining this forum would be a way to learn.
Now back to the assembly:
I learned somewhere that if you change the stack you should have an exception FRAME pointer in case your memory access, like mov r10, [rcx] fails. For instance if someone passed rcx = 0 to your function should use the FRAME pointer to unwind. Later I read that you only need to set up a frame pointer if you actually use the stack within the procedure. And I was not sure if I could call this function a leaf function, but now I think that I can.
Please see the revised Add function (I use it as a method)
_text SEGMENT
public Int128Add
; Int128Add
; -------------------------------------
; Signed and unsigned add.
; RCX - PTR to OWORD (input1)
; RDX - PTR to OWORD (input2)
; R8 - PTR to OWORD (result)
; R9 - unused
; -------------------------------------
; RAX volatile - but not used
; R10 volatile
; R11 volatile
;--------------------------------------
; C Header - pseudocode prototype
; void Int128Add(_int128* input1, _int128* input2, _int128* result )
; assume that I have a C compiler that supports _int128 or it is a struct
;--------------------------------------
; C# types (without full implementation here):
; public struct Int128
; {
; private Int64 loQWORD;
; private Int64 hiQWORD;
; [DllImport("BigBadInt128.dll")]
; private static extern void Int128Add(IntPtr addend1, IntPtr addend2, IntPtr result);
; // public methods not shown
; }
; public struct UInt128
; {
; private UInt64 loQWORD;
; private UInt64 hiQWORD;
; [DllImport("BigBadInt128.dll")]
; private static extern void Int128Add(IntPtr addend1, IntPtr addend2, IntPtr result);
; // public methods not shown
; }
;--------------------------------------
; input1 and input2 remain unchanged
; the contents of the OWORD result is modified
;--------------------------------------
; Don't need FRAME and PROLOG because
; this is a leaf function
; it does not need any stack space
; -------------------------------------
Int128Add PROC
mov r10, QWORD PTR [rcx]
mov r11, QWORD PTR [rcx+8]
add r10, QWORD PTR [rdx]
adc r11, QWORD PTR [rdx+8]
mov QWORD PTR [r8], r10
mov QWORD PTR [r8+8], r11
ret
Int128Add ENDP
_text ENDS
END
Quote
I would consider myself a beginner, but I learn quickly
You don't need to put it that way, people see immediately who you are.
In addition you are a C# programmer which is not a good starting point.
ASM programmers usually do not spend time with FRAME, particularly if they are beginners, for a few reasons I could detail here but will not. However, for an introduction on the subject you can read this article 64-bit Structured Exception Handling (SEH) in ASM (https://www.codeproject.com/Articles/1212332/bit-Structured-Exception-Handling-SEH-in-ASM)
deleted
@nidud
Quote
or in this case RDX:RAX.
Is this the Windows ABI? I don't think so.
deleted
Quote
I'm a C# programmer
He was very clear at that.
It still sounds like there are 2 choices, either use 64 bit pointers to the 128 bit variable OR pass the data in SSE registers. I guess you could pass that data in AVX registers as well. As far as I know VS does not support registers directly so the choices collapse down to passing 64 bit pointers to the 128 bit data like the ABI for Win 64 supports.
deleted
According to the Windows ABI he has to use pointers to the variable.
The Windows ABI is a convention to be used by all programs, written in any programming language. We know that it is possible to use special calling conventions, or derivatives when we know beforehand with what our ASM module will be linked to.
This is really not the case here, C# does not allow special calling conventions.
Quote
None of the functions presented by Bob use any return type other than void and all arguments are passed as pointers
We know why he did that, because he does not know yet how to do it otherwise.
deleted
Quote from: nidud on June 15, 2018, 04:48:24 AM
So it's not because he follows the Windows ABI but because he does not know yet how to do it otherwise.
Nah, it is a little "secret" of the Windows ABI how to do it otherwise. :biggrin:
The feature PInvoke in C# has to conform to 64 bit ABI when running in 64 bit mode. The reason is that is used to call the Windows API or any other Native DLL. A DLL that contains 64 bit assembly and standard C/C++ are Native. In 32 bit mode I think that it uses STDCALL.
I'm able to tell a .Net DLL to be compiled to work only for x64. The default is 'Any CPU' which is independent of architecture.
I do not know if PInvoke will accept an XMM0 return value. I'm aware that RAX is the 64 bit return value register, but cannot be used since I would have a 128 bit return value.
Not all my functions are void, but I did not show this one yet.
_text SEGMENT
public UInt128Parse
; UInt128Parse
; ---------
; RCX - QWORD - PTR to String (input)
; RDX - QWORD - PTR to Int128 (result)
; R8 - unused
; R9 - unused
; ---------
; RAX volatile
; R10 volatile
; R11 volatile
;----------
; C Header
; DWORD UInt128Parse(wchar* lpwszString, _int128* result )
;----------
; input lpwszString remains unchanged
; result the pointer's contents are updated
; ---------
; returns 0 sucess
; returns 1 overflow
; returns 2 invalid format
; ---------
;
UInt128Parse PROC FRAME
push rbp
.pushreg rbp
push rdx
.pushreg rdx
push rbx
.pushreg rbx
push r12
.pushreg r12
sub rsp, 030h
.allocstack 030h
mov rbp, rsp
.setframe rbp, 0
.endprolog
mov r12, rdx
xor r11, r11 ; keep r11 zero
mov QWORD PTR [rbp+8], r11
mov QWORD PTR [rbp+16], r11
mov QWORD PTR [rbp+24], r11
xor r10, r10 ; move to the begining of the string
jmp start_loop
keep_looping:
mov rbx, 10
mov rax, [rbp+8]
mul rbx
mov QWORD PTR [rbp+32], rdx
mov QWORD PTR [rbp+8], rax
mov rax, QWORD PTR [rbp+16]
mul rbx
add rax, QWORD PTR [rbp+32]
adc rdx, r11 ; add with carry and zero
mov QWORD PTR [rbp+16], rax
mov QWORD PTR [rbp+24], rdx
cmp rdx, r11
jne overflow
start_loop:
mov dx, WORD PTR [rcx+r10] ; r10 is the string offset
cmp dx, '0'
jb invalid_character
cmp dx, '9'
ja invalid_character
sub dx, '0'
xor rax, rax
mov ax, dx
add QWORD PTR [rbp+8], rax
adc QWORD PTR [rbp+16], r11 ; add with carry and zero
adc QWORD PTR [rbp+24], r11 ; add with carry and zero
cmp QWORD PTR [rbp+24], r11 ; compare to zero
jne overflow
add r10, 2
mov dx, WORD PTR [rcx+r10]
cmp dx, 0
je done
jmp keep_looping
done:
mov rax, QWORD PTR [rbp+8]
mov rdx, QWORD PTR [rbp+16]
mov QWORD PTR [r12], rax
mov QWORD PTR [r12+8], rdx
xor eax, eax
jmp method_exit
overflow:
mov eax, 1
jmp method_exit
invalid_character:
mov eax, 2
method_exit:
; epilog
add rsp, 030h
pop r12
pop rbx
pop rdx
pop rbp
ret
UInt128Parse ENDP
_text ENDS
END
Bob,
If the return value is a problem, what about passing the address of a buffer in the arguments that is any size you like, write the results to that buffer in your assembler proc then back in your calling language just read the buffer ? This is pretty standard stuff and Windows API functions use it regularly.
My latest method signature in C# is this:
[DllImport("BigBadInt128.dll")]
private static extern Int128 Int128Add(ref Int128 addend1, ref Int128 addend2, out Int128 result);
The first 2 parameters are "ref" Int128, so that means pointer to my struct.
The last parameter is "out" Int128. It is still a pointer, but it means that it is only a result, not an input. I don't need a buffer my struct is 128 bits (2 QWORDS in size).
[StructLayout(LayoutKind.Sequential)]
public struct Int128
{
private Int64 loQWORD;
private Int64 hiQWORD;
}
As to Hutch's comment about the buffer, I was only responding to people saying that I don't know any other way.
By the way I knew the whole time that even the first add function worked. I ran it with C# already. I was just wondering if I did it right.
I was not sure if I had to have a FRAME pointer even if I don't use the stack. The reason is I read some article, but I don't remember where it is. Someone said always make a FRAME for exception unwinding.
I might eventually write a C++ program and call my DLL from that also. I just don't usually write code in C++. Not that I never have. I think about 15 years ago I wrote some COM in C++. But it has been so long ago that I most likely won't remember all of the details.
So I keep reading that we allocate shadow space of 32 bytes just in case we erase our registers.
What are the locations if I want to use that space?
mov [rsp+8], rcx - ??? - I think that I saw this somewhere.
mov [rsp+16], rdx
mov [rsp+24], r8
mov [rsp+32], r9
I'm sorry if I'm wrong I'm only guessing.
Bob,
Here is the reference on the calling convention that I use for the MASM64 SDK. There is a mountain of bullshit about how it works across the internet, I did this one the hard way, write, test, verify and it works correctly.
The Win 64 Calling Convention, How Does It Work ?
Win 64 effectively only has one form of calling convention and it is used on all of the Windows API functions and while it is more complicated than the STDCALL and C calling conventions in Win 32, it is also more flexible in the each argument passed to another function can be specified in any of 4 different data sizes, BYTE WORD DWORD and QWORD being respectively 8 bit, 16 bit, 32 bit and 64 bit.
Whereas Win32 only used the stack with STDCALL and the C calling convention, Win 64 uses a combination of integer registers and stack locations to pass the number of arguments required for different procedures. In the specification of the Win 64 calling convention, the stack pointer (RSP) must remain 16 byte aligned which is done for performance reasons with larger data types and a number of instructions that need aligned memory to procedure.
Calling a procedure
The first four (4) arguments are written to the RCX RDX R8 and R9 in any of the 4 data sizes supported by the calling convention and any following arguments are written to a stack relative location in memory without changing the stack pointer (RSP). Many procedures have 4 or less arguments and obtain the advantage of lower calling overhead by receiving the 4 or less arguments directly in the four specified registers.
When there are more than four arguments, arguments 5 and upwards are written to the stack and here there is another consideration that will become obvious at the receiving end of a procedure call, the first four locations on the stack are left empty so that the 4 registers can be stored at those locations if necessary. The first four stack addresses are [rsp], [rsp+8], [rsp+16] and [rsp+24]which are left empty. Argument 5 and upwards are written to the RSP relative address [rsp+32] and upwards with an increase in displacement of 8 bytes for each argument.
A typical procedure call with 6 arguments will look like this.
mov rcx, arg1
mov rdx, arg2
mov r8, arg3
mov r9, arg4
mov QWORD PTR [rsp+32], arg5
mov QWORD PTR [rsp+40], arg6
call FunctionName
It is worth noting that with the stack arguments, if they are either a register or an immediate operand they are written directly to the RSP relative stack address. If the argument is a memory operand, either LOCAL of GLOBAL, it will be written to a register first then the register is written to the stack address as x86 - 64 processors do not support direct memory to memory copy.
It will look like this.
mov rax, arg5
mov QWORD PTR [rsp+32], rax
mov rax, arg6
mov QWORD PTR [rsp+40], rax
The Procedure Being Called
Depending on the number of arguments being passed to the procedure that is being called, a simple procedure that does not call other procedures (a leaf procedure) does not need to create a stack frame and can use the 4 or less registers in the design of the procedure along with other available registers. When a procedure received 5 or more arguments and requires LOCAL variables it usually requires a stack frame which makes the arguments passed on the stack RBP relative.
When a procedure with a stack frame is called, 8 bytes are stored on the stack for the return address and another 8 bytes are used when creating the stack frame. This shifts the location of the first 4 empty arguments up by 16 bytes so that the first empty stack location is located at address [rbp+16]. The four registers that hold the first four arguments are volatile registers which means they can be overwritten by any following mnemonics so in a normal high level procedure that will call multiple procedures, the correct solution is to copy the four registers into the four stack locations. The four empty locations are generally referred to as shadow space.
mov [rbp+16], rcx
mov [rbp+24], rdx
mov [rbp+32], r8
mov [rbp+40], r9
There is good reason to write the four registers to the RBP relative addresses rather than to the variable names as the addresses are fixed at 64 bit and you don't have to bother with any different data sizes. A modern compiler will automate this process and an assembler needs to preserve the 4 register arguments in the (shadow space) so they are not overwritten.
With a language that specifies an argument list at a procedure's entry with an example something like this,
MyFunction proc arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD,arg5:QWORD,arg6:QWORD
Once the four registers have been copied to the four RBP relative addresses (shadow space), you can use the argument names in the argument list in the normal manner when writing the procedure.
The notation used in the above examples is in the format of the 64 bit Microsoft assembler, ML64.EXE and has been developed and tested successfully in Windows 10 Professional. Compatibility testing has also been successfully performed on Win7 64 bit Ultimate and Win 8/8.1 64 bit.
A wrote a few programs in C#, eventually more complicated than any you have ever done.
I have even written an article about mixing C# and ASM in a single executable.
So, I know very well what C# is all about.
Quote
I was only responding to people saying that I don't know any other way
You have not. Everybody and their cat know that integer values up to 64-bit are returned in eax/rax. We were talking about the reason you used void functions when you had a return value.
@Hutch,
There is the part about float (real4) parameters and returning floats that you did not mention.
Quote
You have not. Everybody and their cat know that integer values up to 64-bit are returned in eax/rax. We were talking about the reason you used void functions when you had a return value.
My return value was 128 bit, so I recieved a passed in pointer for most functions. I probably misused RAX. The function was "void" meaning you can ignore RAX, though I don't know if that is a bad practice. I thought that RAX is volatile if the function is void. If you think that is a bad practice I won't use RAX for a void.
I used EAX to return a DWORD in my parse routine. I used that as a success and failure code. 0 is for success. Otherwise in C# I throw an exception. Like OverflowException and FormatException.
I was also responding to Hutch. I'm not sure if he called assembly from C#. Not saying he did or did not. Just I could not gauge it based on how he responded. Sorry, too many people on the forum. AW, I'm not trying to call myself smarter than you so there was really no reason to belittle me as a defense mechanism and say that your C# is more complicated than mine. Maybe it is, but I really thought that stating that at this time was uncalled for.
May be I am talking chinese without knowing.
Let me try again.
You are doing this:
void Int128Add(_m128* const input1, _m128* const input2, _m128* result )
while what you really want is this (or some variation on the same line):
_m128 result = Int128Add(_m128* const input1, _m128* const input)
but you don't know yet how to do it.
Jose,
> There is the part about float (real4) parameters and returning floats that you did not mention.
You are correct here and as well I did not address a number of other data types that can be returned within a 64 bit data size but in some contexts a returned register does the job, if you return the fp0 floating point register it can handle 32, 64 and 80 bit data where with a 64 bit return value you will only do 32 and 64 bit.
Doesn't the ABI tell you? If you need a result that doesn't fit into RAX (integer) or XMM0 (float) you pass the address of the return type in RCX and bump the other args.
https://docs.microsoft.com/en-au/cpp/build/return-values-cpp
RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right.
XMM0, 1, 2, and 3 are used for floating point arguments.
Additional arguments are pushed on the stack left to right.
Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
It is the caller‟s responsibility to allocate 32 bytes of "shadow space" (for storing RCX, RDX, R8, and R9 if needed) before calling the function.
It is the caller‟s responsibility to clean the stack after the call.
Integer return values (similar to x86) are returned in RAX if 64 bits or less.
Floating point return values are returned in XMM0.
Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
It is all there, or almost all.
For example if you use XMM0 to pass a float you can not use RCX to pass a value, If you use XMM1 you can't use RDX, etc
Bob,
> I'm not sure if he called assembly from C#.
You can be sure I have never called assembler from C# as I never have and never will use it. What I was suggestion was that if you simply pass a pointer to memory, a structure or a variable to an assembler procedure, you can write whatever result you like to that address and at the caller end you will have the result you produced in the assembler procedure.
In other words:
"Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller."
It is all here: https://software.intel.com/en-us/articles/introduction-to-x64-assembly
This is all you need to know to become a great 64-bit ASM programmer.
Quote from: hutch-- on June 15, 2018, 03:52:50 PM
Bob,
> I'm not sure if he called assembly from C#.
You can be sure I have never called assembler from C# as I never have and never will use it. What I was suggestion was that if you simply pass a pointer to memory, a structure or a variable to an assembler procedure, you can write whatever result you like to that address and at the caller end you will have the result you produced in the assembler procedure.
Sorry I thought you meant BYTE or CHAR buffer. I was using pointers the whole time.
Quote from: AW on June 15, 2018, 03:59:30 PM
In other words:
"Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called, Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
"
It is all here: https://software.intel.com/en-us/articles/introduction-to-x64-assembly
This is all you need to know to become a great 64-bit ASM programmer.
Thank you AW. I read that and forgot. Not sure if it was the Intel page. So ECX is my pointer.
So I did this in C#:
[StructLayout(LayoutKind.Sequential)]
public struct Int128
{
private Int64 loQWORD;
private Int64 hiQWORD;
[DllImport("BigBadInt128.dll")]
private static extern Int128 Int128Add(ref Int128 addend1, ref Int128 addend2);
public static Int128 operator+ (Int128 addend1, Int128 addend2)
{
return Int128Add(ref addend1, ref addend2);
}
}
And this in 64-bit assembly:
Int128Add PROC
mov r10, QWORD PTR [rdx]
mov r11, QWORD PTR [rdx+8]
add r10, QWORD PTR [r8]
adc r11, QWORD PTR [r8+8]
mov QWORD PTR [rcx], r10
mov QWORD PTR [rcx+8], r11
mov rax, rcx
ret
Int128Add ENDP
deleted
Come on nidud, you should know this :redface:
typedef struct
{
__int64 num1;
__int64 num2;
}_m128t;
int main()
{
_m128t in1 = { 1,1 };
_m128t in2 = { 2,2 };
_m128t myNum = Int128Add(&in1, &in2);
return 0;
}
_m128t myNum = Int128Add(&in1, &in2);
000000013F124D50 lea r8,[in2]
000000013F124D54 lea rdx,[in1]
000000013F124D58 lea rcx,[rbp+188h]
000000013F124D5F call Int128Add (013F121375h)
Look at how RCX is used, lol.
deleted
I am talking about Windows ABI since the beginning of this thread and you are talking about a feature of the GCC compiler (a more sophisticated compiler according to you :icon_eek:)
deleted
I know you will end winning any discussion due to fatigue of the opponent.
I will keep only this for the record:
Quote
Unless he uses a more sophisticated compiler this will not be possible given the maximum returned integer value is 64-bit.
:bgrin:
deleted
Quote from: nidud on June 16, 2018, 01:58:34 AM
As for your usual (your all idiots because you don't know what I just learn five minutes ago from google) babble, that's just entertainment.
You are a bad loser, you should recognize the nonsense you have been saying. This kind of ignorance is not acceptable from someone that is developing an assembler supposed to be compliant with the Windows ABI. Or is it compliant with the GCC ABI?
Bob is saying that he is a C# developer since the first message and you don't stop push selling your GCC ideas.
I would like to learn the Windows ABI way of doing it. I'll be using C# from Windows. I might also use it from C++, but it will be the Visual Studio compiler.
Bob,
Don't take any notice of the "kiddies", its just a form of sport. :P
Kiddies,
Behave yourself ! ;)
deleted
Quote
What I've been saying is that using VS your limited to 64-bit size arguments and return values so you have to use pointers.
What's wrong with using pointers? We all know that XMM registers are bad doing integer operations, and they don't any 128-bit operation! They are just carriers :badgrin:, so you will have to offload their content to make something useful. People usually forget that! It is the same with the VectorCall convention, people forget that the XMM registers have to be loaded and this takes CPU cycles. Don't embark in buzz words, experiment and test by yourself.
Quote
bla, bla bla, ...
No comments. You insist that C# has 128-bit data types. It has NOT.
Quote
As already mention, C# is not bound to a specific ABI. It's created to be used on different computer platforms without being rewritten for specific architectures.
Welcome to planet Earth, please land now that there is no fog. Things here are quite different. :biggrin:
:biggrin:
> What I've been saying is that using VS your limited to 64-bit size arguments and return values so you have to use pointers.
VS does have a technique for writing 128 and 256 bit data types, its called MASM. That is why Microsoft supply MASM in both the old 32 bit version and the 64 bit version. Now you can be sure that nether will run on a Motorola MAC, MIPS, PDP8 or Lunix but both can produce binaries for the OS they are supplied for, Windows. :P
deleted
@nidud
Quote
Registers:
Code: [Select]
mov rax,rcx ; in: rdx:rcx, r8:r9
add rax,r8 ; out: rdx:rax
adc rdx,r9
Pointers:
Code: [Select]
mov r9,[rcx] ; in: [rcx], [rdx]
mov r10,[rcx+8] ; out: [r8]
add r9,[rdx]
adc r10,[rdx+8]
mov [r8],r9
mov [r8+8],r10
You are mixing apples with oranges in a frustrated attempt to come up with something true.
I am talking about what is wrong with pointers when calling functions. and you start a compulsive addition manipulation inside a function.
Quote
The Linux implementation of the Quadmath actually use both. There may be some advantages in doing that but I failed to see any:
If we need a high precision (you define the precision you want) math library we should use MPIR (GMP fork), not bloated limited precision math DLLs like the quadmath. MPIR is in large part written in ASM. I have already posted how to use MPIR from ASM.
Tell me, what is quadmath good for? Can we use it for a large number factorization for example?
Quote
This is simply not true so I think it's safe to just write this off as pure ignorance on your part.
is this an argument? Why don't you produce another compulsive code demo on this one?
Quote
You see, your assertion that it's impossible to write assembler code which is faster and more compact than optimized C++ is simply not true (I assume that was the conclusion in the article you wrote).
I never said it was impossible, but am waiting patiently for someone to beat the compiler on the same routines. This will be more interesting than going there and downvoting an article, as some people do once in a while, that has deserved the prize of article of the month.
deleted
I found something annoying about using FASTCALL (Win64 ABI) when using PROC arguments.
Test1 PROC arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD,arg5:QWORD,arg6:QWORD
; save shadow space
mov [rbp+16], rcx
mov [rbp+24], rdx
mov [rbp+32], r8
mov [rbp+40], r9
; ready to add the arguments up
mov rax, arg1
add rax, arg2
add rax, arg3
add rax, arg4
add rax, arg5
add rax, arg6
ret
Test1 ENDP
And this one:
Test2 PROC arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD,arg5:QWORD,arg6:QWORD
; save shadow space
mov arg1, rcx
mov arg2, rdx
mov arg3, r8
mov arg4, r9
; ready to add the arguments up
mov rax, arg1
add rax, arg2
add rax, arg3
add rax, arg4
add rax, arg5
add rax, arg6
ret
Test2 ENDP
I made up an example of storing the shadow space.
Both of those work the same.
I wrote this based on what I thought Hutch said. I would think that always using the name of the argument is better. Though I know it might be different if the arguments were not all QWORD.
I have this example, though...
Test3 PROC arg1:WORD,arg2:WORD,arg3:WORD,arg4:WORD,arg5:WORD,arg6:WORD
; save shadow space
mov QWORD PTR arg1, rcx
mov QWORD PTR arg2, rdx
mov QWORD PTR arg3, r8
mov QWORD PTR arg4, r9
; ready to add the arguments up
mov ax, arg1
add ax, arg2
add ax, arg3
add ax, arg4
add ax, arg5
add ax, arg6
ret
Test3 ENDP
What is your opinion on this? I recast it to QWORD so that the actual type of the argument does not matter.
My issue is that if I added "USES RBX" then the stack would change? I think that using the named arguments is better.
If a programmer is going to bother using the automatic stack manipulation for named arguments, then the programmer
should never manually mention the stack locations they might change if the procedure is rewritten and a new
register is added to the list in USES.
Now I would also like to reveal something...
I've used MASM32 before and I know all about PROC, PROTO, invoke, etc. (It might have been 10-15 years ago).
But I've read the MASM64 help and noticed that ML64 does not have invoke, but someone made a macro.
I also was wondering why someone did not yet write:
invokefast which would not waste time with storing the first 4 arguments on the stack.
The callee should use the stack space if needed, but the caller shouldn't.
It should be a black box. The caller says here is a shadow space, I will not fill it in. Use it if you want, but I'll ignore it.
The callee if it uses a whole bunch of registers might use the shadow space. But the caller should not even know.
What I'm asking about is a macro that follows the FASTCALL convention exactly.
So I would like to tell you that I expected the named arguments to be on the stack, that is why I did not use this notation.
I started coding after I read to use the registers, so that is what I've been doing.
Time to rewrite my UInt128Mul... I'll be using the shadow space correctly now.
Bob,
The calling techniques in the MASM64 stuff so far does procedure calls in a couple of ways. There is a general "procedure_call" type macro that is called with a number of wrappers, "invoke" included and a pure register call macro which will accept up to 4 arguments, both conform to the Win 64 ABI and handle both ends of the market.
The invoke style macro writes the first 4 args to shadow space as well as the registers and the rest directly to the correct stack locations. It also supports quoted text. The direct register call macros only write up to the first 4 registers so you can do both and in a very efficient manner.
ML64 is an unconfigured assembler and needs to use the pre-processor to configure it. Whereas ML.EXE in Win 32 was easy enough to write pure mnemonic code, the Win 64 ABI is a lot more complex than stack based Win 32 and while it can be done purely manually, its not for the faint of heart.
With stackframe support, you can handle any of the normal high level API and procedure calls but for pure algorithms you write procedures with no stack frame and call them using up to the first 4 registers.
Quote from: hutch-- on June 17, 2018, 02:30:26 PM
The invoke style macro writes the first 4 args to shadow space as well as the registers and the rest directly to the correct stack locations. It also supports quoted text. The direct register call macros only write up to the first 4 registers so you can do both and in a very efficient manner.
I read the macro source code and I don't think that invoke writes to the shadow space, it actually appears to work as I expected.
Please note that I installed MASM32 and MASM64 on my current computer in 2017.
@nidud
Quote
Well, lets take this from the start
Sure, we can retry as many times as you need, hopefully you will understand in the end.
Quote
GCC extends this to 64-bit.
mov rax,rcx ; in: rdx:rcx, r8:r9
add rax,r8 ; out: rdx:rax
adc rdx,r9
ret
Ah, so this is the part you find really cool.
The problem is that nothing useful can be done with the return value in RDX:RAX.
You will still need to call functions with pointer arguments, or are you going to pass arguments in rdx:rax or may be also in r11:r10, r13:r12, you don't clarify this part (ah, from __int128 foo(__int128 a, __int128 b ) it appears that you are thinking about some 128 bit registers that don't exist yet in this planet) ?
If neither is true you will need to save RDX:RAX to memory, like it or not. This means that RDX:RAX is only a useless carrier and consumer of CPU cycles of the return value from callee to caller.
Bob,
This works fine, the macro that "invoke" calls definitely writes to shadow space.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
invoke testproc,150,300,450,600
waitkey
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
testproc proc arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD
; clear the 4 registers
xor rcx, rcx
xor rdx, rdx
xor r8, r8
xor r9, r9
; display the values from shadow space
conout str$(arg1),lf
conout str$(arg2),lf
conout str$(arg3),lf
conout str$(arg4),lf
ret
testproc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Quote from: bigbadbob on June 17, 2018, 03:40:23 PMI read the macro source code and I don't think that invoke writes to the shadow space
Macros are powerful, you can ask them explicitly to write to shadow space; with jinvoke, just add a <cb> after proc:
include \Masm32\MasmBasic\Res\JBasic.inc ; see 64-bit assembly with RichMasm (http://masm32.com/board/index.php?topic=5314.msg59884#msg59884)
.code
testproc proc <cb> arg1:QWORD,arg2:QWORD,arg3:QWORD,arg4:QWORD
xor rcx, rcx ; clear the 4 registers
xor rdx, rdx
xor r8, r8
xor r9, r9
PrintLine Str$("a1: %i\na2: %i\na3: %i\na4: %i", arg1, arg2, arg3, arg4) ; display the values from shadow space
ret
testproc endp
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
jinvoke testproc, 100, 200, 300, 400
Inkey Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
EndOfCodeOutput:
100
200
300
400
This code was assembled with ml64 in 64-bit format
Builds & runs also with UAsm, of course. The <cb> behaviour is needed for Windows callback functions, such as WndProc. Without the <cb>, the macro saves the four instructions needed to write to shadow space,
but the callee must know what to do with the four registers 8)
deleted
Let's move further ABI/FASTCALL discussion to http://masm32.com/board/index.php?topic=7222.0 (http://masm32.com/board/index.php?topic=7222.0)
nidud I noticed that this code does not properly multiply the generic case of OWORD multiplied by OWORD. You said that it was inline so it probably works for a special case.
Quote
These functions are used for the REAL16 (quadmath) implementation so most of them are inline. The mul function goes something like this:
.if !rdx && !r11
mul r10
xor r10,r10
.else
mul r10
mov rbx,rdx
mov rdi,rax
mov rax,rcx
mul r11
mov r11,rdx
xchg r10,rax
mov rdx,rcx
mul rdx
add rbx,rax
adc r10,rdx
adc r11,0
There should be 4 multiplies in the else part.
rdx:rax * r11:r10 = rax * r10 (QWORD0 and QWORD1) + rdx * r10 (QWORD1 and QWORD2) + rax * r11 (QWORD1 and QWORD2) + rdx * r11 (QWORD2 and QWORD3)
deleted
deleted
Quote from: nidud on June 18, 2018, 09:10:02 AM
:biggrin:
There actually is 4 multiplies in there so it's more or less the same code as above using different regs.
I accidentally did not scroll and then quoted incorrectly. You are correct, the first post was correct.
:t
Guys,
I moved this topic as it is way too complicated for learners.
Hi nidud,
Here is my version, very commented.
_text SEGMENT
public UInt128Mul
UInt128 STRUCT
loQWORD QWORD ?
hiQWORD QWORD ?
UInt128 ENDS
UInt256 STRUCT
myQWORD0 QWORD ?
myQWORD1 QWORD ?
myQWORD2 QWORD ?
myQWORD3 QWORD ?
UInt256 ENDS
; UInt128Mul
; ---------
; RCX - WORD - PTR to UInt256 (result)
; RDX - WORD - PTR to UInt128 (input1)
; R8 - WORD - PTR to UInt128 (input2)
; R9 - volatile, not used for parameter passing
; ---------
; RAX volatile
; R10 volatile
; R11 volatile
;----------
; C Header - variant 1
; UInt128 UInt128Mul(UInt128* const input1, UInt128* const input2);
; C Header - variant 2
; void UInt128Mul(UInt256* result, UInt128* const input1, UInt128* const input2);
;----------
; source remains unchanged
; performs *R8 = *RCX * *RDX in 128 bit mode resulting in 256 bits
UInt128Mul PROC result:PTR UInt128, input1:PTR UInt128, input2:PTR UInt128
; no need to save shadow space yet
cmp (UInt128 PTR [rdx]).hiQWORD, 0
jne long_math
cmp (UInt128 PTR [r8]).hiQWORD, 0
jne long_math
mov rax, (UInt128 PTR [rdx]).loQWORD
mov r10, (UInt128 PTR [r8]).loQWORD
mul r10
mov (UInt256 PTR [rcx]).myQWORD0, rax
mov (UInt256 PTR [rcx]).myQWORD1, rdx
mov (UInt256 PTR [rcx]).myQWORD2, 0
mov (UInt256 PTR [rcx]).myQWORD3, 0
mov rax, rcx
ret
long_math:
; save shadow space
mov result, rcx ; this saved shadow parameter is actually used
mov input1, rdx ; saved but not used
mov input2, r8 ; saved but not used
push r12
push r13
push r14
push r15
xor r14, r14 ; make r14 and r15 zero because they will start as carries
xor r15, r15
mov rax, (UInt128 PTR [rdx]).loQWORD ; rdx is a pointer to input1
mov r10, (UInt128 PTR [r8]).loQWORD ; r8 is a pointer to input2
mov rcx, (UInt128 PTR [rdx]).hiQWORD
mov r11, (UInt128 PTR [r8]).hiQWORD
mov r8, rax
mul r10 ; input1.loQWORD * input2.loQWORD ==> rdx : rax
mov r12, rax ; this is the result.myQWORD0 (final result)
mov r13, rdx ; this is the result.myQWORD1 (temp)
mov rax, r8
mul r11 ; input1.loQWORD * input2.hiQWORD ==> rdx : rax
add r13, rax ; update result.myQWORD1 (still temp) by adding
adc r14, rdx ; result.myQWORD2 (temp) add with carry
mov rax, rcx
mul r10 ; input1.hiQWORD * input2.loQWORD ==> rdx : rax
add r13, rax ; update result.myQWORD1 (final result) by adding
adc r14, rdx ; update result.myQWORD2 (temp) by adding with carry
adc r15, 0 ; begin using result.myQWORD3 (temp) in case of carry
mov rax, rcx
mul r11 ; input1.hiQWORD * input2.hiQWORD ==> rdx : rax
add r14, rax ; update result.myQWORD2 (final result) by adding
adc r15, rdx ; update result.myQWORD3 (final result) by adding with carry
mov rax, result ; load result pointer into rax to begin storing in memory
mov (UInt256 PTR [rax]).myQWORD0, r12
mov (UInt256 PTR [rax]).myQWORD1, r13
mov (UInt256 PTR [rax]).myQWORD1, r14
mov (UInt256 PTR [rax]).myQWORD1, r15
pop r15
pop r14
pop r13
pop r12
ret
UInt128Mul ENDP
_text ENDS
END
Quote from: nidud on June 18, 2018, 03:34:20 AM
Nevertheless you can't do the RDX:RAX thing in VS (if that was your plan) as already explained, so this is strictly assembler.
I know, it is something that can eventually be explored only in Assembler. Not either within the Windows ABI or the System V ABI.
However, I can not visualize a good way to explore it in a ASM only application. Don't be afraid to post a solution if you have it.
deleted
It is not a routine that proves what you want. You need to make a function that calls that routine and then calls the same or a similar routine in order to produce something useful that can be printed - in other words make a f*g application, stop bluffing. When you try that you will see you are wasting CPU cycles with those maneuvers.
What fascinates me with this discussion is the level of fud involved, you have in Windows a published ABI and it handles from BYTE to QWORD, above that (SSE, AVX) you use pointers to larger data sizes OR you directly load SSE or AVX registers then call the procedure. Now I have no doubt that you can do things in different ways if you have code at the receiving end that will handle it but its hard to beat a single pointer when the alternative is to have to re-assembler weird techniques back into a usable data type.
I have no doubt its character building and may even be amusement but its not an improvement over the published ABI. Now an alternative is to create a structure if you have to pass a variety of different sizes to the same proc. Ensure the data in the struct is aligned correctly, big first dropping down in size to the smaller sizes then pass a single structure pointer and you have probably hit the big time in terms of the most efficient technique to call a procedure with variable sized arguments.
Quote from: hutch-- on June 19, 2018, 02:56:00 PM
What fascinates me with this discussion is the level of fud involved,
Nothing forbids us from inventing our own ABI and use it inside our ASM only application - practically the only restriction is keep aligned what needs to be aligned to prevent an exception. But in almost every case there is little to no advantage in inventing a new ABI (this includes the VectorCall, which can be advantageous only with specially tailor made routines).
Large data will need to be passed with pointers, like it or not. And return values will continue limited by the size of registers as well or returned in pointers. Now, comes @nidud saying that we can return values in 2 registers instead of one (he got the idea from the 32-bit way of returning a 64-bit value ). Looks like an appealing idea if we abstract how we are going to deal with the data in 2 registers on the calling end.
@nidud says "easy, see that we can multiply and come up with a 256 bit value in 4 registers!" (I am sure we can also come up with a 512 bit value in 8 registers).
What else can we do out of that mess of data spread across multiple registers?
If we need to call a function we will have to move data around to new registers (which may end up being a "musical chairs" game) or save it in memory (something we were trying to avoid at all costs in the first place).
deleted
Quote from: nidud on June 20, 2018, 12:25:12 AMLook, nobody has disputed the fact that you need a pointer to return a value larger than 64-bit in Windows 64 unless you use a vector to do so. In the case of __int128 and larger integer values (the subject of this tread) I recommended, based on experience, not to do so and use pointers instead.
In the case of __int128, returning the value in xmm0 would be a natural choice.
Quote from: jj2007 on June 20, 2018, 02:28:51 AM
In the case of __int128, returning the value in xmm0 would be a natural choice.
Complete nonsense. :badgrin:
Quote
It was the big software corporations and chip manufactures who invented things like vectorcall and System V to improve performance. You think they where all wrong, Idiots even?
Sorry, I believed you were the socialist guy in here. I never though that about big corporations, they are the best think in this World :t
deleted
Quote from: AW on June 20, 2018, 02:43:55 AM
Quote from: jj2007 on June 20, 2018, 02:28:51 AM
In the case of __int128, returning the value in xmm0 would be a natural choice.
Complete nonsense. :badgrin:
Your remark is about as competent as saying that returning DWORD values in eax is "complete nonsense".
(ok ok, I know I shouldn't feed the troll, but he looks soooo hungry... poor beast :shock:)
deleted
Quote from: nidud on June 20, 2018, 03:10:56 AM
Quote from: AW on June 20, 2018, 02:53:17 AM
I never though that about big corporations, they are the best think in this World :t
:biggrin:
So who's the idiot then?
I guess you already got the confirmation from the mirror:
"Mirror, mirror, on the wall,
Who in this land is the most idiot of all?"
:t
Quote from: jj2007 on June 20, 2018, 03:31:08 AM
Quote from: AW on June 20, 2018, 02:43:55 AM
Quote from: jj2007 on June 20, 2018, 02:28:51 AM
In the case of __int128, returning the value in xmm0 would be a natural choice.
Complete nonsense. :badgrin:
Your remark is about as competent as saying that returning DWORD values in eax is "complete nonsense".
(ok ok, I know I shouldn't feed the troll, but he looks soooo hungry... poor beast :shock:)
LOL, you are becoming less and less intelligent every day. :badgrin:
:biggrin:
Maybe we shoud have moved this discussion to Romper Room. :P
(puts on flameproof asbestos suit)
For simple add128 and sub128,wouldnt it be easier with macros,than to worry about calling conventions?
Would be unnesserary slow add overhead calling convention+call/ret for fast add/sub
deleted