This pops up sometimes with random number generation: How to multiply two QWORDs?
Microsoft offers a bunch of dedicated functions with strange names such as ...
UnsignedMultiply128
Multiply128
__mul128
... but I got none of them working, the compiler always complains about intrinsics not supported etc, the usual C/C++ mess.
But I managed to put up a little 36 bytes routine that does the job (attached, pure MASM32). Problem: It's slow. There is a much faster alternative:
fild a64
fild b64
fmul
fistp resq
Great, over 4x faster, but if the result exceeds a QWORD, it spits out meaningless results and becomes very, very slow (due to the FPU's error handling).
So I spent some time looking for a SIMD solution, and I found PMULUDQ. Looks great but it's DWORDs, not QWORDs :(
Any bright ideas...?
Hi JJ
I found this old macro
; Macro: qdmul
; Purpose: Multiply a QWORD by a DWORD.
; Arguments: ecx::edx::eax = edx::eax * ebx.
qdmul macro
push eax
mov eax, edx
mul ebx
mov ecx, edx
pop edx
xchg eax, edx
push edx
mul ebx
pop ebx
add edx, ebx
adc ecx, 0
endm
You can use it, splitting the multiplication in 4 steps:
1. multiply be the high 32 bit using the macro
2. shift the result by 32 ecx::edx::eax --> edi::ecx::edx::eax
3. multiply be the low 32 bit using the macro ecx::edx::eax
4. add the result of step 2 & 3
Biterider
Thanks, will have a look. It doesn't look any simpler, though.
Don't know if its fast but the results are correct. LATER : Cleaned up.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL quad1 :QWORD
LOCAL quad2 :QWORD
mov quad1, 100
mov quad2, 100
cvtsi2sd xmm0, quad1
cvtsi2sd xmm1, quad2
mulsd xmm0, xmm1
cvtsd2si r11, xmm0
conout str$(r11),lf
waitkey
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value
That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(
deleted
Quote from: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value
That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(
I never used it. So, it seems that it converts a QWORD
integer to
Floating-Point format (QWORD also) if you are right Jochen :biggrin: . These definitions are very ... cool !!!
LATER: i saw all CVT??? instructions for SSE2 and all converts from 32 to 64 or 64 to 32 .
It s nice ! I think Hutch has a privileged information and he doesnt show us ;)
Quote from: nidud on September 16, 2018, 03:36:20 AM
You probably have to convert each DWORD to REAL8
Thanks, but as the title says, it's
QWORDs, not DWORDs.
:biggrin:
> I think Hutch has a privileged information and he doesnt show us.
I have I have, its the Intel instruction manual. What I did when I started on the scalar double instructions was do a search through the entire manual for anything that had an SD ending, avoided the AVX versions and made a list of what looked like it was useful for scalar maths. Its worth having a look at the packed double versions as well as there are more primitives to work with.
What you may find is that the 64 bit mulsd instruction runs out of puff at the top of the 64 bit range.
LATER : From memory there are techniques to determine overflow and underflow which you may need to have in place if you are dealing with number up near or over the range limit of 64 bit.
This
should do the job:
MultQQ3 proc pQ1, pQ2
mov eax, [esp+4] ; pQ1
mov edx, [esp+8] ; pQ2
movlps xmm0, QWORD ptr [eax]
movlps xmm1, QWORD ptr [edx]
pclmulqdq xmm0, xmm1, 0
retn 2*4
MultQQ3 endp
Intel: (https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf)
QuotePCLMULQDQ Instruction Definition
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords
which are selected from the first and the second operands according to the immediate
byte value.
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8
Description: Carry-less multiplication of one quadword (8 bytes) of xmm1
by one quadword (8 bytes) of xmm2/m128, returning a double quadword
(16 bytes). The immediate byte is used for determining which quadwords of
xmm1 and xmm2/m128 should be used.
The instruction has an own 'chapter' in the Oracle manual. (https://docs.oracle.com/cd/E36784_01/html/E36859/gnygo.html)
Problem: It doesn't work :(
Testcase attached, plain Masm32. It calculates 4000000000*2222222222, no problem for the FPU but pclmulqdq does something else. More precisely, it does carry-less multiplication (https://en.wikipedia.org/wiki/Carry-less_product#Example).
Note there are variants like pclmullqlqdq xmm0, xmm1 (https://en.wikipedia.org/wiki/CLMUL_instruction_set#New_instructions) ("Multiply the low halves of the two registers") which do not require the immediate operand; they are just
pclmulqdq with the immediate bits embedded.
Carry-less means that the carry is discarded. I mentioned this instruction here (http://masm32.com/board/index.php?topic=7395.0). So you will never achieve what you want.
You need a Big Integer library or play with floating points to obtain an approximation.
A little convoluted way but it works.
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800
Thanks for all the suggestions :t
I have put together a little testbed, and here are the timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
1190 cycles for 100 * MultQQ
2115 cycles for 100 * Multiply64by64 (Rui)
2171 cycles for 100 * u64_mul (Rui)
1213 cycles for 100 * PCLMULQDQ
2840 cycles for 100 * MultPclmulqdq2
4358 cycles for 100 * doMul (aw27)
913 cycles for 100 * MultQQ
2118 cycles for 100 * Multiply64by64 (Rui)
2182 cycles for 100 * u64_mul (Rui)
1212 cycles for 100 * PCLMULQDQ
2843 cycles for 100 * MultPclmulqdq2
4357 cycles for 100 * doMul (aw27)
917 cycles for 100 * MultQQ
2116 cycles for 100 * Multiply64by64 (Rui)
2172 cycles for 100 * u64_mul (Rui)
1213 cycles for 100 * PCLMULQDQ
2855 cycles for 100 * MultPclmulqdq2
4351 cycles for 100 * doMul (aw27)
915 cycles for 100 * MultQQ
2113 cycles for 100 * Multiply64by64 (Rui)
2167 cycles for 100 * u64_mul (Rui)
1214 cycles for 100 * PCLMULQDQ
2841 cycles for 100 * MultPclmulqdq2
4363 cycles for 100 * doMul (aw27)
62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
45 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 10801413638766757892
DestQ 10801413638766757892
DestQ 11111111111234566980
Note that some of the slower algos produce a 128-bit result.
JJ,
As always, you are a scam artist. Where are you testing the multiplication of 2 QWORDS?
Are these QWORDS?
SrcQ1 dq 1234567890
SrcQ2 dq 9000000082
Yes, as much as these are QWORDS:
SrcQ1 dq 1
SrcQ2 dq 2
Did you ever went to school? If so, did you try to deceive your teachers with tricks like these and succeed?
deleted
Thanks, Nidud. Your muld is practically my MulQQ, only that I pass the source QWORDs as pointers while you pass them as QWORDs. I had tested that before, and it was a little bit slower. Results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
911 cycles for 100 * MultQQ
2116 cycles for 100 * Multiply64by64 (Rui)
2183 cycles for 100 * u64_mul (Rui)
1213 cycles for 100 * PCLMULQDQ
2833 cycles for 100 * MultPclmulqdq2
4358 cycles for 100 * doMul (aw27)
965 cycles for 100 * muld (Nidud)
3894 cycles for 100 * mulq_sse (Nidud)
917 cycles for 100 * MultQQ
2116 cycles for 100 * Multiply64by64 (Rui)
2175 cycles for 100 * u64_mul (Rui)
1215 cycles for 100 * PCLMULQDQ
2840 cycles for 100 * MultPclmulqdq2
4355 cycles for 100 * doMul (aw27)
966 cycles for 100 * muld (Nidud)
3893 cycles for 100 * mulq_sse (Nidud)
915 cycles for 100 * MultQQ
2127 cycles for 100 * Multiply64by64 (Rui)
2170 cycles for 100 * u64_mul (Rui)
1213 cycles for 100 * PCLMULQDQ
2841 cycles for 100 * MultPclmulqdq2
4362 cycles for 100 * doMul (aw27)
968 cycles for 100 * muld (Nidud)
3899 cycles for 100 * mulq_sse (Nidud)
915 cycles for 100 * MultQQ
2123 cycles for 100 * Multiply64by64 (Rui)
2168 cycles for 100 * u64_mul (Rui)
1214 cycles for 100 * PCLMULQDQ
2833 cycles for 100 * MultPclmulqdq2
4354 cycles for 100 * doMul (aw27)
965 cycles for 100 * muld (Nidud)
3913 cycles for 100 * mulq_sse (Nidud)
62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
45 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
112 bytes for mulq_sse (Nidud)
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 10801413638766757892 < PCLMULQDQ carry-less, OK for random number but not suitable as "normal" mult
DestQ 10801413638766757892
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 506253686751116096 <<< little problem here
The FPU version is a lot slower, and chokes if the result goes beyond the QWORD range.
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)
878 cycles for 100 * MultQQ
1695 cycles for 100 * Multiply64by64 (Rui)
1850 cycles for 100 * u64_mul (Rui)
515 cycles for 100 * PCLMULQDQ
2320 cycles for 100 * MultPclmulqdq2
3976 cycles for 100 * doMul (aw27)
879 cycles for 100 * muld (Nidud)
3668 cycles for 100 * mulq_sse (Nidud)
868 cycles for 100 * MultQQ
1859 cycles for 100 * Multiply64by64 (Rui)
2035 cycles for 100 * u64_mul (Rui)
546 cycles for 100 * PCLMULQDQ
2134 cycles for 100 * MultPclmulqdq2
4049 cycles for 100 * doMul (aw27)
956 cycles for 100 * muld (Nidud)
3637 cycles for 100 * mulq_sse (Nidud)
922 cycles for 100 * MultQQ
1729 cycles for 100 * Multiply64by64 (Rui)
1849 cycles for 100 * u64_mul (Rui)
537 cycles for 100 * PCLMULQDQ
2103 cycles for 100 * MultPclmulqdq2
4215 cycles for 100 * doMul (aw27)
885 cycles for 100 * muld (Nidud)
3635 cycles for 100 * mulq_sse (Nidud)
884 cycles for 100 * MultQQ
1901 cycles for 100 * Multiply64by64 (Rui)
1853 cycles for 100 * u64_mul (Rui)
531 cycles for 100 * PCLMULQDQ
2127 cycles for 100 * MultPclmulqdq2
3960 cycles for 100 * doMul (aw27)
981 cycles for 100 * muld (Nidud)
3643 cycles for 100 * mulq_sse (Nidud)
62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
45 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
112 bytes for mulq_sse (Nidud)
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 10801413638766757892
DestQ 10801413638766757892
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 506253686751116096
--- ok ---
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
1159 cycles for 100 * MultQQ
2008 cycles for 100 * Multiply64by64 (Rui)
2190 cycles for 100 * u64_mul (Rui)
665 cycles for 100 * PCLMULQDQ
2458 cycles for 100 * MultPclmulqdq2
4529 cycles for 100 * doMul (aw27)
1006 cycles for 100 * muld (Nidud)
4222 cycles for 100 * mulq_sse (Nidud)
994 cycles for 100 * MultQQ
1975 cycles for 100 * Multiply64by64 (Rui)
2115 cycles for 100 * u64_mul (Rui)
722 cycles for 100 * PCLMULQDQ
2478 cycles for 100 * MultPclmulqdq2
4550 cycles for 100 * doMul (aw27)
1004 cycles for 100 * muld (Nidud)
4209 cycles for 100 * mulq_sse (Nidud)
1000 cycles for 100 * MultQQ
1971 cycles for 100 * Multiply64by64 (Rui)
2109 cycles for 100 * u64_mul (Rui)
601 cycles for 100 * PCLMULQDQ
2639 cycles for 100 * MultPclmulqdq2
4493 cycles for 100 * doMul (aw27)
1003 cycles for 100 * muld (Nidud)
4229 cycles for 100 * mulq_sse (Nidud)
998 cycles for 100 * MultQQ
1972 cycles for 100 * Multiply64by64 (Rui)
2098 cycles for 100 * u64_mul (Rui)
598 cycles for 100 * PCLMULQDQ
2443 cycles for 100 * MultPclmulqdq2
5134 cycles for 100 * doMul (aw27)
1005 cycles for 100 * muld (Nidud)
4219 cycles for 100 * mulq_sse (Nidud)
62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
45 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
112 bytes for mulq_sse (Nidud)
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 10801413638766757892
DestQ 10801413638766757892
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 506253686751116096
--- ok ---
deleted
Hi Jocehn
works fine :t
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v3 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v3 ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP MultQQ ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP doMul ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP muld ---***
**************** E N D ****************
@Nidud: Yes, the new SSE version does not produce the expected result, so I added _mul128 instead. I also added a print of the high qword for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
914 cycles for 100 * MultQQ
2111 cycles for 100 * Multiply64by64 (Rui)
2159 cycles for 100 * u64_mul (Rui)
1207 cycles for 100 * PCLMULQDQ
2832 cycles for 100 * MultPclmulqdq2
4267 cycles for 100 * doMul (aw27)
962 cycles for 100 * muld (Nidud)
2306 cycles for 100 * _mul128 (Nidud)
910 cycles for 100 * MultQQ
2115 cycles for 100 * Multiply64by64 (Rui)
2163 cycles for 100 * u64_mul (Rui)
1209 cycles for 100 * PCLMULQDQ
2829 cycles for 100 * MultPclmulqdq2
4266 cycles for 100 * doMul (aw27)
962 cycles for 100 * muld (Nidud)
2300 cycles for 100 * _mul128 (Nidud)
910 cycles for 100 * MultQQ
2120 cycles for 100 * Multiply64by64 (Rui)
2166 cycles for 100 * u64_mul (Rui)
1211 cycles for 100 * PCLMULQDQ
2833 cycles for 100 * MultPclmulqdq2
4272 cycles for 100 * doMul (aw27)
964 cycles for 100 * muld (Nidud)
2296 cycles for 100 * _mul128 (Nidud)
926 cycles for 100 * MultQQ
2118 cycles for 100 * Multiply64by64 (Rui)
2169 cycles for 100 * u64_mul (Rui)
1207 cycles for 100 * PCLMULQDQ
2833 cycles for 100 * MultPclmulqdq2
4273 cycles for 100 * doMul (aw27)
962 cycles for 100 * muld (Nidud)
2287 cycles for 100 * _mul128 (Nidud)
62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
52 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
146 bytes for _mul128 (Nidud)
MultQQ 6760860027809745732
Multiply64by64 (Rui) 6760860027809745732 - high QWORD: 1728378107
u64_mul (Rui) 6760860027809745732 - high QWORD: 1728378107
PCLMULQDQ 7817399311675693060
MultPclmulqdq2 7817399311675693060
doMul (aw27) 6760860027809745732 - high QWORD: 1728378107
muld (Nidud) 6760860027809745732
_mul128 (Nidud) 6760860027809745732
This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800
:badgrin:
>>> ...for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
versions 3 ( _v3 ) should be much faster... ( than quick! )
This «u64_mul (Rui)» was not written by me, i got it... dont remember who wrote it.
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
versions 3 ( _v3 ) should be much faster... ( than quick! )
Then post it here. And please, not hidden in an archive with a dozen files.
Quote from: jj2007 on September 19, 2018, 03:59:03 AM
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
versions 3 ( _v3 ) should be much faster... ( than quick! )
Then post it here. And please, not hidden in an archive with a dozen files.
Is in my
reply #19 .inc fileJochen, please, replace (i replaced that file in the reply #19)
ret
Multiply64_64_v3 endp
by
ret 20
Multiply64_64_v3 endp
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
Quote from: jj2007 on September 19, 2018, 06:06:59 AM
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
i replaced just now, file MulQQbyQQ.zip, inside you have Multiply64by64
.inc and inside you have that procedure.
There are 6:
Multiply64by64_v1 <<<--- uses pointers
Multiply64by64_v2 Multiply64by64_v3Multiply64_64_v1 <<<--- uses value ( qwords )Multiply64_64_v2Multiply64_64_v3
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
deleted
Quote from: jj2007 on September 19, 2018, 06:42:17 AM
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
Jochen, Hutch will say to me why i dont zip it and post !
Remember that we need to extract the algos you post from archives.
Is there any other way when we have a lot of files ?
If we zip is because we zip; if we dont zip is because we dont zip. :biggrin:
IdxX0 equ 0
IdxX1 equ 4
IdxY0 equ 0
IdxY1 equ 4
IdxZ0 equ 0
IdxZ1 equ 4
IdxZ2 equ 8
IdxZ3 equ 12
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3 proc pX64:DWORD, pY64:DWORD, pZ:DWORD
push ebx
push esi
mov ebx, [esp+12] ;pX64
mov esi, [esp+16] ;pY64
mov ecx, [esp+20] ;pZ
; ----------------------
; IdxX0*IdxY0
; ----------------------
mov eax, [ebx+IdxX0]
mul dword ptr [esi+IdxY0]
mov [ecx+IdxZ0], eax
mov [ecx+IdxZ1], edx
; ----------------------
; IdxX0*IdxY1
; ----------------------
mov eax, [ebx+IdxX0]
mul dword ptr [esi+IdxY1]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, 0
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY0
; ----------------------
mov eax, [ebx+IdxX1]
mul dword ptr [esi+IdxY0]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, [ecx+IdxZ2]
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY1
; ----------------------
mov eax, [ebx+IdxX1]
mul dword ptr [esi+IdxY1]
add eax, [ecx+IdxZ2]
mov [ecx+IdxZ2], eax
;
adc edx, 0
mov [ecx+IdxZ3], edx
pop esi
pop ebx
ret 12
Multiply64by64_v3 endp
;««««««««««««««««««««««««««««««««««««««««««««««««««««
Multiply64_64_v3 proc X:QWORD, Y:QWORD, pZ:DWORD
; -------------
; pointers
; -------------
mov ecx, [esp+20] ;pZ
; ----------------------
; IdxX0*IdxY0
; ----------------------
mov eax, dword ptr [esp+4] ; [X+IdxX0]
mul dword ptr [esp+12] ; [Y+IdxY0]
mov [ecx+IdxZ0], eax
mov [ecx+IdxZ1], edx
; ----------------------
; IdxX0*IdxY1
; ----------------------
mov eax, dword ptr [esp+4] ; [X+IdxX0]
mul dword ptr [esp+16] ; [Y+IdxY1]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, 0
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY0
; ----------------------
mov eax, dword ptr [esp+8] ; [X+IdxX1]
mul dword ptr [esp+12] ; [Y+IdxY0]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, [ecx+IdxZ2]
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY1
; ----------------------
mov eax, dword ptr [esp+8] ; [X+IdxX1]
mul dword ptr [esp+16] ; [Y+IdxY1]
add eax, [ecx+IdxZ2]
mov [ecx+IdxZ2], eax
;
adc edx, 0
mov [ecx+IdxZ3], edx
ret 20
Multiply64_64_v3 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Good job, Rui :t
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
1192 cycles for 100 * MultQQ
1843 cycles for 100 * Multiply64_64_v3v(Rui)
2168 cycles for 100 * u64_mul (Rui)
1215 cycles for 100 * PCLMULQDQ
2830 cycles for 100 * MultPclmulqdq2
4281 cycles for 100 * doMul (aw27)
963 cycles for 100 * muld (Nidud)
2304 cycles for 100 * _mul128 (Nidud)
913 cycles for 100 * MultQQ
1846 cycles for 100 * Multiply64_64_v3v(Rui)
2163 cycles for 100 * u64_mul (Rui)
1208 cycles for 100 * PCLMULQDQ
2827 cycles for 100 * MultPclmulqdq2
4270 cycles for 100 * doMul (aw27)
964 cycles for 100 * muld (Nidud)
2301 cycles for 100 * _mul128 (Nidud)
912 cycles for 100 * MultQQ
1840 cycles for 100 * Multiply64_64_v3v(Rui)
2166 cycles for 100 * u64_mul (Rui)
1210 cycles for 100 * PCLMULQDQ
2833 cycles for 100 * MultPclmulqdq2
4288 cycles for 100 * doMul (aw27)
968 cycles for 100 * muld (Nidud)
2297 cycles for 100 * _mul128 (Nidud)
909 cycles for 100 * MultQQ
1850 cycles for 100 * Multiply64_64_v3v(Rui)
2173 cycles for 100 * u64_mul (Rui)
1208 cycles for 100 * PCLMULQDQ
2831 cycles for 100 * MultPclmulqdq2
4326 cycles for 100 * doMul (aw27)
957 cycles for 100 * muld (Nidud)
2316 cycles for 100 * _mul128 (Nidud)
62 bytes for MultQQ
194 bytes for Multiply64_64_v3v(Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
52 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
146 bytes for _mul128 (Nidud)
MultQQ 6760860027809745732
Multiply64_64_v3v(Rui) 6760860027809745732 - high QWORD: 1728378107
u64_mul (Rui) 6760860027809745732 - high QWORD: 1728378107
PCLMULQDQ 7817399311675693060
MultPclmulqdq2 7817399311675693060
doMul (aw27) 6760860027809745732 - high QWORD: 1728378107
muld (Nidud) 6760860027809745732
_mul128 (Nidud) 6760860027809745732
Hello my strong friend, :t
Quote from: nidud on September 19, 2018, 07:04:13 AM
mov rax,0x1122334455667788
mov rcx,0x99aabbccddeeff00
mul rcx
printf("%#I64x%I64x\n", rdx, rax)
Amazing serendipity after countless tries and fails! Congratulations! :t
deleted
Hi nidud,
you are using the old code. This is the last code i posted tested in reply #30
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3 proc pX64:DWORD, pY64:DWORD, pZ:DWORD
push ebx
push esi
mov ebx, [esp+12] ;pX64
mov esi, [esp+16] ;pY64
mov ecx, [esp+20] ;pZ
; ----------------------
; IdxX0*IdxY0
; ----------------------
mov eax, [ebx+IdxX0]
mul dword ptr [esi+IdxY0]
mov [ecx+IdxZ0], eax
mov [ecx+IdxZ1], edx
; ----------------------
; IdxX0*IdxY1
; ----------------------
mov eax, [ebx+IdxX0]
mul dword ptr [esi+IdxY1]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, 0
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY0
; ----------------------
mov eax, [ebx+IdxX1]
mul dword ptr [esi+IdxY0]
add eax, [ecx+IdxZ1]
mov [ecx+IdxZ1], eax
;
adc edx, [ecx+IdxZ2]
mov [ecx+IdxZ2], edx
; ----------------------
; IdxX1*IdxY1
; ----------------------
mov eax, [ebx+IdxX1]
mul dword ptr [esi+IdxY1]
add eax, [ecx+IdxZ2]
mov [ecx+IdxZ2], eax
;
adc edx, 0
mov [ecx+IdxZ3], edx
pop esi
pop ebx
ret 12
Multiply64by64_v3 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
deleted
Back into action :badgrin:
This 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results) and within those that work is the only one whose author is known (the other have been copied and pasted from some unknown excavated ditch).
.486
.model flat, stdcall
includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg
twoQwords struc
q1l dword 0
q1h dword 0
q2l dword 0
q2h dword 0
twoQwords ends
.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0
.code
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
push ebx
push esi
mov eax, firstL
mul dword ptr secL
mov ecx, res
mov (twoQwords PTR [ecx]).q1l, eax
mov ebx, edx
mov eax, firstH
mul dword ptr secL
add eax, ebx
adc edx,0
mov ebx, eax
mov esi, edx
mov eax, firstL
mul dword ptr secH
add eax, ebx
mov (twoQwords PTR [ecx]).q1h, eax
adc edx, 0
mov ebx, edx
mov eax, firstH
mul dword ptr secH
add eax, esi
adc edx,0
add eax, ebx
adc edx,0
mov (twoQwords PTR [ecx]).q2l, eax
mov (twoQwords PTR [ecx]).q2h, edx
pop esi
pop ebx
ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
main proc
invoke multifast64x64_128, qword1, qword2, addr result
lea esi, result
invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]
invoke ExitProcess,0
main endp
end
val1=0xf122334455667788
val2=0xf9aabbccddeeff00
val1*val2=0xeb2b15787630c963479983e499807800
(*) Guaranteed True Masm (TM) code, no BASIC or C imitations.
Quote from: AW on September 20, 2018, 04:39:00 PMThis 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results)
I know one shouldn't feed the troll but for others who follow this thread a reminder of the original post:
Quote from: jj2007 on September 15, 2018, 11:11:11 PM
This pops up sometimes with random number generation: How to multiply two QWORDs?
There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that, therefore the algos that produce QWORD output are perfectly valid for this purpose (and they work also for multiplying a
dq 18446744073709551615, except PCLMULQDQ).
Even the "distorted" carry-less PCLMULQDQ is valid for random number generation. And of course,
mul rcx also produces a qword, but we are talking 32-bit code here.
Quote
There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that
Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.
Quote
And of course, mul rcx also produces a qword,
It does not (do you need a link to learn about mul?), that's why I congratulated nidud. Sort of Colombo egg, he found it by serendipity after many unsuccessful tries.
Dear José,
In the interest of everybody, please take your pills regularly :icon14:
Quote from: AW on September 18, 2018, 09:39:05 PM
As always, you are a scam artist.
Quote from: AW on September 19, 2018, 03:39:16 PMAmazing serendipity after countless tries and fails! Congratulations! :t
Quote from: AW on September 20, 2018, 05:28:59 PM
Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.
Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
USING r12,r13,r14,r15
SaveRegs
mov r12, 0
mov r13, 3
mov rdx, 2000000000000000000 ; 19 digits
mulx rdx,r12,r13 ; mul rdx x r13
conout " r12 = ",str$(r12),lf
conout " r13 = ",str$(r13),lf
conout " rdx = ",str$(rdx),lf
waitkey
RestoreRegs
.exit
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
; Description
;
; Performs an unsigned multiplication of the implicit source operand (EDX/RDX)
; and the specified source operand (the third operand) and stores the low half
; of the result in the second destination (second operand), the high half
; of the result in the first destination operand (first operand), without reading
; or writing the arithmetic flags.
;
; This enables efficient programming where the software can interleave add with
; carry operations and multiplications.
;
; If the first and second operand are identical, it will contain the high half
; of the multiplication result.
;
; This instruction is not supported in real mode and virtual-8086 mode. The operand
; size is always 32 bits if not in 64-bit mode. In 64-bit mode operand size 64
; requires VEX.W1. VEX.W1 is ignored in non-64-bit modes. An
; attempt to execute this instruction with VEX.L not equal to 0 will cause #UD.
Quote from: hutch-- on September 20, 2018, 06:20:51 PM
Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.
Looks interesting, and
mulx eax, ebx, ecx builds fine but causes an illegal instruction exception on my trusty old Core i5 :(
Quote from: jj2007 on September 20, 2018, 07:37:04 PM
Looks interesting,
You mean its exotic, let me correct:
There might be an exotic case where a mulx is useful, but almost nobody needs that, therefore the algos that use mul are perfectly valid for this purpose
Using the new mulx instructions on the above 32-bit code we can save at least 4 instructions giving a new total 26 instructions.
It is necessary the .xmm directive with this instruction, something a bit exotic would say someone from this forum.
.686p
.xmm
.model flat, stdcall
includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg
twoQwords struc
q1l dword 0
q1h dword 0
q2l dword 0
q2h dword 0
twoQwords ends
.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0
.code
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
push ebx
push esi
mov edx, firstL
mulx ebx, eax, dword ptr secL
mov ecx, res
mov (twoQwords PTR [ecx]).q1l, eax
mov edx, firstH
mulx esi, eax, dword ptr secL
add ebx, eax
mov edx, firstL
mulx edx, eax, dword ptr secH
add eax, ebx
mov (twoQwords PTR [ecx]).q1h, eax
adc edx, 0
mov ebx, edx
mov edx, firstH
mulx edx, eax, dword ptr secH
add eax, esi
adc edx,0
add eax, ebx
adc edx,0
mov (twoQwords PTR [ecx]).q2l, eax
mov (twoQwords PTR [ecx]).q2h, edx
pop esi
pop ebx
ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
main proc
invoke multifast64x64_128, qword1, qword2, addr result
lea esi, result
invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]
invoke ExitProcess,0
main endp
end
Some less young (but trusty) computers will refuse to run the mulx instruction.