Print Page - Multiply two QWORDs

Title: Multiply two QWORDs
Post by: jj2007 on September 15, 2018, 11:11:11 PM

This pops up sometimes with random number generation: How to multiply two QWORDs?

Microsoft offers a bunch of dedicated functions with strange names such as ...

Code Select

UnsignedMultiply128
Multiply128
__mul128

... but I got none of them working, the compiler always complains about intrinsics not supported etc, the usual C/C++ mess.

But I managed to put up a little 36 bytes routine that does the job (attached, pure MASM32). Problem: It's slow. There is a much faster alternative:

Code Select

	fild a64
	fild b64
	fmul
	fistp resq

Great, over 4x faster, but if the result exceeds a QWORD, it spits out meaningless results and becomes very, very slow (due to the FPU's error handling).

So I spent some time looking for a SIMD solution, and I found PMULUDQ. Looks great but it's DWORDs, not QWORDs :(

Any bright ideas...?

Title: Re: Multiply two QWORDs
Post by: Biterider on September 16, 2018, 12:44:57 AM

Hi JJ
I found this old macro

Code Select


; Macro:      qdmul
; Purpose:    Multiply a QWORD by a DWORD.
; Arguments:  ecx::edx::eax = edx::eax * ebx.

qdmul macro
    push eax
    mov eax, edx
    mul ebx
    mov ecx, edx
    pop edx
    xchg eax, edx
    push edx
    mul ebx
    pop ebx
    add edx, ebx
    adc ecx, 0
endm

You can use it, splitting the multiplication in 4 steps:
1. multiply be the high 32 bit using the macro
2. shift the result by 32 ecx::edx::eax --> edi::ecx::edx::eax
3. multiply be the low 32 bit using the macro ecx::edx::eax
4. add the result of step 2 & 3

Biterider

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 01:48:51 AM

Thanks, will have a look. It doesn't look any simpler, though.

Title: Re: Multiply two QWORDs
Post by: hutch-- on September 16, 2018, 02:19:03 AM

Don't know if its fast but the results are correct. LATER : Cleaned up.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

LOCAL quad1 :QWORD
LOCAL quad2 :QWORD

mov quad1, 100
mov quad2, 100

cvtsi2sd xmm0, quad1
cvtsi2sd xmm1, quad2
mulsd xmm0, xmm1
cvtsd2si r11, xmm0

conout str$(r11),lf

waitkey
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 02:41:25 AM

Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(

Title: Re: Multiply two QWORDs
Post by: nidud on September 16, 2018, 03:36:20 AM

deleted

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 16, 2018, 04:48:40 AM

Quote from: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(

I never used it. So, it seems that it converts a QWORD integer to Floating-Point format (QWORD also) if you are right Jochen :biggrin: . These definitions are very ... cool !!!

LATER: i saw all CVT??? instructions for SSE2 and all converts from 32 to 64 or 64 to 32 .
It s nice ! I think Hutch has a privileged information and he doesnt show us ;)

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 11:00:10 AM

Quote from: nidud on September 16, 2018, 03:36:20 AM
You probably have to convert each DWORD to REAL8

Thanks, but as the title says, it's QWORDs, not DWORDs.

Title: Re: Multiply two QWORDs
Post by: hutch-- on September 16, 2018, 11:10:02 AM

:biggrin:

> I think Hutch has a privileged information and he doesnt show us.

I have I have, its the Intel instruction manual. What I did when I started on the scalar double instructions was do a search through the entire manual for anything that had an SD ending, avoided the AVX versions and made a list of what looked like it was useful for scalar maths. Its worth having a look at the packed double versions as well as there are more primitives to work with.

What you may find is that the 64 bit mulsd instruction runs out of puff at the top of the 64 bit range.

LATER : From memory there are techniques to determine overflow and underflow which you may need to have in place if you are dealing with number up near or over the range limit of 64 bit.

Title: pclmulqdq
Post by: jj2007 on September 16, 2018, 11:40:02 AM

This should do the job:

Code Select

MultQQ3 proc pQ1, pQ2
  mov eax, [esp+4]	; pQ1
  mov edx, [esp+8]	; pQ2
  movlps xmm0, QWORD ptr [eax]
  movlps xmm1, QWORD ptr [edx]
  pclmulqdq xmm0, xmm1, 0
  retn 2*4
MultQQ3 endp

Intel: (https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf)

QuotePCLMULQDQ Instruction Definition
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords
which are selected from the first and the second operands according to the immediate
byte value.
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8
Description: Carry-less multiplication of one quadword (8 bytes) of xmm1
by one quadword (8 bytes) of xmm2/m128, returning a double quadword
(16 bytes). The immediate byte is used for determining which quadwords of
xmm1 and xmm2/m128 should be used.

The instruction has an own 'chapter' in the Oracle manual. (https://docs.oracle.com/cd/E36784_01/html/E36859/gnygo.html)

Problem: It doesn't work :(

Testcase attached, plain Masm32. It calculates 4000000000*2222222222, no problem for the FPU but pclmulqdq does something else. More precisely, it does carry-less multiplication (https://en.wikipedia.org/wiki/Carry-less_product#Example).

Note there are variants like pclmullqlqdq xmm0, xmm1 (https://en.wikipedia.org/wiki/CLMUL_instruction_set#New_instructions) ("Multiply the low halves of the two registers") which do not require the immediate operand; they are just pclmulqdq with the immediate bits embedded.

Title: Re: Multiply two QWORDs
Post by: aw27 on September 17, 2018, 02:11:02 AM

Carry-less means that the carry is discarded. I mentioned this instruction here (http://masm32.com/board/index.php?topic=7395.0). So you will never achieve what you want.
You need a Big Integer library or play with floating points to obtain an approximation.

Title: Re: Multiply two QWORDs
Post by: aw27 on September 18, 2018, 02:24:31 AM

A little convoluted way but it works.

Code Select


val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 18, 2018, 05:48:46 PM

Thanks for all the suggestions :t

I have put together a little testbed, and here are the timings:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1190    cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2171    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)

913     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2182    cycles for 100 * u64_mul (Rui)
1212    cycles for 100 * PCLMULQDQ
2843    cycles for 100 * MultPclmulqdq2
4357    cycles for 100 * doMul (aw27)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2172    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2855    cycles for 100 * MultPclmulqdq2
4351    cycles for 100 * doMul (aw27)

915     cycles for 100 * MultQQ
2113    cycles for 100 * Multiply64by64 (Rui)
2167    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4363    cycles for 100 * doMul (aw27)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980

Note that some of the slower algos produce a 128-bit result.

Title: Re: Multiply two QWORDs
Post by: aw27 on September 18, 2018, 09:39:05 PM

JJ,

As always, you are a scam artist. Where are you testing the multiplication of 2 QWORDS?
Are these QWORDS?
SrcQ1   dq 1234567890
SrcQ2   dq 9000000082

Yes, as much as these are QWORDS:
SrcQ1   dq 1
SrcQ2   dq 2

Did you ever went to school? If so, did you try to deceive your teachers with tricks like these and succeed?

Title: Re: Multiply two QWORDs
Post by: nidud on September 18, 2018, 09:56:13 PM

deleted

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 18, 2018, 10:49:36 PM

Thanks, Nidud. Your muld is practically my MulQQ, only that I pass the source QWORDs as pointers while you pass them as QWORDs. I had tested that before, and it was a little bit slower. Results:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

911     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2183    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3894    cycles for 100 * mulq_sse (Nidud)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2175    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4355    cycles for 100 * doMul (aw27)
966     cycles for 100 * muld (Nidud)
3893    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2127    cycles for 100 * Multiply64by64 (Rui)
2170    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4362    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
3899    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2123    cycles for 100 * Multiply64by64 (Rui)
2168    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4354    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3913    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892 < PCLMULQDQ carry-less, OK for random number but not suitable as "normal" mult
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096   <<< little problem here

The FPU version is a lot slower, and chokes if the result goes beyond the QWORD range.

Title: Re: Multiply two QWORDs
Post by: LiaoMi on September 19, 2018, 12:08:16 AM

Code Select

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

878     cycles for 100 * MultQQ
1695    cycles for 100 * Multiply64by64 (Rui)
1850    cycles for 100 * u64_mul (Rui)
515     cycles for 100 * PCLMULQDQ
2320    cycles for 100 * MultPclmulqdq2
3976    cycles for 100 * doMul (aw27)
879     cycles for 100 * muld (Nidud)
3668    cycles for 100 * mulq_sse (Nidud)

868     cycles for 100 * MultQQ
1859    cycles for 100 * Multiply64by64 (Rui)
2035    cycles for 100 * u64_mul (Rui)
546     cycles for 100 * PCLMULQDQ
2134    cycles for 100 * MultPclmulqdq2
4049    cycles for 100 * doMul (aw27)
956     cycles for 100 * muld (Nidud)
3637    cycles for 100 * mulq_sse (Nidud)

922     cycles for 100 * MultQQ
1729    cycles for 100 * Multiply64by64 (Rui)
1849    cycles for 100 * u64_mul (Rui)
537     cycles for 100 * PCLMULQDQ
2103    cycles for 100 * MultPclmulqdq2
4215    cycles for 100 * doMul (aw27)
885     cycles for 100 * muld (Nidud)
3635    cycles for 100 * mulq_sse (Nidud)

884     cycles for 100 * MultQQ
1901    cycles for 100 * Multiply64by64 (Rui)
1853    cycles for 100 * u64_mul (Rui)
531     cycles for 100 * PCLMULQDQ
2127    cycles for 100 * MultPclmulqdq2
3960    cycles for 100 * doMul (aw27)
981     cycles for 100 * muld (Nidud)
3643    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---

Title: Re: Multiply two QWORDs
Post by: hutch-- on September 19, 2018, 12:10:57 AM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

1159 cycles for 100 * MultQQ
2008 cycles for 100 * Multiply64by64 (Rui)
2190 cycles for 100 * u64_mul (Rui)
665 cycles for 100 * PCLMULQDQ
2458 cycles for 100 * MultPclmulqdq2
4529 cycles for 100 * doMul (aw27)
1006 cycles for 100 * muld (Nidud)
4222 cycles for 100 * mulq_sse (Nidud)

994 cycles for 100 * MultQQ
1975 cycles for 100 * Multiply64by64 (Rui)
2115 cycles for 100 * u64_mul (Rui)
722 cycles for 100 * PCLMULQDQ
2478 cycles for 100 * MultPclmulqdq2
4550 cycles for 100 * doMul (aw27)
1004 cycles for 100 * muld (Nidud)
4209 cycles for 100 * mulq_sse (Nidud)

1000 cycles for 100 * MultQQ
1971 cycles for 100 * Multiply64by64 (Rui)
2109 cycles for 100 * u64_mul (Rui)
601 cycles for 100 * PCLMULQDQ
2639 cycles for 100 * MultPclmulqdq2
4493 cycles for 100 * doMul (aw27)
1003 cycles for 100 * muld (Nidud)
4229 cycles for 100 * mulq_sse (Nidud)

998 cycles for 100 * MultQQ
1972 cycles for 100 * Multiply64by64 (Rui)
2098 cycles for 100 * u64_mul (Rui)
598 cycles for 100 * PCLMULQDQ
2443 cycles for 100 * MultPclmulqdq2
5134 cycles for 100 * doMul (aw27)
1005 cycles for 100 * muld (Nidud)
4219 cycles for 100 * mulq_sse (Nidud)

62 bytes for MultQQ
108 bytes for Multiply64by64 (Rui)
126 bytes for u64_mul (Rui)
46 bytes for PCLMULQDQ
45 bytes for MultPclmulqdq2
253 bytes for doMul (aw27)
68 bytes for muld (Nidud)
112 bytes for mulq_sse (Nidud)

DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 10801413638766757892
DestQ 10801413638766757892
DestQ 11111111111234566980
DestQ 11111111111234566980
DestQ 506253686751116096

--- ok ---

Title: Re: Multiply two QWORDs
Post by: nidud on September 19, 2018, 12:41:19 AM

deleted

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 02:12:16 AM

Hi Jocehn
works fine :t
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v3 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v3 ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP MultQQ ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP doMul ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP muld ---***
**************** E N D ****************

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 03:02:12 AM

@Nidud: Yes, the new SSE version does not produce the expected result, so I added _mul128 instead. I also added a print of the high qword for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

914     cycles for 100 * MultQQ
2111    cycles for 100 * Multiply64by64 (Rui)
2159    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2832    cycles for 100 * MultPclmulqdq2
4267    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2306    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2163    cycles for 100 * u64_mul (Rui)
1209    cycles for 100 * PCLMULQDQ
2829    cycles for 100 * MultPclmulqdq2
4266    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2300    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2120    cycles for 100 * Multiply64by64 (Rui)
2166    cycles for 100 * u64_mul (Rui)
1211    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4272    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2296    cycles for 100 * _mul128 (Nidud)

926     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2169    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4273    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2287    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64by64 (Rui)   6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732

Title: Re: Multiply two QWORDs
Post by: aw27 on September 19, 2018, 03:26:07 AM

This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800

:badgrin:

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 03:27:33 AM

>>> ...for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
versions 3 ( _v3 ) should be much faster... ( than quick! )

This «u64_mul (Rui)» was not written by me, i got it... dont remember who wrote it.

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 03:59:03 AM

Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 05:02:06 AM

Quote from: jj2007 on September 19, 2018, 03:59:03 AM
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

Is in my reply #19 .inc file

Jochen, please, replace (i replaced that file in the reply #19)

ret
Multiply64_64_v3 endp

by

ret 20
Multiply64_64_v3 endp

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 06:06:59 AM

There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 06:12:11 AM

Quote from: jj2007 on September 19, 2018, 06:06:59 AM
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.

i replaced just now, file MulQQbyQQ.zip, inside you have Multiply64by64.inc and inside you have that procedure.

There are 6:
Multiply64by64_v1 <<<--- uses pointers
Multiply64by64_v2
Multiply64by64_v3

Multiply64_64_v1 <<<--- uses value ( qwords )
Multiply64_64_v2
Multiply64_64_v3

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 06:42:17 AM

Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.

Title: Re: Multiply two QWORDs
Post by: nidud on September 19, 2018, 07:04:13 AM

deleted

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 07:42:38 AM

Quote from: jj2007 on September 19, 2018, 06:42:17 AM
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.

Jochen, Hutch will say to me why i dont zip it and post !
Remember that we need to extract the algos you post from archives.
Is there any other way when we have a lot of files ?
If we zip is because we zip; if we dont zip is because we dont zip. :biggrin:

Code Select


IdxX0   equ 0
IdxX1   equ 4

IdxY0   equ 0
IdxY1   equ 4

IdxZ0   equ 0
IdxZ1   equ 4
IdxZ2   equ 8
IdxZ3   equ 12

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
;««««««««««««««««««««««««««««««««««««««««««««««««««««
Multiply64_64_v3        proc    X:QWORD, Y:QWORD, pZ:DWORD
                        ; -------------
                        ;   pointers
                        ; -------------
                        mov     ecx, [esp+20]           ;pZ
                    
                        ; ----------------------
                        ;     IdxX0*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                        mov     [ecx+IdxZ0], eax
                        mov     [ecx+IdxZ1], edx

                        ; ----------------------
                        ;     IdxX0*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                    
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ2], edx
                    
                        ; ----------------------
                        ;     IdxX1*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                    
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], edx
                    
                        ; ----------------------
                        ;     IdxX1*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                    
                        add     eax, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ3], edx

                        ret     20
Multiply64_64_v3        endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 07:55:07 AM

Good job, Rui :t

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1192    cycles for 100 * MultQQ
1843    cycles for 100 * Multiply64_64_v3v(Rui)
2168    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2830    cycles for 100 * MultPclmulqdq2
4281    cycles for 100 * doMul (aw27)
963     cycles for 100 * muld (Nidud)
2304    cycles for 100 * _mul128 (Nidud)

913     cycles for 100 * MultQQ
1846    cycles for 100 * Multiply64_64_v3v(Rui)
2163    cycles for 100 * u64_mul (Rui)
1208    cycles for 100 * PCLMULQDQ
2827    cycles for 100 * MultPclmulqdq2
4270    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2301    cycles for 100 * _mul128 (Nidud)

912     cycles for 100 * MultQQ
1840    cycles for 100 * Multiply64_64_v3v(Rui)
2166    cycles for 100 * u64_mul (Rui)
1210    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4288    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
2297    cycles for 100 * _mul128 (Nidud)

909     cycles for 100 * MultQQ
1850    cycles for 100 * Multiply64_64_v3v(Rui)
2173    cycles for 100 * u64_mul (Rui)
1208    cycles for 100 * PCLMULQDQ
2831    cycles for 100 * MultPclmulqdq2
4326    cycles for 100 * doMul (aw27)
957     cycles for 100 * muld (Nidud)
2316    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
194     bytes for Multiply64_64_v3v(Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64_64_v3v(Rui) 6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 08:18:14 AM

Hello my strong friend, :t

Title: Re: Multiply two QWORDs
Post by: aw27 on September 19, 2018, 03:39:16 PM

Quote from: nidud on September 19, 2018, 07:04:13 AM

mov rax,0x1122334455667788
mov rcx,0x99aabbccddeeff00
mul rcx
printf("%#I64x%I64x\n", rdx, rax)

Amazing serendipity after countless tries and fails! Congratulations! :t

Title: Re: Multiply two QWORDs
Post by: nidud on September 20, 2018, 06:57:40 AM

deleted

Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 20, 2018, 07:22:07 AM

Hi nidud,
you are using the old code. This is the last code i posted tested in reply #30

Code Select


OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Title: Re: Multiply two QWORDs
Post by: nidud on September 20, 2018, 07:48:49 AM

deleted

Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 04:39:00 PM

Back into action :badgrin:
This 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results) and within those that work is the only one whose author is known (the other have been copied and pasted from some unknown excavated ditch).

Code Select


.486
.model flat, stdcall

includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg

twoQwords struc
	q1l dword 0
	q1h dword 0
	q2l dword 0
	q2h dword 0
twoQwords ends

.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0

.code

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
		push ebx
		push esi

		mov eax, firstL
		mul dword ptr secL
		mov ecx, res
		mov (twoQwords PTR [ecx]).q1l, eax
		mov ebx, edx
		
		mov eax, firstH
		mul dword ptr secL
		add eax, ebx
		adc	edx,0
		mov ebx, eax
		mov esi, edx
		
		mov eax, firstL
		mul dword ptr secH
		add eax, ebx
		
		mov (twoQwords PTR [ecx]).q1h, eax
		adc edx, 0 
		mov ebx, edx
		
		mov eax, firstH
		mul dword ptr secH
		add eax, esi
		adc edx,0
		add eax, ebx
		adc edx,0
		
		mov (twoQwords PTR [ecx]).q2l, eax
		mov (twoQwords PTR [ecx]).q2h, edx

		pop esi
		pop ebx
		ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

main proc
	invoke multifast64x64_128, qword1, qword2, addr result
	lea esi, result
	invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]	
	
	invoke ExitProcess,0
main endp

end

val1=0xf122334455667788
val2=0xf9aabbccddeeff00
val1*val2=0xeb2b15787630c963479983e499807800

(*) Guaranteed True Masm _(TM) code, no BASIC or C imitations.

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 05:14:22 PM

Quote from: AW on September 20, 2018, 04:39:00 PMThis 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results)

I know one shouldn't feed the troll but for others who follow this thread a reminder of the original post:

Quote from: jj2007 on September 15, 2018, 11:11:11 PM
This pops up sometimes with random number generation: How to multiply two QWORDs?

There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that, therefore the algos that produce QWORD output are perfectly valid for this purpose (and they work also for multiplying a dq 18446744073709551615, except PCLMULQDQ).

Even the "distorted" carry-less PCLMULQDQ is valid for random number generation. And of course, mul rcx also produces a qword, but we are talking 32-bit code here.

Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 05:28:59 PM

Quote
There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that

Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.

Quote
And of course, mul rcx also produces a qword,

It does not (do you need a link to learn about mul?), that's why I congratulated nidud. Sort of Colombo egg, he found it by serendipity after many unsuccessful tries.

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 06:14:51 PM

Dear José,

In the interest of everybody, please take your pills regularly :icon14:

Quote from: AW on September 18, 2018, 09:39:05 PM
As always, you are a scam artist.

Quote from: AW on September 19, 2018, 03:39:16 PMAmazing serendipity after countless tries and fails! Congratulations! :t

Quote from: AW on September 20, 2018, 05:28:59 PM
Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.

Title: Re: Multiply two QWORDs
Post by: hutch-- on September 20, 2018, 06:20:51 PM

Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

USING r12,r13,r14,r15

SaveRegs

mov r12, 0
mov r13, 3
mov rdx, 2000000000000000000 ; 19 digits

mulx rdx,r12,r13 ; mul rdx x r13

conout " r12 = ",str$(r12),lf
conout " r13 = ",str$(r13),lf
conout " rdx = ",str$(rdx),lf

waitkey
RestoreRegs
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

; Description
;
; Performs an unsigned multiplication of the implicit source operand (EDX/RDX)
; and the specified source operand (the third operand) and stores the low half
; of the result in the second destination (second operand), the high half
; of the result in the first destination operand (first operand), without reading
; or writing the arithmetic flags.
;
; This enables efficient programming where the software can interleave add with
; carry operations and multiplications.
;
; If the first and second operand are identical, it will contain the high half
; of the multiplication result.
;
; This instruction is not supported in real mode and virtual-8086 mode. The operand
; size is always 32 bits if not in 64-bit mode. In 64-bit mode operand size 64
; requires VEX.W1. VEX.W1 is ignored in non-64-bit modes. An
; attempt to execute this instruction with VEX.L not equal to 0 will cause #UD.

Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 07:37:04 PM

Quote from: hutch-- on September 20, 2018, 06:20:51 PM
Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.

Looks interesting, and mulx eax, ebx, ecx builds fine but causes an illegal instruction exception on my trusty old Core i5 :(

Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 07:52:09 PM

Quote from: jj2007 on September 20, 2018, 07:37:04 PM
Looks interesting,

You mean its exotic, let me correct:
There might be an exotic case where a mulx is useful, but almost nobody needs that, therefore the algos that use mul are perfectly valid for this purpose

Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 09:00:39 PM

Using the new mulx instructions on the above 32-bit code we can save at least 4 instructions giving a new total 26 instructions.
It is necessary the .xmm directive with this instruction, something a bit exotic would say someone from this forum.

Code Select


.686p
.xmm
.model flat, stdcall

includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg

twoQwords struc
	q1l dword 0
	q1h dword 0
	q2l dword 0
	q2h dword 0
twoQwords ends

.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0

.code

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
		push ebx
		push esi

		mov edx, firstL
		mulx ebx, eax, dword ptr secL
		mov ecx, res
		mov (twoQwords PTR [ecx]).q1l, eax

		mov edx, firstH
		mulx esi, eax, dword ptr secL
		add ebx, eax

		mov edx, firstL
		mulx edx, eax, dword ptr secH 
		add eax, ebx	
		mov (twoQwords PTR [ecx]).q1h, eax	
		adc edx, 0
		mov ebx, edx	

		mov edx, firstH
		mulx edx, eax, dword ptr secH 	
		add eax, esi
		adc edx,0
		add eax, ebx
		adc edx,0
		
		mov (twoQwords PTR [ecx]).q2l, eax
		mov (twoQwords PTR [ecx]).q2h, edx

		pop esi
		pop ebx
		ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

main proc
	invoke multifast64x64_128, qword1, qword2, addr result
	lea esi, result
	invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]	
	
	invoke ExitProcess,0
main endp

end

Some less young (but trusty) computers will refuse to run the mulx instruction.

The MASM Forum

General => The Workshop => Topic started by: jj2007 on September 15, 2018, 11:11:11 PM