The MASM Forum

General => The Workshop => Topic started by: jj2007 on September 15, 2018, 11:11:11 PM

Title: Multiply two QWORDs
Post by: jj2007 on September 15, 2018, 11:11:11 PM
This pops up sometimes with random number generation: How to multiply two QWORDs?

Microsoft offers a bunch of dedicated functions with strange names such as ...
UnsignedMultiply128
Multiply128
__mul128

... but I got none of them working, the compiler always complains about intrinsics not supported etc, the usual C/C++ mess.

But I managed to put up a little 36 bytes routine that does the job (attached, pure MASM32). Problem: It's slow. There is a much faster alternative:
fild a64
fild b64
fmul
fistp resq


Great, over 4x faster, but if the result exceeds a QWORD, it spits out meaningless results and becomes very, very slow (due to the FPU's error handling).

So I spent some time looking for a SIMD solution, and I found PMULUDQ. Looks great but it's DWORDs, not QWORDs :(

Any bright ideas...?
Title: Re: Multiply two QWORDs
Post by: Biterider on September 16, 2018, 12:44:57 AM
Hi JJ
I found this old macro

; Macro:      qdmul
; Purpose:    Multiply a QWORD by a DWORD.
; Arguments:  ecx::edx::eax = edx::eax * ebx.

qdmul macro
    push eax
    mov eax, edx
    mul ebx
    mov ecx, edx
    pop edx
    xchg eax, edx
    push edx
    mul ebx
    pop ebx
    add edx, ebx
    adc ecx, 0
endm


You can use it, splitting the multiplication in 4 steps:
1. multiply be the high 32 bit using the macro
2. shift the result by 32 ecx::edx::eax --> edi::ecx::edx::eax
3. multiply be the low 32 bit using the macro ecx::edx::eax
4. add the result of step 2 & 3

Biterider
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 01:48:51 AM
Thanks, will have a look. It doesn't look any simpler, though.
Title: Re: Multiply two QWORDs
Post by: hutch-- on September 16, 2018, 02:19:03 AM
Don't know if its fast but the results are correct. LATER : Cleaned up.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL quad1 :QWORD
    LOCAL quad2 :QWORD

    mov quad1, 100
    mov quad2, 100

    cvtsi2sd xmm0, quad1
    cvtsi2sd xmm1, quad2
    mulsd xmm0, xmm1
    cvtsd2si r11, xmm0

    conout str$(r11),lf

    waitkey
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(
Title: Re: Multiply two QWORDs
Post by: nidud on September 16, 2018, 03:36:20 AM
deleted
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 16, 2018, 04:48:40 AM
Quote from: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(
I never used it. So, it seems that it converts a QWORD integer to Floating-Point format (QWORD also) if you are right Jochen  :biggrin: . These definitions are very ... cool !!!

LATER: i saw all CVT??? instructions for SSE2 and all converts from 32 to 64 or 64 to 32 .
           It s nice ! I think Hutch has a privileged information and he doesnt show us  ;)
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 16, 2018, 11:00:10 AM
Quote from: nidud on September 16, 2018, 03:36:20 AM
You probably have to convert each DWORD to REAL8

Thanks, but as the title says, it's QWORDs, not DWORDs.
Title: Re: Multiply two QWORDs
Post by: hutch-- on September 16, 2018, 11:10:02 AM
 :biggrin:

> I think Hutch has a privileged information and he doesnt show us.

I have I have, its the Intel instruction manual. What I did when I started on the scalar double instructions was do a search through the entire manual for anything that had an SD ending, avoided the AVX versions and made a list of what looked like it was useful for scalar maths. Its worth having a look at the packed double versions as well as there are more primitives to work with.

What you may find is that the 64 bit mulsd instruction runs out of puff at the top of the 64 bit range.

LATER : From memory there are techniques to determine overflow and underflow which you may need to have in place if you are dealing with number up near or over the range limit of 64 bit.
Title: pclmulqdq
Post by: jj2007 on September 16, 2018, 11:40:02 AM
This should do the job:
MultQQ3 proc pQ1, pQ2
  mov eax, [esp+4] ; pQ1
  mov edx, [esp+8] ; pQ2
  movlps xmm0, QWORD ptr [eax]
  movlps xmm1, QWORD ptr [edx]
  pclmulqdq xmm0, xmm1, 0
  retn 2*4
MultQQ3 endp


Intel: (https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf)
QuotePCLMULQDQ Instruction Definition
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords
which are selected from the first and the second operands according to the immediate
byte value.
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8
Description: Carry-less multiplication of one quadword (8 bytes) of xmm1
by one quadword (8 bytes) of xmm2/m128, returning a double quadword
(16 bytes). The immediate byte is used for determining which quadwords of
xmm1 and xmm2/m128 should be used.

The instruction has an own 'chapter' in the Oracle manual. (https://docs.oracle.com/cd/E36784_01/html/E36859/gnygo.html)

Problem: It doesn't work :(

Testcase attached, plain Masm32. It calculates 4000000000*2222222222, no problem for the FPU but pclmulqdq does something else. More precisely, it does carry-less multiplication (https://en.wikipedia.org/wiki/Carry-less_product#Example).

Note there are variants like pclmullqlqdq xmm0, xmm1 (https://en.wikipedia.org/wiki/CLMUL_instruction_set#New_instructions) ("Multiply the low halves of the two registers") which do not require the immediate operand; they are just pclmulqdq with the immediate bits embedded.
Title: Re: Multiply two QWORDs
Post by: aw27 on September 17, 2018, 02:11:02 AM
Carry-less means that the carry is discarded. I mentioned this instruction here (http://masm32.com/board/index.php?topic=7395.0). So you will never achieve what you want.
You need a Big Integer library or play with floating points to obtain an approximation.
Title: Re: Multiply two QWORDs
Post by: aw27 on September 18, 2018, 02:24:31 AM
A little convoluted way but it works.


val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 18, 2018, 05:48:46 PM
Thanks for all the suggestions :t

I have put together a little testbed, and here are the timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1190    cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2171    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)

913     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2182    cycles for 100 * u64_mul (Rui)
1212    cycles for 100 * PCLMULQDQ
2843    cycles for 100 * MultPclmulqdq2
4357    cycles for 100 * doMul (aw27)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2172    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2855    cycles for 100 * MultPclmulqdq2
4351    cycles for 100 * doMul (aw27)

915     cycles for 100 * MultQQ
2113    cycles for 100 * Multiply64by64 (Rui)
2167    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4363    cycles for 100 * doMul (aw27)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980


Note that some of the slower algos produce a 128-bit result.
Title: Re: Multiply two QWORDs
Post by: aw27 on September 18, 2018, 09:39:05 PM
JJ,

As always, you are a scam artist. Where are you testing the multiplication of 2 QWORDS?
Are these QWORDS?
SrcQ1   dq 1234567890
SrcQ2   dq 9000000082

Yes, as much as these are QWORDS:
SrcQ1   dq 1
SrcQ2   dq 2

Did you ever went to school? If so, did you try to deceive your teachers with tricks like these and succeed?
Title: Re: Multiply two QWORDs
Post by: nidud on September 18, 2018, 09:56:13 PM
deleted
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 18, 2018, 10:49:36 PM
Thanks, Nidud. Your muld is practically my MulQQ, only that I pass the source QWORDs as pointers while you pass them as QWORDs. I had tested that before, and it was a little bit slower. Results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

911     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2183    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3894    cycles for 100 * mulq_sse (Nidud)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2175    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4355    cycles for 100 * doMul (aw27)
966     cycles for 100 * muld (Nidud)
3893    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2127    cycles for 100 * Multiply64by64 (Rui)
2170    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4362    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
3899    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2123    cycles for 100 * Multiply64by64 (Rui)
2168    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4354    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3913    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892 < PCLMULQDQ carry-less, OK for random number but not suitable as "normal" mult
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096   <<< little problem here


The FPU version is a lot slower, and chokes if the result goes beyond the QWORD range.
Title: Re: Multiply two QWORDs
Post by: LiaoMi on September 19, 2018, 12:08:16 AM
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

878     cycles for 100 * MultQQ
1695    cycles for 100 * Multiply64by64 (Rui)
1850    cycles for 100 * u64_mul (Rui)
515     cycles for 100 * PCLMULQDQ
2320    cycles for 100 * MultPclmulqdq2
3976    cycles for 100 * doMul (aw27)
879     cycles for 100 * muld (Nidud)
3668    cycles for 100 * mulq_sse (Nidud)

868     cycles for 100 * MultQQ
1859    cycles for 100 * Multiply64by64 (Rui)
2035    cycles for 100 * u64_mul (Rui)
546     cycles for 100 * PCLMULQDQ
2134    cycles for 100 * MultPclmulqdq2
4049    cycles for 100 * doMul (aw27)
956     cycles for 100 * muld (Nidud)
3637    cycles for 100 * mulq_sse (Nidud)

922     cycles for 100 * MultQQ
1729    cycles for 100 * Multiply64by64 (Rui)
1849    cycles for 100 * u64_mul (Rui)
537     cycles for 100 * PCLMULQDQ
2103    cycles for 100 * MultPclmulqdq2
4215    cycles for 100 * doMul (aw27)
885     cycles for 100 * muld (Nidud)
3635    cycles for 100 * mulq_sse (Nidud)

884     cycles for 100 * MultQQ
1901    cycles for 100 * Multiply64by64 (Rui)
1853    cycles for 100 * u64_mul (Rui)
531     cycles for 100 * PCLMULQDQ
2127    cycles for 100 * MultPclmulqdq2
3960    cycles for 100 * doMul (aw27)
981     cycles for 100 * muld (Nidud)
3643    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---
Title: Re: Multiply two QWORDs
Post by: hutch-- on September 19, 2018, 12:10:57 AM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

1159    cycles for 100 * MultQQ
2008    cycles for 100 * Multiply64by64 (Rui)
2190    cycles for 100 * u64_mul (Rui)
665     cycles for 100 * PCLMULQDQ
2458    cycles for 100 * MultPclmulqdq2
4529    cycles for 100 * doMul (aw27)
1006    cycles for 100 * muld (Nidud)
4222    cycles for 100 * mulq_sse (Nidud)

994     cycles for 100 * MultQQ
1975    cycles for 100 * Multiply64by64 (Rui)
2115    cycles for 100 * u64_mul (Rui)
722     cycles for 100 * PCLMULQDQ
2478    cycles for 100 * MultPclmulqdq2
4550    cycles for 100 * doMul (aw27)
1004    cycles for 100 * muld (Nidud)
4209    cycles for 100 * mulq_sse (Nidud)

1000    cycles for 100 * MultQQ
1971    cycles for 100 * Multiply64by64 (Rui)
2109    cycles for 100 * u64_mul (Rui)
601     cycles for 100 * PCLMULQDQ
2639    cycles for 100 * MultPclmulqdq2
4493    cycles for 100 * doMul (aw27)
1003    cycles for 100 * muld (Nidud)
4229    cycles for 100 * mulq_sse (Nidud)

998     cycles for 100 * MultQQ
1972    cycles for 100 * Multiply64by64 (Rui)
2098    cycles for 100 * u64_mul (Rui)
598     cycles for 100 * PCLMULQDQ
2443    cycles for 100 * MultPclmulqdq2
5134    cycles for 100 * doMul (aw27)
1005    cycles for 100 * muld (Nidud)
4219    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---
Title: Re: Multiply two QWORDs
Post by: nidud on September 19, 2018, 12:41:19 AM
deleted
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 02:12:16 AM
Hi Jocehn
              works fine :t
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v3 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP  Multiply64_64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v3 ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP MultQQ ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP doMul ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP muld ---***
**************** E N D ****************
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 03:02:12 AM
@Nidud: Yes, the new SSE version does not produce the expected result, so I added _mul128 instead. I also added a print of the high qword for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

914     cycles for 100 * MultQQ
2111    cycles for 100 * Multiply64by64 (Rui)
2159    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2832    cycles for 100 * MultPclmulqdq2
4267    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2306    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2163    cycles for 100 * u64_mul (Rui)
1209    cycles for 100 * PCLMULQDQ
2829    cycles for 100 * MultPclmulqdq2
4266    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2300    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2120    cycles for 100 * Multiply64by64 (Rui)
2166    cycles for 100 * u64_mul (Rui)
1211    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4272    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2296    cycles for 100 * _mul128 (Nidud)

926     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2169    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4273    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2287    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64by64 (Rui)   6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732
Title: Re: Multiply two QWORDs
Post by: aw27 on September 19, 2018, 03:26:07 AM
This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800

:badgrin:
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 03:27:33 AM
>>> ...for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
           versions 3 ( _v3 ) should be much faster... ( than quick! )


          This «u64_mul (Rui)» was not written by me, i got it... dont remember who wrote it.
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 03:59:03 AM
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 05:02:06 AM
Quote from: jj2007 on September 19, 2018, 03:59:03 AM
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

Is in my reply #19 .inc file

Jochen, please, replace (i replaced that file in the reply #19)

                                  ret
Multiply64_64_v3     endp

by

                                 ret     20
Multiply64_64_v3     endp
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 06:06:59 AM
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 06:12:11 AM
Quote from: jj2007 on September 19, 2018, 06:06:59 AM
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
i replaced just now, file MulQQbyQQ.zip, inside you have Multiply64by64.inc and inside you have that procedure.

There are 6:
                       Multiply64by64_v1   <<<--- uses pointers
                       Multiply64by64_v2
                       Multiply64by64_v3

Multiply64_64_v1   <<<--- uses value ( qwords )
Multiply64_64_v2
Multiply64_64_v3
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 06:42:17 AM
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
Title: Re: Multiply two QWORDs
Post by: nidud on September 19, 2018, 07:04:13 AM
deleted
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 07:42:38 AM
Quote from: jj2007 on September 19, 2018, 06:42:17 AM
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
Jochen, Hutch will say to me why i dont zip it and post !
Remember  that we need to extract the algos you post from archives.
Is there any other way when we have a lot of files ?
If we zip is because we zip; if we dont zip is because we dont zip.  :biggrin:


IdxX0   equ 0
IdxX1   equ 4

IdxY0   equ 0
IdxY1   equ 4

IdxZ0   equ 0
IdxZ1   equ 4
IdxZ2   equ 8
IdxZ3   equ 12

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
;««««««««««««««««««««««««««««««««««««««««««««««««««««
Multiply64_64_v3        proc    X:QWORD, Y:QWORD, pZ:DWORD
                        ; -------------
                        ;   pointers
                        ; -------------
                        mov     ecx, [esp+20]           ;pZ
                   
                        ; ----------------------
                        ;     IdxX0*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                        mov     [ecx+IdxZ0], eax
                        mov     [ecx+IdxZ1], edx

                        ; ----------------------
                        ;     IdxX0*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ3], edx

                        ret     20
Multiply64_64_v3        endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 19, 2018, 07:55:07 AM
Good job, Rui :t
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1192    cycles for 100 * MultQQ
1843    cycles for 100 * Multiply64_64_v3v(Rui)
2168    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2830    cycles for 100 * MultPclmulqdq2
4281    cycles for 100 * doMul (aw27)
963     cycles for 100 * muld (Nidud)
2304    cycles for 100 * _mul128 (Nidud)

913     cycles for 100 * MultQQ
1846    cycles for 100 * Multiply64_64_v3v(Rui)
2163    cycles for 100 * u64_mul (Rui)
1208    cycles for 100 * PCLMULQDQ
2827    cycles for 100 * MultPclmulqdq2
4270    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2301    cycles for 100 * _mul128 (Nidud)

912     cycles for 100 * MultQQ
1840    cycles for 100 * Multiply64_64_v3v(Rui)
2166    cycles for 100 * u64_mul (Rui)
1210    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4288    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
2297    cycles for 100 * _mul128 (Nidud)

909     cycles for 100 * MultQQ
1850    cycles for 100 * Multiply64_64_v3v(Rui)
2173    cycles for 100 * u64_mul (Rui)
1208    cycles for 100 * PCLMULQDQ
2831    cycles for 100 * MultPclmulqdq2
4326    cycles for 100 * doMul (aw27)
957     cycles for 100 * muld (Nidud)
2316    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
194     bytes for Multiply64_64_v3v(Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64_64_v3v(Rui) 6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 19, 2018, 08:18:14 AM
Hello my strong friend, :t
Title: Re: Multiply two QWORDs
Post by: aw27 on September 19, 2018, 03:39:16 PM
Quote from: nidud on September 19, 2018, 07:04:13 AM

    mov rax,0x1122334455667788
    mov rcx,0x99aabbccddeeff00
    mul rcx
    printf("%#I64x%I64x\n", rdx, rax)


Amazing serendipity after countless tries and fails! Congratulations!  :t
Title: Re: Multiply two QWORDs
Post by: nidud on September 20, 2018, 06:57:40 AM
deleted
Title: Re: Multiply two QWORDs
Post by: RuiLoureiro on September 20, 2018, 07:22:07 AM
Hi nidud,
            you are using the old code. This is the last code i posted tested in reply #30

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Multiply two QWORDs
Post by: nidud on September 20, 2018, 07:48:49 AM
deleted
Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 04:39:00 PM
Back into action  :badgrin:
This 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results) and within those that work is the only one whose author is known (the other have been copied and pasted from some unknown excavated ditch).


.486
.model flat, stdcall

includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg

twoQwords struc
q1l dword 0
q1h dword 0
q2l dword 0
q2h dword 0
twoQwords ends

.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0

.code

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
push ebx
push esi

mov eax, firstL
mul dword ptr secL
mov ecx, res
mov (twoQwords PTR [ecx]).q1l, eax
mov ebx, edx

mov eax, firstH
mul dword ptr secL
add eax, ebx
adc edx,0
mov ebx, eax
mov esi, edx

mov eax, firstL
mul dword ptr secH
add eax, ebx

mov (twoQwords PTR [ecx]).q1h, eax
adc edx, 0
mov ebx, edx

mov eax, firstH
mul dword ptr secH
add eax, esi
adc edx,0
add eax, ebx
adc edx,0

mov (twoQwords PTR [ecx]).q2l, eax
mov (twoQwords PTR [ecx]).q2h, edx

pop esi
pop ebx
ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

main proc
invoke multifast64x64_128, qword1, qword2, addr result
lea esi, result
invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]

invoke ExitProcess,0
main endp

end


val1=0xf122334455667788
val2=0xf9aabbccddeeff00
val1*val2=0xeb2b15787630c963479983e499807800


(*) Guaranteed True Masm (TM) code, no BASIC or C imitations.
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 05:14:22 PM
Quote from: AW on September 20, 2018, 04:39:00 PMThis 32-bit version (*) is the fastest of those that work (i.e. those that produce 128-bit results)

I know one shouldn't feed the troll but for others who follow this thread a reminder of the original post:

Quote from: jj2007 on September 15, 2018, 11:11:11 PM
This pops up sometimes with random number generation: How to multiply two QWORDs?

There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that, therefore the algos that produce QWORD output are perfectly valid for this purpose (and they work also for multiplying a dq 18446744073709551615, except PCLMULQDQ).

Even the "distorted" carry-less PCLMULQDQ is valid for random number generation. And of course, mul rcx also produces a qword, but we are talking 32-bit code here.
Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 05:28:59 PM
Quote
There might be an exotic case where a 128-bit integer is useful, but almost nobody needs that
Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.

Quote
And of course, mul rcx also produces a qword,
It does not (do you need a link to learn about mul?), that's why I congratulated nidud. Sort of Colombo egg, he found it by serendipity after many unsuccessful tries.
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 06:14:51 PM
Dear José,

In the interest of everybody, please take your pills regularly :icon14:

Quote from: AW on September 18, 2018, 09:39:05 PM
As always, you are a scam artist.

Quote from: AW on September 19, 2018, 03:39:16 PMAmazing serendipity after countless tries and fails! Congratulations!  :t

Quote from: AW on September 20, 2018, 05:28:59 PM
Everything you can't manage is exotic. One thing I am sure, nobody needs the crapware you drop here.
Title: Re: Multiply two QWORDs
Post by: hutch-- on September 20, 2018, 06:20:51 PM
Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    USING r12,r13,r14,r15

    SaveRegs

    mov r12, 0
    mov r13, 3
    mov rdx, 2000000000000000000    ; 19 digits

    mulx rdx,r12,r13        ; mul rdx x r13

    conout "  r12 = ",str$(r12),lf
    conout "  r13 = ",str$(r13),lf
    conout "  rdx = ",str$(rdx),lf

    waitkey
    RestoreRegs
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

; Description
;
; Performs an unsigned multiplication of the implicit source operand (EDX/RDX)
; and the specified source operand (the third operand) and stores the low half
; of the result in the second destination (second operand), the high half
; of the result in the first destination operand (first operand), without reading
; or writing the arithmetic flags.
;
; This enables efficient programming where the software can interleave add with
; carry operations and multiplications.

; If the first and second operand are identical, it will contain the high half
; of the multiplication result.
;
; This instruction is not supported in real mode and virtual-8086 mode. The operand
; size is always 32 bits if not in 64-bit mode. In 64-bit mode operand size 64
; requires VEX.W1. VEX.W1 is ignored in non-64-bit modes. An
; attempt to execute this instruction with VEX.L not equal to 0 will cause #UD.
Title: Re: Multiply two QWORDs
Post by: jj2007 on September 20, 2018, 07:37:04 PM
Quote from: hutch-- on September 20, 2018, 06:20:51 PM
Here is a quick play with mulx. This is 64 bit code but apparently it also works in 32 bit but only with 32 bit variables.
Looks interesting, and mulx eax, ebx, ecx builds fine but causes an illegal instruction exception on my trusty old Core i5 :(
Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 07:52:09 PM
Quote from: jj2007 on September 20, 2018, 07:37:04 PM
Looks interesting,
You mean its exotic, let me correct:
There might be an exotic case where a mulx is useful, but almost nobody needs that, therefore the algos that use mul are perfectly valid for this purpose
Title: Re: Multiply two QWORDs
Post by: aw27 on September 20, 2018, 09:00:39 PM
Using the new mulx instructions on the above 32-bit code we can save at least 4 instructions giving a new total 26 instructions.
It is necessary the .xmm directive with this instruction, something a bit exotic would say someone from this forum.


.686p
.xmm
.model flat, stdcall

includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword
includelib \masm32\lib\msvcrt.lib
printf proto C :ptr, :vararg

twoQwords struc
q1l dword 0
q1h dword 0
q2l dword 0
q2h dword 0
twoQwords ends

.data
qword1 qword 0F122334455667788h
qword2 qword 0F9AABBCCDDEEFF00h
result twoQwords <>
valuePrint db "val1=0x%llx",10,"val2=0x%llx",10,"val1*val2=0x%llx%llx",10,0

.code

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
firstL EQU [esp+12]
firstH EQU [esp+16]
secL EQU [esp+20]
secH EQU [esp+24]
res EQU [esp+28]
multifast64x64_128 proc arg1:qword, arg2 :qword, arg3:ptr
push ebx
push esi

mov edx, firstL
mulx ebx, eax, dword ptr secL
mov ecx, res
mov (twoQwords PTR [ecx]).q1l, eax

mov edx, firstH
mulx esi, eax, dword ptr secL
add ebx, eax

mov edx, firstL
mulx edx, eax, dword ptr secH
add eax, ebx
mov (twoQwords PTR [ecx]).q1h, eax
adc edx, 0
mov ebx, edx

mov edx, firstH
mulx edx, eax, dword ptr secH
add eax, esi
adc edx,0
add eax, ebx
adc edx,0

mov (twoQwords PTR [ecx]).q2l, eax
mov (twoQwords PTR [ecx]).q2h, edx

pop esi
pop ebx
ret 20
multifast64x64_128 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

main proc
invoke multifast64x64_128, qword1, qword2, addr result
lea esi, result
invoke printf, addr valuePrint, qword1, qword2, qword ptr [esi+8], qword ptr [esi]

invoke ExitProcess,0
main endp

end


Some less young (but trusty) computers will refuse to run the mulx instruction.