News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Multiply two QWORDs

Started by jj2007, September 15, 2018, 11:11:11 PM

Previous topic - Next topic

jj2007

Thanks, Nidud. Your muld is practically my MulQQ, only that I pass the source QWORDs as pointers while you pass them as QWORDs. I had tested that before, and it was a little bit slower. Results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

911     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2183    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3894    cycles for 100 * mulq_sse (Nidud)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2175    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4355    cycles for 100 * doMul (aw27)
966     cycles for 100 * muld (Nidud)
3893    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2127    cycles for 100 * Multiply64by64 (Rui)
2170    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4362    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
3899    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2123    cycles for 100 * Multiply64by64 (Rui)
2168    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4354    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3913    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892 < PCLMULQDQ carry-less, OK for random number but not suitable as "normal" mult
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096   <<< little problem here


The FPU version is a lot slower, and chokes if the result goes beyond the QWORD range.

LiaoMi

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

878     cycles for 100 * MultQQ
1695    cycles for 100 * Multiply64by64 (Rui)
1850    cycles for 100 * u64_mul (Rui)
515     cycles for 100 * PCLMULQDQ
2320    cycles for 100 * MultPclmulqdq2
3976    cycles for 100 * doMul (aw27)
879     cycles for 100 * muld (Nidud)
3668    cycles for 100 * mulq_sse (Nidud)

868     cycles for 100 * MultQQ
1859    cycles for 100 * Multiply64by64 (Rui)
2035    cycles for 100 * u64_mul (Rui)
546     cycles for 100 * PCLMULQDQ
2134    cycles for 100 * MultPclmulqdq2
4049    cycles for 100 * doMul (aw27)
956     cycles for 100 * muld (Nidud)
3637    cycles for 100 * mulq_sse (Nidud)

922     cycles for 100 * MultQQ
1729    cycles for 100 * Multiply64by64 (Rui)
1849    cycles for 100 * u64_mul (Rui)
537     cycles for 100 * PCLMULQDQ
2103    cycles for 100 * MultPclmulqdq2
4215    cycles for 100 * doMul (aw27)
885     cycles for 100 * muld (Nidud)
3635    cycles for 100 * mulq_sse (Nidud)

884     cycles for 100 * MultQQ
1901    cycles for 100 * Multiply64by64 (Rui)
1853    cycles for 100 * u64_mul (Rui)
531     cycles for 100 * PCLMULQDQ
2127    cycles for 100 * MultPclmulqdq2
3960    cycles for 100 * doMul (aw27)
981     cycles for 100 * muld (Nidud)
3643    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

1159    cycles for 100 * MultQQ
2008    cycles for 100 * Multiply64by64 (Rui)
2190    cycles for 100 * u64_mul (Rui)
665     cycles for 100 * PCLMULQDQ
2458    cycles for 100 * MultPclmulqdq2
4529    cycles for 100 * doMul (aw27)
1006    cycles for 100 * muld (Nidud)
4222    cycles for 100 * mulq_sse (Nidud)

994     cycles for 100 * MultQQ
1975    cycles for 100 * Multiply64by64 (Rui)
2115    cycles for 100 * u64_mul (Rui)
722     cycles for 100 * PCLMULQDQ
2478    cycles for 100 * MultPclmulqdq2
4550    cycles for 100 * doMul (aw27)
1004    cycles for 100 * muld (Nidud)
4209    cycles for 100 * mulq_sse (Nidud)

1000    cycles for 100 * MultQQ
1971    cycles for 100 * Multiply64by64 (Rui)
2109    cycles for 100 * u64_mul (Rui)
601     cycles for 100 * PCLMULQDQ
2639    cycles for 100 * MultPclmulqdq2
4493    cycles for 100 * doMul (aw27)
1003    cycles for 100 * muld (Nidud)
4229    cycles for 100 * mulq_sse (Nidud)

998     cycles for 100 * MultQQ
1972    cycles for 100 * Multiply64by64 (Rui)
2098    cycles for 100 * u64_mul (Rui)
598     cycles for 100 * PCLMULQDQ
2443    cycles for 100 * MultPclmulqdq2
5134    cycles for 100 * doMul (aw27)
1005    cycles for 100 * muld (Nidud)
4219    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---

nidud

#18
deleted

RuiLoureiro

#19
Hi Jocehn
              works fine :t
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v3 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP  Multiply64_64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v3 ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP MultQQ ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP doMul ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP muld ---***
**************** E N D ****************

jj2007

@Nidud: Yes, the new SSE version does not produce the expected result, so I added _mul128 instead. I also added a print of the high qword for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

914     cycles for 100 * MultQQ
2111    cycles for 100 * Multiply64by64 (Rui)
2159    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2832    cycles for 100 * MultPclmulqdq2
4267    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2306    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2163    cycles for 100 * u64_mul (Rui)
1209    cycles for 100 * PCLMULQDQ
2829    cycles for 100 * MultPclmulqdq2
4266    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2300    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2120    cycles for 100 * Multiply64by64 (Rui)
2166    cycles for 100 * u64_mul (Rui)
1211    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4272    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2296    cycles for 100 * _mul128 (Nidud)

926     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2169    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4273    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2287    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64by64 (Rui)   6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732

aw27

This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800


:badgrin:

RuiLoureiro

>>> ...for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
           versions 3 ( _v3 ) should be much faster... ( than quick! )


          This «u64_mul (Rui)» was not written by me, i got it... dont remember who wrote it.

jj2007

Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

RuiLoureiro

#24
Quote from: jj2007 on September 19, 2018, 03:59:03 AM
Quote from: RuiLoureiro on September 19, 2018, 03:27:33 AM
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

Is in my reply #19 .inc file

Jochen, please, replace (i replaced that file in the reply #19)

                                  ret
Multiply64_64_v3     endp

by

                                 ret     20
Multiply64_64_v3     endp

jj2007

There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.

RuiLoureiro

Quote from: jj2007 on September 19, 2018, 06:06:59 AM
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
i replaced just now, file MulQQbyQQ.zip, inside you have Multiply64by64.inc and inside you have that procedure.

There are 6:
                       Multiply64by64_v1   <<<--- uses pointers
                       Multiply64by64_v2
                       Multiply64by64_v3

Multiply64_64_v1   <<<--- uses value ( qwords )
Multiply64_64_v2
Multiply64_64_v3

jj2007

Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.

nidud

#28
deleted

RuiLoureiro

Quote from: jj2007 on September 19, 2018, 06:42:17 AM
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
Jochen, Hutch will say to me why i dont zip it and post !
Remember  that we need to extract the algos you post from archives.
Is there any other way when we have a lot of files ?
If we zip is because we zip; if we dont zip is because we dont zip.  :biggrin:


IdxX0   equ 0
IdxX1   equ 4

IdxY0   equ 0
IdxY1   equ 4

IdxZ0   equ 0
IdxZ1   equ 4
IdxZ2   equ 8
IdxZ3   equ 12

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
;««««««««««««««««««««««««««««««««««««««««««««««««««««
Multiply64_64_v3        proc    X:QWORD, Y:QWORD, pZ:DWORD
                        ; -------------
                        ;   pointers
                        ; -------------
                        mov     ecx, [esp+20]           ;pZ
                   
                        ; ----------------------
                        ;     IdxX0*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                        mov     [ecx+IdxZ0], eax
                        mov     [ecx+IdxZ1], edx

                        ; ----------------------
                        ;     IdxX0*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ3], edx

                        ret     20
Multiply64_64_v3        endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef