Author Topic: Multiply two QWORDs  (Read 902 times)

jj2007

  • Member
  • *****
  • Posts: 8822
  • Assembler is fun ;-)
    • MasmBasic
Re: Multiply two QWORDs
« Reply #15 on: September 18, 2018, 10:49:36 PM »
Thanks, Nidud. Your muld is practically my MulQQ, only that I pass the source QWORDs as pointers while you pass them as QWORDs. I had tested that before, and it was a little bit slower. Results:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

911     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2183    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3894    cycles for 100 * mulq_sse (Nidud)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2175    cycles for 100 * u64_mul (Rui)
1215    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4355    cycles for 100 * doMul (aw27)
966     cycles for 100 * muld (Nidud)
3893    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2127    cycles for 100 * Multiply64by64 (Rui)
2170    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4362    cycles for 100 * doMul (aw27)
968     cycles for 100 * muld (Nidud)
3899    cycles for 100 * mulq_sse (Nidud)

915     cycles for 100 * MultQQ
2123    cycles for 100 * Multiply64by64 (Rui)
2168    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4354    cycles for 100 * doMul (aw27)
965     cycles for 100 * muld (Nidud)
3913    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892 < PCLMULQDQ carry-less, OK for random number but not suitable as "normal" mult
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096   <<< little problem here

The FPU version is a lot slower, and chokes if the result goes beyond the QWORD range.

LiaoMi

  • Member
  • ***
  • Posts: 323
Re: Multiply two QWORDs
« Reply #16 on: September 19, 2018, 12:08:16 AM »
Code: [Select]
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

878     cycles for 100 * MultQQ
1695    cycles for 100 * Multiply64by64 (Rui)
1850    cycles for 100 * u64_mul (Rui)
515     cycles for 100 * PCLMULQDQ
2320    cycles for 100 * MultPclmulqdq2
3976    cycles for 100 * doMul (aw27)
879     cycles for 100 * muld (Nidud)
3668    cycles for 100 * mulq_sse (Nidud)

868     cycles for 100 * MultQQ
1859    cycles for 100 * Multiply64by64 (Rui)
2035    cycles for 100 * u64_mul (Rui)
546     cycles for 100 * PCLMULQDQ
2134    cycles for 100 * MultPclmulqdq2
4049    cycles for 100 * doMul (aw27)
956     cycles for 100 * muld (Nidud)
3637    cycles for 100 * mulq_sse (Nidud)

922     cycles for 100 * MultQQ
1729    cycles for 100 * Multiply64by64 (Rui)
1849    cycles for 100 * u64_mul (Rui)
537     cycles for 100 * PCLMULQDQ
2103    cycles for 100 * MultPclmulqdq2
4215    cycles for 100 * doMul (aw27)
885     cycles for 100 * muld (Nidud)
3635    cycles for 100 * mulq_sse (Nidud)

884     cycles for 100 * MultQQ
1901    cycles for 100 * Multiply64by64 (Rui)
1853    cycles for 100 * u64_mul (Rui)
531     cycles for 100 * PCLMULQDQ
2127    cycles for 100 * MultPclmulqdq2
3960    cycles for 100 * doMul (aw27)
981     cycles for 100 * muld (Nidud)
3643    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 5894
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Multiply two QWORDs
« Reply #17 on: September 19, 2018, 12:10:57 AM »

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

1159    cycles for 100 * MultQQ
2008    cycles for 100 * Multiply64by64 (Rui)
2190    cycles for 100 * u64_mul (Rui)
665     cycles for 100 * PCLMULQDQ
2458    cycles for 100 * MultPclmulqdq2
4529    cycles for 100 * doMul (aw27)
1006    cycles for 100 * muld (Nidud)
4222    cycles for 100 * mulq_sse (Nidud)

994     cycles for 100 * MultQQ
1975    cycles for 100 * Multiply64by64 (Rui)
2115    cycles for 100 * u64_mul (Rui)
722     cycles for 100 * PCLMULQDQ
2478    cycles for 100 * MultPclmulqdq2
4550    cycles for 100 * doMul (aw27)
1004    cycles for 100 * muld (Nidud)
4209    cycles for 100 * mulq_sse (Nidud)

1000    cycles for 100 * MultQQ
1971    cycles for 100 * Multiply64by64 (Rui)
2109    cycles for 100 * u64_mul (Rui)
601     cycles for 100 * PCLMULQDQ
2639    cycles for 100 * MultPclmulqdq2
4493    cycles for 100 * doMul (aw27)
1003    cycles for 100 * muld (Nidud)
4229    cycles for 100 * mulq_sse (Nidud)

998     cycles for 100 * MultQQ
1972    cycles for 100 * Multiply64by64 (Rui)
2098    cycles for 100 * u64_mul (Rui)
598     cycles for 100 * PCLMULQDQ
2443    cycles for 100 * MultPclmulqdq2
5134    cycles for 100 * doMul (aw27)
1005    cycles for 100 * muld (Nidud)
4219    cycles for 100 * mulq_sse (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
112     bytes for mulq_sse (Nidud)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   506253686751116096

--- ok ---
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1614
    • https://github.com/nidud/asmc
Re: Multiply two QWORDs
« Reply #18 on: September 19, 2018, 12:41:19 AM »
I made some changes to sse version.

  mulpd xmm1, xmm2
  cvtsd2si ecx, xmm1      ; a0 * b1
  add edx, ecx
  movhlps xmm1, xmm1
  cvtsd2si ecx, xmm1      ; a1 * b0
  add edx, ecx

However, it will still fail and be really slow compare to the conventional method.
Code: [Select]
_mul128 proc Multiplier:qword, Multiplicand:qword, Highproduct:ptr

    mov eax,dword ptr Multiplier
    mov edx,dword ptr Multiplier[4]
    mov ecx,dword ptr Multiplicand[4]

    .if !edx && !ecx

        .if Highproduct

            mov ecx,Highproduct
            mov [ecx],edx
            mov [ecx+4],edx
        .endif
        mul dword ptr Multiplicand
    .else

        push    ebx
        push    esi
        push    edi
        push    ebp
        push    eax
        push    edx
        push    edx
        mov     ebx,dword ptr Multiplicand
        mul     ebx
        mov     esi,edx
        mov     edi,eax
        pop     eax
        mul     ecx
        mov     ebp,edx
        xchg    ebx,eax
        pop     edx
        mul     edx
        add     esi,eax
        adc     ebx,edx
        adc     ebp,0
        pop     eax
        mul     ecx
        add     esi,eax
        adc     ebx,edx
        adc     ebp,0
        mov     ecx,ebp
        mov     edx,esi
        mov     eax,edi
        pop     ebp
        mov     edi,Highproduct

        .if edi

            mov [edi],ebx
            mov [edi+4],ecx
        .endif
        pop     edi
        pop     esi
        pop     ebx

    .endif
    ret

_mul128 endp

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Multiply two QWORDs
« Reply #19 on: September 19, 2018, 02:12:16 AM »
Hi Jocehn
              works fine :t
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64by64_v3 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v1 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP  Multiply64_64_v2 ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP Multiply64_64_v3 ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP MultQQ ---***
6FA84000
5B98FAA3
C2C48146
00000783
*** STOP doMul ---***
6FA84000
5B98FAA3
00000000
00000000
*** STOP muld ---***
**************** E N D ****************
« Last Edit: September 19, 2018, 06:06:33 AM by RuiLoureiro »

jj2007

  • Member
  • *****
  • Posts: 8822
  • Assembler is fun ;-)
    • MasmBasic
Re: Multiply two QWORDs
« Reply #20 on: September 19, 2018, 03:02:12 AM »
@Nidud: Yes, the new SSE version does not produce the expected result, so I added _mul128 instead. I also added a print of the high qword for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

914     cycles for 100 * MultQQ
2111    cycles for 100 * Multiply64by64 (Rui)
2159    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2832    cycles for 100 * MultPclmulqdq2
4267    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2306    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2163    cycles for 100 * u64_mul (Rui)
1209    cycles for 100 * PCLMULQDQ
2829    cycles for 100 * MultPclmulqdq2
4266    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2300    cycles for 100 * _mul128 (Nidud)

910     cycles for 100 * MultQQ
2120    cycles for 100 * Multiply64by64 (Rui)
2166    cycles for 100 * u64_mul (Rui)
1211    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4272    cycles for 100 * doMul (aw27)
964     cycles for 100 * muld (Nidud)
2296    cycles for 100 * _mul128 (Nidud)

926     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2169    cycles for 100 * u64_mul (Rui)
1207    cycles for 100 * PCLMULQDQ
2833    cycles for 100 * MultPclmulqdq2
4273    cycles for 100 * doMul (aw27)
962     cycles for 100 * muld (Nidud)
2287    cycles for 100 * _mul128 (Nidud)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
52      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)
68      bytes for muld (Nidud)
146     bytes for _mul128 (Nidud)

MultQQ                 6760860027809745732
Multiply64by64 (Rui)   6760860027809745732  - high QWORD: 1728378107
u64_mul (Rui)          6760860027809745732  - high QWORD: 1728378107
PCLMULQDQ              7817399311675693060
MultPclmulqdq2         7817399311675693060
doMul (aw27)           6760860027809745732  - high QWORD: 1728378107
muld (Nidud)           6760860027809745732
_mul128 (Nidud)        6760860027809745732

AW

  • Member
  • *****
  • Posts: 1561
  • Let's Make ASM Great Again!
Re: Multiply two QWORDs
« Reply #21 on: September 19, 2018, 03:26:07 AM »
This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800


 :badgrin:

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Multiply two QWORDs
« Reply #22 on: September 19, 2018, 03:27:33 AM »
>>> ...for those algos that do produce one, of which Rui's Multiply64by64 is the fastest on my CPU:
           versions 3 ( _v3 ) should be much faster... ( than quick! )


          This «u64_mul (Rui)» was not written by me, i got it... dont remember who wrote it.

jj2007

  • Member
  • *****
  • Posts: 8822
  • Assembler is fun ;-)
    • MasmBasic
Re: Multiply two QWORDs
« Reply #23 on: September 19, 2018, 03:59:03 AM »
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Multiply two QWORDs
« Reply #24 on: September 19, 2018, 05:02:06 AM »
           versions 3 ( _v3 ) should be much faster... ( than quick! )

Then post it here. And please, not hidden in an archive with a dozen files.

Is in my reply #19 .inc file

Jochen, please, replace (i replaced that file in the reply #19)

                                  ret
Multiply64_64_v3     endp

by

                                 ret     20
Multiply64_64_v3     endp
« Last Edit: September 19, 2018, 06:08:32 AM by RuiLoureiro »

jj2007

  • Member
  • *****
  • Posts: 8822
  • Assembler is fun ;-)
    • MasmBasic
Re: Multiply two QWORDs
« Reply #25 on: September 19, 2018, 06:06:59 AM »
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Multiply two QWORDs
« Reply #26 on: September 19, 2018, 06:12:11 AM »
There is no Multiply64_64_v3 in the inc file. Post it here. Use the code tags.
i replaced just now, file MulQQbyQQ.zip, inside you have Multiply64by64.inc and inside you have that procedure.

There are 6:
                       Multiply64by64_v1   <<<--- uses pointers
                       Multiply64by64_v2
                       Multiply64by64_v3

Multiply64_64_v1   <<<--- uses value ( qwords )
Multiply64_64_v2
Multiply64_64_v3

jj2007

  • Member
  • *****
  • Posts: 8822
  • Assembler is fun ;-)
    • MasmBasic
Re: Multiply two QWORDs
« Reply #27 on: September 19, 2018, 06:42:17 AM »
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.

nidud

  • Member
  • *****
  • Posts: 1614
    • https://github.com/nidud/asmc
Re: Multiply two QWORDs
« Reply #28 on: September 19, 2018, 07:04:13 AM »
This is a True Masm (TM) 64-bit version of mult64to128, results are 128-bits as expected not crippled to 64-bits, even using floating point, as I have seen so far, or using incorrect carryless functions.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

64-bit True Masm (TM), 128-bit Results
978 cycles for 100 * mult64to128 (AW)
val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800


 :badgrin:

Code: [Select]
mult64to128 proc first:qword, sec:qword, res:ptr

mov r15, rdx
mov eax, ecx
mov r14, rax
mov r10d, edx
mul r10
mov r13d, eax
shr rax, 32

shr rcx, 32
mov rbx, rcx
xchg rax, rcx
mul r10
add rax, rcx
mov r11d, eax
shr rax, 32
mov r12d, eax

shr r15, 32
mov eax, r14d
mul r15
add rax, r11
mov r9, rax
shr rax, 32
xchg rax, r15
mul rbx
add rax, r12
add rax, r15

mov (twoQwords PTR [r8]).q1, rax
shl r9, 32
add r9, r13
mov (twoQwords PTR [r8]).q2, r9

ret
mult64to128 endp

 :biggrin:

Bit skeptical towards these True Masm (TM) guys so lets see if it actually works.

    mov rax,0x1122334455667788
    mov rcx,0x99aabbccddeeff00
    mul rcx
    printf("%#I64x%I64x\n", rdx, rax)

0xa48ddeb93f93d70479983e499807800

Yep, that works: congrats  :t

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Multiply two QWORDs
« Reply #29 on: September 19, 2018, 07:42:38 AM »
Rui,
have a look at how Nidud posts his algo. I do not want to extract algos from archives.
Jochen, Hutch will say to me why i dont zip it and post !
Remember  that we need to extract the algos you post from archives.
Is there any other way when we have a lot of files ?
If we zip is because we zip; if we dont zip is because we dont zip.  :biggrin:

Code: [Select]
IdxX0   equ 0
IdxX1   equ 4

IdxY0   equ 0
IdxY1   equ 4

IdxZ0   equ 0
IdxZ1   equ 4
IdxZ2   equ 8
IdxZ3   equ 12

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Multiply64by64_v3       proc     pX64:DWORD, pY64:DWORD, pZ:DWORD
                        push     ebx
                     push     esi

                     mov      ebx, [esp+12]           ;pX64
                     mov      esi, [esp+16]           ;pY64
                     mov      ecx, [esp+20]           ;pZ

                  ; ----------------------
                  ;    IdxX0*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY0]
                     mov      [ecx+IdxZ0], eax
                     mov      [ecx+IdxZ1], edx

                  ; ----------------------
                  ;   IdxX0*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX0]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ2], edx

                      ; ----------------------
                  ;   IdxX1*IdxY0
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY0]

                     add      eax, [ecx+IdxZ1]
                     mov      [ecx+IdxZ1], eax
                  ;
                     adc      edx, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], edx

                  ; ----------------------
                  ;   IdxX1*IdxY1
                  ; ----------------------
                     mov      eax, [ebx+IdxX1]
                     mul      dword ptr [esi+IdxY1]

                     add      eax, [ecx+IdxZ2]
                     mov      [ecx+IdxZ2], eax
                  ;
                     adc      edx, 0
                     mov      [ecx+IdxZ3], edx

                     pop      esi
                     pop      ebx
                     ret      12
Multiply64by64_v3       endp
;««««««««««««««««««««««««««««««««««««««««««««««««««««
Multiply64_64_v3        proc    X:QWORD, Y:QWORD, pZ:DWORD
                        ; -------------
                        ;   pointers
                        ; -------------
                        mov     ecx, [esp+20]           ;pZ
                   
                        ; ----------------------
                        ;     IdxX0*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                        mov     [ecx+IdxZ0], eax
                        mov     [ecx+IdxZ1], edx

                        ; ----------------------
                        ;     IdxX0*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+4]          ; [X+IdxX0]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY0
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+12]          ; [Y+IdxY0]
                   
                        add     eax, [ecx+IdxZ1]
                        mov     [ecx+IdxZ1], eax
                        ;
                        adc     edx, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], edx
                   
                        ; ----------------------
                        ;     IdxX1*IdxY1
                        ; ----------------------
                        mov     eax, dword ptr [esp+8]          ; [X+IdxX1]
                        mul     dword ptr [esp+16]          ; [Y+IdxY1]
                   
                        add     eax, [ecx+IdxZ2]
                        mov     [ecx+IdxZ2], eax
                        ;
                        adc     edx, 0
                        mov     [ecx+IdxZ3], edx

                        ret     20
Multiply64_64_v3        endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef