News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Multiply two QWORDs

Started by jj2007, September 15, 2018, 11:11:11 PM

Previous topic - Next topic

jj2007

This pops up sometimes with random number generation: How to multiply two QWORDs?

Microsoft offers a bunch of dedicated functions with strange names such as ...
UnsignedMultiply128
Multiply128
__mul128

... but I got none of them working, the compiler always complains about intrinsics not supported etc, the usual C/C++ mess.

But I managed to put up a little 36 bytes routine that does the job (attached, pure MASM32). Problem: It's slow. There is a much faster alternative:
fild a64
fild b64
fmul
fistp resq


Great, over 4x faster, but if the result exceeds a QWORD, it spits out meaningless results and becomes very, very slow (due to the FPU's error handling).

So I spent some time looking for a SIMD solution, and I found PMULUDQ. Looks great but it's DWORDs, not QWORDs :(

Any bright ideas...?

Biterider

Hi JJ
I found this old macro

; Macro:      qdmul
; Purpose:    Multiply a QWORD by a DWORD.
; Arguments:  ecx::edx::eax = edx::eax * ebx.

qdmul macro
    push eax
    mov eax, edx
    mul ebx
    mov ecx, edx
    pop edx
    xchg eax, edx
    push edx
    mul ebx
    pop ebx
    add edx, ebx
    adc ecx, 0
endm


You can use it, splitting the multiplication in 4 steps:
1. multiply be the high 32 bit using the macro
2. shift the result by 32 ecx::edx::eax --> edi::ecx::edx::eax
3. multiply be the low 32 bit using the macro ecx::edx::eax
4. add the result of step 2 & 3

Biterider

jj2007

Thanks, will have a look. It doesn't look any simpler, though.

hutch--

Don't know if its fast but the results are correct. LATER : Cleaned up.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL quad1 :QWORD
    LOCAL quad2 :QWORD

    mov quad1, 100
    mov quad2, 100

    cvtsi2sd xmm0, quad1
    cvtsi2sd xmm1, quad2
    mulsd xmm0, xmm1
    cvtsd2si r11, xmm0

    conout str$(r11),lf

    waitkey
    .exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

jj2007

Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(

nidud

#5
deleted

RuiLoureiro

#6
Quote from: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(
I never used it. So, it seems that it converts a QWORD integer to Floating-Point format (QWORD also) if you are right Jochen  :biggrin: . These definitions are very ... cool !!!

LATER: i saw all CVT??? instructions for SSE2 and all converts from 32 to 64 or 64 to 32 .
           It s nice ! I think Hutch has a privileged information and he doesnt show us  ;)

jj2007

Quote from: nidud on September 16, 2018, 03:36:20 AM
You probably have to convert each DWORD to REAL8

Thanks, but as the title says, it's QWORDs, not DWORDs.

hutch--

 :biggrin:

> I think Hutch has a privileged information and he doesnt show us.

I have I have, its the Intel instruction manual. What I did when I started on the scalar double instructions was do a search through the entire manual for anything that had an SD ending, avoided the AVX versions and made a list of what looked like it was useful for scalar maths. Its worth having a look at the packed double versions as well as there are more primitives to work with.

What you may find is that the 64 bit mulsd instruction runs out of puff at the top of the 64 bit range.

LATER : From memory there are techniques to determine overflow and underflow which you may need to have in place if you are dealing with number up near or over the range limit of 64 bit.

jj2007

This should do the job:
MultQQ3 proc pQ1, pQ2
  mov eax, [esp+4] ; pQ1
  mov edx, [esp+8] ; pQ2
  movlps xmm0, QWORD ptr [eax]
  movlps xmm1, QWORD ptr [edx]
  pclmulqdq xmm0, xmm1, 0
  retn 2*4
MultQQ3 endp


Intel:
QuotePCLMULQDQ Instruction Definition
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords
which are selected from the first and the second operands according to the immediate
byte value.
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8
Description: Carry-less multiplication of one quadword (8 bytes) of xmm1
by one quadword (8 bytes) of xmm2/m128, returning a double quadword
(16 bytes). The immediate byte is used for determining which quadwords of
xmm1 and xmm2/m128 should be used.

The instruction has an own 'chapter' in the Oracle manual.

Problem: It doesn't work :(

Testcase attached, plain Masm32. It calculates 4000000000*2222222222, no problem for the FPU but pclmulqdq does something else. More precisely, it does carry-less multiplication.

Note there are variants like pclmullqlqdq xmm0, xmm1 ("Multiply the low halves of the two registers") which do not require the immediate operand; they are just pclmulqdq with the immediate bits embedded.

aw27

Carry-less means that the carry is discarded. I mentioned this instruction here. So you will never achieve what you want.
You need a Big Integer library or play with floating points to obtain an approximation.

aw27

A little convoluted way but it works.


val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800

jj2007

Thanks for all the suggestions :t

I have put together a little testbed, and here are the timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1190    cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2171    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)

913     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2182    cycles for 100 * u64_mul (Rui)
1212    cycles for 100 * PCLMULQDQ
2843    cycles for 100 * MultPclmulqdq2
4357    cycles for 100 * doMul (aw27)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2172    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2855    cycles for 100 * MultPclmulqdq2
4351    cycles for 100 * doMul (aw27)

915     cycles for 100 * MultQQ
2113    cycles for 100 * Multiply64by64 (Rui)
2167    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4363    cycles for 100 * doMul (aw27)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980


Note that some of the slower algos produce a 128-bit result.

aw27

JJ,

As always, you are a scam artist. Where are you testing the multiplication of 2 QWORDS?
Are these QWORDS?
SrcQ1   dq 1234567890
SrcQ2   dq 9000000082

Yes, as much as these are QWORDS:
SrcQ1   dq 1
SrcQ2   dq 2

Did you ever went to school? If so, did you try to deceive your teachers with tricks like these and succeed?

nidud

#14
deleted