Multiply two QWORDs

jj2007 · September 15, 2018, 11:11:11 PM

This pops up sometimes with random number generation: How to multiply two QWORDs?

Microsoft offers a bunch of dedicated functions with strange names such as ...

Code Select

UnsignedMultiply128
Multiply128
__mul128

... but I got none of them working, the compiler always complains about intrinsics not supported etc, the usual C/C++ mess.

But I managed to put up a little 36 bytes routine that does the job (attached, pure MASM32). Problem: It's slow. There is a much faster alternative:

Code Select

	fild a64
	fild b64
	fmul
	fistp resq

Great, over 4x faster, but if the result exceeds a QWORD, it spits out meaningless results and becomes very, very slow (due to the FPU's error handling).

So I spent some time looking for a SIMD solution, and I found PMULUDQ. Looks great but it's DWORDs, not QWORDs :(

Any bright ideas...?

Biterider · September 16, 2018, 12:44:57 AM

Hi JJ
I found this old macro

Code Select


; Macro:      qdmul
; Purpose:    Multiply a QWORD by a DWORD.
; Arguments:  ecx::edx::eax = edx::eax * ebx.

qdmul macro
    push eax
    mov eax, edx
    mul ebx
    mov ecx, edx
    pop edx
    xchg eax, edx
    push edx
    mul ebx
    pop ebx
    add edx, ebx
    adc ecx, 0
endm

You can use it, splitting the multiplication in 4 steps:
1. multiply be the high 32 bit using the macro
2. shift the result by 32 ecx::edx::eax --> edi::ecx::edx::eax
3. multiply be the low 32 bit using the macro ecx::edx::eax
4. add the result of step 2 & 3

Biterider

jj2007 · September 16, 2018, 01:48:51 AM

Thanks, will have a look. It doesn't look any simpler, though.

hutch-- · September 16, 2018, 02:19:03 AM

Don't know if its fast but the results are correct. LATER : Cleaned up.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include64\masm64rt.inc

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

LOCAL quad1 :QWORD
LOCAL quad2 :QWORD

mov quad1, 100
mov quad2, 100

cvtsi2sd xmm0, quad1
cvtsi2sd xmm1, quad2
mulsd xmm0, xmm1
cvtsd2si r11, xmm0

conout str$(r11),lf

waitkey
.exit

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

jj2007 · September 16, 2018, 02:41:25 AM

Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(

nidud · September 16, 2018, 03:36:20 AM

deleted

RuiLoureiro · September 16, 2018, 04:48:40 AM

Quote from: jj2007 on September 16, 2018, 02:41:25 AM
Interesting: CVTSI2SD--Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

That is the definition, but in your code it loads a QWORD. And in 32-bit land, it refuses to load more than a DWORD. Something is wrong with the definition :(

I never used it. So, it seems that it converts a QWORD integer to Floating-Point format (QWORD also) if you are right Jochen

. These definitions are very ... cool !!!

LATER: i saw all CVT??? instructions for SSE2 and all converts from 32 to 64 or 64 to 32 .
It s nice ! I think Hutch has a privileged information and he doesnt show us ;)

jj2007 · September 16, 2018, 11:00:10 AM

Quote from: nidud on September 16, 2018, 03:36:20 AM
You probably have to convert each DWORD to REAL8

Thanks, but as the title says, it's QWORDs, not DWORDs.

hutch-- · September 16, 2018, 11:10:02 AM

> I think Hutch has a privileged information and he doesnt show us.

I have I have, its the Intel instruction manual. What I did when I started on the scalar double instructions was do a search through the entire manual for anything that had an SD ending, avoided the AVX versions and made a list of what looked like it was useful for scalar maths. Its worth having a look at the packed double versions as well as there are more primitives to work with.

What you may find is that the 64 bit mulsd instruction runs out of puff at the top of the 64 bit range.

LATER : From memory there are techniques to determine overflow and underflow which you may need to have in place if you are dealing with number up near or over the range limit of 64 bit.

jj2007 · September 16, 2018, 11:40:02 AM

This should do the job:

Code Select

MultQQ3 proc pQ1, pQ2
  mov eax, [esp+4]	; pQ1
  mov edx, [esp+8]	; pQ2
  movlps xmm0, QWORD ptr [eax]
  movlps xmm1, QWORD ptr [edx]
  pclmulqdq xmm0, xmm1, 0
  retn 2*4
MultQQ3 endp

Intel:

QuotePCLMULQDQ Instruction Definition
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords
which are selected from the first and the second operands according to the immediate
byte value.
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8
Description: Carry-less multiplication of one quadword (8 bytes) of xmm1
by one quadword (8 bytes) of xmm2/m128, returning a double quadword
(16 bytes). The immediate byte is used for determining which quadwords of
xmm1 and xmm2/m128 should be used.

The instruction has an own 'chapter' in the Oracle manual.

Problem: It doesn't work :(

Testcase attached, plain Masm32. It calculates 4000000000*2222222222, no problem for the FPU but pclmulqdq does something else. More precisely, it does carry-less multiplication.

Note there are variants like pclmullqlqdq xmm0, xmm1 ("Multiply the low halves of the two registers") which do not require the immediate operand; they are just pclmulqdq with the immediate bits embedded.

aw27 · September 17, 2018, 02:11:02 AM

Carry-less means that the carry is discarded. I mentioned this instruction here. So you will never achieve what you want.
You need a Big Integer library or play with floating points to obtain an approximation.

aw27 · September 18, 2018, 02:24:31 AM

A little convoluted way but it works.

Code Select


val1=0x1122334455667788
val2=0x99aabbccddeeff00
val1*val2=0xa48ddeb93f93d70479983e499807800

jj2007 · September 18, 2018, 05:48:46 PM

Thanks for all the suggestions :t

I have put together a little testbed, and here are the timings:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1190    cycles for 100 * MultQQ
2115    cycles for 100 * Multiply64by64 (Rui)
2171    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2840    cycles for 100 * MultPclmulqdq2
4358    cycles for 100 * doMul (aw27)

913     cycles for 100 * MultQQ
2118    cycles for 100 * Multiply64by64 (Rui)
2182    cycles for 100 * u64_mul (Rui)
1212    cycles for 100 * PCLMULQDQ
2843    cycles for 100 * MultPclmulqdq2
4357    cycles for 100 * doMul (aw27)

917     cycles for 100 * MultQQ
2116    cycles for 100 * Multiply64by64 (Rui)
2172    cycles for 100 * u64_mul (Rui)
1213    cycles for 100 * PCLMULQDQ
2855    cycles for 100 * MultPclmulqdq2
4351    cycles for 100 * doMul (aw27)

915     cycles for 100 * MultQQ
2113    cycles for 100 * Multiply64by64 (Rui)
2167    cycles for 100 * u64_mul (Rui)
1214    cycles for 100 * PCLMULQDQ
2841    cycles for 100 * MultPclmulqdq2
4363    cycles for 100 * doMul (aw27)

62      bytes for MultQQ
108     bytes for Multiply64by64 (Rui)
126     bytes for u64_mul (Rui)
46      bytes for PCLMULQDQ
45      bytes for MultPclmulqdq2
253     bytes for doMul (aw27)

DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   11111111111234566980
DestQ   10801413638766757892
DestQ   10801413638766757892
DestQ   11111111111234566980

Note that some of the slower algos produce a 128-bit result.

aw27 · September 18, 2018, 09:39:05 PM

JJ,

As always, you are a scam artist. Where are you testing the multiplication of 2 QWORDS?
Are these QWORDS?
SrcQ1   dq 1234567890
SrcQ2   dq 9000000082

Yes, as much as these are QWORDS:
SrcQ1   dq 1
SrcQ2   dq 2

Did you ever went to school? If so, did you try to deceive your teachers with tricks like these and succeed?

nidud · September 18, 2018, 09:56:13 PM

deleted

The MASM Forum

News:

Multiply two QWORDs

jj2007

Biterider

jj2007

hutch--

jj2007

nidud

RuiLoureiro

jj2007

hutch--

jj2007

aw27

aw27

jj2007

aw27

nidud