I had hoped to divide by 10 with SIMD instructions, but somehow it fails. Do out math experts have an explanation?
include \masm32\include\masm32rt.inc
.686p
.xmm
.code
MyQ dq 123456789
MyQD dd 123456789
Magic dq 0CCCCCCCCCCCCCCCDh
MagicD dd 0CCCCCCCDh
start:
mov eax, MyQD
mov ecx, MagicD
mul ecx ; multiply DWORDs, result is 64 bits
shr edx, 3
print str$(edx), " result magic D", 13, 10 ; 12345678
movlps xmm0, MyQ
movlps xmm1, Magic
pclmullqlqdq xmm0, xmm1 ; multiply QWORDs, result is 128 bits
movhlps xmm0, xmm0 ; move high QWORD to lo QWORD
psrlq xmm0, 3 ; see shr edx, 3
movd eax, xmm0
inkey str$(eax), " result magic Q", 13, 10 ; garbage
exit
end start
pclmullqlqdq
Its "carry-less" multiplication. Its a useless instruction.
You need mulx for 64-bit x 64-bit ==> 128-bit or a complicated series of AVX2 using shifts or 64-bit double manipulation or AVX512 multiply.
Hi
Yes, pclmullqlqdq is not the instruction you are looking for. It performs a polynomial multiplication on GF(2).
As mentioned by InfiniteLoop, for this implementation you need (64bit) x (64bit) = high64(128bit).
This article may help https://stackoverflow.com/questions/28868367/getting-the-high-part-of-64-bit-integer-multiplication/50958815#50958815 (https://stackoverflow.com/questions/28868367/getting-the-high-part-of-64-bit-integer-multiplication/50958815#50958815).
Interesting to note that C intrinsics fall back on a simple mul instruction for a 128-bit (RDX::RAX) multiplication.
Biterider
Thanks, InfiniteLoop and Biterider :thumbsup:
Quote from: Biterider on May 02, 2022, 03:52:16 AMInteresting to note that C intrinsics fall back on a simple mul instruction for a 128-bit (RDX::RAX) multiplication
I know, this works fine, but I need it for my 32-bit library. Bad luck :sad: