The MASM Forum

General => The Workshop => Topic started by: jj2007 on May 01, 2022, 10:55:56 PM

Title: Division by multiplying with a magic number
Post by: jj2007 on May 01, 2022, 10:55:56 PM
I had hoped to divide by 10 with SIMD instructions, but somehow it fails. Do out math experts have an explanation?

Code: [Select]
include \masm32\include\masm32rt.inc
.686p
.xmm
.code
MyQ dq 123456789
MyQD dd 123456789
Magic dq 0CCCCCCCCCCCCCCCDh
MagicD dd 0CCCCCCCDh

start:
  mov eax, MyQD
  mov ecx, MagicD
  mul ecx ; multiply DWORDs, result is 64 bits
  shr edx, 3
  print str$(edx), " result magic D", 13, 10 ; 12345678
  movlps xmm0, MyQ
  movlps xmm1, Magic
  pclmullqlqdq xmm0, xmm1 ; multiply QWORDs, result is 128 bits
  movhlps xmm0, xmm0 ; move high QWORD to lo QWORD
  psrlq xmm0, 3 ; see shr edx, 3
  movd eax, xmm0
  inkey str$(eax), " result magic Q", 13, 10  ; garbage
  exit
end start
Title: Re: Division by multiplying with a magic number
Post by: InfiniteLoop on May 02, 2022, 02:54:58 AM
pclmullqlqdq

Its "carry-less" multiplication. Its a useless instruction.
You need mulx for 64-bit x 64-bit ==> 128-bit or a complicated series of AVX2 using shifts or 64-bit double manipulation or AVX512 multiply.
Title: Re: Division by multiplying with a magic number
Post by: Biterider on May 02, 2022, 03:52:16 AM
Hi
Yes, pclmullqlqdq is not the instruction you are looking for. It performs a polynomial multiplication on GF(2).
As mentioned by InfiniteLoop, for this implementation you need (64bit) x (64bit) = high64(128bit).

This article may help https://stackoverflow.com/questions/28868367/getting-the-high-part-of-64-bit-integer-multiplication/50958815#50958815 (https://stackoverflow.com/questions/28868367/getting-the-high-part-of-64-bit-integer-multiplication/50958815#50958815).

Interesting to note that C intrinsics fall back on a simple mul instruction for a 128-bit (RDX::RAX) multiplication.

Biterider
Title: Re: Division by multiplying with a magic number
Post by: jj2007 on May 02, 2022, 06:14:58 AM
Thanks, InfiniteLoop and Biterider :thumbsup:

Interesting to note that C intrinsics fall back on a simple mul instruction for a 128-bit (RDX::RAX) multiplication

I know, this works fine, but I need it for my 32-bit library. Bad luck :sad: