The MASM Forum

General => The Workshop => Topic started by: Leo on January 23, 2014, 05:05:26 AM

Title: Packed integer multiplication
Post by: Leo on January 23, 2014, 05:05:26 AM
I can't find an instruction to perform a very simple task that I need.
I have 4 integers in xmm0 and another 4 in xmm1. They are all positive. I need to
multiply the two registers and get the 4 resulting 32bit values. I found a way of doing it using PMADDWD. The problem is that it works only if all the input numbers are at most 15 bits long. Is there a solution for larger numbers?

Thanks
Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 05:17:39 AM
looks like you have a few options....

SSE4 may have the instruction you want, but isn't supported by all processors that are in common use
(PMULDQ or PMULLD)

SSE2....
you can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
or, you can shuffle, then use PMULUDQ to do 2 at a time (32-bit to 64-bit result)

doesn't look like SSE3 offers much help
Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 05:20:02 AM
Like this?

include \masm32\MasmBasic\MasmBasic.inc        ; download (http://masm32.com/board/index.php?topic=94.0)

.data
MyO0        OWORD 0FFFFFF00FFFFFF00FFFFFF00FFFFFF00h
MyO1        OWORD 0F1F1F100F1F1F100F1F1F100F1F1F100h

        Init
        movups xmm0, MyO0
        movups xmm1, MyO1
        pmuludq xmm0, xmm1
        Inkey Hex$(xmm0)
        Exit
end start


Output: F1F1F00E 0E0F0000 F1F1F00E 0E0F0000

P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.
Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results
Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 05:35:16 AM
Quote from: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results
That should work with the code above...

QuoteMultiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1

Bad luck :(

Depending on the application, a simple mulps might do the job, but it means a dword to single conversion.
Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 07:28:33 AM
Quoteyou can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle

that may be the solution
Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 12:28:53 PM
Post it, and I'll add it to the timings :greensml:

include \masm32\MasmBasic\MasmBasic.inc        ; download (http://masm32.com/board/index.php?topic=94.0)
.data
MyO0        dd 10001, 10002, 10003, 10004
MyQ1        dd 10000, 20000, 30000, 40000

.data?
dest0        dd ?
dest1        dd ?
dest2        dd ?
dest3        dd ?

  Init
  movups xmm0, OWORD PTR MyO0
  movups xmm1, OWORD PTR MyQ1
  ; Convert Packed Signed Doubleword Integers to Packed Single-Precision Floating-Point Values
  cvtdq2ps xmm0, xmm0
  cvtdq2ps xmm1, xmm1
  ; Packed Single-Precision Floating-Point Multiply
  mulps xmm0, xmm1
  ; Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
  cvtps2dq xmm0, xmm0
  movups oword ptr dest0, xmm0
  Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n\n", dest3)
  movups xmm0, OWORD PTR MyO0
  movups xmm1, OWORD PTR MyQ1
  ; Multiply the packed word integers in xmm1 by the packed word integers in xmm2/m128, and add the adjacent doubleword results
  pmaddwd xmm0, xmm1
  movups oword ptr dest0, xmm0
  Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n", dest3)
  Exit
end start

Output:
D0=     100010000
D1=     200040000
D2=     300089984
D3=     400160000

D0=     100010000
D1=     200040000
D2=     300090000
D3=     -255462144


Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1144    cycles for 100 * cvtdq2ps & mulps
823     cycles for 100 * pmaddwd
1625    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords

1136    cycles for 100 * cvtdq2ps & mulps
816     cycles for 100 * pmaddwd
1608    cycles for 100 * pmuludq to dwords
1583    cycles for 100 * pmuludq to qwords

1135    cycles for 100 * cvtdq2ps & mulps
816     cycles for 100 * pmaddwd
1607    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords


EDIT: Added pmuludq to the timings (source attached). With pmuludq, results can exceed 32 bits, in case that is needed. It is even a bit faster than pmuludq to dwords, see timings above.

@Leo: Could you please tell us more about your requirements, e.g. allowed ranges, required precision, etc? Thanks.

And btw, welcome to the forum :icon14:
Title: Re: Packed integer multiplication
Post by: MichaelW on January 23, 2014, 05:04:46 PM
Why use movups when you can easily align your data?
Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 09:37:40 PM
Because the OP might have unaligned data, and because the difference is small (2%).
Title: Re: Packed integer multiplication
Post by: hool on January 29, 2014, 03:24:50 AM
        movdqu  xmm0, [rsp]             ;   a3      a2      a1      a0
        movdqu  xmm1, [rsp+16]          ;   b3      b2      b1      b0 
        pshufd  xmm2, xmm0, 110001b     ;   _       a3      _       a1
        pshufd  xmm3, xmm1, 110001b     ;   _       b3      _       b1
        pmuludq xmm0, xmm1              ;   _     a2*b2     _     a0*b0   (only low 32bit considered)
        pmuludq xmm3, xmm2              ;   _     a3*b3     _     a1*b1
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0   
Title: Re: Packed integer multiplication
Post by: Gunther on January 29, 2014, 04:01:32 AM
Hi hool,

good catch. Will work.  :t

Gunther
Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2014, 07:28:37 AM
Hi hool & qWord,

It works:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1140    cycles for 100 * cvtdq2ps & mulps
819     cycles for 100 * pmaddwd
1613    cycles for 100 * pmuludq to dwords
1580    cycles for 100 * pmuludq to qwords
2083    cycles for 100 * pshufd & pmuludq (hool)
2053    cycles for 100 * pshufd & pmuludq (qWord)

1145    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1615    cycles for 100 * pmuludq to dwords
1591    cycles for 100 * pmuludq to qwords
2100    cycles for 100 * pshufd & pmuludq (hool)
2058    cycles for 100 * pshufd & pmuludq (qWord)

1142    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1609    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords
2084    cycles for 100 * pshufd & pmuludq (hool)
2049    cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)
Title: Re: Packed integer multiplication
Post by: qWord on January 29, 2014, 07:37:32 AM
Quote from: hool on January 29, 2014, 03:24:50 AM
;...
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0   

shifts are "slow" and not necessary:
movdqu\a xmm0,pdw_a
movdqu\a xmm1,pdw_b
pshufd xmm2,xmm0,11110101y
pshufd xmm3,xmm1,11110101y
pmuludq xmm0,xmm1
pmuludq xmm2,xmm3
pshufd xmm0,xmm0,1000y
pshufd xmm2,xmm2,1000y
punpckldq xmm0,xmm2
movdqu\a pdw_c,xmm0
Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2014, 07:51:01 AM
OK, both hool's and qWord's algo are integrated, see above. Time for some timings ;-)
Title: Re: Packed integer multiplication
Post by: dedndave on January 29, 2014, 09:42:12 AM
haven't seen the original poster since first post   :P
Title: Re: Packed integer multiplication
Post by: dedndave on January 29, 2014, 09:44:42 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

1468    cycles for 100 * cvtdq2ps & mulps
1455    cycles for 100 * pmaddwd
2004    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
1990    cycles for 100 * pshufd & pmuludq (hool)
1903    cycles for 100 * pshufd & pmuludq (qWord)

1494    cycles for 100 * cvtdq2ps & mulps
1475    cycles for 100 * pmaddwd
2031    cycles for 100 * pmuludq to dwords
2507    cycles for 100 * pmuludq to qwords
1998    cycles for 100 * pshufd & pmuludq (hool)
1901    cycles for 100 * pshufd & pmuludq (qWord)

1457    cycles for 100 * cvtdq2ps & mulps
1457    cycles for 100 * pmaddwd
2047    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
2000    cycles for 100 * pshufd & pmuludq (hool)
1936    cycles for 100 * pshufd & pmuludq (qWord)
Title: Re: Packed integer multiplication
Post by: hool on January 29, 2014, 11:02:50 AM
Its probably worth mentioning that all these versions have different areas of use.

cvtdq2ps/cvtps2dq for example will cause precision loss at some point (failes to convert 0x1000001 back and forth)
pmuludq to (d/q)words relies on memory to produce result
and my version keeps stuff in registers (for better or worse)   

AMD FX(tm)-8320 Eight-Core Processor            (SSE4)

370     cycles for 100 * cvtdq2ps & mulps
409     cycles for 100 * pmaddwd
623     cycles for 100 * pmuludq to dwords
490     cycles for 100 * pmuludq to qwords
703     cycles for 100 * pshufd & pmuludq (hool)
613     cycles for 100 * pshufd & pmuludq (qWord)

450     cycles for 100 * cvtdq2ps & mulps
312     cycles for 100 * pmaddwd
609     cycles for 100 * pmuludq to dwords
511     cycles for 100 * pmuludq to qwords
567     cycles for 100 * pshufd & pmuludq (hool)
529     cycles for 100 * pshufd & pmuludq (qWord)

468     cycles for 100 * cvtdq2ps & mulps
357     cycles for 100 * pmaddwd
659     cycles for 100 * pmuludq to dwords
540     cycles for 100 * pmuludq to qwords
563     cycles for 100 * pshufd & pmuludq (hool)
581     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)


You may notice some irregularities in numbers that because I have some heavy programs running alongside. Its very real-life example how CPU reorders instructions as it pleases.
Title: Re: Packed integer multiplication
Post by: FORTRANS on January 30, 2014, 12:42:35 AM

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1170 cycles for 100 * cvtdq2ps & mulps
728 cycles for 100 * pmaddwd
1756 cycles for 100 * pmuludq to dwords
1682 cycles for 100 * pmuludq to qwords
2141 cycles for 100 * pshufd & pmuludq (hool)
2105 cycles for 100 * pshufd & pmuludq (qWord)

1166 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1749 cycles for 100 * pmuludq to dwords
1650 cycles for 100 * pmuludq to qwords
2144 cycles for 100 * pshufd & pmuludq (hool)
2099 cycles for 100 * pshufd & pmuludq (qWord)

1164 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1762 cycles for 100 * pmuludq to dwords
1646 cycles for 100 * pmuludq to qwords
2139 cycles for 100 * pshufd & pmuludq (hool)
2098 cycles for 100 * pshufd & pmuludq (qWord)

36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)

400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)

--- ok ---
Title: Re: Packed integer multiplication
Post by: jj2007 on January 30, 2014, 01:31:49 AM
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1347    cycles for 100 * cvtdq2ps & mulps
613     cycles for 100 * pmaddwd
1256    cycles for 100 * pmuludq to dwords
1020    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1274    cycles for 100 * pshufd & pmuludq (qWord)

1347    cycles for 100 * cvtdq2ps & mulps
614     cycles for 100 * pmaddwd
1247    cycles for 100 * pmuludq to dwords
1019    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1270    cycles for 100 * pshufd & pmuludq (qWord)
Title: Re: Packed integer multiplication
Post by: Gunther on January 30, 2014, 04:15:26 AM

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

281     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
401     cycles for 100 * pmuludq to dwords
306     cycles for 100 * pmuludq to qwords
390     cycles for 100 * pshufd & pmuludq (hool)
317     cycles for 100 * pshufd & pmuludq (qWord)

277     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
404     cycles for 100 * pmuludq to dwords
305     cycles for 100 * pmuludq to qwords
398     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

286     cycles for 100 * cvtdq2ps & mulps
221     cycles for 100 * pmaddwd
407     cycles for 100 * pmuludq to dwords
310     cycles for 100 * pmuludq to qwords
389     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

--- ok ---


Gunther
Title: Re: Packed integer multiplication
Post by: Leo on February 19, 2014, 02:06:03 AM
Thanks a lot everybody!
I saw the first response, checked the newer instructions that came since SSE2 and decided to use SSE 4.1 both since it had pmulld, which did just what I wanted and since it had some other instructions, which helped me as well. The fallback SSE2 version uses pmaddwd with a check before the algorithm that gives an error message if the data is too large. That's sufficient for what I wanted. Thank a lot everybody for the responses!
Title: Re: Packed integer multiplication
Post by: MichaelW on February 19, 2014, 07:26:40 AM
Quote from: jj2007 on January 23, 2014, 05:20:02 AM
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.
Title: Re: Packed integer multiplication
Post by: Gunther on February 19, 2014, 09:02:57 AM
Michael,

Quote from: MichaelW on February 19, 2014, 07:26:40 AM
Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.

that sounds a bit brain damaged. Is it another MS mess?

Gunther
Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2016, 06:41:58 AM
Warming up an old thread :badgrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
480     cycles for 100 * cvtdq2ps & mulps
14975   cycles for 100 * mulps
1482    cycles for 100 * pmuludq to dwords
99      cycles for 100 * pmuludq to qwords
592     cycles for 100 * pshufd & pmuludq (hool)
541     cycles for 100 * pshufd & pmuludq (qWord)

479     cycles for 100 * cvtdq2ps & mulps
14966   cycles for 100 * mulps
908     cycles for 100 * pmuludq to dwords
617     cycles for 100 * pmuludq to qwords
1244    cycles for 100 * pshufd & pmuludq (hool)
544     cycles for 100 * pshufd & pmuludq (qWord)

478     cycles for 100 * cvtdq2ps & mulps
14956   cycles for 100 * mulps
910     cycles for 100 * pmuludq to dwords
72      cycles for 100 * pmuludq to qwords
593     cycles for 100 * pshufd & pmuludq (hool)
540     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
26      bytes for mulps
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

4001600 = eax cvtdq2ps & mulps
4001600 = eax mulps
4001600 = eax pmuludq to dwords
4001600 = eax pmuludq to qwords
4001600 = eax pshufd & pmuludq (hool)
4001600 = eax pshufd & pmuludq (qWord)


The reason is that I discovered you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW, at least on my i5. Could be a stall or whatever. Second restriction is that the result should be below 16Mio. So it is not really a good choice, unless you want to scale rectangles and you are sure that none of the four dwords is negative 8)
Title: Re: Packed integer multiplication
Post by: qWord on January 29, 2016, 06:56:10 AM
Quote from: jj2007 on January 29, 2016, 06:41:58 AMyou can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW
not surprisingly  when working with denormalized  (positive) numbers as operand - even thought there is an (commonly masked) exception for such operands.