Print Page - Packed integer multiplication

Title: Packed integer multiplication
Post by: Leo on January 23, 2014, 05:05:26 AM

I can't find an instruction to perform a very simple task that I need.
I have 4 integers in xmm0 and another 4 in xmm1. They are all positive. I need to
multiply the two registers and get the 4 resulting 32bit values. I found a way of doing it using PMADDWD. The problem is that it works only if all the input numbers are at most 15 bits long. Is there a solution for larger numbers?

Thanks

Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 05:17:39 AM

looks like you have a few options....

SSE4 may have the instruction you want, but isn't supported by all processors that are in common use
(PMULDQ or PMULLD)

SSE2....
you can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
or, you can shuffle, then use PMULUDQ to do 2 at a time (32-bit to 64-bit result)

doesn't look like SSE3 offers much help

Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 05:20:02 AM

Like this?

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)

.data
MyO0 OWORD 0FFFFFF00FFFFFF00FFFFFF00FFFFFF00h
MyO1 OWORD 0F1F1F100F1F1F100F1F1F100F1F1F100h

Init
movups xmm0, MyO0
movups xmm1, MyO1
pmuludq xmm0, xmm1
Inkey Hex$(xmm0)
Exit
end start

Output: F1F1F00E 0E0F0000 F1F1F00E 0E0F0000

P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 05:21:41 AM

sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results

Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 05:35:16 AM

Quote from: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results

~~That should work with the code above...~~

QuoteMultiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1

Bad luck :(

Depending on the application, a simple mulps might do the job, but it means a dword to single conversion.

Title: Re: Packed integer multiplication
Post by: dedndave on January 23, 2014, 07:28:33 AM

Quoteyou can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle

that may be the solution

Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 12:28:53 PM

Post it, and I'll add it to the timings :greensml:

include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
.data
MyO0 dd 10001, 10002, 10003, 10004
MyQ1 dd 10000, 20000, 30000, 40000

.data?
dest0 dd ?
dest1 dd ?
dest2 dd ?
dest3 dd ?

Init
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Convert Packed Signed Doubleword Integers to Packed Single-Precision Floating-Point Values
cvtdq2ps xmm0, xmm0
cvtdq2ps xmm1, xmm1
; Packed Single-Precision Floating-Point Multiply
mulps xmm0, xmm1
; Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
cvtps2dq xmm0, xmm0
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n\n", dest3)
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Multiply the packed word integers in xmm1 by the packed word integers in xmm2/m128, and add the adjacent doubleword results
pmaddwd xmm0, xmm1
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n", dest3)
Exit
end start

Output:
D0= 100010000
D1= 200040000
D2= 300089984
D3= 400160000

D0= 100010000
D1= 200040000
D2= 300090000
D3= -255462144

Timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)

1144 cycles for 100 * cvtdq2ps & mulps
823 cycles for 100 * pmaddwd
1625 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords

1136 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1608 cycles for 100 * pmuludq to dwords
1583 cycles for 100 * pmuludq to qwords

1135 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1607 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords

36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords

400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords

EDIT: Added pmuludq to the timings (source attached). With pmuludq, results can exceed 32 bits, in case that is needed. It is even a bit faster than pmuludq to dwords, see timings above.

@Leo: Could you please tell us more about your requirements, e.g. allowed ranges, required precision, etc? Thanks.

And btw, welcome to the forum :icon14:

Title: Re: Packed integer multiplication
Post by: MichaelW on January 23, 2014, 05:04:46 PM

Why use movups when you can easily align your data?

Title: Re: Packed integer multiplication
Post by: jj2007 on January 23, 2014, 09:37:40 PM

Because the OP might have unaligned data, and because the difference is small (2%).

Title: Re: Packed integer multiplication
Post by: hool on January 29, 2014, 03:24:50 AM

Code Select

        movdqu  xmm0, [rsp]             ;   a3      a2      a1      a0
        movdqu  xmm1, [rsp+16]          ;   b3      b2      b1      b0  
        pshufd  xmm2, xmm0, 110001b     ;   _       a3      _       a1
        pshufd  xmm3, xmm1, 110001b     ;   _       b3      _       b1
        pmuludq xmm0, xmm1              ;   _     a2*b2     _     a0*b0   (only low 32bit considered)
        pmuludq xmm3, xmm2              ;   _     a3*b3     _     a1*b1
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0

Title: Re: Packed integer multiplication
Post by: Gunther on January 29, 2014, 04:01:32 AM

Hi hool,

good catch. Will work. :t

Gunther

Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2014, 07:28:37 AM

Hi hool & qWord,

It works:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1140    cycles for 100 * cvtdq2ps & mulps
819     cycles for 100 * pmaddwd
1613    cycles for 100 * pmuludq to dwords
1580    cycles for 100 * pmuludq to qwords
2083    cycles for 100 * pshufd & pmuludq (hool)
2053    cycles for 100 * pshufd & pmuludq (qWord)

1145    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1615    cycles for 100 * pmuludq to dwords
1591    cycles for 100 * pmuludq to qwords
2100    cycles for 100 * pshufd & pmuludq (hool)
2058    cycles for 100 * pshufd & pmuludq (qWord)

1142    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1609    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords
2084    cycles for 100 * pshufd & pmuludq (hool)
2049    cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

Title: Re: Packed integer multiplication
Post by: qWord on January 29, 2014, 07:37:32 AM

Quote from: hool on January 29, 2014, 03:24:50 AM
Code Select Expand
;... psllq xmm0, 32 ; a2*b2 0 a0*b0 0 psllq xmm3, 32 ; a3*b3 0 a1*b1 0 pshufd xmm0, xmm0, 110001b ; 0 a2*b2 0 a0*b0 por xmm0, xmm3 ; a3*b3 a2*b2 a1*b1 a0*b0

shifts are "slow" and not necessary:

Code Select

	movdqu\a xmm0,pdw_a
	movdqu\a xmm1,pdw_b
	pshufd xmm2,xmm0,11110101y
	pshufd xmm3,xmm1,11110101y
	pmuludq xmm0,xmm1
	pmuludq xmm2,xmm3
	pshufd xmm0,xmm0,1000y
	pshufd xmm2,xmm2,1000y
	punpckldq xmm0,xmm2
	movdqu\a pdw_c,xmm0

Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2014, 07:51:01 AM

OK, both hool's and qWord's algo are integrated, see above. Time for some timings ;-)

Title: Re: Packed integer multiplication
Post by: dedndave on January 29, 2014, 09:42:12 AM

haven't seen the original poster since first post :P

Title: Re: Packed integer multiplication
Post by: dedndave on January 29, 2014, 09:44:42 AM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

1468    cycles for 100 * cvtdq2ps & mulps
1455    cycles for 100 * pmaddwd
2004    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
1990    cycles for 100 * pshufd & pmuludq (hool)
1903    cycles for 100 * pshufd & pmuludq (qWord)

1494    cycles for 100 * cvtdq2ps & mulps
1475    cycles for 100 * pmaddwd
2031    cycles for 100 * pmuludq to dwords
2507    cycles for 100 * pmuludq to qwords
1998    cycles for 100 * pshufd & pmuludq (hool)
1901    cycles for 100 * pshufd & pmuludq (qWord)

1457    cycles for 100 * cvtdq2ps & mulps
1457    cycles for 100 * pmaddwd
2047    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
2000    cycles for 100 * pshufd & pmuludq (hool)
1936    cycles for 100 * pshufd & pmuludq (qWord)

Title: Re: Packed integer multiplication
Post by: hool on January 29, 2014, 11:02:50 AM

Its probably worth mentioning that all these versions have different areas of use.

cvtdq2ps/cvtps2dq for example will cause precision loss at some point (failes to convert 0x1000001 back and forth)
pmuludq to (d/q)words relies on memory to produce result
and my version keeps stuff in registers (for better or worse)

Code Select

AMD FX(tm)-8320 Eight-Core Processor            (SSE4)

370     cycles for 100 * cvtdq2ps & mulps
409     cycles for 100 * pmaddwd
623     cycles for 100 * pmuludq to dwords
490     cycles for 100 * pmuludq to qwords
703     cycles for 100 * pshufd & pmuludq (hool)
613     cycles for 100 * pshufd & pmuludq (qWord)

450     cycles for 100 * cvtdq2ps & mulps
312     cycles for 100 * pmaddwd
609     cycles for 100 * pmuludq to dwords
511     cycles for 100 * pmuludq to qwords
567     cycles for 100 * pshufd & pmuludq (hool)
529     cycles for 100 * pshufd & pmuludq (qWord)

468     cycles for 100 * cvtdq2ps & mulps
357     cycles for 100 * pmaddwd
659     cycles for 100 * pmuludq to dwords
540     cycles for 100 * pmuludq to qwords
563     cycles for 100 * pshufd & pmuludq (hool)
581     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

You may notice some irregularities in numbers that because I have some heavy programs running alongside. Its very real-life example how CPU reorders instructions as it pleases.

Title: Re: Packed integer multiplication
Post by: FORTRANS on January 30, 2014, 12:42:35 AM

Code Select


Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1170	cycles for 100 * cvtdq2ps & mulps
728	cycles for 100 * pmaddwd
1756	cycles for 100 * pmuludq to dwords
1682	cycles for 100 * pmuludq to qwords
2141	cycles for 100 * pshufd & pmuludq (hool)
2105	cycles for 100 * pshufd & pmuludq (qWord)

1166	cycles for 100 * cvtdq2ps & mulps
727	cycles for 100 * pmaddwd
1749	cycles for 100 * pmuludq to dwords
1650	cycles for 100 * pmuludq to qwords
2144	cycles for 100 * pshufd & pmuludq (hool)
2099	cycles for 100 * pshufd & pmuludq (qWord)

1164	cycles for 100 * cvtdq2ps & mulps
727	cycles for 100 * pmaddwd
1762	cycles for 100 * pmuludq to dwords
1646	cycles for 100 * pmuludq to qwords
2139	cycles for 100 * pshufd & pmuludq (hool)
2098	cycles for 100 * pshufd & pmuludq (qWord)

36	bytes for cvtdq2ps & mulps
27	bytes for pmaddwd
67	bytes for pmuludq to dwords
51	bytes for pmuludq to qwords
62	bytes for pshufd & pmuludq (hool)
57	bytes for pshufd & pmuludq (qWord)

400160000	= eax cvtdq2ps & mulps
-255462144	= eax pmaddwd
400160000	= eax pmuludq to dwords
400160000	= eax pmuludq to qwords
400160000	= eax pshufd & pmuludq (hool)
400160000	= eax pshufd & pmuludq (qWord)

--- ok ---

Title: Re: Packed integer multiplication
Post by: jj2007 on January 30, 2014, 01:31:49 AM

Code Select

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1347    cycles for 100 * cvtdq2ps & mulps
613     cycles for 100 * pmaddwd
1256    cycles for 100 * pmuludq to dwords
1020    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1274    cycles for 100 * pshufd & pmuludq (qWord)

1347    cycles for 100 * cvtdq2ps & mulps
614     cycles for 100 * pmaddwd
1247    cycles for 100 * pmuludq to dwords
1019    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1270    cycles for 100 * pshufd & pmuludq (qWord)

Title: Re: Packed integer multiplication
Post by: Gunther on January 30, 2014, 04:15:26 AM

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

281     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
401     cycles for 100 * pmuludq to dwords
306     cycles for 100 * pmuludq to qwords
390     cycles for 100 * pshufd & pmuludq (hool)
317     cycles for 100 * pshufd & pmuludq (qWord)

277     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
404     cycles for 100 * pmuludq to dwords
305     cycles for 100 * pmuludq to qwords
398     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

286     cycles for 100 * cvtdq2ps & mulps
221     cycles for 100 * pmaddwd
407     cycles for 100 * pmuludq to dwords
310     cycles for 100 * pmuludq to qwords
389     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

--- ok ---

Gunther

Title: Re: Packed integer multiplication
Post by: Leo on February 19, 2014, 02:06:03 AM

Thanks a lot everybody!
I saw the first response, checked the newer instructions that came since SSE2 and decided to use SSE 4.1 both since it had pmulld, which did just what I wanted and since it had some other instructions, which helped me as well. The fallback SSE2 version uses pmaddwd with a check before the algorithm that gives an error message if the data is too large. That's sufficient for what I wanted. Thank a lot everybody for the responses!

Title: Re: Packed integer multiplication
Post by: MichaelW on February 19, 2014, 07:26:40 AM

Quote from: jj2007 on January 23, 2014, 05:20:02 AM
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.

Title: Re: Packed integer multiplication
Post by: Gunther on February 19, 2014, 09:02:57 AM

Michael,

Quote from: MichaelW on February 19, 2014, 07:26:40 AM
Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.

that sounds a bit brain damaged. Is it another MS mess?

Gunther

Title: Re: Packed integer multiplication
Post by: jj2007 on January 29, 2016, 06:41:58 AM

Warming up an old thread :badgrin:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
480     cycles for 100 * cvtdq2ps & mulps
14975   cycles for 100 * mulps
1482    cycles for 100 * pmuludq to dwords
99      cycles for 100 * pmuludq to qwords
592     cycles for 100 * pshufd & pmuludq (hool)
541     cycles for 100 * pshufd & pmuludq (qWord)

479     cycles for 100 * cvtdq2ps & mulps
14966   cycles for 100 * mulps
908     cycles for 100 * pmuludq to dwords
617     cycles for 100 * pmuludq to qwords
1244    cycles for 100 * pshufd & pmuludq (hool)
544     cycles for 100 * pshufd & pmuludq (qWord)

478     cycles for 100 * cvtdq2ps & mulps
14956   cycles for 100 * mulps
910     cycles for 100 * pmuludq to dwords
72      cycles for 100 * pmuludq to qwords
593     cycles for 100 * pshufd & pmuludq (hool)
540     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
26      bytes for mulps
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

4001600 = eax cvtdq2ps & mulps
4001600 = eax mulps
4001600 = eax pmuludq to dwords
4001600 = eax pmuludq to qwords
4001600 = eax pshufd & pmuludq (hool)
4001600 = eax pshufd & pmuludq (qWord)

The reason is that I discovered you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW, at least on my i5. Could be a stall or whatever. Second restriction is that the result should be below 16Mio. So it is not really a good choice, unless you want to scale rectangles and you are sure that none of the four dwords is negative 8)

Title: Re: Packed integer multiplication
Post by: qWord on January 29, 2016, 06:56:10 AM

Quote from: jj2007 on January 29, 2016, 06:41:58 AMyou can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW

not surprisingly when working with denormalized (positive) numbers as operand - even thought there is an (commonly masked) exception for such operands.

The MASM Forum

General => The Workshop => Topic started by: Leo on January 23, 2014, 05:05:26 AM