Packed integer multiplication

Leo · January 23, 2014, 05:05:26 AM

I can't find an instruction to perform a very simple task that I need.
I have 4 integers in xmm0 and another 4 in xmm1. They are all positive. I need to
multiply the two registers and get the 4 resulting 32bit values. I found a way of doing it using PMADDWD. The problem is that it works only if all the input numbers are at most 15 bits long. Is there a solution for larger numbers?

Thanks

dedndave · January 23, 2014, 05:17:39 AM

looks like you have a few options....

SSE4 may have the instruction you want, but isn't supported by all processors that are in common use
(PMULDQ or PMULLD)

SSE2....
you can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
or, you can shuffle, then use PMULUDQ to do 2 at a time (32-bit to 64-bit result)

doesn't look like SSE3 offers much help

jj2007 · January 23, 2014, 05:20:02 AM

Like this?

include \masm32\MasmBasic\MasmBasic.inc ; download

.data
MyO0 OWORD 0FFFFFF00FFFFFF00FFFFFF00FFFFFF00h
MyO1 OWORD 0F1F1F100F1F1F100F1F1F100F1F1F100h

Init
movups xmm0, MyO0
movups xmm1, MyO1
pmuludq xmm0, xmm1
Inkey Hex$(xmm0)
Exit
end start

Output: F1F1F00E 0E0F0000 F1F1F00E 0E0F0000

P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

dedndave · January 23, 2014, 05:21:41 AM

sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results

jj2007 · January 23, 2014, 05:35:16 AM

Quote from: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results

~~That should work with the code above...~~

QuoteMultiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1

Bad luck :(

Depending on the application, a simple mulps might do the job, but it means a dword to single conversion.

dedndave · January 23, 2014, 07:28:33 AM

Quoteyou can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle

that may be the solution

jj2007 · January 23, 2014, 12:28:53 PM

Post it, and I'll add it to the timings

include \masm32\MasmBasic\MasmBasic.inc ; download
.data
MyO0 dd 10001, 10002, 10003, 10004
MyQ1 dd 10000, 20000, 30000, 40000

.data?
dest0 dd ?
dest1 dd ?
dest2 dd ?
dest3 dd ?

Init
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Convert Packed Signed Doubleword Integers to Packed Single-Precision Floating-Point Values
cvtdq2ps xmm0, xmm0
cvtdq2ps xmm1, xmm1
; Packed Single-Precision Floating-Point Multiply
mulps xmm0, xmm1
; Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
cvtps2dq xmm0, xmm0
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n\n", dest3)
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Multiply the packed word integers in xmm1 by the packed word integers in xmm2/m128, and add the adjacent doubleword results
pmaddwd xmm0, xmm1
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n", dest3)
Exit
end start

Output:
D0= 100010000
D1= 200040000
D2= 300089984
D3= 400160000

D0= 100010000
D1= 200040000
D2= 300090000
D3= -255462144

Timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)

1144 cycles for 100 * cvtdq2ps & mulps
823 cycles for 100 * pmaddwd
1625 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords

1136 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1608 cycles for 100 * pmuludq to dwords
1583 cycles for 100 * pmuludq to qwords

1135 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1607 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords

36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords

400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords

EDIT: Added pmuludq to the timings (source attached). With pmuludq, results can exceed 32 bits, in case that is needed. It is even a bit faster than pmuludq to dwords, see timings above.

@Leo: Could you please tell us more about your requirements, e.g. allowed ranges, required precision, etc? Thanks.

And btw, welcome to the forum :icon14:

MichaelW · January 23, 2014, 05:04:46 PM

Why use movups when you can easily align your data?

jj2007 · January 23, 2014, 09:37:40 PM

Because the OP might have unaligned data, and because the difference is small (2%).

hool · January 29, 2014, 03:24:50 AM

Code Select

        movdqu  xmm0, [rsp]             ;   a3      a2      a1      a0
        movdqu  xmm1, [rsp+16]          ;   b3      b2      b1      b0  
        pshufd  xmm2, xmm0, 110001b     ;   _       a3      _       a1
        pshufd  xmm3, xmm1, 110001b     ;   _       b3      _       b1
        pmuludq xmm0, xmm1              ;   _     a2*b2     _     a0*b0   (only low 32bit considered)
        pmuludq xmm3, xmm2              ;   _     a3*b3     _     a1*b1
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0

Gunther · January 29, 2014, 04:01:32 AM

Hi hool,

good catch. Will work. :t

Gunther

jj2007 · January 29, 2014, 07:28:37 AM

Hi hool & qWord,

It works:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1140    cycles for 100 * cvtdq2ps & mulps
819     cycles for 100 * pmaddwd
1613    cycles for 100 * pmuludq to dwords
1580    cycles for 100 * pmuludq to qwords
2083    cycles for 100 * pshufd & pmuludq (hool)
2053    cycles for 100 * pshufd & pmuludq (qWord)

1145    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1615    cycles for 100 * pmuludq to dwords
1591    cycles for 100 * pmuludq to qwords
2100    cycles for 100 * pshufd & pmuludq (hool)
2058    cycles for 100 * pshufd & pmuludq (qWord)

1142    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1609    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords
2084    cycles for 100 * pshufd & pmuludq (hool)
2049    cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

qWord · January 29, 2014, 07:37:32 AM

Quote from: hool on January 29, 2014, 03:24:50 AM
Code Select Expand
;... psllq xmm0, 32 ; a2*b2 0 a0*b0 0 psllq xmm3, 32 ; a3*b3 0 a1*b1 0 pshufd xmm0, xmm0, 110001b ; 0 a2*b2 0 a0*b0 por xmm0, xmm3 ; a3*b3 a2*b2 a1*b1 a0*b0

shifts are "slow" and not necessary:

Code Select

	movdqu\a xmm0,pdw_a
	movdqu\a xmm1,pdw_b
	pshufd xmm2,xmm0,11110101y
	pshufd xmm3,xmm1,11110101y
	pmuludq xmm0,xmm1
	pmuludq xmm2,xmm3
	pshufd xmm0,xmm0,1000y
	pshufd xmm2,xmm2,1000y
	punpckldq xmm0,xmm2
	movdqu\a pdw_c,xmm0

jj2007 · January 29, 2014, 07:51:01 AM

OK, both hool's and qWord's algo are integrated, see above. Time for some timings ;-)

dedndave · January 29, 2014, 09:42:12 AM

haven't seen the original poster since first post :P

The MASM Forum

News:

Packed integer multiplication

Leo

dedndave

jj2007

dedndave

jj2007

dedndave

jj2007

MichaelW

jj2007

hool

Gunther

jj2007

qWord

jj2007

dedndave