News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Packed integer multiplication

Started by Leo, January 23, 2014, 05:05:26 AM

Previous topic - Next topic

Leo

I can't find an instruction to perform a very simple task that I need.
I have 4 integers in xmm0 and another 4 in xmm1. They are all positive. I need to
multiply the two registers and get the 4 resulting 32bit values. I found a way of doing it using PMADDWD. The problem is that it works only if all the input numbers are at most 15 bits long. Is there a solution for larger numbers?

Thanks

dedndave

looks like you have a few options....

SSE4 may have the instruction you want, but isn't supported by all processors that are in common use
(PMULDQ or PMULLD)

SSE2....
you can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
or, you can shuffle, then use PMULUDQ to do 2 at a time (32-bit to 64-bit result)

doesn't look like SSE3 offers much help

jj2007

Like this?

include \masm32\MasmBasic\MasmBasic.inc        ; download

.data
MyO0        OWORD 0FFFFFF00FFFFFF00FFFFFF00FFFFFF00h
MyO1        OWORD 0F1F1F100F1F1F100F1F1F100F1F1F100h

        Init
        movups xmm0, MyO0
        movups xmm1, MyO1
        pmuludq xmm0, xmm1
        Inkey Hex$(xmm0)
        Exit
end start


Output: F1F1F00E 0E0F0000 F1F1F00E 0E0F0000

P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

dedndave

sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results

jj2007

#4
Quote from: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results
That should work with the code above...

QuoteMultiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1

Bad luck :(

Depending on the application, a simple mulps might do the job, but it means a dword to single conversion.

dedndave

Quoteyou can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle

that may be the solution

jj2007

#6
Post it, and I'll add it to the timings :greensml:

include \masm32\MasmBasic\MasmBasic.inc        ; download
.data
MyO0        dd 10001, 10002, 10003, 10004
MyQ1        dd 10000, 20000, 30000, 40000

.data?
dest0        dd ?
dest1        dd ?
dest2        dd ?
dest3        dd ?

  Init
  movups xmm0, OWORD PTR MyO0
  movups xmm1, OWORD PTR MyQ1
  ; Convert Packed Signed Doubleword Integers to Packed Single-Precision Floating-Point Values
  cvtdq2ps xmm0, xmm0
  cvtdq2ps xmm1, xmm1
  ; Packed Single-Precision Floating-Point Multiply
  mulps xmm0, xmm1
  ; Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
  cvtps2dq xmm0, xmm0
  movups oword ptr dest0, xmm0
  Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n\n", dest3)
  movups xmm0, OWORD PTR MyO0
  movups xmm1, OWORD PTR MyQ1
  ; Multiply the packed word integers in xmm1 by the packed word integers in xmm2/m128, and add the adjacent doubleword results
  pmaddwd xmm0, xmm1
  movups oword ptr dest0, xmm0
  Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n", dest3)
  Exit
end start

Output:
D0=     100010000
D1=     200040000
D2=     300089984
D3=     400160000

D0=     100010000
D1=     200040000
D2=     300090000
D3=     -255462144


Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1144    cycles for 100 * cvtdq2ps & mulps
823     cycles for 100 * pmaddwd
1625    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords

1136    cycles for 100 * cvtdq2ps & mulps
816     cycles for 100 * pmaddwd
1608    cycles for 100 * pmuludq to dwords
1583    cycles for 100 * pmuludq to qwords

1135    cycles for 100 * cvtdq2ps & mulps
816     cycles for 100 * pmaddwd
1607    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords


EDIT: Added pmuludq to the timings (source attached). With pmuludq, results can exceed 32 bits, in case that is needed. It is even a bit faster than pmuludq to dwords, see timings above.

@Leo: Could you please tell us more about your requirements, e.g. allowed ranges, required precision, etc? Thanks.

And btw, welcome to the forum :icon14:

MichaelW

Why use movups when you can easily align your data?
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Because the OP might have unaligned data, and because the difference is small (2%).

hool

        movdqu  xmm0, [rsp]             ;   a3      a2      a1      a0
        movdqu  xmm1, [rsp+16]          ;   b3      b2      b1      b0 
        pshufd  xmm2, xmm0, 110001b     ;   _       a3      _       a1
        pshufd  xmm3, xmm1, 110001b     ;   _       b3      _       b1
        pmuludq xmm0, xmm1              ;   _     a2*b2     _     a0*b0   (only low 32bit considered)
        pmuludq xmm3, xmm2              ;   _     a3*b3     _     a1*b1
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0   

Gunther

Hi hool,

good catch. Will work.  :t

Gunther
You have to know the facts before you can distort them.

jj2007

Hi hool & qWord,

It works:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

1140    cycles for 100 * cvtdq2ps & mulps
819     cycles for 100 * pmaddwd
1613    cycles for 100 * pmuludq to dwords
1580    cycles for 100 * pmuludq to qwords
2083    cycles for 100 * pshufd & pmuludq (hool)
2053    cycles for 100 * pshufd & pmuludq (qWord)

1145    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1615    cycles for 100 * pmuludq to dwords
1591    cycles for 100 * pmuludq to qwords
2100    cycles for 100 * pshufd & pmuludq (hool)
2058    cycles for 100 * pshufd & pmuludq (qWord)

1142    cycles for 100 * cvtdq2ps & mulps
821     cycles for 100 * pmaddwd
1609    cycles for 100 * pmuludq to dwords
1581    cycles for 100 * pmuludq to qwords
2084    cycles for 100 * pshufd & pmuludq (hool)
2049    cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

qWord

Quote from: hool on January 29, 2014, 03:24:50 AM
;...
        psllq   xmm0, 32                ; a2*b2     0     a0*b0     0
        psllq   xmm3, 32                ; a3*b3     0     a1*b1     0
        pshufd  xmm0, xmm0, 110001b     ;   0     a2*b2     0     a0*b0
        por     xmm0, xmm3              ; a3*b3   a2*b2   a1*b1   a0*b0   

shifts are "slow" and not necessary:
movdqu\a xmm0,pdw_a
movdqu\a xmm1,pdw_b
pshufd xmm2,xmm0,11110101y
pshufd xmm3,xmm1,11110101y
pmuludq xmm0,xmm1
pmuludq xmm2,xmm3
pshufd xmm0,xmm0,1000y
pshufd xmm2,xmm2,1000y
punpckldq xmm0,xmm2
movdqu\a pdw_c,xmm0
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

OK, both hool's and qWord's algo are integrated, see above. Time for some timings ;-)

dedndave

haven't seen the original poster since first post   :P