I can't find an instruction to perform a very simple task that I need.
I have 4 integers in xmm0 and another 4 in xmm1. They are all positive. I need to
multiply the two registers and get the 4 resulting 32bit values. I found a way of doing it using PMADDWD. The problem is that it works only if all the input numbers are at most 15 bits long. Is there a solution for larger numbers?
Thanks
looks like you have a few options....
SSE4 may have the instruction you want, but isn't supported by all processors that are in common use
(PMULDQ or PMULLD)
SSE2....
you can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
or, you can shuffle, then use PMULUDQ to do 2 at a time (32-bit to 64-bit result)
doesn't look like SSE3 offers much help
Like this?
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
.data
MyO0 OWORD 0FFFFFF00FFFFFF00FFFFFF00FFFFFF00h
MyO1 OWORD 0F1F1F100F1F1F100F1F1F100F1F1F100h
Init
movups xmm0, MyO0
movups xmm1, MyO1
pmuludq xmm0, xmm1
Inkey Hex$(xmm0)
Exit
end start
Output: F1F1F00E 0E0F0000 F1F1F00E 0E0F0000
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results
Quote from: dedndave on January 23, 2014, 05:21:41 AM
sounds like he has (4) 16-bit unsigned integers, and wants (4) 32-bit results
That should work with the code above...QuoteMultiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1
Bad luck :(
Depending on the application, a simple
mulps might do the job, but it means a dword to single conversion.
Quoteyou can use PMULHW to get the high word result and PMULLW to get the low word result, then shuffle
that may be the solution
Post it, and I'll add it to the timings :greensml:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
.data
MyO0 dd 10001, 10002, 10003, 10004
MyQ1 dd 10000, 20000, 30000, 40000
.data?
dest0 dd ?
dest1 dd ?
dest2 dd ?
dest3 dd ?
Init
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Convert Packed Signed Doubleword Integers to Packed Single-Precision Floating-Point Values
cvtdq2ps xmm0, xmm0
cvtdq2ps xmm1, xmm1
; Packed Single-Precision Floating-Point Multiply
mulps xmm0, xmm1
; Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
cvtps2dq xmm0, xmm0
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n\n", dest3)
movups xmm0, OWORD PTR MyO0
movups xmm1, OWORD PTR MyQ1
; Multiply the packed word integers in xmm1 by the packed word integers in xmm2/m128, and add the adjacent doubleword results
pmaddwd xmm0, xmm1
movups oword ptr dest0, xmm0
Print Str$("D0=\t%i", dest0), Str$("\nD1=\t%i", dest1), Str$("\nD2=\t%i", dest2), Str$("\nD3=\t%i\n", dest3)
Exit
end start
Output:
D0= 100010000
D1= 200040000
D2= 300089984
D3= 400160000
D0= 100010000
D1= 200040000
D2= 300090000
D3= -255462144
Timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1144 cycles for 100 * cvtdq2ps & mulps
823 cycles for 100 * pmaddwd
1625 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords
1136 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1608 cycles for 100 * pmuludq to dwords
1583 cycles for 100 * pmuludq to qwords
1135 cycles for 100 * cvtdq2ps & mulps
816 cycles for 100 * pmaddwd
1607 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords
36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
EDIT: Added pmuludq to the timings (source attached). With pmuludq, results can exceed 32 bits, in case that is needed. It is even a bit faster than pmuludq to dwords, see timings above.
@Leo: Could you please tell us more about your requirements, e.g. allowed ranges, required precision, etc? Thanks.
And btw, welcome to the forum :icon14:
Why use movups when you can easily align your data?
Because the OP might have unaligned data, and because the difference is small (2%).
movdqu xmm0, [rsp] ; a3 a2 a1 a0
movdqu xmm1, [rsp+16] ; b3 b2 b1 b0
pshufd xmm2, xmm0, 110001b ; _ a3 _ a1
pshufd xmm3, xmm1, 110001b ; _ b3 _ b1
pmuludq xmm0, xmm1 ; _ a2*b2 _ a0*b0 (only low 32bit considered)
pmuludq xmm3, xmm2 ; _ a3*b3 _ a1*b1
psllq xmm0, 32 ; a2*b2 0 a0*b0 0
psllq xmm3, 32 ; a3*b3 0 a1*b1 0
pshufd xmm0, xmm0, 110001b ; 0 a2*b2 0 a0*b0
por xmm0, xmm3 ; a3*b3 a2*b2 a1*b1 a0*b0
Hi hool,
good catch. Will work. :t
Gunther
Hi hool & qWord,
It works:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1140 cycles for 100 * cvtdq2ps & mulps
819 cycles for 100 * pmaddwd
1613 cycles for 100 * pmuludq to dwords
1580 cycles for 100 * pmuludq to qwords
2083 cycles for 100 * pshufd & pmuludq (hool)
2053 cycles for 100 * pshufd & pmuludq (qWord)
1145 cycles for 100 * cvtdq2ps & mulps
821 cycles for 100 * pmaddwd
1615 cycles for 100 * pmuludq to dwords
1591 cycles for 100 * pmuludq to qwords
2100 cycles for 100 * pshufd & pmuludq (hool)
2058 cycles for 100 * pshufd & pmuludq (qWord)
1142 cycles for 100 * cvtdq2ps & mulps
821 cycles for 100 * pmaddwd
1609 cycles for 100 * pmuludq to dwords
1581 cycles for 100 * pmuludq to qwords
2084 cycles for 100 * pshufd & pmuludq (hool)
2049 cycles for 100 * pshufd & pmuludq (qWord)
36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)
400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)
Quote from: hool on January 29, 2014, 03:24:50 AM
;...
psllq xmm0, 32 ; a2*b2 0 a0*b0 0
psllq xmm3, 32 ; a3*b3 0 a1*b1 0
pshufd xmm0, xmm0, 110001b ; 0 a2*b2 0 a0*b0
por xmm0, xmm3 ; a3*b3 a2*b2 a1*b1 a0*b0
shifts are "slow" and not necessary:
movdqu\a xmm0,pdw_a
movdqu\a xmm1,pdw_b
pshufd xmm2,xmm0,11110101y
pshufd xmm3,xmm1,11110101y
pmuludq xmm0,xmm1
pmuludq xmm2,xmm3
pshufd xmm0,xmm0,1000y
pshufd xmm2,xmm2,1000y
punpckldq xmm0,xmm2
movdqu\a pdw_c,xmm0
OK, both hool's and qWord's algo are integrated, see above. Time for some timings ;-)
haven't seen the original poster since first post :P
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
1468 cycles for 100 * cvtdq2ps & mulps
1455 cycles for 100 * pmaddwd
2004 cycles for 100 * pmuludq to dwords
2471 cycles for 100 * pmuludq to qwords
1990 cycles for 100 * pshufd & pmuludq (hool)
1903 cycles for 100 * pshufd & pmuludq (qWord)
1494 cycles for 100 * cvtdq2ps & mulps
1475 cycles for 100 * pmaddwd
2031 cycles for 100 * pmuludq to dwords
2507 cycles for 100 * pmuludq to qwords
1998 cycles for 100 * pshufd & pmuludq (hool)
1901 cycles for 100 * pshufd & pmuludq (qWord)
1457 cycles for 100 * cvtdq2ps & mulps
1457 cycles for 100 * pmaddwd
2047 cycles for 100 * pmuludq to dwords
2471 cycles for 100 * pmuludq to qwords
2000 cycles for 100 * pshufd & pmuludq (hool)
1936 cycles for 100 * pshufd & pmuludq (qWord)
Its probably worth mentioning that all these versions have different areas of use.
cvtdq2ps/cvtps2dq for example will cause precision loss at some point (failes to convert 0x1000001 back and forth)
pmuludq to (d/q)words relies on memory to produce result
and my version keeps stuff in registers (for better or worse)
AMD FX(tm)-8320 Eight-Core Processor (SSE4)
370 cycles for 100 * cvtdq2ps & mulps
409 cycles for 100 * pmaddwd
623 cycles for 100 * pmuludq to dwords
490 cycles for 100 * pmuludq to qwords
703 cycles for 100 * pshufd & pmuludq (hool)
613 cycles for 100 * pshufd & pmuludq (qWord)
450 cycles for 100 * cvtdq2ps & mulps
312 cycles for 100 * pmaddwd
609 cycles for 100 * pmuludq to dwords
511 cycles for 100 * pmuludq to qwords
567 cycles for 100 * pshufd & pmuludq (hool)
529 cycles for 100 * pshufd & pmuludq (qWord)
468 cycles for 100 * cvtdq2ps & mulps
357 cycles for 100 * pmaddwd
659 cycles for 100 * pmuludq to dwords
540 cycles for 100 * pmuludq to qwords
563 cycles for 100 * pshufd & pmuludq (hool)
581 cycles for 100 * pshufd & pmuludq (qWord)
36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)
400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)
You may notice some irregularities in numbers that because I have some heavy programs running alongside. Its very real-life example how CPU reorders instructions as it pleases.
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
1170 cycles for 100 * cvtdq2ps & mulps
728 cycles for 100 * pmaddwd
1756 cycles for 100 * pmuludq to dwords
1682 cycles for 100 * pmuludq to qwords
2141 cycles for 100 * pshufd & pmuludq (hool)
2105 cycles for 100 * pshufd & pmuludq (qWord)
1166 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1749 cycles for 100 * pmuludq to dwords
1650 cycles for 100 * pmuludq to qwords
2144 cycles for 100 * pshufd & pmuludq (hool)
2099 cycles for 100 * pshufd & pmuludq (qWord)
1164 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1762 cycles for 100 * pmuludq to dwords
1646 cycles for 100 * pmuludq to qwords
2139 cycles for 100 * pshufd & pmuludq (hool)
2098 cycles for 100 * pshufd & pmuludq (qWord)
36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)
400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)
--- ok ---
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
1347 cycles for 100 * cvtdq2ps & mulps
613 cycles for 100 * pmaddwd
1256 cycles for 100 * pmuludq to dwords
1020 cycles for 100 * pmuludq to qwords
1269 cycles for 100 * pshufd & pmuludq (hool)
1274 cycles for 100 * pshufd & pmuludq (qWord)
1347 cycles for 100 * cvtdq2ps & mulps
614 cycles for 100 * pmaddwd
1247 cycles for 100 * pmuludq to dwords
1019 cycles for 100 * pmuludq to qwords
1269 cycles for 100 * pshufd & pmuludq (hool)
1270 cycles for 100 * pshufd & pmuludq (qWord)
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
281 cycles for 100 * cvtdq2ps & mulps
220 cycles for 100 * pmaddwd
401 cycles for 100 * pmuludq to dwords
306 cycles for 100 * pmuludq to qwords
390 cycles for 100 * pshufd & pmuludq (hool)
317 cycles for 100 * pshufd & pmuludq (qWord)
277 cycles for 100 * cvtdq2ps & mulps
220 cycles for 100 * pmaddwd
404 cycles for 100 * pmuludq to dwords
305 cycles for 100 * pmuludq to qwords
398 cycles for 100 * pshufd & pmuludq (hool)
318 cycles for 100 * pshufd & pmuludq (qWord)
286 cycles for 100 * cvtdq2ps & mulps
221 cycles for 100 * pmaddwd
407 cycles for 100 * pmuludq to dwords
310 cycles for 100 * pmuludq to qwords
389 cycles for 100 * pshufd & pmuludq (hool)
318 cycles for 100 * pshufd & pmuludq (qWord)
36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)
400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)
--- ok ---
Gunther
Thanks a lot everybody!
I saw the first response, checked the newer instructions that came since SSE2 and decided to use SSE 4.1 both since it had pmulld, which did just what I wanted and since it had some other instructions, which helped me as well. The fallback SSE2 version uses pmaddwd with a check before the algorithm that gives an error message if the data is too large. That's sufficient for what I wanted. Thank a lot everybody for the responses!
Quote from: jj2007 on January 23, 2014, 05:20:02 AM
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.
Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.
Michael,
Quote from: MichaelW on February 19, 2014, 07:26:40 AM
Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.
that sounds a bit brain damaged. Is it another MS mess?
Gunther
Warming up an old thread :badgrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
480 cycles for 100 * cvtdq2ps & mulps
14975 cycles for 100 * mulps
1482 cycles for 100 * pmuludq to dwords
99 cycles for 100 * pmuludq to qwords
592 cycles for 100 * pshufd & pmuludq (hool)
541 cycles for 100 * pshufd & pmuludq (qWord)
479 cycles for 100 * cvtdq2ps & mulps
14966 cycles for 100 * mulps
908 cycles for 100 * pmuludq to dwords
617 cycles for 100 * pmuludq to qwords
1244 cycles for 100 * pshufd & pmuludq (hool)
544 cycles for 100 * pshufd & pmuludq (qWord)
478 cycles for 100 * cvtdq2ps & mulps
14956 cycles for 100 * mulps
910 cycles for 100 * pmuludq to dwords
72 cycles for 100 * pmuludq to qwords
593 cycles for 100 * pshufd & pmuludq (hool)
540 cycles for 100 * pshufd & pmuludq (qWord)
36 bytes for cvtdq2ps & mulps
26 bytes for mulps
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)
4001600 = eax cvtdq2ps & mulps
4001600 = eax mulps
4001600 = eax pmuludq to dwords
4001600 = eax pmuludq to qwords
4001600 = eax pshufd & pmuludq (hool)
4001600 = eax pshufd & pmuludq (qWord)
The reason is that I discovered you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW, at least on my i5. Could be a stall or whatever. Second restriction is that the result should be below 16Mio. So it is not really a good choice, unless you want to scale rectangles and you are sure that none of the four dwords is negative 8)
Quote from: jj2007 on January 29, 2016, 06:41:58 AMyou can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW
not surprisingly when working with denormalized (positive) numbers as operand - even thought there is an (commonly masked) exception for such operands.