News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Packed integer multiplication

Started by Leo, January 23, 2014, 05:05:26 AM

Previous topic - Next topic

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

1468    cycles for 100 * cvtdq2ps & mulps
1455    cycles for 100 * pmaddwd
2004    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
1990    cycles for 100 * pshufd & pmuludq (hool)
1903    cycles for 100 * pshufd & pmuludq (qWord)

1494    cycles for 100 * cvtdq2ps & mulps
1475    cycles for 100 * pmaddwd
2031    cycles for 100 * pmuludq to dwords
2507    cycles for 100 * pmuludq to qwords
1998    cycles for 100 * pshufd & pmuludq (hool)
1901    cycles for 100 * pshufd & pmuludq (qWord)

1457    cycles for 100 * cvtdq2ps & mulps
1457    cycles for 100 * pmaddwd
2047    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
2000    cycles for 100 * pshufd & pmuludq (hool)
1936    cycles for 100 * pshufd & pmuludq (qWord)

hool

Its probably worth mentioning that all these versions have different areas of use.

cvtdq2ps/cvtps2dq for example will cause precision loss at some point (failes to convert 0x1000001 back and forth)
pmuludq to (d/q)words relies on memory to produce result
and my version keeps stuff in registers (for better or worse)   

AMD FX(tm)-8320 Eight-Core Processor            (SSE4)

370     cycles for 100 * cvtdq2ps & mulps
409     cycles for 100 * pmaddwd
623     cycles for 100 * pmuludq to dwords
490     cycles for 100 * pmuludq to qwords
703     cycles for 100 * pshufd & pmuludq (hool)
613     cycles for 100 * pshufd & pmuludq (qWord)

450     cycles for 100 * cvtdq2ps & mulps
312     cycles for 100 * pmaddwd
609     cycles for 100 * pmuludq to dwords
511     cycles for 100 * pmuludq to qwords
567     cycles for 100 * pshufd & pmuludq (hool)
529     cycles for 100 * pshufd & pmuludq (qWord)

468     cycles for 100 * cvtdq2ps & mulps
357     cycles for 100 * pmaddwd
659     cycles for 100 * pmuludq to dwords
540     cycles for 100 * pmuludq to qwords
563     cycles for 100 * pshufd & pmuludq (hool)
581     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)


You may notice some irregularities in numbers that because I have some heavy programs running alongside. Its very real-life example how CPU reorders instructions as it pleases.

FORTRANS


Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1170 cycles for 100 * cvtdq2ps & mulps
728 cycles for 100 * pmaddwd
1756 cycles for 100 * pmuludq to dwords
1682 cycles for 100 * pmuludq to qwords
2141 cycles for 100 * pshufd & pmuludq (hool)
2105 cycles for 100 * pshufd & pmuludq (qWord)

1166 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1749 cycles for 100 * pmuludq to dwords
1650 cycles for 100 * pmuludq to qwords
2144 cycles for 100 * pshufd & pmuludq (hool)
2099 cycles for 100 * pshufd & pmuludq (qWord)

1164 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1762 cycles for 100 * pmuludq to dwords
1646 cycles for 100 * pmuludq to qwords
2139 cycles for 100 * pshufd & pmuludq (hool)
2098 cycles for 100 * pshufd & pmuludq (qWord)

36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)

400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)

--- ok ---

jj2007

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1347    cycles for 100 * cvtdq2ps & mulps
613     cycles for 100 * pmaddwd
1256    cycles for 100 * pmuludq to dwords
1020    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1274    cycles for 100 * pshufd & pmuludq (qWord)

1347    cycles for 100 * cvtdq2ps & mulps
614     cycles for 100 * pmaddwd
1247    cycles for 100 * pmuludq to dwords
1019    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1270    cycles for 100 * pshufd & pmuludq (qWord)

Gunther


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

281     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
401     cycles for 100 * pmuludq to dwords
306     cycles for 100 * pmuludq to qwords
390     cycles for 100 * pshufd & pmuludq (hool)
317     cycles for 100 * pshufd & pmuludq (qWord)

277     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
404     cycles for 100 * pmuludq to dwords
305     cycles for 100 * pmuludq to qwords
398     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

286     cycles for 100 * cvtdq2ps & mulps
221     cycles for 100 * pmaddwd
407     cycles for 100 * pmuludq to dwords
310     cycles for 100 * pmuludq to qwords
389     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

--- ok ---


Gunther
You have to know the facts before you can distort them.

Leo

Thanks a lot everybody!
I saw the first response, checked the newer instructions that came since SSE2 and decided to use SSE 4.1 both since it had pmulld, which did just what I wanted and since it had some other instructions, which helped me as well. The fallback SSE2 version uses pmaddwd with a check before the algorithm that gives an error message if the data is too large. That's sufficient for what I wanted. Thank a lot everybody for the responses!

MichaelW

Quote from: jj2007 on January 23, 2014, 05:20:02 AM
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.
Well Microsoft, here's another nice mess you've gotten us into.

Gunther

Michael,

Quote from: MichaelW on February 19, 2014, 07:26:40 AM
Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.

that sounds a bit brain damaged. Is it another MS mess?

Gunther
You have to know the facts before you can distort them.

jj2007

Warming up an old thread :badgrin:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
480     cycles for 100 * cvtdq2ps & mulps
14975   cycles for 100 * mulps
1482    cycles for 100 * pmuludq to dwords
99      cycles for 100 * pmuludq to qwords
592     cycles for 100 * pshufd & pmuludq (hool)
541     cycles for 100 * pshufd & pmuludq (qWord)

479     cycles for 100 * cvtdq2ps & mulps
14966   cycles for 100 * mulps
908     cycles for 100 * pmuludq to dwords
617     cycles for 100 * pmuludq to qwords
1244    cycles for 100 * pshufd & pmuludq (hool)
544     cycles for 100 * pshufd & pmuludq (qWord)

478     cycles for 100 * cvtdq2ps & mulps
14956   cycles for 100 * mulps
910     cycles for 100 * pmuludq to dwords
72      cycles for 100 * pmuludq to qwords
593     cycles for 100 * pshufd & pmuludq (hool)
540     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
26      bytes for mulps
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

4001600 = eax cvtdq2ps & mulps
4001600 = eax mulps
4001600 = eax pmuludq to dwords
4001600 = eax pmuludq to qwords
4001600 = eax pshufd & pmuludq (hool)
4001600 = eax pshufd & pmuludq (qWord)


The reason is that I discovered you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW, at least on my i5. Could be a stall or whatever. Second restriction is that the result should be below 16Mio. So it is not really a good choice, unless you want to scale rectangles and you are sure that none of the four dwords is negative 8)

qWord

Quote from: jj2007 on January 29, 2016, 06:41:58 AMyou can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW
not surprisingly  when working with denormalized  (positive) numbers as operand - even thought there is an (commonly masked) exception for such operands.
MREAL macros - when you need floating point arithmetic while assembling!