Author Topic: Packed integer multiplication  (Read 9144 times)

dedndave

  • Member
  • *****
  • Posts: 8806
  • Still using Abacus 2.0
    • DednDave
Re: Packed integer multiplication
« Reply #15 on: January 29, 2014, 09:44:42 AM »
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

1468    cycles for 100 * cvtdq2ps & mulps
1455    cycles for 100 * pmaddwd
2004    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
1990    cycles for 100 * pshufd & pmuludq (hool)
1903    cycles for 100 * pshufd & pmuludq (qWord)

1494    cycles for 100 * cvtdq2ps & mulps
1475    cycles for 100 * pmaddwd
2031    cycles for 100 * pmuludq to dwords
2507    cycles for 100 * pmuludq to qwords
1998    cycles for 100 * pshufd & pmuludq (hool)
1901    cycles for 100 * pshufd & pmuludq (qWord)

1457    cycles for 100 * cvtdq2ps & mulps
1457    cycles for 100 * pmaddwd
2047    cycles for 100 * pmuludq to dwords
2471    cycles for 100 * pmuludq to qwords
2000    cycles for 100 * pshufd & pmuludq (hool)
1936    cycles for 100 * pshufd & pmuludq (qWord)

hool

  • Guest
Re: Packed integer multiplication
« Reply #16 on: January 29, 2014, 11:02:50 AM »
Its probably worth mentioning that all these versions have different areas of use.

cvtdq2ps/cvtps2dq for example will cause precision loss at some point (failes to convert 0x1000001 back and forth)
pmuludq to (d/q)words relies on memory to produce result
and my version keeps stuff in registers (for better or worse)   

Code: [Select]
AMD FX(tm)-8320 Eight-Core Processor            (SSE4)

370     cycles for 100 * cvtdq2ps & mulps
409     cycles for 100 * pmaddwd
623     cycles for 100 * pmuludq to dwords
490     cycles for 100 * pmuludq to qwords
703     cycles for 100 * pshufd & pmuludq (hool)
613     cycles for 100 * pshufd & pmuludq (qWord)

450     cycles for 100 * cvtdq2ps & mulps
312     cycles for 100 * pmaddwd
609     cycles for 100 * pmuludq to dwords
511     cycles for 100 * pmuludq to qwords
567     cycles for 100 * pshufd & pmuludq (hool)
529     cycles for 100 * pshufd & pmuludq (qWord)

468     cycles for 100 * cvtdq2ps & mulps
357     cycles for 100 * pmaddwd
659     cycles for 100 * pmuludq to dwords
540     cycles for 100 * pmuludq to qwords
563     cycles for 100 * pshufd & pmuludq (hool)
581     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

You may notice some irregularities in numbers that because I have some heavy programs running alongside. Its very real-life example how CPU reorders instructions as it pleases.

FORTRANS

  • Member
  • *****
  • Posts: 1014
Re: Packed integer multiplication
« Reply #17 on: January 30, 2014, 12:42:35 AM »
Code: [Select]
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

1170 cycles for 100 * cvtdq2ps & mulps
728 cycles for 100 * pmaddwd
1756 cycles for 100 * pmuludq to dwords
1682 cycles for 100 * pmuludq to qwords
2141 cycles for 100 * pshufd & pmuludq (hool)
2105 cycles for 100 * pshufd & pmuludq (qWord)

1166 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1749 cycles for 100 * pmuludq to dwords
1650 cycles for 100 * pmuludq to qwords
2144 cycles for 100 * pshufd & pmuludq (hool)
2099 cycles for 100 * pshufd & pmuludq (qWord)

1164 cycles for 100 * cvtdq2ps & mulps
727 cycles for 100 * pmaddwd
1762 cycles for 100 * pmuludq to dwords
1646 cycles for 100 * pmuludq to qwords
2139 cycles for 100 * pshufd & pmuludq (hool)
2098 cycles for 100 * pshufd & pmuludq (qWord)

36 bytes for cvtdq2ps & mulps
27 bytes for pmaddwd
67 bytes for pmuludq to dwords
51 bytes for pmuludq to qwords
62 bytes for pshufd & pmuludq (hool)
57 bytes for pshufd & pmuludq (qWord)

400160000 = eax cvtdq2ps & mulps
-255462144 = eax pmaddwd
400160000 = eax pmuludq to dwords
400160000 = eax pmuludq to qwords
400160000 = eax pshufd & pmuludq (hool)
400160000 = eax pshufd & pmuludq (qWord)

--- ok ---

jj2007

  • Member
  • *****
  • Posts: 8619
  • Assembler is fun ;-)
    • MasmBasic
Re: Packed integer multiplication
« Reply #18 on: January 30, 2014, 01:31:49 AM »
Code: [Select]
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

1347    cycles for 100 * cvtdq2ps & mulps
613     cycles for 100 * pmaddwd
1256    cycles for 100 * pmuludq to dwords
1020    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1274    cycles for 100 * pshufd & pmuludq (qWord)

1347    cycles for 100 * cvtdq2ps & mulps
614     cycles for 100 * pmaddwd
1247    cycles for 100 * pmuludq to dwords
1019    cycles for 100 * pmuludq to qwords
1269    cycles for 100 * pshufd & pmuludq (hool)
1270    cycles for 100 * pshufd & pmuludq (qWord)

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Packed integer multiplication
« Reply #19 on: January 30, 2014, 04:15:26 AM »
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

281     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
401     cycles for 100 * pmuludq to dwords
306     cycles for 100 * pmuludq to qwords
390     cycles for 100 * pshufd & pmuludq (hool)
317     cycles for 100 * pshufd & pmuludq (qWord)

277     cycles for 100 * cvtdq2ps & mulps
220     cycles for 100 * pmaddwd
404     cycles for 100 * pmuludq to dwords
305     cycles for 100 * pmuludq to qwords
398     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

286     cycles for 100 * cvtdq2ps & mulps
221     cycles for 100 * pmaddwd
407     cycles for 100 * pmuludq to dwords
310     cycles for 100 * pmuludq to qwords
389     cycles for 100 * pshufd & pmuludq (hool)
318     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
27      bytes for pmaddwd
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

400160000       = eax cvtdq2ps & mulps
-255462144      = eax pmaddwd
400160000       = eax pmuludq to dwords
400160000       = eax pmuludq to qwords
400160000       = eax pshufd & pmuludq (hool)
400160000       = eax pshufd & pmuludq (qWord)

--- ok ---

Gunther
Get your facts first, and then you can distort them.

Leo

  • Guest
Re: Packed integer multiplication
« Reply #20 on: February 19, 2014, 02:06:03 AM »
Thanks a lot everybody!
I saw the first response, checked the newer instructions that came since SSE2 and decided to use SSE 4.1 both since it had pmulld, which did just what I wanted and since it had some other instructions, which helped me as well. The fallback SSE2 version uses pmaddwd with a check before the algorithm that gives an error message if the data is too large. That's sufficient for what I wanted. Thank a lot everybody for the responses!

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Packed integer multiplication
« Reply #21 on: February 19, 2014, 07:26:40 AM »
P.S.: Won't work with old assemblers, but ML 8.0+ and JWasm understand OWORD.

Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.
Well Microsoft, here’s another nice mess you’ve gotten us into.

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Packed integer multiplication
« Reply #22 on: February 19, 2014, 09:02:57 AM »
Michael,

Even 6.14 recognizes OWORD and knows that it has a size of 16 bytes, but cannot handle an initializer larger than 64 bits.

that sounds a bit brain damaged. Is it another MS mess?

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 8619
  • Assembler is fun ;-)
    • MasmBasic
Re: Packed integer multiplication
« Reply #23 on: January 29, 2016, 06:41:58 AM »
Warming up an old thread :badgrin:

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
480     cycles for 100 * cvtdq2ps & mulps
14975   cycles for 100 * mulps
1482    cycles for 100 * pmuludq to dwords
99      cycles for 100 * pmuludq to qwords
592     cycles for 100 * pshufd & pmuludq (hool)
541     cycles for 100 * pshufd & pmuludq (qWord)

479     cycles for 100 * cvtdq2ps & mulps
14966   cycles for 100 * mulps
908     cycles for 100 * pmuludq to dwords
617     cycles for 100 * pmuludq to qwords
1244    cycles for 100 * pshufd & pmuludq (hool)
544     cycles for 100 * pshufd & pmuludq (qWord)

478     cycles for 100 * cvtdq2ps & mulps
14956   cycles for 100 * mulps
910     cycles for 100 * pmuludq to dwords
72      cycles for 100 * pmuludq to qwords
593     cycles for 100 * pshufd & pmuludq (hool)
540     cycles for 100 * pshufd & pmuludq (qWord)

36      bytes for cvtdq2ps & mulps
26      bytes for mulps
67      bytes for pmuludq to dwords
51      bytes for pmuludq to qwords
62      bytes for pshufd & pmuludq (hool)
57      bytes for pshufd & pmuludq (qWord)

4001600 = eax cvtdq2ps & mulps
4001600 = eax mulps
4001600 = eax pmuludq to dwords
4001600 = eax pmuludq to qwords
4001600 = eax pshufd & pmuludq (hool)
4001600 = eax pshufd & pmuludq (qWord)

The reason is that I discovered you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW, at least on my i5. Could be a stall or whatever. Second restriction is that the result should be below 16Mio. So it is not really a good choice, unless you want to scale rectangles and you are sure that none of the four dwords is negative 8)

qWord

  • Member
  • *****
  • Posts: 1473
  • The base type of a type is the type itself
    • SmplMath macros
Re: Packed integer multiplication
« Reply #24 on: January 29, 2016, 06:56:10 AM »
you can use mulps for packed integer multiplication - with certain limitations, such as: SLOOOOOOOOOW
not surprisingly  when working with denormalized  (positive) numbers as operand - even thought there is an (commonly masked) exception for such operands.
MREAL macros - when you need floating point arithmetic while assembling!