Author Topic: SSE1 replacement for PCMPEQD  (Read 3655 times)

Antariy

  • Member
  • ****
  • Posts: 541
SSE1 replacement for PCMPEQD
« on: August 15, 2013, 11:15:18 PM »
Here is the description and example usage for the "PCMPDQD replacement" (used, for an instance, here).

I used CMPPS SSE1 floating-point instruction in an "EQUAL" mode (alias is CMPEQPS) in comparsion of an integers, but to do that, neends to do it a bit indirectly - not with the "raw" integers.

Since it's FP instruction, it isn't designed to compare any arbitrary data, like integers are, because FP numbers have a defined format. But for CMPEQPS there are these conditions: if the number is NaN (i.e. arbitrary data, or a "garbage" from a FP-format point of view) - it doesn't signal an invalid operation, and if number is NaN - it returns a FALSE as a result of comparsion (not-equal). This is documented behaviour, so we may rely on it.

Next though: what if compare arbitrary data with zero? A proper FP zero is equal to the integer zero. So, the results of a comparsion will be proper, because if arbitrary data was a binary zero - comparsion will return TRUE, if the data was not zero - with no difference, has had it some proper FP format or it was a NaN - it will return FALSE; just like it should do.

One more thing: to compare with a zero there obviously should be a shortcuts in a CPU design, because there is no need to "parse" the format via the circuit, it's more like a question of checking line states - if one number. is zero, and all its bits (lines) are zero, then if any other line (bit) of other number is set, the circuit may break further processing and return a shortcut results "not equal". So, we can really guess that comparsion of any arbitrary data with zero using CMPEQPS is not only safe (because it's based on a documentation) but also fast (i.e. comparsion of some "garbage" with zero will not cause a delays because of improper format of that "garbage" - arbitrary data).

So, we can use CMPEQPS to compare integers indirectly: we just need to XOR these integers with XORPS, and then compare the result with a zero.

But here we need to take in account that proper FP number has one difference from integer number: it may has a negative zero value. I.e., for a FPU/SSE the float numbers 80000000h and 0 are equal, but for the integer number or a bit stream comparsion it is not so. So, we need to avoid false results when CMPEQPS will return TRUE as a result of a comparsion for 80000000h and 0.


Here is the construct that meets the needs stated above:

ifndef USE_SSE2
      xorps xmm1p,xmm0p
   ifndef ASCII_7BIT_ONLY
      movaps xmmtemp,xmm1p
   endif
      cmpps xmm1p,oword ptr AxPCMPEQD_zero_ow,0
   ifndef ASCII_7BIT_ONLY
      andps xmmtemp,xmm1p
      xorps xmm1p,xmmtemp
   endif
else
      pcmpeqd xmm1p,xmm0p
endif


Here:
xmm1p - one XMM reg;
xmm0p - other XMM reg;
xmmtemp - third XMM reg;
If flag USE_SSE2 is not defined - used the PCMPEQD SSE1-workaround code;
If the data is known to be in the format where the highest bit of the bytes is always clear, then code may be smaller/faster - by definition of ASCII_7BIT_ONLY - then the workaround consists only from two instructions, and correction for negative zero possibility (80000000h) is not included (because it's known that there may not be 80h bytes).

The name of the ASCII_7BIT_ONLY flag is that because I used this construct in a string-comparsion code, so if the data known to be in a range of a standard ASCII (7 bit), the flag may be defined so.

In full form, with -0 correction, this trick works for comparsion of any data: numbers, just data stream etc.

The code attached is an example of a MemCmp using this construct.

In the test a two 16 MB buffers are compared. First it allocates memory, and commits it + fills with a "garbage", then runs SSE1 MemCmp, then GPR MemCmp, then SSE1 "number-crunching" (a loop working with small memory location than may fit in cache), then SSE2 MemCmp, then SSE2 "number-crunching" - this sequence is chosen to make sure tests on non-SSE2 capable machines. Then it repeats with ASCII_7BIT_ONLY defined.


It's very interesting, how this will behave on a different SSE1+ capable machines.

Timings:
Code: [Select]
Commiting the memory
Start test
41346696 cycles for AxPCMPEQD SSE1 MemCmp
43194864 cycles for REPE CMPSD MemCmp
423 cycles for AxPCMPEQD SSE1 Number crunching
38094976 cycles for AxPCMPEQD SSE2
269 cycles for AxPCMPEQD SSE2 Number crunching

Now with ASCII_7BIT_ONLY defined

38243504 cycles for AxPCMPEQD SSE1 MemCmp
43280048 cycles for REPE CMPSD MemCmp
269 cycles for AxPCMPEQD SSE1 Number crunching
38459352 cycles for AxPCMPEQD SSE2
271 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...


In a MemCmp-like functions the SSE1 work-around (5/2 instructions instead of one SSE2 instruction) seems to be pretty usable (since the RAM throughput is a limiter anyway). With ASCII_7BIT_ONLY flag defined (2 instructions mode) it maybe even faster than PCMPEQD on my machine :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7738
  • Assembler is fun ;-)
    • MasmBasic
Re: SSE1 replacement for PCMPEQD
« Reply #1 on: August 16, 2013, 01:37:02 AM »
Celeron M:
Commiting the memory
Start test
18072096 cycles for AxPCMPEQD SSE1 MemCmp
21147972 cycles for REPE CMPSD MemCmp
154 cycles for AxPCMPEQD SSE1 Number crunching
17282292 cycles for AxPCMPEQD SSE2
77 cycles for AxPCMPEQD SSE2 Number crunching

Now with ASCII_7BIT_ONLY defined

17383536 cycles for AxPCMPEQD SSE1 MemCmp
21200976 cycles for REPE CMPSD MemCmp
77 cycles for AxPCMPEQD SSE1 Number crunching
17466456 cycles for AxPCMPEQD SSE2
77 cycles for AxPCMPEQD SSE2 Number crunching


Very cute, Alex :t

Gunther

  • Member
  • *****
  • Posts: 3517
  • Forgive your enemies, but never forget their names
Re: SSE1 replacement for PCMPEQD
« Reply #2 on: August 16, 2013, 02:28:42 AM »
Hi Alex,

rock solid.  :t

Code: [Select]
Commiting the memory
Start test
8328179 cycles for AxPCMPEQD SSE1 MemCmp
9438014 cycles for REPE CMPSD MemCmp
67 cycles for AxPCMPEQD SSE1 Number crunching
8675932 cycles for AxPCMPEQD SSE2
40 cycles for AxPCMPEQD SSE2 Number crunching

Now with ASCII_7BIT_ONLY defined

7499720 cycles for AxPCMPEQD SSE1 MemCmp
9610910 cycles for REPE CMPSD MemCmp
44 cycles for AxPCMPEQD SSE1 Number crunching
6954688 cycles for AxPCMPEQD SSE2
39 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...

Gunther
Get your facts first, and then you can distort them.

Antariy

  • Member
  • ****
  • Posts: 541
Re: SSE1 replacement for PCMPEQD
« Reply #3 on: August 16, 2013, 07:04:12 PM »
Very cute, Alex :t

Hi Alex,

rock solid.  :t

Thank you very much, Jochen and Gunther! :biggrin:

Siekmanski

  • Member
  • *****
  • Posts: 1144
Re: SSE1 replacement for PCMPEQD
« Reply #4 on: August 16, 2013, 08:11:09 PM »
 :t
Code: [Select]
Commiting the memory
Start test
16722522 cycles for AxPCMPEQD SSE1 MemCmp
26295687 cycles for REPE CMPSD MemCmp
76 cycles for AxPCMPEQD SSE1 Number crunching
14614443 cycles for AxPCMPEQD SSE2
46 cycles for AxPCMPEQD SSE2 Number crunching

Now with ASCII_7BIT_ONLY defined

14655033 cycles for AxPCMPEQD SSE1 MemCmp
26378181 cycles for REPE CMPSD MemCmp
46 cycles for AxPCMPEQD SSE1 Number crunching
14842827 cycles for AxPCMPEQD SSE2
46 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...

Antariy

  • Member
  • ****
  • Posts: 541
Re: SSE1 replacement for PCMPEQD
« Reply #5 on: August 16, 2013, 09:23:33 PM »
Thank you, Marinus :biggrin: