Here is the description and example usage for the "PCMPDQD replacement" (used, for an instance,

here).

I used CMPPS SSE1 floating-point instruction in an "EQUAL" mode (alias is CMPEQPS) in comparsion of an

*integers*, but to do that, neends to do it a bit indirectly - not with the "raw" integers.

Since it's FP instruction, it isn't designed to compare any arbitrary data, like integers are, because FP numbers have a defined format. But for CMPEQPS there are these conditions: if the number is NaN (i.e. arbitrary data, or a "garbage" from a FP-format point of view) - it doesn't signal an invalid operation, and if number is NaN - it returns a FALSE as a result of comparsion (not-equal). This is documented behaviour, so we may rely on it.

Next though: what if compare arbitrary data with zero? A proper FP zero is equal to the integer zero. So, the results of a comparsion will be proper, because if arbitrary data was a binary zero - comparsion will return TRUE, if the data was not zero - with no difference, has had it some proper FP format or it was a NaN - it will return FALSE; just like it should do.

One more thing: to compare with a zero there obviously should be a shortcuts in a CPU design, because there is no need to "parse" the format via the circuit, it's more like a question of checking line states - if one number. is zero, and all its bits (lines) are zero, then if any other line (bit) of other number is set, the circuit may break further processing and return a shortcut results "not equal". So, we can really guess that comparsion of any arbitrary data with zero using CMPEQPS is not only safe (because it's based on a documentation) but also fast (i.e. comparsion of some "garbage" with zero will not cause a delays because of improper format of that "garbage" - arbitrary data).

So, we can use CMPEQPS to compare integers indirectly: we just need to XOR these integers with XORPS, and then compare the result with a zero.

But here we need to take in account that proper FP number has one difference from integer number: it may has a

*negative zero value*. I.e., for a FPU/SSE the float numbers 80000000h and 0 are equal, but for the integer number or a bit stream comparsion it is not so. So, we need to avoid false results when CMPEQPS will return TRUE as a result of a comparsion for 80000000h and 0.

Here is the construct that meets the needs stated above:

`ifndef USE_SSE2`

xorps xmm1p,xmm0p

ifndef ASCII_7BIT_ONLY

movaps xmmtemp,xmm1p

endif

cmpps xmm1p,oword ptr AxPCMPEQD_zero_ow,0

ifndef ASCII_7BIT_ONLY

andps xmmtemp,xmm1p

xorps xmm1p,xmmtemp

endif

else

pcmpeqd xmm1p,xmm0p

endifHere:

xmm1p - one XMM reg;

xmm0p - other XMM reg;

xmmtemp - third XMM reg;

If flag USE_SSE2 is not defined - used the PCMPEQD SSE1-workaround code;

If the data is known to be in the format where the highest bit of the bytes is always clear, then code may be smaller/faster - by definition of ASCII_7BIT_ONLY - then the workaround consists only from two instructions, and correction for negative zero possibility (80000000h) is not included (because it's known that there may not be 80h bytes).

The name of the ASCII_7BIT_ONLY flag is that because I used this construct in a string-comparsion code, so if the data known to be in a range of a standard ASCII (7 bit), the flag may be defined so.

In full form, with -0 correction, this trick works for comparsion of any data: numbers, just data stream etc.

The code attached is an example of a MemCmp using this construct.

In the test a two 16 MB buffers are compared. First it allocates memory, and commits it + fills with a "garbage", then runs SSE1 MemCmp, then GPR MemCmp, then SSE1 "number-crunching" (a loop working with small memory location than may fit in cache), then SSE2 MemCmp, then SSE2 "number-crunching" - this sequence is chosen to make sure tests on non-SSE2 capable machines. Then it repeats with ASCII_7BIT_ONLY defined.

It's very interesting, how this will behave on a different SSE1+ capable machines.

Timings:

`Commiting the memory`

Start test

41346696 cycles for AxPCMPEQD SSE1 MemCmp

43194864 cycles for REPE CMPSD MemCmp

423 cycles for AxPCMPEQD SSE1 Number crunching

38094976 cycles for AxPCMPEQD SSE2

269 cycles for AxPCMPEQD SSE2 Number crunching

Now with ASCII_7BIT_ONLY defined

38243504 cycles for AxPCMPEQD SSE1 MemCmp

43280048 cycles for REPE CMPSD MemCmp

269 cycles for AxPCMPEQD SSE1 Number crunching

38459352 cycles for AxPCMPEQD SSE2

271 cycles for AxPCMPEQD SSE2 Number crunching

Press any key to continue ...

In a MemCmp-like functions the SSE1 work-around (5/2 instructions instead of one SSE2 instruction) seems to be pretty usable (since the RAM throughput is a limiter anyway). With ASCII_7BIT_ONLY flag defined (2 instructions mode) it maybe even faster than PCMPEQD on my machine