The MASM Forum
General => The Laboratory => Topic started by: Antariy on August 15, 2013, 11:15:18 PM

Here is the description and example usage for the "PCMPDQD replacement" (used, for an instance, here (http://)).
I used CMPPS SSE1 floatingpoint instruction in an "EQUAL" mode (alias is CMPEQPS) in comparsion of an integers, but to do that, neends to do it a bit indirectly  not with the "raw" integers.
Since it's FP instruction, it isn't designed to compare any arbitrary data, like integers are, because FP numbers have a defined format. But for CMPEQPS there are these conditions: if the number is NaN (i.e. arbitrary data, or a "garbage" from a FPformat point of view)  it doesn't signal an invalid operation, and if number is NaN  it returns a FALSE as a result of comparsion (notequal). This is documented behaviour, so we may rely on it.
Next though: what if compare arbitrary data with zero? A proper FP zero is equal to the integer zero. So, the results of a comparsion will be proper, because if arbitrary data was a binary zero  comparsion will return TRUE, if the data was not zero  with no difference, has had it some proper FP format or it was a NaN  it will return FALSE; just like it should do.
One more thing: to compare with a zero there obviously should be a shortcuts in a CPU design, because there is no need to "parse" the format via the circuit, it's more like a question of checking line states  if one number. is zero, and all its bits (lines) are zero, then if any other line (bit) of other number is set, the circuit may break further processing and return a shortcut results "not equal". So, we can really guess that comparsion of any arbitrary data with zero using CMPEQPS is not only safe (because it's based on a documentation) but also fast (i.e. comparsion of some "garbage" with zero will not cause a delays because of improper format of that "garbage"  arbitrary data).
So, we can use CMPEQPS to compare integers indirectly: we just need to XOR these integers with XORPS, and then compare the result with a zero.
But here we need to take in account that proper FP number has one difference from integer number: it may has a negative zero value. I.e., for a FPU/SSE the float numbers 80000000h and 0 are equal, but for the integer number or a bit stream comparsion it is not so. So, we need to avoid false results when CMPEQPS will return TRUE as a result of a comparsion for 80000000h and 0.
Here is the construct that meets the needs stated above:
ifndef USE_SSE2
xorps xmm1p,xmm0p
ifndef ASCII_7BIT_ONLY
movaps xmmtemp,xmm1p
endif
cmpps xmm1p,oword ptr AxPCMPEQD_zero_ow,0
ifndef ASCII_7BIT_ONLY
andps xmmtemp,xmm1p
xorps xmm1p,xmmtemp
endif
else
pcmpeqd xmm1p,xmm0p
endif
Here:
xmm1p  one XMM reg;
xmm0p  other XMM reg;
xmmtemp  third XMM reg;
If flag USE_SSE2 is not defined  used the PCMPEQD SSE1workaround code;
If the data is known to be in the format where the highest bit of the bytes is always clear, then code may be smaller/faster  by definition of ASCII_7BIT_ONLY  then the workaround consists only from two instructions, and correction for negative zero possibility (80000000h) is not included (because it's known that there may not be 80h bytes).
The name of the ASCII_7BIT_ONLY flag is that because I used this construct in a stringcomparsion code, so if the data known to be in a range of a standard ASCII (7 bit), the flag may be defined so.
In full form, with 0 correction, this trick works for comparsion of any data: numbers, just data stream etc.
The code attached is an example of a MemCmp using this construct.
In the test a two 16 MB buffers are compared. First it allocates memory, and commits it + fills with a "garbage", then runs SSE1 MemCmp, then GPR MemCmp, then SSE1 "numbercrunching" (a loop working with small memory location than may fit in cache), then SSE2 MemCmp, then SSE2 "numbercrunching"  this sequence is chosen to make sure tests on nonSSE2 capable machines. Then it repeats with ASCII_7BIT_ONLY defined.
It's very interesting, how this will behave on a different SSE1+ capable machines.
Timings:
Commiting the memory
Start test
41346696 cycles for AxPCMPEQD SSE1 MemCmp
43194864 cycles for REPE CMPSD MemCmp
423 cycles for AxPCMPEQD SSE1 Number crunching
38094976 cycles for AxPCMPEQD SSE2
269 cycles for AxPCMPEQD SSE2 Number crunching
Now with ASCII_7BIT_ONLY defined
38243504 cycles for AxPCMPEQD SSE1 MemCmp
43280048 cycles for REPE CMPSD MemCmp
269 cycles for AxPCMPEQD SSE1 Number crunching
38459352 cycles for AxPCMPEQD SSE2
271 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...
In a MemCmplike functions the SSE1 workaround (5/2 instructions instead of one SSE2 instruction) seems to be pretty usable (since the RAM throughput is a limiter anyway). With ASCII_7BIT_ONLY flag defined (2 instructions mode) it maybe even faster than PCMPEQD on my machine :biggrin:

Celeron M:
Commiting the memory
Start test
18072096 cycles for AxPCMPEQD SSE1 MemCmp
21147972 cycles for REPE CMPSD MemCmp
154 cycles for AxPCMPEQD SSE1 Number crunching
17282292 cycles for AxPCMPEQD SSE2
77 cycles for AxPCMPEQD SSE2 Number crunching
Now with ASCII_7BIT_ONLY defined
17383536 cycles for AxPCMPEQD SSE1 MemCmp
21200976 cycles for REPE CMPSD MemCmp
77 cycles for AxPCMPEQD SSE1 Number crunching
17466456 cycles for AxPCMPEQD SSE2
77 cycles for AxPCMPEQD SSE2 Number crunching
Very cute, Alex :t

Hi Alex,
rock solid. :t
Commiting the memory
Start test
8328179 cycles for AxPCMPEQD SSE1 MemCmp
9438014 cycles for REPE CMPSD MemCmp
67 cycles for AxPCMPEQD SSE1 Number crunching
8675932 cycles for AxPCMPEQD SSE2
40 cycles for AxPCMPEQD SSE2 Number crunching
Now with ASCII_7BIT_ONLY defined
7499720 cycles for AxPCMPEQD SSE1 MemCmp
9610910 cycles for REPE CMPSD MemCmp
44 cycles for AxPCMPEQD SSE1 Number crunching
6954688 cycles for AxPCMPEQD SSE2
39 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...
Gunther

Very cute, Alex :t
Hi Alex,
rock solid. :t
Thank you very much, Jochen and Gunther! :biggrin:

:t
Commiting the memory
Start test
16722522 cycles for AxPCMPEQD SSE1 MemCmp
26295687 cycles for REPE CMPSD MemCmp
76 cycles for AxPCMPEQD SSE1 Number crunching
14614443 cycles for AxPCMPEQD SSE2
46 cycles for AxPCMPEQD SSE2 Number crunching
Now with ASCII_7BIT_ONLY defined
14655033 cycles for AxPCMPEQD SSE1 MemCmp
26378181 cycles for REPE CMPSD MemCmp
46 cycles for AxPCMPEQD SSE1 Number crunching
14842827 cycles for AxPCMPEQD SSE2
46 cycles for AxPCMPEQD SSE2 Number crunching
Press any key to continue ...

Thank you, Marinus :biggrin: