Comparing 128-bit numbers aka OWORDs

Antariy · August 14, 2013, 12:57:03 PM

Jochen, if you limit code only to ZF and CF flags to check, how would you check if negative number is greater than zero, for an instance? That's why I preferred the standard flags setting in my version, and it seems more logical to use comparsion macro-(or call)-"instruction" in a way as usually do with CMP.

Antariy · August 14, 2013, 02:34:34 PM

Here is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test

dedndave · August 14, 2013, 06:50:53 PM

yes - signed branches use the OF flag, as well
sign, carry, overflow - the one they give you that noone cares about is parity :P
aux carry is rarely used, too - mainly for BCD math, i think

i was thinking of implementing a function that could compare integers of any size

Code Select

INVOKE ArbCmp,nNumberOfBytes,lpFirst,lpSecond

it could compare the high-order bytes as bytes until alignment on the second operand is found
after that (if still equal), you could use an aligned method until inequality is found
don't want to use JECXZ because it's slow

jj2007 · August 14, 2013, 10:22:11 PM

Quote from: Antariy on August 14, 2013, 02:34:34 PM
Here is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test

Hi Alex,
Here it is ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
18742 cycles for 1000 * Ocmp (JJ)
88733 cycles for 1000 * cmp128b (loop)
62612 cycles for 1000 * AxCMP128bit

18750 cycles for 1000 * Ocmp (JJ)
88605 cycles for 1000 * cmp128b (loop)
63719 cycles for 1000 * AxCMP128bit

And Ocmp passes all your tests now, i.e. it behaves exactly like a cmp eax, edx.

P.S.: The extra speed comes from the bswap instruction.

Antariy · August 15, 2013, 04:05:40 AM

Hi Jochen, here are results.

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2255/1000 cycles

25281   cycles for 1000 * Ocmp (JJ)
104235  cycles for 1000 * cmp128b (loop)
34615   cycles for 1000 * AxCMP128bit

25063   cycles for 1000 * Ocmp (JJ)
103227  cycles for 1000 * cmp128b (loop)
34196   cycles for 1000 * AxCMP128bit

28798   cycles for 1000 * Ocmp (JJ)
101244  cycles for 1000 * cmp128b (loop)
34112   cycles for 1000 * AxCMP128bit


--- ok ---

:t

jj2007 · August 15, 2013, 04:45:37 AM

Thanks to everybody, especially Alex :t

I have posted a "CMP defeats intuition" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error

).

Note the FPU behaves differently.

Gunther · August 15, 2013, 05:39:44 AM

Jochen,

the new timings for you:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2141/1000 cycles

1959    cycles for 1000 * Ocmp (JJ)
60752   cycles for 1000 * cmp128b (loop)
6356    cycles for 1000 * AxCMP128bit

2096    cycles for 1000 * Ocmp (JJ)
60573   cycles for 1000 * cmp128b (loop)
6196    cycles for 1000 * AxCMP128bit

1918    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
6058    cycles for 1000 * AxCMP128bit

--- ok ---

Gunther

Antariy · August 15, 2013, 06:10:19 PM

Quote from: jj2007 on August 15, 2013, 04:45:37 AM
Thanks to everybody, especially Alex :t

:icon_redface: You're welcome, Jochen :t

Quote from: jj2007 on August 15, 2013, 04:45:37 AM
I have posted a "CMP defeats intuition" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error ).

Note the FPU behaves differently.

The most annoying thing in that is that there is not a full set of instructions to work with the flags selectively, like CLC/STC/CMC/CLD/STD/CLI/STI. For example - how to simply (un)set OF flag? Is there any other, simpler way than I used (i.e., via RCR - it is the only instruction I know of that may change OF flag and preserve ZF and allows to restore CF to its previous state (before RCR))?

dedndave · August 18, 2013, 01:53:45 AM

if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

but, you could come up with a set of operands/operations to generate the flag conditions, as desired

Code Select

mov al,7Fh
mov ah,88h
sub al,ah

so, at the end of your code, you could have

Code Select

SetFlags:
sub al,88h

and branch to that location with different values in AL
something like that :P

jj2007 · August 18, 2013, 02:12:44 AM

The solution for setting the flags in the Ocmp & Qcmp macros was actually inspired by Alex' insistence that the macro should behave exactly like a cmp reg32, reg32. So the trick is to take two OWORDs, for example:

oSmall OWORD 88000000000000000000000000000100h ; 88=NEGATIVE
oBig OWORD 88000000000000000000000000000300h

.. to scan them with pcmpeqb & bsr for the first different byte, and then "construct" two reg32:
eax = 88000001
edx = 88000003

Then, a simple cmp eax, edx sets the flags. Simple and fast...

dedndave · August 18, 2013, 02:15:39 AM

that is a great solution :t

jj2007 · August 18, 2013, 02:20:39 AM

Quote from: dedndave on August 18, 2013, 02:15:39 AM
that is a great solution :t

Thanks ;-)
By the way, if you put the pcmpeqb part into a loop, it should be easy to implement the arbitrary length algo you proposed. If the length is below OWORD, you will have to clear the last bits after bsr.

dedndave · August 18, 2013, 02:29:59 AM

i like the flag-setting solution
i am not as fond of using BSR :P

jj2007 · August 18, 2013, 03:01:41 AM

Then cmps is your candidate, Dave ;-)

qWord · August 18, 2013, 04:14:31 AM

just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction:

Code Select

cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
	lea esi,ow0
	lea edi,ow1
	mov eax,[esi+0*DWORD]
	mov ecx,[esi+1*DWORD]
	mov edx,[esi+2*DWORD]
	mov ebx,[esi+3*DWORD]
	sub eax,[edi+0*DWORD]
	jnz @NE1
	sbb ecx,[edi+1*DWORD]
	jnz @NE2
	sbb edx,[edi+2*DWORD]
	jnz @NE3
	sbb ebx,[edi+3*DWORD]
	;/* equal */
	jmp @end

@NE1:	sbb ecx,[edi+1*DWORD]
@NE2:	sbb edx,[edi+2*DWORD]
@NE3:	sbb ebx,[edi+3*DWORD]
	;/* LT or GT */
	jnz @end
	or ebx,2    ; MOV may be better, because it breaks the dependency chain...
	cmp ebx,1	; new flags: A GT B
@end:
endm

The MASM Forum

News:

Comparing 128-bit numbers aka OWORDs

Antariy

Antariy

dedndave

jj2007

Antariy

jj2007

Gunther

Antariy

dedndave

jj2007

dedndave

jj2007

dedndave

jj2007

qWord