News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

Antariy

Jochen, if you limit code only to ZF and CF flags to check, how would you check if negative number is greater than zero, for an instance? That's why I preferred the standard flags setting in my version, and it seems more logical to use comparsion macro-(or call)-"instruction" in a way as usually do with CMP.

Antariy

Here is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:

dedndave

yes - signed branches use the OF flag, as well
sign, carry, overflow - the one they give you that noone cares about is parity   :P
aux carry is rarely used, too - mainly for BCD math, i think

i was thinking of implementing a function that could compare integers of any size
INVOKE  ArbCmp,nNumberOfBytes,lpFirst,lpSecond

it could compare the high-order bytes as bytes until alignment on the second operand is found
after that (if still equal), you could use an aligned method until inequality is found
don't want to use JECXZ because it's slow

jj2007

Quote from: Antariy on August 14, 2013, 02:34:34 PM
Here is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:

Hi Alex,
Here it is ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
18742   cycles for 1000 * Ocmp (JJ)
88733   cycles for 1000 * cmp128b (loop)
62612   cycles for 1000 * AxCMP128bit

18750   cycles for 1000 * Ocmp (JJ)
88605   cycles for 1000 * cmp128b (loop)
63719   cycles for 1000 * AxCMP128bit


And Ocmp passes all your tests now, i.e. it behaves exactly like a cmp eax, edx.

P.S.: The extra speed comes from the bswap instruction.

Antariy

Hi Jochen, here are results.


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2255/1000 cycles

25281   cycles for 1000 * Ocmp (JJ)
104235  cycles for 1000 * cmp128b (loop)
34615   cycles for 1000 * AxCMP128bit

25063   cycles for 1000 * Ocmp (JJ)
103227  cycles for 1000 * cmp128b (loop)
34196   cycles for 1000 * AxCMP128bit

28798   cycles for 1000 * Ocmp (JJ)
101244  cycles for 1000 * cmp128b (loop)
34112   cycles for 1000 * AxCMP128bit


--- ok ---



:t

jj2007

Thanks to everybody, especially Alex :t

I have posted a "CMP defeats intuition" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).

Note the FPU behaves differently.

Gunther

Jochen,

the new timings for you:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2141/1000 cycles

1959    cycles for 1000 * Ocmp (JJ)
60752   cycles for 1000 * cmp128b (loop)
6356    cycles for 1000 * AxCMP128bit

2096    cycles for 1000 * Ocmp (JJ)
60573   cycles for 1000 * cmp128b (loop)
6196    cycles for 1000 * AxCMP128bit

1918    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
6058    cycles for 1000 * AxCMP128bit

--- ok ---


Gunther
You have to know the facts before you can distort them.

Antariy

Quote from: jj2007 on August 15, 2013, 04:45:37 AM
Thanks to everybody, especially Alex :t

:icon_redface: You're welcome, Jochen :t

Quote from: jj2007 on August 15, 2013, 04:45:37 AM
I have posted a "CMP defeats intuition" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).

Note the FPU behaves differently.

The most annoying thing in that is that there is not a full set of instructions to work with the flags selectively, like CLC/STC/CMC/CLD/STD/CLI/STI. For example - how to simply (un)set OF flag? Is there any other, simpler way than I used (i.e., via RCR - it is the only instruction I know of that may change OF flag and preserve ZF and allows to restore CF to its previous state (before RCR))?

dedndave

if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

but, you could come up with a set of operands/operations to generate the flag conditions, as desired

mov al,7Fh
mov ah,88h
sub al,ah


so, at the end of your code, you could have
SetFlags:
sub al,88h


and branch to that location with different values in AL
something like that   :P

jj2007

The solution for setting the flags in the Ocmp & Qcmp macros was actually inspired by Alex' insistence that the macro should behave exactly like a cmp reg32, reg32. So the trick is to take two OWORDs, for example:

  oSmall   OWORD 88000000000000000000000000000100h   ; 88=NEGATIVE
  oBig   OWORD 88000000000000000000000000000300h

.. to scan them with pcmpeqb & bsr for the first different byte, and then "construct" two reg32:
  eax = 88000001
  edx = 88000003

Then, a simple cmp eax, edx sets the flags. Simple and fast...

dedndave


jj2007

Quote from: dedndave on August 18, 2013, 02:15:39 AM
that is a great solution   :t
Thanks ;-)
By the way, if you put the pcmpeqb part into a loop, it should be easy to implement the arbitrary length algo you proposed. If the length is below OWORD, you will have to clear the last bits after bsr.

dedndave

i like the flag-setting solution
i am not as fond of using BSR   :P

jj2007


qWord

just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction:
cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
lea esi,ow0
lea edi,ow1
mov eax,[esi+0*DWORD]
mov ecx,[esi+1*DWORD]
mov edx,[esi+2*DWORD]
mov ebx,[esi+3*DWORD]
sub eax,[edi+0*DWORD]
jnz @NE1
sbb ecx,[edi+1*DWORD]
jnz @NE2
sbb edx,[edi+2*DWORD]
jnz @NE3
sbb ebx,[edi+3*DWORD]
;/* equal */
jmp @end

@NE1: sbb ecx,[edi+1*DWORD]
@NE2: sbb edx,[edi+2*DWORD]
@NE3: sbb ebx,[edi+3*DWORD]
;/* LT or GT */
jnz @end
or ebx,2    ; MOV may be better, because it breaks the dependency chain...
cmp ebx,1 ; new flags: A GT B
@end:
endm
MREAL macros - when you need floating point arithmetic while assembling!