Comparing 128-bit numbers aka OWORDs

Gunther · August 18, 2013, 05:24:44 AM

Hi qWord,

I think you're right. Very elegant solution. :t

Gunther

jj2007 · August 18, 2013, 08:27:45 AM

Quote from: qWord on August 18, 2013, 04:14:31 AM
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

14167 cycles for 1000 * Ocmp (JJ)
73673 cycles for 1000 * cmp128b (loop)
9501 cycles for 1000 * cmp128 qWord
112854 cycles for 1000 * AxCMP128bit
:t

Gunther · August 18, 2013, 08:31:35 AM

The Timings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 1648/1000 cycles

2375    cycles for 1000 * Ocmp (JJ)
60100   cycles for 1000 * cmp128b (loop)
4581    cycles for 1000 * cmp128 qWord
6580    cycles for 1000 * AxCMP128bit

2378    cycles for 1000 * Ocmp (JJ)
59763   cycles for 1000 * cmp128b (loop)
10781   cycles for 1000 * cmp128 qWord
6894    cycles for 1000 * AxCMP128bit

2345    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
4545    cycles for 1000 * cmp128 qWord
6595    cycles for 1000 * AxCMP128bit

--- ok ---

Well done, qWord. :t

Gunther

nidud · August 18, 2013, 09:11:57 AM

deleted

dedndave · August 18, 2013, 10:24:50 AM

i don't see how qWord's code is any faster than...

Code Select

OwordA OWORD ?
OwordB OWORD ?

        mov     eax,dword ptr OwordA[12]
        mov     edx,dword ptr OwordA[8]
        cmp     eax,dword ptr OwordB[12]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[8]
        mov     eax,dword ptr OwordA[4]
        jnz     FlagsSet

        cmp     eax,dword ptr OwordB[4]
        mov     edx,dword ptr OwordA[0]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[0]

FlagsSet:

;flags are set as though you had executed CMP OwordA,OwordB

in particular, register indirect addressing is a little slower than direct addressing
and, my code only pre-loads 1 dword, not all 4
in addition to all that, i do not rely on the carry flag being forwarded for SBB

qWord · August 18, 2013, 10:58:43 AM

dave,
nidud,
AFAICS your code does not do the same as mine. For the following test numbers,

Code Select

	A OWORD 0ffffffffffffffffffffffff00000001h
	B OWORD 0ffffffffffffffffffffffffffffffffh

your code returns that A is greater than B (signed), which is wrong.

jj2007 · August 18, 2013, 11:29:31 AM

Interesting :t

decimal, QWORD size (can't display more than that with deb, sorry ;))
qqA -4294967295
qqB -1

qA LESSER qB ; Ocmp MasmBasic
qA lesser qB ; cmp128 qWord
qB GREATER qA
qB greater qA

dedndave · August 18, 2013, 12:20:15 PM

good catch, qWord - lol
i have been using something like that for years
always for unsigned comparisons, though :P

still, if you combine my method and your method, i think you get faster CORRECT code

Antariy · August 18, 2013, 06:41:51 PM

Quote from: dedndave on August 18, 2013, 01:53:45 AM
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

You're right, and that's why I used RCR and not pushf/popf

Results for latest zip in the thread:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2188/1000 cycles

24104   cycles for 1000 * Ocmp (JJ)
100352  cycles for 1000 * cmp128b (loop)
37171   cycles for 1000 * cmp128 qWord
33162   cycles for 1000 * AxCMP128bit

23649   cycles for 1000 * Ocmp (JJ)
100419  cycles for 1000 * cmp128b (loop)
37734   cycles for 1000 * cmp128 qWord
33313   cycles for 1000 * AxCMP128bit

24096   cycles for 1000 * Ocmp (JJ)
99257   cycles for 1000 * cmp128b (loop)
37482   cycles for 1000 * cmp128 qWord
33053   cycles for 1000 * AxCMP128bit


--- ok ---

Antariy · August 18, 2013, 08:05:10 PM

Having shamelessly stolen Jochen's implementation

and a bit simplified it, there is a "new" algo:

JJAxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1
   movups xmm0,[ow0]
   movzx eax,word ptr [ow0+14]
   pcmpeqb xmm0,[ow1]
   movzx edx,word ptr [ow1+14]
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1
   cmp ecx,15
   jz @l1
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   @l1:
   cmp ax,dx
ENDM

The idea is the same - construct compared element from MSB and make a higher unmached byte as a LSB, but in a word-sized reg. If the only highest bytes differs, then it goes shorter way and doesn't update LSB.
And now this is the absolutely equal to the CMP instruction 128-bit-emulation - because if regs are equal, it anyway makes a final CMP, so flags are set as they should be, and right after comparsion we may use any Jcc instruction, not forced to use JZ first like in earlier algos (because we do short cut after PMOVMSKB/BSR, but there only ZF is set - other flags are undefined, but this is not the same behaviour as CMP does for equal regs - it should set them properly and do not leave "undefined"). And even SF is set the same as CMP does

Jochen, :t

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2287/1000 cycles

23052   cycles for 1000 * Ocmp (JJ)
100429  cycles for 1000 * cmp128b (loop)
36612   cycles for 1000 * cmp128 qWord
17043   cycles for 1000 * JJAxCMP128bit

23306   cycles for 1000 * Ocmp (JJ)
102801  cycles for 1000 * cmp128b (loop)
37323   cycles for 1000 * cmp128 qWord
16847   cycles for 1000 * JJAxCMP128bit

24068   cycles for 1000 * Ocmp (JJ)
96693   cycles for 1000 * cmp128b (loop)
36452   cycles for 1000 * cmp128 qWord
17062   cycles for 1000 * JJAxCMP128bit


--- ok ---

Antariy · August 18, 2013, 08:12:44 PM

Hi, nidud :t

Quote from: nidud on August 18, 2013, 09:11:57 AM
maybe I'm missing something here, but will not this also work:
Code Select Expand
cmp128f macro ow0,ow1 LOCAL @end mov eax,dword ptr ow0[12] cmp eax,dword ptr ow1[12] jne @end mov eax,dword ptr ow0[8] cmp eax,dword ptr ow1[8] jne @end mov eax,dword ptr ow0[4] cmp eax,dword ptr ow1[4] jne @end mov eax,dword ptr ow0 cmp eax,dword ptr ow1 @end: endm

The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does. With such a straightforward code it will return proper result only for unsigned numbers (JA/JB/JZ), for signed - the result is unpredictable, because if not-highest-order DWORDs are different, they may be signed/unsigned (it's not predictable), but the entire OWORD which consists from them may have other signed/unsigned state. So, we forced to include the highest order element (DWORD or BYTE) to the comparsion to get right result.

Gunther · August 18, 2013, 08:18:22 PM

Hi Alex,

the timings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3662/1000 cycles

362     cycles for 1000 * Ocmp (JJ)
57227   cycles for 1000 * cmp128b (loop)
2590    cycles for 1000 * cmp128 qWord
5042    cycles for 1000 * JJAxCMP128bit

412     cycles for 1000 * Ocmp (JJ)
57178   cycles for 1000 * cmp128b (loop)
2540    cycles for 1000 * cmp128 qWord
5050    cycles for 1000 * JJAxCMP128bit

353     cycles for 1000 * Ocmp (JJ)
57250   cycles for 1000 * cmp128b (loop)
2587    cycles for 1000 * cmp128 qWord
5045    cycles for 1000 * JJAxCMP128bit

--- ok ---

Good job. :t

Gunther

dedndave · August 18, 2013, 08:40:17 PM

very nice Alex :t

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2075/1000 cycles

23383   cycles for 1000 * Ocmp (JJ)
97294   cycles for 1000 * cmp128b (loop)
36753   cycles for 1000 * cmp128 qWord
17587   cycles for 1000 * JJAxCMP128bit

23394   cycles for 1000 * Ocmp (JJ)
97825   cycles for 1000 * cmp128b (loop)
36873   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23387   cycles for 1000 * Ocmp (JJ)
98043   cycles for 1000 * cmp128b (loop)
36745   cycles for 1000 * cmp128 qWord
17146   cycles for 1000 * JJAxCMP128bit

QuoteHaving shamelessly stolen Jochen's implementation...

the algo has even improved your English :lol:

Antariy · August 18, 2013, 10:01:52 PM

Thank you, Gunther and Dave!

Quote from: dedndave on August 18, 2013, 08:40:17 PM
very nice Alex :t

But Gunther's CPU likes Jochen's algo much better!

0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Quote from: dedndave on August 18, 2013, 08:40:17 PM
QuoteHaving shamelessly stolen Jochen's implementation...
the algo has even improved your English :lol:

Is this proper sentence? To be honest, very frequently, being writing something in a real-time, I'm very-very unsure that I'm writing properly :redface:

jj2007 · August 18, 2013, 10:16:21 PM

Quote from: Antariy on August 18, 2013, 10:01:52 PM
But Gunther's CPU likes Jochen's algo much better! 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Something is wrong there - 0.3 cycles is impossible...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1778/1000 cycles

14132 cycles for 1000 * Ocmp (JJ)
13074 cycles for 1000 * Ocmp2 (JJ)
73846 cycles for 1000 * cmp128b (loop)
9513 cycles for 1000 * cmp128 qWord
9141 cycles for 1000 * JJAxCMP128bit

Ocmp2 uses your xor dx, 0ffffh trick.

The MASM Forum

News:

Comparing 128-bit numbers aka OWORDs

Gunther

jj2007

Gunther

nidud

dedndave

qWord

jj2007

dedndave

Antariy

Antariy

Antariy

Gunther

dedndave

Antariy

jj2007