News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

Gunther

Hi qWord,

I think you're right. Very elegant solution.  :t

Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: qWord on August 18, 2013, 04:14:31 AM
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

14167   cycles for 1000 * Ocmp (JJ)
73673   cycles for 1000 * cmp128b (loop)
9501    cycles for 1000 * cmp128 qWord
112854  cycles for 1000 * AxCMP128bit

:t

Gunther

The Timings:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 1648/1000 cycles

2375    cycles for 1000 * Ocmp (JJ)
60100   cycles for 1000 * cmp128b (loop)
4581    cycles for 1000 * cmp128 qWord
6580    cycles for 1000 * AxCMP128bit

2378    cycles for 1000 * Ocmp (JJ)
59763   cycles for 1000 * cmp128b (loop)
10781   cycles for 1000 * cmp128 qWord
6894    cycles for 1000 * AxCMP128bit

2345    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
4545    cycles for 1000 * cmp128 qWord
6595    cycles for 1000 * AxCMP128bit

--- ok ---


Well done, qWord.  :t

Gunther
You have to know the facts before you can distort them.

nidud

#48
deleted

dedndave

i don't see how qWord's code is any faster than...

OwordA OWORD ?
OwordB OWORD ?

        mov     eax,dword ptr OwordA[12]
        mov     edx,dword ptr OwordA[8]
        cmp     eax,dword ptr OwordB[12]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[8]
        mov     eax,dword ptr OwordA[4]
        jnz     FlagsSet

        cmp     eax,dword ptr OwordB[4]
        mov     edx,dword ptr OwordA[0]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[0]

FlagsSet:

;flags are set as though you had executed CMP OwordA,OwordB


in particular, register indirect addressing is a little slower than direct addressing
and, my code only pre-loads 1 dword, not all 4
in addition to all that, i do not rely on the carry flag being forwarded for SBB

qWord

dave,
nidud,
AFAICS your code does not do the same as mine. For the following test numbers,

A OWORD 0ffffffffffffffffffffffff00000001h
B OWORD 0ffffffffffffffffffffffffffffffffh

your code returns that A is greater than B (signed), which is wrong.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Interesting :t

decimal, QWORD size (can't display more than that with deb, sorry ;))
qqA             -4294967295
qqB             -1

qA LESSER qB    ; Ocmp MasmBasic
qA lesser qB    ; cmp128 qWord
qB GREATER qA
qB greater qA

dedndave

good catch, qWord - lol
i have been using something like that for years
always for unsigned comparisons, though   :P

still, if you combine my method and your method, i think you get faster CORRECT code   :biggrin:

Antariy

Quote from: dedndave on August 18, 2013, 01:53:45 AM
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

You're right, and that's why I used RCR and not pushf/popf :biggrin:

Results for latest zip in the thread:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2188/1000 cycles

24104   cycles for 1000 * Ocmp (JJ)
100352  cycles for 1000 * cmp128b (loop)
37171   cycles for 1000 * cmp128 qWord
33162   cycles for 1000 * AxCMP128bit

23649   cycles for 1000 * Ocmp (JJ)
100419  cycles for 1000 * cmp128b (loop)
37734   cycles for 1000 * cmp128 qWord
33313   cycles for 1000 * AxCMP128bit

24096   cycles for 1000 * Ocmp (JJ)
99257   cycles for 1000 * cmp128b (loop)
37482   cycles for 1000 * cmp128 qWord
33053   cycles for 1000 * AxCMP128bit


--- ok ---

Antariy

Having shamelessly stolen Jochen's implementation :biggrin: and a bit simplified it, there is a "new" algo:


JJAxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1
   movups xmm0,[ow0]
   movzx eax,word ptr [ow0+14]
   pcmpeqb xmm0,[ow1]
   movzx edx,word ptr [ow1+14]
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1
   cmp ecx,15
   jz @l1
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]   
   @l1:
   cmp ax,dx
ENDM


The idea is the same - construct compared element from MSB and make a higher unmached byte as a LSB, but in a word-sized reg. If the only highest bytes differs, then it goes shorter way and doesn't update LSB.
And now this is the absolutely equal to the CMP instruction 128-bit-emulation - because if regs are equal, it anyway makes a final CMP, so flags are set as they should be, and right after comparsion we may use any Jcc instruction, not forced to use JZ first like in earlier algos (because we do short cut after PMOVMSKB/BSR, but there only ZF is set - other flags are undefined, but this is not the same behaviour as CMP does for equal regs - it should set them properly and do not leave "undefined"). And even SF is set the same as CMP does :biggrin: Jochen, :t


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2287/1000 cycles

23052   cycles for 1000 * Ocmp (JJ)
100429  cycles for 1000 * cmp128b (loop)
36612   cycles for 1000 * cmp128 qWord
17043   cycles for 1000 * JJAxCMP128bit

23306   cycles for 1000 * Ocmp (JJ)
102801  cycles for 1000 * cmp128b (loop)
37323   cycles for 1000 * cmp128 qWord
16847   cycles for 1000 * JJAxCMP128bit

24068   cycles for 1000 * Ocmp (JJ)
96693   cycles for 1000 * cmp128b (loop)
36452   cycles for 1000 * cmp128 qWord
17062   cycles for 1000 * JJAxCMP128bit


--- ok ---

Antariy

Hi, nidud :t

Quote from: nidud on August 18, 2013, 09:11:57 AM
maybe I'm missing something here, but will not this also work:

cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm


The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does. With such a straightforward code it will return proper result only for unsigned numbers (JA/JB/JZ), for signed - the result is unpredictable, because if not-highest-order DWORDs are different, they may be signed/unsigned (it's not predictable), but the entire OWORD which consists from them may have other signed/unsigned state. So, we forced to include the highest order element (DWORD or BYTE) to the comparsion to get right result.

Gunther

Hi Alex,

the timings:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3662/1000 cycles

362     cycles for 1000 * Ocmp (JJ)
57227   cycles for 1000 * cmp128b (loop)
2590    cycles for 1000 * cmp128 qWord
5042    cycles for 1000 * JJAxCMP128bit

412     cycles for 1000 * Ocmp (JJ)
57178   cycles for 1000 * cmp128b (loop)
2540    cycles for 1000 * cmp128 qWord
5050    cycles for 1000 * JJAxCMP128bit

353     cycles for 1000 * Ocmp (JJ)
57250   cycles for 1000 * cmp128b (loop)
2587    cycles for 1000 * cmp128 qWord
5045    cycles for 1000 * JJAxCMP128bit

--- ok ---


Good job.  :t

Gunther
You have to know the facts before you can distort them.

dedndave

very nice Alex   :t
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2075/1000 cycles

23383   cycles for 1000 * Ocmp (JJ)
97294   cycles for 1000 * cmp128b (loop)
36753   cycles for 1000 * cmp128 qWord
17587   cycles for 1000 * JJAxCMP128bit

23394   cycles for 1000 * Ocmp (JJ)
97825   cycles for 1000 * cmp128b (loop)
36873   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23387   cycles for 1000 * Ocmp (JJ)
98043   cycles for 1000 * cmp128b (loop)
36745   cycles for 1000 * cmp128 qWord
17146   cycles for 1000 * JJAxCMP128bit


QuoteHaving shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:

Antariy

Thank you, Gunther and Dave! :biggrin:

Quote from: dedndave on August 18, 2013, 08:40:17 PM
very nice Alex   :t

But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Quote from: dedndave on August 18, 2013, 08:40:17 PM
QuoteHaving shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:

Is this proper sentence? To be honest, very frequently, being writing something in a real-time, I'm very-very unsure that I'm writing properly :redface:

jj2007

Quote from: Antariy on August 18, 2013, 10:01:52 PM
But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Something is wrong there - 0.3 cycles is impossible...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1778/1000 cycles

14132   cycles for 1000 * Ocmp (JJ)
13074   cycles for 1000 * Ocmp2 (JJ)
73846   cycles for 1000 * cmp128b (loop)
9513    cycles for 1000 * cmp128 qWord
9141    cycles for 1000 * JJAxCMP128bit


Ocmp2 uses your xor dx, 0ffffh trick.