Author Topic: Comparing 128-bit numbers aka OWORDs  (Read 123403 times)

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Comparing 128-bit numbers aka OWORDs
« Reply #45 on: August 18, 2013, 05:24:44 AM »
Hi qWord,

I think you're right. Very elegant solution.  :t

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 11450
  • Assembler is fun ;-)
    • MasmBasic
Re: Comparing 128-bit numbers aka OWORDs
« Reply #46 on: August 18, 2013, 08:27:45 AM »
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

14167   cycles for 1000 * Ocmp (JJ)
73673   cycles for 1000 * cmp128b (loop)
9501    cycles for 1000 * cmp128 qWord
112854  cycles for 1000 * AxCMP128bit

 :t

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Comparing 128-bit numbers aka OWORDs
« Reply #47 on: August 18, 2013, 08:31:35 AM »
The Timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 1648/1000 cycles

2375    cycles for 1000 * Ocmp (JJ)
60100   cycles for 1000 * cmp128b (loop)
4581    cycles for 1000 * cmp128 qWord
6580    cycles for 1000 * AxCMP128bit

2378    cycles for 1000 * Ocmp (JJ)
59763   cycles for 1000 * cmp128b (loop)
10781   cycles for 1000 * cmp128 qWord
6894    cycles for 1000 * AxCMP128bit

2345    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
4545    cycles for 1000 * cmp128 qWord
6595    cycles for 1000 * AxCMP128bit

--- ok ---

Well done, qWord.  :t

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 2147
    • https://github.com/nidud/asmc
Re: Comparing 128-bit numbers aka OWORDs
« Reply #48 on: August 18, 2013, 09:11:57 AM »
maybe I'm missing something here, but will not this also work:
Code: [Select]
cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm

result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 2013/1000 cycles

9249 cycles for 1000 * Ocmp (JJ)
84785 cycles for 1000 * cmp128b (loop)
7066 cycles for 1000 * cmp128 qWord
57891 cycles for 1000 * AxCMP128bit
2036 cycles for 1000 * cmp128 nidud

9249 cycles for 1000 * Ocmp (JJ)
84805 cycles for 1000 * cmp128b (loop)
7055 cycles for 1000 * cmp128 qWord
57614 cycles for 1000 * AxCMP128bit
2030 cycles for 1000 * cmp128 nidud

9298 cycles for 1000 * Ocmp (JJ)
84779 cycles for 1000 * cmp128b (loop)
7060 cycles for 1000 * cmp128 qWord
57623 cycles for 1000 * AxCMP128bit
2030 cycles for 1000 * cmp128 nidud

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #49 on: August 18, 2013, 10:24:50 AM »
i don't see how qWord's code is any faster than...

Code: [Select]
OwordA OWORD ?
OwordB OWORD ?

        mov     eax,dword ptr OwordA[12]
        mov     edx,dword ptr OwordA[8]
        cmp     eax,dword ptr OwordB[12]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[8]
        mov     eax,dword ptr OwordA[4]
        jnz     FlagsSet

        cmp     eax,dword ptr OwordB[4]
        mov     edx,dword ptr OwordA[0]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[0]

FlagsSet:

;flags are set as though you had executed CMP OwordA,OwordB

in particular, register indirect addressing is a little slower than direct addressing
and, my code only pre-loads 1 dword, not all 4
in addition to all that, i do not rely on the carry flag being forwarded for SBB

qWord

  • Member
  • *****
  • Posts: 1476
  • The base type of a type is the type itself
    • SmplMath macros
Re: Comparing 128-bit numbers aka OWORDs
« Reply #50 on: August 18, 2013, 10:58:43 AM »
dave,
nidud,
AFAICS your code does not do the same as mine. For the following test numbers,

Code: [Select]
A OWORD 0ffffffffffffffffffffffff00000001h
B OWORD 0ffffffffffffffffffffffffffffffffh
your code returns that A is greater than B (signed), which is wrong.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

  • Member
  • *****
  • Posts: 11450
  • Assembler is fun ;-)
    • MasmBasic
Re: Comparing 128-bit numbers aka OWORDs
« Reply #51 on: August 18, 2013, 11:29:31 AM »
Interesting :t

decimal, QWORD size (can't display more than that with deb, sorry ;))
qqA             -4294967295
qqB             -1

qA LESSER qB    ; Ocmp MasmBasic
qA lesser qB    ; cmp128 qWord
qB GREATER qA
qB greater qA

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #52 on: August 18, 2013, 12:20:15 PM »
good catch, qWord - lol
i have been using something like that for years
always for unsigned comparisons, though   :P

still, if you combine my method and your method, i think you get faster CORRECT code   :biggrin:

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #53 on: August 18, 2013, 06:41:51 PM »
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

You're right, and that's why I used RCR and not pushf/popf :biggrin:

Results for latest zip in the thread:

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2188/1000 cycles

24104   cycles for 1000 * Ocmp (JJ)
100352  cycles for 1000 * cmp128b (loop)
37171   cycles for 1000 * cmp128 qWord
33162   cycles for 1000 * AxCMP128bit

23649   cycles for 1000 * Ocmp (JJ)
100419  cycles for 1000 * cmp128b (loop)
37734   cycles for 1000 * cmp128 qWord
33313   cycles for 1000 * AxCMP128bit

24096   cycles for 1000 * Ocmp (JJ)
99257   cycles for 1000 * cmp128b (loop)
37482   cycles for 1000 * cmp128 qWord
33053   cycles for 1000 * AxCMP128bit


--- ok ---

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #54 on: August 18, 2013, 08:05:10 PM »
Having shamelessly stolen Jochen's implementation :biggrin: and a bit simplified it, there is a "new" algo:


JJAxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1
   movups xmm0,[ow0]
   movzx eax,word ptr [ow0+14]
   pcmpeqb xmm0,[ow1]
   movzx edx,word ptr [ow1+14]
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1
   cmp ecx,15
   jz @l1
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]   
   @l1:
   cmp ax,dx
ENDM


The idea is the same - construct compared element from MSB and make a higher unmached byte as a LSB, but in a word-sized reg. If the only highest bytes differs, then it goes shorter way and doesn't update LSB.
And now this is the absolutely equal to the CMP instruction 128-bit-emulation - because if regs are equal, it anyway makes a final CMP, so flags are set as they should be, and right after comparsion we may use any Jcc instruction, not forced to use JZ first like in earlier algos (because we do short cut after PMOVMSKB/BSR, but there only ZF is set - other flags are undefined, but this is not the same behaviour as CMP does for equal regs - it should set them properly and do not leave "undefined"). And even SF is set the same as CMP does :biggrin: Jochen, :t

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2287/1000 cycles

23052   cycles for 1000 * Ocmp (JJ)
100429  cycles for 1000 * cmp128b (loop)
36612   cycles for 1000 * cmp128 qWord
17043   cycles for 1000 * JJAxCMP128bit

23306   cycles for 1000 * Ocmp (JJ)
102801  cycles for 1000 * cmp128b (loop)
37323   cycles for 1000 * cmp128 qWord
16847   cycles for 1000 * JJAxCMP128bit

24068   cycles for 1000 * Ocmp (JJ)
96693   cycles for 1000 * cmp128b (loop)
36452   cycles for 1000 * cmp128 qWord
17062   cycles for 1000 * JJAxCMP128bit


--- ok ---

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #55 on: August 18, 2013, 08:12:44 PM »
Hi, nidud :t

maybe I'm missing something here, but will not this also work:
Code: [Select]
cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm

The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does. With such a straightforward code it will return proper result only for unsigned numbers (JA/JB/JZ), for signed - the result is unpredictable, because if not-highest-order DWORDs are different, they may be signed/unsigned (it's not predictable), but the entire OWORD which consists from them may have other signed/unsigned state. So, we forced to include the highest order element (DWORD or BYTE) to the comparsion to get right result.

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Comparing 128-bit numbers aka OWORDs
« Reply #56 on: August 18, 2013, 08:18:22 PM »
Hi Alex,

the timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3662/1000 cycles

362     cycles for 1000 * Ocmp (JJ)
57227   cycles for 1000 * cmp128b (loop)
2590    cycles for 1000 * cmp128 qWord
5042    cycles for 1000 * JJAxCMP128bit

412     cycles for 1000 * Ocmp (JJ)
57178   cycles for 1000 * cmp128b (loop)
2540    cycles for 1000 * cmp128 qWord
5050    cycles for 1000 * JJAxCMP128bit

353     cycles for 1000 * Ocmp (JJ)
57250   cycles for 1000 * cmp128b (loop)
2587    cycles for 1000 * cmp128 qWord
5045    cycles for 1000 * JJAxCMP128bit

--- ok ---

Good job.  :t

Gunther
Get your facts first, and then you can distort them.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #57 on: August 18, 2013, 08:40:17 PM »
very nice Alex   :t
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2075/1000 cycles

23383   cycles for 1000 * Ocmp (JJ)
97294   cycles for 1000 * cmp128b (loop)
36753   cycles for 1000 * cmp128 qWord
17587   cycles for 1000 * JJAxCMP128bit

23394   cycles for 1000 * Ocmp (JJ)
97825   cycles for 1000 * cmp128b (loop)
36873   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23387   cycles for 1000 * Ocmp (JJ)
98043   cycles for 1000 * cmp128b (loop)
36745   cycles for 1000 * cmp128 qWord
17146   cycles for 1000 * JJAxCMP128bit

Quote
Having shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #58 on: August 18, 2013, 10:01:52 PM »
Thank you, Gunther and Dave! :biggrin:

very nice Alex   :t

But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Quote
Having shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:

Is this proper sentence? To be honest, very frequently, being writing something in a real-time, I'm very-very unsure that I'm writing properly :redface:

jj2007

  • Member
  • *****
  • Posts: 11450
  • Assembler is fun ;-)
    • MasmBasic
Re: Comparing 128-bit numbers aka OWORDs
« Reply #59 on: August 18, 2013, 10:16:21 PM »
But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Something is wrong there - 0.3 cycles is impossible...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1778/1000 cycles

14132   cycles for 1000 * Ocmp (JJ)
13074   cycles for 1000 * Ocmp2 (JJ)
73846   cycles for 1000 * cmp128b (loop)
9513    cycles for 1000 * cmp128 qWord
9141    cycles for 1000 * JJAxCMP128bit


Ocmp2 uses your xor dx, 0ffffh trick.