Author Topic: Comparing 128-bit numbers aka OWORDs  (Read 118983 times)

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #60 on: August 18, 2013, 11:12:40 PM »
Something is wrong there - 0.3 cycles is impossible...

It may be some inconsistence in timings, like it often happens, but probably it shows that Gunther's CPU model runs your version faster.

Timings for the new archive:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2261/1000 cycles

26992   cycles for 1000 * Ocmp (JJ)
23136   cycles for 1000 * Ocmp2 (JJ)
108573  cycles for 1000 * cmp128b (loop)
40769   cycles for 1000 * cmp128 qWord
18240   cycles for 1000 * JJAxCMP128bit

24862   cycles for 1000 * Ocmp (JJ)
22679   cycles for 1000 * Ocmp2 (JJ)
107366  cycles for 1000 * cmp128b (loop)
38917   cycles for 1000 * cmp128 qWord
18459   cycles for 1000 * JJAxCMP128bit

24960   cycles for 1000 * Ocmp (JJ)
22641   cycles for 1000 * Ocmp2 (JJ)
107176  cycles for 1000 * cmp128b (loop)
38921   cycles for 1000 * cmp128 qWord
18207   cycles for 1000 * JJAxCMP128bit


--- ok ---


BTW: in my variation of code here:

   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1


we may have couple of cycles less when numbers are equal and jz @l1 is moved before BSR (just a note, it will not improve timings in current testbed).

nidud

  • Member
  • *****
  • Posts: 2049
    • https://github.com/nidud/asmc
Re: Comparing 128-bit numbers aka OWORDs
« Reply #61 on: August 18, 2013, 11:52:46 PM »
your code returns that A is greater than B (signed), which is wrong.

:biggrin:

I've also been using that method, but (hopefully) only for unsigned values.

You may argue that the test is slightly rigged, so in this regard the fastest code will be: :lol:
Code: [Select]
mov ax,word ptr ow1
cmp ax,word ptr ow0

Well, here is the latest test results:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 2008/1000 cycles

9208 cycles for 1000 * Ocmp (JJ)
90229 cycles for 1000 * cmp128b (loop)
7016 cycles for 1000 * cmp128 qWord
9545 cycles for 1000 * JJAxCMP128bit

9206 cycles for 1000 * Ocmp (JJ)
94243 cycles for 1000 * cmp128b (loop)
7008 cycles for 1000 * cmp128 qWord
9564 cycles for 1000 * JJAxCMP128bit

9205 cycles for 1000 * Ocmp (JJ)
89038 cycles for 1000 * cmp128b (loop)
7011 cycles for 1000 * cmp128 qWord
9532 cycles for 1000 * JJAxCMP128bit

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #62 on: August 19, 2013, 08:56:10 AM »
AMD, as usual, prefers GPR code :biggrin:

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #63 on: August 19, 2013, 12:30:34 PM »
BTW, in qWord's code, are really the commented instructions required? They seem to be superfluous.


   push esi
   push edi
   push ebx
   lea esi,ow0
   lea edi,ow1
   mov eax,[esi+0*DWORD]
   mov ecx,[esi+1*DWORD]
   mov edx,[esi+2*DWORD]
   mov ebx,[esi+3*DWORD]
   sub eax,[edi+0*DWORD]
   ;jnz @NE1
   sbb ecx,[edi+1*DWORD]
   ;jnz @NE2
   sbb edx,[edi+2*DWORD]
   ;jnz @NE3
   sbb ebx,[edi+3*DWORD]
   ;/* equal */
   ;jmp @end

;@NE1:   sbb ecx,[edi+1*DWORD]
;@NE2:   sbb edx,[edi+2*DWORD]
;@NE3:   sbb ebx,[edi+3*DWORD]
   ;/* LT or GT */
;   jnz @end
;   or ebx,2    ; MOV may be better, because it breaks the dependency chain...
;   cmp ebx,1   ; new flags: A GT B
@end:
   pop ebx
   pop edi
   pop esi


Every jump goes to the continuation of the same sequence of instructions.
CMP behaves just like SUB, so, probably it's simpler and faster just to follow straight 128 bit integer substraction with SUB/SBB/SBB/SBB.

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #64 on: August 19, 2013, 12:46:43 PM »
The code:

   mov eax,dword ptr [ow0]
   sub eax,dword ptr [ow1]
   mov eax,dword ptr [ow0+4]
   sbb eax,dword ptr [ow1+4]
   mov eax,dword ptr [ow0+8]
   sbb eax,dword ptr [ow1+8]
   mov eax,dword ptr [ow0+12]
   sbb eax,dword ptr [ow1+12]

seems to be pretty fast on my machine, though, still a lot slower than SSE-powered version.

nidud

  • Member
  • *****
  • Posts: 2049
    • https://github.com/nidud/asmc
Re: Comparing 128-bit numbers aka OWORDs
« Reply #65 on: August 19, 2013, 01:26:17 PM »
I don't think that will work correctly
this may work
Code: [Select]
cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
mov eax,dword ptr ow0
sub eax,dword ptr ow1
jnz @NE1
mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
jnz @NE2
mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
jnz @NE3
mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jmp @end
@NE1: mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
@NE2: mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
@NE3: mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jnz @end
mov eax,2
cmp eax,1
@end:
endm

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #66 on: August 19, 2013, 01:29:03 PM »
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
notice that it executes about the same number of instructions, either way

my thinking is that if the high-order dwords do not match, you shouldn't have to do any more   :P
depending on the application (of course), that could be a majority of the time
so, i still think there is room for improvement
at the moment, i have one more graphing routine to write, so i haven't spent any time on it

nidud

  • Member
  • *****
  • Posts: 2049
    • https://github.com/nidud/asmc
Re: Comparing 128-bit numbers aka OWORDs
« Reply #67 on: August 19, 2013, 01:35:21 PM »
here is a test including qword's number
Code: [Select]
sSmall OWORD 0fffffffffffffffffffffffff00000001h
sBig OWORD 0fffffffffffffffffffffffffffffffffh
...
cmp128 sSmall, sBig
jnl error
cmp128 sBig, sSmall
jl error
cmp128 oSmall, oBig ; movups xmm0, oword ptr oSmall
jnl error
jnb error
cmp128 oBig, oSmall
jle error
jbe error
cmp128 oMedium, oMedium
jne error
cmp128 oSmallF, oBigF
jge error
jae error
cmp128 oBigF, oSmallF
jle error
jbe error
cmp128 oMedF, oMedF
jne error

jj2007

  • Member
  • *****
  • Posts: 11159
  • Assembler is fun ;-)
    • MasmBasic
Re: Comparing 128-bit numbers aka OWORDs
« Reply #68 on: August 19, 2013, 06:16:01 PM »
Hi nidud,
If your algo yields correct results, you should sell it ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2999/1000 cycles

18751   cycles for 1000 * Ocmp (JJ)
18730   cycles for 1000 * Ocmp2 (JJ)
3035    cycles for 1000 * cmp128n (nidud)
8218    cycles for 1000 * cmp128 qWord
22797   cycles for 1000 * JJAxCMP128bit


EDIT: See reply #74 for corrected version
« Last Edit: August 19, 2013, 11:57:21 PM by jj2007 »

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #69 on: August 19, 2013, 07:01:34 PM »
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2113/1000 cycles

23370   cycles for 1000 * Ocmp (JJ)
21821   cycles for 1000 * Ocmp2 (JJ)
13010   cycles for 1000 * cmp128n (nidud)
37300   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23393   cycles for 1000 * Ocmp (JJ)
21550   cycles for 1000 * Ocmp2 (JJ)
12906   cycles for 1000 * cmp128n (nidud)
36926   cycles for 1000 * cmp128 qWord
17266   cycles for 1000 * JJAxCMP128bit

23680   cycles for 1000 * Ocmp (JJ)
21558   cycles for 1000 * Ocmp2 (JJ)
13570   cycles for 1000 * cmp128n (nidud)
36904   cycles for 1000 * cmp128 qWord
17157   cycles for 1000 * JJAxCMP128bit

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Comparing 128-bit numbers aka OWORDs
« Reply #70 on: August 19, 2013, 07:28:30 PM »
nice idea, nidud   :t

still plenty of room for improvement, i think
using EAX all the way through has to be slowing it down
although, as a macro, it does make it more flexible

one little item....
Code: [Select]
mov eax,2
cmp eax,1
8 bytes - clears OF, ZF, CF, and SF

can be replaced with
Code: [Select]
or  al,12 bytes - clears OF, ZF, CF, and SF

Antariy

  • Member
  • ****
  • Posts: 551
Re: Comparing 128-bit numbers aka OWORDs
« Reply #71 on: August 19, 2013, 07:52:35 PM »
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB

Ah, yes, you're right, Dave :redface:


Jochen, to get correct test running, you should change this macro:


align 16
TestC_s:
; useC=0      ; uncomment to exclude TestC
NameC equ cmp128n (nidud)   ; assign a descriptive name here
TestC proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   showresults=0
   cmp128n offset oSmall, offset oBig
   cmp128n offset oBig, offset oSmall
   cmp128n offset oMedium, offset oMedium
   
   cmp128n offset oSmallF, offset oBigF
   cmp128n offset oBigF, offset oSmallF
   cmp128n offset oMedF, offset oMedF
   sub ebx, 6
   ; dec ebx - we test 6x above
  .Until Sign?
  ret
TestC endp
TestC_endp:



Remove every "offset" statement before numbers.
If you would have a look into the disassembly, you'll find that the expanded macro actually compares not the numbers but the offsets (BTW, that's very nasty "feature" of a macroses).

qWord

  • Member
  • *****
  • Posts: 1476
  • The base type of a type is the type itself
    • SmplMath macros
Re: Comparing 128-bit numbers aka OWORDs
« Reply #72 on: August 19, 2013, 09:04:04 PM »
What about making the test a bit more realistic by adding some dependencies? e.g. if( Condition ) then do (operation A) else do (operation B)
 :t
MREAL macros - when you need floating point arithmetic while assembling!

nidud

  • Member
  • *****
  • Posts: 2049
    • https://github.com/nidud/asmc
Re: Comparing 128-bit numbers aka OWORDs
« Reply #73 on: August 19, 2013, 09:49:08 PM »
using EAX all the way through has to be slowing it down

I tried to preload edx and ecx, but that actually slows it down
It seems that alignment of the code (?) is more important

Quote
one little item....
Code: [Select]
mov eax,2
cmp eax,1
8 bytes - clears OF, ZF, CF, and SF

can be replaced with
Code: [Select]
or  al,12 bytes - clears OF, ZF, CF, and SF

Code: [Select]
mov eax,2
cmp eax,1
2030 cycles for 1000 * cmp128

or al,1
1517 cycles for 1000 * cmp128

sub eax,eax
inc eax
1360 cycles for 1000 * cmp128

jj2007

  • Member
  • *****
  • Posts: 11159
  • Assembler is fun ;-)
    • MasmBasic
Re: Comparing 128-bit numbers aka OWORDs
« Reply #74 on: August 19, 2013, 11:54:21 PM »
Jochen, to get correct test running, you should change this macro:

That's right :t

By the way: Why sbb all over the place?

   jnz @NE1
   mov eax,dword ptr ow0[4]
   .if Carry?
      INT 3
   .endif
   sbb eax,dword ptr ow1[4]


AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 3008/1000 cycles

18753   cycles for 1000 * Ocmp (JJ)
18749   cycles for 1000 * Ocmp2 (JJ)
2688 1) cycles for 1000 * cmp128n (nidud)
8210    cycles for 1000 * cmp128 qWord
22749   cycles for 1000 * JJAxCMP128bit


1) You will be fined for violating speed limits 8)