News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

Antariy

Quote from: jj2007 on August 18, 2013, 10:16:21 PM
Something is wrong there - 0.3 cycles is impossible...

It may be some inconsistence in timings, like it often happens, but probably it shows that Gunther's CPU model runs your version faster.

Timings for the new archive:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2261/1000 cycles

26992   cycles for 1000 * Ocmp (JJ)
23136   cycles for 1000 * Ocmp2 (JJ)
108573  cycles for 1000 * cmp128b (loop)
40769   cycles for 1000 * cmp128 qWord
18240   cycles for 1000 * JJAxCMP128bit

24862   cycles for 1000 * Ocmp (JJ)
22679   cycles for 1000 * Ocmp2 (JJ)
107366  cycles for 1000 * cmp128b (loop)
38917   cycles for 1000 * cmp128 qWord
18459   cycles for 1000 * JJAxCMP128bit

24960   cycles for 1000 * Ocmp (JJ)
22641   cycles for 1000 * Ocmp2 (JJ)
107176  cycles for 1000 * cmp128b (loop)
38921   cycles for 1000 * cmp128 qWord
18207   cycles for 1000 * JJAxCMP128bit


--- ok ---



BTW: in my variation of code here:

   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1


we may have couple of cycles less when numbers are equal and jz @l1 is moved before BSR (just a note, it will not improve timings in current testbed).

nidud

#61
deleted

Antariy


Antariy

BTW, in qWord's code, are really the commented instructions required? They seem to be superfluous.


   push esi
   push edi
   push ebx
   lea esi,ow0
   lea edi,ow1
   mov eax,[esi+0*DWORD]
   mov ecx,[esi+1*DWORD]
   mov edx,[esi+2*DWORD]
   mov ebx,[esi+3*DWORD]
   sub eax,[edi+0*DWORD]
   ;jnz @NE1
   sbb ecx,[edi+1*DWORD]
   ;jnz @NE2
   sbb edx,[edi+2*DWORD]
   ;jnz @NE3
   sbb ebx,[edi+3*DWORD]
   ;/* equal */
   ;jmp @end

;@NE1:   sbb ecx,[edi+1*DWORD]
;@NE2:   sbb edx,[edi+2*DWORD]
;@NE3:   sbb ebx,[edi+3*DWORD]
   ;/* LT or GT */
;   jnz @end
;   or ebx,2    ; MOV may be better, because it breaks the dependency chain...
;   cmp ebx,1   ; new flags: A GT B
@end:
   pop ebx
   pop edi
   pop esi


Every jump goes to the continuation of the same sequence of instructions.
CMP behaves just like SUB, so, probably it's simpler and faster just to follow straight 128 bit integer substraction with SUB/SBB/SBB/SBB.

Antariy

The code:

   mov eax,dword ptr [ow0]
   sub eax,dword ptr [ow1]
   mov eax,dword ptr [ow0+4]
   sbb eax,dword ptr [ow1+4]
   mov eax,dword ptr [ow0+8]
   sbb eax,dword ptr [ow1+8]
   mov eax,dword ptr [ow0+12]
   sbb eax,dword ptr [ow1+12]

seems to be pretty fast on my machine, though, still a lot slower than SSE-powered version.

nidud

#65
deleted

dedndave

i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
notice that it executes about the same number of instructions, either way

my thinking is that if the high-order dwords do not match, you shouldn't have to do any more   :P
depending on the application (of course), that could be a majority of the time
so, i still think there is room for improvement
at the moment, i have one more graphing routine to write, so i haven't spent any time on it

nidud

#67
deleted

jj2007

#68
Hi nidud,
If your algo yields correct results, you should sell it ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2999/1000 cycles

18751   cycles for 1000 * Ocmp (JJ)
18730   cycles for 1000 * Ocmp2 (JJ)
3035    cycles for 1000 * cmp128n (nidud)
8218    cycles for 1000 * cmp128 qWord
22797   cycles for 1000 * JJAxCMP128bit


EDIT: See reply #74 for corrected version

dedndave

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2113/1000 cycles

23370   cycles for 1000 * Ocmp (JJ)
21821   cycles for 1000 * Ocmp2 (JJ)
13010   cycles for 1000 * cmp128n (nidud)
37300   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23393   cycles for 1000 * Ocmp (JJ)
21550   cycles for 1000 * Ocmp2 (JJ)
12906   cycles for 1000 * cmp128n (nidud)
36926   cycles for 1000 * cmp128 qWord
17266   cycles for 1000 * JJAxCMP128bit

23680   cycles for 1000 * Ocmp (JJ)
21558   cycles for 1000 * Ocmp2 (JJ)
13570   cycles for 1000 * cmp128n (nidud)
36904   cycles for 1000 * cmp128 qWord
17157   cycles for 1000 * JJAxCMP128bit

dedndave

nice idea, nidud   :t

still plenty of room for improvement, i think
using EAX all the way through has to be slowing it down
although, as a macro, it does make it more flexible

one little item....
mov eax,2
cmp eax,1

8 bytes - clears OF, ZF, CF, and SF

can be replaced with
or  al,1
2 bytes - clears OF, ZF, CF, and SF

Antariy

Quote from: dedndave on August 19, 2013, 01:29:03 PM
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB

Ah, yes, you're right, Dave :redface:


Jochen, to get correct test running, you should change this macro:


align 16
TestC_s:
; useC=0      ; uncomment to exclude TestC
NameC equ cmp128n (nidud)   ; assign a descriptive name here
TestC proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   showresults=0
   cmp128n offset oSmall, offset oBig
   cmp128n offset oBig, offset oSmall
   cmp128n offset oMedium, offset oMedium
   
   cmp128n offset oSmallF, offset oBigF
   cmp128n offset oBigF, offset oSmallF
   cmp128n offset oMedF, offset oMedF
   sub ebx, 6
   ; dec ebx - we test 6x above
  .Until Sign?
  ret
TestC endp
TestC_endp:



Remove every "offset" statement before numbers.
If you would have a look into the disassembly, you'll find that the expanded macro actually compares not the numbers but the offsets (BTW, that's very nasty "feature" of a macroses).

qWord

What about making the test a bit more realistic by adding some dependencies? e.g. if( Condition ) then do (operation A) else do (operation B)
:t
MREAL macros - when you need floating point arithmetic while assembling!

nidud

#73
deleted

jj2007

Quote from: Antariy on August 19, 2013, 07:52:35 PM
Jochen, to get correct test running, you should change this macro:

That's right :t

By the way: Why sbb all over the place?

   jnz @NE1
   mov eax,dword ptr ow0[4]
   .if Carry?
      INT 3
   .endif
   sbb eax,dword ptr ow1[4]


AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 3008/1000 cycles

18753   cycles for 1000 * Ocmp (JJ)
18749   cycles for 1000 * Ocmp2 (JJ)
2688 1) cycles for 1000 * cmp128n (nidud)
8210    cycles for 1000 * cmp128 qWord
22749   cycles for 1000 * JJAxCMP128bit


1) You will be fined for violating speed limits 8)