Comparing 128-bit numbers aka OWORDs

Antariy · August 18, 2013, 11:12:40 PM

Quote from: jj2007 on August 18, 2013, 10:16:21 PM
Something is wrong there - 0.3 cycles is impossible...

It may be some inconsistence in timings, like it often happens, but probably it shows that Gunther's CPU model runs your version faster.

Timings for the new archive:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2261/1000 cycles

26992   cycles for 1000 * Ocmp (JJ)
23136   cycles for 1000 * Ocmp2 (JJ)
108573  cycles for 1000 * cmp128b (loop)
40769   cycles for 1000 * cmp128 qWord
18240   cycles for 1000 * JJAxCMP128bit

24862   cycles for 1000 * Ocmp (JJ)
22679   cycles for 1000 * Ocmp2 (JJ)
107366  cycles for 1000 * cmp128b (loop)
38917   cycles for 1000 * cmp128 qWord
18459   cycles for 1000 * JJAxCMP128bit

24960   cycles for 1000 * Ocmp (JJ)
22641   cycles for 1000 * Ocmp2 (JJ)
107176  cycles for 1000 * cmp128b (loop)
38921   cycles for 1000 * cmp128 qWord
18207   cycles for 1000 * JJAxCMP128bit


--- ok ---

BTW: in my variation of code here:

xor ecx,0FFFFh
bsr ecx,ecx
jz @l1

we may have couple of cycles less when numbers are equal and jz @l1 is moved before BSR (just a note, it will not improve timings in current testbed).

nidud · August 18, 2013, 11:52:46 PM

deleted

Antariy · August 19, 2013, 08:56:10 AM

AMD, as usual, prefers GPR code

Antariy · August 19, 2013, 12:30:34 PM

BTW, in qWord's code, are really the commented instructions required? They seem to be superfluous.

   push esi
   push edi
   push ebx
   lea esi,ow0
   lea edi,ow1
   mov eax,[esi+0*DWORD]
   mov ecx,[esi+1*DWORD]
   mov edx,[esi+2*DWORD]
   mov ebx,[esi+3*DWORD]
   sub eax,[edi+0*DWORD]
   ;jnz @NE1
   sbb ecx,[edi+1*DWORD]
   ;jnz @NE2
   sbb edx,[edi+2*DWORD]
   ;jnz @NE3
   sbb ebx,[edi+3*DWORD]
   ;/* equal */
   ;jmp @end

;@NE1:   sbb ecx,[edi+1*DWORD]
;@NE2:   sbb edx,[edi+2*DWORD]
;@NE3:   sbb ebx,[edi+3*DWORD]
   ;/* LT or GT */
;   jnz @end
;   or ebx,2 ; MOV may be better, because it breaks the dependency chain...
;   cmp ebx,1   ; new flags: A GT B
@end:
   pop ebx
   pop edi
   pop esi

Every jump goes to the continuation of the same sequence of instructions.
CMP behaves just like SUB, so, probably it's simpler and faster just to follow straight 128 bit integer substraction with SUB/SBB/SBB/SBB.

Antariy · August 19, 2013, 12:46:43 PM

The code:

   mov eax,dword ptr [ow0]
   sub eax,dword ptr [ow1]
   mov eax,dword ptr [ow0+4]
   sbb eax,dword ptr [ow1+4]
   mov eax,dword ptr [ow0+8]
   sbb eax,dword ptr [ow1+8]
   mov eax,dword ptr [ow0+12]
   sbb eax,dword ptr [ow1+12]

seems to be pretty fast on my machine, though, still a lot slower than SSE-powered version.

nidud · August 19, 2013, 01:26:17 PM

deleted

dedndave · August 19, 2013, 01:29:03 PM

i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
notice that it executes about the same number of instructions, either way

my thinking is that if the high-order dwords do not match, you shouldn't have to do any more :P
depending on the application (of course), that could be a majority of the time
so, i still think there is room for improvement
at the moment, i have one more graphing routine to write, so i haven't spent any time on it

nidud · August 19, 2013, 01:35:21 PM

deleted

jj2007 · August 19, 2013, 06:16:01 PM

Hi nidud,
If your algo yields correct results, you should sell it ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2999/1000 cycles

18751 cycles for 1000 * Ocmp (JJ)
18730 cycles for 1000 * Ocmp2 (JJ)
3035 cycles for 1000 * cmp128n (nidud)
8218 cycles for 1000 * cmp128 qWord
22797 cycles for 1000 * JJAxCMP128bit

EDIT: See reply #74 for corrected version

dedndave · August 19, 2013, 07:01:34 PM

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2113/1000 cycles

23370   cycles for 1000 * Ocmp (JJ)
21821   cycles for 1000 * Ocmp2 (JJ)
13010   cycles for 1000 * cmp128n (nidud)
37300   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23393   cycles for 1000 * Ocmp (JJ)
21550   cycles for 1000 * Ocmp2 (JJ)
12906   cycles for 1000 * cmp128n (nidud)
36926   cycles for 1000 * cmp128 qWord
17266   cycles for 1000 * JJAxCMP128bit

23680   cycles for 1000 * Ocmp (JJ)
21558   cycles for 1000 * Ocmp2 (JJ)
13570   cycles for 1000 * cmp128n (nidud)
36904   cycles for 1000 * cmp128 qWord
17157   cycles for 1000 * JJAxCMP128bit

dedndave · August 19, 2013, 07:28:30 PM

nice idea, nidud :t

still plenty of room for improvement, i think
using EAX all the way through has to be slowing it down
although, as a macro, it does make it more flexible

one little item....

Code Select

	mov eax,2
	cmp eax,1

8 bytes - clears OF, ZF, CF, and SF

can be replaced with

Code Select

or al,1
2 bytes - clears OF, ZF, CF, and SF

Antariy · August 19, 2013, 07:52:35 PM

Quote from: dedndave on August 19, 2013, 01:29:03 PM
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB

Ah, yes, you're right, Dave :redface:

Jochen, to get correct test running, you should change this macro:

align 16
TestC_s:
; useC=0      ; uncomment to exclude TestC
NameC equ cmp128n (nidud)   ; assign a descriptive name here
TestC proc
mov ebx, AlgoLoops-1   ; loop e.g. 100x
align 4
.Repeat
   showresults=0
   cmp128n offset oSmall, offset oBig
   cmp128n offset oBig, offset oSmall
   cmp128n offset oMedium, offset oMedium

   cmp128n offset oSmallF, offset oBigF
   cmp128n offset oBigF, offset oSmallF
   cmp128n offset oMedF, offset oMedF
   sub ebx, 6
   ; dec ebx - we test 6x above
.Until Sign?
ret
TestC endp
TestC_endp:

Remove every "offset" statement before numbers.
If you would have a look into the disassembly, you'll find that the expanded macro actually compares not the numbers but the offsets (BTW, that's very nasty "feature" of a macroses).

qWord · August 19, 2013, 09:04:04 PM

What about making the test a bit more realistic by adding some dependencies? e.g. if( Condition ) then do (operation A) else do (operation B)
:t

nidud · August 19, 2013, 09:49:08 PM

deleted

jj2007 · August 19, 2013, 11:54:21 PM

Quote from: Antariy on August 19, 2013, 07:52:35 PM
Jochen, to get correct test running, you should change this macro:

That's right :t

By the way: Why sbb all over the place?

   jnz @NE1
   mov eax,dword ptr ow0[4]
   .if Carry?
      INT 3
   .endif
   sbb eax,dword ptr ow1[4]

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 3008/1000 cycles

18753 cycles for 1000 * Ocmp (JJ)
18749 cycles for 1000 * Ocmp2 (JJ)
2688 1) cycles for 1000 * cmp128n (nidud)
8210 cycles for 1000 * cmp128 qWord
22749 cycles for 1000 * JJAxCMP128bit

1) You will be fined for violating speed limits 8)

The MASM Forum

News:

Comparing 128-bit numbers aka OWORDs

Antariy

nidud

Antariy

Antariy

Antariy

nidud

dedndave

nidud

jj2007

dedndave

dedndave

Antariy

qWord

nidud

jj2007