News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

dedndave

your idea is better anyways, nidud
it just occured to me that OR AL,1 does not necessarily clear the ZF if the pre-existing value has bit 7 set
    sub     eax,eax
    inc     eax

i like it   :P

dedndave

something in there doesn't like my P4   :P

Cmp128TimingsNidudB
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2078/1000 cycles

23549   cycles for 1000 * Ocmp (JJ)
21190   cycles for 1000 * Ocmp2 (JJ)
31788   cycles for 1000 * cmp128n (nidud)
36806   cycles for 1000 * cmp128 qWord
17161   cycles for 1000 * JJAxCMP128bit

23934   cycles for 1000 * Ocmp (JJ)
21858   cycles for 1000 * Ocmp2 (JJ)
31505   cycles for 1000 * cmp128n (nidud)
37015   cycles for 1000 * cmp128 qWord
17209   cycles for 1000 * JJAxCMP128bit

24145   cycles for 1000 * Ocmp (JJ)
21179   cycles for 1000 * Ocmp2 (JJ)
31363   cycles for 1000 * cmp128n (nidud)
36887   cycles for 1000 * cmp128 qWord
17515   cycles for 1000 * JJAxCMP128bit

dedndave

i still think you can stop comparing when you find a high-order mismatch
and - no need to ripple the CF all the way from low-order to high-order if they are all equal

something like this - i need to do some testing
Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands

    mov     eax,dword ptr OwA[12]
    mov     edx,dword ptr OwA[8]
    sub     eax,dword ptr OwB[12]
    .if ZERO?
        cmp     edx,dword ptr OwB[8]
        mov     ecx,dword ptr OwA[4]
        .if ZERO?
            cmp     ecx,dword ptr OwB[4]
            mov     edx,dword ptr OwA[0]
            .if ZERO?
                cmp     edx,dword ptr OwB[0]
                .if !ZERO?
                    sbb     eax,0
                .endif
            .endif
        .endif
    .endif

ENDM

nidud

#78
deleted

jj2007

   jnz @NE1
   ; so here we are inside a ZERO branch, and there is definitely NO CARRY
   mov eax,dword ptr ow0[4]
   sbb eax,dword ptr ow0[4]  ; therefore sbb behaves exactly like sub

Antariy

New attempt:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2198/1000 cycles

24957   cycles for 1000 * Ocmp (JJ)
22338   cycles for 1000 * Ocmp2 (JJ)
32907   cycles for 1000 * cmp128n (nidud)
39508   cycles for 1000 * cmp128 qWord
6635    cycles for 1000 * AxCMP128bit

24604   cycles for 1000 * Ocmp (JJ)
22414   cycles for 1000 * Ocmp2 (JJ)
32845   cycles for 1000 * cmp128n (nidud)
38513   cycles for 1000 * cmp128 qWord
6367    cycles for 1000 * AxCMP128bit

24608   cycles for 1000 * Ocmp (JJ)
22294   cycles for 1000 * Ocmp2 (JJ)
34059   cycles for 1000 * cmp128n (nidud)
38513   cycles for 1000 * cmp128 qWord
6388    cycles for 1000 * AxCMP128bit


--- ok ---


Did not look over your latest archive yet, Jochen, got it just now.

Can you please add my code to newest testbed?

Antariy

Ooops, entangled the order :greensml:

Here is my code:


AxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2, @l3, @l1_1, @l2_1, @l3_1, @l0


mov eax,dword ptr [ow0+12]
cmp eax,dword ptr [ow1+12]
jnz @l0

mov eax,dword ptr [ow0+8]
cmp eax,dword ptr [ow1+8]
jnz @l3

mov eax,dword ptr [ow0+4]
cmp eax,dword ptr [ow1+4]
jnz @l2

mov eax,dword ptr [ow0]
cmp eax,dword ptr [ow1]
jz @l0


@l1:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2]
mov cx,word ptr [ow1+2]
cmp dx,word ptr [ow1+2]
jnz @l1_1
mov cx,word ptr [ow1]
mov dx,ax
@l1_1:
cmp edx,ecx
jmp @l0

@l2:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+4]
mov cx,word ptr [ow1+2+4]
cmp dx,word ptr [ow1+2+4]
jnz @l2_1
mov cx,word ptr [ow1+4]
mov dx,ax
@l2_1:
cmp edx,ecx
jmp @l0

@l3:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+8]
mov cx,word ptr [ow1+2+8]
cmp dx,word ptr [ow1+2+8]
jnz @l3_1
mov cx,word ptr [ow1+8]
mov dx,ax
@l3_1:
cmp edx,ecx


@l0:

ENDM


Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2280/1000 cycles

25091   cycles for 1000 * Ocmp (JJ)
23674   cycles for 1000 * Ocmp2 (JJ)
33530   cycles for 1000 * cmp128n (nidud)
39507   cycles for 1000 * cmp128 qWord
12115   cycles for 1000 * AxCMP128bit

25075   cycles for 1000 * Ocmp (JJ)
22759   cycles for 1000 * Ocmp2 (JJ)
34451   cycles for 1000 * cmp128n (nidud)
39244   cycles for 1000 * cmp128 qWord
10880   cycles for 1000 * AxCMP128bit

25092   cycles for 1000 * Ocmp (JJ)
22990   cycles for 1000 * Ocmp2 (JJ)
32880   cycles for 1000 * cmp128n (nidud)
38631   cycles for 1000 * cmp128 qWord
10777   cycles for 1000 * AxCMP128bit


--- ok ---

nidud

#82
deleted


dedndave

nidud - SUB would be faster because it does not have to wait for the carry condition to be set

Alex - odds are that the high order compare would find a mismatch
of all the possible 128-bit integers, only 1 in every 4294967296 have a specific high dword value

of all combinations of two 128-bit values,
only 1 combination in every 42949672962 (roughly) has matching high-order dwords

as you said, this is very application dependant, but i like playing those odds   :P

the timing tests probably aren't set up to test a wide range of values

Antariy

Quote from: dedndave on August 20, 2013, 02:20:59 AM
Alex - odds are that the high order compare would find a mismatch
...
as you said, this is very application dependant, but i like playing those odds   :P

No-no, Dave, I agree with you, just not get it from first time (in the first pages of the thread me too "voted" for checking the highest order elements), it's late here, sorry :greensml:

nidud

#86
deleted

dedndave

lol nidud
you'd have to select specific values to compare to validate that theory

the values used in the timing test don't cover a very comprehensive range of comparisons

dedndave

results from Alex's latest code

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2134/1000 cycles

23563   cycles for 1000 * Ocmp (JJ)
21133   cycles for 1000 * Ocmp2 (JJ)
31419   cycles for 1000 * cmp128n (nidud)
37262   cycles for 1000 * cmp128 qWord
10163   cycles for 1000 * AxCMP128bit

23335   cycles for 1000 * Ocmp (JJ)
21130   cycles for 1000 * Ocmp2 (JJ)
32047   cycles for 1000 * cmp128n (nidud)
37288   cycles for 1000 * cmp128 qWord
10208   cycles for 1000 * AxCMP128bit

23770   cycles for 1000 * Ocmp (JJ)
21134   cycles for 1000 * Ocmp2 (JJ)
31739   cycles for 1000 * cmp128n (nidud)
37136   cycles for 1000 * cmp128 qWord
10172   cycles for 1000 * AxCMP128bit

Gunther

My results from Alex's latest test:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++++++3 of 20 tests valid, loop overhead is approx. 1678/1000 cycles

2364    cycles for 1000 * Ocmp (JJ)
2351    cycles for 1000 * Ocmp2 (JJ)
2177    cycles for 1000 * cmp128n (nidud)
10741   cycles for 1000 * cmp128 qWord
3756    cycles for 1000 * AxCMP128bit

2340    cycles for 1000 * Ocmp (JJ)
2346    cycles for 1000 * Ocmp2 (JJ)
2188    cycles for 1000 * cmp128n (nidud)
4684    cycles for 1000 * cmp128 qWord
3749    cycles for 1000 * AxCMP128bit

2420    cycles for 1000 * Ocmp (JJ)
2431    cycles for 1000 * Ocmp2 (JJ)
7600    cycles for 1000 * cmp128n (nidud)
4545    cycles for 1000 * cmp128 qWord
9999    cycles for 1000 * AxCMP128bit

--- ok ---


Gunther
You have to know the facts before you can distort them.