The MASM Forum

General => The Laboratory => Topic started by: jj2007 on August 12, 2013, 08:25:24 PM

Title: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 12, 2013, 08:25:24 PM
Following the "What is the fastest way (performance wise) to compare two 128 bit integers" thread (http://masm32.com/board/index.php?topic=2213.0) in the Campus, here a first attempt to time comparisons of 128-bit unsigned integers.

Fifteen cycles is quite a lot, so if you have a better algo, please post it... you can test it before in line 90 of the attached source.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
15580   cycles for 1000 * cmp128 (2 globals)
88587   cycles for 1000 * cmp128b (loop)
27107   cycles for 1000 * cmp128p (calls proc)
28611   cycles for 1000 * two pointers


P.S.: Googling yields almost nothing, apparently there are not many applications for this :(
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 12, 2013, 08:45:03 PM
Que?

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1824/1000 cycles

3385    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

3489    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

3335    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 12, 2013, 10:23:54 PM
John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:

OK, version B attached on top. It features a loop based macro:

cmp128b MACRO ow0, ow1   ; both operands must be memory variables
  push esi
  push edi
  mov esi, offset ow0
  mov edi, offset ow1
  mov ecx, 16
  .Repeat
   dec ecx
   .if Sign?
      inc ecx
      .Break
   .endif
   movzx eax, byte ptr [esi+ecx]
   movzx edx, byte ptr [edi+ecx]
   cmp eax, edx
  .Until !Zero?
  pop edi
  pop esi
ENDM

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 12, 2013, 10:36:36 PM
>John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:
Hah! It's you being disrespectful to my CPU.



Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1694/1000 cycles

3158    cycles for 1000 * cmp128
58915   cycles for 1000 * cmp128b

3162    cycles for 1000 * cmp128
58929   cycles for 1000 * cmp128b

3124    cycles for 1000 * cmp128
59191   cycles for 1000 * cmp128b

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 12, 2013, 11:38:25 PM
One more, adding a generic one which expects two pointers:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2998/1000 cycles

15567   cycles for 1000 * cmp128 (2 globals)
88584   cycles for 1000 * cmp128b (loop)
27141   cycles for 1000 * cmp128p (calls proc)
29387   cycles for 1000 * two pointers

15562   cycles for 1000 * cmp128 (2 globals)
88587   cycles for 1000 * cmp128b (loop)
27100   cycles for 1000 * cmp128p (calls proc)
29431   cycles for 1000 * two pointers

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

9116    cycles for 1000 * cmp128 (2 globals)
73639   cycles for 1000 * cmp128b (loop)
17461   cycles for 1000 * cmp128p (calls proc)
24831   cycles for 1000 * two pointers


Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 13, 2013, 03:16:45 AM
Jochen,

your timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3047/1000 cycles

1636    cycles for 1000 * cmp128 (2 globals)
56016   cycles for 1000 * cmp128b (loop)
4410    cycles for 1000 * cmp128p (calls proc)
2688    cycles for 1000 * two pointers

1696    cycles for 1000 * cmp128 (2 globals)
55860   cycles for 1000 * cmp128b (loop)
10677   cycles for 1000 * cmp128p (calls proc)
2658    cycles for 1000 * two pointers

1612    cycles for 1000 * cmp128 (2 globals)
55951   cycles for 1000 * cmp128b (loop)
10790   cycles for 1000 * cmp128p (calls proc)
2770    cycles for 1000 * two pointers

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: hutch-- on August 13, 2013, 05:26:23 AM
Hmmmm,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 1028/1000 cycles

6340    cycles for 1000 * cmp128 (2 globals)
66042   cycles for 1000 * cmp128b (loop)
15203   cycles for 1000 * cmp128p (calls proc)
28236   cycles for 1000 * two pointers

6394    cycles for 1000 * cmp128 (2 globals)
66052   cycles for 1000 * cmp128b (loop)
15315   cycles for 1000 * cmp128p (calls proc)
28234   cycles for 1000 * two pointers

6340    cycles for 1000 * cmp128 (2 globals)
66007   cycles for 1000 * cmp128b (loop)
14605   cycles for 1000 * cmp128p (calls proc)
28232   cycles for 1000 * two pointers


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: RuiLoureiro on August 13, 2013, 08:06:27 AM
Hi
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
loop overhead is approx. 2450/1000 cycles

18005   cycles for 1000 * cmp128 (2 globals)
101446  cycles for 1000 * cmp128b (loop)
34947   cycles for 1000 * cmp128p (calls proc)
27351   cycles for 1000 * two pointers

17983   cycles for 1000 * cmp128 (2 globals)
103720  cycles for 1000 * cmp128b (loop)
35346   cycles for 1000 * cmp128p (calls proc)
27835   cycles for 1000 * two pointers

18020   cycles for 1000 * cmp128 (2 globals)
97862   cycles for 1000 * cmp128b (loop)
35671   cycles for 1000 * cmp128p (calls proc)
27585   cycles for 1000 * two pointers


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 13, 2013, 10:12:03 AM
Thanks to all of you :t

If I find the time, I'll have to look at the validity of results. Perhaps an extra check of the msb is necessary.

ifidni @Environ(oAssembler), <mlv615>
  oSmall   qWORD 00000000000000001h, 0800000000000000h   ; 7f00000000000000???
   db 03h
  oMedium   qWORD 00000000000000002h, 0800000000000000h
   db 02h
  oBig   qWORD 00000000000000003h, 0800000000000000h
   db 01h
  oSmallF   qWORD 00000000000007f00h, 0800000000000000h
  oMedF   qWORD 00000000000008000h, 0800000000000000h
  oBigF   qWORD 0000000000000ff00h, 0800000000000000h
else
  oSmall   OWORD 080000000000000000000000000000001h
   db 03h
  oMedium   OWORD 080000000000000000000000000000002h
   db 02h
  oBig   OWORD 080000000000000000000000000000003h
   db 01h
  oSmallF   OWORD 080000000000000000000000000007f00h
  oMedF   OWORD 080000000000000000000000000008000h
  oBigF   OWORD 08000000000000000000000000000ff00h
endif

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 02:11:59 AM
Hi Jochen :t

Here are my code added and timings.


This is a tricky replacement of PCMPEQD to make the SSE1-compatible code. Archive contains both versions, the "default" is SSE1-capable.


    ifdef USE_SSE1
        xorps xmm1,xmm0
        movaps xmm2,xmm1
        cmpps xmm1,oword ptr zero128b,0
        andps xmm2,xmm1
        xorps xmm1,xmm2
    else
        pcmpeqd xmm0,xmm1
    endif


The tricky thing with packed comparsion is the question: is the not matched DWORD the highest DWORD, or not? The signed/unsigned comparsion sets different flags, but they relate to one other for the number that is in positive range of signed.
The code made flags correction if required.


        movmskps eax,xmm1
        xor al,0fH
        bsr eax,eax
        jz @l1
        xor ecx,ecx
        mov edx,dword ptr [ow0+eax*4]
        cmp eax,3               ; if this was not highest-order dword
        mov eax,dword ptr [ow1+eax*4]       
        setne cl
                             
        cmp edx,eax             ; OF = SF if EDX (signed)> ECX, CF = 0 if EDX (unsigned)> ECX
        jecxz @l1
       
        ; we need to preserve ZF and CF flags
        ; but make OF != SF if CF = 1

        jns @F                  ; (if sign flag is set, then we do want to clear OF flag)
        mov ecx,80000000h       ; then mark: highest bit will become #30 bit,
        @@:                     ; CF will become #31 bit...

        setc cl                 ; ...lowest bit will become CF

        rcr ecx,1               ; restore CF, if two highest bits set then OF = 0 else OF = 1
        @l1:


I've replaced cmp128b macro with my code to simplify addition to the testbed.

It will probably better to avoid JECXZ with using of a jumptable, but in this testbed it was simpler to use JECXZ than to create table for every macro expansion.
Most robust way of usage is to compare for equality first, then for signed/unsigned great-or-less than (because of short cut after BSR - if it goes that path - the only ZF flag is guaranteed to be set exact to results).

Code: [Select]

SSE2:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2231/1000 cycles

18239   cycles for 1000 * cmp128
20661   cycles for 1000 * cmp128b

18399   cycles for 1000 * cmp128
20907   cycles for 1000 * cmp128b

18365   cycles for 1000 * cmp128
20838   cycles for 1000 * cmp128b


--- ok ---


SSE1:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2189/1000 cycles

18274   cycles for 1000 * cmp128
22574   cycles for 1000 * cmp128b

18327   cycles for 1000 * cmp128
22582   cycles for 1000 * cmp128b

18273   cycles for 1000 * cmp128
22632   cycles for 1000 * cmp128b


--- ok ---

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 14, 2013, 03:32:37 AM
Alex,

you've made good points. Thank you.  :t

Here are the timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3341/1000 cycles

1416    cycles for 1000 * cmp128
3673    cycles for 1000 * cmp128b

1352    cycles for 1000 * cmp128
3597    cycles for 1000 * cmp128b

1405    cycles for 1000 * cmp128
3670    cycles for 1000 * cmp128b

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 05:09:20 AM
Hi Jochen :t

Here are my code added and timings.

Thanks a lot, Alex :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1767/1000 cycles

9121    cycles for 1000 * cmp128
93407   cycles for 1000 * cmp128b


I am still busy with the macro, trying to make it return correct values. If you want to test yours, here is the core of the testbed, in qword version.

.data
qSmall        qWORD 7f00000000000001h
qMedium        qWORD 7f00000000000002h
qBig        qWORD 7f00000000000003h
qSmallN        qWORD 8000000000000001h
qMedN        qWORD 8000000000000002h
qBigN        qWORD 8000000000000003h
qSmallF        qWORD 0800000000007f00h
qMedF        qWORD 0800000000008000h
qBigF        qWORD 080000000000ff00h
...
   print chr$(13, 10, "Qcmp", 13, 10)
   Print Str$("qSp=%i\n", qSmall)
   Print Str$("qBp=%i\n", qBig)
   Print Str$("qSn=%i\n", qSmallN)
   Print Str$("qBn=%i\n\n", qBigN)
   print chr$(13, 10, "positive (lesser, greater)", 13, 10)
   Qcmp qSmall, qBig
   Qcmp qBig, qSmall
   print chr$(13, 10, "negative (lesser, greater)", 13, 10)
   Qcmp qSmallN, qBigN
   Qcmp qBigN, qSmallN
   print chr$(13, 10, "pos, neg (greater, greater)", 13, 10)
   jt=1
   Qcmp qSmall, qBigN
   Qcmp qBig, qSmallN
   print chr$(13, 10, "neg, pos (lesser, lesser)", 13, 10)
   Qcmp qSmallN, qBig
   Qcmp qBigN, qSmall
Title: Re: a bit bored
Post by: qWord on August 14, 2013, 06:07:51 AM
jj, please a little more effectiveness for the macros...
Code: [Select]
useA=1
;...
CodeSize MACRO algo, overhead:=<15>    ; default overhead is mov ecx, 99+mov eax, ecx+loop
  pushad
  mov eax, offset &algo&_endp    ; OPT_Errline 0
  sub eax, offset &algo&_s
  sub eax, overhead
    if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
        % print str$(eax), 9, @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">), 13, 10
    else
        % print str$(eax), 9, "bytes for other", 13, 10
    endif
  popad
ENDM

AlgoName$ MACRO algo
  if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
    EXITM @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">)
  else
    EXITM <"other">
  endif
ENDM
;...
start:
    push 1
    call ShowCpu    ; print brand string and SSE level
    invoke SetProcessAffinityMask, -1, 1    ; restrict to one core
   
    Calibrate
    ??cntr = 1
    REPEAT ShowLoops+2
        IF ??cntr LT ShowLoops+1
            SpinUp
        ENDIF
        FORC char,<ABCDEFGHIJKLMNOPQRSTUVWXYZ>
            IFDEF use&char&
                IF use&char&
                    IF ??cntr LT ShowLoops+1
                        invoke Sleep, SleepMs
                        counter_begin TimerLoops, HIGH_PRIORITY_CLASS
                           call Test&char&
                        counter_end
                        ShowCycles Test&char&
                    ELSEIF (??cntr EQ ShowLoops+1) AND ShowSize NE 0
                        CodeSize Test&char&
                    ELSEIF (??cntr EQ ShowLoops+2) AND ShowResult NE 0
                        CodeResult Test&char&
                    ENDIF
                ENDIF       
            ENDIF
        ENDM
        IF (??cntr LT ShowLoops+1) OR ((??cntr EQ ShowLoops+1) AND ShowSize NE 0) OR ((??cntr EQ ShowLoops+2) AND ShowResult NE 0)
            print chr$(13, 10)
        ENDIF
        ??cntr = ??cntr + 1
    ENDM
    inkey chr$(13, 10, "--- ok ---", 13)
    exit
   :icon_cool:
Title: Re: a bit bored
Post by: jj2007 on August 14, 2013, 06:48:56 AM
jj, please a little more effectiveness for the macros...

qWord,
Thanks for this demo, guru of the macro universe :biggrin:
Right now I am too busy solving QWORD (note the uppercase) mysteries, will test it asap ;-)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 08:58:17 AM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9118    cycles for 1000 * cmp128 (2 globals, buggy)
73747   cycles for 1000 * cmp128b (loop)
35023   cycles for 1000 * OcmpJJ
94027   cycles for 1000 * OcmpAlex


@Alex: Your macro doesn't pass the full test yet... see Cmp128.asc (open with \Masm32\RichMasm\RichMasm.exe, hit F6 to build; to navigate, click on bookmarks on the right, or select a word and hit F3)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:10:12 AM
Jochen, can you make an example of a number which fails with it?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 09:20:21 AM
Hi Alex,
Search cmp128.asc for if Alex to see the source of this output (N means negative OWORD):

---------- ALEX ----------
positive (lesser, greater)
oSmall  smaller oBig
oBig    bigger  oSmall

negative (lesser, greater)
oSmallN smaller oBigN
oBigN   bigger  oSmallN

pos, neg (greater, greater)
oSmall  smaller oBigN
oBig    smaller oSmallN

neg, pos (lesser, lesser)
oSmallN bigger  oBig
oBigN   bigger  oSmall

pos, neg (equal, equal)
oMedium EQUALS  oMedium
oSmallN EQUALS  oSmallN

Or did you do unsigned comparisons??

P.S.: Ocmp works currently only for JWasm and recent ML.exe, not for ML 6.15 :(
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:21:49 AM
Jochen, I'm sure it is correct. Maybe the problems is in that you're using BSF to find non-matching element, but you should use BSR - because you want to find the highest order non-matching element. So, if you're comparing results of my code and your code, and decide results of your code as right, then this maybe a reason for not correctness of my macro.

Try your macro with this numbers, for an instance: 80000000800000000000000000000100 and 80000000000000000000000000000300.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:26:40 AM
Or did you do unsigned comparisons??

No, I just made the usual CMP "emulator" for 128 bits :biggrin: It behaves just the same as CMP does - i.e. it works for signed and unsigned numbers, with no different, the "signed-ness" of the number controls via usual Jcc instuctions (JG/JA/JL/JB).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 09:27:56 AM
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).

But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:30:25 AM
P.S.: Ocmp works currently only for JWasm and recent ML.exe, not for ML 6.15 :(

Yes, I used ML10 to build, but previous, smaller testbed was buildable with ML10 only, too - strange, ML6.15 gives internal assembler error.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:32:10 AM
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).

But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?

oSmallN? I see oSmallF in the source only? Can you drop the numbers?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:50:31 AM
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).

But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?

Jochen, check the number oBig definition for non-ML6.15, you've defined it as a negative (too much zeroes, probably) :t

Again, I checked the code very carefully, it shoud not be wrong.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 09:52:32 AM
oSmallN? I see oSmallF in the source only? Can you drop the numbers?

They are in the testbed, cmp128.asc (not the *timings.asm).

oSmallN   OWORD 88000000000000000000000000000001h
oBigN   OWORD 88000000000000000000000000000003h

And you were absolutely right about bsr instead of bsf :t

> check the number oBig definition
Will do so, thanxalot, Alex :icon14:
New version of the testbed attached. There is still the problem with your negative numbers.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 09:56:10 AM
Also make sure how do you check results, my code does it just like CMP:

CMP First128BitNumber,Second128BitNumber
JZ -> if first equal to second
JA/JG -> unsigned/signed jump if first is above/greater than second
JB/JL -> -/- - - - - below/less - -
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 10:03:02 AM
See if showresults - for technical reasons (cmc), I use Carry? and Zero?, should be sufficient for signed comparisons.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 10:13:29 AM
oSmallN? I see oSmallF in the source only? Can you drop the numbers?

They are in the testbed, cmp128.asc (not the *timings.asm).

Thanks, found it after some time :redface:

> check the number oBig definition
Will do so, thanxalot, Alex :icon14:

I've completely entangled in the source already :greensml:

oSmallN   OWORD 88000000000000000000000000000001h
oMedN   OWORD 88000000000000000000000000000002h
oBigN   OWORD 88000000000000000000000000000003h

"Big negative number" - what did you mean? "Big" - how? This definition oBigN is greater than oSmallN, like -1 greater than -2.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 10:16:34 AM
See if showresults - for technical reasons (cmc), I use Carry? and Zero?, should be sufficient for signed comparisons.

But my code sets all flags just like CMP does, so, to check its correctness you should use a standard technique with JA/JB for unsigned numbers, and JG/JL for signed.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 10:36:46 AM
"Big negative number" - what did you mean? "Big" - how? This definition oBigN is greater than oSmallN, like -1 greater than -2.

Here are the corresponding qwords, which Str$() can handle:

qSmallPos=      8574853690513424385
qBigPos=        8574853690513424387

positive (lesser, greater)
qSmall lesser qBig
qBig greater qSmall


qSmallNeg=     -8646911284551352319
qBigNeg=       -8646911284551352317

negative (lesser, greater)
qSmallN lesser qBigN
qBigN greater qSmallN


OWORDs should behave identically. As you may have seen already, I use MbcmpO to decide which size to handle.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 10:46:22 AM
Yes, I meant this - oBigN is greater ("bigger" :biggrin:) than oSmallN.

But, still, to check my code you need to use appropriate Jcc instructions, because it follows the standard CPU manner in flags setting. I checked it before posting with different kinds of numbers - it works.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 12:57:03 PM
Jochen, if you limit code only to ZF and CF flags to check, how would you check if negative number is greater than zero, for an instance? That's why I preferred the standard flags setting in my version, and it seems more logical to use comparsion macro-(or call)-"instruction" in a way as usually do with CMP.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 14, 2013, 02:34:34 PM
Here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 14, 2013, 06:50:53 PM
yes - signed branches use the OF flag, as well
sign, carry, overflow - the one they give you that noone cares about is parity   :P
aux carry is rarely used, too - mainly for BCD math, i think

i was thinking of implementing a function that could compare integers of any size
Code: [Select]
INVOKE  ArbCmp,nNumberOfBytes,lpFirst,lpSecond
it could compare the high-order bytes as bytes until alignment on the second operand is found
after that (if still equal), you could use an aligned method until inequality is found
don't want to use JECXZ because it's slow
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 14, 2013, 10:22:11 PM
Here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:

Hi Alex,
Here it is ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
18742   cycles for 1000 * Ocmp (JJ)
88733   cycles for 1000 * cmp128b (loop)
62612   cycles for 1000 * AxCMP128bit

18750   cycles for 1000 * Ocmp (JJ)
88605   cycles for 1000 * cmp128b (loop)
63719   cycles for 1000 * AxCMP128bit


And Ocmp passes all your tests now, i.e. it behaves exactly like a cmp eax, edx.

P.S.: The extra speed comes from the bswap instruction.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 15, 2013, 04:05:40 AM
Hi Jochen, here are results.

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2255/1000 cycles

25281   cycles for 1000 * Ocmp (JJ)
104235  cycles for 1000 * cmp128b (loop)
34615   cycles for 1000 * AxCMP128bit

25063   cycles for 1000 * Ocmp (JJ)
103227  cycles for 1000 * cmp128b (loop)
34196   cycles for 1000 * AxCMP128bit

28798   cycles for 1000 * Ocmp (JJ)
101244  cycles for 1000 * cmp128b (loop)
34112   cycles for 1000 * AxCMP128bit


--- ok ---


:t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 15, 2013, 04:45:37 AM
Thanks to everybody, especially Alex :t

I have posted a "CMP defeats intuition (http://masm32.com/board/index.php?topic=2235.0)" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).

Note the FPU behaves differently.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 15, 2013, 05:39:44 AM
Jochen,

the new timings for you:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2141/1000 cycles

1959    cycles for 1000 * Ocmp (JJ)
60752   cycles for 1000 * cmp128b (loop)
6356    cycles for 1000 * AxCMP128bit

2096    cycles for 1000 * Ocmp (JJ)
60573   cycles for 1000 * cmp128b (loop)
6196    cycles for 1000 * AxCMP128bit

1918    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
6058    cycles for 1000 * AxCMP128bit

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 15, 2013, 06:10:19 PM
Thanks to everybody, especially Alex :t

:icon_redface: You're welcome, Jochen :t

I have posted a "CMP defeats intuition (http://masm32.com/board/index.php?topic=2235.0)" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).

Note the FPU behaves differently.

The most annoying thing in that is that there is not a full set of instructions to work with the flags selectively, like CLC/STC/CMC/CLD/STD/CLI/STI. For example - how to simply (un)set OF flag? Is there any other, simpler way than I used (i.e., via RCR - it is the only instruction I know of that may change OF flag and preserve ZF and allows to restore CF to its previous state (before RCR))?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 01:53:45 AM
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

but, you could come up with a set of operands/operations to generate the flag conditions, as desired

Code: [Select]
mov al,7Fh
mov ah,88h
sub al,ah

so, at the end of your code, you could have
Code: [Select]
SetFlags:
sub al,88h

and branch to that location with different values in AL
something like that   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 02:12:44 AM
The solution for setting the flags in the Ocmp & Qcmp macros was actually inspired by Alex' insistence that the macro should behave exactly like a cmp reg32, reg32. So the trick is to take two OWORDs, for example:

  oSmall   OWORD 88000000000000000000000000000100h   ; 88=NEGATIVE
  oBig   OWORD 88000000000000000000000000000300h

.. to scan them with pcmpeqb & bsr for the first different byte, and then "construct" two reg32:
  eax = 88000001
  edx = 88000003

Then, a simple cmp eax, edx sets the flags. Simple and fast...
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 02:15:39 AM
that is a great solution   :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 02:20:39 AM
that is a great solution   :t
Thanks ;-)
By the way, if you put the pcmpeqb part into a loop, it should be easy to implement the arbitrary length algo you proposed. If the length is below OWORD, you will have to clear the last bits after bsr.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 02:29:59 AM
i like the flag-setting solution
i am not as fond of using BSR   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 03:01:41 AM
Then cmps is your candidate, Dave ;-)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: qWord on August 18, 2013, 04:14:31 AM
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction:
Code: [Select]
cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
lea esi,ow0
lea edi,ow1
mov eax,[esi+0*DWORD]
mov ecx,[esi+1*DWORD]
mov edx,[esi+2*DWORD]
mov ebx,[esi+3*DWORD]
sub eax,[edi+0*DWORD]
jnz @NE1
sbb ecx,[edi+1*DWORD]
jnz @NE2
sbb edx,[edi+2*DWORD]
jnz @NE3
sbb ebx,[edi+3*DWORD]
;/* equal */
jmp @end

@NE1: sbb ecx,[edi+1*DWORD]
@NE2: sbb edx,[edi+2*DWORD]
@NE3: sbb ebx,[edi+3*DWORD]
;/* LT or GT */
jnz @end
or ebx,2    ; MOV may be better, because it breaks the dependency chain...
cmp ebx,1 ; new flags: A GT B
@end:
endm
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 18, 2013, 05:24:44 AM
Hi qWord,

I think you're right. Very elegant solution.  :t

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 08:27:45 AM
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

14167   cycles for 1000 * Ocmp (JJ)
73673   cycles for 1000 * cmp128b (loop)
9501    cycles for 1000 * cmp128 qWord
112854  cycles for 1000 * AxCMP128bit

 :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 18, 2013, 08:31:35 AM
The Timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 1648/1000 cycles

2375    cycles for 1000 * Ocmp (JJ)
60100   cycles for 1000 * cmp128b (loop)
4581    cycles for 1000 * cmp128 qWord
6580    cycles for 1000 * AxCMP128bit

2378    cycles for 1000 * Ocmp (JJ)
59763   cycles for 1000 * cmp128b (loop)
10781   cycles for 1000 * cmp128 qWord
6894    cycles for 1000 * AxCMP128bit

2345    cycles for 1000 * Ocmp (JJ)
59383   cycles for 1000 * cmp128b (loop)
4545    cycles for 1000 * cmp128 qWord
6595    cycles for 1000 * AxCMP128bit

--- ok ---

Well done, qWord.  :t

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 18, 2013, 09:11:57 AM
maybe I'm missing something here, but will not this also work:
Code: [Select]
cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm

result:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 2013/1000 cycles

9249 cycles for 1000 * Ocmp (JJ)
84785 cycles for 1000 * cmp128b (loop)
7066 cycles for 1000 * cmp128 qWord
57891 cycles for 1000 * AxCMP128bit
2036 cycles for 1000 * cmp128 nidud

9249 cycles for 1000 * Ocmp (JJ)
84805 cycles for 1000 * cmp128b (loop)
7055 cycles for 1000 * cmp128 qWord
57614 cycles for 1000 * AxCMP128bit
2030 cycles for 1000 * cmp128 nidud

9298 cycles for 1000 * Ocmp (JJ)
84779 cycles for 1000 * cmp128b (loop)
7060 cycles for 1000 * cmp128 qWord
57623 cycles for 1000 * AxCMP128bit
2030 cycles for 1000 * cmp128 nidud
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 10:24:50 AM
i don't see how qWord's code is any faster than...

Code: [Select]
OwordA OWORD ?
OwordB OWORD ?

        mov     eax,dword ptr OwordA[12]
        mov     edx,dword ptr OwordA[8]
        cmp     eax,dword ptr OwordB[12]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[8]
        mov     eax,dword ptr OwordA[4]
        jnz     FlagsSet

        cmp     eax,dword ptr OwordB[4]
        mov     edx,dword ptr OwordA[0]
        jnz     FlagsSet

        cmp     edx,dword ptr OwordB[0]

FlagsSet:

;flags are set as though you had executed CMP OwordA,OwordB

in particular, register indirect addressing is a little slower than direct addressing
and, my code only pre-loads 1 dword, not all 4
in addition to all that, i do not rely on the carry flag being forwarded for SBB
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: qWord on August 18, 2013, 10:58:43 AM
dave,
nidud,
AFAICS your code does not do the same as mine. For the following test numbers,

Code: [Select]
A OWORD 0ffffffffffffffffffffffff00000001h
B OWORD 0ffffffffffffffffffffffffffffffffh
your code returns that A is greater than B (signed), which is wrong.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 11:29:31 AM
Interesting :t

decimal, QWORD size (can't display more than that with deb, sorry ;))
qqA             -4294967295
qqB             -1

qA LESSER qB    ; Ocmp MasmBasic
qA lesser qB    ; cmp128 qWord
qB GREATER qA
qB greater qA
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 12:20:15 PM
good catch, qWord - lol
i have been using something like that for years
always for unsigned comparisons, though   :P

still, if you combine my method and your method, i think you get faster CORRECT code   :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 18, 2013, 06:41:51 PM
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel

You're right, and that's why I used RCR and not pushf/popf :biggrin:

Results for latest zip in the thread:

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2188/1000 cycles

24104   cycles for 1000 * Ocmp (JJ)
100352  cycles for 1000 * cmp128b (loop)
37171   cycles for 1000 * cmp128 qWord
33162   cycles for 1000 * AxCMP128bit

23649   cycles for 1000 * Ocmp (JJ)
100419  cycles for 1000 * cmp128b (loop)
37734   cycles for 1000 * cmp128 qWord
33313   cycles for 1000 * AxCMP128bit

24096   cycles for 1000 * Ocmp (JJ)
99257   cycles for 1000 * cmp128b (loop)
37482   cycles for 1000 * cmp128 qWord
33053   cycles for 1000 * AxCMP128bit


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 18, 2013, 08:05:10 PM
Having shamelessly stolen Jochen's implementation :biggrin: and a bit simplified it, there is a "new" algo:


JJAxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1
   movups xmm0,[ow0]
   movzx eax,word ptr [ow0+14]
   pcmpeqb xmm0,[ow1]
   movzx edx,word ptr [ow1+14]
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1
   cmp ecx,15
   jz @l1
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]   
   @l1:
   cmp ax,dx
ENDM


The idea is the same - construct compared element from MSB and make a higher unmached byte as a LSB, but in a word-sized reg. If the only highest bytes differs, then it goes shorter way and doesn't update LSB.
And now this is the absolutely equal to the CMP instruction 128-bit-emulation - because if regs are equal, it anyway makes a final CMP, so flags are set as they should be, and right after comparsion we may use any Jcc instruction, not forced to use JZ first like in earlier algos (because we do short cut after PMOVMSKB/BSR, but there only ZF is set - other flags are undefined, but this is not the same behaviour as CMP does for equal regs - it should set them properly and do not leave "undefined"). And even SF is set the same as CMP does :biggrin: Jochen, :t

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2287/1000 cycles

23052   cycles for 1000 * Ocmp (JJ)
100429  cycles for 1000 * cmp128b (loop)
36612   cycles for 1000 * cmp128 qWord
17043   cycles for 1000 * JJAxCMP128bit

23306   cycles for 1000 * Ocmp (JJ)
102801  cycles for 1000 * cmp128b (loop)
37323   cycles for 1000 * cmp128 qWord
16847   cycles for 1000 * JJAxCMP128bit

24068   cycles for 1000 * Ocmp (JJ)
96693   cycles for 1000 * cmp128b (loop)
36452   cycles for 1000 * cmp128 qWord
17062   cycles for 1000 * JJAxCMP128bit


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 18, 2013, 08:12:44 PM
Hi, nidud :t

maybe I'm missing something here, but will not this also work:
Code: [Select]
cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm

The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does. With such a straightforward code it will return proper result only for unsigned numbers (JA/JB/JZ), for signed - the result is unpredictable, because if not-highest-order DWORDs are different, they may be signed/unsigned (it's not predictable), but the entire OWORD which consists from them may have other signed/unsigned state. So, we forced to include the highest order element (DWORD or BYTE) to the comparsion to get right result.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 18, 2013, 08:18:22 PM
Hi Alex,

the timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3662/1000 cycles

362     cycles for 1000 * Ocmp (JJ)
57227   cycles for 1000 * cmp128b (loop)
2590    cycles for 1000 * cmp128 qWord
5042    cycles for 1000 * JJAxCMP128bit

412     cycles for 1000 * Ocmp (JJ)
57178   cycles for 1000 * cmp128b (loop)
2540    cycles for 1000 * cmp128 qWord
5050    cycles for 1000 * JJAxCMP128bit

353     cycles for 1000 * Ocmp (JJ)
57250   cycles for 1000 * cmp128b (loop)
2587    cycles for 1000 * cmp128 qWord
5045    cycles for 1000 * JJAxCMP128bit

--- ok ---

Good job.  :t

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 18, 2013, 08:40:17 PM
very nice Alex   :t
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2075/1000 cycles

23383   cycles for 1000 * Ocmp (JJ)
97294   cycles for 1000 * cmp128b (loop)
36753   cycles for 1000 * cmp128 qWord
17587   cycles for 1000 * JJAxCMP128bit

23394   cycles for 1000 * Ocmp (JJ)
97825   cycles for 1000 * cmp128b (loop)
36873   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23387   cycles for 1000 * Ocmp (JJ)
98043   cycles for 1000 * cmp128b (loop)
36745   cycles for 1000 * cmp128 qWord
17146   cycles for 1000 * JJAxCMP128bit

Quote
Having shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 18, 2013, 10:01:52 PM
Thank you, Gunther and Dave! :biggrin:

very nice Alex   :t

But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Quote
Having shamelessly stolen Jochen's implementation...
the algo has even improved your English   :lol:

Is this proper sentence? To be honest, very frequently, being writing something in a real-time, I'm very-very unsure that I'm writing properly :redface:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 18, 2013, 10:16:21 PM
But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.

Something is wrong there - 0.3 cycles is impossible...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1778/1000 cycles

14132   cycles for 1000 * Ocmp (JJ)
13074   cycles for 1000 * Ocmp2 (JJ)
73846   cycles for 1000 * cmp128b (loop)
9513    cycles for 1000 * cmp128 qWord
9141    cycles for 1000 * JJAxCMP128bit


Ocmp2 uses your xor dx, 0ffffh trick.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 18, 2013, 11:12:40 PM
Something is wrong there - 0.3 cycles is impossible...

It may be some inconsistence in timings, like it often happens, but probably it shows that Gunther's CPU model runs your version faster.

Timings for the new archive:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2261/1000 cycles

26992   cycles for 1000 * Ocmp (JJ)
23136   cycles for 1000 * Ocmp2 (JJ)
108573  cycles for 1000 * cmp128b (loop)
40769   cycles for 1000 * cmp128 qWord
18240   cycles for 1000 * JJAxCMP128bit

24862   cycles for 1000 * Ocmp (JJ)
22679   cycles for 1000 * Ocmp2 (JJ)
107366  cycles for 1000 * cmp128b (loop)
38917   cycles for 1000 * cmp128 qWord
18459   cycles for 1000 * JJAxCMP128bit

24960   cycles for 1000 * Ocmp (JJ)
22641   cycles for 1000 * Ocmp2 (JJ)
107176  cycles for 1000 * cmp128b (loop)
38921   cycles for 1000 * cmp128 qWord
18207   cycles for 1000 * JJAxCMP128bit


--- ok ---


BTW: in my variation of code here:

   xor ecx,0FFFFh
   bsr ecx,ecx
   jz @l1


we may have couple of cycles less when numbers are equal and jz @l1 is moved before BSR (just a note, it will not improve timings in current testbed).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 18, 2013, 11:52:46 PM
your code returns that A is greater than B (signed), which is wrong.

:biggrin:

I've also been using that method, but (hopefully) only for unsigned values.

You may argue that the test is slightly rigged, so in this regard the fastest code will be: :lol:
Code: [Select]
mov ax,word ptr ow1
cmp ax,word ptr ow0

Well, here is the latest test results:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 2008/1000 cycles

9208 cycles for 1000 * Ocmp (JJ)
90229 cycles for 1000 * cmp128b (loop)
7016 cycles for 1000 * cmp128 qWord
9545 cycles for 1000 * JJAxCMP128bit

9206 cycles for 1000 * Ocmp (JJ)
94243 cycles for 1000 * cmp128b (loop)
7008 cycles for 1000 * cmp128 qWord
9564 cycles for 1000 * JJAxCMP128bit

9205 cycles for 1000 * Ocmp (JJ)
89038 cycles for 1000 * cmp128b (loop)
7011 cycles for 1000 * cmp128 qWord
9532 cycles for 1000 * JJAxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 19, 2013, 08:56:10 AM
AMD, as usual, prefers GPR code :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 19, 2013, 12:30:34 PM
BTW, in qWord's code, are really the commented instructions required? They seem to be superfluous.


   push esi
   push edi
   push ebx
   lea esi,ow0
   lea edi,ow1
   mov eax,[esi+0*DWORD]
   mov ecx,[esi+1*DWORD]
   mov edx,[esi+2*DWORD]
   mov ebx,[esi+3*DWORD]
   sub eax,[edi+0*DWORD]
   ;jnz @NE1
   sbb ecx,[edi+1*DWORD]
   ;jnz @NE2
   sbb edx,[edi+2*DWORD]
   ;jnz @NE3
   sbb ebx,[edi+3*DWORD]
   ;/* equal */
   ;jmp @end

;@NE1:   sbb ecx,[edi+1*DWORD]
;@NE2:   sbb edx,[edi+2*DWORD]
;@NE3:   sbb ebx,[edi+3*DWORD]
   ;/* LT or GT */
;   jnz @end
;   or ebx,2    ; MOV may be better, because it breaks the dependency chain...
;   cmp ebx,1   ; new flags: A GT B
@end:
   pop ebx
   pop edi
   pop esi


Every jump goes to the continuation of the same sequence of instructions.
CMP behaves just like SUB, so, probably it's simpler and faster just to follow straight 128 bit integer substraction with SUB/SBB/SBB/SBB.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 19, 2013, 12:46:43 PM
The code:

   mov eax,dword ptr [ow0]
   sub eax,dword ptr [ow1]
   mov eax,dword ptr [ow0+4]
   sbb eax,dword ptr [ow1+4]
   mov eax,dword ptr [ow0+8]
   sbb eax,dword ptr [ow1+8]
   mov eax,dword ptr [ow0+12]
   sbb eax,dword ptr [ow1+12]

seems to be pretty fast on my machine, though, still a lot slower than SSE-powered version.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 19, 2013, 01:26:17 PM
I don't think that will work correctly
this may work
Code: [Select]
cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
mov eax,dword ptr ow0
sub eax,dword ptr ow1
jnz @NE1
mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
jnz @NE2
mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
jnz @NE3
mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jmp @end
@NE1: mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
@NE2: mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
@NE3: mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jnz @end
mov eax,2
cmp eax,1
@end:
endm
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 19, 2013, 01:29:03 PM
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
notice that it executes about the same number of instructions, either way

my thinking is that if the high-order dwords do not match, you shouldn't have to do any more   :P
depending on the application (of course), that could be a majority of the time
so, i still think there is room for improvement
at the moment, i have one more graphing routine to write, so i haven't spent any time on it
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 19, 2013, 01:35:21 PM
here is a test including qword's number
Code: [Select]
sSmall OWORD 0fffffffffffffffffffffffff00000001h
sBig OWORD 0fffffffffffffffffffffffffffffffffh
...
cmp128 sSmall, sBig
jnl error
cmp128 sBig, sSmall
jl error
cmp128 oSmall, oBig ; movups xmm0, oword ptr oSmall
jnl error
jnb error
cmp128 oBig, oSmall
jle error
jbe error
cmp128 oMedium, oMedium
jne error
cmp128 oSmallF, oBigF
jge error
jae error
cmp128 oBigF, oSmallF
jle error
jbe error
cmp128 oMedF, oMedF
jne error
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 19, 2013, 06:16:01 PM
Hi nidud,
If your algo yields correct results, you should sell it ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2999/1000 cycles

18751   cycles for 1000 * Ocmp (JJ)
18730   cycles for 1000 * Ocmp2 (JJ)
3035    cycles for 1000 * cmp128n (nidud)
8218    cycles for 1000 * cmp128 qWord
22797   cycles for 1000 * JJAxCMP128bit


EDIT: See reply #74 for corrected version
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 19, 2013, 07:01:34 PM
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2113/1000 cycles

23370   cycles for 1000 * Ocmp (JJ)
21821   cycles for 1000 * Ocmp2 (JJ)
13010   cycles for 1000 * cmp128n (nidud)
37300   cycles for 1000 * cmp128 qWord
17150   cycles for 1000 * JJAxCMP128bit

23393   cycles for 1000 * Ocmp (JJ)
21550   cycles for 1000 * Ocmp2 (JJ)
12906   cycles for 1000 * cmp128n (nidud)
36926   cycles for 1000 * cmp128 qWord
17266   cycles for 1000 * JJAxCMP128bit

23680   cycles for 1000 * Ocmp (JJ)
21558   cycles for 1000 * Ocmp2 (JJ)
13570   cycles for 1000 * cmp128n (nidud)
36904   cycles for 1000 * cmp128 qWord
17157   cycles for 1000 * JJAxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 19, 2013, 07:28:30 PM
nice idea, nidud   :t

still plenty of room for improvement, i think
using EAX all the way through has to be slowing it down
although, as a macro, it does make it more flexible

one little item....
Code: [Select]
mov eax,2
cmp eax,1
8 bytes - clears OF, ZF, CF, and SF

can be replaced with
Code: [Select]
or  al,12 bytes - clears OF, ZF, CF, and SF
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 19, 2013, 07:52:35 PM
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB

Ah, yes, you're right, Dave :redface:


Jochen, to get correct test running, you should change this macro:


align 16
TestC_s:
; useC=0      ; uncomment to exclude TestC
NameC equ cmp128n (nidud)   ; assign a descriptive name here
TestC proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   showresults=0
   cmp128n offset oSmall, offset oBig
   cmp128n offset oBig, offset oSmall
   cmp128n offset oMedium, offset oMedium
   
   cmp128n offset oSmallF, offset oBigF
   cmp128n offset oBigF, offset oSmallF
   cmp128n offset oMedF, offset oMedF
   sub ebx, 6
   ; dec ebx - we test 6x above
  .Until Sign?
  ret
TestC endp
TestC_endp:



Remove every "offset" statement before numbers.
If you would have a look into the disassembly, you'll find that the expanded macro actually compares not the numbers but the offsets (BTW, that's very nasty "feature" of a macroses).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: qWord on August 19, 2013, 09:04:04 PM
What about making the test a bit more realistic by adding some dependencies? e.g. if( Condition ) then do (operation A) else do (operation B)
 :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 19, 2013, 09:49:08 PM
using EAX all the way through has to be slowing it down

I tried to preload edx and ecx, but that actually slows it down
It seems that alignment of the code (?) is more important

Quote
one little item....
Code: [Select]
mov eax,2
cmp eax,1
8 bytes - clears OF, ZF, CF, and SF

can be replaced with
Code: [Select]
or  al,12 bytes - clears OF, ZF, CF, and SF

Code: [Select]
mov eax,2
cmp eax,1
2030 cycles for 1000 * cmp128

or al,1
1517 cycles for 1000 * cmp128

sub eax,eax
inc eax
1360 cycles for 1000 * cmp128
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 19, 2013, 11:54:21 PM
Jochen, to get correct test running, you should change this macro:

That's right :t

By the way: Why sbb all over the place?

   jnz @NE1
   mov eax,dword ptr ow0[4]
   .if Carry?
      INT 3
   .endif
   sbb eax,dword ptr ow1[4]


AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 3008/1000 cycles

18753   cycles for 1000 * Ocmp (JJ)
18749   cycles for 1000 * Ocmp2 (JJ)
2688 1) cycles for 1000 * cmp128n (nidud)
8210    cycles for 1000 * cmp128 qWord
22749   cycles for 1000 * JJAxCMP128bit


1) You will be fined for violating speed limits 8)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 12:15:31 AM
your idea is better anyways, nidud
it just occured to me that OR AL,1 does not necessarily clear the ZF if the pre-existing value has bit 7 set
Code: [Select]
    sub     eax,eax
    inc     eax
i like it   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 12:18:46 AM
something in there doesn't like my P4   :P

Cmp128TimingsNidudB
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2078/1000 cycles

23549   cycles for 1000 * Ocmp (JJ)
21190   cycles for 1000 * Ocmp2 (JJ)
31788   cycles for 1000 * cmp128n (nidud)
36806   cycles for 1000 * cmp128 qWord
17161   cycles for 1000 * JJAxCMP128bit

23934   cycles for 1000 * Ocmp (JJ)
21858   cycles for 1000 * Ocmp2 (JJ)
31505   cycles for 1000 * cmp128n (nidud)
37015   cycles for 1000 * cmp128 qWord
17209   cycles for 1000 * JJAxCMP128bit

24145   cycles for 1000 * Ocmp (JJ)
21179   cycles for 1000 * Ocmp2 (JJ)
31363   cycles for 1000 * cmp128n (nidud)
36887   cycles for 1000 * cmp128 qWord
17515   cycles for 1000 * JJAxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 01:06:42 AM
i still think you can stop comparing when you find a high-order mismatch
and - no need to ripple the CF all the way from low-order to high-order if they are all equal

something like this - i need to do some testing
Code: [Select]
Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands

    mov     eax,dword ptr OwA[12]
    mov     edx,dword ptr OwA[8]
    sub     eax,dword ptr OwB[12]
    .if ZERO?
        cmp     edx,dword ptr OwB[8]
        mov     ecx,dword ptr OwA[4]
        .if ZERO?
            cmp     ecx,dword ptr OwB[4]
            mov     edx,dword ptr OwA[0]
            .if ZERO?
                cmp     edx,dword ptr OwB[0]
                .if !ZERO?
                    sbb     eax,0
                .endif
            .endif
        .endif
    .endif

ENDM
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 01:16:02 AM
By the way: Why sbb all over the place?

   jnz @NE1
   mov eax,dword ptr ow0[4]
   .if Carry?
      INT 3
   .endif
   sbb eax,dword ptr ow1[4]


   jnz @NE1
   mov eax,dword ptr ow0[4]
   sub eax,dword ptr ow1[4]

   .if Carry?
      dec eax
      INT 3
   .endif

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 2014/1000 cycles

9249   cycles for 1000 * Ocmp (JJ)
9036   cycles for 1000 * Ocmp2 (JJ)
2368   cycles for 1000 * cmp128n (nidud)
7067   cycles for 1000 * cmp128 qWord
9577   cycles for 1000 * JJAxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 20, 2013, 01:23:16 AM
   jnz @NE1
   ; so here we are inside a ZERO branch, and there is definitely NO CARRY
   mov eax,dword ptr ow0[4]
   sbb eax,dword ptr ow0[4]  ; therefore sbb behaves exactly like sub
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 01:53:18 AM
New attempt:

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2198/1000 cycles

24957   cycles for 1000 * Ocmp (JJ)
22338   cycles for 1000 * Ocmp2 (JJ)
32907   cycles for 1000 * cmp128n (nidud)
39508   cycles for 1000 * cmp128 qWord
6635    cycles for 1000 * AxCMP128bit

24604   cycles for 1000 * Ocmp (JJ)
22414   cycles for 1000 * Ocmp2 (JJ)
32845   cycles for 1000 * cmp128n (nidud)
38513   cycles for 1000 * cmp128 qWord
6367    cycles for 1000 * AxCMP128bit

24608   cycles for 1000 * Ocmp (JJ)
22294   cycles for 1000 * Ocmp2 (JJ)
34059   cycles for 1000 * cmp128n (nidud)
38513   cycles for 1000 * cmp128 qWord
6388    cycles for 1000 * AxCMP128bit


--- ok ---

Did not look over your latest archive yet, Jochen, got it just now.

Can you please add my code to newest testbed?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 02:01:20 AM
Ooops, entangled the order :greensml:

Here is my code:

Code: [Select]
AxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2, @l3, @l1_1, @l2_1, @l3_1, @l0


mov eax,dword ptr [ow0+12]
cmp eax,dword ptr [ow1+12]
jnz @l0

mov eax,dword ptr [ow0+8]
cmp eax,dword ptr [ow1+8]
jnz @l3

mov eax,dword ptr [ow0+4]
cmp eax,dword ptr [ow1+4]
jnz @l2

mov eax,dword ptr [ow0]
cmp eax,dword ptr [ow1]
jz @l0


@l1:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2]
mov cx,word ptr [ow1+2]
cmp dx,word ptr [ow1+2]
jnz @l1_1
mov cx,word ptr [ow1]
mov dx,ax
@l1_1:
cmp edx,ecx
jmp @l0

@l2:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+4]
mov cx,word ptr [ow1+2+4]
cmp dx,word ptr [ow1+2+4]
jnz @l2_1
mov cx,word ptr [ow1+4]
mov dx,ax
@l2_1:
cmp edx,ecx
jmp @l0

@l3:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+8]
mov cx,word ptr [ow1+2+8]
cmp dx,word ptr [ow1+2+8]
jnz @l3_1
mov cx,word ptr [ow1+8]
mov dx,ax
@l3_1:
cmp edx,ecx


@l0:

ENDM

Timings:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2280/1000 cycles

25091   cycles for 1000 * Ocmp (JJ)
23674   cycles for 1000 * Ocmp2 (JJ)
33530   cycles for 1000 * cmp128n (nidud)
39507   cycles for 1000 * cmp128 qWord
12115   cycles for 1000 * AxCMP128bit

25075   cycles for 1000 * Ocmp (JJ)
22759   cycles for 1000 * Ocmp2 (JJ)
34451   cycles for 1000 * cmp128n (nidud)
39244   cycles for 1000 * cmp128 qWord
10880   cycles for 1000 * AxCMP128bit

25092   cycles for 1000 * Ocmp (JJ)
22990   cycles for 1000 * Ocmp2 (JJ)
32880   cycles for 1000 * cmp128n (nidud)
38631   cycles for 1000 * cmp128 qWord
10777   cycles for 1000 * AxCMP128bit


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 02:03:08 AM
   jnz @NE1
   ; so here we are inside a ZERO branch, and there is definitely NO CARRY
   mov eax,dword ptr ow0[4]
   sbb eax,dword ptr ow0[4]  ; therefore sbb behaves exactly like sub
hmm :icon_redface:
yes, you may use sub instead

EDIT: JJ  :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 02:19:44 AM
Archive with the right code.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 02:20:59 AM
nidud - SUB would be faster because it does not have to wait for the carry condition to be set

Alex - odds are that the high order compare would find a mismatch
of all the possible 128-bit integers, only 1 in every 4294967296 have a specific high dword value

of all combinations of two 128-bit values,
only 1 combination in every 42949672962 (roughly) has matching high-order dwords

as you said, this is very application dependant, but i like playing those odds   :P

the timing tests probably aren't set up to test a wide range of values
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 02:25:33 AM
Alex - odds are that the high order compare would find a mismatch
...
as you said, this is very application dependant, but i like playing those odds   :P

No-no, Dave, I agree with you, just not get it from first time (in the first pages of the thread me too "voted" for checking the highest order elements), it's late here, sorry :greensml:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 02:31:00 AM
Quote
1365   cycles for 1000 * cmp128 nidud
1363   cycles for 1000 * cmp128 Dave

 :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 02:36:58 AM
lol nidud
you'd have to select specific values to compare to validate that theory

the values used in the timing test don't cover a very comprehensive range of comparisons
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 02:40:26 AM
results from Alex's latest code

prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2134/1000 cycles

23563   cycles for 1000 * Ocmp (JJ)
21133   cycles for 1000 * Ocmp2 (JJ)
31419   cycles for 1000 * cmp128n (nidud)
37262   cycles for 1000 * cmp128 qWord
10163   cycles for 1000 * AxCMP128bit

23335   cycles for 1000 * Ocmp (JJ)
21130   cycles for 1000 * Ocmp2 (JJ)
32047   cycles for 1000 * cmp128n (nidud)
37288   cycles for 1000 * cmp128 qWord
10208   cycles for 1000 * AxCMP128bit

23770   cycles for 1000 * Ocmp (JJ)
21134   cycles for 1000 * Ocmp2 (JJ)
31739   cycles for 1000 * cmp128n (nidud)
37136   cycles for 1000 * cmp128 qWord
10172   cycles for 1000 * AxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 20, 2013, 02:43:45 AM
My results from Alex's latest test:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++++++3 of 20 tests valid, loop overhead is approx. 1678/1000 cycles

2364    cycles for 1000 * Ocmp (JJ)
2351    cycles for 1000 * Ocmp2 (JJ)
2177    cycles for 1000 * cmp128n (nidud)
10741   cycles for 1000 * cmp128 qWord
3756    cycles for 1000 * AxCMP128bit

2340    cycles for 1000 * Ocmp (JJ)
2346    cycles for 1000 * Ocmp2 (JJ)
2188    cycles for 1000 * cmp128n (nidud)
4684    cycles for 1000 * cmp128 qWord
3749    cycles for 1000 * AxCMP128bit

2420    cycles for 1000 * Ocmp (JJ)
2431    cycles for 1000 * Ocmp2 (JJ)
7600    cycles for 1000 * cmp128n (nidud)
4545    cycles for 1000 * cmp128 qWord
9999    cycles for 1000 * AxCMP128bit

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 02:53:55 AM
lol nidud
you'd have to select specific values to compare to validate that theory

the values used in the timing test don't cover a very comprehensive range of comparisons

sorry  :lol:

Quote
8285   cycles for 1000 * Ocmp (JJ)
83787   cycles for 1000 * cmp128b (loop)
5642   cycles for 1000 * cmp128 qWord
8582   cycles for 1000 * JJAxCMP128bit
1367   cycles for 1000 * cmp128 nidud
1538   cycles for 1000 * cmp128 Dave
4051   cycles for 1000 * cmp128 Antariy
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 03:08:00 AM
 :biggrin:  now, you're just picking on me - lol
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 03:45:07 AM
Well, I added your code to the test by Alex, and used a different PC:
Quote
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
loop overhead is approx. 2101/1000 cycles

8640    cycles for 1000 * Ocmp (JJ)
7715    cycles for 1000 * Ocmp2 (JJ)
5945    cycles for 1000 * cmp128n (nidud)
8970    cycles for 1000 * cmp128 qWord
15807   cycles for 1000 * AxCMP128bit
5942    cycles for 1000 * cmp128 Dave
so there you go  :lol:

However, there seems to be something wrong (sometimes), as also in the test by Gynther
Quote
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
+++++++++11 of 20 tests valid, loop overhead is approx. 2024/1000 cycles

The AMD works fine thought
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 3016/1000 cycles

8246   cycles for 1000 * Ocmp (JJ)
8037   cycles for 1000 * Ocmp2 (JJ)
1361   cycles for 1000 * cmp128n (nidud)
6056   cycles for 1000 * cmp128 qWord
4050   cycles for 1000 * AxCMP128bit
1531   cycles for 1000 * cmp128 Dave
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 20, 2013, 04:56:39 AM
Ooops, entangled the order :greensml:

Attached :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1766/1000 cycles

12963   cycles for 1000 * Ocmp2 (JJ)
6266    cycles for 1000 * Cmp128Dave
6270    cycles for 1000 * cmp128n (nidud)
9508    cycles for 1000 * cmp128 qWord
10295   cycles for 1000 * AxCMP128bit


Alex, there's something wrong, we have the slowest algos!! :dazzled:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 05:44:59 AM
After a little afternoon nap I suddenly realise why the timing was equal in the test I did. I rename the Dave-macro to test if it worked and forgot to rename the proc. So I ended comparing my macro to my macro, hence the reason why the time was equal.

So, sorry Dave, but I'm picking on you again  :lol:

Your code fails (I think), and is equal to this (I think):
Code: [Select]
mov eax,dword ptr ow0[12]
sub eax,dword ptr ow1[12]
jnz @end
mov edx,eax
mov eax,dword ptr ow0[8]
sub eax,dword ptr ow1[8]
jnz @end
mov eax,dword ptr ow0[4]
sub eax,dword ptr ow1[4]
jnz @end
mov eax,dword ptr ow0
sub eax,dword ptr ow1
jz @end
sbb edx,0

However, I think this may work
Code: [Select]
mov eax,dword ptr ow0[12]
sub eax,dword ptr ow1[12]
jnz @end
mov eax,dword ptr ow0[8]
sub eax,dword ptr ow1[8]
jnz @end
mov eax,dword ptr ow0[4]
sub eax,dword ptr ow1[4]
jnz @end
mov eax,dword ptr ow0
sub eax,dword ptr ow1
jz @end
mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jnz @end
sub eax,eax
inc eax
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 08:17:15 AM
Alex, there's something wrong, we have the slowest algos!! :dazzled:

Code: [Select]
cmp128 1, -1
jle error
jnb error
fail: cmp128 qWord
fail: cmp128n (nidud)
fail: cmp128 Dave

Alex version works

Quote
4050   cycles for 1000 * AxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 08:23:53 AM
Ooops, entangled the order :greensml:

Attached :t


Alex, there's something wrong, we have the slowest algos!! :dazzled:

Thank you, Jochen :t

Strange and interesting thing: this not-too-complicated algos seem to be very CPU-dependent.

Timings for your latest archive:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2217/1000 cycles

22772   cycles for 1000 * Ocmp2 (JJ)
7344    cycles for 1000 * Cmp128Dave
33425   cycles for 1000 * cmp128n (nidud)
39887   cycles for 1000 * cmp128 qWord
10968   cycles for 1000 * AxCMP128bit

22775   cycles for 1000 * Ocmp2 (JJ)
7318    cycles for 1000 * Cmp128Dave
34449   cycles for 1000 * cmp128n (nidud)
39424   cycles for 1000 * cmp128 qWord
11218   cycles for 1000 * AxCMP128bit

22814   cycles for 1000 * Ocmp2 (JJ)
7333    cycles for 1000 * Cmp128Dave
33557   cycles for 1000 * cmp128n (nidud)
40793   cycles for 1000 * cmp128 qWord
10702   cycles for 1000 * AxCMP128bit


--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 20, 2013, 08:38:43 AM
Code: [Select]
cmp128 1, -1
jle error
jnb error
fail: cmp128 qWord
fail: cmp128n (nidud)
fail: cmp128 Dave

Alex version works

Are you sure?
oPlusOne GREATER oMinusOne (jj)
oPlusOne greater oMinusOne (nidud)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 08:40:04 AM
So, sorry Dave, but I'm picking on you again  :lol:

Your code fails (I think), and is equal to this (I think):

Yes, nidud is right here, Dave's code is more or less equal to the failing one I suggested couple pages ago (it consisted just from SUB/SBB/SBB/SBB) - if the number 1 is bigger than number 2 then it returned ZF flag set, too because of latest SUB (in Dave's version it's latest SBB).
(I.e. FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF is GREATER than FFFFFFFFFFFFFFFFFFFFFFFF00000001, but also is EQUAL :biggrin: - JGE/JAE/JE/JZ will have false jump.)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 08:46:44 AM
Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands

still - my code may fail - and so may some others
we need a good validation routine before we worry about timing   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 09:03:07 AM
still - my code may fail - and so may some others

I think you might just "trace" the code "in mind" with different conditions to get general idea if it works (like you did when I suggested brute SUB/SBB/SBB/SBB). There are not much different numbers are needed, actually, the ones chosen by Jochen are already enough - you may see that, for an instance, for you code, when numbers oBigF and oSmallF are compared (it returns GREATER and ZERO set).

But you're right here:
we need a good validation routine before we worry about timing   :P

We have such a validation here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) :P
Though, it's boring to parse results looking on a bunch of reported passed jump conditions :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 09:13:46 AM
My code is a bit bloated, but I think I checked it pretty thoroughly. And you can see that in general the idea is the same as in Jochen's SSE-powered algo: it just constructs final comparing number from highest order element (WORD which becomes high-order WORD of a constructed DWORD) and highest-order-different element (WORD which becomes low-order WORD of constructed DWORD). So, even if it's a bit bloated, it should work properly with no doubts, if it's implemented properly (no "mistypos" etc).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 10:03:15 AM
test i will use to fix my code
by the way, 3200 tests are made (40 are repeats) - my current code only fails 2440 of them   :lol:

for the moment - back to work on my graph code
i will play with this some more later this week

Code: [Select]
;###############################################################################################

        .XCREF
        .NoList
        INCLUDE    \Masm32\Include\Masm32rt.inc
        .686p
        .MMX
        .XMM
        .List

;###############################################################################################

C128Dave PROTO :LPVOID,:LPVOID
TestCmp  PROTO :LPVOID,:LPVOID

;###############################################################################################

Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands

    mov     eax,dword ptr OwA[12]
    mov     edx,dword ptr OwA[8]
    sub     eax,dword ptr OwB[12]
    .if ZERO?
        cmp     edx,dword ptr OwB[8]
        mov     ecx,dword ptr OwA[4]
        .if ZERO?
            cmp     ecx,dword ptr OwB[4]
            mov     edx,dword ptr OwA[0]
            .if ZERO?
                cmp     edx,dword ptr OwB[0]
                .if !ZERO?
                    sbb     eax,eax
                .endif
            .endif
        .endif
    .endif

ENDM

;###############################################################################################

        .DATA

;each line is one set of comparison values: a DWORD and an OWORD
;the flag result after comparing the DWORD's from any 2 lines should
;be the same as after comparing the OWORD's on the same 2 lines
;each pair of lines is compared both ways (CMP a,b and CMP b,a) for a total of 3200 tests

TestVal dd 0,0,0,0,0
        dd 1,1,0,0,0
        dd 100h,0,1,0,0
        dd 10000h,0,0,1,0
        dd 1000000h,0,0,0,1

        dd 40000000h,0,0,0,40000000h
        dd 40000001h,1,0,0,40000000h
        dd 40000100h,0,1,0,40000000h
        dd 40010000h,0,0,1,40000000h
        dd 41000000h,0,0,0,40000001h

        dd 80000000h,0,0,0,80000000h
        dd 80000001h,1,0,0,80000000h
        dd 80000100h,0,1,0,80000000h
        dd 80010000h,0,0,1,80000000h
        dd 81000000h,0,0,0,80000001h

        dd 0C0000000h,0,0,0,0C0000000h
        dd 0C0000001h,1,0,0,0C0000000h
        dd 0C0000100h,0,1,0,0C0000000h
        dd 0C0010000h,0,0,1,0C0000000h
        dd 0C1000000h,0,0,0,0C0000001h

        dd 3FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
        dd 3EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh

        dd 7FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
        dd 7EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh

        dd 0BFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
        dd 0BEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
        dd 0FEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh

;***********************************************************************************************

;        .DATA?

;###############################################################################################

        .CODE

;***********************************************************************************************

_main   PROC

        mov     esi,offset TestVal
        mov     ebx,40
        mov     edi,esi

loop00: push    ebx
        push    edi
        mov     ebx,40

loop01: INVOKE  TestCmp,esi,edi
        INVOKE  TestCmp,edi,esi
        dec     ebx
        lea     edi,[edi+20]
        jnz     loop01

        pop     edi
        pop     ebx
        add     esi,20
        dec     ebx
        jnz     loop00

        print   chr$(13,10)
        inkey
        INVOKE  ExitProcess,0

_main   ENDP

;***********************************************************************************************

C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID

    mov     esi,lpOp1
    mov     edi,lpOp2
    mov     eax,[esi+12]
    mov     edx,[esi+8]
    sub     eax,[edi+12]
    .if ZERO?
        cmp     edx,[edi+8]
        mov     ecx,[esi+4]
        .if ZERO?
            cmp     ecx,[edi+4]
            mov     edx,[esi]
            .if ZERO?
                cmp     edx,[edi]
                .if !ZERO?
                    sbb     eax,eax
                .endif
            .endif
        .endif
    .endif
    ret

C128Dave ENDP

;***********************************************************************************************

TestCmp PROC USES EBX ESI EDI lpOp1:LPVOID,lpOp2:LPVOID

;OF = bit 11
;SF = bit 7
;ZF = bit 6
;CF = bit 0

    mov     esi,lpOp1
    mov     edi,lpOp2
    mov     eax,[esi]
    cmp     eax,[edi]
    push    ebp
    pushfd
    add     esi,4
    add     edi,4
 ;   INVOKE  C128Dave,esi,edi
    pushfd
    pop     ebx               ;EBX = OWORD compare result flags
    pop     ebp               ;EBP = DWORD compare result flags
    and     ebx,8C1h
    and     ebp,8C1h          ;OF SF ZF CF only
    .if ebx!=ebp
        print   chr$('cmp ')
        mov     eax,[esi+12]
        print   uhex$(eax),'_'
        mov     eax,[esi+8]
        print   uhex$(eax),'_'
        mov     eax,[esi+4]
        print   uhex$(eax),'_'
        mov     eax,[esi]
        print   uhex$(eax),' , '
        mov     eax,[edi+12]
        print   uhex$(eax),'_'
        mov     eax,[edi+8]
        print   uhex$(eax),'_'
        mov     eax,[edi+4]
        print   uhex$(eax),'_'
        mov     eax,[edi]
        print   uhex$(eax),13,10,'was: '
        .if ebx&800h
            print   chr$('OV ')
        .else
            print   chr$('NV ')
        .endif
        .if ebx&80h
            print   chr$('NG ')
        .else
            print   chr$('PL ')
        .endif
        .if ebx&40h
            print   chr$('ZR ')
        .else
            print   chr$('NZ ')
        .endif
        .if ebx&1
            print   chr$('CY')
        .else
            print   chr$('NC')
        .endif
        print   chr$(' should be: ')
        .if ebp&800h
            print   chr$('OV ')
        .else
            print   chr$('NV ')
        .endif
        .if ebp&80h
            print   chr$('NG ')
        .else
            print   chr$('PL ')
        .endif
        .if ebp&40h
            print   chr$('ZR ')
        .else
            print   chr$('NZ ')
        .endif
        .if ebp&1
            print   chr$('CY')
        .else
            print   chr$('NC')
        .endif
        print   chr$(13,10)
    .endif
    pop     ebp
    ret

TestCmp ENDP

;###############################################################################################

        END     _main
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 10:14:57 AM
the test program only displays on failure
here is an example of one fail:
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 40000000_00000001_00000000_00000000
was: NV PL NZ NC should be: NV NG NZ CY
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:17:03 AM
for the moment - back to work on my graph code
i will play with this some more later this week

:t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 20, 2013, 11:44:41 AM

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1756/1000 cycles

2362    cycles for 1000 * Ocmp2 (JJ)
2627    cycles for 1000 * Cmp128Dave
2313    cycles for 1000 * cmp128n (nidud)
4672    cycles for 1000 * cmp128 qWord
3885    cycles for 1000 * AxCMP128bit

2418    cycles for 1000 * Ocmp2 (JJ)
2635    cycles for 1000 * Cmp128Dave
2310    cycles for 1000 * cmp128n (nidud)
4628    cycles for 1000 * cmp128 qWord
3833    cycles for 1000 * AxCMP128bit

2412    cycles for 1000 * Ocmp2 (JJ)
2637    cycles for 1000 * Cmp128Dave
2394    cycles for 1000 * cmp128n (nidud)
4647    cycles for 1000 * cmp128 qWord
3866    cycles for 1000 * AxCMP128bit
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 12:08:01 PM
Thank you, John! :t

Can I ask you to make one more test of a program in this (http://masm32.com/board/index.php?topic=2222.msg23288#msg23288) post? We had strange results, though, I think they are representative image that one algo is faster than another, but the numbers were just wonderful. Interesting, if this behaviour will be shown on your CPU, too.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 20, 2013, 12:39:40 PM
 :t

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1677/1000 cycles

2537    cycles for 1000 * Ocmp (JJ)
61306   cycles for 1000 * cmp128b (loop)
4740    cycles for 1000 * cmp128 qWord
2132    cycles for 1000 * JJAxCMP128bit

2441    cycles for 1000 * Ocmp (JJ)
61320   cycles for 1000 * cmp128b (loop)
4759    cycles for 1000 * cmp128 qWord
2009    cycles for 1000 * JJAxCMP128bit

2419    cycles for 1000 * Ocmp (JJ)
61279   cycles for 1000 * cmp128b (loop)
4796    cycles for 1000 * cmp128 qWord
2052    cycles for 1000 * JJAxCMP128bit

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 12:57:59 PM
Incredible! :biggrin: Results are absolutely different! In this thread we have very varying timings for every kind of algo.
Thank you very much, John! :t

I thought that my JJAxCMP128bit tweak is a lot slower than original Jochen's version so replaced it with GPR code, but it seems that on anything younger than desktop PIV Intel CPU models SSE code will be faster than GPR (not sure about this for AMD). Probably there is need to return the tweak into testbed.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 02:25:20 PM
Here is one that passes Dave's test
Code: [Select]
cmp128n macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@OV,@end
mov eax,dword ptr ow0
sub eax,dword ptr ow1
jnz @NE1
mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
jnz @NE2
mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
jnz @NE3
mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jmp @end
@NE1: mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow1[4]
@NE2: mov eax,dword ptr ow0[8]
sbb eax,dword ptr ow1[8]
@NE3: mov eax,dword ptr ow0[12]
sbb eax,dword ptr ow1[12]
jo @OV
jnz @end
inc eax
jmp @end
@OV: jc @end
mov eax,80000000h
sub eax,7FFFFFFFh
@end:
endm

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 3014/1000 cycles

8295   cycles for 1000 * Ocmp (JJ)
83800   cycles for 1000 * cmp128b (loop)
*5565   cycles for 1000 * cmp128 qWord
8583   cycles for 1000 * JJAxCMP128bit
2037   cycles for 1000 * cmp128 nidud
*1531   cycles for 1000 * cmp128 Dave
*4188   cycles for 1000 * cmp128 Antariy

* these currently fail
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 06:16:56 PM
Hi nidud, can you please post entire testbed source + binary? This will help a lot to check what is going on.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 20, 2013, 06:34:40 PM
We are still testing unsigned? And the end result is ja/je/jb?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 07:19:58 PM
We are still testing unsigned? And the end result is ja/je/jb?

No, theoretically every algo is designed to support any number - fully emulate behaviour of usual CMP instruction in flags setting, so how to treat the number is for programmers decision - JA or JG, JB or JL etc.
Something strange here with GPR code - if nidud will post the testbed it will help, but right now I cannot say what is wrong (and I'm not sure that something is wrong - we probably need a comprehensive checking testbed otherwise it's very error-prone to chech all things manually).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 20, 2013, 07:47:01 PM
I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manual
temp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 08:24:01 PM
nidud, if the source is a bit in a working untidy then post a binary, please, or the number which makes the algos to fail.

I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manual
temp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.

We don't set flags or combine flags for every DWORD - the trick is that we make a CMP unstruction for a 128 bit integer number, so the flags should be set according to the state of comparsion a two 128 bit numbers as entire, not as "arrays" of DWORDs (just like CMP 128_bit_number_1, 128_bit_number_2 and then Jcc).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 09:39:49 PM
i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests   :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:03:13 PM
OK, here are the correctness test, not in form of a macro for testing of every algo, but it shows the idea.

Need to compare OWORDs with selected algo, then to convert and save the flags in a easily comparable format with FlagsToEAX, then compare the same OWORDs with "Etalone" macro, which produces proper results by emulation flags setting and returns the result similar to FlagsToEAX, then compare what "Etalone" has returned and what was saved from the tested algo flags. If there's difference - it shows it.

Here is the code isolated from the source:

Added to the source:
Code: [Select]
; EAX = BITS: ... CF ZF SF OF
FlagsToEAX MACRO
pushfd
xor eax,eax
pop edx
bt edx,0
rcl eax,1
bt edx,6
rcl eax,1
bt edx,7
rcl eax,1
bt edx,11
rcl eax,1
ENDM

Etalone MACRO ow0, ow1
LOCAL @l1, @l2, @l3, l0

push ebx

mov eax,dword ptr [ow0+12]
mov edx,dword ptr [ow1+12]
cmp eax,edx
jnz @l1 ; just save flags

mov ecx,dword ptr [ow0+8]
mov ebx,dword ptr [ow1+8]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0+4]
mov ebx,dword ptr [ow1+4]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0]
mov ebx,dword ptr [ow1]
cmp ecx,ebx
jz @l1



@l2:
push 0
ja @l3 ; if it's above - the number is bigger because this isn't MSD

mov byte ptr [esp+3],1 ; CF set, below than (unsigned)

test eax,eax ; if numbers are signed then set required flags
jns @l3
mov word ptr [esp],0001h ; SF and OF are not equal, so it means less than (signed)
@l3:
pop edx
shr edx,1
rcl eax,1
push 3
pop ecx
@@:
shr edx,8
rcl eax,1
loop @B
jmp @l0


@l1:
FlagsToEAX

@l0:

if 0
push eax
mov ebx,eax
test ebx,1
jz @F
print "OF "
@@:
test ebx,2
jz @F
print "SF "
@@:
test ebx,4
jz @F
print "ZF "
@@:
test ebx,8
jz @F
print "CF "
@@:

print chr$(13,10)
pop eax
endif

pop ebx

ENDM
   

The piece below is a checking itself, it may be added in a start of a prog here:

start:   push 1
   call ShowCpu   ; print brand string and SSE level
   invoke SetProcessAffinityMask, -1, 1   ; restrict to one core
   Calibrate


Code: [Select]
AxCMP128bit numberOne,numberTwo
FlagsToEAX
push eax
Etalone numberOne, numberTwo
pop ecx
xor eax,ecx
jz @l1 ; test OK
and eax,3 ; layout of SF and OF flag may differ, but if they are both not equal
jz @l1 ; in first comparsion and in second comparsion, then it's proper result
cmp eax,3 ; since signed less than is OF != SF with no difference which flags are (un)set
jz @l1
mov edx,[esp]
print str$(edx)," - Test failed: "
print uhex$(dword ptr [esi+12])
print "_"
print uhex$(dword ptr [esi+8])
print "_"
print uhex$(dword ptr [esi+4])
print "_"
print uhex$(dword ptr [esi])
print "  "
print uhex$(dword ptr [edi+12])
print "_"
print uhex$(dword ptr [edi+8])
print "_"
print uhex$(dword ptr [edi+4])
print "_"
print uhex$(dword ptr [edi]),13,10
@l1:

One may say why the "Etalone" stated as truely proper code, well, this is question like about chiken and egg :biggrin: It stated so because it follows the CMP/SUB behaviour in a flags setting - it just emulates it, but not with flags and in a variable, which will then be compared with the converted flags state result of a tested code. It's longer to describe "why" it stated to work properly than to check its algo with help of Intel docs.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:07:37 PM
i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests   :biggrin:

Hi Dave, I did not see your code yet, will check it :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:10:01 PM
Can I use your data set with my checking method above? I did not prepare the data patterns yet, but checking on just a random data shows that algo works (but crafted data like yours is better than just random).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 10:23:27 PM
of course you may, Alex

the data is organized as sets of: (1) dword and (1) oword

comparing the owords from any two lines should yield the
same flags as comparing the dwords from the same 2 lines

so - the dwords are "control" values and the owords are "test" values
3160 tests are required to make all compares
(40 x 40 x 2, less 40 repeats)

in my test code, i examine OF, SF, ZF, and CF - they should match
no messing around with JL, JGE, etc
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 10:32:55 PM
Alex,
I just used the test I posted here (http://masm32.com/board/index.php?topic=2222.msg23338#msg23338), adding qWord’s number

I then added Small=1 for the cmp(1,-1) which fail’s

Then I added this to Dave’s test
Code: [Select]

C128nidud PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
sub eax,[edi]
jnz @NE1
mov eax,[esi+4]
sub eax,[edi+4]
jnz @NE2
mov eax,[esi+8]
sub eax,[edi+8]
jnz @NE3
mov eax,[esi+12]
sbb eax,[edi+12]
jmp @end
@NE1: mov eax,[esi+4]
sbb eax,[edi+4]
@NE2: mov eax,[esi+8]
sbb eax,[edi+8]
@NE3:   mov eax,[esi+12]
sbb eax,[edi+12]
jo @OV
jnz @end
inc eax
@end: ret
@OV: jc @end
mov eax,80000000h
sub eax,7FFFFFFFh
jmp @end
C128nidud endp
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:52:13 PM
With the help of Dave's test numbers patterns, here is the checking testbed :t

Currently only my and Jochen's algos pass the check.

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2111/1000 cycles

#######################################################
Testing algo: Cmp128Dave [esi],[edi]
1970169159 - Test failed: 00000000_00000000_00000000_00000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000000_00000000  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000000_00000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000000_00000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000001_00000001_00000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000001_00000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000001_00000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000100_00000000_00000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000100_00000000_00000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000100_00000000_00000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000001_00000000_00000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000000_00000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000000_00000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000000_01000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_40000000_00000001  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000001_00000000_00000000_40010000  00000001_00000000
_C0000100_C0000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_41000000_40000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000  00000000_00000000
_00000001_C0000001
1970169159 - Test failed: 00000000_00000001_00000000_80000100  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000000_80000100  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000000_80000100  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_00000000
_00000001_C0000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001  00000000_00000000
_00000001_C0000001
1970169159 - Test failed: C0000000_80000001_00000000_00000000  C0000000_00000000
_00000000_00000000
1970169159 - Test failed: C0000000_00000000_00000000_00000000  C0000000_80000001
_00000000_00000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001  00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001  00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000001_C0000001  00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000001_00000000_C0000100_C0000000  00000001_00000000
_00000000_40010000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000  00000000_00000000
_00000001_C0000001
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_3FFFFFFE_3FFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_3FFFFFFE_3FFFFFFF  FFFFFFFF_FFFFFFFE
_FFFFFFFF_BFFFFEFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFE
_3FFFFFFE_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_BFFFFFFF
_7FFFFFFE_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFE
_FFFFFFFF_BFFFFEFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_BEFFFFFF
_BFFFFFFF_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_BFFFFFFE
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF  FFFFFFFF_FFFEFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE  FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF  FFFFFFFE_FFFFFFFF
_FFFFFFFF_7FFEFFFF
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF  FFFFFFFE_FFFFFFFF
_FFFFFEFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF  FFFFFFFE_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFFFF_7FFEFFFF  FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF  FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_BFFFFFFF_7FFFFFFE_FFFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_FFFFFFFF_BFFFFEFF  FFFFFFFF_FFFFFFFE
_3FFFFFFE_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_FFFFFFFF_BFFFFEFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_BEFFFFFF_BFFFFFFF_FFFFFFFE  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_BFFFFFFE_FFFFFFFF_FFFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE  FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE  FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFEFF_FFFFFFFF  FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFEFFFF_FFFFFFFF_FFFFFFFF  FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFFFF_FFFFFFFF  FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
Test done


#######################################################
Testing algo: cmp128n [esi],[edi]
1970169159 - Test failed: 80000000_00000000_00000000_00000001  7FFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFF
Test done


#######################################################
Testing algo: cmp128q [esi],[edi]
1970169159 - Test failed: 80000000_00000000_00000000_00000001  7FFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFF
Test done


#######################################################
Testing algo: Ocmp2 [esi],[edi]
Test done


#######################################################
Testing algo: AxCMP128bit [esi],[edi]
Test done

Dave, thank you very much for the test data :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 20, 2013, 10:57:16 PM
Alex,
I just used the test I posted here (http://masm32.com/board/index.php?topic=2222.msg23338#msg23338), adding qWord’s number

I then added Small=1 for the cmp(1,-1) which fail’s

Then I added this to Dave’s test
Code: [Select]

C128nidud PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
sub eax,[edi]
jnz @NE1
mov eax,[esi+4]
sub eax,[edi+4]
jnz @NE2
mov eax,[esi+8]
sub eax,[edi+8]
jnz @NE3
mov eax,[esi+12]
sbb eax,[edi+12]
jmp @end
@NE1: mov eax,[esi+4]
sbb eax,[edi+4]
@NE2: mov eax,[esi+8]
sbb eax,[edi+8]
@NE3:   mov eax,[esi+12]
sbb eax,[edi+12]
jo @OV
jnz @end
inc eax
@end: ret
@OV: jc @end
mov eax,80000000h
sub eax,7FFFFFFFh
jmp @end
C128nidud endp


Hmm... I checked it with qWord's number, too - it worked. Maybe I did not get something in your method?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 20, 2013, 11:37:36 PM
I only tested 4 of the algos, and of these yours was the only one that worked. However, I copy and past your algo into Dave's test and there it fails (I may be wrong). Here is the proc I used:

Code: [Select]
C128Alex PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi+12]
cmp eax,[edi+12]
jnz @l0
mov eax,[esi+8]
cmp eax,[edi+8]
jnz @l3
mov eax,[esi+4]
cmp eax,[edi+4]
jnz @l2
mov eax,[esi]
cmp eax,[edi]
jz @l0
@l1:
mov edx,[esi+12]
mov ecx,[edi+12]
mov dx,[esi+2]
mov cx,[edi+2]
cmp dx,[edi+2]
jnz @l1_1
mov cx,[edi]
mov dx,ax
@l1_1:
cmp edx,ecx
jmp @l0
@l2:
mov edx,[esi+12]
mov ecx,[edi+12]
mov dx,[esi+2+4]
mov cx,[edi+2+4]
cmp dx,[edi+2+4]
jnz @l2_1
mov cx,[edi+4]
mov dx,ax
@l2_1:
cmp edx,ecx
jmp @l0
@l3:
mov edx,[esi+12]
mov ecx,[edi+12]
mov dx,[esi+2+8]
mov cx,[edi+2+8]
cmp dx,[edi+2+8]
jnz @l3_1
mov cx,[edi+8]
mov dx,ax
@l3_1:
cmp edx,ecx
@l0:
ret
C128Alex endp

Quote
we need a good validation routine before we worry about timing   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 20, 2013, 11:42:54 PM
nidud's latest code passes my test
are you saying it doesn't pass yours Alex ?

on my algo, this fail....
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
tells me that my theory about early exit on high-order compare is not a good theory - lol

it appears that you DO have to ripple from low to high
so, we are looking at something like nidud's code

i added a few blank lines for readability....
Code: [Select]
;***********************************************************************************************

C128nidud PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID

mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
sub eax,[edi]
jnz @NE1

mov eax,[esi+4]
sub eax,[edi+4]
jnz @NE2

mov eax,[esi+8]
sub eax,[edi+8]
jnz @NE3

mov eax,[esi+12]
sbb eax,[edi+12]
jmp @end

@NE1: mov eax,[esi+4]
sbb eax,[edi+4]

@NE2: mov eax,[esi+8]
sbb eax,[edi+8]

@NE3:   mov eax,[esi+12]
sbb eax,[edi+12]
jo @OV

jnz @end

inc eax

@end: ret

@OV: jc @end

mov eax,80000000h
sub eax,7FFFFFFFh
jmp @end

C128nidud ENDP

;***********************************************************************************************

i like the algo, except it seems a little messy at the end   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 12:17:52 AM
 :biggrin:

The first algo was not that far off (32), excluding [sub eax,eax], but I ended up with this:
Code: [Select]
cmp 80000000_00000000_00000000_00000000 , 7FFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF
was: NV PL NZ NC should be: OV PL NZ NC

The test could also be used for the timing since the current one is somewhat limited with regards to variation of numbers.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 12:34:19 AM
Here is the new one. Since my testing method goes other way than Dave's one and do not needs additional DWORD to check results (it relies on an "Etalone" proc which now is full), I used that extra DWORD in tests, too - by varying the offset to table and increment size - this DWORD becomes a part of a OWORD in a second pass.

You may play with this, too - comments in the lines:

   add edi,16;+offst
   dec ebx
   jnz @l2
   add esi,16;+offst
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 12:40:41 AM
nidud's latest code passes my test
are you saying it doesn't pass yours Alex ?

on my algo, this fail....
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
tells me that my theory about early exit on high-order compare is not a good theory - lol

I used his old code, probably, well, ATM I'm working on a testing method and not noticed that it's updated :biggrin: The target was to make working testing method - it's OK now, I'll add new code now.

BTW, Dave, is this your latest code?

Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands

    mov     eax,dword ptr OwA[12]
    mov     edx,dword ptr OwA[8]
    sub     eax,dword ptr OwB[12]
    .if ZERO?
        cmp     edx,dword ptr OwB[8]
        mov     ecx,dword ptr OwA[4]
        .if ZERO?
            cmp     ecx,dword ptr OwB[4]
            mov     edx,dword ptr OwA[0]
            .if ZERO?
                cmp     edx,dword ptr OwB[0]
                .if !ZERO?
                    sbb     eax,eax
                .endif
            .endif
        .endif
    .endif

ENDM
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 12:48:50 AM
well - that is the latest that doesn't work   :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 01:16:01 AM
C128nidud doesn't fail.

One more update - just for completeness make more passes with numbers flow. Some numbers are repeating.

I think no one looked to my testbed so no one knows how neat and flexible it is :P Testing results + timings in one testbed.

Note again: this testing method does not require manual data construction - you may even test it over any random data, because it uses the "milestone" to compare the results with. But Dave's data much-much better than random data, because it's crafted thing.

Timings in the bottom of listing:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++++16 of 20 tests valid, loop overhead is approx. 2433/1000 cycles

22910   cycles for 1000 * Ocmp (JJ)
21054   cycles for 1000 * Ocmp2 (JJ)
38189   cycles for 1000 * cmp128n (nidud)
36171   cycles for 1000 * cmp128 qWord
9852    cycles for 1000 * AxCMP128bit

22886   cycles for 1000 * Ocmp (JJ)
20740   cycles for 1000 * Ocmp2 (JJ)
38777   cycles for 1000 * cmp128n (nidud)
36293   cycles for 1000 * cmp128 qWord
9835    cycles for 1000 * AxCMP128bit

23170   cycles for 1000 * Ocmp (JJ)
20988   cycles for 1000 * Ocmp2 (JJ)
37485   cycles for 1000 * cmp128n (nidud)
36515   cycles for 1000 * cmp128 qWord
9851    cycles for 1000 * AxCMP128bit


--- ok ---


The message exceeded 20000 chars, so I removed all the data except timings. But when you run it it first displays correctness testing results for every algo.

Can I ask for timings, please?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 01:16:47 AM
this code works but as i recall, SAHF is a slow instruction
Code: [Select]
C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID

    mov     esi,lpOp1
    mov     edi,lpOp2

    mov     eax,[esi]
    cmp     eax,[edi]
    jnz     c1

    mov     eax,[esi+4]
    cmp     eax,[edi+4]
    jnz     c2

    mov     eax,[esi+8]
    cmp     eax,[edi+8]
    jnz     c3

    mov     eax,[esi+12]
    cmp     eax,[edi+12]
    jmp short cz

c1: mov     eax,[esi+4]
    sbb     eax,[edi+4]

c2: mov     eax,[esi+8]
    sbb     eax,[edi+8]

c3: mov     eax,[esi+12]
    sbb     eax,[edi+12]
    jnz     cz

    lahf
    lea     eax,[eax-4000h]
    sahf

cz: ret

C128Dave ENDP
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 01:19:08 AM
this code works, but i think SAHF is slow

Well, just add it in the testbed I posted above :P :biggrin: I got bored with this stuff :P :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 01:20:21 AM
this is the kind of stuff i enjoy
i should be working on something else - lol
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 01:23:20 AM
To check your code you may just insert in into testbed and add this in the start:

   CheckIt <invoke C128Dave,esi,edi>


And, yes, it passes the check :t Though adding to a timings part is to your side :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 01:55:44 AM
here is a nice algo
i think this one is a winner   :t

it can be modified to make a macro that direct addresses
and preloading registers may also provide some improvement
but this gives you the basic concept.....
Code: [Select]
C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID

    mov     esi,lpOp1
    mov     edi,lpOp2
    xor     edx,edx
    mov     eax,[esi]
    cmp     eax,[edi]
    .if !ZERO?
        inc     edx
    .endif
    mov     eax,[esi+4]
    sbb     eax,[edi+4]
    .if !ZERO?
        inc     edx
    .endif
    mov     eax,[esi+8]
    sbb     eax,[edi+8]
    .if !ZERO?
        inc     edx
    .endif
    mov     eax,[esi+12]
    mov     ecx,[edi+12]
    sbb     al,cl
    .if !ZERO?
        inc     edx
    .endif
    .if CARRY?
        mov     cl,dl
        mov     al,dh
    .else
        mov     al,dl
        mov     cl,dh
    .endif
    cmp     eax,ecx
    ret

C128Dave ENDP

the idea is to gather the info for the 3 low-order DWORD compares
then, when we get to the last one, compare the low bytes (with borrow)
then, replace the low bytes with the cumulated results for a single DWORD CMP instruction

notice that INC does not affect the CF
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 21, 2013, 02:32:04 AM
test_correct.zip:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

18963   cycles for 1000 * Ocmp (JJ)
18407   cycles for 1000 * Ocmp2 (JJ)
15392   cycles for 1000 * cmp128n (nidud)
8043   cycles for 1000 * cmp128 qWord
5544   cycles for 1000 * AxCMP128bit

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
14256   cycles for 1000 * Ocmp (JJ)
13015   cycles for 1000 * Ocmp2 (JJ)
15534   cycles for 1000 * cmp128n (nidud)
9508   cycles for 1000 * cmp128 qWord
10279   cycles for 1000 * AxCMP128bit


Ocmp2 passed all tests.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 02:55:05 AM
Here is a time test for the algos that passed the test

Quote

AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
2873466   cycles for C128Dave
3549577   cycles for C128Dave2
3107078   cycles for C128Nidud
---------------------------------------------------------
2839056   cycles for C128Dave
3563048   cycles for C128Dave2
3113448   cycles for C128Nidud
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 03:51:13 AM
here it is in macro form - should be a bit faster   :P
Code: [Select]
Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands
;Example: Cmp128Dave offset Oword1,offset Oword2

    mov     eax,dword ptr OwA[0]
    xor     ecx,ecx
    cmp     eax,dword ptr OwB[0]
    mov     edx,dword ptr OwA[4]
    .if !ZERO?
        inc     ecx
    .endif
    sbb     edx,dword ptr OwB[4]
    mov     eax,dword ptr OwA[8]
    .if !ZERO?
        inc     ecx
    .endif
    sbb     eax,dword ptr OwB[8]
    mov     edx,dword ptr OwB[12]
    mov     eax,dword ptr OwA[12]
    .if !ZERO?
        inc     ecx
    .endif
    sbb     al,dl
    .if !ZERO?
        inc     ecx
    .endif
    .if CARRY?
        mov     dl,cl
        mov     al,ch
    .else
        mov     al,cl
        mov     dl,ch
    .endif
    cmp     eax,edx

ENDM
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: hutch-- on August 21, 2013, 04:22:55 AM
i don't know if this is even vaguely useful as I have not been following this topic in any real detail but this may be useful to someone, a .486 compatible unsigned QWORD comparison algo.

Code: [Select]
IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    cmpqword PROTO :DWORD,:DWORD

    .data
      value1 QWORD 0000000000000000h
      value2 QWORD 0000000000000001h
      value3 QWORD 0000000000000002h

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    invoke cmpqword,ADDR value1,ADDR value2     ; 1 < 2
    print str$(eax),13,10

    invoke cmpqword,ADDR value3,ADDR value3     ; 3 = 3
    print str$(eax),13,10

    invoke cmpqword,ADDR value2,ADDR value1     ; 2 > 1
    print str$(eax),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

cmpqword proc pquad1:DWORD,pquad2:DWORD

  ; ----------------------
  ; unsigned QWORD compare
  ; ----------------------
    mov eax, [esp+4]
    mov edx, [esp+8]

    mov ecx, [eax+4]
    cmp ecx, [edx+4]    ; high DWORD 1st
    ja greater
    jb lessthan

    mov ecx, [eax]
    cmp ecx, [edx]      ; low DWORD next
    ja greater
    jb lessthan

    xor eax, eax        ; return 0 on equal
    ret 8

  lessthan:
    mov eax, -1         ; return -1 on less than
    ret 8

  greater:
    mov eax, 1          ; return 1 on greater
    ret 8

cmpqword endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 04:58:15 AM
Think you nailed it with the first one  :biggrin:

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1672485   cycles for Cmp128Dave
2633931   cycles for Cmp128Dave2
2066655   cycles for Cmp128Nidud
---------------------------------------------------------
1672149   cycles for Cmp128Dave
2621202   cycles for Cmp128Dave2
2065264   cycles for Cmp128Nidud
---------------------------------------------------------

Intel don't seems to like it  :icon_eek:
Quote
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)
---------------------------------------------------------
1723582 cycles for Cmp128Dave
7285714 cycles for Cmp128Dave2
2616409 cycles for Cmp128Nidud
---------------------------------------------------------
1746902 cycles for Cmp128Dave
7266284 cycles for Cmp128Dave2
2607623 cycles for Cmp128Nidud
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 06:13:55 AM
here is that one in macro form
i measure 11 cycles on my P4, which is pretty good

the values that are tested may not be an honest reflection of SAHF usage
i won't throw that other macro away, just yet   :P

Code: [Select]
Cmp128Dave MACRO OwA:REQ,OwB:REQ

;OwA and OwB are pointers to memory operands
;Example: Cmp128Dave offset Oword1,offset Oword2

    LOCAL   c1,c2,c3,c4

    mov     eax,dword ptr OwA[0]
    cmp     eax,dword ptr OwB[0]
    jnz     c1

    mov     eax,dword ptr OwA[4]
    cmp     eax,dword ptr OwB[4]
    jnz     c2

    mov     eax,dword ptr OwA[8]
    cmp     eax,dword ptr OwB[8]
    jnz     c3

    mov     eax,dword ptr OwA[12]
    cmp     eax,dword ptr OwB[12]
    jmp short c4

c1: mov     eax,dword ptr OwA[4]
    sbb     eax,dword ptr OwB[4]

c2: mov     eax,dword ptr OwA[8]
    sbb     eax,dword ptr OwB[8]

c3: mov     eax,dword ptr OwA[12]
    sbb     eax,dword ptr OwB[12]
    jnz     c4

    lahf
    lea     eax,[eax-4000h]
    sahf

c4:

ENDM
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 06:16:10 AM
not sure what the scaling factor is for cycles, but here's your last test   :P
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5663779 cycles for Cmp128Dave
8939593 cycles for Cmp128Dave2
8382453 cycles for Cmp128Nidud
---------------------------------------------------------
6032710 cycles for Cmp128Dave
9247293 cycles for Cmp128Dave2
7744632 cycles for Cmp128Nidud
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 06:46:07 AM
wow - looking at the values again, it would seem that they all take the SAHF path
i am surprised that code does so well   :redface:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 07:09:22 AM
This is the old test used to compare 16 byte strings using xmm (Frank)

So, maybe the data should be align 16..
Code: [Select]
TestVal dd 0,0,0,0,0
...
Code: [Select]
align 16
TestVal dd 0,?,?,?
dd 0,0,0,0
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 07:17:18 AM
Quote
SAHF - Store AH Register into FLAGS

   Usage:  SAHF
   Modifies flags: AF CF PF SF ZF

   Transfers bits 0-7 of AH into the Flags Register.  This includes
   AF, CF, PF, SF and ZF.

                                            Clocks                    Size
        Operands         808x  286   386   486          Bytes

        none                    4     2       3      2               1
jxx is 3
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 07:36:16 AM
ahhh - good point on the validation test data alignment
we can pad that with "empty" dwords to make it align
Alex's code doesn't use a control value, so that's another way to go

as for that timing chart.....

yes - it was a fast instruction in days of old
however, many instructions that explicitly manipulate flags seem to run slow under NT
CMC, STC, CLC are exceptions to that rule - they are ok

but CLD, STD, POPFD seem to be really slow under NT
i figured SAHF would be also
i think it's related to the protected OS thing
it has to verify that the flag change is allowed with current privileges before continuing
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 21, 2013, 07:48:44 AM
I've tinkered with another one, it's fast but it doesn't pass all tests :(
@Alex: Could you modify CheckIt so that it produces less output? Such as: #tests failed?

ocjj=0
oqDeb=0
OcmpJJ MACRO ow0, ow1
LOCAL z0, z1
  ocjj=ocjj+1
  z0 CATSTR <ocJJ0>, %ocjj
  z1 CATSTR <ocJJ1>, %ocjj
   mov eax, dword ptr ow0[12]
   cmp eax, dword ptr ow1[12]
   jne z0   ; no test byte ptr
   mov eax, dword ptr ow0[8]
   mov edx, dword ptr ow1[8]
   cmp eax, edx
   jne z1
   mov eax, dword ptr ow0[4]
   mov edx, dword ptr ow1[4]
   cmp eax, edx
   jne z1
   mov eax, dword ptr ow0[0]
   mov edx, dword ptr ow1[0]
z1:
   test byte ptr ow1[15], 80h
   usedeb=01
   .if Sign?
      cmp eax, edx
      .if ! (Carry? && Sign?)
         xchg eax, edx      ; qSmallN
      .endif
   .endif
   cmp eax, edx
z0:
  if oqDeb
  .if Zero?
   print "&ow0", " equals  &ow1", 13, 10
  .elseif !Sign?
   print "&ow0", " greater &ow1", 13, 10
  .else
   print "&ow0", " lesser &ow1", 13, 10
  .endif
   print chr$(13, 10)
  endif
ENDM


Good night from Europe ;-)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 21, 2013, 06:34:32 PM
I updated the validation test data with 16 byte alignment
and made some "necessarily" changes to the "mess at the end"  :P

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1692304   cycles for Cmp128Dave
2661742   cycles for Cmp128Dave2
1572492   cycles for Cmp128Nidud
---------------------------------------------------------
1698327   cycles for Cmp128Dave
2688844   cycles for Cmp128Dave2
1565310   cycles for Cmp128Nidud
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 07:30:50 PM
test_correct.zip:

Ocmp2 passed all tests.

AxCMP128bit too! :biggrin:

@Alex: Could you modify CheckIt so that it produces less output? Such as: #tests failed?

Here it is. Now it prints the offsets of the numbers, not numbers themselves, so having a binary you have an info where to check and this makes output a lot smaller. Also in the same place an int 3 is executed if the prog is running under the debugger, and after that the jump to the repeat of a failed test is made - you may trace things or may jump over this jump (pun).
Is this suitable?

Also simplified Etalone a bit - more straightforward now.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 21, 2013, 09:05:05 PM
Dave, in your checking method you're checking for full corresponding of a flags that returned from compare of a control DWORD and flags returned from a comparing of a OWORD. But this is not actually right way, because the layout of SF and OF flags may be different but still have proper: by documentation signed jumps check only for (non-)equality of OF and SF flag, there are no any notes on that which layout of flags will be exactly after any compare (and I think this may be hardware-depended). I.e., if one compare returns OF=1 and SF=0, other compare returns OF=0 and SF=1 - these results are both equal to each other, because JB/JBE (and derivatives like JNA/JNAE) will jump.

My checking code is aware of this, but not yours, that's whay my comparing code doesn't pass your check. But it works, and works properly, because exact SF/OF flags layout is not fixed in standards.

In your case you too may make this like, after this:

    and     ebx,8C1h
    and     ebp,8C1h          ;OF SF ZF CF only

Make check this way:

   xor ebp,ebx
   .if ebp!=0 && ebp!=100010000000Y ; if OF and SF were "swapped" then XOR will make both bits set



An update. Checking is more strict + added new Jochen's experimental algo from upper post.

Jochen, though I was working on a testbed - the idea of that algo looks interesting.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 09:19:08 PM
Alex
if you find a mismatch in any of those 4 flags, i can find a corresponding branch that will not work properly
conversely, if the 4 flags match, all the conditional branches will work correctly (except parity)

there is also the parity flag, but we haven't tested for it because it's so rarely used
the parity flag only applies to the low-order byte   ::)
that actually makes it somewhat useless except for serial comm
i could add that to my test very easily, but all our OWORD algos would probably fail - lol

in addition to all those flags, there is an auxiliary carry flag
however, it is not used for any branches
it's really more or less a "CPU internal" flag

Code: [Select]
****************************************************************
Equality Branches (Used for Signed or Unsigned Comparisons)
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JZ           Jump if equal             ZF=1            JE
JNZ          Jump if not equal         ZF=0            JNE
****************************************************************

Code: [Select]
****************************************************************
Unsigned Branches
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JA           Jump if above             CF=0 and ZF=0   JNBE
JAE          Jump if above or equal    CF=0            JNC JNB
JB           Jump if below             CF=1            JC JNAE
JBE          Jump if below or equal    CF=1 or ZF=1    JNA
****************************************************************

Code: [Select]
****************************************************************
Signed Branches
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JG           Jump if greater           SF=OF or ZF=0   JNLE
JGE          Jump if greater or equal  SF=OF           JNL
JL           Jump if less              SF<>OF          JNGE
JLE          Jump if less or equal     SF<>OF or ZF=1  JNG
JO           Jump if overflow          OF=1
JNO          Jump if no overflow       OF=0
JS           Jump if sign              SF=1
JNS          Jump if no sign           SF=0
****************************************************************

you are right, though - the code could use XOR
Code: [Select]
    mov eax,ebx
    xor eax,ebp
    test eax,8C1h
    jnz fail
i don't really see the advantage, though
also, by keeping the flags in EBX and EBP intact, i can use them for the failure report   :P
Code: [Select]
    xor ebx,ebpthat would destroy one set of flags for the report
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 09:35:42 PM
cmp128tm
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5534615 cycles for Cmp128Dave
9100335 cycles for Cmp128Dave2
5937448 cycles for Cmp128Nidud
---------------------------------------------------------
5646211 cycles for Cmp128Dave
9094478 cycles for Cmp128Dave2
5587906 cycles for Cmp128Nidud
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 21, 2013, 09:45:07 PM
HI Dave,

   A nit to pick.

Code: [Select]
JS           Jump if sign              SF=1
JNS          Jump if no sign           SF=1

Should be;

Code: [Select]
JS           Jump if sign              SF=1
JNS          Jump if no sign           SF=0

   As an aside, this thread has me looking at my fixed point arithmetic
program again.  Such a comparison might be useful.

Cheers,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 21, 2013, 09:50:31 PM
thanks Steve   :t

corrected   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 12:53:50 AM
Alex
if you find a mismatch in any of those 4 flags, i can find a corresponding branch that will not work properly
conversely, if the 4 flags match, all the conditional branches will work correctly (except parity)

there is also the parity flag, but we haven't tested for it because it's so rarely used
the parity flag only applies to the low-order byte   ::)
that actually makes it somewhat useless except for serial comm
i could add that to my test very easily, but all our OWORD algos would probably fail - lol

in addition to all those flags, there is an auxiliary carry flag
however, it is not used for any branches
it's really more or less a "CPU internal" flag

Code: [Select]
****************************************************************
Equality Branches (Used for Signed or Unsigned Comparisons)
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JZ           Jump if equal             ZF=1            JE
JNZ          Jump if not equal         ZF=0            JNE
****************************************************************

Code: [Select]
****************************************************************
Unsigned Branches
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JA           Jump if above             CF=0 and ZF=0   JNBE
JAE          Jump if above or equal    CF=0            JNC JNB
JB           Jump if below             CF=1            JC JNAE
JBE          Jump if below or equal    CF=1 or ZF=1    JNA
****************************************************************

Code: [Select]
****************************************************************
Signed Branches
----------------------------------------------------------------
Instruction  Description               Condition       Aliases
----------------------------------------------------------------
JG           Jump if greater           SF=OF or ZF=0   JNLE
JGE          Jump if greater or equal  SF=OF           JNL
JL           Jump if less              SF<>OF          JNGE
JLE          Jump if less or equal     SF<>OF or ZF=1  JNG
JO           Jump if overflow          OF=1
JNO          Jump if no overflow       OF=0
JS           Jump if sign              SF=1
JNS          Jump if no sign           SF=0
****************************************************************

you are right, though - the code could use XOR
Code: [Select]
    mov eax,ebx
    xor eax,ebp
    test eax,8C1h
    jnz fail
i don't really see the advantage, though
also, by keeping the flags in EBX and EBP intact, i can use them for the failure report   :P
Code: [Select]
    xor ebx,ebpthat would destroy one set of flags for the report


Dave, no need in huge tables. Just look to the logic.

Sign jumps check for SF and OF (I did not mention ZF just because I told about SF and OF, but that was implied and I thought I should not make millions of reservations) to check if number is greates or less than. And if you would look to the tables you posted, you'll find that for signed jumps only one condition is important - in relation of SF and OF flags, we talk now about them and not about ZF: the mutual state of SF and OF flags. There is no any "agreement", HOW should be set these flags.

If the number 1 is greater than number 2, then both OF and SF should be equal each to other. I.e.:

OF = 1, SF = 1 : JG will jump, JB will NOT jump, JGE will jump, JBE will NOT jump
OF = 0, SF = 0 : JG the same as above


If the number 1 is less than number 2, then both OF and SF should NOT be equal each to other. I.e.:

OF = 0, SF = 1 : JG will NOT jump, JB will jump, JGE will NOT jump, JBE will jump
OF = 1, SF = 0 : JG the same as above


You don't get what I trying to suggest: not to don't take attention on ZF and CF, but to make code aware of that SF and OF may have more than one set of states, and follow to the docs and CPU design at the same time.

The way I suggested works so: if after XOR of two flags set you have zero - then they both are equal, the test passed. But if it is not zero - you should to check, if the mutual state of OF and SF flags is the same in both flags sets. I.e., if in one flags set OF was 1 and SF was 0, then in second flags set it may be OF=0 and SF=1, and this is still RIGHT RESULT, because it is CPU specs.

OF=1, SF=0  ==  OF=0, SF=1

OF=1, SF=1  ==  OF=0, SF=0

This is the spec, you may read it again to just to check it in any way you want - CPU will follow, for an instance, JG for SF=1 + OF=1 the same as for SF=0 + OF=0.


So, if two flags set have equal mutual state of OF and SF, and they are equal, for example, SF=1 and OF=0 in both flags sets, then after XOR you'll get zero (if all other flags are equal); if the state of OF and SF is mutually equal, but swapped, for example SF=0 and OF=1, then you'll after XOR get both bits set (and other flags bits unset, if they were equal). So, checking for zero or availiability of both SF and OF set to 1, after XOR, is a proper way to go, to emulate proper CPU's behaviour.


    xor ebp,ebx
    .if ebp!=0 && ebp!=100010000000Y


This code meets that requirements.


If you are still not agree, then, please, in the dump of "failed" comparsions for my code, choose any two numbers you want, and then we will make a test - compare them, and then make every possible conditional jump for numbers comparsion. You would not find that numbers, because CPUs behaviour is independed on that how your checking method assumes it behaves.

Besides of this, the Jochen's idea, which is used in his algo and in my algo - just perfect. It's undisputable.

Quote
i could add that to my test very easily, but all our OWORD algos would probably fail - lol

We make comparsion code, but not full CPU emulation :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 03:01:01 AM
Quote
OF = 1, SF = 1 : JG will jump, JB will NOT jump, JGE will jump, JBE will NOT jump

I mistyped here, meant JLE instead of JBE and JL instead of JB ::) (well, this is not usual "typo" but rather a hurry + tiredness about this disput)

Quote
if the state of OF and SF is mutually equal, but swapped, for example SF=0 and OF=1

In full form it should be "... for example SF=0 and OF=1 in one flags set, and SF=1 and OF=0 in the other flags set ..."




Well, for an example, I inserted my algo into your testing code, well, grabbing right first "failed" test numbers

cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY



After your code compares the "checking DWORD", it has the flags CF=1, ZF=0, SF=0, OF=0.
After execution of my comparsion code for that number it returns the flags: CF=1, ZF=0, SF=1, OF=1.

This is proper result. CF and ZF are the same, SF=CF.

Moreover, if you will trace the code, you'll find than your code first checks the "checking DWORD" by loading first DWORD in EAX - it is zero, and then comparing it with 80000100. And there SF and OF flags are 1.
My code loads high order DWORD of first OWORD, it's zero, then it comparing it with highest order DWORD of second OWORD, it's 80000000. And right after this it goes to exit from algo. And SF and OF flags are 0.

The internal CPUs logic decided to set SF and OF to 1 when the second number was 80000100, and to set SF and OF to 0 when the second number was 80000000. We cannot say why it does that - we don't know CPUs exact circuit, but it anyway has no meaning, because they defined that it's important only mutual state of SF and OF flags, but not HOW they should be set exactly - in this case they both may be set to 1 or 0, and both cases will be right.



Quote
Besides of this, the Jochen's idea, which is used in his algo and in my algo - just perfect. It's undisputable.

Probably, one may say that my checking algo for correctness of comparsion two numbers is perfect :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 04:37:16 AM
Alex,
I asked:
Quote
maybe I'm missing something here, but will not this also work:
You say:
Quote
The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does.
This means:
Code: [Select]
and ebx,8C5h
and ebp,8C5h   ;OF SF ZF PF CF only

JP Jump if Parity PF=1
JPE Jump if Parity Even PF=1
JPO Jump if Parity Odd PF=0
JNP Jump if No Parity PF=0

However, it appears that none of us is capable of doing that  :lol:
Dave's test include these signed jumps, but skips the PF test
Code: [Select]
JNO Jump if Not Overflow (signed) OF=0
JNS Jump if Not Signed (signed) SF=0
JO Jump if Overflow (signed) OF=1
JS Jump if Signed (signed) SF=1

We have already compromised the quest to fully "emulate" CMP,
so the Equal | Above/Greater | Below/Lower seems reasonable to me

Code: [Select]
    and     ebx,8C1h
    and     ebp,8C1h   ;OF SF ZF CF only
    mov     eax,ebp
    xor     eax,ebx
    .if     eax && eax!=100010000000B
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 05:03:58 AM
JP/JPO and JNP/JPE are practically never used
and in the rare cases when they are used, they are operating on byte data, not OWORD's

JO and JNO are used reasonably often in signed math, so are JS and JNS
if those flags have to be correct, i don't see what the argument is about - lol

you have to emulate the exact behaviour in OF, SF, ZF, and CF

Alex is missing an important point
it's not enough that SF=OF or SF<>OF
because YOU DON'T KNOW WHICH Jxx INSTRUCTION WILL BE USED
you have to set the flags so that any of the ones listed above will work
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 06:16:02 AM
 :biggrin:

you know (at least most of the time) what you comparing
if you don't need OF and SF you may simplify the test  :P

Quote
JO and JNO are used reasonably often
as in one of the Cmp128 macros  :lol:

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 06:16:45 AM
i updated my version of the validation test program   :P

compare data is 16 aligned
no more control dword's - instead, i use a known-good routine to set the flags
at the end, it reports the fail count

EDIT: i also simplified usage:
Code: [Select]
_main   PROC

    INVOKE  AllTests,C128nidud

    inkey
    exit

_main   ENDP
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 07:36:56 AM
Ok, here is a new one that passes the xor test  :P

Code: [Select]
Cmp128NidudSEE macro A:REQ,B:REQ
movups xmm0,A[0]
movups xmm1,B[0]
movups xmm2,xmm0 ; save dest
pcmpgtb xmm0,xmm1
pmovmskb eax,xmm0
pcmpgtb xmm1,xmm2
pmovmskb ecx,xmm1
mov edx,[esi+12]
and edx,80000000h
or eax,edx
mov edx,[edi+12]
and edx,80000000h
or ecx,edx
cmp eax,ecx
endm

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1757533   cycles for Cmp128Dave
2690764   cycles for Cmp128Dave2
1564372   cycles for Cmp128Nidud
1062856   cycles for Cmp128NidudSEE
---------------------------------------------------------
1752965   cycles for Cmp128Dave
2660209   cycles for Cmp128Dave2
1622358   cycles for Cmp128Nidud
1064902   cycles for Cmp128NidudSEE
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 07:41:47 AM
it doesn't seem to mean much that it passed the "xor test"   :badgrin:

Code: [Select]
mov edx,[esi+12]
and edx,80000000h
or eax,edx
mov edx,[edi+12]

ESI and EDI are never preserved or initialized - oops
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 07:46:31 AM
i set it up in my validation code and got 800 failures out of 3160 tests
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 08:30:17 AM
it doesn't seem to mean much that it passed the "xor test"   :badgrin:
but this is much more fun  :lol:

Quote
ESI and EDI are never preserved or initialized - oops
work in progress
args to the macro is ESI and EDI, so that works  :P

here is some xor fun  :biggrin:
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1749181   cycles for Cmp128Dave
2669723   cycles for Cmp128Dave2
1578032   cycles for Cmp128Nidud
1061967   cycles for Cmp128NidudSEE (xor)
965562   cycles for Cmp128Axel (xor)
625467   cycles for Cmp128DaveU (unsigned)
802335   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
1755748   cycles for Cmp128Dave
2647837   cycles for Cmp128Dave2
1565729   cycles for Cmp128Nidud
1063489   cycles for Cmp128NidudSEE (xor)
966057   cycles for Cmp128Axel (xor)
633028   cycles for Cmp128DaveU (unsigned)
770824   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 09:55:12 AM
Cmp128tm2
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
6121006 cycles for Cmp128Dave
9888967 cycles for Cmp128Dave2
6120274 cycles for Cmp128Nidud
3453129 cycles for Cmp128NidudSEE (xor)
1127987 cycles for Cmp128Axel (xor)
1121311 cycles for Cmp128DaveU (unsigned)
841821  cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
6297867 cycles for Cmp128Dave
9968284 cycles for Cmp128Dave2
6103287 cycles for Cmp128Nidud
3369442 cycles for Cmp128NidudSEE (xor)
1309918 cycles for Cmp128Axel (xor)
1181227 cycles for Cmp128DaveU (unsigned)
959974  cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 12:10:23 PM
eliminated most of the repeat tests (doh)
now, 1600 tests are performed
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 02:57:35 PM
i set it up in my validation code and got 800 failures out of 3160 tests

That is around 25% of the values which means that 100% of the OF/SF is flipped. The CMPGT test is signed, and the sign from the value is added to the DWORD tested, which is needed for the unsigned compare.

The DWORD is 25% of the OWORD, so it's possible to shave the errors by 75% by populate the upper 75% with the high DWORD minus 25%.

This means that the 16 bits from the test must be reduced to 4 bits which is 25% of 16.

Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
movups xmm2,xmm0 ; save dest
pcmpgtd xmm0,xmm1 ; cmp(A,B)
movmskps eax,xmm0 ; set 4 bit to AX
pcmpgtd xmm1,xmm2 ; cmp(B,A)
movmskps ecx,xmm1 ; set 4 bit to CX
mov edx,A[12]
mov dl,al ; dl is 25% of edx
mov eax,B[12]
mov al,cl
sub edx,eax

In theory this version should then have 200 failures which is 25% of 800

However, this is not an ideal solution since the OF/SF functions now will fail 25% of the time, compare to 100% before, which is more predicable since you then could us JO instead of JNO.

I think the solution is to use the first version, and find a clever way to flip the flags, maybe something like this:
Code: [Select]
cmp edx,eax
jxx @end
not edx
not eax
sub edx,eax

I will do some testing  :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 04:36:58 PM
Alex is missing an important point
it's not enough that SF=OF or SF<>OF
because YOU DON'T KNOW WHICH Jxx INSTRUCTION WILL BE USED
you have to set the flags so that any of the ones listed above will work

No, this is you missing the point.

My comparsion code will work for every kind of signed/unsigned numbers-comparsion jumps. My code sets CF and ZF properly, but it may set SF and OF differently than YOUR CHECKING METHOD WAITS it will do. But with CPU IT WILL WORK PROPERLY, because CPU IS NOT YOUR CHECKING CODE.

Your checking code is incomplete and thus it fails itself. You may argue with this more, but this is senseless disput, since you totally don't get what I'm trying to say.

You may try to find any numbers with which my code + following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE will fail, but you would not find that numbers.

My code is working and working properly. You may even bet for contrary for $100000000000, but you will never win.
My checking code is the only code in the thread that properly checks the flags returned for numbers comparsion and following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE.

Specially for those, who still don't get my point from previous posts, where I omitted CF and ZF flags (which's state was assummed to be equal in both flags sets).
These flags sets:

CF=1, ZF=0, OF=1, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=0, SF=0

CF=1, ZF=0, OF=0, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=1, SF=0

And these:

CF=0, ZF=1, OF=1, SF=1 ARE THE SAME AS CF=0, ZF=1, OF=0, SF=0

CF=1, ZF=0, OF=0, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=1, SF=0

And so on. Take your special attention on the MUTUAL STATE (I said these words much in the thread :greensml:) OF and SF flags in these tables, then read the docs, then take attention on the table again, and then read the docs.

This is CPUs behaviour. But your code isn't aware of this. Though this is only question of two instructions to make code aware - and I showed that instructions.



And then, if it's still not enough, well, try to write the code which will use my code to compare any numbers you will chose, and then execute each of these instructions: JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, and then will fail. This is challenge, yeah. One may decline this challenge with words like "it's waste of time, I'm not in duty to do that", of course. But this still will not prove that one's point.


I theoretically and practically proved my point in this thread - if the descriptions "are not enough" for opponents - that's their own problems. When opponents do not want to understand or to check what was said to them - this is not adequate disput. It's not so much to grab any "failed" numbers, then compare them with my code, and then try every of only 9 conditional jumps we are all make working code for.

No one will be able theoretically prove contrary if the one will read documentation. Anyone may practically try to find numbers with which my code + following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE will fail, but one will not find such numbers.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 22, 2013, 07:33:19 PM
my friend Alex   :biggrin:

rather than sorting all that out, let's look at some cases that fail one test, but pass the other

here is one that passes your test, but fails mine

cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY

if we subtracted 00000000_00000000_00000000_00000000 - 80000000_00000000_00000000_00000001
the result would be 7FFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF (sign = 0)

you can make a simple test:
Code: [Select]
mov eax,0
cmp eax,80000001h
js is_neg

if i execute a JG, JGE, JL, or JLE instruction after comparing these, your code will work as it should
however, if i execute a JS, JNS, JO, or JNO instruction, it will not

i ran your code through my test and got 48 failures
attached is the text file.....
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 22, 2013, 09:29:51 PM
If you guys are working on 4 dwords, why not work on 4 bytes?
The logic is the same and we can easily test for errors. Forget speed, get the basics going.

Once you figure it out it should be easy to do n-bit comparisons.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 10:01:34 PM
if i execute a JG, JGE, JL, or JLE instruction after comparing these, your code will work as it should
however, if i execute a JS, JNS, JO, or JNO instruction, it will not

Dave, but the point was to make a comparsion code, i.e. just compare numbers: which one is greater, lesser, above, below, or they equal. There was no points about J(N)S and J(N)O jumps - only compare jumps. Thus, all my posts based on this - only "value compare" point.

The target was make code which will properly work for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, but not J(N)O, and this probably was stated in first pages of the thread, as well as in my thread about "128bit CMP" in the lab I said this point - and said that JS will not work properly, too, but one may use test byte ptr [oword+15],80h and then JS (just about ~15 bytes length code and 2 instructions) - instead of using such a bloated algos like we write. It's question of usability - I get it that.

So, thus all my posts was based on the point that we all working on algos which will satisfy these jumps: JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, as well as my checking algo based on this, and my comparsion algo, too.

Since it's very unlikely anyone may want to compare two 128 bit numbers, and then execute J(N)O jump.

Here is one more variation - partially based on Jochen's experimental algo posted on page 10.

It passes my check, and doesn't passes Dave's check, but anyone may see that the logic is so simple that it just may not fail for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE. So, we probably worked on checking algos having different specifications of the target :lol:

Code: [Select]
option prologue:none
option epilogue:none
AxCMP128bitProc2 proc n1,n2
mov eax,[esp+4]
mov edx,[esp+8]
xor ecx,ecx
mov [esp+4],esi
mov [esp+8],edi


mov esi,[eax+12]
test dword ptr [eax+12],80000000h
mov edi,[eax+8]

sets cl

cmp esi,[edx+12]
jnz @l0

cmp edi,[edx+8]
mov esi,[eax+4]
jnz @l1

cmp esi,[edx+4]
mov edi,[eax]
jnz @l1

cmp edi,[edx]
jnz @l1

mov esi,[esp+4]
mov edi,[esp+8]
ret 8

@l1:

ja @F
mov esi,[predata1+ecx*8]
cmp esi,[predata1+4+ecx*8]
mov esi,[esp+4]
mov edi,[esp+8]
ret 8

@@:
mov esi,[predata2+ecx*8]
cmp esi,[predata2+4+ecx*8]

@l0:
mov esi,[esp+4]
mov edi,[esp+8]
ret 8

align 4
predata1 label dword
dd 1,2,-2,-1

predata2 label dword
dd 2,1,-1,-2

AxCMP128bitProc2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Though this code is slower on my machine than my macro code based on first Jochen's idea.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 10:04:37 PM
And, yes, I'm glad that you understand me - and agree that it will work for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE - and we sorted out why were unable to understand the points of each other.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 22, 2013, 10:12:13 PM
I think probably qWord's code was based on idea of only-comparsion, too.


So, shortly speaking, for your checking code, if you see that the state of OF and SF is just reversed contrary to the "should be", after testing of my code, then it's OK, because it's "by design" :biggrin:

I.e., here in your test if you see something like

was: OV NG NZ CY should be: NV PL NZ CY


OF=1, SF=1, should be OF=0, SF=0, ZF and CF are the same - so, that's OK for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE. But not for JS or JO, but I did not write the comparsion code for support of JS and JO.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 22, 2013, 11:36:26 PM
this version can be used for JO and JS
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
movups xmm2,xmm0 ; save dest
pcmpgtd xmm0,xmm1 ; cmp(A,B)
movmskps eax,xmm0 ; set 4 bit to AX
pcmpgtd xmm1,xmm2 ; cmp(B,A)
movmskps ecx,xmm1 ; set 4 bit to CX
mov edx,A[12]
mov dl,al
mov eax,B[12]
mov al,cl
if 0
and edx,8000000Fh ; these two changes the result
and eax,8000000Fh ; from 25% to 100% failure (JS/JO)
endif
sub edx,eax
that is, in reverse order  :P

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1743759   cycles for Cmp128Dave
2669666   cycles for Cmp128Dave2
1560726   cycles for Cmp128Nidud
956658   cycles for Cmp128NidudSEE (xor)
783467   cycles for Cmp128Axel (xor)
769200   cycles for Cmp128DaveU (unsigned)
781580   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
1752016   cycles for Cmp128Dave
2660982   cycles for Cmp128Dave2
1561043   cycles for Cmp128Nidud
957958   cycles for Cmp128NidudSEE (xor)
791439   cycles for Cmp128Axel (xor)
766533   cycles for Cmp128DaveU (unsigned)
782690   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 12:33:12 AM
a simpler version
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
movups xmm2,xmm0 ; save dest
pcmpgtd xmm0,xmm1 ; cmp(A,B)
movmskps eax,xmm0 ; set 4 bit to AX
pcmpgtd xmm1,xmm2 ; cmp(B,A)
movmskps ecx,xmm1 ; set 4 bit to CX
mov ah,A[15]
mov ch,B[15]
sub ax,cx
and  another version
Code: [Select]
movups xmm0,A[0]
pcmpgtd xmm0,B[0]
movmskps eax,xmm0
movups xmm0,B[0]
pcmpgtd xmm0,A[0]
movmskps ecx,xmm0
mov ah,A[15]
mov ch,B[15]
sub ax,cx

Code: [Select]
1756809 cycles for Cmp128Dave
2634501 cycles for Cmp128Dave2
1558370 cycles for Cmp128Nidud
785653 cycles for Cmp128NidudSEE (xor)
966232 cycles for Cmp128Axel (xor)
618094 cycles for Cmp128DaveU (unsigned)
768168 cycles for Cmp128NidudU (unsigned)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 23, 2013, 12:37:25 AM
Quote
783467   cycles for Cmp128Axel (xor)

*Suspeciously looking to the proc name and then to the source*

LOL

The algo not only may reverse the state of a flags but reverse the order of letters in their name (if it was named by the author's name) :biggrin:
Very funny :t

results:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5881815 cycles for Cmp128Dave
9914640 cycles for Cmp128Dave2
5894460 cycles for Cmp128Nidud
2936083 cycles for Cmp128NidudSEE (xor)
2444879 cycles for Cmp128NidudSEEU (unsigned)
994634  cycles for Cmp128Axel (xor)
941599  cycles for Cmp128DaveU (unsigned)
962220  cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
5905129 cycles for Cmp128Dave
9810117 cycles for Cmp128Dave2
5910750 cycles for Cmp128Nidud
2766487 cycles for Cmp128NidudSEE (xor)
2443439 cycles for Cmp128NidudSEEU (unsigned)
969855  cycles for Cmp128Axel (xor)
933817  cycles for Cmp128DaveU (unsigned)
858812  cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------

--- ok ---




Results for attached archive - added Jochen's code and my earlier tweak of his code.
Also renamed some labels in the source, Axel was not contrary :biggrin:
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5843868 cycles for Cmp128Dave
9540057 cycles for Cmp128Dave2
5861747 cycles for Cmp128Nidud
2684675 cycles for Cmp128NidudSEE (xor)
2440403 cycles for Cmp128NidudSEEU (unsigned)
946568  cycles for Cmp128Alex (xor)
950729  cycles for Cmp128DaveU (unsigned)
988245  cycles for Cmp128NidudU (unsigned)
3537702 cycles for JJAxCMP128bit (SSE)
5887886 cycles for Ocmp2 - JJ's (SSE)
---------------------------------------------------------
5847877 cycles for Cmp128Dave
9582836 cycles for Cmp128Dave2
5915661 cycles for Cmp128Nidud
2731934 cycles for Cmp128NidudSEE (xor)
2439047 cycles for Cmp128NidudSEEU (unsigned)
1040627 cycles for Cmp128Alex (xor)
956229  cycles for Cmp128DaveU (unsigned)
950693  cycles for Cmp128NidudU (unsigned)
3533400 cycles for JJAxCMP128bit (SSE)
5874002 cycles for Ocmp2 - JJ's (SSE)
---------------------------------------------------------

--- ok ---



nidud, I inserted your Cmp128NidudSEEU algo into my testbed, it doesn't passes the check - many numbers, but here is first one:

00000000 00000000 00000000 00000000 and 00000000 80000000 40000001 00000000

It returns that first number is above and greater than second (CF=0, SF=0, OF=0, ZF=0) - JA/JG/JAE/JGE will jump.

SSE comparsion instructions are only-signed, that's the reason probably (i.e., they decide the OWORD just as set of signed DWORDs).


Here is the checking code of my testbed - it gets too entangled and contains too much old/not working code, so, probably it's better to post it as "external".

Code: [Select]

; EAX = BITS: ... CF ZF SF OF
FlagsToEAX MACRO
pushfd
xor eax,eax
pop edx
bt edx,0
rcl eax,1
bt edx,6
rcl eax,1
bt edx,7
rcl eax,1
bt edx,11
rcl eax,1
ENDM

Etalone MACRO ow0, ow1
LOCAL @l1, @l2, @l3, @l0, @l4

push ebx

mov eax,dword ptr [ow0+12]
mov edx,dword ptr [ow1+12]
cmp eax,edx
jnz @l1 ; just save flags

mov ecx,dword ptr [ow0+8]
mov ebx,dword ptr [ow1+8]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0+4]
mov ebx,dword ptr [ow1+4]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0]
mov ebx,dword ptr [ow1]
cmp ecx,ebx
jz @l1



@l2:
mov eax,0
ja @l0 ; if it's above - the number is bigger because this isn't MSD

;mov byte ptr [esp+3],1 ; CF set, below than (unsigned)

mov eax,1001Y ; CF and OF set, below than and less than

; setting of a sign flag has no meaning in this case, it' superfluous

jmp @l0


@l1:
FlagsToEAX

@l0:

if 0
push eax
mov ebx,eax
test ebx,1
jz @F
print "OF "
@@:
test ebx,2
jz @F
print "SF "
@@:
test ebx,4
jz @F
print "ZF "
@@:
test ebx,8
jz @F
print "CF "
@@:

print chr$(13,10)
pop eax
endif

pop ebx

ENDM


TestVal dd 0,0,0,0,0
        dd 1,1,0,0,0
        dd 100h,0,1,0,0
        dd 10000h,0,0,1,0
        dd 1000000h,0,0,0,1

        dd 40000000h,0,0,0,40000000h
        dd 40000001h,1,0,0,40000000h
        dd 40000100h,0,1,0,40000000h
        dd 40010000h,0,0,1,40000000h
        dd 41000000h,0,0,0,40000001h

        dd 80000000h,0,0,0,80000000h
        dd 80000001h,1,0,0,80000000h
        dd 80000100h,0,1,0,80000000h
        dd 80010000h,0,0,1,80000000h
        dd 81000000h,0,0,0,80000001h

        dd 0C0000000h,0,0,0,0C0000000h
        dd 0C0000001h,1,0,0,0C0000000h
        dd 0C0000100h,0,1,0,0C0000000h
        dd 0C0010000h,0,0,1,0C0000000h
        dd 0C1000000h,0,0,0,0C0000001h

        dd 3FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
        dd 3FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
        dd 3EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh

        dd 7FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
        dd 7FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
        dd 7EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh

        dd 0BFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
        dd 0BFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
        dd 0BEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
        dd 0FEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh
       
        dd 0FFFF0001H,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        ArraySize EQU $-TestVal
        ;dd 123456h
       
       
CheckIt MACRO howToInvoke:REQ
CheckIt2 <howToInvoke>, 0, 0
CheckIt2 <howToInvoke>, 4, 0
CheckIt2 <howToInvoke>, 0, 4
CheckIt2 <howToInvoke>, 4, 4
ENDM

CheckIt2 MACRO howToInvoke, offst1:=<0>, offst2:=<0>
LOCAL @l3, @l2, @l1
;########################## CHECK

push esi
push edi
push ebx

print "#######################################################",13,10
print "Testing algo: ",@CatStr(<!">,<howToInvoke>,<!">)," offst1: ",@CatStr(<!">,<offst1>,<!">)," offst2: ",@CatStr(<!">,<offst2>,<!">),13,10
mov esi,offset TestVal+offst1

@l3:
mov ebx,ArraySize/16
mov edi,offset TestVal+offst1

@l2:
howToInvoke
FlagsToEAX
push eax
Etalone esi, edi
pop ecx
xor eax,ecx
jz @l1 ; test OK
cmp eax,3 ; layout of SF and OF flag may differ, but if they are both not equal
jz @l1 ; in first comparsion and in second comparsion, then it's proper result
@@: ; since signed less than is OF != SF with no difference which flags are (un)set
;int 3
if 0
print str$(ebx)," - Test failed: "
print uhex$(dword ptr [esi+12])
print "_"
print uhex$(dword ptr [esi+8])
print "_"
print uhex$(dword ptr [esi+4])
print "_"
print uhex$(dword ptr [esi])
print "  "
print uhex$(dword ptr [edi+12])
print "_"
print uhex$(dword ptr [edi+8])
print "_"
print uhex$(dword ptr [edi+4])
print "_"
print uhex$(dword ptr [edi]),13,10
else
print uhex$(esi)," "
print uhex$(edi),"   "
invoke IsDebuggerPresent
test eax,eax
jz @F
int 3
jmp @l2 ; repeat test to see it under the debugger, or skip this instruction
@@:


endif

@l1:

add edi,16+offst2
dec ebx
jnz @l2
add esi,16+offst2
cmp esi,offset TestVal+ArraySize
jb @l3

pop ebx
pop edi
pop esi

print "Test done",13,10,13,10,13,10

;##########################
ENDM


(this is the code from prog posted in the bottom of 10 page)

To test your algos just insert them into source, insert the code above, and then use this:

   CheckIt <Cmp128NidudSEEU [esi],[edi]>
   
   CheckIt <Cmp128NidudSEE [esi],[edi]>


To check algos that are procs and not a macroses, use this construct:


CheckIt <invoke AlgoName,esi,edi>


(esi and edi are obligatory to be specified as params)

It will print offsets of the failed numbers, and, if you are running the prog under the debugger, will break and then re-run failed comparsion, so you may trace things.

Did it pass Dave's check? :eek:

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 01:21:06 AM
*Suspeciously looking to the proc name and then to the source*
sorry about that  :lol:

Quote
nidud, I inserted your Cmp128NidudSEEU algo into my testbed, it doesn't passes the check
Quote
Did it pass Dave's check? :eek:

Cmp128NidudSEEU is now removed  :lol:

I inserted JJ's macro in the test
Code: [Select]
1761503 cycles for Cmp128Dave
2665107 cycles for Cmp128Dave2
1565382 cycles for Cmp128Nidud
789270 cycles for Cmp128NidudSEE (xor)
966233 cycles for Cmp128Alex (xor)
692925 cycles for Cmp128JJ (xor)
763265 cycles for Cmp128DaveU (unsigned)
774416 cycles for Cmp128NidudU (unsigned)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 01:47:33 AM
SSE comparsion instructions are only-signed, that's the reason probably (i.e., they decide the OWORD just as set of signed DWORDs).

Yes, that is what’s debated here (http://masm32.com/board/index.php?topic=2222.msg23536#msg23536)
Since the SEE test is signed, and the result we are testing is unsigned, we have to add the sign to the test. The second test will therefore be inverted, so all the OF/SF flags will be flipped as explained here (http://masm32.com/board/index.php?topic=2222.msg23551#msg23551)

If value is added to the second test, as done here (http://masm32.com/board/index.php?topic=2222.msg23554#msg23554), the state of the OF/SF flag is random.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 23, 2013, 01:47:42 AM
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
-----------------------------------------------
1996511 cycles for Cmp128Dave
3028413 cycles for Cmp128Dave2
1932345 cycles for Cmp128Nidud
1350691 cycles for Cmp128NidudSEE (xor)
1028161 cycles for Cmp128Alex (xor)
848824  cycles for Cmp128JJ (xor)
838612  cycles for Cmp128DaveU (unsigned)
838743  cycles for Cmp128NidudU (unsigned)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
------------------------------------------------------
1897321 cycles for Cmp128Dave
6395118 cycles for Cmp128Dave2
1833315 cycles for Cmp128Nidud
1987302 cycles for Cmp128NidudSEE (xor)
921104  cycles for Cmp128Alex (xor)
737550  cycles for Cmp128JJ (xor)
873656  cycles for Cmp128DaveU (unsigned)
714518  cycles for Cmp128NidudU (unsigned)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 05:19:52 AM
The xor test passes unsigned macros:
Cmp128NidudU
Cmp128DaveU

Quote
CheckIt <Cmp128NidudSEE [esi],[edi]>

I looked at the code, and the TestVal (http://masm32.com/board/index.php?topic=2222.msg23472#msg23472) data is not align 16

I aligned the table by removing the test DWORD in front of the OWORD
I then changed the CheckIt macro:
Code: [Select]
CheckIt MACRO howToInvoke:REQ
CheckIt2 <howToInvoke>, 0, 0
CheckIt2 <howToInvoke>, 16, 0
CheckIt2 <howToInvoke>, 0, 16
CheckIt2 <howToInvoke>, 16, 16
ENDM

Now the Cmp128NidudSEE macro passes, and:
  Cmp128NidudSEEU
  Cmp128NidudU
  Cmp128DaveU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 23, 2013, 06:46:49 AM
my latest version has 16-aligned data, as well as a few other improvements

http://masm32.com/board/index.php?topic=2222.msg23534#msg23534 (http://masm32.com/board/index.php?topic=2222.msg23534#msg23534)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 07:39:10 AM
I have used that one, but I'm using the xor for the SEE macro.

I added a number to the table, the one given by qWord, and then the test fail:
Code: [Select]
cmp 00000001_00000000_00000000_00000000 , 00000001_FFFFFFFF_FFFFFFFF_FFFFFFFF
was: NV PL NZ NC should be: NV NG NZ CY
cmp 00000001_00000000_00000000_00000000 , 00000001_FFFFFFFF_FFFFFFFF_FFFFFFFF
was: NV PL NZ NC should be: NV NG NZ CY
cmp 00000001_FFFFFFFF_FFFFFFFF_FFFFFFFF , 00000001_00000000_00000000_00000000
was: NV NG NZ CY should be: NV PL NZ NC
cmp 00000001_FFFFFFFF_FFFFFFFF_FFFFFFFF , 00000001_00000000_00000000_00000000
was: NV NG NZ CY should be: NV PL NZ NC

4 Failures
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 23, 2013, 01:10:21 PM
interesting that qWord's "magic" number (that's a pun, really) is needed
that means that my set of data values is insufficient to perform a comprehensive test
it also means that some of our "thought-to-be-good" algos may not be

i may have to look at it a little closer to see what other values are needed
i thought i had it covered with the walking 1's and 0's
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 23, 2013, 06:20:23 PM
*Suspeciously looking to the proc name and then to the source*
sorry about that  :lol:

Do not worry about that :biggrin:

The xor test passes unsigned macros:
Cmp128NidudU
Cmp128DaveU

Quote
CheckIt <Cmp128NidudSEE [esi],[edi]>

I looked at the code, and the TestVal (http://masm32.com/board/index.php?topic=2222.msg23472#msg23472) data is not align 16

I aligned the table by removing the test DWORD in front of the OWORD
I then changed the CheckIt macro:
Code: [Select]
CheckIt MACRO howToInvoke:REQ
CheckIt2 <howToInvoke>, 0, 0
CheckIt2 <howToInvoke>, 16, 0
CheckIt2 <howToInvoke>, 0, 16
CheckIt2 <howToInvoke>, 16, 16
ENDM

Now the Cmp128NidudSEE macro passes, and:
  Cmp128NidudSEEU
  Cmp128NidudU
  Cmp128DaveU

Yes, it is not aligned, but it has the reason, too: our algos should work with unaligned data, too, even if they are SSE (that's whay we use MOVUPS/MOVUPD) (though JJAxCMP128bit is not currently aware of unaligned data, but it's question of one instruction more), and then, I use offset changement by 4 to not only change the numbers, but also to make a "check DWORD" as the part of the numbers - it will have different position in them in different passes, so we actually have, roughly speaking, 4 times more testing numbers than original Dave's OWORDs set.


interesting that qWord's "magic" number (that's a pun, really) is needed
that means that my set of data values is insufficient to perform a comprehensive test
it also means that some of our "thought-to-be-good" algos may not be

i may have to look at it a little closer to see what other values are needed
i thought i had it covered with the walking 1's and 0's

In my testbed I get very many errors for algos, though, some numbers are cross-repeated, others are not appeared when I first used your testing data with only one pass using OWORDs just like they were prepared to be used (i.e. skipping DWORD and checking OWORDs). After than I added a playing with offsets and using an additional dword as a part of a numbers, there are much new numbers revealed.
So, I think, probably it's a idea to go - we may craft the data as OWORDs, but then walk through it with step of a DWORD, or even byte, this will increase possibility of detection.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 08:52:01 PM
Here is a new evaluation test based on the first test by Dave

Not shore if this is correct, but this is the result:
Code: [Select]
;
;        JB/JA   JL/JG   JO/JS
; ------------------------------------------
; Cmp128Dave [x] [x] [x]
; Cmp128Dave2 [x] [x] [x]
; Cmp128Nidud [x] [x] [x]
; Cmp128Alex [x] [x] [ ]
; Cmp128JJ [ ] [ ] [ ]
; Cmp128DaveU [x] [ ] [ ]
; Cmp128NidudU [x] [ ] [ ]
; Cmp128JJSEE [ ] [ ] [ ]
; Cmp128NidudSEE [ ] [ ] [ ]
; Cmp128JJAlexSEE [ ] [ ] [ ]
;
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 23, 2013, 10:10:37 PM
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
---------------------------------------------------------
1986385   cycles for Cmp128Dave
6307221   cycles for Cmp128Dave2
1832520   cycles for Cmp128Nidud
3162995   cycles for Cmp128NidudSEE (xor)
970403   cycles for Cmp128Alex (xor)
806938   cycles for Cmp128JJ (xor)
866598   cycles for Cmp128DaveU (unsigned)
754468   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
1980527   cycles for Cmp128Dave
6309359   cycles for Cmp128Dave2
1853104   cycles for Cmp128Nidud
3159145   cycles for Cmp128NidudSEE (xor)
964008   cycles for Cmp128Alex (xor)
807636   cycles for Cmp128JJ (xor)
867367   cycles for Cmp128DaveU (unsigned)
749661   cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 10:18:40 PM
hmm, I seem to have a software problem  :biggrin:

testing a SEE macro
Code: [Select]
ml 438 Failures
jwasm 146 Failures
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 23, 2013, 10:20:37 PM
i like that method, nidud   :t
i was thinking of doing something like that, and adding parity - lol
but i don't have time to mess with it, right now

i did create a new set of values
but, i haven't had time to validate the standard flags proc

Code: [Select]
TestVal dd 0,0,0,0
        dd 1,0,0,0
        dd 0,1,0,0
        dd 0,0,1,0
        dd 0,0,0,1

        dd 7FFFFFFFh,0,0,0
        dd 0FFFFFFFFh,0,0,0
        dd 0FFFFFFFFh,1,0,0
        dd 0FFFFFFFFh,7FFFFFFFh,0,0
        dd 0FFFFFFFFh,0FFFFFFFFh,0,0
        dd 0FFFFFFFFh,0FFFFFFFFh,1,0
        dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,0
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,1

        dd 0,0,0,40000000h
        dd 1,0,0,40000000h
        dd 0,1,0,40000000h
        dd 0,0,1,40000000h
        dd 0,0,0,40000001h

        dd 7FFFFFFFh,0,0,40000000h
        dd 0FFFFFFFFh,0,0,40000000h
        dd 0FFFFFFFFh,1,0,40000000h
        dd 0FFFFFFFFh,7FFFFFFFh,0,40000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0,40000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,1,40000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,40000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,40000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,40000001h

        dd 0,0,0,80000000h
        dd 1,0,0,80000000h
        dd 0,1,0,80000000h
        dd 0,0,1,80000000h
        dd 0,0,0,80000001h

        dd 7FFFFFFFh,0,0,80000000h
        dd 0FFFFFFFFh,0,0,80000000h
        dd 0FFFFFFFFh,1,0,80000000h
        dd 0FFFFFFFFh,7FFFFFFFh,0,80000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0,80000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,1,80000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,80000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,80000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,80000001h

        dd 0,0,0,0C0000000h
        dd 1,0,0,0C0000000h
        dd 0,1,0,0C0000000h
        dd 0,0,1,0C0000000h
        dd 0,0,0,0C0000001h

        dd 7FFFFFFFh,0,0,0C0000000h
        dd 0FFFFFFFFh,0,0,0C0000000h
        dd 0FFFFFFFFh,1,0,0C0000000h
        dd 0FFFFFFFFh,7FFFFFFFh,0,0C0000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0,0C0000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,1,0C0000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,0C0000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0C0000000h
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0C0000001h

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh

        dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 0,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
        dd 0,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
        dd 0,80000000h,0FFFFFFFFh,3FFFFFFFh
        dd 0,0,0FFFFFFFFh,3FFFFFFFh
        dd 0,0,0FFFFFFFEh,3FFFFFFFh
        dd 0,0,80000000h,3FFFFFFFh
        dd 0,0,0,3FFFFFFFh
        dd 0,0,0,3FFFFFFEh

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh

        dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 0,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
        dd 0,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
        dd 0,80000000h,0FFFFFFFFh,7FFFFFFFh
        dd 0,0,0FFFFFFFFh,7FFFFFFFh
        dd 0,0,0FFFFFFFEh,7FFFFFFFh
        dd 0,0,80000000h,7FFFFFFFh
        dd 0,0,0,7FFFFFFFh
        dd 0,0,0,7FFFFFFEh

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh

        dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
        dd 0,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
        dd 0,80000000h,0FFFFFFFFh,0BFFFFFFFh
        dd 0,0,0FFFFFFFFh,0BFFFFFFFh
        dd 0,0,0FFFFFFFEh,0BFFFFFFFh
        dd 0,0,80000000h,0BFFFFFFFh
        dd 0,0,0,0BFFFFFFFh
        dd 0,0,0,0BFFFFFFEh

        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
        dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh

        dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
        dd 0,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
        dd 0,80000000h,0FFFFFFFFh,0FFFFFFFFh
        dd 0,0,0FFFFFFFFh,0FFFFFFFFh
        dd 0,0,0FFFFFFFEh,0FFFFFFFFh
        dd 0,0,80000000h,0FFFFFFFFh
        dd 0,0,0,0FFFFFFFFh
        dd 0,0,0,0FFFFFFFEh
TestVal_end LABEL BYTE
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 23, 2013, 11:10:14 PM
new software
Code: [Select]
jwasm Cmp128Eval.asm
jwlink libpath \masm32\lib file Cmp128Eval.obj
numbers used
Code: [Select]
dd 000000000h,?,?,?, 0, 0, 0, 0
dd 000000001h,?,?,?, 1, 0, 0, 0
dd 0000001FFh,?,?,?,-1, 1, 0, 0
dd 000000100h,?,?,?, 0, 1, 0, 0
dd 00001FF01h,?,?,?, 1,-1, 1, 0
dd 00001FFFFh,?,?,?,-1,-1, 1, 0
dd 000010000h,?,?,?, 0, 0, 1, 0
dd 000010001h,?,?,?, 1, 0, 1, 0
dd 001FF01FFh,?,?,?,-1, 1,-1, 1
dd 001FF0100h,?,?,?, 0, 1,-1, 1
dd 001FFFF01h,?,?,?, 1,-1,-1, 1
dd 001FFFFFFh,?,?,?,-1,-1,-1, 1
dd 001000000h,?,?,?, 0, 0, 0, 1
dd 001000001h,?,?,?, 1, 0, 0, 1
dd 0010001FFh,?,?,?,-1, 1, 0, 1
dd 001000100h,?,?,?, 0, 1, 0, 1
dd 0FF01FF01h,?,?,?, 1,-1, 1,-1
dd 0FF01FFFFh,?,?,?,-1,-1, 1,-1
dd 0FF010000h,?,?,?, 0, 0, 1,-1
dd 0FF010001h,?,?,?, 1, 0, 1,-1
dd 0FFFF01FFh,?,?,?,-1, 1,-1,-1
dd 0FFFF0100h,?,?,?, 0, 1,-1,-1
dd 0FFFFFF01h,?,?,?, 1,-1,-1,-1
dd 0FFFFFFFFh,?,?,?,-1,-1,-1,-1
dd 0FEFFFFFFh,?,?,?,-1,-1,-1,-2
dd 0FFFEFFFFh,?,?,?,-1,-1,-2,-1
dd 0FFFFFEFFh,?,?,?,-1,-2,-1,-1
dd 0FFFFFFFEh,?,?,?,-2,-1,-1,-1
dd 040000000h,?,?,?, 0,0,0,40000000h
dd 040000001h,?,?,?, 1,0,0,40000000h
dd 040000100h,?,?,?, 0,1,0,40000000h
dd 040010000h,?,?,?, 0,0,1,40000000h
dd 041000000h,?,?,?, 0,0,0,40000001h
dd 080000000h,?,?,?, 0,0,0,80000000h
dd 080000001h,?,?,?, 1,0,0,80000000h
dd 080000100h,?,?,?, 0,1,0,80000000h
dd 080010000h,?,?,?, 0,0,1,80000000h
dd 081000000h,?,?,?, 0,0,0,80000001h
dd 0C0000000h,?,?,?, 0,0,0,0C0000000h
dd 0C0000001h,?,?,?, 1,0,0,0C0000000h
dd 0C0000100h,?,?,?, 0,1,0,0C0000000h
dd 0C0010000h,?,?,?, 0,0,1,0C0000000h
dd 0C1000000h,?,?,?, 0,0,0,0C0000001h
new results
Code: [Select]
;
;        JB/JA   JL/JG   JO/JS
; ------------------------------------------
; Cmp128Dave [x] [x] [x]
; Cmp128Dave2 [x] [x] [x]
; Cmp128Nidud [x] [x] [x]
; Cmp128Alex [x] [x] [ ]
; Cmp128JJ [ ] [ ] [ ]
; Cmp128DaveU [x] [ ] [ ]
; Cmp128NidudU [x] [ ] [ ]
; Cmp128JJSEE [x] [x] [ ]
; Cmp128NidudSEE [ ] [ ] [ ]
; Cmp128JJAlexSEE [x] [x] [ ]
;
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 24, 2013, 01:05:37 AM
I haven't followed this as intensely as I should, sorry. Now I ran your test, and picked arbitrarily one of the "failed" values, and I don't quite understand why you consider that a failure:

include \masm32\MasmBasic\MasmBasic.inc        ; download (http://masm32.com/board/index.php?topic=94.0)
ox0 oword 0FFFFFFFFFFFFFFFF00000001FFFFFFFFh
ox1 oword 0FFFFFFFF00000001FFFFFFFF00000001h
qx0 qword 0FFFFFFFF0001FFFFh
qx1 qword 0FFFF0001FFFF0001h
dx0 dd    0FFFF01FFh
dx1 dd    0FF01FF01h

  Init
  Ocmp ox0, ox1
  movups xmm0, ox0
  movups xmm1, ox1
  deb 4, "OWORD size", x:xmm0, x:xmm1, flags
  Qcmp qx0, qx1
  deb 4, "QWORD size", qx0, qx1, x:qx0, x:qx1, flags
  mov eax, dx0
  cmp eax, dx1
  deb 4, "DWORD size", dx0, dx1, x:dx0, x:dx1, flags
  Inkey CrLf$, "was: NO NS NZ CY should be: NO NS NZ NC"
  Exit
end start

Output:
OWORD size
x:xmm0          FFFFFFFF FFFFFFFF 00000001 FFFFFFFF
x:xmm1          FFFFFFFF 00000001 FFFFFFFF 00000001
flags:          czso

QWORD size
qx0             -4294836225
qx1             -281466386841599
x:qx0           FFFFFFFF 0001FFFF
x:qx1           FFFF0001 FFFF0001
flags:          czso

DWORD size
dx0             -65025
dx1             -16646399
x:dx0           FFFF01FF
x:dx1           FF01FF01
flags:          czso  <<<<<<<<< lowercase means "not set"

was: NO NS NZ CY should be: NO NS NZ NC


Or do I misunderstand something? Apologies if that is the case...
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 02:25:20 AM
you seem to have a software problem :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 02:55:17 AM
I tried to fix the SEE macro by using PMAXUB

Quote
Compares 16 pairs of 8-bit unsigned integer values.

The first source operand is an XMM register. The second source
operand is either another XMM register or a 128-bit memory
location. The first source operand is also the destination.

Code: [Select]
movups xmm0,[esi]
pmaxub xmm0,[edi]
pmovmskb eax,xmm0
movups xmm0,[edi]
pmaxub xmm0,[esi]
pmovmskb ecx,xmm0

but the result is always the same
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000000_00000001
AX:CX 00000000  was: NO NS ZR NC should be: NO NS NZ CY
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000001_FFFFFFFF
AX:CX 000F000F  was: NO NS ZR NC should be: NO NS NZ CY
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 24, 2013, 03:28:05 AM
DWORD size
Code: [Select]
dx0             -65025
dx1             -16646399
x:dx0           FFFF01FF
x:dx1           FF01FF01
flags:          czso  <<<<<<<<< lowercase means "not set"

was: NO NS NZ CY should be: NO NS NZ NC

the carry flag was set by the algorithm and should not have been
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 24, 2013, 03:39:27 AM
flags:          czso  <<<<<<<<< lowercase means "not set"

was: NO NS NZ CY should be: NO NS NZ NC

the carry flag was set by the algorithm and should not have been

Well, not by my algo... ::)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 04:05:01 AM
The result of a full test for Cmp128JJ
Code: [Select]
cmp FFFFFFFF_FFFFFFFF_00000001_FFFFFFFF , FFFFFFFF_00000001_FFFFFFFF_00000001
AX:DX 0001FFFF  was: NO NS NZ CY should be: NO NS NZ NC

216 Failures
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 24, 2013, 04:51:13 AM
The result of a full test for Cmp128JJ
Code: [Select]
cmp FFFFFFFF_FFFFFFFF_00000001_FFFFFFFF , FFFFFFFF_00000001_FFFFFFFF_00000001
AX:DX 0001FFFF  was: NO NS NZ CY should be: NO NS NZ NC

216 Failures

That one was marked as "tinkering with", you can take it out. I was talking about the MasmBasic algo (Cmp128JJSEE - what is SEE? SSE?) which, AFAIK, sets zero and carry exactly like a cmp eax, edx; it does produce different results for SF/OF but in a manner that does not alter the jl/jg jumps (which require SF!=OF resp SF==OF). Which means 0 failures, right?

Besides, as shown above, your test produces occasionally wrong results. The deb macro's czso means "none of the four are set", your algo says the carry was set. Olly and deb say carry is clear.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 05:05:15 AM
I tried to fix the SEE macro by using PMAXUB

Quote
Compares 16 pairs of 8-bit unsigned integer values.

The first source operand is an XMM register. The second source
operand is either another XMM register or a 128-bit memory
location. The first source operand is also the destination.

Code: [Select]
movups xmm0,[esi]
pmaxub xmm0,[edi]
pmovmskb eax,xmm0
movups xmm0,[edi]
pmaxub xmm0,[esi]
pmovmskb ecx,xmm0

but the result is always the same
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000000_00000001
AX:CX 00000000  was: NO NS ZR NC should be: NO NS NZ CY
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000001_FFFFFFFF
AX:CX 000F000F  was: NO NS ZR NC should be: NO NS NZ CY
Quote
and writes the numerically greater value into the corresponding
location of the destination  :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 05:15:19 AM
The result of a full test for Cmp128JJ
Code: [Select]
cmp FFFFFFFF_FFFFFFFF_00000001_FFFFFFFF , FFFFFFFF_00000001_FFFFFFFF_00000001
AX:DX 0001FFFF  was: NO NS NZ CY should be: NO NS NZ NC

216 Failures

That one was marked as "tinkering with", you can take it out. I was talking about the MasmBasic algo (Cmp128JJSEE - what is SEE? SSE?) which, AFAIK, sets zero and carry exactly like a cmp eax, edx; it does produce different results for SF/OF but in a manner that does not alter the jl/jg jumps (which require SF!=OF resp SF==OF). Which means 0 failures, right?

Besides, as shown above, your test produces occasionally wrong results. The deb macro's czso means "none of the four are set", your algo says the carry was set. Olly and deb say carry is clear.

The result of a "exclude JO/JS" test for Cmp128JJSEE (sic)
Quote
0 Failures

The result of a full test
Quote
150 Failures

the above error is not one of the 150

hmm, I seem to have a software problem  :biggrin:

testing a SEE macro
Code: [Select]
ml 438 Failures
jwasm 146 Failures
new software
Code: [Select]
jwasm Cmp128Eval.asm
jwlink libpath \masm32\lib file Cmp128Eval.obj
could be the makeit.bat file  :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 05:31:31 AM
Code: [Select]
[quote]That one was marked as "tinkering with"
so is also the Cmp128NidudSEE (sic) algo
and it don't seem to get any better   :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 24, 2013, 05:47:05 AM
The result of a full test
Quote
150 Failures

Yes, and all 150 produce correct jumps because SF & OF are swapped. So Ocmp and Qcmp (http://masm32.com/board/index.php?topic=94.msg23071#msg23071) work correctly - no failures. I guess the same holds true for Alex' version, although I haven't had time to test it.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 24, 2013, 08:51:25 AM
Can I please have timings for the attached prog?

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5934340 cycles for Cmp128Dave
9986238 cycles for Cmp128Dave2
6086064 cycles for Cmp128Nidud
2989932 cycles for Cmp128NidudSEE (xor)
2520135 cycles for Cmp128NidudSEEU (unsigned)
933977  cycles for Cmp128Alex (xor)
991343  cycles for Cmp128DaveU (unsigned)
926699  cycles for Cmp128NidudU (unsigned)
3613971 cycles for JJAxCMP128bit (SSE)
5933804 cycles for Ocmp2 - JJ's (SSE)
2965817 cycles for AxCMP128bitProc3
3446805 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
6019773 cycles for Cmp128Dave
9803217 cycles for Cmp128Dave2
6065554 cycles for Cmp128Nidud
2770437 cycles for Cmp128NidudSEE (xor)
2689950 cycles for Cmp128NidudSEEU (unsigned)
945193  cycles for Cmp128Alex (xor)
950871  cycles for Cmp128DaveU (unsigned)
923201  cycles for Cmp128NidudU (unsigned)
3674979 cycles for JJAxCMP128bit (SSE)
5935439 cycles for Ocmp2 - JJ's (SSE)
2902697 cycles for AxCMP128bitProc3
3456920 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 24, 2013, 10:18:50 AM
Alex CMP128bit
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5817682 cycles for Cmp128Dave
9943469 cycles for Cmp128Dave2
6048901 cycles for Cmp128Nidud
2573234 cycles for Cmp128NidudSEE (xor)
2341353 cycles for Cmp128NidudSEEU (unsigned)
1022981 cycles for Cmp128Alex (xor)
919308  cycles for Cmp128DaveU (unsigned)
844475  cycles for Cmp128NidudU (unsigned)
3445214 cycles for JJAxCMP128bit (SSE)
5638061 cycles for Ocmp2 - JJ's (SSE)
2800548 cycles for AxCMP128bitProc3
3475108 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5776971 cycles for Cmp128Dave
9150470 cycles for Cmp128Dave2
5760939 cycles for Cmp128Nidud
2597086 cycles for Cmp128NidudSEE (xor)
2326359 cycles for Cmp128NidudSEEU (unsigned)
1154350 cycles for Cmp128Alex (xor)
880936  cycles for Cmp128DaveU (unsigned)
896082  cycles for Cmp128NidudU (unsigned)
3442676 cycles for JJAxCMP128bit (SSE)
5767909 cycles for Ocmp2 - JJ's (SSE)
2825598 cycles for AxCMP128bitProc3
3242168 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 10:22:09 AM
Code: [Select]
;        JB/JA   JL/JG   JO/JS
; ------------------------------------------
; Cmp128Dave [x] [x] [x]
; Cmp128Dave2 [x] [x] [x]
; Cmp128Nidud [x] [x] [x]
; Cmp128NidudSSE [x] [x] [x]
; Cmp128Alex [x] [x] [ ]
; Cmp128JJSSE [x] [x] [ ]
; Cmp128JJAlexSSE [x] [x] [ ]
; AxCMP128bitProc3 [x] [x] [ ]
; AxCMP128bitProc3c [x] [x] [ ]
; Cmp128DaveU [x] [ ] [ ]
; Cmp128NidudU [x] [ ] [ ]
;

Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1705190 cycles for Cmp128Dave
2648436 cycles for Cmp128Dave2
1630559 cycles for Cmp128Nidud
972669 cycles for Cmp128Alex (xor)
628103 cycles for Cmp128DaveU (unsigned)
769965 cycles for Cmp128NidudU (unsigned)
2548697 cycles for Cmp128JJSSE (xor)
1854976 cycles for Cmp128JJAlexSSE (xor)
1353558 cycles for Cmp128NidudSSE
2582152 cycles for AxCMP128bitProc3
2754507 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
1695637 cycles for Cmp128Dave
2672783 cycles for Cmp128Dave2
1621580 cycles for Cmp128Nidud
992659 cycles for Cmp128Alex (xor)
625231 cycles for Cmp128DaveU (unsigned)
769738 cycles for Cmp128NidudU (unsigned)
2548079 cycles for Cmp128JJSSE (xor)
1856750 cycles for Cmp128JJAlexSSE (xor)
1354046 cycles for Cmp128NidudSSE
2587125 cycles for AxCMP128bitProc3
2760540 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 24, 2013, 10:27:29 AM
nidud cmp128
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5657695 cycles for Cmp128Dave
8870013 cycles for Cmp128Dave2
5700804 cycles for Cmp128Nidud
1115699 cycles for Cmp128Alex (xor)
895891  cycles for Cmp128DaveU (unsigned)
1045671 cycles for Cmp128NidudU (unsigned)
5584412 cycles for Cmp128JJSSE (xor)
3425749 cycles for Cmp128JJAlexSSE (xor)
6307931 cycles for Cmp128NidudSSE
2788370 cycles for AxCMP128bitProc3
3278983 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5626412 cycles for Cmp128Dave
8946119 cycles for Cmp128Dave2
5902719 cycles for Cmp128Nidud
1007638 cycles for Cmp128Alex (xor)
911179  cycles for Cmp128DaveU (unsigned)
977962  cycles for Cmp128NidudU (unsigned)
5649649 cycles for Cmp128JJSSE (xor)
3396764 cycles for Cmp128JJAlexSSE (xor)
6278291 cycles for Cmp128NidudSSE
2784183 cycles for AxCMP128bitProc3
3333723 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 10:45:31 AM
Alex,
I converted the functions to macros, it's a bit faster  :biggrin:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
---------------------------------------------------------
1690006 cycles for Cmp128Dave
2675950 cycles for Cmp128Dave2
1614107 cycles for Cmp128Nidud
959759 cycles for Cmp128Alex (xor)
620231 cycles for Cmp128DaveU (unsigned)
765410 cycles for Cmp128NidudU (unsigned)
2536611 cycles for Cmp128JJSSE (xor)
1871924 cycles for Cmp128JJAlexSSE (xor)
1346793 cycles for Cmp128NidudSSE
1351217 cycles for AxCMP128bitProc3
1359923 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
1700039 cycles for Cmp128Dave
2642895 cycles for Cmp128Dave2
1617344 cycles for Cmp128Nidud
960480 cycles for Cmp128Alex (xor)
629764 cycles for Cmp128DaveU (unsigned)
770936 cycles for Cmp128NidudU (unsigned)
2536453 cycles for Cmp128JJSSE (xor)
1854277 cycles for Cmp128JJAlexSSE (xor)
1352593 cycles for Cmp128NidudSSE
1349484 cycles for AxCMP128bitProc3
1359979 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 24, 2013, 10:55:11 AM
The wondering thing is that there're some "rumours", like "CMOV is preferably than jump + mov", or "string instructions with REP(E) prefix are the fastest possible", but in tests these rumours are not proved much of times :icon_eek:

nidud's cmp128.zip
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
6909610 cycles for Cmp128Dave
9987059 cycles for Cmp128Dave2
5967230 cycles for Cmp128Nidud
948941  cycles for Cmp128Alex (xor)
969717  cycles for Cmp128DaveU (unsigned)
1015841 cycles for Cmp128NidudU (unsigned)
5981220 cycles for Cmp128JJSSE (xor)
3506260 cycles for Cmp128JJAlexSSE (xor)
6612580 cycles for Cmp128NidudSSE
2884301 cycles for AxCMP128bitProc3
3538700 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
6021956 cycles for Cmp128Dave
9902589 cycles for Cmp128Dave2
5931336 cycles for Cmp128Nidud
982830  cycles for Cmp128Alex (xor)
928048  cycles for Cmp128DaveU (unsigned)
933436  cycles for Cmp128NidudU (unsigned)
5800983 cycles for Cmp128JJSSE (xor)
3515863 cycles for Cmp128JJAlexSSE (xor)
6532618 cycles for Cmp128NidudSSE
2878085 cycles for AxCMP128bitProc3
3399463 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 24, 2013, 11:06:21 AM
Alex,
I converted the functions to macros, it's a bit faster  :biggrin:

Thanks :biggrin:

It looks like prologue and epilogue get much of time.

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
6111257 cycles for Cmp128Dave
9430653 cycles for Cmp128Dave2
6087648 cycles for Cmp128Nidud
977496  cycles for Cmp128Alex (xor)
912537  cycles for Cmp128DaveU (unsigned)
897094  cycles for Cmp128NidudU (unsigned)
5813163 cycles for Cmp128JJSSE (xor)
3502859 cycles for Cmp128JJAlexSSE (xor)
6542713 cycles for Cmp128NidudSSE
2027954 cycles for AxCMP128bitProc3
2026933 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5771320 cycles for Cmp128Dave
9534868 cycles for Cmp128Dave2
5882258 cycles for Cmp128Nidud
999464  cycles for Cmp128Alex (xor)
911686  cycles for Cmp128DaveU (unsigned)
970983  cycles for Cmp128NidudU (unsigned)
5801006 cycles for Cmp128JJSSE (xor)
3547188 cycles for Cmp128JJAlexSSE (xor)
6496130 cycles for Cmp128NidudSSE
2012868 cycles for AxCMP128bitProc3
2010059 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 24, 2013, 11:10:37 AM
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
1114751 cycles for Cmp128Dave
1631462 cycles for Cmp128Dave2
1012058 cycles for Cmp128Nidud
585622  cycles for Cmp128Alex (xor)
517017  cycles for Cmp128DaveU (unsigned)
477794  cycles for Cmp128NidudU (unsigned)
1214935 cycles for Cmp128JJSSE (xor)
1304208 cycles for Cmp128JJAlexSSE (xor)
1434055 cycles for Cmp128NidudSSE
1324566 cycles for AxCMP128bitProc3
1357700 cycles for AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 24, 2013, 08:32:17 PM
Hi nidud,

timings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2360/1000 cycles

1771    cycles for 1000 * Ocmp (JJ)
1757    cycles for 1000 * Ocmp2 (JJ)
1445    cycles for 1000 * cmp128n (nidud)
3829    cycles for 1000 * cmp128 qWord
3146    cycles for 1000 * AxCMP128bit

1843    cycles for 1000 * Ocmp (JJ)
1846    cycles for 1000 * Ocmp2 (JJ)
1612    cycles for 1000 * cmp128n (nidud)
3992    cycles for 1000 * cmp128 qWord
3144    cycles for 1000 * AxCMP128bit

1709    cycles for 1000 * Ocmp (JJ)
1746    cycles for 1000 * Ocmp2 (JJ)
1506    cycles for 1000 * cmp128n (nidud)
3848    cycles for 1000 * cmp128 qWord
3270    cycles for 1000 * AxCMP128bit

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: six_L on August 24, 2013, 10:47:39 PM
Quote
Intel(R) Core(TM) i3 CPU       M 370  @ 2.40GHz (SSE4)
---------------------------------------------------------
2639323   cycles for Cmp128Dave
8595376   cycles for Cmp128Dave2
3084573   cycles for Cmp128Nidud
1718384   cycles for Cmp128Alex (xor)
742712   cycles for Cmp128DaveU (unsigned)
621263   cycles for Cmp128NidudU (unsigned)
2229787   cycles for Cmp128JJSSE (xor)
988796   cycles for Cmp128JJAlexSSE (xor)
1853176   cycles for Cmp128NidudSSE
1302968   cycles for AxCMP128bitProc3
1286231   cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
1768975   cycles for Cmp128Dave
7299673   cycles for Cmp128Dave2
2650490   cycles for Cmp128Nidud
1824726   cycles for Cmp128Alex (xor)
1732493   cycles for Cmp128DaveU (unsigned)
1461457   cycles for Cmp128NidudU (unsigned)
3480289   cycles for Cmp128JJSSE (xor)
1887690   cycles for Cmp128JJAlexSSE (xor)
2749835   cycles for Cmp128NidudSSE
2177752   cycles for AxCMP128bitProc3
2126370   cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------

--- ok ---

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 25, 2013, 12:01:08 AM
New and updated time test

The old test only used the evaluation numbers, and all compare was therefore aligned 16. The macros should be able to handle unaligned data, so I adapted the idea of Alex to use the same data, but add offset by 4 for each iteration to get more random compare.

As a result of this Alex and my SEE macro failed  :lol:

Changes made to Cmp128JJAlexSSE:
Code: [Select]
movups xmm0,[ow0]
movups xmm1,[ow1] ; ++
movzx eax,word ptr [ow0+14]
;pcmpeqb xmm0,[ow1] ; this failed on unaligned data
pcmpeqb xmm0,xmm1
Same changes is made to Cmp128NidudSSE:
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
pcmpeqd xmm0,xmm1

New results:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
784821 cycles [x][x][x] - Cmp128Dave
1208193 cycles [x][x][x] - Cmp128Dave2
792137 cycles [x][x][x] - Cmp128Nidud
722428 cycles [x][x][x] - Cmp128NidudSSE
618894 cycles [x][x][ ] - Cmp128Alex
1088523 cycles [x][x][ ] - Cmp128JJSSE
1210964 cycles [x][x][ ] - Cmp128JJAlexSSE
693965 cycles [x][x][ ] - AxCMP128bitProc3
685678 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
531623 cycles [x][ ][ ] - Cmp128DaveU
532299 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 25, 2013, 02:12:23 AM
here is a simple one using a LUT (unsigned)

Code: [Select]
omask db 12,12,12,12,12,12,12,12,8,8,8,8,4,4,0,12
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
pcmpeqd xmm0,xmm1
movmskps eax,xmm0
mov al,omask[eax]
mov edx,A[eax]
sub edx,B[eax]
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 25, 2013, 02:24:11 AM
Hi nidud,

the new timings:

Code: [Select]

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
698804  cycles [x][x][x] - Cmp128Dave
1161997 cycles [x][x][x] - Cmp128Dave2
614967  cycles [x][x][x] - Cmp128Nidud
715645  cycles [x][x][x] - Cmp128NidudSSE
461614  cycles [x][x][ ] - Cmp128Alex
378413  cycles [x][x][ ] - Cmp128JJSSE
331370  cycles [x][x][ ] - Cmp128JJAlexSSE
467938  cycles [x][x][ ] - AxCMP128bitProc3
436255  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
437147  cycles [x][ ][ ] - Cmp128DaveU
463277  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 25, 2013, 05:56:47 PM
As a result of this Alex and my SEE macro failed  :lol:

Changes made to Cmp128JJAlexSSE:
Code: [Select]
movups xmm0,[ow0]
movups xmm1,[ow1] ; ++
movzx eax,word ptr [ow0+14]
;pcmpeqb xmm0,[ow1] ; this failed on unaligned data
pcmpeqb xmm0,xmm1

Yes, I noted that it's unaware of unaligned data. Your solution is right :t

Here are the timings:

Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2652796 cycles [x][x][x] - Cmp128Dave
3952276 cycles [x][x][x] - Cmp128Dave2
2639764 cycles [x][x][x] - Cmp128Nidud
3069710 cycles [x][x][x] - Cmp128NidudSSE
944781  cycles [x][x][ ] - Cmp128Alex
1913987 cycles [x][x][ ] - Cmp128JJSSE
2623148 cycles [x][x][ ] - Cmp128JJAlexSSE
1324491 cycles [x][x][ ] - AxCMP128bitProc3
1279045 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
726832  cycles [x][ ][ ] - Cmp128DaveU
738206  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


It's interesting how differently algos perform on different processors.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on August 25, 2013, 06:57:30 PM
Hi Alex,

It's interesting how differently algos perform on different processors.

yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 25, 2013, 07:18:41 PM
Quote
It's interesting how differently algos perform on different processors.

Yes, and in the last test-round I ended up with this conclusion:
Quote
We either all use the same CPU or stick to the basic then  :lol:

I have no way of testing this, but I assume that this will also work in 64-bit:
Code: [Select]
mov rax,A[0]
sub rax,B[0]
jnz @2
mov rax,A[8]
sbb rax,B[8]
jmp @end
@1: jc @end
mov al,80h
sub al,7Fh
jmp @end
@2: mov rax,A[8]
sbb rax,B[8]
jnz @end
jo @1
inc rax

Well, we have some workable solutions for all the three emulations now, and the evaluation code also seems to work, so that is good.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 25, 2013, 08:07:28 PM
A brand new Cmp128JJAlexSSE!

Don't miss it on your displays right now!

Now fully compliant with Dave's Testing Method™ (JO/JS works as expected).

Even with 3 new tastes modifications!


:greensml:


Timings welcome :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 25, 2013, 08:15:21 PM
Hi Gunther :biggrin:

It's interesting how differently algos perform on different processors.

yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.

With current amount of different CPU models that would be a bunch of code :biggrin:
You're perfectly right, to get every clock from every machine it's the only way.



Ah, forget to post the timings in previous post:
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2958901 cycles [x][x][x] - Cmp128Dave
4334913 cycles [x][x][x] - Cmp128Dave2
2957840 cycles [x][x][x] - Cmp128Nidud
3402571 cycles [x][x][x] - Cmp128NidudSSE
1034917 cycles [x][x][ ] - Cmp128Alex
2118138 cycles [x][x][ ] - Cmp128JJSSE
1762373 cycles [x][x][x] - Cmp128JJAlexSSE_1
1726287 cycles [x][x][x] - Cmp128JJAlexSSE_2
1739010 cycles [x][x][x] - Cmp128JJAlexSSE_3
1464577 cycles [x][x][ ] - AxCMP128bitProc3
1372694 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
779180  cycles [x][ ][ ] - Cmp128DaveU
798269  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 25, 2013, 08:23:20 PM
A brand new Cmp128JJAlexSSE!

It seems to like my Celeron - best among the "good" algos :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
968781  cycles {x}{x}{x} - Cmp128Dave
2629540 cycles {x}{x}{x} - Cmp128Dave2
938714  cycles {x}{x}{x} - Cmp128Nidud
1039057 cycles {x}{x}{x} - Cmp128NidudSSE
706010  cycles {x}{x}{ } - Cmp128Alex
1131248 cycles {x}{x}{ } - Cmp128JJSSE
834193  cycles {x}{x}{x} - Cmp128JJAlexSSE_1
947852  cycles {x}{x}{x} - Cmp128JJAlexSSE_2
948452  cycles {x}{x}{x} - Cmp128JJAlexSSE_3
881549  cycles {x}{x}{ } - AxCMP128bitProc3
890835  cycles {x}{x}{ } - AxCMP128bitProc3c (cmov)
610504  cycles {x}{ }{ } - Cmp128DaveU
599043  cycles {x}{ }{ } - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 25, 2013, 08:35:25 PM
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
785950 cycles [x][x][x] - Cmp128Dave
1205202 cycles [x][x][x] - Cmp128Dave2
794444 cycles [x][x][x] - Cmp128Nidud
718819 cycles [x][x][x] - Cmp128NidudSSE
618550 cycles [x][x][ ] - Cmp128Alex
1091460 cycles [x][x][ ] - Cmp128JJSSE
1069602 cycles [x][x][x] - Cmp128JJAlexSSE_1
1070532 cycles [x][x][x] - Cmp128JJAlexSSE_2
1071273 cycles [x][x][x] - Cmp128JJAlexSSE_3
686309 cycles [x][x][ ] - AxCMP128bitProc3
684448 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
530841 cycles [x][ ][ ] - Cmp128DaveU
532318 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 25, 2013, 08:41:24 PM
Ooops, toooooo much digits in the numbers, getting valuating them "by width" :greensml: "By width" the selected timings were wider, so I thought that it much slower... :greensml: :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 25, 2013, 08:45:08 PM
1069602   cycles
  • - Cmp128JJAlexSSE_1

1070532   cycles
  • - Cmp128JJAlexSSE_2

1071273   cycles
  • - Cmp128JJAlexSSE_3
But here they all near.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 25, 2013, 09:12:33 PM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2598846 cycles [x][x][x] - Cmp128Dave
3786288 cycles [x][x][x] - Cmp128Dave2
2616598 cycles [x][x][x] - Cmp128Nidud
3025310 cycles [x][x][x] - Cmp128NidudSSE
914405  cycles [x][x][ ] - Cmp128Alex
1906276 cycles [x][x][ ] - Cmp128JJSSE
1588020 cycles [x][x][x] - Cmp128JJAlexSSE_1
1562841 cycles [x][x][x] - Cmp128JJAlexSSE_2
1558993 cycles [x][x][x] - Cmp128JJAlexSSE_3
1326437 cycles [x][x][ ] - AxCMP128bitProc3
1254462 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
692441  cycles [x][ ][ ] - Cmp128DaveU
713309  cycles [x][ ][ ] - Cmp128NidudU
Code: [Select]
------------------------------------------------------
2615758 cycles [x][x][x] - Cmp128Dave
3829660 cycles [x][x][x] - Cmp128Dave2
2621750 cycles [x][x][x] - Cmp128Nidud
3031078 cycles [x][x][x] - Cmp128NidudSSE
908794  cycles [x][x][ ] - Cmp128Alex
1892463 cycles [x][x][ ] - Cmp128JJSSE
1591916 cycles [x][x][x] - Cmp128JJAlexSSE_1
1557071 cycles [x][x][x] - Cmp128JJAlexSSE_2
1559415 cycles [x][x][x] - Cmp128JJAlexSSE_3
1313596 cycles [x][x][ ] - AxCMP128bitProc3
1267780 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
711284  cycles [x][ ][ ] - Cmp128DaveU
741151  cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 25, 2013, 10:13:49 PM
There is a lot of random things with plays into the speed test apparently.
I changed from movups to movdqu in Cmp128NidudSSE and Cmp128JJAlexSSE_3:
Code: [Select]
783489 cycles [x][x][x] - Cmp128Dave
1207791 cycles [x][x][x] - Cmp128Dave2
792539 cycles [x][x][x] - Cmp128Nidud
702561 cycles [x][x][x] - Cmp128NidudSSE
619145 cycles [x][x][ ] - Cmp128Alex
1088566 cycles [x][x][ ] - Cmp128JJSSE
1073680 cycles [x][x][x] - Cmp128JJAlexSSE_1
1068435 cycles [x][x][x] - Cmp128JJAlexSSE_2
1078837 cycles [x][x][x] - Cmp128JJAlexSSE_3
686730 cycles [x][x][ ] - AxCMP128bitProc3
684243 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
534373 cycles [x][ ][ ] - Cmp128DaveU
532885 cycles [x][ ][ ] - Cmp128NidudU

I then move one of the test to the top:
Code: [Select]
689506 cycles [x][x][x] - Cmp128NidudSSE
785092 cycles [x][x][x] - Cmp128Dave
1205640 cycles [x][x][x] - Cmp128Dave2
792837 cycles [x][x][x] - Cmp128Nidud
620102 cycles [x][x][ ] - Cmp128Alex
1090523 cycles [x][x][ ] - Cmp128JJSSE
1069606 cycles [x][x][x] - Cmp128JJAlexSSE_1
1070590 cycles [x][x][x] - Cmp128JJAlexSSE_2
1074581 cycles [x][x][x] - Cmp128JJAlexSSE_3
688797 cycles [x][x][ ] - AxCMP128bitProc3
684210 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
531873 cycles [x][ ][ ] - Cmp128DaveU
532100 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 25, 2013, 10:20:43 PM
tests that use a little more time return more repeatable results
if i am timing code, i try to make each pass last about 0.5 seconds
that seems to give repeatable numbers
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 25, 2013, 11:51:25 PM
Hi,

   From Reply #216.

Code: [Select]
pre-P4 (SSE1)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1067625 cycles [x][x][x] - Cmp128Dave
2571737 cycles [x][x][x] - Cmp128Dave2
998428 cycles [x][x][x] - Cmp128Nidud
1083846 cycles [x][x][x] - Cmp128NidudSSE
847793 cycles [x][x][ ] - Cmp128Alex
1788551 cycles [x][x][ ] - Cmp128JJSSE
1215146 cycles [x][x][x] - Cmp128JJAlexSSE_1
1623996 cycles [x][x][x] - Cmp128JJAlexSSE_2
1570182 cycles [x][x][x] - Cmp128JJAlexSSE_3
1114476 cycles [x][x][ ] - AxCMP128bitProc3
1174133 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
608508 cycles [x][ ][ ] - Cmp128DaveU
612287 cycles [x][ ][ ] - Cmp128NidudU

--- ok --- 
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1040860 cycles [x][x][x] - Cmp128Dave
2541208 cycles [x][x][x] - Cmp128Dave2
940427 cycles [x][x][x] - Cmp128Nidud
1046690 cycles [x][x][x] - Cmp128NidudSSE
834253 cycles [x][x][ ] - Cmp128Alex
1849858 cycles [x][x][ ] - Cmp128JJSSE
1453007 cycles [x][x][x] - Cmp128JJAlexSSE_1
1703155 cycles [x][x][x] - Cmp128JJAlexSSE_2
1713931 cycles [x][x][x] - Cmp128JJAlexSSE_3
963145 cycles [x][x][ ] - AxCMP128bitProc3
1004886 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652720 cycles [x][ ][ ] - Cmp128DaveU
646938 cycles [x][ ][ ] - Cmp128NidudU

--- ok --- 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
979702 cycles [x][x][x] - Cmp128Dave
2660548 cycles [x][x][x] - Cmp128Dave2
948481 cycles [x][x][x] - Cmp128Nidud
1056326 cycles [x][x][x] - Cmp128NidudSSE
754229 cycles [x][x][ ] - Cmp128Alex
1145531 cycles [x][x][ ] - Cmp128JJSSE
852507 cycles [x][x][x] - Cmp128JJAlexSSE_1
960256 cycles [x][x][x] - Cmp128JJAlexSSE_2
959330 cycles [x][x][x] - Cmp128JJAlexSSE_3
891707 cycles [x][x][ ] - AxCMP128bitProc3
899999 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
615768 cycles [x][ ][ ] - Cmp128DaveU
606497 cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

Cheers,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on August 26, 2013, 12:05:00 AM
Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
973837  cycles [x][x][x] - Cmp128Dave
3064246 cycles [x][x][x] - Cmp128Dave2
924278  cycles [x][x][x] - Cmp128Nidud
1063306 cycles [x][x][x] - Cmp128NidudSSE
688245  cycles [x][x][ ] - Cmp128Alex
1082474 cycles [x][x][ ] - Cmp128JJSSE
801400  cycles [x][x][x] - Cmp128JJAlexSSE_1
898730  cycles [x][x][x] - Cmp128JJAlexSSE_2
902646  cycles [x][x][x] - Cmp128JJAlexSSE_3
896815  cycles [x][x][ ] - AxCMP128bitProc3
929492  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
632298  cycles [x][ ][ ] - Cmp128DaveU
627533  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 26, 2013, 06:23:41 AM
Hi,

   Using Dave's original 40 DWORD AND OWORD pairs of numbers,
and some of his logic, I wrote some code for my fixed point
program.  Claims it passes the tests.  Yippee!  Had me going in
circles for a while.

Cheers,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 26, 2013, 07:04:16 AM
the first test is strange:
Code: [Select]
pre-P4 (SSE1)
most of the code used in the macros are SSE2
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 26, 2013, 07:16:10 AM
the first test is strange:
Code: [Select]
pre-P4 (SSE1)
most of the code used in the macros are SSE2

Hi nidud,

   Yeah, I would think it should not run.  But?

Cheers,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: MichaelW on August 26, 2013, 07:54:28 AM
I have seen SSE2 code that would run on my P3 without triggering an exception, but which would produce incorrect results.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 26, 2013, 08:49:32 AM
Yes, used in the code SSE2 instructions are PCMPEQB - which used the same opcode as MMX PCMPEQB but with 66h prefix which isn't recognized by PIII so it treat this as a MMX instruction (so SSE results are incorrect) and MOVAPS/MOVAPD - MOVAPS works on PIII and MOVAPD has opcode of MOVAPS with 66h prefix so it works, too.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 26, 2013, 11:35:51 AM
this is starting to get a bit obsessive  :lol:

the new conventinal algo:
Code: [Select]
mov eax,A[0]
sub eax,B[0]
jnz @0
mov eax,A[4]
sub eax,B[4]
jnz @1
mov eax,A[8]
sub eax,B[8]
jnz @2
mov eax,A[12]
sub eax,B[12]
jmp @end
@0: mov eax,A[4]
sbb eax,B[4]
@1: mov eax,A[8]
sbb eax,B[8]
@2: mov eax,A[12]
sbb eax,B[12]
mov eax,1
bsf eax,eax
the new SSE2 algo:
Code: [Select]
movdqu xmm0,A[0]
movdqu xmm1,B[0]
pcmpeqd xmm0,xmm1
movmskps eax,xmm0
sub al,1111B
jnz @0
mov eax,A[12]
sub eax,B[12]
jmp @end
@0: mov eax,A[0]
sub eax,B[0]
mov eax,A[4]
sbb eax,B[4]
mov eax,A[8]
sbb eax,B[8]
mov eax,A[12]
sbb eax,B[12]
mov eax,1
bsf eax,eax

and the time:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
619669 cycles [x][x][x] - Cmp128Nidud
526969 cycles [x][x][x] - Cmp128NidudSSE
783632 cycles [x][x][x] - Cmp128Dave
1207680 cycles [x][x][x] - Cmp128Dave2
1069604 cycles [x][x][x] - Cmp128JJAlexSSE_1
1070598 cycles [x][x][x] - Cmp128JJAlexSSE_2
1074710 cycles [x][x][x] - Cmp128JJAlexSSE_3
615174 cycles [x][x][ ] - Cmp128Alex
1087909 cycles [x][x][ ] - Cmp128JJSSE
705998 cycles [x][x][ ] - AxCMP128bitProc3
704783 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
450940 cycles [x][ ][ ] - Cmp128DaveU
441532 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 26, 2013, 12:38:23 PM
this is starting to get a bit obsessive  :lol:

Yes :biggrin:

Code: [Select]
mov eax,1
bsf eax,eax

Is this works? :icon_eek:


Here you can simplify a bit:
Code: [Select]
movmskps eax,xmm0
sub al,1111B
jnz @0

Instead of jnz @0 jz to the exit from macro - it already processed right zero (equal) result.


Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2302893 cycles [x][x][x] - Cmp128Nidud
2441613 cycles [x][x][x] - Cmp128NidudSSE
2871717 cycles [x][x][x] - Cmp128Dave
4208738 cycles [x][x][x] - Cmp128Dave2
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
1695861 cycles [x][x][x] - Cmp128JJAlexSSE_2
1946274 cycles [x][x][x] - Cmp128JJAlexSSE_3
985137  cycles [x][x][ ] - Cmp128Alex
2063049 cycles [x][x][ ] - Cmp128JJSSE
1411323 cycles [x][x][ ] - AxCMP128bitProc3
1324774 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756882  cycles [x][ ][ ] - Cmp128DaveU
784458  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 26, 2013, 01:57:34 PM
Feed the obsession...
Code: [Select]
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
---------------------------------------------------
305387  cycles [x][x][x] - Cmp128Nidud
308974  cycles [x][x][x] - Cmp128NidudSSE
617569  cycles [x][x][x] - Cmp128Dave
1184205 cycles [x][x][x] - Cmp128Dave2
273918  cycles [x][x][x] - Cmp128JJAlexSSE_1
319743  cycles [x][x][x] - Cmp128JJAlexSSE_2
319190  cycles [x][x][x] - Cmp128JJAlexSSE_3
452218  cycles [x][x][ ] - Cmp128Alex
323382  cycles [x][x][ ] - Cmp128JJSSE
417314  cycles [x][x][ ] - AxCMP128bitProc3
395354  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
341747  cycles [x][ ][ ] - Cmp128DaveU
348616  cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 26, 2013, 05:14:49 PM
Feed the obsession...

Me too :biggrin:
Code: [Select]
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
844     kCycles [x][x][x] - Cmp128Dave
1299    kCycles [x][x][x] - Cmp128Dave2
846     kCycles [x][x][x] - Cmp128Nidud
922     kCycles [x][x][x] - Cmp128NidudSSE
644     kCycles [x][x][ ] - Cmp128Alex
1557    kCycles [x][x][ ] - Cmp128JJSSE
1471    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1465    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1465    kCycles [x][x][x] - Cmp128JJAlexSSE_3
802     kCycles [x][x][ ] - AxCMP128bitProc3
772     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
543     kCycles [x][ ][ ] - Cmp128DaveU
543     kCycles [x][ ][ ] - Cmp128NidudU

P.S.: Added a sar eax, 10, and changed test_end "kCycles (x)(x)(x) - Cmp128Dave2"
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on August 26, 2013, 05:55:31 PM
Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
700812  cycles [x][x][x] - Cmp128Nidud
729986  cycles [x][x][x] - Cmp128NidudSSE
971877  cycles [x][x][x] - Cmp128Dave
3026250 cycles [x][x][x] - Cmp128Dave2
782064  cycles [x][x][x] - Cmp128JJAlexSSE_1
890599  cycles [x][x][x] - Cmp128JJAlexSSE_2
926681  cycles [x][x][x] - Cmp128JJAlexSSE_3
682186  cycles [x][x][ ] - Cmp128Alex
1067566 cycles [x][x][ ] - Cmp128JJSSE
882899  cycles [x][x][ ] - AxCMP128bitProc3
888908  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
592661  cycles [x][ ][ ] - Cmp128DaveU
570588  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 26, 2013, 08:05:09 PM
Yum-yum!

New MACRO added - brute rework of original GPR macro but to make it work just like CMP (passes Dave's check).

Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2184021 cycles [x][x][x] - Cmp128Nidud
2313648 cycles [x][x][x] - Cmp128NidudSSE
2767063 cycles [x][x][x] - Cmp128Dave
4086277 cycles [x][x][x] - Cmp128Dave2
1672157 cycles [x][x][x] - Cmp128JJAlexSSE_1
1644385 cycles [x][x][x] - Cmp128JJAlexSSE_2
1889066 cycles [x][x][x] - Cmp128JJAlexSSE_3
980736  cycles [x][x][ ] - Cmp128Alex
1851407 cycles [x][x][x] - Cmp128Alex_2
1899452 cycles [x][x][x] - Cmp128Alex_3
2048700 cycles [x][x][ ] - Cmp128JJSSE
1388635 cycles [x][x][ ] - AxCMP128bitProc3
1311284 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756260  cycles [x][ ][ ] - Cmp128DaveU
775831  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

I think AMD probably should like it better than Intel.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 26, 2013, 08:08:59 PM
the first test is strange:
Code: [Select]
pre-P4 (SSE1)
most of the code used in the macros are SSE2


But half of current code is GPR, too - there are Dave's, your and mine codes that didn't use SSE at all :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 12:29:46 AM
No timings?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 27, 2013, 01:12:12 AM
Code: [Select]
mov eax,1
bsf eax,eax

Is this works? :icon_eek:

The trick is to clear the zero flag without changing any of the other flags. In Dave’s code this is done like this:
Code: [Select]
    jnz     c4
    lahf
    lea     eax,[eax-4000h]
    sahf

BSF are one of the few upcodes that only modifies ZF, but it is a bit slow.
Code: [Select]
Clocks
BSF 6-42
BSR 6-103

Quote
Here you can simplify a bit:
Code: [Select]
movmskps eax,xmm0
sub al,1111B
jnz @0

Instead of jnz @0 jz to the exit from macro - it already processed right zero (equal) result.

ah, thanks  :t
Code: [Select]
movdqu xmm0,A[0]
movdqu xmm1,B[0]
pcmpeqd xmm0,xmm1
movmskps eax,xmm0
sub eax,1111B
jz @end
mov eax,A[0]
sub eax,B[0]
mov eax,A[4]
sbb eax,B[4]
mov eax,A[8]
sbb eax,B[8]
mov eax,A[12]
sbb eax,B[12]
mov eax,1
bsf eax,eax

timings:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
617117 cycles [x][x][x] - Cmp128Nidud
524352 cycles [x][x][x] - Cmp128NidudSSE
783415 cycles [x][x][x] - Cmp128Dave
1206666 cycles [x][x][x] - Cmp128Dave2
1069902 cycles [x][x][x] - Cmp128JJAlexSSE_1
1070353 cycles [x][x][x] - Cmp128JJAlexSSE_2
1074687 cycles [x][x][x] - Cmp128JJAlexSSE_3
613814 cycles [x][x][ ] - Cmp128Alex
973664 cycles [x][x][x] - Cmp128Alex_2
970668 cycles [x][x][x] - Cmp128Alex_3
1090601 cycles [x][x][ ] - Cmp128JJSSE
704123 cycles [x][x][ ] - AxCMP128bitProc3
712922 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
466620 cycles [x][ ][ ] - Cmp128DaveU
440970 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 27, 2013, 08:48:04 AM
i did create a new set of values
but, i haven't had time to validate the standard flags proc

Hi,

   In Reply #187 Dave had an array of test values.  I just tested
my algorithm against those, and passed.  I created the check
values as he had in his earlier validation program as that was what
I based mine on.  Would that still be of use to anyone else?  I know
you want fast algorithms, and mine is most probably slow.  (And it
is 16-bit to run it on an 80186.)  Anyone interested in it?

Regards,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 12:32:27 PM
Code: [Select]
mov eax,1
bsf eax,eax

Is this works? :icon_eek:

The trick is to clear the zero flag without changing any of the other flags. In Dave’s code this is done like this:
Code: [Select]
    jnz     c4
    lahf
    lea     eax,[eax-4000h]
    sahf

BSF are one of the few upcodes that only modifies ZF, but it is a bit slow.
Code: [Select]
Clocks
BSF 6-42
BSR 6-103

The problems is that BSF sets only ZF flag, but other flags after instruction are "undefined". For the flags this means that its state is absolutely unpredictable, and, for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed. In short - BSF cannot be used for this purpose with any robustness (if on some CPU the flags aren't touched, on other CPU they may be messed). Check it on your CPU - is BSF trashed other flags? Did it passed Dave's check? If so, then your CPU doesn't change other flags with BSF, otherwice it should not pass the check.



Can I ask everyone for more timings for this archive? http://masm32.com/board/index.php?topic=2222.msg23743#msg23743
It's interesting how worth the rework of old code is.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 01:20:42 PM
Anyone interested in it?

Yes, Steve, of course! I'm interested :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 27, 2013, 02:18:51 PM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2044430 cycles [x][x][x] - Cmp128Nidud
2232933 cycles [x][x][x] - Cmp128NidudSSE
2631658 cycles [x][x][x] - Cmp128Dave
3862003 cycles [x][x][x] - Cmp128Dave2
1601513 cycles [x][x][x] - Cmp128JJAlexSSE_1
1559401 cycles [x][x][x] - Cmp128JJAlexSSE_2
1791892 cycles [x][x][x] - Cmp128JJAlexSSE_3
935826  cycles [x][x][ ] - Cmp128Alex
1729147 cycles [x][x][x] - Cmp128Alex_2
1779960 cycles [x][x][x] - Cmp128Alex_3
1913773 cycles [x][x][ ] - Cmp128JJSSE
1302324 cycles [x][x][ ] - AxCMP128bitProc3
1253729 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
701808  cycles [x][ ][ ] - Cmp128DaveU
752020  cycles [x][ ][ ] - Cmp128NidudU
Code: [Select]
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2033683 cycles [x][x][x] - Cmp128Nidud
2247710 cycles [x][x][x] - Cmp128NidudSSE
2628652 cycles [x][x][x] - Cmp128Dave
3813015 cycles [x][x][x] - Cmp128Dave2
1629220 cycles [x][x][x] - Cmp128JJAlexSSE_1
1591177 cycles [x][x][x] - Cmp128JJAlexSSE_2
1794286 cycles [x][x][x] - Cmp128JJAlexSSE_3
936215  cycles [x][x][ ] - Cmp128Alex
1725124 cycles [x][x][x] - Cmp128Alex_2
1782223 cycles [x][x][x] - Cmp128Alex_3
1900926 cycles [x][x][ ] - Cmp128JJSSE
1331104 cycles [x][x][ ] - AxCMP128bitProc3
1260544 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
696199  cycles [x][ ][ ] - Cmp128DaveU
734917  cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on August 27, 2013, 03:25:08 PM
Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
700120  cycles [x][x][x] - Cmp128Nidud
730067  cycles [x][x][x] - Cmp128NidudSSE
972859  cycles [x][x][x] - Cmp128Dave
3028178 cycles [x][x][x] - Cmp128Dave2
784587  cycles [x][x][x] - Cmp128JJAlexSSE_1
890498  cycles [x][x][x] - Cmp128JJAlexSSE_2
928216  cycles [x][x][x] - Cmp128JJAlexSSE_3
680786  cycles [x][x][ ] - Cmp128Alex
1108150 cycles [x][x][x] - Cmp128Alex_2
1114646 cycles [x][x][x] - Cmp128Alex_3
1069239 cycles [x][x][ ] - Cmp128JJSSE
871461  cycles [x][x][ ] - AxCMP128bitProc3
889968  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
592212  cycles [x][ ][ ] - Cmp128DaveU
570113  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 04:56:43 PM
Thank you very much, Dave and Marinus! :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on August 27, 2013, 05:07:12 PM
Here you go Alex
Code: [Select]
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
---------------------------------------------------
332168  cycles [x][x][x] - Cmp128Nidud
339908  cycles [x][x][x] - Cmp128NidudSSE
671045  cycles [x][x][x] - Cmp128Dave
1346792 cycles [x][x][x] - Cmp128Dave2
299201  cycles [x][x][x] - Cmp128JJAlexSSE_1
386241  cycles [x][x][x] - Cmp128JJAlexSSE_2
398699  cycles [x][x][x] - Cmp128JJAlexSSE_3
508040  cycles [x][x][ ] - Cmp128Alex
610825  cycles [x][x][x] - Cmp128Alex_2
608946  cycles [x][x][x] - Cmp128Alex_3
378717  cycles [x][x][ ] - Cmp128JJSSE
467459  cycles [x][x][ ] - AxCMP128bitProc3
417007  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
360011  cycles [x][ ][ ] - Cmp128DaveU
383098  cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 27, 2013, 05:22:34 PM
The problems is that BSF sets only ZF flag, but other flags after instruction are "undefined". For the flags this means that its state is absolutely unpredictable, and, for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed. In short - BSF cannot be used for this purpose with any robustness (if on some CPU the flags aren't touched, on other CPU they may be messed). Check it on your CPU - is BSF trashed other flags? Did it passed Dave's check? If so, then your CPU doesn't change other flags with BSF, otherwice it should not pass the check.

Unless this is not specifically stated in the Intel manual, that can't be the case. It will mean the same as to say that "on some CPU’s the upcode INC sometimes cleared the CF flag", which is not the case.

From the Intel manual:
Code: [Select]
AAD - Ascii Adjust for Division

Usage:  AAD
Modifies flags: SF ZF PF (AF,CF,OF undefined)

BSF - Bit Scan Forward (386+)

Usage:  BSF     dest,src
Modifies flags: ZF

DIV - Divide

Usage:  DIV     src
Modifies flags: (AF,CF,OF,PF,SF,ZF undefined)

I may be wrong in this claim, and if this is the case, the attached test will fail on some (your's?) CPU’s.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 07:23:23 PM
Here you go Alex

Thank you very much, John! :biggrin:

The problems is that BSF sets only ZF flag, but other flags after instruction are "undefined". For the flags this means that its state is absolutely unpredictable, and, for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed. In short - BSF cannot be used for this purpose with any robustness (if on some CPU the flags aren't touched, on other CPU they may be messed). Check it on your CPU - is BSF trashed other flags? Did it passed Dave's check? If so, then your CPU doesn't change other flags with BSF, otherwice it should not pass the check.

Unless this is not specifically stated in the Intel manual, that can't be the case. It will mean the same as to say that "on some CPU’s the upcode INC sometimes cleared the CF flag", which is not the case.

I may be wrong in this claim, and if this is the case, the attached test will fail on some (your's?) CPU’s.


It's stated in Intel's manual. Maybe you're using some textual portable (shortened) version of it, but in full version other flags are stated "undefined".

Interesting enough, it seem that your AMD doesn't trash other flags.

(results truncated since too long)
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000000_00000001
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ CY
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000001_FFFFFFFF
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ CY

AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ NC
cmp C0000001_00000000_00000000_00000000 , 40000001_00000000_00000000_00000000
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ NC

1365 Failures

Press any key to continue ...

Can anyone here with AMD CPU and Intel CPU run the test attached in the post above?
(Maybe we found the fastest "CPUID" functionality for the IsIntelOrAMD routine :biggrin:)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 27, 2013, 09:00:40 PM
with nidud's Cmp128Eval program, i get 1365 failures

here is a little test program....
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

judging from the parity flag, it looks like it explicitly sets some of the flags, other than the ZF
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 09:43:41 PM
Yes, I have the same results

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

Press any key to continue ...

judging from the parity flag, it looks like it explicitly sets some of the flags, other than the ZF
for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed

They all zeroed and parity seems to be set properly for the result.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 27, 2013, 09:47:36 PM
Hi,

Anyone interested in it?

Yes, Steve, of course! I'm interested :t

   Okay, here it is.  16-bit, but would be easy to convert to 32-bits.

Code: [Select]
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Compare two large numbers.  Bigger than can fit into a register to be
; compared with the CMP instruction.  This Algorithm is based (loosely)
; on a discussion between deadndave and jj2007 of the MASM Forum.  With
; commentary from others.  See Comparing 128-bit numbers aka OWORDs, in
; The Laboratory.  Note, that the source and destination are subtracted
; differently between CMPS and CMP.  And that does not matter here as I
; only test for equality, where order doesn't matter.  The final result
; is from a CMP.
; SRN, 22/25 August 2013.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;  INPUT:  (E)SI pointing to an OWORD number.
;          (E)DI pointing to an OWORD number.
; OUTPUT:  Flags set from comparison.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CMPSVal PROC
        PUSH    SI      ; Dave is using these as counters, so preserve.
        PUSH    DI

        ADD     SI,15   ; Point to last (high) byte of OWORD.
        ADD     DI,15

        MOV     AH,[DI] ; Put OWRD high bytes into AH and DH.
        MOV     DH,[SI]

        MOV     CX,16
        STD             ; Go from high to low order bytes.
   REPE CMPSB           ; Do the comparison.

        CMP     CX,15   ; Fixed it.  Almost.
        JNE     CV_1

   REPE CMPSB
CV_1:
        CLD

        MOV     AL,[DI+1] ; Put lower order byt into AL and DL.
        MOV     DL,[SI+1]
        CMP     AX,DX     ; Return flags.

        POP     DI
        POP     SI

        RET

CMPSVal ENDP

Enjoy,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on August 27, 2013, 09:50:29 PM
Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 10:08:50 PM
No, it will not work as IsIntelOrAMD check :biggrin:
Thank you, Marinus :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 10:16:33 PM
Hi Steve! Thank you :t

   Okay, here it is.  16-bit, but would be easy to convert to 32-bits.

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-mathing bytes, instead of CMPSB, but other logic is the same.


Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


(this version is faster than earlier included in the testbed)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 27, 2013, 10:20:43 PM
Code: [Select]

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
720551 cycles [x][x][x] - Cmp128Nidud
760062 cycles [x][x][x] - Cmp128NidudSSE
1047587 cycles [x][x][x] - Cmp128Dave
2544535 cycles [x][x][x] - Cmp128Dave2
1452129 cycles [x][x][x] - Cmp128JJAlexSSE_1
1704401 cycles [x][x][x] - Cmp128JJAlexSSE_2
1711250 cycles [x][x][x] - Cmp128JJAlexSSE_3
832572 cycles [x][x][ ] - Cmp128Alex
1228790 cycles [x][x][x] - Cmp128Alex_2
1249508 cycles [x][x][x] - Cmp128Alex_3
1849514 cycles [x][x][ ] - Cmp128JJSSE
959424 cycles [x][x][ ] - AxCMP128bitProc3
1008975 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652808 cycles [x][ ][ ] - Cmp128DaveU
649697 cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 27, 2013, 10:25:40 PM

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-matching bytes, instead of CMPSB, but other logic is the same.

Hi Alex,

   Interesting.  Nice to see the algorithm reworked.

Thanks,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 27, 2013, 10:30:40 PM
   Interesting.  Nice to see the algorithm reworked.

Yes, it always interesting to see different implementations of an idea :biggrin:

Thank you for the test, Steve! :t
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 28, 2013, 01:36:24 AM
I like the "Intel 8086 Family Architecture" document because it includes timings for the upcodes, and I assumed backward compatible on all instruction set architectures based on the Intel 8086 CPU.

Quote
Many additions and extensions have been added to the x86 instruction set over the years, almost consistently with full backward compatibility

Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC
Code: [Select]
Intel(R) Celeron(R) D CPU 3.20GHz (SSE3)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC
Code: [Select]
Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (SSE4)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Almost consistently:   :P
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PE NC
            BSR 1: NV PL NZ NA PE NC
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,

   First some more results.

Code: [Select]
pre-P4 (SSE1)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

pre-P4
    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

Press any key to continue ...

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

   Second, I tried to visualize the problems with setting the flags
correctly.  So I wrote a program to plot the flags for a normal
byte comparison and then for a truncated byte representing a
partial data set.  It did not help me in any particular way.  But
here it is anyway.  Mode 12H graphics.

Regards,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 28, 2013, 09:19:12 AM
I like the "Intel 8086 Family Architecture" document because it includes timings for the upcodes, and I assumed backward compatible on all instruction set architectures based on the Intel 8086 CPU.

Quote
Many additions and extensions have been added to the x86 instruction set over the years, almost consistently with full backward compatibility

It's not your error :t (BTW in the Intel's 80386 reference information is as in your reference)

As for clocks - after PMMX and especially after P6 family released, old clocks numbers information may be very outdated, the more so this for more or less modern CPUs (here we have totally unpredictable timings not only between manufacturers, but even inside different models of one microarchitecture).


Hi,

   First some more results.

Code: [Select]
pre-P4 (SSE1)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

pre-P4
    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

Press any key to continue ...

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Steve, is the third result for your PMMX?

The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 28, 2013, 10:26:34 PM
Hi,

   First some more results.


Steve, is the third result for your PMMX?


   Yes, the P-MMX with Windows 98.  A pity that it
does not follow Intel's rules.

Quote
The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?

   Maybe.  I was trying to visualize what information from
the high nybble / byte / double, in a byte / word / OWORD
would tell me about the complete result.  So I took the row
and column counters, both bytes, and plotted the results of
a compare for each of the four flags we were looking at.

Code: [Select]
        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start5
        MOV     [ucPixel],15
Start5:
        CALL SetPixel10

   If the flag is set, the pixel is white, otherwise black for the first
four plots or graphs.  I then do the same for a truncated case.

Code: [Select]
        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        AND     AH,0F0H
        AND     AL,0F0H
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start8
        MOV     [ucPixel],15
Start8:
        CALL SetPixel10

   I was hoping that it would show a short-cut or some such.
All it showed was that for the majority of cases there is no such
short-cut that I could possibly see.  You have to do a bit better.
(I think, at least that is what I took away from this.)

   For the color plot, I am using the Mode 12H planar graphics
mode.  Sixteen colors with four planes.  So I assigned a flag to
its own plane.  So the colors show what flags are set by the
comparison.  I probably should change the palette to make the
results be clearer.  But I saw what I needed to, and so did not
bother.  I could update that if you think it would help.  (And you
see a good and proper mapping of the four flags to three primary
colors.)

   The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well.  So you can take an early
exit from the algorithm if the CMPS result is zero.  I did not bother
with mine.*  (Though I, or someone, should time both versions.)

    Given the other colors, any combination of the other three flags
is possible.**  And again, no obvious simplification was apparent to
me.

   I hope that explains most if not all of your question.

Regards,

Steve N.

Edit:

*  Zero is set, on average, once out of 256 times for the byte
comparison.  And it just gets worse as the size increases.

SRN

Edit:

**  Said that wrong.  The colors show which flags can be set
together, as shown by the labels.  Not all the possibilities occur.

SRN
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 28, 2013, 10:28:14 PM
that is really strange
if it isn't going to execute the instruction correctly, at least it could BSOD or something   :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 29, 2013, 12:33:05 AM
ok, the BSF thing didn’t work, so it's one step back  :P
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
784468 cycles [x][x][x] - Cmp128Nidud
701224 cycles [x][x][x] - Cmp128NidudSSE
872713 cycles [x][x][x] - Cmp128Dave
1224045 cycles [x][x][x] - Cmp128Dave2
1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1
1072113 cycles [x][x][x] - Cmp128JJAlexSSE_2
1074183 cycles [x][x][x] - Cmp128JJAlexSSE_3
515851 cycles [x][x][ ] - Cmp128Alex
947269 cycles [x][x][x] - Cmp128Alex_2
952060 cycles [x][x][x] - Cmp128Alex_3
1089283 cycles [x][x][ ] - Cmp128JJSSE
685943 cycles [x][x][ ] - AxCMP128bitProc3
690706 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
530917 cycles [x][ ][ ] - Cmp128DaveU
531786 cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 29, 2013, 02:52:20 AM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2610027 cycles [x][x][x] - Cmp128Nidud
2886072 cycles [x][x][x] - Cmp128NidudSSE
2616611 cycles [x][x][x] - Cmp128Dave
3812015 cycles [x][x][x] - Cmp128Dave2
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
1574406 cycles [x][x][x] - Cmp128JJAlexSSE_2
1786771 cycles [x][x][x] - Cmp128JJAlexSSE_3
953366  cycles [x][x][ ] - Cmp128Alex
1705162 cycles [x][x][x] - Cmp128Alex_2
1775305 cycles [x][x][x] - Cmp128Alex_3
1896256 cycles [x][x][ ] - Cmp128JJSSE
1310303 cycles [x][x][ ] - AxCMP128bitProc3
1255525 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
697520  cycles [x][ ][ ] - Cmp128DaveU
717197  cycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 29, 2013, 03:43:56 AM
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
987     kCycles [x][x][x] - Cmp128Nidud
1066    kCycles [x][x][x] - Cmp128NidudSSE
956     kCycles [x][x][x] - Cmp128Dave
2577    kCycles [x][x][x] - Cmp128Dave2
820     kCycles [x][x][x] - Cmp128JJAlexSSE_1 <<<<<<<<<<
929     kCycles [x][x][x] - Cmp128JJAlexSSE_2
950     kCycles [x][x][x] - Cmp128JJAlexSSE_3
688     kCycles [x][x][ ] - Cmp128Alex
1066    kCycles [x][x][x] - Cmp128Alex_2
1112    kCycles [x][x][x] - Cmp128Alex_3
1109    kCycles [x][x][ ] - Cmp128JJSSE
862     kCycles [x][x][ ] - AxCMP128bitProc3
870     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
598     kCycles [x][ ][ ] - Cmp128DaveU
586     kCycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 29, 2013, 04:44:55 AM
Hmm, maybe the CPU's which "cheat" with the BSF/BSR flags are faster
Alex/JJ use BSR

My test: "no cheat"
Code: [Select]
701224 cycles [x][x][x] - Cmp128NidudSSE
1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1
Siekmanski: "no cheat"
Code: [Select]
730067  cycles [x][x][x] - Cmp128NidudSSE
784587  cycles [x][x][x] - Cmp128JJAlexSSE_1

Alex: "cheat"
Code: [Select]
2441613 cycles [x][x][x] - Cmp128NidudSSE
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
Dave: "cheat"
Code: [Select]
2886072 cycles [x][x][x] - Cmp128NidudSSE
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 30, 2013, 06:09:11 AM
Hi,

   I made the PlotFlag program with an improved color scheme.
And I fixed an error in the labeling of the combined flags.  Oops.
I updated Reply #260 with the fixed program.  The Overflow,
Sign, and Carry flags are assigned to blue. green, and red so
that their combinations should make sense.

Regards,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 30, 2013, 12:22:20 PM
Hi Steve! Thanks for your explanation :biggrin: Yes, now it explains things - earlier I did not actually get how the the graps correspond to flags, after your more detailed explanation it got clear.

ok, the BSF thing didn’t work, so it's one step back  :P

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2799747 cycles [x][x][x] - Cmp128Nidud
3028917 cycles [x][x][x] - Cmp128NidudSSE
2740311 cycles [x][x][x] - Cmp128Dave
4045304 cycles [x][x][x] - Cmp128Dave2
1664740 cycles [x][x][x] - Cmp128JJAlexSSE_1
1633226 cycles [x][x][x] - Cmp128JJAlexSSE_2
1878213 cycles [x][x][x] - Cmp128JJAlexSSE_3
973406  cycles [x][x][ ] - Cmp128Alex
1835600 cycles [x][x][x] - Cmp128Alex_2
1878425 cycles [x][x][x] - Cmp128Alex_3
1968746 cycles [x][x][ ] - Cmp128JJSSE
1374576 cycles [x][x][ ] - AxCMP128bitProc3
1290053 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
739767  cycles [x][ ][ ] - Cmp128DaveU
744369  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

BTW, in my testbed I replaced couple of macroses but not posted them (the results above are not for the updated procs, though):

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


and


Cmp128JJAlexSSE_2 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   movzx eax,byte ptr [ow0+14]
   movzx edx,byte ptr [ow1+14]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


Hmm, maybe the CPU's which "cheat" with the BSF/BSR flags are faster
Alex/JJ use BSR

My test: "no cheat"
Code: [Select]
701224 cycles [x][x][x] - Cmp128NidudSSE
1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1
Siekmanski: "no cheat"
Code: [Select]
730067  cycles [x][x][x] - Cmp128NidudSSE
784587  cycles [x][x][x] - Cmp128JJAlexSSE_1

Alex: "cheat"
Code: [Select]
2441613 cycles [x][x][x] - Cmp128NidudSSE
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
Dave: "cheat"
Code: [Select]
2886072 cycles [x][x][x] - Cmp128NidudSSE
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1

Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:

But the same PIV models are much more slower than more modern CPUs with some other instructions, too, and those are the real reason: SSE instructions, SBB instruction, LAHF/SAHF.

Cmp128JJAlexSSE_1 uses far more faster (on PIV) code in GRP part without SUBs/SBBs.
Fully-GPR Cmp128Alex_2 is faster than Cmp128Nidud because of this, too.

On modern CPUs your GPR code "should be" faster than my GPR code, for an instance. But for SSE code there are very different timings in the thread, probably, on very modern Intel CPUs models my SSE code "should be" faster.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 30, 2013, 10:11:42 PM
Jochen's latest attachment...

prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2557    kCycles [x][x][x] - Cmp128Nidud
2815    kCycles [x][x][x] - Cmp128NidudSSE
2517    kCycles [x][x][x] - Cmp128Dave
3715    kCycles [x][x][x] - Cmp128Dave2
1557    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1521    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1753    kCycles [x][x][x] - Cmp128JJAlexSSE_3
914     kCycles [x][x][ ] - Cmp128Alex
1698    kCycles [x][x][x] - Cmp128Alex_2
1726    kCycles [x][x][x] - Cmp128Alex_3
1850    kCycles [x][x][ ] - Cmp128JJSSE
1298    kCycles [x][x][ ] - AxCMP128bitProc3
1228    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
696     kCycles [x][ ][ ] - Cmp128DaveU
690     kCycles [x][ ][ ] - Cmp128NidudU
Code: [Select]
------------------------------------------------------
2576    kCycles [x][x][x] - Cmp128Nidud
2804    kCycles [x][x][x] - Cmp128NidudSSE
2543    kCycles [x][x][x] - Cmp128Dave
3723    kCycles [x][x][x] - Cmp128Dave2
1557    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1518    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1853    kCycles [x][x][x] - Cmp128JJAlexSSE_3
918     kCycles [x][x][ ] - Cmp128Alex
1697    kCycles [x][x][x] - Cmp128Alex_2
1733    kCycles [x][x][x] - Cmp128Alex_3
1848    kCycles [x][x][ ] - Cmp128JJSSE
1280    kCycles [x][x][ ] - AxCMP128bitProc3
1211    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
677     kCycles [x][ ][ ] - Cmp128DaveU
707     kCycles [x][ ][ ] - Cmp128NidudU
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 31, 2013, 12:04:27 AM
Hi,

   From Reply # 266.

Code: [Select]
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1012 kCycles [x][x][x] - Cmp128Nidud
1153 kCycles [x][x][x] - Cmp128NidudSSE
1049 kCycles [x][x][x] - Cmp128Dave
2520 kCycles [x][x][x] - Cmp128Dave2
1441 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1682 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1686 kCycles [x][x][x] - Cmp128JJAlexSSE_3
824 kCycles [x][x][ ] - Cmp128Alex
1200 kCycles [x][x][x] - Cmp128Alex_2
1261 kCycles [x][x][x] - Cmp128Alex_3
1823 kCycles [x][x][ ] - Cmp128JJSSE
951 kCycles [x][x][ ] - AxCMP128bitProc3
987 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
640 kCycles [x][ ][ ] - Cmp128DaveU
633 kCycles [x][ ][ ] - Cmp128NidudU

--- ok --- Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986 kCycles [x][x][x] - Cmp128Nidud
1134 kCycles [x][x][x] - Cmp128NidudSSE
1024 kCycles [x][x][x] - Cmp128Dave
2491 kCycles [x][x][x] - Cmp128Dave2
1423 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1674 kCycles [x][x][x] - Cmp128JJAlexSSE_3
817 kCycles [x][x][ ] - Cmp128Alex
1190 kCycles [x][x][x] - Cmp128Alex_2
1249 kCycles [x][x][x] - Cmp128Alex_3
1813 kCycles [x][x][ ] - Cmp128JJSSE
941 kCycles [x][x][ ] - AxCMP128bitProc3
986 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
639 kCycles [x][ ][ ] - Cmp128DaveU
630 kCycles [x][ ][ ] - Cmp128NidudU

--- ok ---

Regards,

Steve
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 31, 2013, 03:24:36 AM
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
765 kCycles [x][x][x] - Cmp128Nidud
685 kCycles [x][x][x] - Cmp128NidudSSE
765 kCycles [x][x][x] - Cmp128Dave
1177 kCycles [x][x][x] - Cmp128Dave2
1045 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1047 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1048 kCycles [x][x][x] - Cmp128JJAlexSSE_3
605 kCycles [x][x][ ] - Cmp128Alex
950 kCycles [x][x][x] - Cmp128Alex_2
948 kCycles [x][x][x] - Cmp128Alex_3
1064 kCycles [x][x][ ] - Cmp128JJSSE
688 kCycles [x][x][ ] - AxCMP128bitProc3
697 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
440 kCycles [x][ ][ ] - Cmp128DaveU
430 kCycles [x][ ][ ] - Cmp128NidudU

Quote
Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:

Well, if that is correct the following test will prove your point
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
383145 cycles for CmpLEA
382951 cycles for CmpADD
384098 cycles for CmpINC
384502 cycles for CmpBSF
378250 cycles for CmpCLX
------------------------------------------------------
383944 cycles for CmpLEA
387003 cycles for CmpADD
383393 cycles for CmpINC
383522 cycles for CmpBSF
378291 cycles for CmpCLX
------------------------------------------------------
385948 cycles for CmpLEA
384310 cycles for CmpADD
383979 cycles for CmpINC
384283 cycles for CmpBSF
378046 cycles for CmpCLX

The BSF test should then be faster on my CPU  :P
but I suspect the following:

LEA will be faster than ADD on Dave's CPU
INC preserve CF, so this will be slower on Dave's CPU
BSF will be much faster on yours and Dave’s CPU

I also think that manipulating/testing flags are slower on yours and Dave's CPU, hence the reason why JXX becomes slower (Cmp128Nidud/Cmp128Dave).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on August 31, 2013, 04:12:20 AM
Jochen's latest attachment...

The only change was a shr eax, 10 to make the timings more readable...
BTW, it would be nice if the test for the (x)(x)(x) could be integrated with the timings. At present, there is a hand-made static string only... where are the authors of the magic test?

CmpFlag.zip:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
------------------------------------------------------
672764  cycles for CmpLEA
465288  cycles for CmpADD
464995  cycles for CmpINC
212725  cycles for CmpBSF
2981445 cycles for CmpCLX
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 31, 2013, 05:29:25 AM
Quote
LEA will be faster than ADD on Dave's CPU
INC preserve CF, so this will be slower on Dave's CPU
BSF will be much faster on yours and Dave’s CPU

today must be opposite day   :P
there are certain things that P4's are just not good at
i like developing on a P4, though - if it's fast on my machine, it'll be fast on every one else's   :lol:

not sure what the CmpCLX test is, but my CPU hates it
a good chance it is not doing what you want it to

Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
574802  cycles for CmpLEA
601257  cycles for CmpADD
912774  cycles for CmpINC
474878  cycles for CmpBSF
27755044        cycles for CmpCLX
------------------------------------------------------
624710  cycles for CmpLEA
609602  cycles for CmpADD
915751  cycles for CmpINC
494000  cycles for CmpBSF
27825419        cycles for CmpCLX
------------------------------------------------------
589298  cycles for CmpLEA
591765  cycles for CmpADD
909765  cycles for CmpINC
468357  cycles for CmpBSF
27813103        cycles for CmpCLX
------------------------------------------------------
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on August 31, 2013, 06:07:49 AM
Quote
today must be opposite day   :P

I assumed the following:
Quote
not sure what the CmpCLX test is
it manipulate the flags using STC/CLC/STD/CLD/CMC

Quote
if it's fast on my machine, it'll be fast on every one else's   :lol:
that will be the point: we all assume that to be so  :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on August 31, 2013, 06:13:29 AM
Code: [Select]
pre-P4 (SSE1)
------------------------------------------------------
690329 cycles for CmpLEA
703742 cycles for CmpADD
709925 cycles for CmpINC
215679 cycles for CmpBSF
2708852 cycles for CmpCLX
------------------------------------------------------
688496 cycles for CmpLEA
704680 cycles for CmpADD
707613 cycles for CmpINC
215581 cycles for CmpBSF
2709056 cycles for CmpCLX
------------------------------------------------------
688429 cycles for CmpLEA
704797 cycles for CmpADD
707767 cycles for CmpINC
215511 cycles for CmpBSF
2707745 cycles for CmpCLX
------------------------------------------------------

--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
------------------------------------------------------
712592 cycles for CmpLEA
712084 cycles for CmpADD
711109 cycles for CmpINC
207906 cycles for CmpBSF
2633382 cycles for CmpCLX
------------------------------------------------------
710075 cycles for CmpLEA
711340 cycles for CmpADD
712251 cycles for CmpINC
207880 cycles for CmpBSF
2633572 cycles for CmpCLX
------------------------------------------------------
711945 cycles for CmpLEA
712603 cycles for CmpADD
710777 cycles for CmpINC
209128 cycles for CmpBSF
2631655 cycles for CmpCLX
------------------------------------------------------

--- ok ---

pre-P4------------------------------------------------------
1386583 cycles for CmpLEA
734379 cycles for CmpADD
733891 cycles for CmpINC
1341089 cycles for CmpBSF
1865749 cycles for CmpCLX
------------------------------------------------------
1386134 cycles for CmpLEA
735003 cycles for CmpADD
734641 cycles for CmpINC
1341132 cycles for CmpBSF
1867097 cycles for CmpCLX
------------------------------------------------------
1382758 cycles for CmpLEA
736389 cycles for CmpADD
734488 cycles for CmpINC
1341860 cycles for CmpBSF
1867206 cycles for CmpCLX
------------------------------------------------------

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
713980 cycles for CmpLEA
716320 cycles for CmpADD
716456 cycles for CmpINC
214883 cycles for CmpBSF
3013235 cycles for CmpCLX
------------------------------------------------------
714376 cycles for CmpLEA
715769 cycles for CmpADD
716068 cycles for CmpINC
214918 cycles for CmpBSF
3013878 cycles for CmpCLX
------------------------------------------------------
714034 cycles for CmpLEA
716052 cycles for CmpADD
716264 cycles for CmpINC
214754 cycles for CmpBSF
3012735 cycles for CmpCLX
------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on August 31, 2013, 07:31:48 AM
Quote
not sure what the CmpCLX test is
it manipulate the flags using STC/CLC/STD/CLD/CMC

ahhh.....
CLD and STD are slow as hell on P4's
and not all that fast on many other processors
they seem to be reasonable on your AMD
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on August 31, 2013, 09:57:28 PM
Quote
Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:

Well, if that is correct the following test will prove your point
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
383145 cycles for CmpLEA
382951 cycles for CmpADD
384098 cycles for CmpINC
384502 cycles for CmpBSF
378250 cycles for CmpCLX
------------------------------------------------------
383944 cycles for CmpLEA
387003 cycles for CmpADD
383393 cycles for CmpINC
383522 cycles for CmpBSF
378291 cycles for CmpCLX
------------------------------------------------------
385948 cycles for CmpLEA
384310 cycles for CmpADD
383979 cycles for CmpINC
384283 cycles for CmpBSF
378046 cycles for CmpCLX

The BSF test should then be faster on my CPU  :P

Well, it proved (on other machines, too)

Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
------------------------------------------------------
490345  cycles for CmpLEA
524195  cycles for CmpADD
768829  cycles for CmpINC
397793  cycles for CmpBSF
28201301        cycles for CmpCLX
------------------------------------------------------
501654  cycles for CmpLEA
512675  cycles for CmpADD
758029  cycles for CmpINC
410060  cycles for CmpBSF
28179301        cycles for CmpCLX
------------------------------------------------------
489078  cycles for CmpLEA
521448  cycles for CmpADD
773169  cycles for CmpINC
404085  cycles for CmpBSF
28120852        cycles for CmpCLX
------------------------------------------------------

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on September 01, 2013, 03:01:49 AM
Is there a good statistic which could define the "average CPU" used globally?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 01, 2013, 05:14:37 AM
Hi,
I have shortened the MasmBasic Qcmp (and Ocmp) macro a little bit - and get zero failures now :biggrin:

Please include in Cmp128Eval and the timings.

include oqCmp.asm

align 16

        test_start
        lea esi,ow_table
        .repeat
            lea edi,ow_table
            .repeat
                Qcmp [esi], [edi]
                add edi,4
            .until edi >= offset eo_table
            add esi,4
        .until esi >= offset eo_table
        test_end "cycles (x)(x)(x) - MasmBasic Qcmp"

P.S.: Timings attached. I excluded one very slow algo and those which fail in the first two categories.

Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
911     kCycles [x][x][x] - Cmp128Nidud
1013    kCycles [x][x][x] - Cmp128NidudSSE
684     kCycles [x][x][ ] - Cmp128Alex
1127    kCycles [x][x][x] - MasmBasic Ocmp  <<<<< OWORD
989     kCycles [x][x][x] - MasmBasic Qcmp
814     kCycles [x][x][x] - Cmp128JJAlexSSE_1
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
926     kCycles [x][x][x] - Cmp128JJAlexSSE_3
859     kCycles [x][x][ ] - AxCMP128bitProc3
868     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

And timings for an i5 - with the Qcmp and Alex1 three times each (the timings are not very stable on the i5, and the two algos are most interesting for me):
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
746     kCycles [x][x][x] - Cmp128Dave
600     kCycles [x][x][x] - Cmp128Nidud
714     kCycles [x][x][x] - Cmp128NidudSSE
494     kCycles [x][x][ ] - Cmp128Alex
429     kCycles [x][x][x] - MasmBasic Qcmp
388     kCycles [x][x][x] - MasmBasic Qcmp
429     kCycles [x][x][x] - MasmBasic Qcmp
428     kCycles [x][x][x] - Cmp128JJAlexSSE_1
401     kCycles [x][x][x] - Cmp128JJAlexSSE_1
428     kCycles [x][x][x] - Cmp128JJAlexSSE_1
437     kCycles [x][x][x] - Cmp128JJAlexSSE_2
427     kCycles [x][x][x] - Cmp128JJAlexSSE_3
530     kCycles [x][x][ ] - AxCMP128bitProc3
501     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on September 02, 2013, 05:40:14 AM
Hi,

   I posted a routine in Reply # 252.  This quote is from Reply #262.

   The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well.  So you can take an early
exit from the algorithm if the CMPS result is zero.  I did not bother
with mine.*  (Though I, or someone, should time both versions.)

   Well, I timed it with and without the test for zero and an early
exit.  The one with the extra test was slower.  Tested with Dave's
112 test values.

Regards,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on September 02, 2013, 05:48:08 AM
from Reply #280,

Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
942     kCycles [x][x][x] - Cmp128Dave
891     kCycles [x][x][x] - Cmp128Nidud
1023    kCycles [x][x][x] - Cmp128NidudSSE
673     kCycles [x][x][ ] - Cmp128Alex
828     kCycles [x][x][x] - MasmBasic Qcmp
766     kCycles [x][x][x] - Cmp128JJAlexSSE_1
869     kCycles [x][x][x] - Cmp128JJAlexSSE_2
876     kCycles [x][x][x] - Cmp128JJAlexSSE_3
867     kCycles [x][x][ ] - AxCMP128bitProc3
895     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on September 02, 2013, 06:24:05 AM
full test:
Code: [Select]
cmp 00000000_00000000_00000000_00000000 , 00000000_00000001_FFFFFFFF_00000001
 was: NO NS NZ CY should be: NO SF NZ CY
...

1318 Failures
unsigned:
Code: [Select]
898 Failures
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 02, 2013, 06:58:41 AM
You can substantially reduce the number of failures (from 1318 to zero) if you use Ocmp ("O" like "O sole mio") instead of Qcmp  ;)

And yes, it's my fault because I erroneously used Qcmp in the timings. I was qonfused ::)

New version attached, with minor changes to Ocmp:
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
911     kCycles [x][x][x] - Cmp128Nidud
1013    kCycles [x][x][x] - Cmp128NidudSSE
684     kCycles [x][x][ ] - Cmp128Alex
1010    kCycles [x][x][x] - MasmBasic Ocmp
815     kCycles [x][x][x] - Cmp128JJAlexSSE_1
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
925     kCycles [x][x][x] - Cmp128JJAlexSSE_3
870     kCycles [x][x][ ] - AxCMP128bitProc3
867     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on September 02, 2013, 08:34:08 AM
And yes, it's my fault because I erroneously used Qcmp in the timings. I was qonfused ::)

well, it's difficult to read your "code", but I think it will be like this:
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
movaps xmm2,xmm0 ; copy for pcmpeqb
  pcmpeqb xmm2,xmm1
pmovmskb edx,xmm2 ; show in dx where xt0 differs to xt1
not dx
and dh,07Fh
bsr edx,edx
push ecx
movzx ecx,byte ptr A[15]
bswap ecx
mov cl,A[edx]
movzx edx,byte ptr B[edx]
bswap edx
mov dl,B[15]
bswap edx
cmp ecx,edx
pop ecx

which could be reduced to this:
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
  pcmpeqb xmm0,xmm1
pmovmskb eax,xmm0
not eax
and eax,07FFFh
bsr eax,eax
mov dh,A[15]
mov dl,A[eax]
mov al,B[eax]
mov ah,B[15]
cmp dx,ax
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: KeepingRealBusy on September 02, 2013, 08:52:39 AM

Here is my contribution (from reply  284)

Code: [Select]

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358    kCycles [x][x][x] - Cmp128Dave
1206    kCycles [x][x][x] - Cmp128Nidud
975     kCycles [x][x][x] - Cmp128NidudSSE
1171    kCycles [x][x][ ] - Cmp128Alex
1766    kCycles [x][x][x] - MasmBasic Ocmp
1424    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040    kCycles [x][x][x] - Cmp128JJAlexSSE_2
721     kCycles [x][x][x] - Cmp128JJAlexSSE_3
535     kCycles [x][x][ ] - AxCMP128bitProc3
519     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave AKA KRB
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 02, 2013, 12:35:42 PM
Jochen, did you time the version of a macro I posted couple pages above?
Here it is:

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


For me it faster than original "_1" macro, also you can try to change so

   mov eax,word ptr [ow0+14]
   mov edx,word ptr [ow1+14]


but for me it is slower than the version above it.

Timings for it (there is your old macro - my testbed us a bit outdated)
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2189320 cycles [x][x][x] - Cmp128Nidud
2295837 cycles [x][x][x] - Cmp128NidudSSE
2773387 cycles [x][x][x] - Cmp128Dave
4033478 cycles [x][x][x] - Cmp128Dave2
1597228 cycles [x][x][x] - Cmp128JJAlexSSE_1
1622741 cycles [x][x][x] - Cmp128JJAlexSSE_2
1905774 cycles [x][x][x] - Cmp128JJAlexSSE_3
993931  cycles [x][x][ ] - Cmp128Alex
1859714 cycles [x][x][x] - Cmp128Alex_2
1901902 cycles [x][x][x] - Cmp128Alex_3
1994856 cycles [x][x][ ] - Cmp128JJSSE
1346269 cycles [x][x][ ] - AxCMP128bitProc3
1311894 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
741050  cycles [x][ ][ ] - Cmp128DaveU
770599  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


Timings for Cmp128_timingsOQ
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2696    kCycles [x][x][x] - Cmp128Dave
2713    kCycles [x][x][x] - Cmp128Nidud
3125    kCycles [x][x][x] - Cmp128NidudSSE
945     kCycles [x][x][ ] - Cmp128Alex
1932    kCycles [x][x][x] - MasmBasic Ocmp
1485    kCycles [x][x][x] - MasmBasic Qcmp
1639    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1604    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1595    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1360    kCycles [x][x][ ] - AxCMP128bitProc3
1274    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---


Timings for Cmp128_timingsO
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2856    kCycles [x][x][x] - Cmp128Dave
2752    kCycles [x][x][x] - Cmp128Nidud
3128    kCycles [x][x][x] - Cmp128NidudSSE
956     kCycles [x][x][ ] - Cmp128Alex
1928    kCycles [x][x][x] - MasmBasic Ocmp
1641    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1601    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1592    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1361    kCycles [x][x][ ] - AxCMP128bitProc3
1272    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 02, 2013, 12:47:09 PM
Hi Dave :t


Here is my contribution (from reply  284)

Code: [Select]

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358    kCycles [x][x][x] - Cmp128Dave
1206    kCycles [x][x][x] - Cmp128Nidud
975     kCycles [x][x][x] - Cmp128NidudSSE
1171    kCycles [x][x][ ] - Cmp128Alex
1766    kCycles [x][x][x] - MasmBasic Ocmp
1424    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040    kCycles [x][x][x] - Cmp128JJAlexSSE_2
721     kCycles [x][x][x] - Cmp128JJAlexSSE_3
535     kCycles [x][x][ ] - AxCMP128bitProc3
519     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave AKA KRB

Incredible difference in the algos, which use full and half sized regs. Your AMD seems to very good work with "partial" regs, contrary to Intel's which are bad with them.
Cmp128JJAlexSSE_3 differs from Cmp128JJAlexSSE_1
only with this:

   xor cx,0FFFFh
   jz @l2
   and cx,7FFFh
   bsr cx,cx

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 02, 2013, 04:28:55 PM
Jochen, did you time the version of a macro I posted couple pages above?

Here it comes:
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
916     kCycles [x][x][x] - Cmp128Nidud
1017    kCycles [x][x][x] - Cmp128NidudSSE
689     kCycles [x][x][ ] - Cmp128Alex
1013    kCycles [x][x][x] - MasmBasic Ocmp
815     kCycles [x][x][x] - Cmp128JJAlexSSE_1
854     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
926     kCycles [x][x][x] - Cmp128JJAlexSSE_3
858     kCycles [x][x][ ] - AxCMP128bitProc3
870     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
653     kCycles [x][x][x] - Cmp128Dave
608     kCycles [x][x][x] - Cmp128Nidud
806     kCycles [x][x][x] - Cmp128NidudSSE
434     kCycles [x][x][ ] - Cmp128Alex
386     kCycles [x][x][x] - MasmBasic Ocmp
315     kCycles [x][x][x] - Cmp128JJAlexSSE_1
366     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
355     kCycles [x][x][x] - Cmp128JJAlexSSE_2
316     kCycles [x][x][x] - Cmp128JJAlexSSE_3
455     kCycles [x][x][ ] - AxCMP128bitProc3
439     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

well, it's difficult to read your "code", but I think...
You should learn Masm, it's a fascinating language :t

(and I'm afraid your interpretation is not correct - you might launch Olly to see what it really does).
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: sinsi on September 02, 2013, 06:06:46 PM
jj's latest
Code: [Select]
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
695     kCycles [x][x][x] - Cmp128Dave
564     kCycles [x][x][x] - Cmp128Nidud
652     kCycles [x][x][x] - Cmp128NidudSSE
396     kCycles [x][x][ ] - Cmp128Alex
316     kCycles [x][x][x] - MasmBasic Ocmp
268     kCycles [x][x][x] - Cmp128JJAlexSSE_1
321     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
312     kCycles [x][x][x] - Cmp128JJAlexSSE_2
271     kCycles [x][x][x] - Cmp128JJAlexSSE_3
403     kCycles [x][x][ ] - AxCMP128bitProc3
378     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
748     kCycles [x][x][x] - Cmp128Dave
615     kCycles [x][x][x] - Cmp128Nidud
714     kCycles [x][x][x] - Cmp128NidudSSE
433     kCycles [x][x][ ] - Cmp128Alex
348     kCycles [x][x][x] - MasmBasic Ocmp
296     kCycles [x][x][x] - Cmp128JJAlexSSE_1
353     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
344     kCycles [x][x][x] - Cmp128JJAlexSSE_2
298     kCycles [x][x][x] - Cmp128JJAlexSSE_3
442     kCycles [x][x][ ] - AxCMP128bitProc3
416     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on September 02, 2013, 07:01:01 PM
your code is hard to read, Jochen - lol
i dread if i have to add a routine   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 02, 2013, 08:07:38 PM
your code is hard to read, Jochen - lol

Come on, it's ultra simple...
  pmovmskb edx, xt2   ; show in dx where xt0 differs to xt1
  if MbcmpO eq QWORD
   not dl
   and edx, 07fh
  else          ; don't duplicate MSB
   if 1
      xor edx, -1
      and edx, 07fffh
   else
      not dx
      and dh, 07fh
   endif
  endif
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on September 02, 2013, 10:57:58 PM
I guess there is different views about writing code
but one could consider (at least I do) the golden rule:
Quote
Keep it simple stupid!

At least if you write code for other people, which is often the case, as in forums and projects involving other people and so on. A comment should explain what the code actually do, not how it does it, which means that the code basically should explain itself.

Quote
You should learn Masm, it's a fascinating language :t
Well, if you follow these simple rules you may write your own compiler, as I have done in collaboration with other people using a common language. That simplifies the process since everybody is able to read and understand what other people do.

Quote
I'm afraid your interpretation is not correct
How do you know?

Quote
you might launch Olly to see what it really does
Don't you think that this is a bit to much to ask, or at least a bit complicated, to use a debugger to see what it actually does?
Code: [Select]
CPU Disasm
Address   Hex dump          Command                                  Comments
0040111C  |.  0F1006        MOVUPS XMM0,DQWORD PTR DS:[ESI]          ; FLOAT 0.0, 0.0, 0.0, 0.0
0040111F  |.  0F100F        MOVUPS XMM1,DQWORD PTR DS:[EDI]
00401122  |.  0F28D0        MOVAPS XMM2,XMM0
00401125  |.  660F74D1      PCMPEQB XMM2,XMM1
00401129  |.  660FD7D2      PMOVMSKB EDX,XMM2
0040112D  |.  66:F7D2       NOT DX
00401130  |.  80E6 7F       AND DH,7F
00401133  |.  0FBDD2        BSR EDX,EDX
00401136  |.  51            PUSH ECX
00401137  |.  0FB64E 0F     MOVZX ECX,BYTE PTR DS:[ESI+0F]
0040113B  |.  0FC9          BSWAP ECX
0040113D  |.  8A0C32        MOV CL,BYTE PTR DS:[ESI+EDX]
00401140  |.  0FB6143A      MOVZX EDX,BYTE PTR DS:[EDI+EDX]
00401144  |.  0FCA          BSWAP EDX
00401146  |.  8A57 0F       MOV DL,BYTE PTR DS:[EDI+0F]
00401149  |.  0FCA          BSWAP EDX
0040114B  |.  3BCA          CMP ECX,EDX
0040114D  |.  59            POP ECX
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 02, 2013, 11:27:09 PM
well, it's difficult to read your "code"
I guess there is different views about writing code
Yes, certainly. But I would never call your code "code", or refer to you as a "coder" instead of a coder. It requires a certain level of arrogance to dismiss somebody else's code as "code".

Quote
Quote
I'm afraid your interpretation is not correct
How do you know?

Quote
you might launch Olly to see what it really does
Don't you think that this is a bit to much to ask, or at least a bit complicated, to use a debugger to see what it actually does?

Normally, I would not ask, but since you had difficulties de-coding my macro, I thought Olly would be a reliable way to check. What you show above, by the way, is old code - the version of oqCmp.asm that I posted 15 hours ago already contained:
   if 1
      xor edx, -1
      and edx, 07fffh
   else
      not dx
      and dh, 07fh
   endif

The if 1 is conditional assembly and means "use this branch, not the other one".

Congrats, by the way - on the AMD your code is faster than mine:
Code: [Select]
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
843     kCycles [x][x][x] - Cmp128Dave
847     kCycles [x][x][x] - Cmp128Nidud
917     kCycles [x][x][x] - Cmp128NidudSSE
643     kCycles [x][x][ ] - Cmp128Alex
1578    kCycles [x][x][x] - MasmBasic Ocmp
1469    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1531    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1466    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1466    kCycles [x][x][x] - Cmp128JJAlexSSE_3
803     kCycles [x][x][ ] - AxCMP128bitProc3
771     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on September 03, 2013, 12:49:43 AM
From Reply #289.
Code: [Select]

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1022 kCycles [x][x][x] - Cmp128Dave
917 kCycles [x][x][x] - Cmp128Nidud
1022 kCycles [x][x][x] - Cmp128NidudSSE
817 kCycles [x][x][ ] - Cmp128Alex
1561 kCycles [x][x][x] - MasmBasic Ocmp
1422 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1471 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1677 kCycles [x][x][x] - Cmp128JJAlexSSE_3
937 kCycles [x][x][ ] - AxCMP128bitProc3
985 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on September 03, 2013, 12:54:43 AM
Jochen,
it's just the text format
we each have our own style and it can be hard to get used to someone else's   :P
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on September 03, 2013, 02:02:26 AM
The timings for Jochen's latest version:

Code: [Select]

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
736     kCycles [x][x][x] - Cmp128Dave
629     kCycles [x][x][x] - Cmp128Nidud
696     kCycles [x][x][x] - Cmp128NidudSSE
442     kCycles [x][x][ ] - Cmp128Alex
367     kCycles [x][x][x] - MasmBasic Ocmp
321     kCycles [x][x][x] - Cmp128JJAlexSSE_1
371     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
364     kCycles [x][x][x] - Cmp128JJAlexSSE_2
352     kCycles [x][x][x] - Cmp128JJAlexSSE_3
447     kCycles [x][x][ ] - AxCMP128bitProc3
427     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 02:18:54 AM
Thanxalot :icon14:

Attached one more, inter alia with a modification of the test_start macro:

test_start macro useit:=<1>
usethismacro=useit
  if usethismacro
   push 50000000
   .Repeat
      dec dword ptr [esp]   ; heat up the CPU
   .Until Sign?
   add esp, 4
   invoke Sleep, 0
   counter_begin 1000, HIGH_PRIORITY_CLASS
  endif
endm


On some machines, timings were very volatile, the small mod above seems to help.

Code: [Select]
Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986     kCycles [x][x][x] - Cmp128Dave
946     kCycles [x][x][x] - Cmp128Nidud
818     kCycles [x][x][x] - Cmp128NidudSSE
575     kCycles [x][x][ ] - Cmp128Alex
564     kCycles [x][x][x] - MasmBasic Ocmp.1
517     kCycles [x][x][x] - MasmBasic Ocmp.0
549     kCycles [x][x][x] - MasmBasic Ocmp.1
513     kCycles [x][x][x] - MasmBasic Ocmp.0
472     kCycles [x][x][x] - Cmp128JJAlexSSE_1
476     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
618     kCycles [x][x][x] - Cmp128JJAlexSSE_2
614     kCycles [x][x][x] - Cmp128JJAlexSSE_3
747     kCycles [x][x][ ] - AxCMP128bitProc3
772     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
-----------------------------------------------------
843     kCycles [x][x][x] - Cmp128Dave
844     kCycles [x][x][x] - Cmp128Nidud
919     kCycles [x][x][x] - Cmp128NidudSSE
641     kCycles [x][x][ ] - Cmp128Alex
1588    kCycles [x][x][x] - MasmBasic Ocmp.1
1584    kCycles [x][x][x] - MasmBasic Ocmp.0
1586    kCycles [x][x][x] - MasmBasic Ocmp.1
1578    kCycles [x][x][x] - MasmBasic Ocmp.0
1467    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1532    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1471    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1468    kCycles [x][x][x] - Cmp128JJAlexSSE_3
801     kCycles [x][x][ ] - AxCMP128bitProc3
771     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on September 03, 2013, 03:13:23 AM
Jochen,

the new timings. I hope that helps:

Code: [Select]

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
660     kCycles [x][x][x] - Cmp128Dave
538     kCycles [x][x][x] - Cmp128Nidud
603     kCycles [x][x][x] - Cmp128NidudSSE
371     kCycles [x][x][ ] - Cmp128Alex
316     kCycles [x][x][x] - MasmBasic Ocmp.1
307     kCycles [x][x][x] - MasmBasic Ocmp.0
314     kCycles [x][x][x] - MasmBasic Ocmp.1
306     kCycles [x][x][x] - MasmBasic Ocmp.0
259     kCycles [x][x][x] - Cmp128JJAlexSSE_1
308     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
302     kCycles [x][x][x] - Cmp128JJAlexSSE_2
263     kCycles [x][x][x] - Cmp128JJAlexSSE_3
391     kCycles [x][x][ ] - AxCMP128bitProc3
363     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Gunther
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 03:44:45 AM
Thanks, Gunther. And here is the Celeron M:
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
955     kCycles [x][x][x] - Cmp128Dave
923     kCycles [x][x][x] - Cmp128Nidud
1012    kCycles [x][x][x] - Cmp128NidudSSE
676     kCycles [x][x][ ] - Cmp128Alex
1081    kCycles [x][x][x] - MasmBasic Ocmp.1
1010    kCycles [x][x][x] - MasmBasic Ocmp.0
1080    kCycles [x][x][x] - MasmBasic Ocmp.1
1011    kCycles [x][x][x] - MasmBasic Ocmp.0
814     kCycles [x][x][x] - Cmp128JJAlexSSE_1
853     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
926     kCycles [x][x][x] - Cmp128JJAlexSSE_2
927     kCycles [x][x][x] - Cmp128JJAlexSSE_3
869     kCycles [x][x][ ] - AxCMP128bitProc3
867     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: nidud on September 03, 2013, 03:49:40 AM
Oh, here we go again  :lol:

You're an excellent "coder" JJ, but sometimes your "code" is a bit intelligent for a simple mind like me, so I tried to simplify it a bit

You are a bit sensitive me think  :P

On reflection, from my simplified view, I now see what Alex was doing here (http://masm32.com/board/index.php?topic=2222.msg23791#msg23791)

I added my modified version to the test:
Code: [Select]
movups xmm0,A[0]
movups xmm1,B[0]
mov edx,B[12]
mov eax,A[12]
  pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
not ecx
and ecx,07FFFh
bsr ecx,ecx
mov dl,B[ecx]
mov al,A[ecx]
cmp eax,edx

result:
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
781     kCycles [x][x][x] - Cmp128Dave
663     kCycles [x][x][x] - Cmp128Nidud
693     kCycles [x][x][x] - Cmp128NidudSSE
428     kCycles [x][x][ ] - Cmp128Alex
389     kCycles [x][x][x] - MasmBasic Ocmp.1
409     kCycles [x][x][x] - MasmBasic Ocmp.0
406     kCycles [x][x][x] - MasmBasic Ocmp.1
379     kCycles [x][x][x] - MasmBasic Ocmp.0
315     kCycles [x][x][x] - Cmp128JJAlexSSE_1
366     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
287     kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
355     kCycles [x][x][x] - Cmp128JJAlexSSE_2
315     kCycles [x][x][x] - Cmp128JJAlexSSE_3
463     kCycles [x][x][ ] - AxCMP128bitProc3
437     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

AMD Athlon(tm) II X2 245 Processor (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
848 kCycles [x][x][x] - Cmp128Dave
777 kCycles [x][x][x] - Cmp128Nidud
692 kCycles [x][x][x] - Cmp128NidudSSE
599 kCycles [x][x][ ] - Cmp128Alex
1153 kCycles [x][x][x] - MasmBasic Ocmp.1
1085 kCycles [x][x][x] - MasmBasic Ocmp.0
1153 kCycles [x][x][x] - MasmBasic Ocmp.1
1086 kCycles [x][x][x] - MasmBasic Ocmp.0
1044 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1042 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1029 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1048 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1048 kCycles [x][x][x] - Cmp128JJAlexSSE_3
695 kCycles [x][x][ ] - AxCMP128bitProc3
676 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 03, 2013, 07:43:59 AM
Latest Jochen's archive:
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2713    kCycles [x][x][x] - Cmp128Dave
2716    kCycles [x][x][x] - Cmp128Nidud
3065    kCycles [x][x][x] - Cmp128NidudSSE
918     kCycles [x][x][ ] - Cmp128Alex
2044    kCycles [x][x][x] - MasmBasic Ocmp.1
1900    kCycles [x][x][x] - MasmBasic Ocmp.0
2046    kCycles [x][x][x] - MasmBasic Ocmp.1
1890    kCycles [x][x][x] - MasmBasic Ocmp.0
1605    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1574    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1614    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1571    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1363    kCycles [x][x][ ] - AxCMP128bitProc3
1256    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Latest nidud's archive:
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2708    kCycles [x][x][x] - Cmp128Dave
2749    kCycles [x][x][x] - Cmp128Nidud
3076    kCycles [x][x][x] - Cmp128NidudSSE
909     kCycles [x][x][ ] - Cmp128Alex
2044    kCycles [x][x][x] - MasmBasic Ocmp.1
1895    kCycles [x][x][x] - MasmBasic Ocmp.0
2046    kCycles [x][x][x] - MasmBasic Ocmp.1
1906    kCycles [x][x][x] - MasmBasic Ocmp.0
1604    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1543    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1351    kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1576    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1578    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1340    kCycles [x][x][ ] - AxCMP128bitProc3
1267    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---



:biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 08:18:12 AM
Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
957     kCycles [x][x][x] - Cmp128Dave
926     kCycles [x][x][x] - Cmp128Nidud
1016    kCycles [x][x][x] - Cmp128NidudSSE
680     kCycles [x][x][ ] - Cmp128Alex
1039    kCycles [x][x][x] - MasmBasic Ocmp.1
1039    kCycles [x][x][x] - MasmBasic Ocmp.0
818     kCycles [x][x][x] - Cmp128JJAlexSSE_1
857     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
785     kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
832     kCycles [x][x][x] - Cmp128AxelNidudJJ_A
853     kCycles [x][x][x] - Cmp128AxelNidudJJ_B
930     kCycles [x][x][x] - Cmp128JJAlexSSE_2
930     kCycles [x][x][x] - Cmp128JJAlexSSE_3
866     kCycles [x][x][ ] - AxCMP128bitProc3
871     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Cmp128AxelNidudJJ MACRO A:REQ, B:REQ
   movups   xmm0,A[0]
   movups   xmm1,B[0]
   push ecx      ; do not trash ecx
;    mov   eax,A[12]
;    mov   edx,B[12]
     pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   if ANJ_A
      xor ecx, -1
      and   ecx, 07FFFh
      or ecx, 1      ; make sure there is no zero input (http://masm32.com/board/index.php?topic=2312.0)
      bsr   ecx, ecx
      mov eax,A[12]
      mov edx,B[12]
      mov dl,B[ecx]
      mov al,A[ecx]
      cmp eax,edx
   else
      xor ecx, 0FFFFh
      .if !Zero?      ; make sure there is no zero input
         and   ecx, 07FFFh
         bsr   ecx, ecx
         mov eax,A[12]
         mov edx,B[12]
         mov dl,B[ecx]
         mov al,A[ecx]
         cmp eax,edx
      .endif
   endif
   pop ecx
ENDM
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Siekmanski on September 03, 2013, 08:26:59 AM
Hi Jochen,

Reply #298:

Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
963     kCycles [x][x][x] - Cmp128Dave
901     kCycles [x][x][x] - Cmp128Nidud
1000    kCycles [x][x][x] - Cmp128NidudSSE
662     kCycles [x][x][ ] - Cmp128Alex
991     kCycles [x][x][x] - MasmBasic Ocmp.1
941     kCycles [x][x][x] - MasmBasic Ocmp.0
991     kCycles [x][x][x] - MasmBasic Ocmp.1
941     kCycles [x][x][x] - MasmBasic Ocmp.0
765     kCycles [x][x][x] - Cmp128JJAlexSSE_1
826     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
870     kCycles [x][x][x] - Cmp128JJAlexSSE_2
873     kCycles [x][x][x] - Cmp128JJAlexSSE_3
862     kCycles [x][x][ ] - AxCMP128bitProc3
888     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Hi nidud,

Reply #301:

Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
961     kCycles [x][x][x] - Cmp128Dave
903     kCycles [x][x][x] - Cmp128Nidud
999     kCycles [x][x][x] - Cmp128NidudSSE
663     kCycles [x][x][ ] - Cmp128Alex
990     kCycles [x][x][x] - MasmBasic Ocmp.1
942     kCycles [x][x][x] - MasmBasic Ocmp.0
990     kCycles [x][x][x] - MasmBasic Ocmp.1
942     kCycles [x][x][x] - MasmBasic Ocmp.0
763     kCycles [x][x][x] - Cmp128JJAlexSSE_1
827     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
718     kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
871     kCycles [x][x][x] - Cmp128JJAlexSSE_2
874     kCycles [x][x][x] - Cmp128JJAlexSSE_3
855     kCycles [x][x][ ] - AxCMP128bitProc3
887     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Jochen,  Reply #303:

Code: [Select]
Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
964     kCycles [x][x][x] - Cmp128Dave
901     kCycles [x][x][x] - Cmp128Nidud
1000    kCycles [x][x][x] - Cmp128NidudSSE
662     kCycles [x][x][ ] - Cmp128Alex
970     kCycles [x][x][x] - MasmBasic Ocmp.1
968     kCycles [x][x][x] - MasmBasic Ocmp.0
764     kCycles [x][x][x] - Cmp128JJAlexSSE_1
825     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
720     kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
717     kCycles [x][x][x] - Cmp128AxelNidudJJ
870     kCycles [x][x][x] - Cmp128JJAlexSSE_2
872     kCycles [x][x][x] - Cmp128JJAlexSSE_3
862     kCycles [x][x][ ] - AxCMP128bitProc3
886     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 08:30:53 AM
Thanks, Marinus - I like it :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: KeepingRealBusy on September 03, 2013, 08:48:02 AM
JJ's latest

Code: [Select]

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1448    kCycles [x][x][x] - Cmp128Dave
1146    kCycles [x][x][x] - Cmp128Nidud
846     kCycles [x][x][x] - Cmp128NidudSSE
496     kCycles [x][x][ ] - Cmp128Alex
2043    kCycles [x][x][x] - MasmBasic Ocmp.1
2153    kCycles [x][x][x] - MasmBasic Ocmp.0
1849    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1996    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1799    kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1884    kCycles [x][x][x] - Cmp128AxelNidudJJ
1956    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1994    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1344    kCycles [x][x][ ] - AxCMP128bitProc3
1263    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 08:54:39 AM
Dave & Marinus,

Thanks but I am afraid the difference between Cmp128JJAlexSSE_1new1 and Cmp128AlexNidudJJ reflected just the volatility of timings - I changed the description but forgot the macro call itself :redface:

Scroll back three posts to get the good version... sorry ;-)
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: KeepingRealBusy on September 03, 2013, 09:06:41 AM
From 303 - I thought something had happened - timings were way high with the newest version.

Code: [Select]

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1391    kCycles [x][x][x] - Cmp128Dave
1065    kCycles [x][x][x] - Cmp128Nidud
884     kCycles [x][x][x] - Cmp128NidudSSE
685     kCycles [x][x][ ] - Cmp128Alex
949     kCycles [x][x][x] - MasmBasic Ocmp.1
953     kCycles [x][x][x] - MasmBasic Ocmp.0
714     kCycles [x][x][x] - Cmp128JJAlexSSE_1
703     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
700     kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
740     kCycles [x][x][x] - Cmp128AxelNidudJJ_A
739     kCycles [x][x][x] - Cmp128AxelNidudJJ_B
710     kCycles [x][x][x] - Cmp128JJAlexSSE_2
697     kCycles [x][x][x] - Cmp128JJAlexSSE_3
516     kCycles [x][x][ ] - AxCMP128bitProc3
1331    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 03, 2013, 10:43:04 AM
Code: [Select]

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2713    kCycles [x][x][x] - Cmp128Dave
2729    kCycles [x][x][x] - Cmp128Nidud
3105    kCycles [x][x][x] - Cmp128NidudSSE
921     kCycles [x][x][ ] - Cmp128Alex
1979    kCycles [x][x][x] - MasmBasic Ocmp.1
1979    kCycles [x][x][x] - MasmBasic Ocmp.0
1607    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1540    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1354    kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1549    kCycles [x][x][x] - Cmp128AxelNidudJJ_A
1589    kCycles [x][x][x] - Cmp128AxelNidudJJ_B
1574    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1577    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1333    kCycles [x][x][ ] - AxCMP128bitProc3
1251    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 03, 2013, 10:54:24 AM
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ

:biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: japheth on September 03, 2013, 05:22:18 PM
You are a bit sensitive me think  :P

Sieht so aus ... ist wohl ein Standardfeature hyperaktiver Forumsmitglieder ... da könnte ich auch die ein oder andere Erfahrung beisteuern.  :icon_mrgreen:

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: jj2007 on September 03, 2013, 11:03:31 PM
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ

Oops - my apologies, Alex :redface:

Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on September 03, 2013, 11:36:20 PM
Hi,

   Well, I put versions of a compare routine using CMPSB and
CMPSW into the timing suite.  If I did it correctly, someone in
Intel really hates string instructions.  And going from bytes to
words only helped ~5 - 15%.  Which means I probably need to
check for gross errors.  Oh well, maybe small code size counts
for something.

Cheers,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on September 04, 2013, 12:04:31 AM
i am not sure that the string method would pass all the tests, Steve
at least, no without some extra support code   :P

RE: "Axel"

Axel is a good name - let's call him that, from now on   :lol:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: FORTRANS on September 04, 2013, 01:51:59 AM
Hi Dave,

   Okay.  You may be right.  The idea for the algorithm came from
Reply #39.  I used your code in Reply #165 to get the algorithm
to verify the compare algorithm.  See Replies #241, #252, and
#255 for test results, code, and a comment.  I skimmed some
of this thread again, but may have missed something.  Did you
change your validation program?

Regards,

Steve N.
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 04, 2013, 11:49:53 AM
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ

Oops - my apologies, Alex :redface:

Don't worry :biggrin:

RE: "Axel"

Axel is a good name - let's call him that, from now on   :lol:


MOV EAX, "Dave"
BSWAP EAX


Your name is 32 bit, too, Dave :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on September 05, 2013, 05:27:59 PM
i have used the value 0DABEDABEh before to XOR some files - lol
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 05, 2013, 07:50:19 PM
i have used the value 0DABEDABEh before to XOR some files - lol

:biggrin:

BTW, Russian letter "V" is written like "В". Дэйв - that's your name how it spelled in russian :biggrin:

But what meant "dedn" prefix before name? Is it some idiom?
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: dedndave on September 05, 2013, 10:41:01 PM
i used to live in a very small town, out in the desert
i lived at the end of an old, historic, mining road called Plomosa Road
i was at the base of a small mountain, so the road ended at my place   :P
up the hill a little was an old abandoned mine

(http://bousechamber.org/images/Bousehill2.jpg)

when you live in a small town, everyone makes up nicknames for you
we had "One-Door Fred" and "Bartender John", etc
one of the guys was an old guy we called "Dirty Ernie"
he used to grab at girls when he'd been drinking - lol

they called me "Dead-End Dave"   :biggrin:
that's where dedndave came from
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Antariy on September 06, 2013, 12:04:43 AM
Thank you, Dave, now it became clear :biggrin:
Title: Re: Comparing 128-bit numbers aka OWORDs
Post by: Gunther on September 06, 2013, 02:36:55 AM
Hi Dave,

they called me "Dead-End Dave"   :biggrin:
that's where dedndave came from

This was an overdue clarification.  :t

Gunther