News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

jj2007

Following the "What is the fastest way (performance wise) to compare two 128 bit integers" thread in the Campus, here a first attempt to time comparisons of 128-bit unsigned integers.

Fifteen cycles is quite a lot, so if you have a better algo, please post it... you can test it before in line 90 of the attached source.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
15580   cycles for 1000 * cmp128 (2 globals)
88587   cycles for 1000 * cmp128b (loop)
27107   cycles for 1000 * cmp128p (calls proc)
28611   cycles for 1000 * two pointers


P.S.: Googling yields almost nothing, apparently there are not many applications for this :(

sinsi

Que?

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1824/1000 cycles

3385    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

3489    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

3335    cycles for 1000 * cmp128
??      cycles for 1000 * cmp128 xx

🍺🍺🍺

jj2007

John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:

OK, version B attached on top. It features a loop based macro:

cmp128b MACRO ow0, ow1   ; both operands must be memory variables
  push esi
  push edi
  mov esi, offset ow0
  mov edi, offset ow1
  mov ecx, 16
  .Repeat
   dec ecx
   .if Sign?
      inc ecx
      .Break
   .endif
   movzx eax, byte ptr [esi+ecx]
   movzx edx, byte ptr [edi+ecx]
   cmp eax, edx
  .Until !Zero?
  pop edi
  pop esi
ENDM


sinsi

>John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:
Hah! It's you being disrespectful to my CPU.



Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1694/1000 cycles

3158    cycles for 1000 * cmp128
58915   cycles for 1000 * cmp128b

3162    cycles for 1000 * cmp128
58929   cycles for 1000 * cmp128b

3124    cycles for 1000 * cmp128
59191   cycles for 1000 * cmp128b

🍺🍺🍺

jj2007

#4
One more, adding a generic one which expects two pointers:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2998/1000 cycles

15567   cycles for 1000 * cmp128 (2 globals)
88584   cycles for 1000 * cmp128b (loop)
27141   cycles for 1000 * cmp128p (calls proc)
29387   cycles for 1000 * two pointers

15562   cycles for 1000 * cmp128 (2 globals)
88587   cycles for 1000 * cmp128b (loop)
27100   cycles for 1000 * cmp128p (calls proc)
29431   cycles for 1000 * two pointers

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

9116    cycles for 1000 * cmp128 (2 globals)
73639   cycles for 1000 * cmp128b (loop)
17461   cycles for 1000 * cmp128p (calls proc)
24831   cycles for 1000 * two pointers



Gunther

Jochen,

your timings:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3047/1000 cycles

1636    cycles for 1000 * cmp128 (2 globals)
56016   cycles for 1000 * cmp128b (loop)
4410    cycles for 1000 * cmp128p (calls proc)
2688    cycles for 1000 * two pointers

1696    cycles for 1000 * cmp128 (2 globals)
55860   cycles for 1000 * cmp128b (loop)
10677   cycles for 1000 * cmp128p (calls proc)
2658    cycles for 1000 * two pointers

1612    cycles for 1000 * cmp128 (2 globals)
55951   cycles for 1000 * cmp128b (loop)
10790   cycles for 1000 * cmp128p (calls proc)
2770    cycles for 1000 * two pointers

--- ok ---


Gunther
You have to know the facts before you can distort them.

hutch--

Hmmmm,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 1028/1000 cycles

6340    cycles for 1000 * cmp128 (2 globals)
66042   cycles for 1000 * cmp128b (loop)
15203   cycles for 1000 * cmp128p (calls proc)
28236   cycles for 1000 * two pointers

6394    cycles for 1000 * cmp128 (2 globals)
66052   cycles for 1000 * cmp128b (loop)
15315   cycles for 1000 * cmp128p (calls proc)
28234   cycles for 1000 * two pointers

6340    cycles for 1000 * cmp128 (2 globals)
66007   cycles for 1000 * cmp128b (loop)
14605   cycles for 1000 * cmp128p (calls proc)
28232   cycles for 1000 * two pointers


--- ok ---

RuiLoureiro

Hi
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
loop overhead is approx. 2450/1000 cycles

18005   cycles for 1000 * cmp128 (2 globals)
101446  cycles for 1000 * cmp128b (loop)
34947   cycles for 1000 * cmp128p (calls proc)
27351   cycles for 1000 * two pointers

17983   cycles for 1000 * cmp128 (2 globals)
103720  cycles for 1000 * cmp128b (loop)
35346   cycles for 1000 * cmp128p (calls proc)
27835   cycles for 1000 * two pointers

18020   cycles for 1000 * cmp128 (2 globals)
97862   cycles for 1000 * cmp128b (loop)
35671   cycles for 1000 * cmp128p (calls proc)
27585   cycles for 1000 * two pointers


--- ok ---

jj2007

Thanks to all of you :t

If I find the time, I'll have to look at the validity of results. Perhaps an extra check of the msb is necessary.

ifidni @Environ(oAssembler), <mlv615>
  oSmall   qWORD 00000000000000001h, 0800000000000000h   ; 7f00000000000000???
   db 03h
  oMedium   qWORD 00000000000000002h, 0800000000000000h
   db 02h
  oBig   qWORD 00000000000000003h, 0800000000000000h
   db 01h
  oSmallF   qWORD 00000000000007f00h, 0800000000000000h
  oMedF   qWORD 00000000000008000h, 0800000000000000h
  oBigF   qWORD 0000000000000ff00h, 0800000000000000h
else
  oSmall   OWORD 080000000000000000000000000000001h
   db 03h
  oMedium   OWORD 080000000000000000000000000000002h
   db 02h
  oBig   OWORD 080000000000000000000000000000003h
   db 01h
  oSmallF   OWORD 080000000000000000000000000007f00h
  oMedF   OWORD 080000000000000000000000000008000h
  oBigF   OWORD 08000000000000000000000000000ff00h
endif


Antariy

Hi Jochen :t

Here are my code added and timings.


This is a tricky replacement of PCMPEQD to make the SSE1-compatible code. Archive contains both versions, the "default" is SSE1-capable.


    ifdef USE_SSE1
        xorps xmm1,xmm0
        movaps xmm2,xmm1
        cmpps xmm1,oword ptr zero128b,0
        andps xmm2,xmm1
        xorps xmm1,xmm2
    else
        pcmpeqd xmm0,xmm1
    endif


The tricky thing with packed comparsion is the question: is the not matched DWORD the highest DWORD, or not? The signed/unsigned comparsion sets different flags, but they relate to one other for the number that is in positive range of signed.
The code made flags correction if required.


        movmskps eax,xmm1
        xor al,0fH
        bsr eax,eax
        jz @l1
        xor ecx,ecx
        mov edx,dword ptr [ow0+eax*4]
        cmp eax,3               ; if this was not highest-order dword
        mov eax,dword ptr [ow1+eax*4]       
        setne cl
                             
        cmp edx,eax             ; OF = SF if EDX (signed)> ECX, CF = 0 if EDX (unsigned)> ECX
        jecxz @l1
       
        ; we need to preserve ZF and CF flags
        ; but make OF != SF if CF = 1

        jns @F                  ; (if sign flag is set, then we do want to clear OF flag)
        mov ecx,80000000h       ; then mark: highest bit will become #30 bit,
        @@:                     ; CF will become #31 bit...

        setc cl                 ; ...lowest bit will become CF

        rcr ecx,1               ; restore CF, if two highest bits set then OF = 0 else OF = 1
        @l1:


I've replaced cmp128b macro with my code to simplify addition to the testbed.

It will probably better to avoid JECXZ with using of a jumptable, but in this testbed it was simpler to use JECXZ than to create table for every macro expansion.
Most robust way of usage is to compare for equality first, then for signed/unsigned great-or-less than (because of short cut after BSR - if it goes that path - the only ZF flag is guaranteed to be set exact to results).



SSE2:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2231/1000 cycles

18239   cycles for 1000 * cmp128
20661   cycles for 1000 * cmp128b

18399   cycles for 1000 * cmp128
20907   cycles for 1000 * cmp128b

18365   cycles for 1000 * cmp128
20838   cycles for 1000 * cmp128b


--- ok ---


SSE1:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2189/1000 cycles

18274   cycles for 1000 * cmp128
22574   cycles for 1000 * cmp128b

18327   cycles for 1000 * cmp128
22582   cycles for 1000 * cmp128b

18273   cycles for 1000 * cmp128
22632   cycles for 1000 * cmp128b


--- ok ---


Gunther

Alex,

you've made good points. Thank you.  :t

Here are the timings:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3341/1000 cycles

1416    cycles for 1000 * cmp128
3673    cycles for 1000 * cmp128b

1352    cycles for 1000 * cmp128
3597    cycles for 1000 * cmp128b

1405    cycles for 1000 * cmp128
3670    cycles for 1000 * cmp128b

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: Antariy on August 14, 2013, 02:11:59 AM
Hi Jochen :t

Here are my code added and timings.

Thanks a lot, Alex :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 1767/1000 cycles

9121    cycles for 1000 * cmp128
93407   cycles for 1000 * cmp128b


I am still busy with the macro, trying to make it return correct values. If you want to test yours, here is the core of the testbed, in qword version.

.data
qSmall        qWORD 7f00000000000001h
qMedium        qWORD 7f00000000000002h
qBig        qWORD 7f00000000000003h
qSmallN        qWORD 8000000000000001h
qMedN        qWORD 8000000000000002h
qBigN        qWORD 8000000000000003h
qSmallF        qWORD 0800000000007f00h
qMedF        qWORD 0800000000008000h
qBigF        qWORD 080000000000ff00h
...
   print chr$(13, 10, "Qcmp", 13, 10)
   Print Str$("qSp=%i\n", qSmall)
   Print Str$("qBp=%i\n", qBig)
   Print Str$("qSn=%i\n", qSmallN)
   Print Str$("qBn=%i\n\n", qBigN)
   print chr$(13, 10, "positive (lesser, greater)", 13, 10)
   Qcmp qSmall, qBig
   Qcmp qBig, qSmall
   print chr$(13, 10, "negative (lesser, greater)", 13, 10)
   Qcmp qSmallN, qBigN
   Qcmp qBigN, qSmallN
   print chr$(13, 10, "pos, neg (greater, greater)", 13, 10)
   jt=1
   Qcmp qSmall, qBigN
   Qcmp qBig, qSmallN
   print chr$(13, 10, "neg, pos (lesser, lesser)", 13, 10)
   Qcmp qSmallN, qBig
   Qcmp qBigN, qSmall

qWord

jj, please a little more effectiveness for the macros...

useA=1
;...
CodeSize MACRO algo, overhead:=<15>    ; default overhead is mov ecx, 99+mov eax, ecx+loop
  pushad
  mov eax, offset &algo&_endp    ; OPT_Errline 0
  sub eax, offset &algo&_s
  sub eax, overhead
    if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
        % print str$(eax), 9, @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">), 13, 10
    else
        % print str$(eax), 9, "bytes for other", 13, 10
    endif
  popad
ENDM

AlgoName$ MACRO algo
  if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
    EXITM @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">)
  else
    EXITM <"other">
  endif
ENDM
;...
start:
    push 1
    call ShowCpu    ; print brand string and SSE level
    invoke SetProcessAffinityMask, -1, 1    ; restrict to one core
   
    Calibrate
    ??cntr = 1
    REPEAT ShowLoops+2
        IF ??cntr LT ShowLoops+1
            SpinUp
        ENDIF
        FORC char,<ABCDEFGHIJKLMNOPQRSTUVWXYZ>
            IFDEF use&char&
                IF use&char&
                    IF ??cntr LT ShowLoops+1
                        invoke Sleep, SleepMs
                        counter_begin TimerLoops, HIGH_PRIORITY_CLASS
                           call Test&char&
                        counter_end
                        ShowCycles Test&char&
                    ELSEIF (??cntr EQ ShowLoops+1) AND ShowSize NE 0
                        CodeSize Test&char&
                    ELSEIF (??cntr EQ ShowLoops+2) AND ShowResult NE 0
                        CodeResult Test&char&
                    ENDIF
                ENDIF       
            ENDIF
        ENDM
        IF (??cntr LT ShowLoops+1) OR ((??cntr EQ ShowLoops+1) AND ShowSize NE 0) OR ((??cntr EQ ShowLoops+2) AND ShowResult NE 0)
            print chr$(13, 10)
        ENDIF
        ??cntr = ??cntr + 1
    ENDM
    inkey chr$(13, 10, "--- ok ---", 13)
    exit
   :icon_cool:
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on August 14, 2013, 06:07:51 AM
jj, please a little more effectiveness for the macros...

qWord,
Thanks for this demo, guru of the macro universe :biggrin:
Right now I am too busy solving QWORD (note the uppercase) mysteries, will test it asap ;-)

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9118    cycles for 1000 * cmp128 (2 globals, buggy)
73747   cycles for 1000 * cmp128b (loop)
35023   cycles for 1000 * OcmpJJ
94027   cycles for 1000 * OcmpAlex


@Alex: Your macro doesn't pass the full test yet... see Cmp128.asc (open with \Masm32\RichMasm\RichMasm.exe, hit F6 to build; to navigate, click on bookmarks on the right, or select a word and hit F3)