Comparing 128-bit numbers aka OWORDs

jj2007 · August 12, 2013, 08:25:24 PM

Following the "What is the fastest way (performance wise) to compare two 128 bit integers" thread in the Campus, here a first attempt to time comparisons of 128-bit unsigned integers.

Fifteen cycles is quite a lot, so if you have a better algo, please post it... you can test it before in line 90 of the attached source.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
15580 cycles for 1000 * cmp128 (2 globals)
88587 cycles for 1000 * cmp128b (loop)
27107 cycles for 1000 * cmp128p (calls proc)
28611 cycles for 1000 * two pointers

P.S.: Googling yields almost nothing, apparently there are not many applications for this :(

sinsi · August 12, 2013, 08:45:03 PM

Que?

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1824/1000 cycles

3385 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx

3489 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx

3335 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx

jj2007 · August 12, 2013, 10:23:54 PM

John, your CPU doesn't respect the speed limits, as usual

OK, version B attached on top. It features a loop based macro:

cmp128b MACRO ow0, ow1   ; both operands must be memory variables
push esi
push edi
mov esi, offset ow0
mov edi, offset ow1
mov ecx, 16
.Repeat
   dec ecx
   .if Sign?
      inc ecx
      .Break
   .endif
   movzx eax, byte ptr [esi+ecx]
   movzx edx, byte ptr [edi+ecx]
   cmp eax, edx
.Until !Zero?
pop edi
pop esi
ENDM

sinsi · August 12, 2013, 10:36:36 PM

>John, your CPU doesn't respect the speed limits, as usual

Hah! It's you being disrespectful to my CPU.

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1694/1000 cycles

3158 cycles for 1000 * cmp128
58915 cycles for 1000 * cmp128b

3162 cycles for 1000 * cmp128
58929 cycles for 1000 * cmp128b

3124 cycles for 1000 * cmp128
59191 cycles for 1000 * cmp128b

jj2007 · August 12, 2013, 11:38:25 PM

One more, adding a generic one which expects two pointers:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2998/1000 cycles

15567 cycles for 1000 * cmp128 (2 globals)
88584 cycles for 1000 * cmp128b (loop)
27141 cycles for 1000 * cmp128p (calls proc)
29387 cycles for 1000 * two pointers

15562 cycles for 1000 * cmp128 (2 globals)
88587 cycles for 1000 * cmp128b (loop)
27100 cycles for 1000 * cmp128p (calls proc)
29431 cycles for 1000 * two pointers

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles

9116 cycles for 1000 * cmp128 (2 globals)
73639 cycles for 1000 * cmp128b (loop)
17461 cycles for 1000 * cmp128p (calls proc)
24831 cycles for 1000 * two pointers

Gunther · August 13, 2013, 03:16:45 AM

Jochen,

your timings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3047/1000 cycles

1636    cycles for 1000 * cmp128 (2 globals)
56016   cycles for 1000 * cmp128b (loop)
4410    cycles for 1000 * cmp128p (calls proc)
2688    cycles for 1000 * two pointers

1696    cycles for 1000 * cmp128 (2 globals)
55860   cycles for 1000 * cmp128b (loop)
10677   cycles for 1000 * cmp128p (calls proc)
2658    cycles for 1000 * two pointers

1612    cycles for 1000 * cmp128 (2 globals)
55951   cycles for 1000 * cmp128b (loop)
10790   cycles for 1000 * cmp128p (calls proc)
2770    cycles for 1000 * two pointers

--- ok ---

Gunther

hutch-- · August 13, 2013, 05:26:23 AM

Hmmmm,

Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 1028/1000 cycles

6340 cycles for 1000 * cmp128 (2 globals)
66042 cycles for 1000 * cmp128b (loop)
15203 cycles for 1000 * cmp128p (calls proc)
28236 cycles for 1000 * two pointers

6394 cycles for 1000 * cmp128 (2 globals)
66052 cycles for 1000 * cmp128b (loop)
15315 cycles for 1000 * cmp128p (calls proc)
28234 cycles for 1000 * two pointers

6340 cycles for 1000 * cmp128 (2 globals)
66007 cycles for 1000 * cmp128b (loop)
14605 cycles for 1000 * cmp128p (calls proc)
28232 cycles for 1000 * two pointers

--- ok ---

RuiLoureiro · August 13, 2013, 08:06:27 AM

Hi
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
loop overhead is approx. 2450/1000 cycles

18005 cycles for 1000 * cmp128 (2 globals)
101446 cycles for 1000 * cmp128b (loop)
34947 cycles for 1000 * cmp128p (calls proc)
27351 cycles for 1000 * two pointers

17983 cycles for 1000 * cmp128 (2 globals)
103720 cycles for 1000 * cmp128b (loop)
35346 cycles for 1000 * cmp128p (calls proc)
27835 cycles for 1000 * two pointers

18020 cycles for 1000 * cmp128 (2 globals)
97862 cycles for 1000 * cmp128b (loop)
35671 cycles for 1000 * cmp128p (calls proc)
27585 cycles for 1000 * two pointers

--- ok ---

jj2007 · August 13, 2013, 10:12:03 AM

Thanks to all of you :t

If I find the time, I'll have to look at the validity of results. Perhaps an extra check of the msb is necessary.

ifidni @Environ(oAssembler), <mlv615>
oSmall   qWORD 00000000000000001h, 0800000000000000h   ; 7f00000000000000???
   db 03h
oMedium   qWORD 00000000000000002h, 0800000000000000h
   db 02h
oBig   qWORD 00000000000000003h, 0800000000000000h
   db 01h
oSmallF   qWORD 00000000000007f00h, 0800000000000000h
oMedF   qWORD 00000000000008000h, 0800000000000000h
oBigF   qWORD 0000000000000ff00h, 0800000000000000h
else
oSmall   OWORD 080000000000000000000000000000001h
   db 03h
oMedium   OWORD 080000000000000000000000000000002h
   db 02h
oBig   OWORD 080000000000000000000000000000003h
   db 01h
oSmallF   OWORD 080000000000000000000000000007f00h
oMedF   OWORD 080000000000000000000000000008000h
oBigF   OWORD 08000000000000000000000000000ff00h
endif

Antariy · August 14, 2013, 02:11:59 AM

Hi Jochen :t

Here are my code added and timings.

This is a tricky replacement of PCMPEQD to make the SSE1-compatible code. Archive contains both versions, the "default" is SSE1-capable.

ifdef USE_SSE1
xorps xmm1,xmm0
movaps xmm2,xmm1
cmpps xmm1,oword ptr zero128b,0
andps xmm2,xmm1
xorps xmm1,xmm2
else
pcmpeqd xmm0,xmm1
endif

The tricky thing with packed comparsion is the question: is the not matched DWORD the highest DWORD, or not? The signed/unsigned comparsion sets different flags, but they relate to one other for the number that is in positive range of signed.
The code made flags correction if required.

movmskps eax,xmm1
xor al,0fH
bsr eax,eax
jz @l1
xor ecx,ecx
mov edx,dword ptr [ow0+eax*4]
cmp eax,3 ; if this was not highest-order dword
mov eax,dword ptr [ow1+eax*4]
setne cl

cmp edx,eax ; OF = SF if EDX (signed)> ECX, CF = 0 if EDX (unsigned)> ECX
jecxz @l1

; we need to preserve ZF and CF flags
; but make OF != SF if CF = 1

jns @F ; (if sign flag is set, then we do want to clear OF flag)
mov ecx,80000000h ; then mark: highest bit will become #30 bit,
@@: ; CF will become #31 bit...

setc cl ; ...lowest bit will become CF

rcr ecx,1 ; restore CF, if two highest bits set then OF = 0 else OF = 1
@l1:

I've replaced cmp128b macro with my code to simplify addition to the testbed.

It will probably better to avoid JECXZ with using of a jumptable, but in this testbed it was simpler to use JECXZ than to create table for every macro expansion.
Most robust way of usage is to compare for equality first, then for signed/unsigned great-or-less than (because of short cut after BSR - if it goes that path - the only ZF flag is guaranteed to be set exact to results).

Code Select



SSE2:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2231/1000 cycles

18239   cycles for 1000 * cmp128
20661   cycles for 1000 * cmp128b

18399   cycles for 1000 * cmp128
20907   cycles for 1000 * cmp128b

18365   cycles for 1000 * cmp128
20838   cycles for 1000 * cmp128b


--- ok ---


SSE1:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2189/1000 cycles

18274   cycles for 1000 * cmp128
22574   cycles for 1000 * cmp128b

18327   cycles for 1000 * cmp128
22582   cycles for 1000 * cmp128b

18273   cycles for 1000 * cmp128
22632   cycles for 1000 * cmp128b


--- ok ---

Gunther · August 14, 2013, 03:32:37 AM

Alex,

you've made good points. Thank you. :t

Here are the timings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3341/1000 cycles

1416    cycles for 1000 * cmp128
3673    cycles for 1000 * cmp128b

1352    cycles for 1000 * cmp128
3597    cycles for 1000 * cmp128b

1405    cycles for 1000 * cmp128
3670    cycles for 1000 * cmp128b

--- ok ---

Gunther

jj2007 · August 14, 2013, 05:09:20 AM

Quote from: Antariy on August 14, 2013, 02:11:59 AM
Hi Jochen :t

Here are my code added and timings.

Thanks a lot, Alex :t

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1767/1000 cycles

9121 cycles for 1000 * cmp128
93407 cycles for 1000 * cmp128b

I am still busy with the macro, trying to make it return correct values. If you want to test yours, here is the core of the testbed, in qword version.

.data
qSmall qWORD 7f00000000000001h
qMedium qWORD 7f00000000000002h
qBig qWORD 7f00000000000003h
qSmallN qWORD 8000000000000001h
qMedN qWORD 8000000000000002h
qBigN qWORD 8000000000000003h
qSmallF qWORD 0800000000007f00h
qMedF qWORD 0800000000008000h
qBigF qWORD 080000000000ff00h
...
   print chr$(13, 10, "Qcmp", 13, 10)
   Print Str$("qSp=%i\n", qSmall)
   Print Str$("qBp=%i\n", qBig)
   Print Str$("qSn=%i\n", qSmallN)
   Print Str$("qBn=%i\n\n", qBigN)
   print chr$(13, 10, "positive (lesser, greater)", 13, 10)
   Qcmp qSmall, qBig
   Qcmp qBig, qSmall
   print chr$(13, 10, "negative (lesser, greater)", 13, 10)
   Qcmp qSmallN, qBigN
   Qcmp qBigN, qSmallN
   print chr$(13, 10, "pos, neg (greater, greater)", 13, 10)
   jt=1
   Qcmp qSmall, qBigN
   Qcmp qBig, qSmallN
   print chr$(13, 10, "neg, pos (lesser, lesser)", 13, 10)
   Qcmp qSmallN, qBig
   Qcmp qBigN, qSmall

qWord · August 14, 2013, 06:07:51 AM

jj, please a little more effectiveness for the macros...

Code Select


useA=1
;...
CodeSize MACRO algo, overhead:=<15>    ; default overhead is mov ecx, 99+mov eax, ecx+loop
  pushad
  mov eax, offset &algo&_endp    ; OPT_Errline 0
  sub eax, offset &algo&_s
  sub eax, overhead
    if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
        % print str$(eax), 9, @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">), 13, 10
    else
        % print str$(eax), 9, "bytes for other", 13, 10
    endif
  popad
ENDM

AlgoName$ MACRO algo
  if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
    EXITM @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">)
  else
    EXITM <"other">
  endif
ENDM
;...
start:
    push 1
    call ShowCpu    ; print brand string and SSE level
    invoke SetProcessAffinityMask, -1, 1    ; restrict to one core
    
    Calibrate
    ??cntr = 1
    REPEAT ShowLoops+2
        IF ??cntr LT ShowLoops+1 
            SpinUp
        ENDIF
        FORC char,<ABCDEFGHIJKLMNOPQRSTUVWXYZ>
            IFDEF use&char&
                IF use&char&
                    IF ??cntr LT ShowLoops+1
                        invoke Sleep, SleepMs
                        counter_begin TimerLoops, HIGH_PRIORITY_CLASS
                           call Test&char&
                        counter_end
                        ShowCycles Test&char&
                    ELSEIF (??cntr EQ ShowLoops+1) AND ShowSize NE 0
                        CodeSize Test&char&
                    ELSEIF (??cntr EQ ShowLoops+2) AND ShowResult NE 0
                        CodeResult Test&char&
                    ENDIF
                ENDIF        
            ENDIF
        ENDM
        IF (??cntr LT ShowLoops+1) OR ((??cntr EQ ShowLoops+1) AND ShowSize NE 0) OR ((??cntr EQ ShowLoops+2) AND ShowResult NE 0)
            print chr$(13, 10)
        ENDIF
        ??cntr = ??cntr + 1
    ENDM
    inkey chr$(13, 10, "--- ok ---", 13)
    exit

:icon_cool:

jj2007 · August 14, 2013, 06:48:56 AM

Quote from: qWord on August 14, 2013, 06:07:51 AM
jj, please a little more effectiveness for the macros...

qWord,
Thanks for this demo, guru of the macro universe

Right now I am too busy solving QWORD (note the uppercase) mysteries, will test it asap ;-)

jj2007 · August 14, 2013, 08:58:17 AM

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
9118 cycles for 1000 * cmp128 (2 globals, buggy)
73747 cycles for 1000 * cmp128b (loop)
35023 cycles for 1000 * OcmpJJ
94027 cycles for 1000 * OcmpAlex

@Alex: Your macro doesn't pass the full test yet... see Cmp128.asc (open with \Masm32\RichMasm\RichMasm.exe, hit F6 to build; to navigate, click on bookmarks on the right, or select a word and hit F3)

The MASM Forum

News: