News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

dedndave

Jochen's latest attachment...

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2557    kCycles [x][x][x] - Cmp128Nidud
2815    kCycles [x][x][x] - Cmp128NidudSSE
2517    kCycles [x][x][x] - Cmp128Dave
3715    kCycles [x][x][x] - Cmp128Dave2
1557    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1521    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1753    kCycles [x][x][x] - Cmp128JJAlexSSE_3
914     kCycles [x][x][ ] - Cmp128Alex
1698    kCycles [x][x][x] - Cmp128Alex_2
1726    kCycles [x][x][x] - Cmp128Alex_3
1850    kCycles [x][x][ ] - Cmp128JJSSE
1298    kCycles [x][x][ ] - AxCMP128bitProc3
1228    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
696     kCycles [x][ ][ ] - Cmp128DaveU
690     kCycles [x][ ][ ] - Cmp128NidudU

------------------------------------------------------
2576    kCycles [x][x][x] - Cmp128Nidud
2804    kCycles [x][x][x] - Cmp128NidudSSE
2543    kCycles [x][x][x] - Cmp128Dave
3723    kCycles [x][x][x] - Cmp128Dave2
1557    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1518    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1853    kCycles [x][x][x] - Cmp128JJAlexSSE_3
918     kCycles [x][x][ ] - Cmp128Alex
1697    kCycles [x][x][x] - Cmp128Alex_2
1733    kCycles [x][x][x] - Cmp128Alex_3
1848    kCycles [x][x][ ] - Cmp128JJSSE
1280    kCycles [x][x][ ] - AxCMP128bitProc3
1211    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
677     kCycles [x][ ][ ] - Cmp128DaveU
707     kCycles [x][ ][ ] - Cmp128NidudU

FORTRANS

Hi,

   From Reply # 266.


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1012 kCycles [x][x][x] - Cmp128Nidud
1153 kCycles [x][x][x] - Cmp128NidudSSE
1049 kCycles [x][x][x] - Cmp128Dave
2520 kCycles [x][x][x] - Cmp128Dave2
1441 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1682 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1686 kCycles [x][x][x] - Cmp128JJAlexSSE_3
824 kCycles [x][x][ ] - Cmp128Alex
1200 kCycles [x][x][x] - Cmp128Alex_2
1261 kCycles [x][x][x] - Cmp128Alex_3
1823 kCycles [x][x][ ] - Cmp128JJSSE
951 kCycles [x][x][ ] - AxCMP128bitProc3
987 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
640 kCycles [x][ ][ ] - Cmp128DaveU
633 kCycles [x][ ][ ] - Cmp128NidudU

--- ok --- Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986 kCycles [x][x][x] - Cmp128Nidud
1134 kCycles [x][x][x] - Cmp128NidudSSE
1024 kCycles [x][x][x] - Cmp128Dave
2491 kCycles [x][x][x] - Cmp128Dave2
1423 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1674 kCycles [x][x][x] - Cmp128JJAlexSSE_3
817 kCycles [x][x][ ] - Cmp128Alex
1190 kCycles [x][x][x] - Cmp128Alex_2
1249 kCycles [x][x][x] - Cmp128Alex_3
1813 kCycles [x][x][ ] - Cmp128JJSSE
941 kCycles [x][x][ ] - AxCMP128bitProc3
986 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
639 kCycles [x][ ][ ] - Cmp128DaveU
630 kCycles [x][ ][ ] - Cmp128NidudU

--- ok ---


Regards,

Steve

nidud

#272
deleted

jj2007

Quote from: dedndave on August 30, 2013, 10:11:42 PM
Jochen's latest attachment...

The only change was a shr eax, 10 to make the timings more readable...
BTW, it would be nice if the test for the (x)(x)(x) could be integrated with the timings. At present, there is a hand-made static string only... where are the authors of the magic test?

CmpFlag.zip:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
------------------------------------------------------
672764  cycles for CmpLEA
465288  cycles for CmpADD
464995  cycles for CmpINC
212725  cycles for CmpBSF
2981445 cycles for CmpCLX

dedndave

QuoteLEA will be faster than ADD on Dave's CPU
INC preserve CF, so this will be slower on Dave's CPU
BSF will be much faster on yours and Dave's CPU

today must be opposite day   :P
there are certain things that P4's are just not good at
i like developing on a P4, though - if it's fast on my machine, it'll be fast on every one else's   :lol:

not sure what the CmpCLX test is, but my CPU hates it
a good chance it is not doing what you want it to

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
574802  cycles for CmpLEA
601257  cycles for CmpADD
912774  cycles for CmpINC
474878  cycles for CmpBSF
27755044        cycles for CmpCLX
------------------------------------------------------
624710  cycles for CmpLEA
609602  cycles for CmpADD
915751  cycles for CmpINC
494000  cycles for CmpBSF
27825419        cycles for CmpCLX
------------------------------------------------------
589298  cycles for CmpLEA
591765  cycles for CmpADD
909765  cycles for CmpINC
468357  cycles for CmpBSF
27813103        cycles for CmpCLX
------------------------------------------------------

nidud

#275
deleted

FORTRANS


pre-P4 (SSE1)
------------------------------------------------------
690329 cycles for CmpLEA
703742 cycles for CmpADD
709925 cycles for CmpINC
215679 cycles for CmpBSF
2708852 cycles for CmpCLX
------------------------------------------------------
688496 cycles for CmpLEA
704680 cycles for CmpADD
707613 cycles for CmpINC
215581 cycles for CmpBSF
2709056 cycles for CmpCLX
------------------------------------------------------
688429 cycles for CmpLEA
704797 cycles for CmpADD
707767 cycles for CmpINC
215511 cycles for CmpBSF
2707745 cycles for CmpCLX
------------------------------------------------------

--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
------------------------------------------------------
712592 cycles for CmpLEA
712084 cycles for CmpADD
711109 cycles for CmpINC
207906 cycles for CmpBSF
2633382 cycles for CmpCLX
------------------------------------------------------
710075 cycles for CmpLEA
711340 cycles for CmpADD
712251 cycles for CmpINC
207880 cycles for CmpBSF
2633572 cycles for CmpCLX
------------------------------------------------------
711945 cycles for CmpLEA
712603 cycles for CmpADD
710777 cycles for CmpINC
209128 cycles for CmpBSF
2631655 cycles for CmpCLX
------------------------------------------------------

--- ok ---

pre-P4------------------------------------------------------
1386583 cycles for CmpLEA
734379 cycles for CmpADD
733891 cycles for CmpINC
1341089 cycles for CmpBSF
1865749 cycles for CmpCLX
------------------------------------------------------
1386134 cycles for CmpLEA
735003 cycles for CmpADD
734641 cycles for CmpINC
1341132 cycles for CmpBSF
1867097 cycles for CmpCLX
------------------------------------------------------
1382758 cycles for CmpLEA
736389 cycles for CmpADD
734488 cycles for CmpINC
1341860 cycles for CmpBSF
1867206 cycles for CmpCLX
------------------------------------------------------

--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
713980 cycles for CmpLEA
716320 cycles for CmpADD
716456 cycles for CmpINC
214883 cycles for CmpBSF
3013235 cycles for CmpCLX
------------------------------------------------------
714376 cycles for CmpLEA
715769 cycles for CmpADD
716068 cycles for CmpINC
214918 cycles for CmpBSF
3013878 cycles for CmpCLX
------------------------------------------------------
714034 cycles for CmpLEA
716052 cycles for CmpADD
716264 cycles for CmpINC
214754 cycles for CmpBSF
3012735 cycles for CmpCLX
------------------------------------------------------

--- ok ---

dedndave

Quote from: nidud on August 31, 2013, 06:07:49 AM
Quotenot sure what the CmpCLX test is
it manipulate the flags using STC/CLC/STD/CLD/CMC

ahhh.....
CLD and STD are slow as hell on P4's
and not all that fast on many other processors
they seem to be reasonable on your AMD

Antariy

Quote from: nidud on August 31, 2013, 03:24:36 AM
Quote
Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:

Well, if that is correct the following test will prove your point

AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
383145 cycles for CmpLEA
382951 cycles for CmpADD
384098 cycles for CmpINC
384502 cycles for CmpBSF
378250 cycles for CmpCLX
------------------------------------------------------
383944 cycles for CmpLEA
387003 cycles for CmpADD
383393 cycles for CmpINC
383522 cycles for CmpBSF
378291 cycles for CmpCLX
------------------------------------------------------
385948 cycles for CmpLEA
384310 cycles for CmpADD
383979 cycles for CmpINC
384283 cycles for CmpBSF
378046 cycles for CmpCLX


The BSF test should then be faster on my CPU  :P

Well, it proved (on other machines, too)


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
------------------------------------------------------
490345  cycles for CmpLEA
524195  cycles for CmpADD
768829  cycles for CmpINC
397793  cycles for CmpBSF
28201301        cycles for CmpCLX
------------------------------------------------------
501654  cycles for CmpLEA
512675  cycles for CmpADD
758029  cycles for CmpINC
410060  cycles for CmpBSF
28179301        cycles for CmpCLX
------------------------------------------------------
489078  cycles for CmpLEA
521448  cycles for CmpADD
773169  cycles for CmpINC
404085  cycles for CmpBSF
28120852        cycles for CmpCLX
------------------------------------------------------

--- ok ---


nidud

#279
deleted

jj2007

#280
Hi,
I have shortened the MasmBasic Qcmp (and Ocmp) macro a little bit - and get zero failures now :biggrin:

Please include in Cmp128Eval and the timings.

include oqCmp.asm

align 16

        test_start
        lea esi,ow_table
        .repeat
            lea edi,ow_table
            .repeat
                Qcmp [esi], [edi]
                add edi,4
            .until edi >= offset eo_table
            add esi,4
        .until esi >= offset eo_table
        test_end "cycles (x)(x)(x) - MasmBasic Qcmp"

P.S.: Timings attached. I excluded one very slow algo and those which fail in the first two categories.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
911     kCycles [x][x][x] - Cmp128Nidud
1013    kCycles [x][x][x] - Cmp128NidudSSE
684     kCycles [x][x][ ] - Cmp128Alex
1127    kCycles [x][x][x] - MasmBasic Ocmp  <<<<< OWORD
989     kCycles [x][x][x] - MasmBasic Qcmp
814     kCycles [x][x][x] - Cmp128JJAlexSSE_1
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
926     kCycles [x][x][x] - Cmp128JJAlexSSE_3
859     kCycles [x][x][ ] - AxCMP128bitProc3
868     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)


And timings for an i5 - with the Qcmp and Alex1 three times each (the timings are not very stable on the i5, and the two algos are most interesting for me):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
746     kCycles [x][x][x] - Cmp128Dave
600     kCycles [x][x][x] - Cmp128Nidud
714     kCycles [x][x][x] - Cmp128NidudSSE
494     kCycles [x][x][ ] - Cmp128Alex
429     kCycles [x][x][x] - MasmBasic Qcmp
388     kCycles [x][x][x] - MasmBasic Qcmp
429     kCycles [x][x][x] - MasmBasic Qcmp
428     kCycles [x][x][x] - Cmp128JJAlexSSE_1
401     kCycles [x][x][x] - Cmp128JJAlexSSE_1
428     kCycles [x][x][x] - Cmp128JJAlexSSE_1
437     kCycles [x][x][x] - Cmp128JJAlexSSE_2
427     kCycles [x][x][x] - Cmp128JJAlexSSE_3
530     kCycles [x][x][ ] - AxCMP128bitProc3
501     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

FORTRANS

Hi,

   I posted a routine in Reply # 252.  This quote is from Reply #262.

Quote from: FORTRANS on August 28, 2013, 10:26:34 PM
   The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well.  So you can take an early
exit from the algorithm if the CMPS result is zero.  I did not bother
with mine.*  (Though I, or someone, should time both versions.)

   Well, I timed it with and without the test for zero and an early
exit.  The one with the extra test was slower.  Tested with Dave's
112 test values.

Regards,

Steve N.

Siekmanski

from Reply #280,


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
942     kCycles [x][x][x] - Cmp128Dave
891     kCycles [x][x][x] - Cmp128Nidud
1023    kCycles [x][x][x] - Cmp128NidudSSE
673     kCycles [x][x][ ] - Cmp128Alex
828     kCycles [x][x][x] - MasmBasic Qcmp
766     kCycles [x][x][x] - Cmp128JJAlexSSE_1
869     kCycles [x][x][x] - Cmp128JJAlexSSE_2
876     kCycles [x][x][x] - Cmp128JJAlexSSE_3
867     kCycles [x][x][ ] - AxCMP128bitProc3
895     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---
Creative coders use backward thinking techniques as a strategy.

nidud

#283
deleted

jj2007

#284
You can substantially reduce the number of failures (from 1318 to zero) if you use Ocmp ("O" like "O sole mio") instead of Qcmp  ;)

And yes, it's my fault because I erroneously used Qcmp in the timings. I was qonfused ::)

New version attached, with minor changes to Ocmp:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
911     kCycles [x][x][x] - Cmp128Nidud
1013    kCycles [x][x][x] - Cmp128NidudSSE
684     kCycles [x][x][ ] - Cmp128Alex
1010    kCycles [x][x][x] - MasmBasic Ocmp
815     kCycles [x][x][x] - Cmp128JJAlexSSE_1
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
925     kCycles [x][x][x] - Cmp128JJAlexSSE_3
870     kCycles [x][x][ ] - AxCMP128bitProc3
867     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)