News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

FORTRANS

Hi,

   From Reply #216.

pre-P4 (SSE1)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1067625 cycles [x][x][x] - Cmp128Dave
2571737 cycles [x][x][x] - Cmp128Dave2
998428 cycles [x][x][x] - Cmp128Nidud
1083846 cycles [x][x][x] - Cmp128NidudSSE
847793 cycles [x][x][ ] - Cmp128Alex
1788551 cycles [x][x][ ] - Cmp128JJSSE
1215146 cycles [x][x][x] - Cmp128JJAlexSSE_1
1623996 cycles [x][x][x] - Cmp128JJAlexSSE_2
1570182 cycles [x][x][x] - Cmp128JJAlexSSE_3
1114476 cycles [x][x][ ] - AxCMP128bitProc3
1174133 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
608508 cycles [x][ ][ ] - Cmp128DaveU
612287 cycles [x][ ][ ] - Cmp128NidudU

--- ok --- 
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1040860 cycles [x][x][x] - Cmp128Dave
2541208 cycles [x][x][x] - Cmp128Dave2
940427 cycles [x][x][x] - Cmp128Nidud
1046690 cycles [x][x][x] - Cmp128NidudSSE
834253 cycles [x][x][ ] - Cmp128Alex
1849858 cycles [x][x][ ] - Cmp128JJSSE
1453007 cycles [x][x][x] - Cmp128JJAlexSSE_1
1703155 cycles [x][x][x] - Cmp128JJAlexSSE_2
1713931 cycles [x][x][x] - Cmp128JJAlexSSE_3
963145 cycles [x][x][ ] - AxCMP128bitProc3
1004886 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652720 cycles [x][ ][ ] - Cmp128DaveU
646938 cycles [x][ ][ ] - Cmp128NidudU

--- ok --- 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
979702 cycles [x][x][x] - Cmp128Dave
2660548 cycles [x][x][x] - Cmp128Dave2
948481 cycles [x][x][x] - Cmp128Nidud
1056326 cycles [x][x][x] - Cmp128NidudSSE
754229 cycles [x][x][ ] - Cmp128Alex
1145531 cycles [x][x][ ] - Cmp128JJSSE
852507 cycles [x][x][x] - Cmp128JJAlexSSE_1
960256 cycles [x][x][x] - Cmp128JJAlexSSE_2
959330 cycles [x][x][x] - Cmp128JJAlexSSE_3
891707 cycles [x][x][ ] - AxCMP128bitProc3
899999 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
615768 cycles [x][ ][ ] - Cmp128DaveU
606497 cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


Cheers,

Steve N.

Siekmanski


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
973837  cycles [x][x][x] - Cmp128Dave
3064246 cycles [x][x][x] - Cmp128Dave2
924278  cycles [x][x][x] - Cmp128Nidud
1063306 cycles [x][x][x] - Cmp128NidudSSE
688245  cycles [x][x][ ] - Cmp128Alex
1082474 cycles [x][x][ ] - Cmp128JJSSE
801400  cycles [x][x][x] - Cmp128JJAlexSSE_1
898730  cycles [x][x][x] - Cmp128JJAlexSSE_2
902646  cycles [x][x][x] - Cmp128JJAlexSSE_3
896815  cycles [x][x][ ] - AxCMP128bitProc3
929492  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
632298  cycles [x][ ][ ] - Cmp128DaveU
627533  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Creative coders use backward thinking techniques as a strategy.

FORTRANS

Hi,

   Using Dave's original 40 DWORD AND OWORD pairs of numbers,
and some of his logic, I wrote some code for my fixed point
program.  Claims it passes the tests.  Yippee!  Had me going in
circles for a while.

Cheers,

Steve N.

nidud

#228
deleted

FORTRANS

Quote from: nidud on August 26, 2013, 07:04:16 AM
the first test is strange:
pre-P4 (SSE1)


most of the code used in the macros are SSE2

Hi nidud,

   Yeah, I would think it should not run.  But?

Cheers,

Steve N.

MichaelW

I have seen SSE2 code that would run on my P3 without triggering an exception, but which would produce incorrect results.
Well Microsoft, here's another nice mess you've gotten us into.

Antariy

Yes, used in the code SSE2 instructions are PCMPEQB - which used the same opcode as MMX PCMPEQB but with 66h prefix which isn't recognized by PIII so it treat this as a MMX instruction (so SSE results are incorrect) and MOVAPS/MOVAPD - MOVAPS works on PIII and MOVAPD has opcode of MOVAPS with 66h prefix so it works, too.

nidud

#232
deleted

Antariy

Quote from: nidud on August 26, 2013, 11:35:51 AM
this is starting to get a bit obsessive  :lol:

Yes :biggrin:

Quote from: nidud on August 26, 2013, 11:35:51 AM

mov eax,1
bsf eax,eax


Is this works? :icon_eek:


Here you can simplify a bit:
Quote from: nidud on August 26, 2013, 11:35:51 AM

movmskps eax,xmm0
sub al,1111B
jnz @0


Instead of jnz @0 jz to the exit from macro - it already processed right zero (equal) result.




Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2302893 cycles [x][x][x] - Cmp128Nidud
2441613 cycles [x][x][x] - Cmp128NidudSSE
2871717 cycles [x][x][x] - Cmp128Dave
4208738 cycles [x][x][x] - Cmp128Dave2
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
1695861 cycles [x][x][x] - Cmp128JJAlexSSE_2
1946274 cycles [x][x][x] - Cmp128JJAlexSSE_3
985137  cycles [x][x][ ] - Cmp128Alex
2063049 cycles [x][x][ ] - Cmp128JJSSE
1411323 cycles [x][x][ ] - AxCMP128bitProc3
1324774 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756882  cycles [x][ ][ ] - Cmp128DaveU
784458  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

sinsi

Feed the obsession...

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
---------------------------------------------------
305387  cycles [x][x][x] - Cmp128Nidud
308974  cycles [x][x][x] - Cmp128NidudSSE
617569  cycles [x][x][x] - Cmp128Dave
1184205 cycles [x][x][x] - Cmp128Dave2
273918  cycles [x][x][x] - Cmp128JJAlexSSE_1
319743  cycles [x][x][x] - Cmp128JJAlexSSE_2
319190  cycles [x][x][x] - Cmp128JJAlexSSE_3
452218  cycles [x][x][ ] - Cmp128Alex
323382  cycles [x][x][ ] - Cmp128JJSSE
417314  cycles [x][x][ ] - AxCMP128bitProc3
395354  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
341747  cycles [x][ ][ ] - Cmp128DaveU
348616  cycles [x][ ][ ] - Cmp128NidudU


jj2007

Quote from: sinsi on August 26, 2013, 01:57:34 PM
Feed the obsession...

Me too :biggrin:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
844     kCycles [x][x][x] - Cmp128Dave
1299    kCycles [x][x][x] - Cmp128Dave2
846     kCycles [x][x][x] - Cmp128Nidud
922     kCycles [x][x][x] - Cmp128NidudSSE
644     kCycles [x][x][ ] - Cmp128Alex
1557    kCycles [x][x][ ] - Cmp128JJSSE
1471    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1465    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1465    kCycles [x][x][x] - Cmp128JJAlexSSE_3
802     kCycles [x][x][ ] - AxCMP128bitProc3
772     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
543     kCycles [x][ ][ ] - Cmp128DaveU
543     kCycles [x][ ][ ] - Cmp128NidudU


P.S.: Added a sar eax, 10, and changed test_end "kCycles (x)(x)(x) - Cmp128Dave2"

Siekmanski


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
700812  cycles [x][x][x] - Cmp128Nidud
729986  cycles [x][x][x] - Cmp128NidudSSE
971877  cycles [x][x][x] - Cmp128Dave
3026250 cycles [x][x][x] - Cmp128Dave2
782064  cycles [x][x][x] - Cmp128JJAlexSSE_1
890599  cycles [x][x][x] - Cmp128JJAlexSSE_2
926681  cycles [x][x][x] - Cmp128JJAlexSSE_3
682186  cycles [x][x][ ] - Cmp128Alex
1067566 cycles [x][x][ ] - Cmp128JJSSE
882899  cycles [x][x][ ] - AxCMP128bitProc3
888908  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
592661  cycles [x][ ][ ] - Cmp128DaveU
570588  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---
Creative coders use backward thinking techniques as a strategy.

Antariy

Yum-yum!

New MACRO added - brute rework of original GPR macro but to make it work just like CMP (passes Dave's check).



Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2184021 cycles [x][x][x] - Cmp128Nidud
2313648 cycles [x][x][x] - Cmp128NidudSSE
2767063 cycles [x][x][x] - Cmp128Dave
4086277 cycles [x][x][x] - Cmp128Dave2
1672157 cycles [x][x][x] - Cmp128JJAlexSSE_1
1644385 cycles [x][x][x] - Cmp128JJAlexSSE_2
1889066 cycles [x][x][x] - Cmp128JJAlexSSE_3
980736  cycles [x][x][ ] - Cmp128Alex
1851407 cycles [x][x][x] - Cmp128Alex_2
1899452 cycles [x][x][x] - Cmp128Alex_3
2048700 cycles [x][x][ ] - Cmp128JJSSE
1388635 cycles [x][x][ ] - AxCMP128bitProc3
1311284 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756260  cycles [x][ ][ ] - Cmp128DaveU
775831  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


I think AMD probably should like it better than Intel.

Antariy

Quote from: nidud on August 26, 2013, 07:04:16 AM
the first test is strange:
pre-P4 (SSE1)


most of the code used in the macros are SSE2


But half of current code is GPR, too - there are Dave's, your and mine codes that didn't use SSE at all :biggrin: