News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

nidud

#210
deleted

nidud

#211
deleted

Gunther

Hi nidud,

the new timings:



Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
698804  cycles [x][x][x] - Cmp128Dave
1161997 cycles [x][x][x] - Cmp128Dave2
614967  cycles [x][x][x] - Cmp128Nidud
715645  cycles [x][x][x] - Cmp128NidudSSE
461614  cycles [x][x][ ] - Cmp128Alex
378413  cycles [x][x][ ] - Cmp128JJSSE
331370  cycles [x][x][ ] - Cmp128JJAlexSSE
467938  cycles [x][x][ ] - AxCMP128bitProc3
436255  cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
437147  cycles [x][ ][ ] - Cmp128DaveU
463277  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


Gunther
You have to know the facts before you can distort them.

Antariy

Quote from: nidud on August 25, 2013, 12:01:08 AM
As a result of this Alex and my SEE macro failed  :lol:

Changes made to Cmp128JJAlexSSE:

movups xmm0,[ow0]
movups xmm1,[ow1] ; ++
movzx eax,word ptr [ow0+14]
;pcmpeqb xmm0,[ow1] ; this failed on unaligned data
pcmpeqb xmm0,xmm1


Yes, I noted that it's unaware of unaligned data. Your solution is right :t

Here are the timings:



Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2652796 cycles [x][x][x] - Cmp128Dave
3952276 cycles [x][x][x] - Cmp128Dave2
2639764 cycles [x][x][x] - Cmp128Nidud
3069710 cycles [x][x][x] - Cmp128NidudSSE
944781  cycles [x][x][ ] - Cmp128Alex
1913987 cycles [x][x][ ] - Cmp128JJSSE
2623148 cycles [x][x][ ] - Cmp128JJAlexSSE
1324491 cycles [x][x][ ] - AxCMP128bitProc3
1279045 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
726832  cycles [x][ ][ ] - Cmp128DaveU
738206  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---



It's interesting how differently algos perform on different processors.

Gunther

Hi Alex,

Quote from: Antariy on August 25, 2013, 05:56:47 PM
It's interesting how differently algos perform on different processors.

yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.

Gunther
You have to know the facts before you can distort them.

nidud

#215
deleted

Antariy

A brand new Cmp128JJAlexSSE!

Don't miss it on your displays right now!

Now fully compliant with Dave's Testing Method™ (JO/JS works as expected).

Even with 3 new tastes modifications!


:greensml:


Timings welcome :t

Antariy

Hi Gunther :biggrin:

Quote from: Gunther on August 25, 2013, 06:57:30 PM
Quote from: Antariy on August 25, 2013, 05:56:47 PM
It's interesting how differently algos perform on different processors.

yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.

With current amount of different CPU models that would be a bunch of code :biggrin:
You're perfectly right, to get every clock from every machine it's the only way.



Ah, forget to post the timings in previous post:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2958901 cycles [x][x][x] - Cmp128Dave
4334913 cycles [x][x][x] - Cmp128Dave2
2957840 cycles [x][x][x] - Cmp128Nidud
3402571 cycles [x][x][x] - Cmp128NidudSSE
1034917 cycles [x][x][ ] - Cmp128Alex
2118138 cycles [x][x][ ] - Cmp128JJSSE
1762373 cycles [x][x][x] - Cmp128JJAlexSSE_1
1726287 cycles [x][x][x] - Cmp128JJAlexSSE_2
1739010 cycles [x][x][x] - Cmp128JJAlexSSE_3
1464577 cycles [x][x][ ] - AxCMP128bitProc3
1372694 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
779180  cycles [x][ ][ ] - Cmp128DaveU
798269  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

jj2007

Quote from: Antariy on August 25, 2013, 08:07:28 PM
A brand new Cmp128JJAlexSSE!

It seems to like my Celeron - best among the "good" algos :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
968781  cycles {x}{x}{x} - Cmp128Dave
2629540 cycles {x}{x}{x} - Cmp128Dave2
938714  cycles {x}{x}{x} - Cmp128Nidud
1039057 cycles {x}{x}{x} - Cmp128NidudSSE
706010  cycles {x}{x}{ } - Cmp128Alex
1131248 cycles {x}{x}{ } - Cmp128JJSSE
834193  cycles {x}{x}{x} - Cmp128JJAlexSSE_1
947852  cycles {x}{x}{x} - Cmp128JJAlexSSE_2
948452  cycles {x}{x}{x} - Cmp128JJAlexSSE_3
881549  cycles {x}{x}{ } - AxCMP128bitProc3
890835  cycles {x}{x}{ } - AxCMP128bitProc3c (cmov)
610504  cycles {x}{ }{ } - Cmp128DaveU
599043  cycles {x}{ }{ } - Cmp128NidudU

nidud

#219
deleted

Antariy

Ooops, toooooo much digits in the numbers, getting valuating them "by width" :greensml: "By width" the selected timings were wider, so I thought that it much slower... :greensml: :biggrin:

Antariy

Quote from: nidud on August 25, 2013, 08:35:25 PM
1069602   cycles
  • - Cmp128JJAlexSSE_1
    1070532   cycles
  • - Cmp128JJAlexSSE_2
    1071273   cycles
  • - Cmp128JJAlexSSE_3
But here they all near.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2598846 cycles [x][x][x] - Cmp128Dave
3786288 cycles [x][x][x] - Cmp128Dave2
2616598 cycles [x][x][x] - Cmp128Nidud
3025310 cycles [x][x][x] - Cmp128NidudSSE
914405  cycles [x][x][ ] - Cmp128Alex
1906276 cycles [x][x][ ] - Cmp128JJSSE
1588020 cycles [x][x][x] - Cmp128JJAlexSSE_1
1562841 cycles [x][x][x] - Cmp128JJAlexSSE_2
1558993 cycles [x][x][x] - Cmp128JJAlexSSE_3
1326437 cycles [x][x][ ] - AxCMP128bitProc3
1254462 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
692441  cycles [x][ ][ ] - Cmp128DaveU
713309  cycles [x][ ][ ] - Cmp128NidudU

------------------------------------------------------
2615758 cycles [x][x][x] - Cmp128Dave
3829660 cycles [x][x][x] - Cmp128Dave2
2621750 cycles [x][x][x] - Cmp128Nidud
3031078 cycles [x][x][x] - Cmp128NidudSSE
908794  cycles [x][x][ ] - Cmp128Alex
1892463 cycles [x][x][ ] - Cmp128JJSSE
1591916 cycles [x][x][x] - Cmp128JJAlexSSE_1
1557071 cycles [x][x][x] - Cmp128JJAlexSSE_2
1559415 cycles [x][x][x] - Cmp128JJAlexSSE_3
1313596 cycles [x][x][ ] - AxCMP128bitProc3
1267780 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
711284  cycles [x][ ][ ] - Cmp128DaveU
741151  cycles [x][ ][ ] - Cmp128NidudU

nidud

#223
deleted

dedndave

tests that use a little more time return more repeatable results
if i am timing code, i try to make each pass last about 0.5 seconds
that seems to give repeatable numbers