News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

nidud

#285
deleted

KeepingRealBusy


Here is my contribution (from reply  284)



AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358    kCycles [x][x][x] - Cmp128Dave
1206    kCycles [x][x][x] - Cmp128Nidud
975     kCycles [x][x][x] - Cmp128NidudSSE
1171    kCycles [x][x][ ] - Cmp128Alex
1766    kCycles [x][x][x] - MasmBasic Ocmp
1424    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040    kCycles [x][x][x] - Cmp128JJAlexSSE_2
721     kCycles [x][x][x] - Cmp128JJAlexSSE_3
535     kCycles [x][x][ ] - AxCMP128bitProc3
519     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave AKA KRB

Antariy

Jochen, did you time the version of a macro I posted couple pages above?
Here it is:

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


For me it faster than original "_1" macro, also you can try to change so

   mov eax,word ptr [ow0+14]
   mov edx,word ptr [ow1+14]


but for me it is slower than the version above it.

Timings for it (there is your old macro - my testbed us a bit outdated)


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2189320 cycles [x][x][x] - Cmp128Nidud
2295837 cycles [x][x][x] - Cmp128NidudSSE
2773387 cycles [x][x][x] - Cmp128Dave
4033478 cycles [x][x][x] - Cmp128Dave2
1597228 cycles [x][x][x] - Cmp128JJAlexSSE_1
1622741 cycles [x][x][x] - Cmp128JJAlexSSE_2
1905774 cycles [x][x][x] - Cmp128JJAlexSSE_3
993931  cycles [x][x][ ] - Cmp128Alex
1859714 cycles [x][x][x] - Cmp128Alex_2
1901902 cycles [x][x][x] - Cmp128Alex_3
1994856 cycles [x][x][ ] - Cmp128JJSSE
1346269 cycles [x][x][ ] - AxCMP128bitProc3
1311894 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
741050  cycles [x][ ][ ] - Cmp128DaveU
770599  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---



Timings for Cmp128_timingsOQ


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2696    kCycles [x][x][x] - Cmp128Dave
2713    kCycles [x][x][x] - Cmp128Nidud
3125    kCycles [x][x][x] - Cmp128NidudSSE
945     kCycles [x][x][ ] - Cmp128Alex
1932    kCycles [x][x][x] - MasmBasic Ocmp
1485    kCycles [x][x][x] - MasmBasic Qcmp
1639    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1604    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1595    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1360    kCycles [x][x][ ] - AxCMP128bitProc3
1274    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---



Timings for Cmp128_timingsO


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2856    kCycles [x][x][x] - Cmp128Dave
2752    kCycles [x][x][x] - Cmp128Nidud
3128    kCycles [x][x][x] - Cmp128NidudSSE
956     kCycles [x][x][ ] - Cmp128Alex
1928    kCycles [x][x][x] - MasmBasic Ocmp
1641    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1601    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1592    kCycles [x][x][x] - Cmp128JJAlexSSE_3
1361    kCycles [x][x][ ] - AxCMP128bitProc3
1272    kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---


Antariy

Hi Dave :t

Quote from: KeepingRealBusy on September 02, 2013, 08:52:39 AM

Here is my contribution (from reply  284)



AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358    kCycles [x][x][x] - Cmp128Dave
1206    kCycles [x][x][x] - Cmp128Nidud
975     kCycles [x][x][x] - Cmp128NidudSSE
1171    kCycles [x][x][ ] - Cmp128Alex
1766    kCycles [x][x][x] - MasmBasic Ocmp
1424    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040    kCycles [x][x][x] - Cmp128JJAlexSSE_2
721     kCycles [x][x][x] - Cmp128JJAlexSSE_3
535     kCycles [x][x][ ] - AxCMP128bitProc3
519     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

Dave AKA KRB


Incredible difference in the algos, which use full and half sized regs. Your AMD seems to very good work with "partial" regs, contrary to Intel's which are bad with them.
Cmp128JJAlexSSE_3 differs from Cmp128JJAlexSSE_1
only with this:

   xor cx,0FFFFh
   jz @l2
   and cx,7FFFh
   bsr cx,cx


jj2007

Quote from: Antariy on September 02, 2013, 12:35:42 PM
Jochen, did you time the version of a macro I posted couple pages above?

Here it comes:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945     kCycles [x][x][x] - Cmp128Dave
916     kCycles [x][x][x] - Cmp128Nidud
1017    kCycles [x][x][x] - Cmp128NidudSSE
689     kCycles [x][x][ ] - Cmp128Alex
1013    kCycles [x][x][x] - MasmBasic Ocmp
815     kCycles [x][x][x] - Cmp128JJAlexSSE_1
854     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
925     kCycles [x][x][x] - Cmp128JJAlexSSE_2
926     kCycles [x][x][x] - Cmp128JJAlexSSE_3
858     kCycles [x][x][ ] - AxCMP128bitProc3
870     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
653     kCycles [x][x][x] - Cmp128Dave
608     kCycles [x][x][x] - Cmp128Nidud
806     kCycles [x][x][x] - Cmp128NidudSSE
434     kCycles [x][x][ ] - Cmp128Alex
386     kCycles [x][x][x] - MasmBasic Ocmp
315     kCycles [x][x][x] - Cmp128JJAlexSSE_1
366     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
355     kCycles [x][x][x] - Cmp128JJAlexSSE_2
316     kCycles [x][x][x] - Cmp128JJAlexSSE_3
455     kCycles [x][x][ ] - AxCMP128bitProc3
439     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)


Quote from: nidud on September 02, 2013, 08:34:08 AM
well, it's difficult to read your "code", but I think...
You should learn Masm, it's a fascinating language :t

(and I'm afraid your interpretation is not correct - you might launch Olly to see what it really does).

sinsi

jj's latest

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
695     kCycles [x][x][x] - Cmp128Dave
564     kCycles [x][x][x] - Cmp128Nidud
652     kCycles [x][x][x] - Cmp128NidudSSE
396     kCycles [x][x][ ] - Cmp128Alex
316     kCycles [x][x][x] - MasmBasic Ocmp
268     kCycles [x][x][x] - Cmp128JJAlexSSE_1
321     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
312     kCycles [x][x][x] - Cmp128JJAlexSSE_2
271     kCycles [x][x][x] - Cmp128JJAlexSSE_3
403     kCycles [x][x][ ] - AxCMP128bitProc3
378     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
748     kCycles [x][x][x] - Cmp128Dave
615     kCycles [x][x][x] - Cmp128Nidud
714     kCycles [x][x][x] - Cmp128NidudSSE
433     kCycles [x][x][ ] - Cmp128Alex
348     kCycles [x][x][x] - MasmBasic Ocmp
296     kCycles [x][x][x] - Cmp128JJAlexSSE_1
353     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
344     kCycles [x][x][x] - Cmp128JJAlexSSE_2
298     kCycles [x][x][x] - Cmp128JJAlexSSE_3
442     kCycles [x][x][ ] - AxCMP128bitProc3
416     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

dedndave

your code is hard to read, Jochen - lol
i dread if i have to add a routine   :P

jj2007

Quote from: dedndave on September 02, 2013, 07:01:01 PM
your code is hard to read, Jochen - lol

Come on, it's ultra simple...
  pmovmskb edx, xt2   ; show in dx where xt0 differs to xt1
  if MbcmpO eq QWORD
   not dl
   and edx, 07fh
  else          ; don't duplicate MSB
   if 1
      xor edx, -1
      and edx, 07fffh
   else
      not dx
      and dh, 07fh
   endif
  endif

nidud

#293
deleted

jj2007

Quote from: nidud on September 02, 2013, 08:34:08 AM
well, it's difficult to read your "code"
Quote from: nidud on September 02, 2013, 10:57:58 PM
I guess there is different views about writing code
Yes, certainly. But I would never call your code "code", or refer to you as a "coder" instead of a coder. It requires a certain level of arrogance to dismiss somebody else's code as "code".

Quote
QuoteI'm afraid your interpretation is not correct
How do you know?

Quoteyou might launch Olly to see what it really does
Don't you think that this is a bit to much to ask, or at least a bit complicated, to use a debugger to see what it actually does?

Normally, I would not ask, but since you had difficulties de-coding my macro, I thought Olly would be a reliable way to check. What you show above, by the way, is old code - the version of oqCmp.asm that I posted 15 hours ago already contained:
   if 1
      xor edx, -1
      and edx, 07fffh
   else
      not dx
      and dh, 07fh
   endif

The if 1 is conditional assembly and means "use this branch, not the other one".

Congrats, by the way - on the AMD your code is faster than mine:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
843     kCycles [x][x][x] - Cmp128Dave
847     kCycles [x][x][x] - Cmp128Nidud
917     kCycles [x][x][x] - Cmp128NidudSSE
643     kCycles [x][x][ ] - Cmp128Alex
1578    kCycles [x][x][x] - MasmBasic Ocmp
1469    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1531    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1466    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1466    kCycles [x][x][x] - Cmp128JJAlexSSE_3
803     kCycles [x][x][ ] - AxCMP128bitProc3
771     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

FORTRANS

From Reply #289.


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1022 kCycles [x][x][x] - Cmp128Dave
917 kCycles [x][x][x] - Cmp128Nidud
1022 kCycles [x][x][x] - Cmp128NidudSSE
817 kCycles [x][x][ ] - Cmp128Alex
1561 kCycles [x][x][x] - MasmBasic Ocmp
1422 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1471 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1677 kCycles [x][x][x] - Cmp128JJAlexSSE_3
937 kCycles [x][x][ ] - AxCMP128bitProc3
985 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---

dedndave

Jochen,
it's just the text format
we each have our own style and it can be hard to get used to someone else's   :P

Gunther

The timings for Jochen's latest version:



Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
736     kCycles [x][x][x] - Cmp128Dave
629     kCycles [x][x][x] - Cmp128Nidud
696     kCycles [x][x][x] - Cmp128NidudSSE
442     kCycles [x][x][ ] - Cmp128Alex
367     kCycles [x][x][x] - MasmBasic Ocmp
321     kCycles [x][x][x] - Cmp128JJAlexSSE_1
371     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
364     kCycles [x][x][x] - Cmp128JJAlexSSE_2
352     kCycles [x][x][x] - Cmp128JJAlexSSE_3
447     kCycles [x][x][ ] - AxCMP128bitProc3
427     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

Thanxalot :icon14:

Attached one more, inter alia with a modification of the test_start macro:

test_start macro useit:=<1>
usethismacro=useit
  if usethismacro
   push 50000000
   .Repeat
      dec dword ptr [esp]   ; heat up the CPU
   .Until Sign?
   add esp, 4
   invoke Sleep, 0
   counter_begin 1000, HIGH_PRIORITY_CLASS
  endif
endm


On some machines, timings were very volatile, the small mod above seems to help.

Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986     kCycles [x][x][x] - Cmp128Dave
946     kCycles [x][x][x] - Cmp128Nidud
818     kCycles [x][x][x] - Cmp128NidudSSE
575     kCycles [x][x][ ] - Cmp128Alex
564     kCycles [x][x][x] - MasmBasic Ocmp.1
517     kCycles [x][x][x] - MasmBasic Ocmp.0
549     kCycles [x][x][x] - MasmBasic Ocmp.1
513     kCycles [x][x][x] - MasmBasic Ocmp.0
472     kCycles [x][x][x] - Cmp128JJAlexSSE_1
476     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
618     kCycles [x][x][x] - Cmp128JJAlexSSE_2
614     kCycles [x][x][x] - Cmp128JJAlexSSE_3
747     kCycles [x][x][ ] - AxCMP128bitProc3
772     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
-----------------------------------------------------
843     kCycles [x][x][x] - Cmp128Dave
844     kCycles [x][x][x] - Cmp128Nidud
919     kCycles [x][x][x] - Cmp128NidudSSE
641     kCycles [x][x][ ] - Cmp128Alex
1588    kCycles [x][x][x] - MasmBasic Ocmp.1
1584    kCycles [x][x][x] - MasmBasic Ocmp.0
1586    kCycles [x][x][x] - MasmBasic Ocmp.1
1578    kCycles [x][x][x] - MasmBasic Ocmp.0
1467    kCycles [x][x][x] - Cmp128JJAlexSSE_1
1532    kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1471    kCycles [x][x][x] - Cmp128JJAlexSSE_2
1468    kCycles [x][x][x] - Cmp128JJAlexSSE_3
801     kCycles [x][x][ ] - AxCMP128bitProc3
771     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

Gunther

Jochen,

the new timings. I hope that helps:



Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
660     kCycles [x][x][x] - Cmp128Dave
538     kCycles [x][x][x] - Cmp128Nidud
603     kCycles [x][x][x] - Cmp128NidudSSE
371     kCycles [x][x][ ] - Cmp128Alex
316     kCycles [x][x][x] - MasmBasic Ocmp.1
307     kCycles [x][x][x] - MasmBasic Ocmp.0
314     kCycles [x][x][x] - MasmBasic Ocmp.1
306     kCycles [x][x][x] - MasmBasic Ocmp.0
259     kCycles [x][x][x] - Cmp128JJAlexSSE_1
308     kCycles [x][x][x] - Cmp128JJAlexSSE_1new
302     kCycles [x][x][x] - Cmp128JJAlexSSE_2
263     kCycles [x][x][x] - Cmp128JJAlexSSE_3
391     kCycles [x][x][ ] - AxCMP128bitProc3
363     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)

--- ok ---


Gunther
You have to know the facts before you can distort them.