News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

The fastest way to fill a dword array with string values

Started by frktons, December 09, 2012, 02:49:23 AM

Previous topic - Next topic

dedndave

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 250/100 cycles

2038    cycles for 100 * FA Dave
2590    cycles for 100 * FA Sinsi
1188    cycles for 100 * FA Jochen unaligned
1813    cycles for 100 * FA Yves

2079    cycles for 100 * FA Dave
2593    cycles for 100 * FA Sinsi
1181    cycles for 100 * FA Jochen unaligned
1795    cycles for 100 * FA Yves

2061    cycles for 100 * FA Dave
2587    cycles for 100 * FA Sinsi
1216    cycles for 100 * FA Jochen unaligned
1791    cycles for 100 * FA Yves

TouEnMasm

Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
loop overhead is approx. 255/100 cycles

2057    cycles for 100 * FA Dave
2581    cycles for 100 * FA Sinsi
1210    cycles for 100 * FA Jochen unaligned
1858    cycles for 100 * FA Yves

2060    cycles for 100 * FA Dave
2550    cycles for 100 * FA Sinsi
1214    cycles for 100 * FA Jochen unaligned
1855    cycles for 100 * FA Yves

2056    cycles for 100 * FA Dave
2553    cycles for 100 * FA Sinsi
1213    cycles for 100 * FA Jochen unaligned
1857    cycles for 100 * FA Yves

230     bytes for FA Dave
116     bytes for FA Sinsi
149     bytes for FA Jochen unaligned
141     bytes for FA Yves

4208520 = eax FA Dave
4208520 = eax FA Sinsi
4208520 = eax FA Jochen unaligned
4208520 = eax FA Yves

--- ok ---
Fa is a musical note to play with CL

frktons

According to these tests we reached the bottom line, almost.
SSE code is always faster than any other, and Dave won $50
for being the first to post his code  :t

As I foresaw:

Quote from: frktons on December 09, 2012, 10:07:32 AM
...
Not bad. I think we
can arrive at 0.7 cycles per dword string, but it has yet to be
demonstrated.  :lol:


My test with Jochen's testpad:

Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
loop overhead is approx. 187/100 cycles

1870    cycles for 100 * FA Dave
648     cycles for 100 * FA Jochen
2442    cycles for 100 * FA Sinsi
791     cycles for 100 * FA Jochen unaligned
2009    cycles for 100 * FA Yves

1886    cycles for 100 * FA Dave
648     cycles for 100 * FA Jochen
2418    cycles for 100 * FA Sinsi
791     cycles for 100 * FA Jochen unaligned
2012    cycles for 100 * FA Yves

1871    cycles for 100 * FA Dave
648     cycles for 100 * FA Jochen
2430    cycles for 100 * FA Sinsi
790     cycles for 100 * FA Jochen unaligned
2010    cycles for 100 * FA Yves

230     bytes for FA Dave
281     bytes for FA Jochen
116     bytes for FA Sinsi
141     bytes for FA Jochen unaligned
141     bytes for FA Yves

4208864 = eax FA Dave
4208864 = eax FA Jochen
4208864 = eax FA Sinsi
4208864 = eax FA Jochen unaligned
4208864 = eax FA Yves

--- ok ---

And the prize for Dave:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

I've divided my code in 3 steps, from the slowest to the fastest.
Here it is the slowest, with GPRs registers:
Quote
-------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-------------------------------------------------------
1914    cycles for Dedndave code
1983    cycles for Frktons I Step
-------------------------------------------------------
1914    cycles for Dedndave code
1973    cycles for Frktons I Step
-------------------------------------------------------
1944    cycles for Dedndave code
1972    cycles for Frktons I Step
-------------------------------------------------------
1913    cycles for Dedndave code
1976    cycles for Frktons I Step
-------------------------------------------------------

I measured it against Dave's code that uses the same
kind of registers.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Gunther

Here are my results:


------------------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
------------------------------------------------------------------------
1577 cycles for Dedndave code
2461 cycles for Frktons I Step
------------------------------------------------------------------------
1048 cycles for Dedndave code
2492 cycles for Frktons I Step
------------------------------------------------------------------------
1058 cycles for Dedndave code
2497 cycles for Frktons I Step
------------------------------------------------------------------------
1048 cycles for Dedndave code
2495 cycles for Frktons I Step
------------------------------------------------------------------------

--- ok ---


Gunther
You have to know the facts before you can distort them.

frktons

An incredible difference between my Core Duo 2 and i7.
Adding and subtracting is very liked by the i7.  :icon_eek:

Probably Jochen's solution is hard to beat for the time being.  :eusa_clap:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

It looks like with GPRs it is quite difficult to go below
1.900 CPU cycles, at least on my machine:

Quote
-----------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
-----------------------------------------------
1914    cycles for Dedndave code - 5 GPRs
2574    cycles for Frktons I Step / 2 GPRs
1889    cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1940    cycles for Dedndave code - 5 GPRs
2561    cycles for Frktons I Step / 2 GPRs
1886    cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1913    cycles for Dedndave code - 5 GPRs
2561    cycles for Frktons I Step / 2 GPRs
1900    cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
1913    cycles for Dedndave code - 5 GPRs
2561    cycles for Frktons I Step / 2 GPRs
1887    cycles for Frktons I Step / 4 GPRs
-----------------------------------------------
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

frktons

Only using MMX registers, the performance start to rock.
Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1926    cycles for Dedndave code - 5 GPRs
1857    cycles for Frktons I Step / 2 GPRs
1896    cycles for Frktons I Step / 4 GPRs
1320    cycles for Frktons II Step / 5 MMX
830     cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1919    cycles for Dedndave code - 5 GPRs
1897    cycles for Frktons I Step / 2 GPRs
1895    cycles for Frktons I Step / 4 GPRs
1321    cycles for Frktons II Step / 5 MMX
831     cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1916    cycles for Dedndave code - 5 GPRs
1898    cycles for Frktons I Step / 2 GPRs
1897    cycles for Frktons I Step / 4 GPRs
1319    cycles for Frktons II Step / 5 MMX
830     cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1915    cycles for Dedndave code - 5 GPRs
1892    cycles for Frktons I Step / 2 GPRs
1894    cycles for Frktons I Step / 4 GPRs
1327    cycles for Frktons II Step / 5 MMX
830     cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

And there are still many things to optimize.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

sinsi


dedndave

STATUS_ILLEGAL_INSTRUCTION = 0xc000001d

he is probably using SSSE3 instructions in "Frktons II Step"   :P

      lea esi, Tens
      lea edi, MyArray
      movq mm5, Mask2Double
      movq mm6, ONE_TWO
      movq mm7, TWO_TWO

    align 4

    @@:

      mov eax, [esi]
      movd mm0, eax

      pshufb mm0, mm5
      paddd mm0, mm6
      movq  mm1, mm0
      paddd mm1, mm7
      movq  mm2, mm1
      paddd mm2, mm7
      movq  mm3, mm2
      paddd mm3, mm7
      movq  mm4, mm3
      paddd mm4, mm7

      movq [edi], mm0
      movq [edi + 8], mm1
      movq [edi + 16], mm2
      movq [edi + 24], mm3
      movq [edi + 32], mm4

      mov eax, [esi + 4]
      movd mm0, eax

      pshufb mm0, mm5
      paddd mm0, mm6
      movq  mm1, mm0
      paddd mm1, mm7
      movq  mm2, mm1
      paddd mm2, mm7
      movq  mm3, mm2
      paddd mm3, mm7
      movq  mm4, mm3
      paddd mm4, mm7

      movq [edi + 40], mm0
      movq [edi + 48], mm1
      movq [edi + 56], mm2
      movq [edi + 64], mm3
      movq [edi + 72], mm4
      add esi, 8
      add edi, 80

      cmp esi, PtrTens

      jl @B

frktons

Quote from: sinsi on December 10, 2012, 10:01:46 PM
Exception code: 0xc000001d
Fault offset: 0x00001619


Do you have PSHUFB on your PC? It is SSSE3, Dave is correct.
You can use a similar instruction to duplicate low dword and store it
to high dword of mmx register, something like:


first_dw dd 0     ; put here the same value of second_dw
second_dw dd 0 ; put here the same value of first_dw

....

lea eax, first_dw
movq mm0, [eax]


N.B. Not tested

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

by using SSSE3 (or even SSE3), you exclude a lot of the older CPU's that are still in use

really - this is a program init function
it seems kinda silly to exclude a CPU unless SSSE3 is to be used throughout the rest of the program

frktons

Quote from: dedndave on December 10, 2012, 10:33:52 PM
by using SSSE3 (or even SSE3), you exclude a lot of the older CPU's that are still in use

really - this is a program init function
it seems kinda silly to exclude a CPU unless SSSE3 is to be used throughout the rest of the program

The owners of old CPUs should read my previous post.  :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

you are going to tell me i need a new one ? - lol
i am very happy with the one i have
until microsoft or adobe or some other a-hole forces me to be unhappy

frktons

Quote from: dedndave on December 10, 2012, 10:38:38 PM
you are going to tell me i need a new one ? - lol
i am very happy with the one i have
until microsoft or adobe or some other a-hole forces me to be unhappy

Not really. I'm happy too with mine, a dual core P-IV, and a Core Duo.
I was talking about asm coders. They know how to change 1 single
instruction to suit their needs.
By the way, I'm only testing some instructions, nothing is going to production
as always with me. [MASM for FUN only].
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama