News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

The fastest way to fill a dword array with string values

Started by frktons, December 09, 2012, 02:49:23 AM

Previous topic - Next topic

frktons

Quote from: jj2007 on December 12, 2012, 06:53:03 PM
Frktons I Step / 2 GPRs and Frktons I Step / 4 GPRs are remarkably fast but you should have a look at the output.

I didn't check and it is possible that some instructions are not correct,
I was first testing for speed, and now I'm going to check for size optimization
and correctness of code. A few days more, I'm quite slow indeed.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

lol
it has to work first, otherwise you are comparing apples with oranges in the timing tests   :t

frktons

Quote from: dedndave on December 12, 2012, 11:25:42 PM
lol
it has to work first, otherwise you are comparing apples with oranges in the timing tests   :t

Here we are, tested and posted. The performances don't change,
there were some typing errors and adding one instead of two somewhere.
These errors didn't impact on performance, but on results they did.  :P

I included a PROC to display the content of the filled array, on demand.  :t
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

prescott w/htt
------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2246    cycles for Dedndave code - 5 GPRs
1964    cycles for Frktons I Step / 2 GPRs
1999    cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
2451    cycles for Frktons II Step / 5 MMX without SSSE3
1122    cycles for Frktons III Step / XMM/MMX with SSE2
1167    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2240    cycles for Dedndave code - 5 GPRs
1939    cycles for Frktons I Step / 2 GPRs
2015    cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
2588    cycles for Frktons II Step / 5 MMX without SSSE3
1115    cycles for Frktons III Step / XMM/MMX with SSE2
1167    cycles for Jochen / 5 XMM
------------------------------------------------------------------------

frktons

Two out of three goals are accomplished, now the last, but not least,
optimization: code size and some small improvement, if needed.  ;)
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz

Instructions: MMX, SSE1, SSE2, SSE3
--------------------------------------------------------
1740    cycles for Dedndave code - 5 GPRs
1348    cycles for Frktons I Step / 2 GPRs
1262    cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
943     cycles for Frktons II Step / 5 MMX without SSSE3
941     cycles for Frktons III Step / XMM/MMX with SSE2
728     cycles for Jochen / 5 XMM
--------------------------------------------------------
1742    cycles for Dedndave code - 5 GPRs
1350    cycles for Frktons I Step / 2 GPRs
1262    cycles for Frktons I Step / 4 GPRs
Frktons II Step requires a PC with SSSE3
944     cycles for Frktons II Step / 5 MMX without SSSE3
937     cycles for Frktons III Step / XMM/MMX with SSE2
728     cycles for Jochen / 5 XMM

frktons

On my home PC the III step is a bit faster than Jochen's code,
and in my office PC it is even faster.
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
-------------------------------------------------------------
813     cycles for Frktons III Step / XMM/MMX with SSE2
835     cycles for Jochen / 5 XMM

Quote
Intel(R) Pentium(R) CPU G6950  @ 2.80GHz
-------------------------------------------------------------
...
405     cycles for Frktons III Step / XMM/MMX with SSE2
644     cycles for Jochen / 5 XMM

I'm actually studing a faster/smaller solution because, as Jochen said:

Quote from: jj2007 on December 09, 2012, 08:25:09 PM

P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin:
I don't agree with the suffix, I'm not at the FINAL stage so far.   :lol:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: frktons on December 14, 2012, 09:58:46 AM
On my home PC the III step is a bit faster than Jochen's code,
and in my office PC it is even faster.
...
I'm actually studing a faster/smaller solution because, as Jochen said:

Quote from: jj2007 on December 09, 2012, 08:25:09 PM

P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin:
I don't agree with the suffix, I'm not at the FINAL stage so far.   :lol:

So the incentive worked :greensml: :t


frktons

This could be my final test, if you don't have any suggestion
to enhance the performance of the last code:

Quote
------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
------------------------------------------------------------------------
1915    cycles for Dedndave code - 5 GPRs
1890    cycles for Frktons I Step / 2 GPRs
1964    cycles for Frktons I Step / 4 GPRs with LEA
1114    cycles for Frktons II Step / 5 MMX with SSSE3
1199    cycles for Frktons II Step / 5 MMX without SSSE3
811     cycles for Frktons III Step / XMM/MMX with SSE2
630     cycles for Frktons III Step / XMM with SSE2 - enhanced
706     cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1915    cycles for Dedndave code - 5 GPRs
1896    cycles for Frktons I Step / 2 GPRs
1978    cycles for Frktons I Step / 4 GPRs with LEA
1110    cycles for Frktons II Step / 5 MMX with SSSE3
1199    cycles for Frktons II Step / 5 MMX without SSSE3
813     cycles for Frktons III Step / XMM/MMX with SSE2
628     cycles for Frktons III Step / XMM with SSE2 - enhanced
704     cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

prescott w/htt
Quote------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2, SSE3
------------------------------------------------------------------------
2267    cycles for Dedndave code - 5 GPRs
1957    cycles for Frktons I Step / 2 GPRs
2035    cycles for Frktons I Step / 4 GPRs with LEA
Frktons II Step requires a PC with SSSE3
2459    cycles for Frktons II Step / 5 MMX without SSSE3
1126    cycles for Frktons III Step / XMM/MMX with SSE2
1197    cycles for Frktons III Step / XMM with SSE2 - enhanced
1159    cycles for Jochen / 5 XMM
------------------------------------------------------------------------
2282    cycles for Dedndave code - 5 GPRs
1967    cycles for Frktons I Step / 2 GPRs
2031    cycles for Frktons I Step / 4 GPRs with LEA
Frktons II Step requires a PC with SSSE3
2483    cycles for Frktons II Step / 5 MMX without SSSE3
1126    cycles for Frktons III Step / XMM/MMX with SSE2
1185    cycles for Frktons III Step / XMM with SSE2 - enhanced
1158    cycles for Jochen / 5 XMM
------------------------------------------------------------------------

six_L

Quote------------------------------------------------------------------------
Intel(R) Core(TM) i3 CPU       M 370  @ 2.40GHz

Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
------------------------------------------------------------------------
1335   cycles for Dedndave code - 5 GPRs
1855   cycles for Frktons I Step / 2 GPRs
1009   cycles for Frktons I Step / 4 GPRs with LEA
587   cycles for Frktons II Step / 5 MMX with SSSE3
643   cycles for Frktons II Step / 5 MMX without SSSE3
389   cycles for Frktons III Step / XMM/MMX with SSE2
344   cycles for Frktons III Step / XMM with SSE2 - enhanced
460   cycles for Jochen / 5 XMM
------------------------------------------------------------------------
1278   cycles for Dedndave code - 5 GPRs
1025   cycles for Frktons I Step / 2 GPRs
1015   cycles for Frktons I Step / 4 GPRs with LEA
585   cycles for Frktons II Step / 5 MMX with SSSE3
633   cycles for Frktons II Step / 5 MMX without SSSE3
400   cycles for Frktons III Step / XMM/MMX with SSE2
345   cycles for Frktons III Step / XMM with SSE2 - enhanced
437   cycles for Jochen / 5 XMM
------------------------------------------------------------------------

--- ok ---

Say you, Say me, Say the codes together for ever.

frktons

As usual, different CPUs = different performances  :P

Considering I'm targeting Core Duo upwards, I can be satisfied
that I reached less than 0.7 cycles for each dword string.  :biggrin:
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

Farabi

http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

frktons

Quote from: Farabi on December 23, 2012, 09:24:58 AM
ahhh nice algo

Thanks my friend. The application of unrolling and SSE2 code
produce this fast algo. :t   
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama