Author Topic: Optimizing some code  (Read 17612 times)

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Optimizing some code
« on: June 10, 2014, 06:54:45 PM »

Hi all,
        I found this 4 procedures to get string length
   
   strlen32            <- Author: Agner Fog
   strlen32M         <- is strlen32 modified by RuiLoureiro
   GetStringLenX  <-         RuiLoureiro
   szLen                <-         MASM

    where the best one is just this:  ;)
Code: [Select]
GetStringLenX       proc        pStr:DWORD
                    mov         edx, [esp+4]    ;pStr
                    xor         eax, eax
                    jz          short @F                   
       _begin0:     add         eax, 1                   
            @@:     movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short _begin0
                    ret         4
GetStringLenX       endp
        I did some tests and we get the following results
        for all 4 cases:
        (You should draw your own conclusions)
       
        Could you run TestString32.exe and post the results
        here ?
        Thanks


        note I:  Thanks Dave for that one.
        note II: Jochen, it seems that it is not for you
                     because...
CASE 1
------------------------------------------------
        the addresses are aligned
------------------------------------------------
Quote
...
 -------------- START ----------------
...
 ***** Time table *****
9   milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
34  milliseconds, szLen         - _string02- length=32
35  milliseconds, GetStringLenX - _string02- length=32


37  milliseconds, strlen32M     - _string03- length=64
39  milliseconds, strlen32      - _string03- length=64
53  milliseconds, szLen         - _string03- length=64
57  milliseconds, GetStringLenX - _string03- length=64
 ********** END **********
CASE 2
---------------------------------------------------
 the addresses are not aligned by 1 byte
---------------------------------------------------
Quote
...
 -------------- START ----------------
...
 ***** Time table *****


10  milliseconds, strlen32M     - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16
15  milliseconds, strlen32      - _string02- length=32
15  milliseconds, strlen32M     - _string02- length=32


16  milliseconds, strlen32      - _string01- length=16
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64
 ********** END **********
CASE 3
-----------------------------------------------------
 the addresses are not aligned by 2 bytes
-----------------------------------------------------
Quote
...
 -------------- START ----------------
...
 ***** Time table *****
10  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
12  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
65  milliseconds, GetStringLenX - _string03- length=64
 ********** END **********
CASE 4
----------------------------------------------------
the addresses are not aligned by 3 bytes
----------------------------------------------------
Quote
...
 -------------- START ----------------
...
 ***** Time table *****


10  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32      - _string02- length=32
16  milliseconds, strlen32M     - _string02- length=32


25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
39  milliseconds, GetStringLenX - _string02- length=32
56  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64
 ********** END **********
FOR 128 BYTES
Quote
...
 -------------- START ----------------
...
 ***** Time table *****
9   milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
13  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64


58  milliseconds, strlen32M     - _string04- length=128
67  milliseconds, strlen32      - _string04- length=128
92  milliseconds, szLen         - _string04- length=128
98  milliseconds, GetStringLenX - _string04- length=128
 ********** END **********

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7642
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Optimizing some code
« Reply #1 on: June 10, 2014, 08:05:24 PM »
Hi Rui,

I moved the thread so it would be seen by the algo folks.

Here is my timings on my 3 gig Core2 quad.

Code: [Select]
*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
 -------------- START ----------------
7 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
12 milliseconds, GetStringLenX, _string01- length=16
7 milliseconds, szLen, _string01- length=16
10 milliseconds, strlen32, _string02 - length=32
9 milliseconds, strlen32M, _string02 - length=32
22 milliseconds, GetStringLenX, _string02- length=32
12 milliseconds, szLen, _string02- length=32
18 milliseconds, strlen32, _string03 - length=64
18 milliseconds, strlen32M, _string03 - length=64
42 milliseconds, GetStringLenX, _string03- length=64
23 milliseconds, szLen, _string03- length=64
32 milliseconds, strlen32, _string04 - length=128
33 milliseconds, strlen32M, _string04 - length=128
64 milliseconds, GetStringLenX, _string04- length=128
44 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

 ***** Time table *****

6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, szLen         - _string01- length=16
7  milliseconds, strlen32      - _string01- length=16
9  milliseconds, strlen32M     - _string02- length=32
10  milliseconds, strlen32      - _string02- length=32
12  milliseconds, szLen         - _string02- length=32
12  milliseconds, GetStringLenX - _string01- length=16
18  milliseconds, strlen32      - _string03- length=64
18  milliseconds, strlen32M     - _string03- length=64
22  milliseconds, GetStringLenX - _string02- length=32
23  milliseconds, szLen         - _string03- length=64
32  milliseconds, strlen32      - _string04- length=128
33  milliseconds, strlen32M     - _string04- length=128
42  milliseconds, GetStringLenX - _string03- length=64
44  milliseconds, szLen         - _string04- length=128
64  milliseconds, GetStringLenX - _string04- length=128
 ********** END **********
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

cpu2

  • Regular Member
  • *
  • Posts: 28
Re: Optimizing some code
« Reply #2 on: June 10, 2014, 09:41:51 PM »
I have an implementation on x64 with SSE2 extensions, maybe interested.

Translate to x86 is very easy.

Regards.

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #3 on: June 10, 2014, 09:46:52 PM »
        note II: Jochen, it seems that it is not for you
                     because...

 ::) :(

Code: [Select]
AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
 -------------- START ----------------
14 milliseconds, strlen32, _string01- length=16
11 milliseconds, strlen32M, _string01- length=16
24 milliseconds, GetStringLenX, _string01- length=16
10 milliseconds, szLen, _string01- length=16
19 milliseconds, strlen32, _string02 - length=32
16 milliseconds, strlen32M, _string02 - length=32
37 milliseconds, GetStringLenX, _string02- length=32
17 milliseconds, szLen, _string02- length=32
34 milliseconds, strlen32, _string03 - length=64
33 milliseconds, strlen32M, _string03 - length=64
66 milliseconds, GetStringLenX, _string03- length=64
37 milliseconds, szLen, _string03- length=64
56 milliseconds, strlen32, _string04 - length=128
55 milliseconds, strlen32M, _string04 - length=128
122 milliseconds, GetStringLenX, _string04- length=128
66 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

 ***** Time table *****

10  milliseconds, szLen         - _string01- length=16
11  milliseconds, strlen32M     - _string01- length=16
14  milliseconds, strlen32      - _string01- length=16
16  milliseconds, strlen32M     - _string02- length=32
17  milliseconds, szLen         - _string02- length=32
19  milliseconds, strlen32      - _string02- length=32
24  milliseconds, GetStringLenX - _string01- length=16
33  milliseconds, strlen32M     - _string03- length=64
34  milliseconds, strlen32      - _string03- length=64
37  milliseconds, szLen         - _string03- length=64
37  milliseconds, GetStringLenX - _string02- length=32
55  milliseconds, strlen32M     - _string04- length=128
56  milliseconds, strlen32      - _string04- length=128
66  milliseconds, szLen         - _string04- length=128
66  milliseconds, GetStringLenX - _string03- length=64
122  milliseconds, GetStringLenX - _string04- length=128
 ********** END **********

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #4 on: June 10, 2014, 11:01:42 PM »
if you want to see the other algorithms perform, try longer strings
many of them work well on strings of, say, 1000 bytes

that seems a little impractical to me, but i can see cases where it might be useful
generally, we think of display strings, which are shorter
but, fully qualified path names can be much longer
and, dealing with text files, you might want to access sentances, paragraphs, or even sections

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #5 on: June 11, 2014, 01:42:35 AM »
 :biggrin:
Hi,
    Thank you Hutch, Jochen and Dave

Cpu2,
           for now i don't want to use MMX, SSE. Thanks.

    For me, it is useful for lengths of 30 bytes (+/-)
    So, 64 is good!
   
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )
Code: [Select]
GetStringLenY       proc        pStr:DWORD
                    mov         edx, [esp+4]                    ;pStr
                    mov         eax, -1                   
            @@:     add         eax, 1                   
                    movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short @B                               
                    ret         4
GetStringLenY       endp
        Please, could you run TestString16A.exe and TestString64.exe
        and post the results here ? Only the «Time table».
        Thanks

        note: your computers are faster. I do optimization based
                  on my.
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
 ***** Time table *****

 9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, GetStringLenY - _string01- length=16
13  milliseconds, szLen         - _string01- length=16

15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
35  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenY - _string02- length=32

37  milliseconds, strlen32M     - _string03- length=64
40  milliseconds, strlen32      - _string03- length=64
54  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32M     - _string04- length=128

58  milliseconds, GetStringLenY - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
93  milliseconds, szLen         - _string04- length=128
100  milliseconds, GetStringLenY - _string04- length=128
 ********** END **********

Code: [Select]
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
 ***** Time table *****

 96 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
 99 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
122 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
124 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
127 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
127 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
157 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
169 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
179 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
 ********** END 2 **********

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #6 on: June 11, 2014, 02:12:04 AM »
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )

For any other length, it could be a bit faster ;-)

No sources??

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #7 on: June 11, 2014, 03:11:54 AM »
Rui,

results from TestString16A.exe:
Code: [Select]
STRINGS:
a bcd efg hij klm nop
ab cdefg hijk lmnopA

abc de fghijkl mn op
abc defg hi jk lm nop

abc de fghijkl mn op A
abc defg hi jk lm nop

X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
64 milliseconds, CompareStringXYS, _string01X, _string01Y
42 milliseconds, CompareStringXYS, _string02X, _string02Y
45 milliseconds, CompareStringXYS, _string03X, _string03Y
28 milliseconds, CompareStringXYT, _string01X, _string01Y
28 milliseconds, CompareStringXYT, _string02X, _string02Y
28 milliseconds, CompareStringXYT, _string03X, _string03Y
48 milliseconds, CompareStringXYBS, _string01X, _string01Y
50 milliseconds, CompareStringXYBS, _string02X, _string02Y
52 milliseconds, CompareStringXYBS, _string03X, _string03Y
 *** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
 ***** Time table *****

28 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
28 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
28 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
42 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
45 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
48 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
50 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
52 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
64 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
 ********** END 2 **********

The results from TestString64.exe:
Code: [Select]
*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
 -------------- START ----------------
9 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
13 milliseconds, GetStringLenY, _string01- length=16
5 milliseconds, szLen, _string01- length=16
7 milliseconds, strlen32, _string02 - length=32
5 milliseconds, strlen32M, _string02 - length=32
24 milliseconds, GetStringLenY, _string02- length=32
8 milliseconds, szLen, _string02- length=32
13 milliseconds, strlen32, _string03 - length=64
10 milliseconds, strlen32M, _string03 - length=64
35 milliseconds, GetStringLenY, _string03- length=64
16 milliseconds, szLen, _string03- length=64
28 milliseconds, strlen32, _string04 - length=128
26 milliseconds, strlen32M, _string04 - length=128
55 milliseconds, GetStringLenY, _string04- length=128
42 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
 ***** Time table *****

5  milliseconds, strlen32M     - _string02- length=32
5  milliseconds, szLen         - _string01- length=16
6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, strlen32      - _string02- length=32
8  milliseconds, szLen         - _string02- length=32
9  milliseconds, strlen32      - _string01- length=16
10  milliseconds, strlen32M     - _string03- length=64
13  milliseconds, strlen32      - _string03- length=64
13  milliseconds, GetStringLenY - _string01- length=16
16  milliseconds, szLen         - _string03- length=64
24  milliseconds, GetStringLenY - _string02- length=32
26  milliseconds, strlen32M     - _string04- length=128
28  milliseconds, strlen32      - _string04- length=128
35  milliseconds, GetStringLenY - _string03- length=64
42  milliseconds, szLen         - _string04- length=128
55  milliseconds, GetStringLenY - _string04- length=128
 ********** END **********

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #8 on: June 11, 2014, 03:23:27 AM »
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )

For any other length, it could be a bit faster ;-)

No sources??
more sources Jochen ? topic "Sorting strings" you have it.
No problems about this code, i post all things !
How do you improve it a bit ?

qWord

  • Member
  • *****
  • Posts: 1476
  • The base type of a type is the type itself
    • SmplMath macros
Re: Optimizing some code
« Reply #9 on: June 11, 2014, 03:43:18 AM »
check the old forum - methods to get the string length has been discussed to death.
MREAL macros - when you need floating point arithmetic while assembling!

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #10 on: June 11, 2014, 04:52:35 AM »
qWord,

More than to get the best code or the fastest
code, i like to write the way i think, following
my own logic. Meanwhile, i try to compare the
logic of some faster procedures with the way as i do.
And i have my conclusions.
In this case, i read strlen32 written by Agner Fog
(i dont need to use strlen) and i wrote a modified
version and tested it in the way you know.
I posted it because it is an optimized version
from an optimized version from Agner Fog.
About sorting strings, i want to add it in
the next version of the linked list project.
In my own projects i don't use null terminated strings.
Of course, the calculator doesn't use it.
As we see, when the length is not more than some
bytes i use any version optimized or not.
To me, complex methods to get the string length is
to put in the dustbin.

EDIT: i didn't do the things like that i saw in the old forum.

Gunther,
            thanks !
« Last Edit: June 16, 2014, 07:40:10 AM by RuiLoureiro »

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #11 on: June 11, 2014, 05:53:46 AM »
Jochen,
            where are you ? Are you sleeping ?
             where is your answer ?

habran

  • Member
  • *****
  • Posts: 1225
    • uasm
Re: Optimizing some code
« Reply #12 on: June 11, 2014, 06:28:59 AM »
Try this :biggrin:
Code: [Select]
GetStringLenX PROC pStr:DWORD
    mov eax,pStr
    .while (BYTE PTR[eax])
      inc eax
    .endw
    sub eax,pStr
    ret
GetStringLenX ENDP

Cod-Father

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #13 on: June 11, 2014, 07:25:20 AM »
Your suggestion: GetStringLenZ is worse.
The difference of addresses is worse.
As we can see below, to use szLen
or GetStringLenY makes no difference
im my system, up to length=32 or 64.

Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
 ***** Time table *****

 9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, GetStringLenY - _string01- length=16
13  milliseconds, szLen         - _string01- length=16

15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
35  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenY - _string02- length=32

37  milliseconds, strlen32M     - _string03- length=64
40  milliseconds, strlen32      - _string03- length=64
54  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32M     - _string04- length=128

58  milliseconds, GetStringLenY - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
93  milliseconds, szLen         - _string04- length=128
100  milliseconds, GetStringLenY - _string04- length=128
 ********** END **********


Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
 ***** Time table *****

 9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
14  milliseconds, strlen32M     - _string02- length=32
14  milliseconds, szLen         - _string01- length=16
15  milliseconds, strlen32      - _string02- length=32
18 milliseconds, GetStringLenZ - _string01- length=16
34  milliseconds, szLen         - _string02- length=32
37  milliseconds, strlen32M     - _string03- length=64
38  milliseconds, strlen32      - _string03- length=64
40  milliseconds, GetStringLenZ - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
58  milliseconds, strlen32M     - _string04- length=128
68  milliseconds, GetStringLenZ - _string03- length=64
96  milliseconds, szLen         - _string04- length=128
123  milliseconds, GetStringLenZ - _string04- length=128
 ********** END **********

jj2007

  • Member
  • *****
  • Posts: 10676
  • Assembler is fun ;-)
    • MasmBasic
Re: Optimizing some code
« Reply #14 on: June 11, 2014, 08:16:59 AM »
Jochen,
            where are you ? Are you sleeping ?
             where is your answer ?

Yes, I was sleeping, and here is my answer (for a 100 byte string):

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
23657   cycles for 100 * Rui
3942    cycles for 100 * MB
13843   cycles for 100 * Masm32
14124   cycles for 100 * CRT
23757   cycles for 100 * Habran


As qWord wrote above, we have tested it already.