The MASM Forum

General => The Laboratory => Topic started by: RuiLoureiro on June 10, 2014, 06:54:45 PM

Title: Optimizing some code
Post by: RuiLoureiro on June 10, 2014, 06:54:45 PM

Hi all,
        I found this 4 procedures to get string length
   
   strlen32            <- Author: Agner Fog
   strlen32M         <- is strlen32 modified by RuiLoureiro
   GetStringLenX  <-         RuiLoureiro
   szLen                <-         MASM

    where the best one is just this:  ;)

GetStringLenX       proc        pStr:DWORD
                    mov         edx, [esp+4]    ;pStr
                    xor         eax, eax
                    jz          short @F                   
       _begin0:     add         eax, 1                   
            @@:     movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short _begin0
                    ret         4
GetStringLenX       endp

        I did some tests and we get the following results
        for all 4 cases:
        (You should draw your own conclusions)
       
        Could you run TestString32.exe and post the results
        here ?
        Thanks


        note I:  Thanks Dave for that one.
        note II: Jochen, it seems that it is not for you
                     because...
CASE 1
------------------------------------------------
         the addresses are aligned
------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
9   milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
34  milliseconds, szLen         - _string02- length=32
35  milliseconds, GetStringLenX - _string02- length=32


37  milliseconds, strlen32M     - _string03- length=64
39  milliseconds, strlen32      - _string03- length=64
53  milliseconds, szLen         - _string03- length=64
57  milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 2
---------------------------------------------------
the addresses are not aligned by 1 byte
---------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****


10  milliseconds, strlen32M     - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16
15  milliseconds, strlen32      - _string02- length=32
15  milliseconds, strlen32M     - _string02- length=32


16  milliseconds, strlen32      - _string01- length=16
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 3
-----------------------------------------------------
the addresses are not aligned by 2 bytes
-----------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
10  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
12  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
65  milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 4
----------------------------------------------------
the addresses are not aligned by 3 bytes
----------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****


10  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32      - _string02- length=32
16  milliseconds, strlen32M     - _string02- length=32


25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
39  milliseconds, GetStringLenX - _string02- length=32
56  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64
********** END **********
FOR 128 BYTES
Quote
...
-------------- START ----------------
...
***** Time table *****
9   milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
13  milliseconds, szLen         - _string01- length=16
13  milliseconds, GetStringLenX - _string01- length=16


15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
25  milliseconds, strlen32M     - _string03- length=64
26  milliseconds, strlen32      - _string03- length=64


34  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenX - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
56  milliseconds, GetStringLenX - _string03- length=64


58  milliseconds, strlen32M     - _string04- length=128
67  milliseconds, strlen32      - _string04- length=128
92  milliseconds, szLen         - _string04- length=128
98  milliseconds, GetStringLenX - _string04- length=128
********** END **********
Title: Re: Optimizing some code
Post by: hutch-- on June 10, 2014, 08:05:24 PM
Hi Rui,

I moved the thread so it would be seen by the algo folks.

Here is my timings on my 3 gig Core2 quad.


*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
-------------- START ----------------
7 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
12 milliseconds, GetStringLenX, _string01- length=16
7 milliseconds, szLen, _string01- length=16
10 milliseconds, strlen32, _string02 - length=32
9 milliseconds, strlen32M, _string02 - length=32
22 milliseconds, GetStringLenX, _string02- length=32
12 milliseconds, szLen, _string02- length=32
18 milliseconds, strlen32, _string03 - length=64
18 milliseconds, strlen32M, _string03 - length=64
42 milliseconds, GetStringLenX, _string03- length=64
23 milliseconds, szLen, _string03- length=64
32 milliseconds, strlen32, _string04 - length=128
33 milliseconds, strlen32M, _string04 - length=128
64 milliseconds, GetStringLenX, _string04- length=128
44 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***

***** Time table *****

6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, szLen         - _string01- length=16
7  milliseconds, strlen32      - _string01- length=16
9  milliseconds, strlen32M     - _string02- length=32
10  milliseconds, strlen32      - _string02- length=32
12  milliseconds, szLen         - _string02- length=32
12  milliseconds, GetStringLenX - _string01- length=16
18  milliseconds, strlen32      - _string03- length=64
18  milliseconds, strlen32M     - _string03- length=64
22  milliseconds, GetStringLenX - _string02- length=32
23  milliseconds, szLen         - _string03- length=64
32  milliseconds, strlen32      - _string04- length=128
33  milliseconds, strlen32M     - _string04- length=128
42  milliseconds, GetStringLenX - _string03- length=64
44  milliseconds, szLen         - _string04- length=128
64  milliseconds, GetStringLenX - _string04- length=128
********** END **********
Title: Re: Optimizing some code
Post by: cpu2 on June 10, 2014, 09:41:51 PM
I have an implementation on x64 with SSE2 extensions, maybe interested.

Translate to x86 is very easy.

Regards.
Title: Re: Optimizing some code
Post by: jj2007 on June 10, 2014, 09:46:52 PM
Quote from: RuiLoureiro on June 10, 2014, 06:54:45 PM
        note II: Jochen, it seems that it is not for you
                     because...

::) :(

AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
-------------- START ----------------
14 milliseconds, strlen32, _string01- length=16
11 milliseconds, strlen32M, _string01- length=16
24 milliseconds, GetStringLenX, _string01- length=16
10 milliseconds, szLen, _string01- length=16
19 milliseconds, strlen32, _string02 - length=32
16 milliseconds, strlen32M, _string02 - length=32
37 milliseconds, GetStringLenX, _string02- length=32
17 milliseconds, szLen, _string02- length=32
34 milliseconds, strlen32, _string03 - length=64
33 milliseconds, strlen32M, _string03 - length=64
66 milliseconds, GetStringLenX, _string03- length=64
37 milliseconds, szLen, _string03- length=64
56 milliseconds, strlen32, _string04 - length=128
55 milliseconds, strlen32M, _string04 - length=128
122 milliseconds, GetStringLenX, _string04- length=128
66 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***

***** Time table *****

10  milliseconds, szLen         - _string01- length=16
11  milliseconds, strlen32M     - _string01- length=16
14  milliseconds, strlen32      - _string01- length=16
16  milliseconds, strlen32M     - _string02- length=32
17  milliseconds, szLen         - _string02- length=32
19  milliseconds, strlen32      - _string02- length=32
24  milliseconds, GetStringLenX - _string01- length=16
33  milliseconds, strlen32M     - _string03- length=64
34  milliseconds, strlen32      - _string03- length=64
37  milliseconds, szLen         - _string03- length=64
37  milliseconds, GetStringLenX - _string02- length=32
55  milliseconds, strlen32M     - _string04- length=128
56  milliseconds, strlen32      - _string04- length=128
66  milliseconds, szLen         - _string04- length=128
66  milliseconds, GetStringLenX - _string03- length=64
122  milliseconds, GetStringLenX - _string04- length=128
********** END **********
Title: Re: Optimizing some code
Post by: dedndave on June 10, 2014, 11:01:42 PM
if you want to see the other algorithms perform, try longer strings
many of them work well on strings of, say, 1000 bytes

that seems a little impractical to me, but i can see cases where it might be useful
generally, we think of display strings, which are shorter
but, fully qualified path names can be much longer
and, dealing with text files, you might want to access sentances, paragraphs, or even sections
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 01:42:35 AM
 :biggrin:
Hi,
    Thank you Hutch, Jochen and Dave

Cpu2,
           for now i don't want to use MMX, SSE. Thanks.

    For me, it is useful for lengths of 30 bytes (+/-)
    So, 64 is good!
   
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )

GetStringLenY       proc        pStr:DWORD
                    mov         edx, [esp+4]                    ;pStr
                    mov         eax, -1                   
            @@:     add         eax, 1                   
                    movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short @B                               
                    ret         4
GetStringLenY       endp

        Please, could you run TestString16A.exe and TestString64.exe
        and post the results here ? Only the «Time table».
        Thanks

        note: your computers are faster. I do optimization based
                  on my.
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, GetStringLenY - _string01- length=16
13  milliseconds, szLen         - _string01- length=16

15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
35  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenY - _string02- length=32

37  milliseconds, strlen32M     - _string03- length=64
40  milliseconds, strlen32      - _string03- length=64
54  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32M     - _string04- length=128

58  milliseconds, GetStringLenY - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
93  milliseconds, szLen         - _string04- length=128
100  milliseconds, GetStringLenY - _string04- length=128
********** END **********


-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

96 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
99 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
122 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
124 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
127 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
127 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
157 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
169 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
179 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
********** END 2 **********
Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 02:12:04 AM
Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )

For any other length, it could be a bit faster ;-)

No sources??
Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 03:11:54 AM
Rui,

results from TestString16A.exe:

STRINGS:
a bcd efg hij klm nop
ab cdefg hijk lmnopA

abc de fghijkl mn op
abc defg hi jk lm nop

abc de fghijkl mn op A
abc defg hi jk lm nop

X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
64 milliseconds, CompareStringXYS, _string01X, _string01Y
42 milliseconds, CompareStringXYS, _string02X, _string02Y
45 milliseconds, CompareStringXYS, _string03X, _string03Y
28 milliseconds, CompareStringXYT, _string01X, _string01Y
28 milliseconds, CompareStringXYT, _string02X, _string02Y
28 milliseconds, CompareStringXYT, _string03X, _string03Y
48 milliseconds, CompareStringXYBS, _string01X, _string01Y
50 milliseconds, CompareStringXYBS, _string02X, _string02Y
52 milliseconds, CompareStringXYBS, _string03X, _string03Y
*** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
***** Time table *****

28 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
28 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
28 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
42 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
45 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
48 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
50 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
52 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
64 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
********** END 2 **********


The results from TestString64.exe:

*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
-------------- START ----------------
9 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
13 milliseconds, GetStringLenY, _string01- length=16
5 milliseconds, szLen, _string01- length=16
7 milliseconds, strlen32, _string02 - length=32
5 milliseconds, strlen32M, _string02 - length=32
24 milliseconds, GetStringLenY, _string02- length=32
8 milliseconds, szLen, _string02- length=32
13 milliseconds, strlen32, _string03 - length=64
10 milliseconds, strlen32M, _string03 - length=64
35 milliseconds, GetStringLenY, _string03- length=64
16 milliseconds, szLen, _string03- length=64
28 milliseconds, strlen32, _string04 - length=128
26 milliseconds, strlen32M, _string04 - length=128
55 milliseconds, GetStringLenY, _string04- length=128
42 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
***** Time table *****

5  milliseconds, strlen32M     - _string02- length=32
5  milliseconds, szLen         - _string01- length=16
6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, strlen32      - _string02- length=32
8  milliseconds, szLen         - _string02- length=32
9  milliseconds, strlen32      - _string01- length=16
10  milliseconds, strlen32M     - _string03- length=64
13  milliseconds, strlen32      - _string03- length=64
13  milliseconds, GetStringLenY - _string01- length=16
16  milliseconds, szLen         - _string03- length=64
24  milliseconds, GetStringLenY - _string02- length=32
26  milliseconds, strlen32M     - _string04- length=128
28  milliseconds, strlen32      - _string04- length=128
35  milliseconds, GetStringLenY - _string03- length=64
42  milliseconds, szLen         - _string04- length=128
55  milliseconds, GetStringLenY - _string04- length=128
********** END **********


Gunther
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 03:23:27 AM
Quote from: jj2007 on June 11, 2014, 02:12:04 AM
Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
    Now, the best one is just this
   (for length=0 it is very very fast !!!  ;) )

For any other length, it could be a bit faster ;-)

No sources??
more sources Jochen ? topic "Sorting strings" you have it.
No problems about this code, i post all things !
How do you improve it a bit ?
Title: Re: Optimizing some code
Post by: qWord on June 11, 2014, 03:43:18 AM
check the old forum - methods to get the string length has been discussed to death.
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 04:52:35 AM
qWord,

More than to get the best code or the fastest
code, i like to write the way i think, following
my own logic. Meanwhile, i try to compare the
logic of some faster procedures with the way as i do.
And i have my conclusions.
In this case, i read strlen32 written by Agner Fog
(i dont need to use strlen) and i wrote a modified
version and tested it in the way you know.
I posted it because it is an optimized version
from an optimized version from Agner Fog.
About sorting strings, i want to add it in
the next version of the linked list project.
In my own projects i don't use null terminated strings.
Of course, the calculator doesn't use it.
As we see, when the length is not more than some
bytes i use any version optimized or not.
To me, complex methods to get the string length is
to put in the dustbin.

EDIT: i didn't do the things like that i saw in the old forum.

Gunther,
            thanks !
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 05:53:46 AM
Jochen,
            where are you ? Are you sleeping ?
             where is your answer ?
Title: Re: Optimizing some code
Post by: habran on June 11, 2014, 06:28:59 AM
Try this :biggrin:

GetStringLenX PROC pStr:DWORD
    mov eax,pStr
    .while (BYTE PTR[eax])
      inc eax
    .endw
    sub eax,pStr
    ret
GetStringLenX ENDP

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 07:25:20 AM
Your suggestion: GetStringLenZ is worse.
The difference of addresses is worse.
As we can see below, to use szLen
or GetStringLenY makes no difference
im my system, up to length=32 or 64.

Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
12  milliseconds, GetStringLenY - _string01- length=16
13  milliseconds, szLen         - _string01- length=16

15  milliseconds, strlen32M     - _string02- length=32
15  milliseconds, strlen32      - _string02- length=32
35  milliseconds, szLen         - _string02- length=32
36  milliseconds, GetStringLenY - _string02- length=32

37  milliseconds, strlen32M     - _string03- length=64
40  milliseconds, strlen32      - _string03- length=64
54  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32M     - _string04- length=128

58  milliseconds, GetStringLenY - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
93  milliseconds, szLen         - _string04- length=128
100  milliseconds, GetStringLenY - _string04- length=128
********** END **********


Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9  milliseconds, strlen32M     - _string01- length=16
10  milliseconds, strlen32      - _string01- length=16
14  milliseconds, strlen32M     - _string02- length=32
14  milliseconds, szLen         - _string01- length=16
15  milliseconds, strlen32      - _string02- length=32
18  milliseconds, GetStringLenZ - _string01- length=16
34  milliseconds, szLen         - _string02- length=32
37  milliseconds, strlen32M     - _string03- length=64
38  milliseconds, strlen32      - _string03- length=64
40  milliseconds, GetStringLenZ - _string02- length=32
53  milliseconds, szLen         - _string03- length=64
58  milliseconds, strlen32      - _string04- length=128
58  milliseconds, strlen32M     - _string04- length=128
68  milliseconds, GetStringLenZ - _string03- length=64
96  milliseconds, szLen         - _string04- length=128
123  milliseconds, GetStringLenZ - _string04- length=128
********** END **********
Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 08:16:59 AM
Quote from: RuiLoureiro on June 11, 2014, 05:53:46 AM
Jochen,
            where are you ? Are you sleeping ?
             where is your answer ?

Yes, I was sleeping, and here is my answer (for a 100 byte string):

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
23657   cycles for 100 * Rui
3942    cycles for 100 * MB
13843   cycles for 100 * Masm32
14124   cycles for 100 * CRT
23757   cycles for 100 * Habran


As qWord wrote above, we have tested it already.
Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:46:09 AM
the routines we have analyzed before have typically been optimized for longer strings
for shorter strings, say 50 characters or less, those routines have too much overhead

so, i tried to keep overhead to a minimum
and - access data as 4-aligned dwords
it hasn't been tested, but maybe it will give you some ideas....
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

ShortLen PROC _lpStr:LPSTR

        mov     edx,[esp+4]
        test    dl,3
        mov     ecx,[edx]
        jz      start_dwords3

        xor     eax,eax
        inc     edx
        test    cl,cl
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ch,ch
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ecx,0FF0000h
        jz      all_done

        jmp short start_dwords2

start_dwords1:
        add     edx,4

start_dwords2:
        mov     ecx,[edx]

start_dwords3:
        test    cl,cl
        mov     al,0
        jz      end_dwords

        test    ch,ch
        mov     al,1
        jz      end_dwords

        test    ecx,0FF0000h
        mov     al,2
        jz      end_dwords

        test    ecx,0FF000000h
        mov     al,3
        jnz     start_dwords1

end_dwords:
        sub     eax,[esp+4]
        add     eax,edx

all_done:
        ret     4

ShortLen ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef
Title: Re: Optimizing some code
Post by: hutch-- on June 11, 2014, 01:11:27 PM
This very simple algo is still one of my favourites, mainly because its small, simple and flexible enough to inline it into the middle of a larger more complex algo. The aligned DWORD read versions will always be faster on longer linear reads but often its of no real gain in a more complex set of tasks.


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

slen proc ptxt:DWORD

    mov eax, [esp+4]
    sub eax, 1

  lbl0:
    add eax, 1
    cmp BYTE PTR [eax], 0
    jne lbl0

    sub eax, [esp+4]

    ret 4

slen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 05:33:57 PM
Well, well... so here are some brandnew algos, tested on 30 and 100 byte strings 8)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11619   cycles for 100 * Rui
3512    cycles for 100 * MB
4734    cycles for 100 * Masm32
4036    cycles for 100 * CRT
11969   cycles for 100 * Habran
4930    cycles for 100 * ShortLen (Dave)
11413   cycles for 100 * slen (Hutch)

11615   cycles for 100 * Rui
3511    cycles for 100 * MB
5915    cycles for 100 * Masm32
4034    cycles for 100 * CRT
11841   cycles for 100 * Habran
4932    cycles for 100 * ShortLen (Dave)
11412   cycles for 100 * slen (Hutch)

32727   cycles for 100 * Rui
5932    cycles for 100 * MB
14423   cycles for 100 * Masm32
13117   cycles for 100 * CRT
33202   cycles for 100 * Habran
15347   cycles for 100 * ShortLen (Dave)
32430   cycles for 100 * slen (Hutch)

32743   cycles for 100 * Rui
5515    cycles for 100 * MB
14442   cycles for 100 * Masm32
13123   cycles for 100 * CRT
33367   cycles for 100 * Habran
15324   cycles for 100 * ShortLen (Dave)
32421   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)
Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 07:11:20 PM
The string length topic was beaten to death. Anyway, here is it again:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

8547    cycles for 100 * Rui
1234    cycles for 100 * MB
4607    cycles for 100 * Masm32
2826    cycles for 100 * CRT
6069    cycles for 100 * Habran
3855    cycles for 100 * ShortLen (Dave)
8492    cycles for 100 * slen (Hutch)

8494    cycles for 100 * Rui
1233    cycles for 100 * MB
4609    cycles for 100 * Masm32
2812    cycles for 100 * CRT
6042    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
8486    cycles for 100 * slen (Hutch)

18892   cycles for 100 * Rui
2545    cycles for 100 * MB
9483    cycles for 100 * Masm32
8950    cycles for 100 * CRT
20342   cycles for 100 * Habran
12571   cycles for 100 * ShortLen (Dave)
19062   cycles for 100 * slen (Hutch)

18737   cycles for 100 * Rui
2545    cycles for 100 * MB
9462    cycles for 100 * Masm32
7726    cycles for 100 * CRT
20296   cycles for 100 * Habran
11327   cycles for 100 * ShortLen (Dave)
19168   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---


Gunther
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 08:44:13 PM
Quote
Hutch:
This very simple algo is still one of my favourites, mainly because its small,
simple and flexible enough...

In DOS we used junk like rep scasb.
But i have that kind of instructions in the dustbin.
Now i avoid to move the addresses.
I like to use the index (=length, size,...) and
a minimum number of registers.
In any way, for me, strlen32 or strlen32M is good.

Remember that, in many applications, we need to
remove input spaces or to find CR/LF and then we
get the length. So this is more important than
one to get the length only. For me, it is.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

12473   cycles for 100 * Rui
3762    cycles for 100 * MB
12026   cycles for 100 * Masm32
5466    cycles for 100 * CRT
14959   cycles for 100 * Habran
11102   cycles for 100 * ShortLen (Dave)
12101   cycles for 100 * slen (Hutch)

12692   cycles for 100 * Rui
3699    cycles for 100 * MB
11994   cycles for 100 * Masm32
5172    cycles for 100 * CRT
15321   cycles for 100 * Habran
11575   cycles for 100 * ShortLen (Dave)
12105   cycles for 100 * slen (Hutch)

28829   cycles for 100 * Rui
7346    cycles for 100 * MB
26394   cycles for 100 * Masm32
15833   cycles for 100 * CRT
35733   cycles for 100 * Habran
27929   cycles for 100 * ShortLen (Dave)
27561   cycles for 100 * slen (Hutch)

27941   cycles for 100 * Rui
7318    cycles for 100 * MB
26319   cycles for 100 * Masm32
15981   cycles for 100 * CRT
36253   cycles for 100 * Habran
24471   cycles for 100 * ShortLen (Dave)
27507   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---
Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 08:48:19 PM
deleted
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 08:59:22 PM
Jochen,
         Sorry, i think that i misunderstood your question.
         HERE are strlen32, strlen32M


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32    proc        pBuf:DWORD
            push        ebx           
            mov         ecx, [esp+8]           ; get pointer to string
            mov         eax, ecx               ; copy pointer
            and         ecx, 3                 ; lower 2 bits of address, check alignment
            jz          L2                     ; string is aligned by 4. Go to loop           
            and         eax, -4                ; align pointer by 4
            mov         ebx, [eax]             ; read from nearest preceding boundary
            shl         ecx, 3                 ; mul by 8 = displacement in bits
            mov         edx, -1
            shl         edx, cl                ; make byte mask
            not         edx                    ; mask = 0FFH for false bytes
            or          ebx, edx               ; mask out false bytes
            ; check first four bytes for zero
            ;-----------------------------------
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jnz         L3                     ; zero-byte found       
            ; Main loop, read 4 bytes aligned
            ;-----------------------------------
    L1:     add         eax, 4                 ; increment pointer by 4
    L2:     mov         ebx, [eax]             ; read 4 bytes of string
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jz          L1                     ; no zero bytes, continue loop       
    L3:     bsf         ecx, ecx               ; find right-most 1-bit
            shr         ecx, 3                 ; divide by 8 = byte index           
            sub         eax, [esp+8]           ; subtract start address
            add         eax, ecx               ; add index to byte           
            pop         ebx
            ret         4
strlen32    endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;******************************************************************************
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32M       proc        pBuf:DWORD           
                mov         ecx, [esp+4]           ; get pointer to string
                mov         eax, ecx               ; copy pointer
                and         ecx, 3                 ; lower 2 bits of address, check alignment
                jz          L2                     ; string is aligned by 4. Go to loop           
                and         eax, -4                ; align pointer by 4               
                shl         ecx, 3                 ; mul by 8 = displacement in bits
                mov         edx, -1
                shl         edx, cl                ; make byte mask
                not         edx                    ; mask = 0FFH for false bytes
                or          edx, [eax]             ; read from nearest preceding boundary         
                ; check first four bytes for zero
                ;-----------------------------------
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jnz         L3                     ; zero-byte found       
                ; Main loop, read 4 bytes aligned
                ;-----------------------------------
        L1:     add         eax, 4                 ; increment pointer by 4
        L2:     mov         edx, [eax]             ; read 4 bytes of string
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jz          L1                     ; no zero bytes, continue loop       
        L3:     bsf         ecx, ecx               ; find right-most 1-bit
                shr         ecx, 3                 ; divide by 8 = byte index           
                sub         eax, [esp+4]           ; subtract start address
                add         eax, ecx               ; add index to byte           
                ret         4
strlen32M       endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 09:15:57 PM
Quote from: RuiLoureiro on June 11, 2014, 08:59:22 PM
Jochen,
         Sorry, i think that i misunderstood your question.
         HERE are strlen32, strlen32M

OK, integrated in attached testbed. They are fast indeed.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11620   cycles for 100 * Rui
3512    cycles for 100 * MB
4338    cycles for 100 * strlen32
4755    cycles for 100 * strlen32M
11842   cycles for 100 * Habran
4932    cycles for 100 * ShortLen (Dave)
11417   cycles for 100 * slen (Hutch)

11628   cycles for 100 * Rui
3514    cycles for 100 * MB
4331    cycles for 100 * strlen32
4736    cycles for 100 * strlen32M
11842   cycles for 100 * Habran
4934    cycles for 100 * ShortLen (Dave)
11425   cycles for 100 * slen (Hutch)

32747   cycles for 100 * Rui
5516    cycles for 100 * MB
11223   cycles for 100 * strlen32
13623   cycles for 100 * strlen32M
33197   cycles for 100 * Habran
15323   cycles for 100 * ShortLen (Dave)
32441   cycles for 100 * slen (Hutch)

32974   cycles for 100 * Rui
5515    cycles for 100 * MB
11256   cycles for 100 * strlen32
13629   cycles for 100 * strlen32M
33159   cycles for 100 * Habran
15327   cycles for 100 * ShortLen (Dave)
32436   cycles for 100 * slen (Hutch)

Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:

Change
ShortLen PROC _lpStr:LPSTR
to
ShortLen PROC ; _lpStr:LPSTR
Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 09:26:44 PM
deleted
Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 09:32:34 PM
Quote from: nidud on June 11, 2014, 09:26:44 PM
fast  :t
...
561   cycles - JJ

Too slow for my taste, but changing
   void len(string)
to
   void Len(string)
helps a lot ;-)
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 09:35:28 PM
You need to write a simple algo to sort it, Jochen.
Where is your algo ? I bet it uses rep scasb !

This table is sorted.
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

3704    cycles for 100 * MB
5093    cycles for 100 * strlen32M
5349    cycles for 100 * strlen32

11087   cycles for 100 * ShortLen (Dave)
12093   cycles for 100 * slen (Hutch)
12549   cycles for 100 * Rui
15481   cycles for 100 * Habran
3742   cycles for 100 * MB
5332   cycles for 100 * strlen32
5462   cycles for 100 * strlen32M

11163   cycles for 100 * ShortLen (Dave)
12060   cycles for 100 * slen (Hutch)
12775   cycles for 100 * Rui
14858   cycles for 100 * Habran

7334   cycles for 100 * MB
17334   cycles for 100 * strlen32M
17613   cycles for 100 * strlen32

27930   cycles for 100 * slen (Hutch)
28026   cycles for 100 * ShortLen (Dave)
28658   cycles for 100 * Rui
43521   cycles for 100 * Habran

7344   cycles for 100 * MB
17316   cycles for 100 * strlen32M
17779   cycles for 100 * strlen32

24022   cycles for 100 * ShortLen (Dave)
27805   cycles for 100 * slen (Hutch)
28074   cycles for 100 * Rui
38015   cycles for 100 * Habran

100     = eax Rui
100     = eax MB
100     = eax strlen32
100     = eax strlen32M
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 09:39:49 PM
 :biggrin:
: we have optimizations and ... optimizations !
for all tastes ! (we see it not only here)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
2516    cycles - std
1128    cycles - Rui
1297    cycles - habran
1048    cycles - Dave
1443    cycles - Hutch
1200    cycles - JJ
  839    cycles - Rui32
  841    cycles - Rui32M

13747   cycles - std
  7607    cycles - Rui
  9441    cycles - habran
  6250    cycles - Dave
  7470    cycles - Hutch
  7079    cycles - JJ
  4154    cycles - Rui32
  4137    cycles - Rui32M

--- ok ---
Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 10:00:55 PM
deleted
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 11:07:08 PM

Quoteyep, that was better
What ? Did you see it ? Where is it ? Show us.
I never saw, i never read any proc written by Jochen.
We should compare only comparable things.


This is what we may see:
for my taste this is junk
Quote
align 16
TestA_s:
NameA equ Rui   ; assign a descriptive name here
TestA proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   push offset somestring
   call GetStringLenY   ; Rui
   dec ebx
  .Until Sign?
  ret
TestA endp
TestA_endp:
; ---------------------------------------------------
align 16
TestB_s:
NameB equ MB   ; assign a descriptive name here
TestB proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   void Len(offset somestring)
   dec ebx
  .Until Sign?
  ret
TestB endp
TestB_endp:
align 16
TestC_s:
; ----------------------------------------------------
; useC=0      ; uncomment to exclude TestC
NameC equ <strlen32>   ; assign a descriptive name here
TestC proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   invoke strlen32, offset somestring
   dec ebx
  .Until Sign?
  ret
TestC endp
TestC_endp:
Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:09:02 PM
Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:

0000000A != 3 - STRLEN error
000003BC != 700 - STRLEN error

i don't understand what that means
i'll fix it if you like
Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:14:33 PM
Quote from: RuiLoureiro on June 11, 2014, 11:07:08 PM

Quoteyep, that was better
What ? Did you see it ? Where is it ? Show us.
We should compare only comparable things.

the test depends on what you're after
most strings are not aligned
well - BSTR's are - and strings that are inside structures probably are
otherwise... you have to devise a test that tests all alignments

just as an example, i attached a test to look at

1) select a single core, and wait 750 mS to bind before testing
2) select loop counts that yield ~0.5 seconds per pass
3) all alignments are tested
(notice that each string is differently aligned)
4) 16 strings are tested - the overall result is divided by 16 and rounded to nearest
5) you get fewer outliers if you open a console window and type the program name than if you click on it
6) the test should show the processor (this one does not)   :P
Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 11:33:41 PM
Results for Jochen:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

9731    cycles for 100 * Rui
1146    cycles for 100 * MB
3871    cycles for 100 * strlen32
2730    cycles for 100 * strlen32M
8707    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
5891    cycles for 100 * slen (Hutch)

8489    cycles for 100 * Rui
2387    cycles for 100 * MB
3879    cycles for 100 * strlen32
2722    cycles for 100 * strlen32M
8698    cycles for 100 * Habran
5101    cycles for 100 * ShortLen (Dave)
5866    cycles for 100 * slen (Hutch)

18823   cycles for 100 * Rui
2029    cycles for 100 * MB
7357    cycles for 100 * strlen32
7455    cycles for 100 * strlen32M
20309   cycles for 100 * Habran
12587   cycles for 100 * ShortLen (Dave)
17441   cycles for 100 * slen (Hutch)

20010   cycles for 100 * Rui
2033    cycles for 100 * MB
7387    cycles for 100 * strlen32
7442    cycles for 100 * strlen32M
19070   cycles for 100 * Habran
12663   cycles for 100 * ShortLen (Dave)
18717   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax strlen32
100     = eax strlen32M
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---


Results for Rui:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
920     cycles - std
775     cycles - Rui
562     cycles - habran
378     cycles - Dave
544     cycles - Hutch
406     cycles - JJ
332     cycles - Rui32
246     cycles - Rui32M
209     cycles - JJ2

14480   cycles - std
10105   cycles - Rui
14099   cycles - habran
8649    cycles - Dave
9680    cycles - Hutch
7488    cycles - JJ
4513    cycles - Rui32
4634    cycles - Rui32M
1162    cycles - JJ2

--- ok ---


Gunther
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 11:35:46 PM
Dave,
          must show the proc.
          it must be clear. otherwise i give him 0.
          if in the school, he must explain all bits.
          This is my rule:
          i don't accept any results unless
          you show your exercise.
Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 12:14:01 AM
Rui,

Quote from: RuiLoureiro on June 11, 2014, 11:35:46 PM
Dave,
          must show the proc.
          it must be clear. otherwise i give him 0.
          if in the school, he must explain all bits.
          This is my rule:
          i don't accept any results unless
          you show your exercise.
but slen.zip contains the source. What's your point?

Gunther
Title: Re: Optimizing some code
Post by: nidud on June 12, 2014, 12:23:04 AM
deleted
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 12:40:12 AM
Gunther,
              what are you talking about ?
              Could you post here the proc
              that i am talking about ?
Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 12:42:41 AM
Rui,

no offense. But the zip archive under post #30 contains the source. Or do you mean another procedure?

Gunther
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 12:46:41 AM
No, iam not talking about it.
Do you know this:
This is the place to post assembler algorithms and code design for discussion, optimisation and any other...
You may do any tests you want to do with the code
i posted. I want to do the same. That's the point.
Read my reply 10.

nidud,
          it is not correct to call "Rui32" and "Rui32M" but
          AgnerFog.

:biggrin: :biggrin:
EDIT: Gunther,
                      Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)
Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 04:57:59 AM
Rui,

Quote from: RuiLoureiro on June 12, 2014, 12:46:41 AM
EDIT: Gunther,
                      Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)

I've read your post #10. I think that I'm talking about algorithms, I'm posting test results (not only in your thread), I'm not talking about gambuzinos, Yetis and other impossibilities.

But anyway, it's your thread. My apology, I won't post into your threads in the future.

Gunther
Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 05:53:36 AM
Hi
What did you do wrong, Gunther ?
I never saw anything wrong. It's clear.
My apology.
Title: Re: Optimizing some code
Post by: dedndave on June 12, 2014, 09:36:12 AM
fixed my routine - and added ShowCpu   :P
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
134 135 135 135 134
141 141 141 140 141
Title: Re: Optimizing some code
Post by: FORTRANS on June 12, 2014, 11:23:18 PM
Hi Dave,

   Here are some results.

Pre-Pentium4 (SSE1)
106 106 106 106 106
104 104 104 104 104
Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
103 104 104 104 103
111 111 111 111 111
Press any key to continue ...


HTH,

Steve N.

Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 11:45:57 PM
Dave,

results from slen2.exe:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
55 51 52 51 51
77 77 77 77 77
Press any key to continue ...


Gunther
Title: Re: Optimizing some code
Post by: nidud on June 15, 2014, 01:28:58 AM
deleted
Title: Re: Optimizing some code
Post by: dedndave on June 15, 2014, 01:33:34 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
79459   cycles - 0: standard (scasb)
7603    cycles - 1: AgnerFog
7785    cycles - 2: AgnerFog (unaligned)
9866    cycles - 3: Dave

16232   cycles - 0: standard (scasb)
7615    cycles - 1: AgnerFog
9145    cycles - 2: AgnerFog (unaligned)
10294   cycles - 3: Dave

16081   cycles - 0: standard (scasb)
7759    cycles - 1: AgnerFog
7763    cycles - 2: AgnerFog (unaligned)
10762   cycles - 3: Dave


you might want to increase the loop counts for better stability
Title: Re: Optimizing some code
Post by: Gunther on June 15, 2014, 02:18:58 AM
Hi nidud,

results for strlen4:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
18212   cycles - 0: standard (scasb)
1871    cycles - 1: AgnerFog
1962    cycles - 2: AgnerFog (unaligned)
2789    cycles - 3: Dave

11414   cycles - 0: standard (scasb)
4372    cycles - 1: AgnerFog
4257    cycles - 2: AgnerFog (unaligned)
6608    cycles - 3: Dave

17074   cycles - 0: standard (scasb)
4358    cycles - 1: AgnerFog
4204    cycles - 2: AgnerFog (unaligned)
6618    cycles - 3: Dave

--- ok ---


Gunther
Title: Re: Optimizing some code
Post by: FORTRANS on June 15, 2014, 05:09:47 AM
Hi,

   The first time it is run, it is slow on the first test.

Regards,

Steve N.

First run.

pre-P4 (SSE1)
------------------------------------------------------
315620  cycles - 0: standard (scasb)
5001    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6741    cycles - 3: Dave

11665   cycles - 0: standard (scasb)
4997    cycles - 1: AgnerFog
4975    cycles - 2: AgnerFog (unaligned)
6764    cycles - 3: Dave

11684   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6780    cycles - 3: Dave

--- ok ---

Second run

pre-P4 (SSE1)
------------------------------------------------------
11813   cycles - 0: standard (scasb)
4992    cycles - 1: AgnerFog
4968    cycles - 2: AgnerFog (unaligned)
6755    cycles - 3: Dave

11679   cycles - 0: standard (scasb)
4985    cycles - 1: AgnerFog
4966    cycles - 2: AgnerFog (unaligned)
6768    cycles - 3: Dave

11696   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4977    cycles - 2: AgnerFog (unaligned)
6758    cycles - 3: Dave

--- ok ---


Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
296692 cycles - 0: standard (scasb)
4193 cycles - 1: AgnerFog
4078 cycles - 2: AgnerFog (unaligned)
6334 cycles - 3: Dave

11671 cycles - 0: standard (scasb)
4148 cycles - 1: AgnerFog
4064 cycles - 2: AgnerFog (unaligned)
6268 cycles - 3: Dave

11675 cycles - 0: standard (scasb)
4209 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6357 cycles - 3: Dave

--- ok --- 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
11809 cycles - 0: standard (scasb)
4187 cycles - 1: AgnerFog
4093 cycles - 2: AgnerFog (unaligned)
6219 cycles - 3: Dave

11793 cycles - 0: standard (scasb)
4147 cycles - 1: AgnerFog
4074 cycles - 2: AgnerFog (unaligned)
6100 cycles - 3: Dave

11810 cycles - 0: standard (scasb)
4211 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6225 cycles - 3: Dave

--- ok ---
Title: Re: Optimizing some code
Post by: LarryC on June 15, 2014, 05:43:47 AM

Intel(R) Core(TM) i7 CPU         960  @ 3.20GHz (SSE4)
------------------------------------------------------
10025   cycles - 0: standard (scasb)
6427    cycles - 1: AgnerFog
6444    cycles - 2: AgnerFog (unaligned)
11482   cycles - 3: Dave

15319   cycles - 0: standard (scasb)
7873    cycles - 1: AgnerFog
6176    cycles - 2: AgnerFog (unaligned)
11058   cycles - 3: Dave

14978   cycles - 0: standard (scasb)
7816    cycles - 1: AgnerFog
6251    cycles - 2: AgnerFog (unaligned)
10995   cycles - 3: Dave

--- ok ---
Title: Re: Optimizing some code
Post by: nidud on June 15, 2014, 11:29:32 PM
deleted
Title: Re: Optimizing some code
Post by: Gunther on June 15, 2014, 11:34:59 PM
Hi nidud,

strlen5:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
22200   cycles - 0: standard (scasb)
10776   cycles - 3: Dave
10271   cycles - 5: MB - len()
7120    cycles - 1: AgnerFog
7264    cycles - 2: AgnerFog (unaligned)
3007    cycles - 6: MB - Len() SSE
2280    cycles - 4: unaligned SSE2

21590   cycles - 0: standard (scasb)
10616   cycles - 3: Dave
10136   cycles - 5: MB - len()
7059    cycles - 1: AgnerFog
17323   cycles - 2: AgnerFog (unaligned)
7226    cycles - 6: MB - Len() SSE
5413    cycles - 4: unaligned SSE2

52253   cycles - 0: standard (scasb)
25722   cycles - 3: Dave
24451   cycles - 5: MB - len()
17339   cycles - 1: AgnerFog
17349   cycles - 2: AgnerFog (unaligned)
7205    cycles - 6: MB - Len() SSE
6116    cycles - 4: unaligned SSE2

--- ok ---


Gunther
Title: Re: Optimizing some code
Post by: nidud on June 16, 2014, 01:29:53 AM
deleted
Title: Re: Optimizing some code
Post by: dedndave on June 16, 2014, 02:38:24 AM
strlen5
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
86037   cycles - 0: standard (scasb)
31180   cycles - 3: Dave
33575   cycles - 5: MB - len()
23079   cycles - 1: AgnerFog
25595   cycles - 2: AgnerFog (unaligned)
21374   cycles - 6: MB - Len() SSE
18166   cycles - 4: unaligned SSE2

49577   cycles - 0: standard (scasb)
31080   cycles - 3: Dave
32727   cycles - 5: MB - len()
23139   cycles - 1: AgnerFog
25405   cycles - 2: AgnerFog (unaligned)
21643   cycles - 6: MB - Len() SSE
18152   cycles - 4: unaligned SSE2

49638   cycles - 0: standard (scasb)
31000   cycles - 3: Dave
32762   cycles - 5: MB - len()
23151   cycles - 1: AgnerFog
31292   cycles - 2: AgnerFog (unaligned)
21172   cycles - 6: MB - Len() SSE
18204   cycles - 4: unaligned SSE2