Print Page - Optimizing some code

Title: Optimizing some code
Post by: RuiLoureiro on June 10, 2014, 06:54:45 PM

Hi all,
I found this 4 procedures to get string length

strlen32 <- Author: Agner Fog
strlen32M <- is strlen32 modified by RuiLoureiro
GetStringLenX <- RuiLoureiro
szLen <- MASM

where the best one is just this: ;)

Code Select


GetStringLenX       proc        pStr:DWORD
                    mov         edx, [esp+4]    ;pStr
                    xor         eax, eax
                    jz          short @F                    
       _begin0:     add         eax, 1                    
            @@:     movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short _begin0
                    ret         4
GetStringLenX       endp

I did some tests and we get the following results
for all 4 cases:
(You should draw your own conclusions)

Could you run TestString32.exe and post the results
here ?
Thanks

note I: Thanks Dave for that one.
note II: Jochen, it seems that it is not for you
because...
CASE 1
------------------------------------------------
the addresses are aligned
------------------------------------------------

Quote
...
-------------- START ----------------
...
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16

15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
34 milliseconds, szLen - _string02- length=32
35 milliseconds, GetStringLenX - _string02- length=32

37 milliseconds, strlen32M - _string03- length=64
39 milliseconds, strlen32 - _string03- length=64
53 milliseconds, szLen - _string03- length=64
57 milliseconds, GetStringLenX - _string03- length=64
********** END **********

CASE 2
---------------------------------------------------
the addresses are not aligned by 1 byte
---------------------------------------------------

Quote
...
-------------- START ----------------
...
***** Time table *****

10 milliseconds, strlen32M - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32 - _string02- length=32
15 milliseconds, strlen32M - _string02- length=32

16 milliseconds, strlen32 - _string01- length=16
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64

34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64
********** END **********

CASE 3
-----------------------------------------------------
the addresses are not aligned by 2 bytes
-----------------------------------------------------

Quote
...
-------------- START ----------------
...
***** Time table *****
10 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
12 milliseconds, GetStringLenX - _string01- length=16

15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64

34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
65 milliseconds, GetStringLenX - _string03- length=64
********** END **********

CASE 4
----------------------------------------------------
the addresses are not aligned by 3 bytes
----------------------------------------------------

Quote
...
-------------- START ----------------
...
***** Time table *****

10 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16

15 milliseconds, strlen32 - _string02- length=32
16 milliseconds, strlen32M - _string02- length=32

25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64

34 milliseconds, szLen - _string02- length=32
39 milliseconds, GetStringLenX - _string02- length=32
56 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64
********** END **********

FOR 128 BYTES

Quote
...
-------------- START ----------------
...
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
13 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16

15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64

34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64

58 milliseconds, strlen32M - _string04- length=128
67 milliseconds, strlen32 - _string04- length=128
92 milliseconds, szLen - _string04- length=128
98 milliseconds, GetStringLenX - _string04- length=128
********** END **********

Title: Re: Optimizing some code
Post by: hutch-- on June 10, 2014, 08:05:24 PM

Hi Rui,

I moved the thread so it would be seen by the algo folks.

Here is my timings on my 3 gig Core2 quad.

Code Select


*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
 -------------- START ----------------
7 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
12 milliseconds, GetStringLenX, _string01- length=16
7 milliseconds, szLen, _string01- length=16
10 milliseconds, strlen32, _string02 - length=32
9 milliseconds, strlen32M, _string02 - length=32
22 milliseconds, GetStringLenX, _string02- length=32
12 milliseconds, szLen, _string02- length=32
18 milliseconds, strlen32, _string03 - length=64
18 milliseconds, strlen32M, _string03 - length=64
42 milliseconds, GetStringLenX, _string03- length=64
23 milliseconds, szLen, _string03- length=64
32 milliseconds, strlen32, _string04 - length=128
33 milliseconds, strlen32M, _string04 - length=128
64 milliseconds, GetStringLenX, _string04- length=128
44 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

 ***** Time table *****

6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, szLen         - _string01- length=16
7  milliseconds, strlen32      - _string01- length=16
9  milliseconds, strlen32M     - _string02- length=32
10  milliseconds, strlen32      - _string02- length=32
12  milliseconds, szLen         - _string02- length=32
12  milliseconds, GetStringLenX - _string01- length=16
18  milliseconds, strlen32      - _string03- length=64
18  milliseconds, strlen32M     - _string03- length=64
22  milliseconds, GetStringLenX - _string02- length=32
23  milliseconds, szLen         - _string03- length=64
32  milliseconds, strlen32      - _string04- length=128
33  milliseconds, strlen32M     - _string04- length=128
42  milliseconds, GetStringLenX - _string03- length=64
44  milliseconds, szLen         - _string04- length=128
64  milliseconds, GetStringLenX - _string04- length=128
 ********** END **********

Title: Re: Optimizing some code
Post by: cpu2 on June 10, 2014, 09:41:51 PM

I have an implementation on x64 with SSE2 extensions, maybe interested.

Translate to x86 is very easy.

Regards.

Title: Re: Optimizing some code
Post by: jj2007 on June 10, 2014, 09:46:52 PM

Quote from: RuiLoureiro on June 10, 2014, 06:54:45 PM
note II: Jochen, it seems that it is not for you
because...

::) :(

Code Select

AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
 -------------- START ----------------
14 milliseconds, strlen32, _string01- length=16
11 milliseconds, strlen32M, _string01- length=16
24 milliseconds, GetStringLenX, _string01- length=16
10 milliseconds, szLen, _string01- length=16
19 milliseconds, strlen32, _string02 - length=32
16 milliseconds, strlen32M, _string02 - length=32
37 milliseconds, GetStringLenX, _string02- length=32
17 milliseconds, szLen, _string02- length=32
34 milliseconds, strlen32, _string03 - length=64
33 milliseconds, strlen32M, _string03 - length=64
66 milliseconds, GetStringLenX, _string03- length=64
37 milliseconds, szLen, _string03- length=64
56 milliseconds, strlen32, _string04 - length=128
55 milliseconds, strlen32M, _string04 - length=128
122 milliseconds, GetStringLenX, _string04- length=128
66 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

 ***** Time table *****

10  milliseconds, szLen         - _string01- length=16
11  milliseconds, strlen32M     - _string01- length=16
14  milliseconds, strlen32      - _string01- length=16
16  milliseconds, strlen32M     - _string02- length=32
17  milliseconds, szLen         - _string02- length=32
19  milliseconds, strlen32      - _string02- length=32
24  milliseconds, GetStringLenX - _string01- length=16
33  milliseconds, strlen32M     - _string03- length=64
34  milliseconds, strlen32      - _string03- length=64
37  milliseconds, szLen         - _string03- length=64
37  milliseconds, GetStringLenX - _string02- length=32
55  milliseconds, strlen32M     - _string04- length=128
56  milliseconds, strlen32      - _string04- length=128
66  milliseconds, szLen         - _string04- length=128
66  milliseconds, GetStringLenX - _string03- length=64
122  milliseconds, GetStringLenX - _string04- length=128
 ********** END **********

Title: Re: Optimizing some code
Post by: dedndave on June 10, 2014, 11:01:42 PM

if you want to see the other algorithms perform, try longer strings
many of them work well on strings of, say, 1000 bytes

that seems a little impractical to me, but i can see cases where it might be useful
generally, we think of display strings, which are shorter
but, fully qualified path names can be much longer
and, dealing with text files, you might want to access sentances, paragraphs, or even sections

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 01:42:35 AM

:biggrin:
Hi,
Thank you Hutch, Jochen and Dave

Cpu2,
for now i don't want to use MMX, SSE. Thanks.

For me, it is useful for lengths of 30 bytes (+/-)
So, 64 is good!

Now, the best one is just this
(for length=0 it is very very fast !!! ;) )

Code Select


GetStringLenY       proc        pStr:DWORD
                    mov         edx, [esp+4]                    ;pStr
                    mov         eax, -1                    
            @@:     add         eax, 1                    
                    movzx       ecx, byte ptr [edx+eax]
                    or          ecx, ecx
                    jnz         short @B                                
                    ret         4
GetStringLenY       endp

Please, could you run TestString16A.exe and TestString64.exe
and post the results here ? Only the «Time table».
Thanks

note: your computers are faster. I do optimization based
on my.

Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, GetStringLenY - _string01- length=16
13 milliseconds, szLen - _string01- length=16

15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
35 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenY - _string02- length=32

37 milliseconds, strlen32M - _string03- length=64
40 milliseconds, strlen32 - _string03- length=64
54 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32M - _string04- length=128

58 milliseconds, GetStringLenY - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
93 milliseconds, szLen - _string04- length=128
100 milliseconds, GetStringLenY - _string04- length=128
********** END **********

Code Select


-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
 ***** Time table *****

 96 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
 99 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
122 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
124 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
127 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
127 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
157 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
169 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
179 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
 ********** END 2 **********

Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 02:12:04 AM

Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
Now, the best one is just this
(for length=0 it is very very fast !!! ;) )

For any other length, it could be a bit faster ;-)

No sources??

Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 03:11:54 AM

Rui,

results from TestString16A.exe:

Code Select


STRINGS:
a bcd efg hij klm nop
ab cdefg hijk lmnopA

abc de fghijkl mn op
abc defg hi jk lm nop

abc de fghijkl mn op A
abc defg hi jk lm nop

X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
X is less than Y
 ShowResultXY
X is EQUAL Y
 ShowResultXY
X is greater than Y
 ShowResultXY
64 milliseconds, CompareStringXYS, _string01X, _string01Y
42 milliseconds, CompareStringXYS, _string02X, _string02Y
45 milliseconds, CompareStringXYS, _string03X, _string03Y
28 milliseconds, CompareStringXYT, _string01X, _string01Y
28 milliseconds, CompareStringXYT, _string02X, _string02Y
28 milliseconds, CompareStringXYT, _string03X, _string03Y
48 milliseconds, CompareStringXYBS, _string01X, _string01Y
50 milliseconds, CompareStringXYBS, _string02X, _string02Y
52 milliseconds, CompareStringXYBS, _string03X, _string03Y
 *** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
 ***** Time table *****

28 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
28 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
28 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
42 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
45 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
48 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
50 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
52 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
64 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
 ********** END 2 **********

The results from TestString64.exe:

Code Select


*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
 -------------- START ----------------
9 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
13 milliseconds, GetStringLenY, _string01- length=16
5 milliseconds, szLen, _string01- length=16
7 milliseconds, strlen32, _string02 - length=32
5 milliseconds, strlen32M, _string02 - length=32
24 milliseconds, GetStringLenY, _string02- length=32
8 milliseconds, szLen, _string02- length=32
13 milliseconds, strlen32, _string03 - length=64
10 milliseconds, strlen32M, _string03 - length=64
35 milliseconds, GetStringLenY, _string03- length=64
16 milliseconds, szLen, _string03- length=64
28 milliseconds, strlen32, _string04 - length=128
26 milliseconds, strlen32M, _string04 - length=128
55 milliseconds, GetStringLenY, _string04- length=128
42 milliseconds, szLen, _string04- length=128
 *** Press any key to get the time table ***

------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
 ***** Time table *****

5  milliseconds, strlen32M     - _string02- length=32
5  milliseconds, szLen         - _string01- length=16
6  milliseconds, strlen32M     - _string01- length=16
7  milliseconds, strlen32      - _string02- length=32
8  milliseconds, szLen         - _string02- length=32
9  milliseconds, strlen32      - _string01- length=16
10  milliseconds, strlen32M     - _string03- length=64
13  milliseconds, strlen32      - _string03- length=64
13  milliseconds, GetStringLenY - _string01- length=16
16  milliseconds, szLen         - _string03- length=64
24  milliseconds, GetStringLenY - _string02- length=32
26  milliseconds, strlen32M     - _string04- length=128
28  milliseconds, strlen32      - _string04- length=128
35  milliseconds, GetStringLenY - _string03- length=64
42  milliseconds, szLen         - _string04- length=128
55  milliseconds, GetStringLenY - _string04- length=128
 ********** END **********

Gunther

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 03:23:27 AM

Quote from: jj2007 on June 11, 2014, 02:12:04 AM
Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
Now, the best one is just this
(for length=0 it is very very fast !!! ;) )

For any other length, it could be a bit faster ;-)

No sources??

more sources Jochen ? topic "Sorting strings" you have it.
No problems about this code, i post all things !
How do you improve it a bit ?

Title: Re: Optimizing some code
Post by: qWord on June 11, 2014, 03:43:18 AM

check the old forum - methods to get the string length has been discussed to death.

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 04:52:35 AM

qWord,

More than to get the best code or the fastest
code, i like to write the way i think, following
my own logic. Meanwhile, i try to compare the
logic of some faster procedures with the way as i do.
And i have my conclusions.
In this case, i read strlen32 written by Agner Fog
(i dont need to use strlen) and i wrote a modified
version and tested it in the way you know.
I posted it because it is an optimized version
from an optimized version from Agner Fog.
About sorting strings, i want to add it in
the next version of the linked list project.
In my own projects i don't use null terminated strings.
Of course, the calculator doesn't use it.
As we see, when the length is not more than some
bytes i use any version optimized or not.
To me, complex methods to get the string length is
to put in the dustbin.

EDIT: i didn't do the things like that i saw in the old forum.

Gunther,
thanks !

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 05:53:46 AM

Jochen,
where are you ? Are you sleeping ?
where is your answer ?

Title: Re: Optimizing some code
Post by: habran on June 11, 2014, 06:28:59 AM

Try this :biggrin:

Code Select


GetStringLenX PROC pStr:DWORD
    mov eax,pStr
    .while (BYTE PTR[eax])
      inc eax
    .endw
    sub eax,pStr
    ret 
GetStringLenX ENDP

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 07:25:20 AM

Your suggestion: GetStringLenZ is worse.
The difference of addresses is worse.
As we can see below, to use szLen
or GetStringLenY makes no difference
im my system, up to length=32 or 64.

Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, GetStringLenY - _string01- length=16
13 milliseconds, szLen - _string01- length=16

15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
35 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenY - _string02- length=32

37 milliseconds, strlen32M - _string03- length=64
40 milliseconds, strlen32 - _string03- length=64
54 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32M - _string04- length=128

58 milliseconds, GetStringLenY - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
93 milliseconds, szLen - _string04- length=128
100 milliseconds, GetStringLenY - _string04- length=128
********** END **********

Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
14 milliseconds, strlen32M - _string02- length=32
14 milliseconds, szLen - _string01- length=16
15 milliseconds, strlen32 - _string02- length=32
18 milliseconds, GetStringLenZ - _string01- length=16
34 milliseconds, szLen - _string02- length=32
37 milliseconds, strlen32M - _string03- length=64
38 milliseconds, strlen32 - _string03- length=64
40 milliseconds, GetStringLenZ - _string02- length=32
53 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
58 milliseconds, strlen32M - _string04- length=128
68 milliseconds, GetStringLenZ - _string03- length=64
96 milliseconds, szLen - _string04- length=128
123 milliseconds, GetStringLenZ - _string04- length=128
********** END **********

Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 08:16:59 AM

Quote from: RuiLoureiro on June 11, 2014, 05:53:46 AM
Jochen,
where are you ? Are you sleeping ?
where is your answer ?

Yes, I was sleeping, and here is my answer (for a 100 byte string):

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
23657 cycles for 100 * Rui
3942 cycles for 100 * MB
13843 cycles for 100 * Masm32
14124 cycles for 100 * CRT
23757 cycles for 100 * Habran

As qWord wrote above, we have tested it already.

Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:46:09 AM

the routines we have analyzed before have typically been optimized for longer strings
for shorter strings, say 50 characters or less, those routines have too much overhead

so, i tried to keep overhead to a minimum
and - access data as 4-aligned dwords
it hasn't been tested, but maybe it will give you some ideas....

Code Select

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

ShortLen PROC _lpStr:LPSTR

        mov     edx,[esp+4]
        test    dl,3
        mov     ecx,[edx]
        jz      start_dwords3

        xor     eax,eax
        inc     edx
        test    cl,cl
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ch,ch
        jz      all_done

        test    dl,3
        jz      start_dwords1

        inc     eax
        inc     edx
        test    ecx,0FF0000h
        jz      all_done

        jmp short start_dwords2

start_dwords1:
        add     edx,4

start_dwords2:
        mov     ecx,[edx]

start_dwords3:
        test    cl,cl
        mov     al,0
        jz      end_dwords

        test    ch,ch
        mov     al,1
        jz      end_dwords

        test    ecx,0FF0000h
        mov     al,2
        jz      end_dwords

        test    ecx,0FF000000h
        mov     al,3
        jnz     start_dwords1

end_dwords:
        sub     eax,[esp+4]
        add     eax,edx

all_done:
        ret     4

ShortLen ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

Title: Re: Optimizing some code
Post by: hutch-- on June 11, 2014, 01:11:27 PM

This very simple algo is still one of my favourites, mainly because its small, simple and flexible enough to inline it into the middle of a larger more complex algo. The aligned DWORD read versions will always be faster on longer linear reads but often its of no real gain in a more complex set of tasks.

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

slen proc ptxt:DWORD

mov eax, [esp+4]
sub eax, 1

lbl0:
add eax, 1
cmp BYTE PTR [eax], 0
jne lbl0

sub eax, [esp+4]

ret 4

slen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 05:33:57 PM

Well, well... so here are some brandnew algos, tested on 30 and 100 byte strings 8)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11619 cycles for 100 * Rui
3512 cycles for 100 * MB
4734 cycles for 100 * Masm32
4036 cycles for 100 * CRT
11969 cycles for 100 * Habran
4930 cycles for 100 * ShortLen (Dave)
11413 cycles for 100 * slen (Hutch)

11615 cycles for 100 * Rui
3511 cycles for 100 * MB
5915 cycles for 100 * Masm32
4034 cycles for 100 * CRT
11841 cycles for 100 * Habran
4932 cycles for 100 * ShortLen (Dave)
11412 cycles for 100 * slen (Hutch)

32727 cycles for 100 * Rui
5932 cycles for 100 * MB
14423 cycles for 100 * Masm32
13117 cycles for 100 * CRT
33202 cycles for 100 * Habran
15347 cycles for 100 * ShortLen (Dave)
32430 cycles for 100 * slen (Hutch)

32743 cycles for 100 * Rui
5515 cycles for 100 * MB
14442 cycles for 100 * Masm32
13123 cycles for 100 * CRT
33367 cycles for 100 * Habran
15324 cycles for 100 * ShortLen (Dave)
32421 cycles for 100 * slen (Hutch)

100 = eax Rui
100 = eax MB
100 = eax Masm32
100 = eax CRT
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)

Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 07:11:20 PM

The string length topic was beaten to death. Anyway, here is it again:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

8547    cycles for 100 * Rui
1234    cycles for 100 * MB
4607    cycles for 100 * Masm32
2826    cycles for 100 * CRT
6069    cycles for 100 * Habran
3855    cycles for 100 * ShortLen (Dave)
8492    cycles for 100 * slen (Hutch)

8494    cycles for 100 * Rui
1233    cycles for 100 * MB
4609    cycles for 100 * Masm32
2812    cycles for 100 * CRT
6042    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
8486    cycles for 100 * slen (Hutch)

18892   cycles for 100 * Rui
2545    cycles for 100 * MB
9483    cycles for 100 * Masm32
8950    cycles for 100 * CRT
20342   cycles for 100 * Habran
12571   cycles for 100 * ShortLen (Dave)
19062   cycles for 100 * slen (Hutch)

18737   cycles for 100 * Rui
2545    cycles for 100 * MB
9462    cycles for 100 * Masm32
7726    cycles for 100 * CRT
20296   cycles for 100 * Habran
11327   cycles for 100 * ShortLen (Dave)
19168   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

Gunther

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 08:44:13 PM

Quote
Hutch:
This very simple algo is still one of my favourites, mainly because its small,
simple and flexible enough...

In DOS we used junk like rep scasb.
But i have that kind of instructions in the dustbin.
Now i avoid to move the addresses.
I like to use the index (=length, size,...) and
a minimum number of registers.
In any way, for me, strlen32 or strlen32M is good.

Remember that, in many applications, we need to
remove input spaces or to find CR/LF and then we
get the length. So this is more important than
one to get the length only. For me, it is.

Code Select


Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

12473   cycles for 100 * Rui
3762    cycles for 100 * MB
12026   cycles for 100 * Masm32
5466    cycles for 100 * CRT
14959   cycles for 100 * Habran
11102   cycles for 100 * ShortLen (Dave)
12101   cycles for 100 * slen (Hutch)

12692   cycles for 100 * Rui
3699    cycles for 100 * MB
11994   cycles for 100 * Masm32
5172    cycles for 100 * CRT
15321   cycles for 100 * Habran
11575   cycles for 100 * ShortLen (Dave)
12105   cycles for 100 * slen (Hutch)

28829   cycles for 100 * Rui
7346    cycles for 100 * MB
26394   cycles for 100 * Masm32
15833   cycles for 100 * CRT
35733   cycles for 100 * Habran
27929   cycles for 100 * ShortLen (Dave)
27561   cycles for 100 * slen (Hutch)

27941   cycles for 100 * Rui
7318    cycles for 100 * MB
26319   cycles for 100 * Masm32
15981   cycles for 100 * CRT
36253   cycles for 100 * Habran
24471   cycles for 100 * ShortLen (Dave)
27507   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax Masm32
100     = eax CRT
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 08:48:19 PM

deleted

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 08:59:22 PM

Jochen,
Sorry, i think that i misunderstood your question.
HERE are strlen32, strlen32M

Code Select


OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
strlen32    proc        pBuf:DWORD
            push        ebx            
            mov         ecx, [esp+8]           ; get pointer to string
            mov         eax, ecx               ; copy pointer
            and         ecx, 3                 ; lower 2 bits of address, check alignment
            jz          L2                     ; string is aligned by 4. Go to loop            
            and         eax, -4                ; align pointer by 4
            mov         ebx, [eax]             ; read from nearest preceding boundary
            shl         ecx, 3                 ; mul by 8 = displacement in bits
            mov         edx, -1
            shl         edx, cl                ; make byte mask
            not         edx                    ; mask = 0FFH for false bytes
            or          ebx, edx               ; mask out false bytes
            ; check first four bytes for zero
            ;-----------------------------------
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jnz         L3                     ; zero-byte found        
            ; Main loop, read 4 bytes aligned
            ;-----------------------------------
    L1:     add         eax, 4                 ; increment pointer by 4
    L2:     mov         ebx, [eax]             ; read 4 bytes of string
            lea         ecx, [ebx-01010101H]   ; subtract 1 from each byte
            not         ebx                    ; invert all bytes
            and         ecx, ebx               ; and these two
            and         ecx, 80808080H         ; test all sign bits
            jz          L1                     ; no zero bytes, continue loop        
    L3:     bsf         ecx, ecx               ; find right-most 1-bit
            shr         ecx, 3                 ; divide by 8 = byte index            
            sub         eax, [esp+8]           ; subtract start address
            add         eax, ecx               ; add index to byte            
            pop         ebx
            ret         4
strlen32    endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 
;******************************************************************************
OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
strlen32M       proc        pBuf:DWORD            
                mov         ecx, [esp+4]           ; get pointer to string
                mov         eax, ecx               ; copy pointer
                and         ecx, 3                 ; lower 2 bits of address, check alignment
                jz          L2                     ; string is aligned by 4. Go to loop            
                and         eax, -4                ; align pointer by 4                
                shl         ecx, 3                 ; mul by 8 = displacement in bits
                mov         edx, -1
                shl         edx, cl                ; make byte mask
                not         edx                    ; mask = 0FFH for false bytes
                or          edx, [eax]             ; read from nearest preceding boundary         
                ; check first four bytes for zero
                ;-----------------------------------
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jnz         L3                     ; zero-byte found        
                ; Main loop, read 4 bytes aligned
                ;-----------------------------------
        L1:     add         eax, 4                 ; increment pointer by 4
        L2:     mov         edx, [eax]             ; read 4 bytes of string
                lea         ecx, [edx-01010101H]   ; subtract 1 from each byte
                not         edx                    ; invert all bytes
                and         ecx, edx               ; and these two
                and         ecx, 80808080H         ; test all sign bits
                jz          L1                     ; no zero bytes, continue loop        
        L3:     bsf         ecx, ecx               ; find right-most 1-bit
                shr         ecx, 3                 ; divide by 8 = byte index            
                sub         eax, [esp+4]           ; subtract start address
                add         eax, ecx               ; add index to byte            
                ret         4
strlen32M       endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 09:15:57 PM

Quote from: RuiLoureiro on June 11, 2014, 08:59:22 PM
Jochen,
Sorry, i think that i misunderstood your question.
HERE are strlen32, strlen32M

OK, integrated in attached testbed. They are fast indeed.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

11620 cycles for 100 * Rui
3512 cycles for 100 * MB
4338 cycles for 100 * strlen32
4755 cycles for 100 * strlen32M
11842 cycles for 100 * Habran
4932 cycles for 100 * ShortLen (Dave)
11417 cycles for 100 * slen (Hutch)

11628 cycles for 100 * Rui
3514 cycles for 100 * MB
4331 cycles for 100 * strlen32
4736 cycles for 100 * strlen32M
11842 cycles for 100 * Habran
4934 cycles for 100 * ShortLen (Dave)
11425 cycles for 100 * slen (Hutch)

32747 cycles for 100 * Rui
5516 cycles for 100 * MB
11223 cycles for 100 * strlen32
13623 cycles for 100 * strlen32M
33197 cycles for 100 * Habran
15323 cycles for 100 * ShortLen (Dave)
32441 cycles for 100 * slen (Hutch)

32974 cycles for 100 * Rui
5515 cycles for 100 * MB
11256 cycles for 100 * strlen32
13629 cycles for 100 * strlen32M
33159 cycles for 100 * Habran
15327 cycles for 100 * ShortLen (Dave)
32436 cycles for 100 * slen (Hutch)

Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:

Change
ShortLen PROC _lpStr:LPSTR
to
ShortLen PROC ; _lpStr:LPSTR

Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 09:26:44 PM

deleted

Title: Re: Optimizing some code
Post by: jj2007 on June 11, 2014, 09:32:34 PM

Quote from: nidud on June 11, 2014, 09:26:44 PM
fast :t
...
561 cycles - JJ

Too slow for my taste, but changing
void len(string)
to
void Len(string)
helps a lot ;-)

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 09:35:28 PM

You need to write a simple algo to sort it, Jochen.
Where is your algo ? I bet it uses rep scasb !

This table is sorted.

Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

3704 cycles for 100 * MB
5093 cycles for 100 * strlen32M
5349 cycles for 100 * strlen32

11087 cycles for 100 * ShortLen (Dave)
12093 cycles for 100 * slen (Hutch)
12549 cycles for 100 * Rui
15481 cycles for 100 * Habran
3742 cycles for 100 * MB
5332 cycles for 100 * strlen32
5462 cycles for 100 * strlen32M

11163 cycles for 100 * ShortLen (Dave)
12060 cycles for 100 * slen (Hutch)
12775 cycles for 100 * Rui
14858 cycles for 100 * Habran

7334 cycles for 100 * MB
17334 cycles for 100 * strlen32M
17613 cycles for 100 * strlen32

27930 cycles for 100 * slen (Hutch)
28026 cycles for 100 * ShortLen (Dave)
28658 cycles for 100 * Rui
43521 cycles for 100 * Habran

7344 cycles for 100 * MB
17316 cycles for 100 * strlen32M
17779 cycles for 100 * strlen32

24022 cycles for 100 * ShortLen (Dave)
27805 cycles for 100 * slen (Hutch)
28074 cycles for 100 * Rui
38015 cycles for 100 * Habran

100 = eax Rui
100 = eax MB
100 = eax strlen32
100 = eax strlen32M
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)

--- ok ---

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 09:39:49 PM

:biggrin:
: we have optimizations and ... optimizations !
for all tastes ! (we see it not only here)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long - <100,200,300,400,500,700,1000>
------------------------------------------------------
2516 cycles - std
1128 cycles - Rui
1297 cycles - habran
1048 cycles - Dave
1443 cycles - Hutch
1200 cycles - JJ
839 cycles - Rui32
841 cycles - Rui32M

13747 cycles - std
7607 cycles - Rui
9441 cycles - habran
6250 cycles - Dave
7470 cycles - Hutch
7079 cycles - JJ
4154 cycles - Rui32
4137 cycles - Rui32M

--- ok ---

Title: Re: Optimizing some code
Post by: nidud on June 11, 2014, 10:00:55 PM

deleted

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 11:07:08 PM

Quoteyep, that was better

What ? Did you see it ? Where is it ? Show us.
I never saw, i never read any proc written by Jochen.
We should compare only comparable things.

This is what we may see:
for my taste this is junk

Quote
align 16
TestA_s:
NameA equ Rui ; assign a descriptive name here
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
push offset somestring
call GetStringLenY ; Rui
dec ebx
.Until Sign?
ret
TestA endp
TestA_endp:
; ---------------------------------------------------
align 16
TestB_s:
NameB equ MB ; assign a descriptive name here
TestB proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
void Len(offset somestring)
dec ebx
.Until Sign?
ret
TestB endp
TestB_endp:
align 16
TestC_s:
; ----------------------------------------------------
; useC=0 ; uncomment to exclude TestC
NameC equ <strlen32> ; assign a descriptive name here
TestC proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
invoke strlen32, offset somestring
dec ebx
.Until Sign?
ret
TestC endp
TestC_endp:

Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:09:02 PM

Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:
Code Select Expand
0000000A != 3 - STRLEN error 000003BC != 700 - STRLEN error

i don't understand what that means
i'll fix it if you like

Title: Re: Optimizing some code
Post by: dedndave on June 11, 2014, 11:14:33 PM

Quote from: RuiLoureiro on June 11, 2014, 11:07:08 PM

Quoteyep, that was better
What ? Did you see it ? Where is it ? Show us.
We should compare only comparable things.

the test depends on what you're after
most strings are not aligned
well - BSTR's are - and strings that are inside structures probably are
otherwise... you have to devise a test that tests all alignments

just as an example, i attached a test to look at

1) select a single core, and wait 750 mS to bind before testing
2) select loop counts that yield ~0.5 seconds per pass
3) all alignments are tested
(notice that each string is differently aligned)
4) 16 strings are tested - the overall result is divided by 16 and rounded to nearest
5) you get fewer outliers if you open a console window and type the program name than if you click on it
6) the test should show the processor (this one does not) :P

Title: Re: Optimizing some code
Post by: Gunther on June 11, 2014, 11:33:41 PM

Results for Jochen:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

9731    cycles for 100 * Rui
1146    cycles for 100 * MB
3871    cycles for 100 * strlen32
2730    cycles for 100 * strlen32M
8707    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
5891    cycles for 100 * slen (Hutch)

8489    cycles for 100 * Rui
2387    cycles for 100 * MB
3879    cycles for 100 * strlen32
2722    cycles for 100 * strlen32M
8698    cycles for 100 * Habran
5101    cycles for 100 * ShortLen (Dave)
5866    cycles for 100 * slen (Hutch)

18823   cycles for 100 * Rui
2029    cycles for 100 * MB
7357    cycles for 100 * strlen32
7455    cycles for 100 * strlen32M
20309   cycles for 100 * Habran
12587   cycles for 100 * ShortLen (Dave)
17441   cycles for 100 * slen (Hutch)

20010   cycles for 100 * Rui
2033    cycles for 100 * MB
7387    cycles for 100 * strlen32
7442    cycles for 100 * strlen32M
19070   cycles for 100 * Habran
12663   cycles for 100 * ShortLen (Dave)
18717   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax strlen32
100     = eax strlen32M
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

Results for Rui:

Code Select



Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
920     cycles - std
775     cycles - Rui
562     cycles - habran
378     cycles - Dave
544     cycles - Hutch
406     cycles - JJ
332     cycles - Rui32
246     cycles - Rui32M
209     cycles - JJ2

14480   cycles - std
10105   cycles - Rui
14099   cycles - habran
8649    cycles - Dave
9680    cycles - Hutch
7488    cycles - JJ
4513    cycles - Rui32
4634    cycles - Rui32M
1162    cycles - JJ2

--- ok ---

Gunther

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 11, 2014, 11:35:46 PM

Dave,
must show the proc.
it must be clear. otherwise i give him 0.
if in the school, he must explain all bits.
This is my rule:
i don't accept any results unless
you show your exercise.

Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 12:14:01 AM

Rui,

Quote from: RuiLoureiro on June 11, 2014, 11:35:46 PM
Dave,
must show the proc.
it must be clear. otherwise i give him 0.
if in the school, he must explain all bits.
This is my rule:
i don't accept any results unless
you show your exercise.

but slen.zip contains the source. What's your point?

Gunther

Title: Re: Optimizing some code
Post by: nidud on June 12, 2014, 12:23:04 AM

deleted

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 12:40:12 AM

Gunther,
what are you talking about ?
Could you post here the proc
that i am talking about ?

Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 12:42:41 AM

Rui,

no offense. But the zip archive under post #30 contains the source. Or do you mean another procedure?

Gunther

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 12:46:41 AM

No, iam not talking about it.
Do you know this:
This is the place to post assembler algorithms and code design for discussion, optimisation and any other...
You may do any tests you want to do with the code
i posted. I want to do the same. That's the point.
Read my reply 10.

nidud,
it is not correct to call "Rui32" and "Rui32M" but
AgnerFog.

:biggrin: :biggrin:
EDIT: Gunther,
Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)

Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 04:57:59 AM

Rui,

Quote from: RuiLoureiro on June 12, 2014, 12:46:41 AM
EDIT: Gunther,
Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)

I've read your post #10. I think that I'm talking about algorithms, I'm posting test results (not only in your thread), I'm not talking about gambuzinos, Yetis and other impossibilities.

But anyway, it's your thread. My apology, I won't post into your threads in the future.

Gunther

Title: Re: Optimizing some code
Post by: RuiLoureiro on June 12, 2014, 05:53:36 AM

Hi
What did you do wrong, Gunther ?
I never saw anything wrong. It's clear.
My apology.

Title: Re: Optimizing some code
Post by: dedndave on June 12, 2014, 09:36:12 AM

fixed my routine - and added ShowCpu :P

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
134 135 135 135 134
141 141 141 140 141

Title: Re: Optimizing some code
Post by: FORTRANS on June 12, 2014, 11:23:18 PM

Hi Dave,

Here are some results.

Code Select

Pre-Pentium4 (SSE1)
106 106 106 106 106
104 104 104 104 104
Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
103 104 104 104 103
111 111 111 111 111
Press any key to continue ...

HTH,

Steve N.

Title: Re: Optimizing some code
Post by: Gunther on June 12, 2014, 11:45:57 PM

Dave,

results from slen2.exe:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
55 51 52 51 51
77 77 77 77 77
Press any key to continue ...

Gunther

Title: Re: Optimizing some code
Post by: nidud on June 15, 2014, 01:28:58 AM

deleted

Title: Re: Optimizing some code
Post by: dedndave on June 15, 2014, 01:33:34 AM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
79459   cycles - 0: standard (scasb)
7603    cycles - 1: AgnerFog
7785    cycles - 2: AgnerFog (unaligned)
9866    cycles - 3: Dave

16232   cycles - 0: standard (scasb)
7615    cycles - 1: AgnerFog
9145    cycles - 2: AgnerFog (unaligned)
10294   cycles - 3: Dave

16081   cycles - 0: standard (scasb)
7759    cycles - 1: AgnerFog
7763    cycles - 2: AgnerFog (unaligned)
10762   cycles - 3: Dave

you might want to increase the loop counts for better stability

Title: Re: Optimizing some code
Post by: Gunther on June 15, 2014, 02:18:58 AM

Hi nidud,

results for strlen4:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
18212   cycles - 0: standard (scasb)
1871    cycles - 1: AgnerFog
1962    cycles - 2: AgnerFog (unaligned)
2789    cycles - 3: Dave

11414   cycles - 0: standard (scasb)
4372    cycles - 1: AgnerFog
4257    cycles - 2: AgnerFog (unaligned)
6608    cycles - 3: Dave

17074   cycles - 0: standard (scasb)
4358    cycles - 1: AgnerFog
4204    cycles - 2: AgnerFog (unaligned)
6618    cycles - 3: Dave

--- ok ---

Gunther

Title: Re: Optimizing some code
Post by: FORTRANS on June 15, 2014, 05:09:47 AM

Hi,

The first time it is run, it is slow on the first test.

Regards,

Steve N.

Code Select

First run.

pre-P4 (SSE1)
------------------------------------------------------
315620  cycles - 0: standard (scasb)
5001    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6741    cycles - 3: Dave

11665   cycles - 0: standard (scasb)
4997    cycles - 1: AgnerFog
4975    cycles - 2: AgnerFog (unaligned)
6764    cycles - 3: Dave

11684   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6780    cycles - 3: Dave

--- ok ---

Second run

pre-P4 (SSE1)
------------------------------------------------------
11813   cycles - 0: standard (scasb)
4992    cycles - 1: AgnerFog
4968    cycles - 2: AgnerFog (unaligned)
6755    cycles - 3: Dave

11679   cycles - 0: standard (scasb)
4985    cycles - 1: AgnerFog
4966    cycles - 2: AgnerFog (unaligned)
6768    cycles - 3: Dave

11696   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4977    cycles - 2: AgnerFog (unaligned)
6758    cycles - 3: Dave

--- ok ---

 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
296692	cycles - 0: standard (scasb)
4193	cycles - 1: AgnerFog
4078	cycles - 2: AgnerFog (unaligned)
6334	cycles - 3: Dave
 
11671	cycles - 0: standard (scasb)
4148	cycles - 1: AgnerFog
4064	cycles - 2: AgnerFog (unaligned)
6268	cycles - 3: Dave
 
11675	cycles - 0: standard (scasb)
4209	cycles - 1: AgnerFog
4092	cycles - 2: AgnerFog (unaligned)
6357	cycles - 3: Dave
 
--- ok ---  
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
11809	cycles - 0: standard (scasb)
4187	cycles - 1: AgnerFog
4093	cycles - 2: AgnerFog (unaligned)
6219	cycles - 3: Dave
 
11793	cycles - 0: standard (scasb)
4147	cycles - 1: AgnerFog
4074	cycles - 2: AgnerFog (unaligned)
6100	cycles - 3: Dave
 
11810	cycles - 0: standard (scasb)
4211	cycles - 1: AgnerFog
4092	cycles - 2: AgnerFog (unaligned)
6225	cycles - 3: Dave
 
--- ok ---

Title: Re: Optimizing some code
Post by: LarryC on June 15, 2014, 05:43:47 AM

Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz (SSE4)
------------------------------------------------------
10025 cycles - 0: standard (scasb)
6427 cycles - 1: AgnerFog
6444 cycles - 2: AgnerFog (unaligned)
11482 cycles - 3: Dave

15319 cycles - 0: standard (scasb)
7873 cycles - 1: AgnerFog
6176 cycles - 2: AgnerFog (unaligned)
11058 cycles - 3: Dave

14978 cycles - 0: standard (scasb)
7816 cycles - 1: AgnerFog
6251 cycles - 2: AgnerFog (unaligned)
10995 cycles - 3: Dave

--- ok ---

Title: Re: Optimizing some code
Post by: nidud on June 15, 2014, 11:29:32 PM

deleted

Title: Re: Optimizing some code
Post by: Gunther on June 15, 2014, 11:34:59 PM

Hi nidud,

strlen5:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
22200   cycles - 0: standard (scasb)
10776   cycles - 3: Dave
10271   cycles - 5: MB - len()
7120    cycles - 1: AgnerFog
7264    cycles - 2: AgnerFog (unaligned)
3007    cycles - 6: MB - Len() SSE
2280    cycles - 4: unaligned SSE2

21590   cycles - 0: standard (scasb)
10616   cycles - 3: Dave
10136   cycles - 5: MB - len()
7059    cycles - 1: AgnerFog
17323   cycles - 2: AgnerFog (unaligned)
7226    cycles - 6: MB - Len() SSE
5413    cycles - 4: unaligned SSE2

52253   cycles - 0: standard (scasb)
25722   cycles - 3: Dave
24451   cycles - 5: MB - len()
17339   cycles - 1: AgnerFog
17349   cycles - 2: AgnerFog (unaligned)
7205    cycles - 6: MB - Len() SSE
6116    cycles - 4: unaligned SSE2

--- ok ---

Gunther

Title: Re: Optimizing some code
Post by: nidud on June 16, 2014, 01:29:53 AM

deleted

Title: Re: Optimizing some code
Post by: dedndave on June 16, 2014, 02:38:24 AM

strlen5
prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
86037   cycles - 0: standard (scasb)
31180   cycles - 3: Dave
33575   cycles - 5: MB - len()
23079   cycles - 1: AgnerFog
25595   cycles - 2: AgnerFog (unaligned)
21374   cycles - 6: MB - Len() SSE
18166   cycles - 4: unaligned SSE2

49577   cycles - 0: standard (scasb)
31080   cycles - 3: Dave
32727   cycles - 5: MB - len()
23139   cycles - 1: AgnerFog
25405   cycles - 2: AgnerFog (unaligned)
21643   cycles - 6: MB - Len() SSE
18152   cycles - 4: unaligned SSE2

49638   cycles - 0: standard (scasb)
31000   cycles - 3: Dave
32762   cycles - 5: MB - len()
23151   cycles - 1: AgnerFog
31292   cycles - 2: AgnerFog (unaligned)
21172   cycles - 6: MB - Len() SSE
18204   cycles - 4: unaligned SSE2

The MASM Forum

General => The Laboratory => Topic started by: RuiLoureiro on June 10, 2014, 06:54:45 PM