The MASM Forum

General => The Workshop => Topic started by: TouEnMasm on November 02, 2014, 07:24:13 PM

Title: Improvement of the strlen function
Post by: TouEnMasm on November 02, 2014, 07:24:13 PM
The strlen don't align the data adress and lose time when this one is unalign.
Here data misaligned
Quote
   .data
   hInstance dd 0
   decale BYTE 0           ;break the alignment
   longphrase    db "BCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl",0


AlignedStrLen proc item:DWORD
    mov     eax,item            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    ebx
    push    edi
    mov     ebx, 80808080h
    ;align eax
    mov ecx,eax
    and ecx,3 ; ecx= x (0 to 3)
    neg ecx ;- x
    add ecx,sizeof dword    ;4-x dans ecx
   ; jz aligned
    ;first compare
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, ecx                ; increment pointer and align on dword
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz  aligned   
    mov eax,item ;reponse       
    add eax,4
    jmp reponse
  aligned:     
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF 6bff4241
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 on cherche 13, le byte D devient FF
    ;upper,method to search a byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz     aligned   
reponse:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length
    pop     edi
    pop     ebx
    ret   
AlignedStrLen endp

There is a very little penalty on short string  but a real win of time on others (about 20%)


Title: Re: Improvement of the strlen function
Post by: jj2007 on November 02, 2014, 08:00:47 PM
Looks really competitive :t What is it supposed to return?

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1447    cycles for 10 * CRT strlen
989     cycles for 10 * Masm32 StrLen
262     cycles for 10 * AlignedStrLen
385     cycles for 10 * MasmBasic Len()

1416    cycles for 10 * CRT strlen
987     cycles for 10 * Masm32 StrLen
262     cycles for 10 * AlignedStrLen
386     cycles for 10 * MasmBasic Len()

1422    cycles for 10 * CRT strlen
984     cycles for 10 * Masm32 StrLen
267     cycles for 10 * AlignedStrLen
385     cycles for 10 * MasmBasic Len()

100     = eax CRT strlen
100     = eax Masm32 StrLen
7       = eax AlignedStrLen
100     = eax MasmBasic Len()


P.S.: Results look different, i.e. eax=100, if you activate this line (#104 in the attached source):
; jz     aligned             ;useless if align ecx == 4
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 02, 2014, 08:10:54 PM

She return the lenght of the chain.
But she can also return  the length of a text line terminated by 0d,0a,just change:
Quote
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
by
Quote
   ; lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
And you have a linelength function where alignment is very useful

Title: Re: Improvement of the strlen function
Post by: qWord on November 02, 2014, 08:39:40 PM
"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
Title: Re: Improvement of the strlen function
Post by: Gunther on November 03, 2014, 02:09:21 AM
Yves and Jochen,

timings for you:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

792     cycles for 10 * CRT strlen
607     cycles for 10 * Masm32 StrLen
130     cycles for 10 * AlignedStrLen
250     cycles for 10 * MasmBasic Len()

776     cycles for 10 * CRT strlen
581     cycles for 10 * Masm32 StrLen
318     cycles for 10 * AlignedStrLen
609     cycles for 10 * MasmBasic Len()

774     cycles for 10 * CRT strlen
597     cycles for 10 * Masm32 StrLen
315     cycles for 10 * AlignedStrLen
610     cycles for 10 * MasmBasic Len()

100     = eax CRT strlen
100     = eax Masm32 StrLen
7       = eax AlignedStrLen
100     = eax MasmBasic Len()

--- ok ---

Gunther
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 03, 2014, 03:55:14 AM
Quote
"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
corrected
Quote
    ;first compare
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, ecx                ; increment pointer and align on dword
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz  aligned                                      ;-------------------------- see first post for complet source code   
    mov eax,item ;reponse                   ;------------------
    add eax,4                                      ;---------------------
    jmp reponse                                   ;-----------------------
  aligned:     
Timing must not be modified

Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 03:56:44 AM
deleted
Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 04:15:46 AM
deleted
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 03, 2014, 04:27:26 AM
 To nidud
You can't align loaded data with the align directive.
The only soluce is to mov the pointer untill he reach an adress multiple of the choosen alignment.
using the corrected version
Get the full code in the first post
The EBP register is replace by EBX,your crash come from here


Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 04:44:32 AM
deleted
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 03, 2014, 04:53:45 AM
I don't know at what you are playing,my code is not like that:
Quote
if I fix your code:
Code: [Select]
    mov eax,[esp+4+12] ;reponse          ;------------------
be cool showing your code in mnemonic asm,as i do.

Windbg is a good player,he show
Quote
vitesse_strlen!AlignedStrLen+0x3:
0040109b 8b4508          mov     eax,dword ptr [ebp+8] ss:0023:0012ffc0={vitesse_strlen!longphrase (00404005)}
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 03, 2014, 05:05:09 AM
Added Nidud's version. They all work fine for a 100 byte string - there is a test for "Hel" in line 413 for those who are interested  ;)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1180    cycles for 10 * CRT strlen
1077    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
405     cycles for 10 * MasmBasic Len()
1167    cycles for 10 * strlenNidud

1180    cycles for 10 * CRT strlen
1075    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
405     cycles for 10 * MasmBasic Len()
1174    cycles for 10 * strlenNidud

1180    cycles for 10 * CRT strlen
1077    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
406     cycles for 10 * MasmBasic Len()
1166    cycles for 10 * strlenNidud

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax AlignedStrLen
100     = eax MasmBasic Len()
100     = eax strlenNidud


P.S.: This is still Yves' first version - #2 crashes:
0040106E   ³.  8B45 08        mov eax, [ebp+8]
ebp is 80808080h

btw does anybody have a CPU showing the 20% improvement?
Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 06:40:36 AM
deleted
Title: Re: Improvement of the strlen function
Post by: Gunther on November 03, 2014, 06:45:10 AM
Jochen your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

741     cycles for 10 * CRT strlen
627     cycles for 10 * Masm32 StrLen
1344    cycles for 10 * AlignedStrLen
217     cycles for 10 * MasmBasic Len()
730     cycles for 10 * strlenNidud

742     cycles for 10 * CRT strlen
617     cycles for 10 * Masm32 StrLen
1350    cycles for 10 * AlignedStrLen
215     cycles for 10 * MasmBasic Len()
729     cycles for 10 * strlenNidud

125     cycles for 10 * CRT strlen
248     cycles for 10 * Masm32 StrLen
309     cycles for 10 * AlignedStrLen
265     cycles for 10 * MasmBasic Len()
230     cycles for 10 * strlenNidud

123     cycles for 10 * CRT strlen
103     cycles for 10 * Masm32 StrLen
126     cycles for 10 * AlignedStrLen
109     cycles for 10 * MasmBasic Len()
94      cycles for 10 * strlenNidud

4       = eax CRT strlen
4       = eax Masm32 StrLen
4       = eax AlignedStrLen
4       = eax MasmBasic Len()
4       = eax strlenNidud

--- ok ---

Nidud,

your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
380986    cycles - 1000 (  0) 0: crt_strlen
336201    cycles - 1000 ( 71) 1: aligned 00000020
345548    cycles - 1000 ( 87) 2: aligned 00000030
343273    cycles - 1000 ( 43) 3: unaligned
87093     cycles - 1000 ( 41) 4: SSE

377742    cycles - 1000 (  0) 0: crt_strlen
335817    cycles - 1000 ( 71) 1: aligned 00000020
347243    cycles - 1000 ( 87) 2: aligned 00000030
341638    cycles - 1000 ( 43) 3: unaligned
87740     cycles - 1000 ( 41) 4: SSE

379494    cycles - 1000 (  0) 0: crt_strlen
333183    cycles - 1000 ( 71) 1: aligned 00000020
346480    cycles - 1000 ( 87) 2: aligned 00000030
342334    cycles - 1000 ( 43) 3: unaligned
87537     cycles - 1000 ( 41) 4: SSE
--- ok ---

Gunther
Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 07:47:17 AM
deleted
Title: Re: Improvement of the strlen function
Post by: dedndave on November 03, 2014, 07:49:03 AM
from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off

think about it - how often do you want to find the length of a string with more than 100 characters or so?

if you are working with longer strings, this routine will do well
it is adapted from code that was published by Randy Hyde
i made so many changes that i didn't put his name in it - lol

the longer the test strings, the better this will compare with other algos

;***********************************************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

        ALIGN   16

SzLen   PROC    lpszStr:LPSTR

;SzLen - Get Length of Zero-Terminated String
;DednDave - Dave Sheldon - 8-2010

        mov     edx,[esp+4]
        mov     ecx,7F7F7F7Fh
        test    dl,3
        push    ebx
        jz      SzLen2

        mov     ebx,[edx]
        xor     eax,eax
        or      bl,bl
        jz      SzLen0

        inc     eax
        or      bh,bh
        jz      SzLen0

        test    dl,2
        jnz     SzLen1

        inc     eax
        test    ebx,0FF0000h
        jnz     SzLen1

SzLen0: pop     ebx
        ret     4

SzLen1: or      dl,3
        inc     edx

SzLen2: push    esi
        push    edi
        mov     esi,01010101h
        mov     edi,80808080h

SzLen3: mov     eax,edx
        mov     ebx,[eax]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+4]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+8]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+12]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+16]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+20]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+24]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+28]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+32]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jz      SzLen3

SzLen4: mov     ebx,[edx-4]
        lea     eax,[edx-4]
        or      bl,bl
        jz      SzLen5

        inc     eax
        or      bh,bh
        jz      SzLen5

        inc     eax
        test    ebx,0FF0000h
        jz      SzLen5

        inc     eax
        test    ebx,0FF000000h
        jnz     SzLen3

SzLen5: pop     edi
        pop     esi
        sub     eax,[esp+8]
        pop     ebx
        ret     4

SzLen   ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;***********************************************************************************************
Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 08:03:10 AM
deleted
Title: Re: Improvement of the strlen function
Post by: nidud on November 03, 2014, 08:39:47 AM
deleted
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 03, 2014, 09:37:42 AM
Quote from: dedndave on November 03, 2014, 07:49:03 AM
from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off

Check the average line length of e.g. the Bible, or The Bible of C Coders, i.e. the header files - see attachment #2, for my VS installation it's 33 bytes  ;-)

Attached a test with different lengths. Your code is fast, Dave. For the 10-byte strings, for certain misalignments the winner is Masm32 StrLen, though - greets to Hutch :t


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

100 bytes:
1180    cycles for 10 * CRT strlen
1069    cycles for 10 * Masm32 StrLen
1148    cycles for 10 * AlignedStrLen
393     cycles for 10 * MasmBasic Len()
1167    cycles for 10 * strlenNidud
956     cycles for 10 * SzLen (Dave)

70 bytes:
901     cycles for 10 * CRT strlen
743     cycles for 10 * Masm32 StrLen
964     cycles for 10 * AlignedStrLen
342     cycles for 10 * MasmBasic Len()
1158    cycles for 10 * strlenNidud
704     cycles for 10 * SzLen (Dave)

40 bytes:
616     cycles for 10 * CRT strlen
509     cycles for 10 * Masm32 StrLen
766     cycles for 10 * AlignedStrLen
282     cycles for 10 * MasmBasic Len()
856     cycles for 10 * strlenNidud
546     cycles for 10 * SzLen (Dave)

10 bytes:
254     cycles for 10 * CRT strlen
313     cycles for 10 * Masm32 StrLen
328     cycles for 10 * AlignedStrLen
245     cycles for 10 * MasmBasic Len()
262     cycles for 10 * strlenNidud
289     cycles for 10 * SzLen (Dave)

10      = eax CRT strlen
10      = eax Masm32 StrLen
10      = eax AlignedStrLen
10      = eax MasmBasic Len()
10      = eax strlenNidud
10      = eax SzLen (Dave)


P.S.: The *.asc in attachment #2 is in RTF format, for use with \Masm32\MasmBasic\RichMasm.exe

This line loads an array of filenames - change if your header files are in a different location:
GetFiles ExpandEnv$("%ProgramFiles%\Microsoft SDKs\Windows\v7.0A\Include\*.h")
Title: Re: Improvement of the strlen function
Post by: Gunther on November 04, 2014, 04:57:47 AM
Jochen,

new timings from StrLenYves3.zip:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

100 bytes:
744     cycles for 10 * CRT strlen
614     cycles for 10 * Masm32 StrLen
875     cycles for 10 * AlignedStrLen
204     cycles for 10 * MasmBasic Len()
733     cycles for 10 * strlenNidud
504     cycles for 10 * SzLen (Dave)

70 bytes:
550     cycles for 10 * CRT strlen
425     cycles for 10 * Masm32 StrLen
571     cycles for 10 * AlignedStrLen
162     cycles for 10 * MasmBasic Len()
1179    cycles for 10 * strlenNidud
518     cycles for 10 * SzLen (Dave)

40 bytes:
847     cycles for 10 * CRT strlen
283     cycles for 10 * Masm32 StrLen
371     cycles for 10 * AlignedStrLen
168     cycles for 10 * MasmBasic Len()
344     cycles for 10 * strlenNidud
275     cycles for 10 * SzLen (Dave)

10 bytes:
149     cycles for 10 * CRT strlen
331     cycles for 10 * Masm32 StrLen
405     cycles for 10 * AlignedStrLen
285     cycles for 10 * MasmBasic Len()
331     cycles for 10 * strlenNidud
363     cycles for 10 * SzLen (Dave)

10      = eax CRT strlen
10      = eax Masm32 StrLen
10      = eax AlignedStrLen
10      = eax MasmBasic Len()
10      = eax strlenNidud
10      = eax SzLen (Dave)

--- ok ---

Gunther
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 04, 2014, 11:07:15 AM
Danke, Gunther :t
Title: Re: Improvement of the strlen function
Post by: Gunther on November 05, 2014, 04:44:22 AM
Jochen,

Quote from: jj2007 on November 04, 2014, 11:07:15 AM
Danke, Gunther :t

it's my pleasure.

Gunther
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 05, 2014, 05:30:25 AM

It seems there is some problems with the measurement:
Quote
70 bytes:
550     cycles for 10 * CRT strlen
425     cycles for 10 * Masm32 StrLen ;********
571     cycles for 10 * AlignedStrLen
162     cycles for 10 * MasmBasic Len()
1179    cycles for 10 * strlenNidud   ;**************
518     cycles for 10 * SzLen (Dave)
40 bytes:
847     cycles for 10 * CRT strlen
283     cycles for 10 * Masm32 StrLen ;**********
371     cycles for 10 * AlignedStrLen
168     cycles for 10 * MasmBasic Len()
344     cycles for 10 * strlenNidud  ;***********
275     cycles for 10 * SzLen (Dave)
The win of time given by the alignment must be just linear with the lenght:
Quote
100 bytes:
MasmStrLen :176 micro/S,strlenNidud :124 micro/S
70 bytes:
MasmStrLen :80 micro/S,strlenNidud :78 micro/S
40 bytes:
MasmStrLen :57 micro/S,strlenNidud :45 micro/S
10 bytes:
MasmStrLen :28 micro/S,strlenNidud :28 micro/S



Title: Re: Improvement of the strlen function
Post by: nidud on November 05, 2014, 05:49:35 AM
deleted
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 05, 2014, 06:32:35 AM
Quote from: ToutEnMasm on November 05, 2014, 05:30:25 AM
It seems there is some problems with the measurement

Sorry for that. Which CPU? Some produce a certain variance, although it looks pretty stable on mine, i.e. Celeron and i5.
Here is a long version that repeats each size twice. As you can see, that is still your version A which fails for len<4 - if you can post a complete new version, I'll include it.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

800 bytes:
5310    cycles for 10 * CRT strlen
5560    cycles for 10 * Masm32 StrLen
4821    cycles for 10 * AlignedStrLen
1348    cycles for 10 * MasmBasic Len()
4798    cycles for 10 * strlenNidud
3710    cycles for 10 * SzLen (Dave)

5191    cycles for 10 * CRT strlen
5148    cycles for 10 * Masm32 StrLen
4805    cycles for 10 * AlignedStrLen
1352    cycles for 10 * MasmBasic Len()
4802    cycles for 10 * strlenNidud
3712    cycles for 10 * SzLen (Dave)

534 bytes:
3704    cycles for 10 * CRT strlen
3289    cycles for 10 * Masm32 StrLen
3479    cycles for 10 * AlignedStrLen
987     cycles for 10 * MasmBasic Len()
3399    cycles for 10 * strlenNidud
2543    cycles for 10 * SzLen (Dave)

3689    cycles for 10 * CRT strlen
3294    cycles for 10 * Masm32 StrLen
3479    cycles for 10 * AlignedStrLen
989     cycles for 10 * MasmBasic Len()
3401    cycles for 10 * strlenNidud
2538    cycles for 10 * SzLen (Dave)

268 bytes:
2150    cycles for 10 * CRT strlen
1701    cycles for 10 * Masm32 StrLen
2033    cycles for 10 * AlignedStrLen
469     cycles for 10 * MasmBasic Len()
1994    cycles for 10 * strlenNidud
1379    cycles for 10 * SzLen (Dave)

2149    cycles for 10 * CRT strlen
1700    cycles for 10 * Masm32 StrLen
2033    cycles for 10 * AlignedStrLen
469     cycles for 10 * MasmBasic Len()
1992    cycles for 10 * strlenNidud
1375    cycles for 10 * SzLen (Dave)

2 bytes:
130     cycles for 10 * CRT strlen
117     cycles for 10 * Masm32 StrLen
185     cycles for 10 * AlignedStrLen
136     cycles for 10 * MasmBasic Len()
113     cycles for 10 * strlenNidud
147     cycles for 10 * SzLen (Dave)

128     cycles for 10 * CRT strlen
117     cycles for 10 * Masm32 StrLen
187     cycles for 10 * AlignedStrLen
131     cycles for 10 * MasmBasic Len()
116     cycles for 10 * strlenNidud
146     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)
Title: Re: Improvement of the strlen function
Post by: Siekmanski on November 05, 2014, 07:06:49 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

800 bytes:
4903    cycles for 10 * CRT strlen
4959    cycles for 10 * Masm32 StrLen
4658    cycles for 10 * AlignedStrLen
1309    cycles for 10 * MasmBasic Len()
4575    cycles for 10 * strlenNidud
3376    cycles for 10 * SzLen (Dave)

4922    cycles for 10 * CRT strlen
4952    cycles for 10 * Masm32 StrLen
4656    cycles for 10 * AlignedStrLen
1304    cycles for 10 * MasmBasic Len()
4572    cycles for 10 * strlenNidud
3400    cycles for 10 * SzLen (Dave)

534 bytes:
3342    cycles for 10 * CRT strlen
3449    cycles for 10 * Masm32 StrLen
3301    cycles for 10 * AlignedStrLen
933     cycles for 10 * MasmBasic Len()
3248    cycles for 10 * strlenNidud
2356    cycles for 10 * SzLen (Dave)

3364    cycles for 10 * CRT strlen
3433    cycles for 10 * Masm32 StrLen
3307    cycles for 10 * AlignedStrLen
927     cycles for 10 * MasmBasic Len()
3233    cycles for 10 * strlenNidud
2366    cycles for 10 * SzLen (Dave)

268 bytes:
1921    cycles for 10 * CRT strlen
1613    cycles for 10 * Masm32 StrLen
1965    cycles for 10 * AlignedStrLen
444     cycles for 10 * MasmBasic Len()
1903    cycles for 10 * strlenNidud
1234    cycles for 10 * SzLen (Dave)

1920    cycles for 10 * CRT strlen
1625    cycles for 10 * Masm32 StrLen
1960    cycles for 10 * AlignedStrLen
448     cycles for 10 * MasmBasic Len()
1909    cycles for 10 * strlenNidud
1226    cycles for 10 * SzLen (Dave)

2 bytes:
140     cycles for 10 * CRT strlen
115     cycles for 10 * Masm32 StrLen
182     cycles for 10 * AlignedStrLen
123     cycles for 10 * MasmBasic Len()
112     cycles for 10 * strlenNidud
135     cycles for 10 * SzLen (Dave)

219     cycles for 10 * CRT strlen
109     cycles for 10 * Masm32 StrLen
183     cycles for 10 * AlignedStrLen
118     cycles for 10 * MasmBasic Len()
111     cycles for 10 * strlenNidud
132     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---
Title: Re: Improvement of the strlen function
Post by: Gunther on November 05, 2014, 09:05:03 AM
Results from the new laptop for version 4:
Quote
Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz (SSE4)

800 bytes:
3922    cycles for 10 * CRT strlen
3970    cycles for 10 * Masm32 StrLen
4432    cycles for 10 * AlignedStrLen
1024    cycles for 10 * MasmBasic Len()
4626    cycles for 10 * strlenNidud
3684    cycles for 10 * SzLen (Dave)

4734    cycles for 10 * CRT strlen
4776    cycles for 10 * Masm32 StrLen
4849    cycles for 10 * AlignedStrLen
1032    cycles for 10 * MasmBasic Len()
4630    cycles for 10 * strlenNidud
3678    cycles for 10 * SzLen (Dave)

534 bytes:
3606    cycles for 10 * CRT strlen
3523    cycles for 10 * Masm32 StrLen
3858    cycles for 10 * AlignedStrLen
1677    cycles for 10 * MasmBasic Len()
2661    cycles for 10 * strlenNidud
2048    cycles for 10 * SzLen (Dave)

2783    cycles for 10 * CRT strlen
3481    cycles for 10 * Masm32 StrLen
3612    cycles for 10 * AlignedStrLen
1542    cycles for 10 * MasmBasic Len()
2671    cycles for 10 * strlenNidud
2428    cycles for 10 * SzLen (Dave)

268 bytes:
2414    cycles for 10 * CRT strlen
1333    cycles for 10 * Masm32 StrLen
1613    cycles for 10 * AlignedStrLen
955     cycles for 10 * MasmBasic Len()
2017    cycles for 10 * strlenNidud
1075    cycles for 10 * SzLen (Dave)

1600    cycles for 10 * CRT strlen
1311    cycles for 10 * Masm32 StrLen
2447    cycles for 10 * AlignedStrLen
379     cycles for 10 * MasmBasic Len()
2323    cycles for 10 * strlenNidud
1067    cycles for 10 * SzLen (Dave)

2 bytes:
111     cycles for 10 * CRT strlen
300     cycles for 10 * Masm32 StrLen
518     cycles for 10 * AlignedStrLen
335     cycles for 10 * MasmBasic Len()
345     cycles for 10 * strlenNidud
356     cycles for 10 * SzLen (Dave)

314     cycles for 10 * CRT strlen
301     cycles for 10 * Masm32 StrLen
516     cycles for 10 * AlignedStrLen
348     cycles for 10 * MasmBasic Len()
278     cycles for 10 * strlenNidud
350     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---

Gunther
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 05, 2014, 11:50:05 AM
Thanks, Gunther & Marinus. Have you noticed the stable i7 results vs unstable i5 results of Gunther? I have no explanation for that :(
Title: Re: Improvement of the strlen function
Post by: Siekmanski on November 05, 2014, 12:17:44 PM
Maybe Guthers computer is overclocked ?
I changed the CPU clock speed of my i7 computer.
Instead of overclocking i have lowered my CPU speed a bit to get a very quiet computer.
Title: Re: Improvement of the strlen function
Post by: dedndave on November 05, 2014, 03:18:49 PM
the loop counts are very low
the tests zip by too fast - lol
from past experience, a loop count that yields a 0.5 second pass results in fairly stable numbers
Title: Re: Improvement of the strlen function
Post by: dedndave on November 05, 2014, 03:21:29 PM
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

800 bytes:
8240    cycles for 10 * CRT strlen
8984    cycles for 10 * Masm32 StrLen
7501    cycles for 10 * AlignedStrLen
4380    cycles for 10 * MasmBasic Len()
7496    cycles for 10 * strlenNidud
6072    cycles for 10 * SzLen (Dave)

8281    cycles for 10 * CRT strlen
9109    cycles for 10 * Masm32 StrLen
7348    cycles for 10 * AlignedStrLen
4403    cycles for 10 * MasmBasic Len()
7625    cycles for 10 * strlenNidud
6119    cycles for 10 * SzLen (Dave)

534 bytes:
5692    cycles for 10 * CRT strlen
6445    cycles for 10 * Masm32 StrLen
5286    cycles for 10 * AlignedStrLen
3138    cycles for 10 * MasmBasic Len()
5433    cycles for 10 * strlenNidud
4224    cycles for 10 * SzLen (Dave)

5675    cycles for 10 * CRT strlen
6526    cycles for 10 * Masm32 StrLen
5159    cycles for 10 * AlignedStrLen
3163    cycles for 10 * MasmBasic Len()
5430    cycles for 10 * strlenNidud
4195    cycles for 10 * SzLen (Dave)

268 bytes:
3212    cycles for 10 * CRT strlen
3716    cycles for 10 * Masm32 StrLen
2953    cycles for 10 * AlignedStrLen
1573    cycles for 10 * MasmBasic Len()
3052    cycles for 10 * strlenNidud
2492    cycles for 10 * SzLen (Dave)

3309    cycles for 10 * CRT strlen
3580    cycles for 10 * Masm32 StrLen
2981    cycles for 10 * AlignedStrLen
1574    cycles for 10 * MasmBasic Len()
3061    cycles for 10 * strlenNidud
2580    cycles for 10 * SzLen (Dave)

2 bytes:
302     cycles for 10 * CRT strlen
281     cycles for 10 * Masm32 StrLen
320     cycles for 10 * AlignedStrLen
351     cycles for 10 * MasmBasic Len()
246     cycles for 10 * strlenNidud
339     cycles for 10 * SzLen (Dave)

299     cycles for 10 * CRT strlen
264     cycles for 10 * Masm32 StrLen
321     cycles for 10 * AlignedStrLen
351     cycles for 10 * MasmBasic Len()
245     cycles for 10 * strlenNidud
340     cycles for 10 * SzLen (Dave)
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 05, 2014, 08:13:29 PM
Pretty stable timings, especially for a P4, and yours is clearly the best non-SSE algo :t
Title: Re: Improvement of the strlen function
Post by: Gunther on November 05, 2014, 11:29:59 PM
Hi Marinus,

Quote from: Siekmanski on November 05, 2014, 12:17:44 PM
Maybe Guthers computer is overclocked ?

no, it isn't. It's an Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 2.30GHz. That's what the System says.

Gunther
Title: Re: Improvement of the strlen function
Post by: FORTRANS on November 06, 2014, 12:26:13 AM
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

800 bytes:
4423 cycles for 10 * CRT strlen
4437 cycles for 10 * Masm32 StrLen
4513 cycles for 10 * AlignedStrLen
1381 cycles for 10 * MasmBasic Len()
4670 cycles for 10 * strlenNidud
3220 cycles for 10 * SzLen (Dave)

4424 cycles for 10 * CRT strlen
4440 cycles for 10 * Masm32 StrLen
4521 cycles for 10 * AlignedStrLen
1646 cycles for 10 * MasmBasic Len()
4313 cycles for 10 * strlenNidud
3584 cycles for 10 * SzLen (Dave)

534 bytes:
3351 cycles for 10 * CRT strlen
3429 cycles for 10 * Masm32 StrLen
3507 cycles for 10 * AlignedStrLen
950 cycles for 10 * MasmBasic Len()
2988 cycles for 10 * strlenNidud
2583 cycles for 10 * SzLen (Dave)

3115 cycles for 10 * CRT strlen
3067 cycles for 10 * Masm32 StrLen
3153 cycles for 10 * AlignedStrLen
939 cycles for 10 * MasmBasic Len()
2996 cycles for 10 * strlenNidud
2587 cycles for 10 * SzLen (Dave)

268 bytes:
1806 cycles for 10 * CRT strlen
1818 cycles for 10 * Masm32 StrLen
2177 cycles for 10 * AlignedStrLen
703 cycles for 10 * MasmBasic Len()
1829 cycles for 10 * strlenNidud
1411 cycles for 10 * SzLen (Dave)

2178 cycles for 10 * CRT strlen
1816 cycles for 10 * Masm32 StrLen
2164 cycles for 10 * AlignedStrLen
699 cycles for 10 * MasmBasic Len()
1835 cycles for 10 * strlenNidud
1190 cycles for 10 * SzLen (Dave)

2 bytes:
109 cycles for 10 * CRT strlen
104 cycles for 10 * Masm32 StrLen
379 cycles for 10 * AlignedStrLen
247 cycles for 10 * MasmBasic Len()
205 cycles for 10 * strlenNidud
284 cycles for 10 * SzLen (Dave)

232 cycles for 10 * CRT strlen
223 cycles for 10 * Masm32 StrLen
180 cycles for 10 * AlignedStrLen
246 cycles for 10 * MasmBasic Len()
203 cycles for 10 * strlenNidud
259 cycles for 10 * SzLen (Dave)

2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)

--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

800 bytes:
7784 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6988 cycles for 10 * AlignedStrLen
2023 cycles for 10 * MasmBasic Len()
7103 cycles for 10 * strlenNidud
6031 cycles for 10 * SzLen (Dave)

7759 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6958 cycles for 10 * AlignedStrLen
2020 cycles for 10 * MasmBasic Len()
7104 cycles for 10 * strlenNidud
6023 cycles for 10 * SzLen (Dave)

534 bytes:
5412 cycles for 10 * CRT strlen
5033 cycles for 10 * Masm32 StrLen
4861 cycles for 10 * AlignedStrLen
1412 cycles for 10 * MasmBasic Len()
4844 cycles for 10 * strlenNidud
4062 cycles for 10 * SzLen (Dave)

5430 cycles for 10 * CRT strlen
4975 cycles for 10 * Masm32 StrLen
4833 cycles for 10 * AlignedStrLen
1446 cycles for 10 * MasmBasic Len()
4847 cycles for 10 * strlenNidud
4053 cycles for 10 * SzLen (Dave)

268 bytes:
2908 cycles for 10 * CRT strlen
2619 cycles for 10 * Masm32 StrLen
2640 cycles for 10 * AlignedStrLen
852 cycles for 10 * MasmBasic Len()
2635 cycles for 10 * strlenNidud
2181 cycles for 10 * SzLen (Dave)

2906 cycles for 10 * CRT strlen
2595 cycles for 10 * Masm32 StrLen
2673 cycles for 10 * AlignedStrLen
853 cycles for 10 * MasmBasic Len()
2633 cycles for 10 * strlenNidud
2178 cycles for 10 * SzLen (Dave)

2 bytes:
181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
183 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)

181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
184 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)

2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)

--- ok ---
Title: Re: Improvement of the strlen function
Post by: Gunther on November 06, 2014, 05:43:43 AM
Here's the result from the desktop computer:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

800 bytes:
4500    cycles for 10 * CRT strlen
4552    cycles for 10 * Masm32 StrLen
4264    cycles for 10 * AlignedStrLen
1202    cycles for 10 * MasmBasic Len()
4191    cycles for 10 * strlenNidud
3114    cycles for 10 * SzLen (Dave)

4511    cycles for 10 * CRT strlen
4594    cycles for 10 * Masm32 StrLen
4268    cycles for 10 * AlignedStrLen
1193    cycles for 10 * MasmBasic Len()
4214    cycles for 10 * strlenNidud
3117    cycles for 10 * SzLen (Dave)

534 bytes:
3249    cycles for 10 * CRT strlen
3216    cycles for 10 * Masm32 StrLen
3033    cycles for 10 * AlignedStrLen
858     cycles for 10 * MasmBasic Len()
2981    cycles for 10 * strlenNidud
2163    cycles for 10 * SzLen (Dave)

3251    cycles for 10 * CRT strlen
3777    cycles for 10 * Masm32 StrLen
3031    cycles for 10 * AlignedStrLen
877     cycles for 10 * MasmBasic Len()
2987    cycles for 10 * strlenNidud
2171    cycles for 10 * SzLen (Dave)

268 bytes:
1846    cycles for 10 * CRT strlen
1496    cycles for 10 * Masm32 StrLen
1815    cycles for 10 * AlignedStrLen
400     cycles for 10 * MasmBasic Len()
1750    cycles for 10 * strlenNidud
1709    cycles for 10 * SzLen (Dave)

1837    cycles for 10 * CRT strlen
1488    cycles for 10 * Masm32 StrLen
1809    cycles for 10 * AlignedStrLen
406     cycles for 10 * MasmBasic Len()
1755    cycles for 10 * strlenNidud
1124    cycles for 10 * SzLen (Dave)

2 bytes:
106     cycles for 10 * CRT strlen
247     cycles for 10 * Masm32 StrLen
404     cycles for 10 * AlignedStrLen
268     cycles for 10 * MasmBasic Len()
247     cycles for 10 * strlenNidud
296     cycles for 10 * SzLen (Dave)

167     cycles for 10 * CRT strlen
99      cycles for 10 * Masm32 StrLen
408     cycles for 10 * AlignedStrLen
271     cycles for 10 * MasmBasic Len()
248     cycles for 10 * strlenNidud
293     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---

Gunther
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 06, 2014, 06:51:56 AM

The REPEAT3 added to the align 4 give better result.
here is strlenNidud at wich ,I have added a REPEAT3 macro
I hope ,I have not added errors


align 4
strlenNidudR3 proc string:DWORD
mov ecx, string
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
jnz done
and ecx, -4 ;align 4 ;0FFFFFFFCh ;c=1100b
lup:
REPEAT 3
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
;jz lup
jnz done
ENDM
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
jz lup
done:
bsf eax, eax
shr eax, 3
sub ecx, string
lea eax, [ecx+eax-4]
ret
strlenNidudR3 endp
Title: Re: Improvement of the strlen function
Post by: hutch-- on November 06, 2014, 06:54:48 AM
I wonder at using such small test strings when anything will read a 1k string faster than you need. Even a sloppy old SCAS algo will do that job fast enough. If you are going to test string length algos, it should be on much larger test strings, well into the megabytes and if the comparisons are to be even vaguely relevant to how they get used in code, you need real time testing, not cache thrashing on small strings.
Title: Re: Improvement of the strlen function
Post by: nidud on November 06, 2014, 07:27:27 AM
deleted
Title: Re: Improvement of the strlen function
Post by: dedndave on November 06, 2014, 12:45:48 PM
Hutch - how often do we actually use it that way?
i would guess most calls to StrLen are made for console out strings of 80 characters or less
maybe a pathname - they can be long, but usually less than 100 characters

i can't think of many applications where i wanted to get the length of a long string
maybe that's because i write the code to keep track of the length   ;)
Title: Re: Improvement of the strlen function
Post by: hutch-- on November 06, 2014, 02:47:31 PM
Dave,

For strings of that length I generally use the following style of code.


    mov eax, [esp+4]                ; get 1st arg off stack
    sub eax, 1
  lbl1:
    add eax, 1
    cmp BYTE PTR [eax], 0           ; loop through address testing for ascii 0
    jne lbl1                        ; loop back if not found

    sub eax, [esp+4]                ; sub original address from EAX to get length

    ret 4                           ; balance the stack by 4 bytes


When the timing is less than a millisecond, what is the point of chasing speed that you cannot even perceive ?
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 06, 2014, 07:12:16 PM
Quote from: dedndave on November 06, 2014, 12:45:48 PM
i would guess most calls to StrLen are made for console out strings of 80 characters or less
...
i can't think of many applications where i wanted to get the length of a long string

Indeed, for really long strings you would know the length. Typically, you need a fast strlen algo when processing huge amounts of strings in textfiles, see my post about 33 bytes average length of C header files (http://masm32.com/board/index.php?topic=3736.msg39494#msg39494). Besides, the speed of len() affects all kinds of other functions - for example, Len() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1144) is being used in about half of the ca. 200 MasmBasic functions. That is why Len() should be fast: to process millions of small strings. JWasm, for example, could benefit from a faster len() function.

But of course, we can argue along the philosophical line "why should anybody nowadays indulge in something as obsolete as assembler programming when Microsoft offers 'sufficiently' fast and safe strlen functions such as StringCbLengthA (http://masm32.com/board/index.php?topic=2650.msg28162#msg28162)" ;-)
Title: Re: Improvement of the strlen function
Post by: TouEnMasm on November 06, 2014, 07:48:06 PM

It's a good exercise on how optimise code.
For real fast functions,there is need of SSE intructions and a set of functions using them.
 
Title: Re: Improvement of the strlen function
Post by: hutch-- on November 06, 2014, 09:25:42 PM
Which is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.
Title: Re: Improvement of the strlen function
Post by: jj2007 on November 06, 2014, 11:20:55 PM
Quote from: hutch-- on November 06, 2014, 09:25:42 PMWhich is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.

Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.
Title: Re: Improvement of the strlen function
Post by: Gunther on November 07, 2014, 05:09:33 AM
Quote from: jj2007 on November 06, 2014, 11:20:55 PM
Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.

That's a reasonable point of view.  :t

Gunther