News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Improvement of the strlen function

Started by TouEnMasm, November 02, 2014, 07:24:13 PM

Previous topic - Next topic

dedndave

from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off

think about it - how often do you want to find the length of a string with more than 100 characters or so?

if you are working with longer strings, this routine will do well
it is adapted from code that was published by Randy Hyde
i made so many changes that i didn't put his name in it - lol

the longer the test strings, the better this will compare with other algos

;***********************************************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

        ALIGN   16

SzLen   PROC    lpszStr:LPSTR

;SzLen - Get Length of Zero-Terminated String
;DednDave - Dave Sheldon - 8-2010

        mov     edx,[esp+4]
        mov     ecx,7F7F7F7Fh
        test    dl,3
        push    ebx
        jz      SzLen2

        mov     ebx,[edx]
        xor     eax,eax
        or      bl,bl
        jz      SzLen0

        inc     eax
        or      bh,bh
        jz      SzLen0

        test    dl,2
        jnz     SzLen1

        inc     eax
        test    ebx,0FF0000h
        jnz     SzLen1

SzLen0: pop     ebx
        ret     4

SzLen1: or      dl,3
        inc     edx

SzLen2: push    esi
        push    edi
        mov     esi,01010101h
        mov     edi,80808080h

SzLen3: mov     eax,edx
        mov     ebx,[eax]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+4]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+8]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+12]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+16]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+20]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+24]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+28]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+32]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jz      SzLen3

SzLen4: mov     ebx,[edx-4]
        lea     eax,[edx-4]
        or      bl,bl
        jz      SzLen5

        inc     eax
        or      bh,bh
        jz      SzLen5

        inc     eax
        test    ebx,0FF0000h
        jz      SzLen5

        inc     eax
        test    ebx,0FF000000h
        jnz     SzLen3

SzLen5: pop     edi
        pop     esi
        sub     eax,[esp+8]
        pop     ebx
        ret     4

SzLen   ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;***********************************************************************************************

nidud

#16
deleted

nidud

#17
deleted

jj2007

Quote from: dedndave on November 03, 2014, 07:49:03 AM
from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off

Check the average line length of e.g. the Bible, or The Bible of C Coders, i.e. the header files - see attachment #2, for my VS installation it's 33 bytes  ;-)

Attached a test with different lengths. Your code is fast, Dave. For the 10-byte strings, for certain misalignments the winner is Masm32 StrLen, though - greets to Hutch :t


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

100 bytes:
1180    cycles for 10 * CRT strlen
1069    cycles for 10 * Masm32 StrLen
1148    cycles for 10 * AlignedStrLen
393     cycles for 10 * MasmBasic Len()
1167    cycles for 10 * strlenNidud
956     cycles for 10 * SzLen (Dave)

70 bytes:
901     cycles for 10 * CRT strlen
743     cycles for 10 * Masm32 StrLen
964     cycles for 10 * AlignedStrLen
342     cycles for 10 * MasmBasic Len()
1158    cycles for 10 * strlenNidud
704     cycles for 10 * SzLen (Dave)

40 bytes:
616     cycles for 10 * CRT strlen
509     cycles for 10 * Masm32 StrLen
766     cycles for 10 * AlignedStrLen
282     cycles for 10 * MasmBasic Len()
856     cycles for 10 * strlenNidud
546     cycles for 10 * SzLen (Dave)

10 bytes:
254     cycles for 10 * CRT strlen
313     cycles for 10 * Masm32 StrLen
328     cycles for 10 * AlignedStrLen
245     cycles for 10 * MasmBasic Len()
262     cycles for 10 * strlenNidud
289     cycles for 10 * SzLen (Dave)

10      = eax CRT strlen
10      = eax Masm32 StrLen
10      = eax AlignedStrLen
10      = eax MasmBasic Len()
10      = eax strlenNidud
10      = eax SzLen (Dave)


P.S.: The *.asc in attachment #2 is in RTF format, for use with \Masm32\MasmBasic\RichMasm.exe

This line loads an array of filenames - change if your header files are in a different location:
GetFiles ExpandEnv$("%ProgramFiles%\Microsoft SDKs\Windows\v7.0A\Include\*.h")

Gunther

Jochen,

new timings from StrLenYves3.zip:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

100 bytes:
744     cycles for 10 * CRT strlen
614     cycles for 10 * Masm32 StrLen
875     cycles for 10 * AlignedStrLen
204     cycles for 10 * MasmBasic Len()
733     cycles for 10 * strlenNidud
504     cycles for 10 * SzLen (Dave)

70 bytes:
550     cycles for 10 * CRT strlen
425     cycles for 10 * Masm32 StrLen
571     cycles for 10 * AlignedStrLen
162     cycles for 10 * MasmBasic Len()
1179    cycles for 10 * strlenNidud
518     cycles for 10 * SzLen (Dave)

40 bytes:
847     cycles for 10 * CRT strlen
283     cycles for 10 * Masm32 StrLen
371     cycles for 10 * AlignedStrLen
168     cycles for 10 * MasmBasic Len()
344     cycles for 10 * strlenNidud
275     cycles for 10 * SzLen (Dave)

10 bytes:
149     cycles for 10 * CRT strlen
331     cycles for 10 * Masm32 StrLen
405     cycles for 10 * AlignedStrLen
285     cycles for 10 * MasmBasic Len()
331     cycles for 10 * strlenNidud
363     cycles for 10 * SzLen (Dave)

10      = eax CRT strlen
10      = eax Masm32 StrLen
10      = eax AlignedStrLen
10      = eax MasmBasic Len()
10      = eax strlenNidud
10      = eax SzLen (Dave)

--- ok ---

Gunther
You have to know the facts before you can distort them.

jj2007


Gunther

You have to know the facts before you can distort them.

TouEnMasm


It seems there is some problems with the measurement:
Quote
70 bytes:
550     cycles for 10 * CRT strlen
425     cycles for 10 * Masm32 StrLen ;********
571     cycles for 10 * AlignedStrLen
162     cycles for 10 * MasmBasic Len()
1179    cycles for 10 * strlenNidud   ;**************
518     cycles for 10 * SzLen (Dave)
40 bytes:
847     cycles for 10 * CRT strlen
283     cycles for 10 * Masm32 StrLen ;**********
371     cycles for 10 * AlignedStrLen
168     cycles for 10 * MasmBasic Len()
344     cycles for 10 * strlenNidud  ;***********
275     cycles for 10 * SzLen (Dave)
The win of time given by the alignment must be just linear with the lenght:
Quote
100 bytes:
MasmStrLen :176 micro/S,strlenNidud :124 micro/S
70 bytes:
MasmStrLen :80 micro/S,strlenNidud :78 micro/S
40 bytes:
MasmStrLen :57 micro/S,strlenNidud :45 micro/S
10 bytes:
MasmStrLen :28 micro/S,strlenNidud :28 micro/S



Fa is a musical note to play with CL

nidud

#23
deleted

jj2007

Quote from: ToutEnMasm on November 05, 2014, 05:30:25 AM
It seems there is some problems with the measurement

Sorry for that. Which CPU? Some produce a certain variance, although it looks pretty stable on mine, i.e. Celeron and i5.
Here is a long version that repeats each size twice. As you can see, that is still your version A which fails for len<4 - if you can post a complete new version, I'll include it.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

800 bytes:
5310    cycles for 10 * CRT strlen
5560    cycles for 10 * Masm32 StrLen
4821    cycles for 10 * AlignedStrLen
1348    cycles for 10 * MasmBasic Len()
4798    cycles for 10 * strlenNidud
3710    cycles for 10 * SzLen (Dave)

5191    cycles for 10 * CRT strlen
5148    cycles for 10 * Masm32 StrLen
4805    cycles for 10 * AlignedStrLen
1352    cycles for 10 * MasmBasic Len()
4802    cycles for 10 * strlenNidud
3712    cycles for 10 * SzLen (Dave)

534 bytes:
3704    cycles for 10 * CRT strlen
3289    cycles for 10 * Masm32 StrLen
3479    cycles for 10 * AlignedStrLen
987     cycles for 10 * MasmBasic Len()
3399    cycles for 10 * strlenNidud
2543    cycles for 10 * SzLen (Dave)

3689    cycles for 10 * CRT strlen
3294    cycles for 10 * Masm32 StrLen
3479    cycles for 10 * AlignedStrLen
989     cycles for 10 * MasmBasic Len()
3401    cycles for 10 * strlenNidud
2538    cycles for 10 * SzLen (Dave)

268 bytes:
2150    cycles for 10 * CRT strlen
1701    cycles for 10 * Masm32 StrLen
2033    cycles for 10 * AlignedStrLen
469     cycles for 10 * MasmBasic Len()
1994    cycles for 10 * strlenNidud
1379    cycles for 10 * SzLen (Dave)

2149    cycles for 10 * CRT strlen
1700    cycles for 10 * Masm32 StrLen
2033    cycles for 10 * AlignedStrLen
469     cycles for 10 * MasmBasic Len()
1992    cycles for 10 * strlenNidud
1375    cycles for 10 * SzLen (Dave)

2 bytes:
130     cycles for 10 * CRT strlen
117     cycles for 10 * Masm32 StrLen
185     cycles for 10 * AlignedStrLen
136     cycles for 10 * MasmBasic Len()
113     cycles for 10 * strlenNidud
147     cycles for 10 * SzLen (Dave)

128     cycles for 10 * CRT strlen
117     cycles for 10 * Masm32 StrLen
187     cycles for 10 * AlignedStrLen
131     cycles for 10 * MasmBasic Len()
116     cycles for 10 * strlenNidud
146     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

800 bytes:
4903    cycles for 10 * CRT strlen
4959    cycles for 10 * Masm32 StrLen
4658    cycles for 10 * AlignedStrLen
1309    cycles for 10 * MasmBasic Len()
4575    cycles for 10 * strlenNidud
3376    cycles for 10 * SzLen (Dave)

4922    cycles for 10 * CRT strlen
4952    cycles for 10 * Masm32 StrLen
4656    cycles for 10 * AlignedStrLen
1304    cycles for 10 * MasmBasic Len()
4572    cycles for 10 * strlenNidud
3400    cycles for 10 * SzLen (Dave)

534 bytes:
3342    cycles for 10 * CRT strlen
3449    cycles for 10 * Masm32 StrLen
3301    cycles for 10 * AlignedStrLen
933     cycles for 10 * MasmBasic Len()
3248    cycles for 10 * strlenNidud
2356    cycles for 10 * SzLen (Dave)

3364    cycles for 10 * CRT strlen
3433    cycles for 10 * Masm32 StrLen
3307    cycles for 10 * AlignedStrLen
927     cycles for 10 * MasmBasic Len()
3233    cycles for 10 * strlenNidud
2366    cycles for 10 * SzLen (Dave)

268 bytes:
1921    cycles for 10 * CRT strlen
1613    cycles for 10 * Masm32 StrLen
1965    cycles for 10 * AlignedStrLen
444     cycles for 10 * MasmBasic Len()
1903    cycles for 10 * strlenNidud
1234    cycles for 10 * SzLen (Dave)

1920    cycles for 10 * CRT strlen
1625    cycles for 10 * Masm32 StrLen
1960    cycles for 10 * AlignedStrLen
448     cycles for 10 * MasmBasic Len()
1909    cycles for 10 * strlenNidud
1226    cycles for 10 * SzLen (Dave)

2 bytes:
140     cycles for 10 * CRT strlen
115     cycles for 10 * Masm32 StrLen
182     cycles for 10 * AlignedStrLen
123     cycles for 10 * MasmBasic Len()
112     cycles for 10 * strlenNidud
135     cycles for 10 * SzLen (Dave)

219     cycles for 10 * CRT strlen
109     cycles for 10 * Masm32 StrLen
183     cycles for 10 * AlignedStrLen
118     cycles for 10 * MasmBasic Len()
111     cycles for 10 * strlenNidud
132     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---
Creative coders use backward thinking techniques as a strategy.

Gunther

Results from the new laptop for version 4:
Quote
Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz (SSE4)

800 bytes:
3922    cycles for 10 * CRT strlen
3970    cycles for 10 * Masm32 StrLen
4432    cycles for 10 * AlignedStrLen
1024    cycles for 10 * MasmBasic Len()
4626    cycles for 10 * strlenNidud
3684    cycles for 10 * SzLen (Dave)

4734    cycles for 10 * CRT strlen
4776    cycles for 10 * Masm32 StrLen
4849    cycles for 10 * AlignedStrLen
1032    cycles for 10 * MasmBasic Len()
4630    cycles for 10 * strlenNidud
3678    cycles for 10 * SzLen (Dave)

534 bytes:
3606    cycles for 10 * CRT strlen
3523    cycles for 10 * Masm32 StrLen
3858    cycles for 10 * AlignedStrLen
1677    cycles for 10 * MasmBasic Len()
2661    cycles for 10 * strlenNidud
2048    cycles for 10 * SzLen (Dave)

2783    cycles for 10 * CRT strlen
3481    cycles for 10 * Masm32 StrLen
3612    cycles for 10 * AlignedStrLen
1542    cycles for 10 * MasmBasic Len()
2671    cycles for 10 * strlenNidud
2428    cycles for 10 * SzLen (Dave)

268 bytes:
2414    cycles for 10 * CRT strlen
1333    cycles for 10 * Masm32 StrLen
1613    cycles for 10 * AlignedStrLen
955     cycles for 10 * MasmBasic Len()
2017    cycles for 10 * strlenNidud
1075    cycles for 10 * SzLen (Dave)

1600    cycles for 10 * CRT strlen
1311    cycles for 10 * Masm32 StrLen
2447    cycles for 10 * AlignedStrLen
379     cycles for 10 * MasmBasic Len()
2323    cycles for 10 * strlenNidud
1067    cycles for 10 * SzLen (Dave)

2 bytes:
111     cycles for 10 * CRT strlen
300     cycles for 10 * Masm32 StrLen
518     cycles for 10 * AlignedStrLen
335     cycles for 10 * MasmBasic Len()
345     cycles for 10 * strlenNidud
356     cycles for 10 * SzLen (Dave)

314     cycles for 10 * CRT strlen
301     cycles for 10 * Masm32 StrLen
516     cycles for 10 * AlignedStrLen
348     cycles for 10 * MasmBasic Len()
278     cycles for 10 * strlenNidud
350     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---

Gunther
You have to know the facts before you can distort them.

jj2007

Thanks, Gunther & Marinus. Have you noticed the stable i7 results vs unstable i5 results of Gunther? I have no explanation for that :(

Siekmanski

Maybe Guthers computer is overclocked ?
I changed the CPU clock speed of my i7 computer.
Instead of overclocking i have lowered my CPU speed a bit to get a very quiet computer.
Creative coders use backward thinking techniques as a strategy.

dedndave

the loop counts are very low
the tests zip by too fast - lol
from past experience, a loop count that yields a 0.5 second pass results in fairly stable numbers