News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Improvement of the strlen function

Started by TouEnMasm, November 02, 2014, 07:24:13 PM

Previous topic - Next topic

TouEnMasm

The strlen don't align the data adress and lose time when this one is unalign.
Here data misaligned
Quote
   .data
   hInstance dd 0
   decale BYTE 0           ;break the alignment
   longphrase    db "BCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl",0


AlignedStrLen proc item:DWORD
    mov     eax,item            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    ebx
    push    edi
    mov     ebx, 80808080h
    ;align eax
    mov ecx,eax
    and ecx,3 ; ecx= x (0 to 3)
    neg ecx ;- x
    add ecx,sizeof dword    ;4-x dans ecx
   ; jz aligned
    ;first compare
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, ecx                ; increment pointer and align on dword
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz  aligned   
    mov eax,item ;reponse       
    add eax,4
    jmp reponse
  aligned:     
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF 6bff4241
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 on cherche 13, le byte D devient FF
    ;upper,method to search a byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz     aligned   
reponse:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length
    pop     edi
    pop     ebx
    ret   
AlignedStrLen endp

There is a very little penalty on short string  but a real win of time on others (about 20%)


Fa is a musical note to play with CL

jj2007

Looks really competitive :t What is it supposed to return?

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1447    cycles for 10 * CRT strlen
989     cycles for 10 * Masm32 StrLen
262     cycles for 10 * AlignedStrLen
385     cycles for 10 * MasmBasic Len()

1416    cycles for 10 * CRT strlen
987     cycles for 10 * Masm32 StrLen
262     cycles for 10 * AlignedStrLen
386     cycles for 10 * MasmBasic Len()

1422    cycles for 10 * CRT strlen
984     cycles for 10 * Masm32 StrLen
267     cycles for 10 * AlignedStrLen
385     cycles for 10 * MasmBasic Len()

100     = eax CRT strlen
100     = eax Masm32 StrLen
7       = eax AlignedStrLen
100     = eax MasmBasic Len()


P.S.: Results look different, i.e. eax=100, if you activate this line (#104 in the attached source):
; jz     aligned             ;useless if align ecx == 4

TouEnMasm


She return the lenght of the chain.
But she can also return  the length of a text line terminated by 0d,0a,just change:
Quote
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
by
Quote
   ; lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
And you have a linelength function where alignment is very useful

Fa is a musical note to play with CL

qWord

"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
MREAL macros - when you need floating point arithmetic while assembling!

Gunther

Yves and Jochen,

timings for you:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

792     cycles for 10 * CRT strlen
607     cycles for 10 * Masm32 StrLen
130     cycles for 10 * AlignedStrLen
250     cycles for 10 * MasmBasic Len()

776     cycles for 10 * CRT strlen
581     cycles for 10 * Masm32 StrLen
318     cycles for 10 * AlignedStrLen
609     cycles for 10 * MasmBasic Len()

774     cycles for 10 * CRT strlen
597     cycles for 10 * Masm32 StrLen
315     cycles for 10 * AlignedStrLen
610     cycles for 10 * MasmBasic Len()

100     = eax CRT strlen
100     = eax Masm32 StrLen
7       = eax AlignedStrLen
100     = eax MasmBasic Len()

--- ok ---

Gunther
You have to know the facts before you can distort them.

TouEnMasm

Quote
"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
corrected
Quote
    ;first compare
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, ecx                ; increment pointer and align on dword
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte 0 FF
    ;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebx
    jz  aligned                                      ;-------------------------- see first post for complet source code   
    mov eax,item ;reponse                   ;------------------
    add eax,4                                      ;---------------------
    jmp reponse                                   ;-----------------------
  aligned:     
Timing must not be modified

Fa is a musical note to play with CL

nidud

#6
deleted

nidud

#7
deleted

TouEnMasm

 To nidud
You can't align loaded data with the align directive.
The only soluce is to mov the pointer untill he reach an adress multiple of the choosen alignment.
using the corrected version
Get the full code in the first post
The EBP register is replace by EBX,your crash come from here


Fa is a musical note to play with CL

nidud

#9
deleted

TouEnMasm

I don't know at what you are playing,my code is not like that:
Quote
if I fix your code:
Code: [Select]
    mov eax,[esp+4+12] ;reponse          ;------------------
be cool showing your code in mnemonic asm,as i do.

Windbg is a good player,he show
Quote
vitesse_strlen!AlignedStrLen+0x3:
0040109b 8b4508          mov     eax,dword ptr [ebp+8] ss:0023:0012ffc0={vitesse_strlen!longphrase (00404005)}
Fa is a musical note to play with CL

jj2007

Added Nidud's version. They all work fine for a 100 byte string - there is a test for "Hel" in line 413 for those who are interested  ;)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1180    cycles for 10 * CRT strlen
1077    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
405     cycles for 10 * MasmBasic Len()
1167    cycles for 10 * strlenNidud

1180    cycles for 10 * CRT strlen
1075    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
405     cycles for 10 * MasmBasic Len()
1174    cycles for 10 * strlenNidud

1180    cycles for 10 * CRT strlen
1077    cycles for 10 * Masm32 StrLen
1147    cycles for 10 * AlignedStrLen
406     cycles for 10 * MasmBasic Len()
1166    cycles for 10 * strlenNidud

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax AlignedStrLen
100     = eax MasmBasic Len()
100     = eax strlenNidud


P.S.: This is still Yves' first version - #2 crashes:
0040106E   ³.  8B45 08        mov eax, [ebp+8]
ebp is 80808080h

btw does anybody have a CPU showing the 20% improvement?

nidud

#12
deleted

Gunther

Jochen your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

741     cycles for 10 * CRT strlen
627     cycles for 10 * Masm32 StrLen
1344    cycles for 10 * AlignedStrLen
217     cycles for 10 * MasmBasic Len()
730     cycles for 10 * strlenNidud

742     cycles for 10 * CRT strlen
617     cycles for 10 * Masm32 StrLen
1350    cycles for 10 * AlignedStrLen
215     cycles for 10 * MasmBasic Len()
729     cycles for 10 * strlenNidud

125     cycles for 10 * CRT strlen
248     cycles for 10 * Masm32 StrLen
309     cycles for 10 * AlignedStrLen
265     cycles for 10 * MasmBasic Len()
230     cycles for 10 * strlenNidud

123     cycles for 10 * CRT strlen
103     cycles for 10 * Masm32 StrLen
126     cycles for 10 * AlignedStrLen
109     cycles for 10 * MasmBasic Len()
94      cycles for 10 * strlenNidud

4       = eax CRT strlen
4       = eax Masm32 StrLen
4       = eax AlignedStrLen
4       = eax MasmBasic Len()
4       = eax strlenNidud

--- ok ---

Nidud,

your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
380986    cycles - 1000 (  0) 0: crt_strlen
336201    cycles - 1000 ( 71) 1: aligned 00000020
345548    cycles - 1000 ( 87) 2: aligned 00000030
343273    cycles - 1000 ( 43) 3: unaligned
87093     cycles - 1000 ( 41) 4: SSE

377742    cycles - 1000 (  0) 0: crt_strlen
335817    cycles - 1000 ( 71) 1: aligned 00000020
347243    cycles - 1000 ( 87) 2: aligned 00000030
341638    cycles - 1000 ( 43) 3: unaligned
87740     cycles - 1000 ( 41) 4: SSE

379494    cycles - 1000 (  0) 0: crt_strlen
333183    cycles - 1000 ( 71) 1: aligned 00000020
346480    cycles - 1000 ( 87) 2: aligned 00000030
342334    cycles - 1000 ( 43) 3: unaligned
87537     cycles - 1000 ( 41) 4: SSE
--- ok ---

Gunther
You have to know the facts before you can distort them.

nidud

#14
deleted