The strlen don't align the data adress and lose time when this one is unalign.
Here data misaligned
Quote
.data
hInstance dd 0
decale BYTE 0 ;break the alignment
longphrase db "BCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl",0
AlignedStrLen proc item:DWORD
mov eax,item ; get pointer to string
lea edx, [eax+3] ; pointer+3 used in the end
push ebx
push edi
mov ebx, 80808080h
;align eax
mov ecx,eax
and ecx,3 ; ecx= x (0 to 3)
neg ecx ;- x
add ecx,sizeof dword ;4-x dans ecx
; jz aligned
;first compare
mov edi, [eax] ; read first 4 bytes
add eax, ecx ; increment pointer and align on dword
lea ecx, [edi-01010101h] ; subtract 1 from each byte 0 FF
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebx
jz aligned
mov eax,item ;reponse
add eax,4
jmp reponse
aligned:
mov edi, [eax] ; read first 4 bytes
add eax, 4 ; increment pointer
lea ecx, [edi-01010101h] ; subtract 1 from each byte 0 FF 6bff4241
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 on cherche 13, le byte D devient FF
;upper,method to search a byte
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebx
jz aligned
reponse:
test ecx, 00008080h ; test first two bytes
jnz @F
shr ecx, 16 ; not in the first 2 bytes
add eax, 2
@@:
shl cl, 1 ; use carry flag to avoid branch
sbb eax, edx ; compute length
pop edi
pop ebx
ret
AlignedStrLen endp
There is a very little penalty on short string but a real win of time on others (about 20%)
Looks really competitive :t What is it supposed to return?
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1447 cycles for 10 * CRT strlen
989 cycles for 10 * Masm32 StrLen
262 cycles for 10 * AlignedStrLen
385 cycles for 10 * MasmBasic Len()
1416 cycles for 10 * CRT strlen
987 cycles for 10 * Masm32 StrLen
262 cycles for 10 * AlignedStrLen
386 cycles for 10 * MasmBasic Len()
1422 cycles for 10 * CRT strlen
984 cycles for 10 * Masm32 StrLen
267 cycles for 10 * AlignedStrLen
385 cycles for 10 * MasmBasic Len()
100 = eax CRT strlen
100 = eax Masm32 StrLen
7 = eax AlignedStrLen
100 = eax MasmBasic Len()
P.S.: Results look different, i.e. eax=100, if you activate this line (#104 in the attached source):
; jz aligned ;useless if align ecx == 4
She return the lenght of the chain.
But she can also return the length of a text line terminated by 0d,0a,just change:
Quote
lea ecx, [edi-01010101h] ; subtract 1 from each byte 0 FF
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
by
Quote
; lea ecx, [edi-01010101h] ; subtract 1 from each byte 0 FF
lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
And you have a linelength function where alignment is very useful
"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
Yves and Jochen,
timings for you:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
792 cycles for 10 * CRT strlen
607 cycles for 10 * Masm32 StrLen
130 cycles for 10 * AlignedStrLen
250 cycles for 10 * MasmBasic Len()
776 cycles for 10 * CRT strlen
581 cycles for 10 * Masm32 StrLen
318 cycles for 10 * AlignedStrLen
609 cycles for 10 * MasmBasic Len()
774 cycles for 10 * CRT strlen
597 cycles for 10 * Masm32 StrLen
315 cycles for 10 * AlignedStrLen
610 cycles for 10 * MasmBasic Len()
100 = eax CRT strlen
100 = eax Masm32 StrLen
7 = eax AlignedStrLen
100 = eax MasmBasic Len()
--- ok ---
Gunther
Quote
"She" does return the correct string length only if:
- the address is aligned
- the address is not aligned and the string length is >= 4.
corrected
Quote
;first compare
mov edi, [eax] ; read first 4 bytes
add eax, ecx ; increment pointer and align on dword
lea ecx, [edi-01010101h] ; subtract 1 from each byte 0 FF
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebx
jz aligned ;-------------------------- see first post for complet source code
mov eax,item ;reponse ;------------------
add eax,4 ;---------------------
jmp reponse ;-----------------------
aligned:
Timing must not be modified
deleted
deleted
To nidud
You can't align loaded data with the align directive.
The only soluce is to mov the pointer untill he reach an adress multiple of the choosen alignment.
using the corrected version
Get the full code in the first post
The EBP register is replace by EBX,your crash come from here
deleted
I don't know at what you are playing,my code is not like that:
Quote
if I fix your code:
Code: [Select]
mov eax,[esp+4+12] ;reponse ;------------------
be cool showing your code in mnemonic asm,as i do.
Windbg is a good player,he show
Quote
vitesse_strlen!AlignedStrLen+0x3:
0040109b 8b4508 mov eax,dword ptr [ebp+8] ss:0023:0012ffc0={vitesse_strlen!longphrase (00404005)}
Added Nidud's version. They all work fine for a 100 byte string - there is a test for "Hel" in line 413 for those who are interested ;)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1180 cycles for 10 * CRT strlen
1077 cycles for 10 * Masm32 StrLen
1147 cycles for 10 * AlignedStrLen
405 cycles for 10 * MasmBasic Len()
1167 cycles for 10 * strlenNidud
1180 cycles for 10 * CRT strlen
1075 cycles for 10 * Masm32 StrLen
1147 cycles for 10 * AlignedStrLen
405 cycles for 10 * MasmBasic Len()
1174 cycles for 10 * strlenNidud
1180 cycles for 10 * CRT strlen
1077 cycles for 10 * Masm32 StrLen
1147 cycles for 10 * AlignedStrLen
406 cycles for 10 * MasmBasic Len()
1166 cycles for 10 * strlenNidud
100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax AlignedStrLen
100 = eax MasmBasic Len()
100 = eax strlenNidud
P.S.: This is still Yves' first version - #2 crashes:
0040106E ³. 8B45 08 mov eax, [ebp+8]
ebp is 80808080h
btw does anybody have a CPU showing the 20% improvement?
deleted
Jochen your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
741 cycles for 10 * CRT strlen
627 cycles for 10 * Masm32 StrLen
1344 cycles for 10 * AlignedStrLen
217 cycles for 10 * MasmBasic Len()
730 cycles for 10 * strlenNidud
742 cycles for 10 * CRT strlen
617 cycles for 10 * Masm32 StrLen
1350 cycles for 10 * AlignedStrLen
215 cycles for 10 * MasmBasic Len()
729 cycles for 10 * strlenNidud
125 cycles for 10 * CRT strlen
248 cycles for 10 * Masm32 StrLen
309 cycles for 10 * AlignedStrLen
265 cycles for 10 * MasmBasic Len()
230 cycles for 10 * strlenNidud
123 cycles for 10 * CRT strlen
103 cycles for 10 * Masm32 StrLen
126 cycles for 10 * AlignedStrLen
109 cycles for 10 * MasmBasic Len()
94 cycles for 10 * strlenNidud
4 = eax CRT strlen
4 = eax Masm32 StrLen
4 = eax AlignedStrLen
4 = eax MasmBasic Len()
4 = eax strlenNidud
--- ok ---
Nidud,
your timings:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
380986 cycles - 1000 ( 0) 0: crt_strlen
336201 cycles - 1000 ( 71) 1: aligned 00000020
345548 cycles - 1000 ( 87) 2: aligned 00000030
343273 cycles - 1000 ( 43) 3: unaligned
87093 cycles - 1000 ( 41) 4: SSE
377742 cycles - 1000 ( 0) 0: crt_strlen
335817 cycles - 1000 ( 71) 1: aligned 00000020
347243 cycles - 1000 ( 87) 2: aligned 00000030
341638 cycles - 1000 ( 43) 3: unaligned
87740 cycles - 1000 ( 41) 4: SSE
379494 cycles - 1000 ( 0) 0: crt_strlen
333183 cycles - 1000 ( 71) 1: aligned 00000020
346480 cycles - 1000 ( 87) 2: aligned 00000030
342334 cycles - 1000 ( 43) 3: unaligned
87537 cycles - 1000 ( 41) 4: SSE
--- ok ---
Gunther
deleted
from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off
think about it - how often do you want to find the length of a string with more than 100 characters or so?
if you are working with longer strings, this routine will do well
it is adapted from code that was published by Randy Hyde
i made so many changes that i didn't put his name in it - lol
the longer the test strings, the better this will compare with other algos
;***********************************************************************************************
OPTION PROLOGUE:None
OPTION EPILOGUE:None
ALIGN 16
SzLen PROC lpszStr:LPSTR
;SzLen - Get Length of Zero-Terminated String
;DednDave - Dave Sheldon - 8-2010
mov edx,[esp+4]
mov ecx,7F7F7F7Fh
test dl,3
push ebx
jz SzLen2
mov ebx,[edx]
xor eax,eax
or bl,bl
jz SzLen0
inc eax
or bh,bh
jz SzLen0
test dl,2
jnz SzLen1
inc eax
test ebx,0FF0000h
jnz SzLen1
SzLen0: pop ebx
ret 4
SzLen1: or dl,3
inc edx
SzLen2: push esi
push edi
mov esi,01010101h
mov edi,80808080h
SzLen3: mov eax,edx
mov ebx,[eax]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+4]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+8]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+12]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+16]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+20]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+24]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+28]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+32]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jz SzLen3
SzLen4: mov ebx,[edx-4]
lea eax,[edx-4]
or bl,bl
jz SzLen5
inc eax
or bh,bh
jz SzLen5
inc eax
test ebx,0FF0000h
jz SzLen5
inc eax
test ebx,0FF000000h
jnz SzLen3
SzLen5: pop edi
pop esi
sub eax,[esp+8]
pop ebx
ret 4
SzLen ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;***********************************************************************************************
deleted
deleted
Quote from: dedndave on November 03, 2014, 07:49:03 AM
from what i've seen, a lot depends on the length of the strings you are testing
if the strings are short (and they usually are), then the overhead of aligning compares does not pay off
Check the average line length of e.g. the Bible, or The Bible of C Coders, i.e. the header files - see attachment #2, for my VS installation it's 33 bytes ;-)
Attached a test with different lengths. Your code is fast, Dave. For the 10-byte strings, for certain misalignments the winner is Masm32 StrLen, though - greets to Hutch :t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
100 bytes:
1180 cycles for 10 * CRT strlen
1069 cycles for 10 * Masm32 StrLen
1148 cycles for 10 * AlignedStrLen
393 cycles for 10 * MasmBasic Len()
1167 cycles for 10 * strlenNidud
956 cycles for 10 * SzLen (Dave)
70 bytes:
901 cycles for 10 * CRT strlen
743 cycles for 10 * Masm32 StrLen
964 cycles for 10 * AlignedStrLen
342 cycles for 10 * MasmBasic Len()
1158 cycles for 10 * strlenNidud
704 cycles for 10 * SzLen (Dave)
40 bytes:
616 cycles for 10 * CRT strlen
509 cycles for 10 * Masm32 StrLen
766 cycles for 10 * AlignedStrLen
282 cycles for 10 * MasmBasic Len()
856 cycles for 10 * strlenNidud
546 cycles for 10 * SzLen (Dave)
10 bytes:
254 cycles for 10 * CRT strlen
313 cycles for 10 * Masm32 StrLen
328 cycles for 10 * AlignedStrLen
245 cycles for 10 * MasmBasic Len()
262 cycles for 10 * strlenNidud
289 cycles for 10 * SzLen (Dave)
10 = eax CRT strlen
10 = eax Masm32 StrLen
10 = eax AlignedStrLen
10 = eax MasmBasic Len()
10 = eax strlenNidud
10 = eax SzLen (Dave)
P.S.: The *.asc in attachment #2 is in RTF format, for use with \Masm32\MasmBasic\RichMasm.exe
This line loads an array of filenames - change if your header files are in a different location:
GetFiles ExpandEnv$("%ProgramFiles%\Microsoft SDKs\Windows\v7.0A\Include\*.h")
Jochen,
new timings from StrLenYves3.zip:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
100 bytes:
744 cycles for 10 * CRT strlen
614 cycles for 10 * Masm32 StrLen
875 cycles for 10 * AlignedStrLen
204 cycles for 10 * MasmBasic Len()
733 cycles for 10 * strlenNidud
504 cycles for 10 * SzLen (Dave)
70 bytes:
550 cycles for 10 * CRT strlen
425 cycles for 10 * Masm32 StrLen
571 cycles for 10 * AlignedStrLen
162 cycles for 10 * MasmBasic Len()
1179 cycles for 10 * strlenNidud
518 cycles for 10 * SzLen (Dave)
40 bytes:
847 cycles for 10 * CRT strlen
283 cycles for 10 * Masm32 StrLen
371 cycles for 10 * AlignedStrLen
168 cycles for 10 * MasmBasic Len()
344 cycles for 10 * strlenNidud
275 cycles for 10 * SzLen (Dave)
10 bytes:
149 cycles for 10 * CRT strlen
331 cycles for 10 * Masm32 StrLen
405 cycles for 10 * AlignedStrLen
285 cycles for 10 * MasmBasic Len()
331 cycles for 10 * strlenNidud
363 cycles for 10 * SzLen (Dave)
10 = eax CRT strlen
10 = eax Masm32 StrLen
10 = eax AlignedStrLen
10 = eax MasmBasic Len()
10 = eax strlenNidud
10 = eax SzLen (Dave)
--- ok ---
Gunther
Danke, Gunther :t
It seems there is some problems with the measurement:
Quote
70 bytes:
550 cycles for 10 * CRT strlen
425 cycles for 10 * Masm32 StrLen ;********
571 cycles for 10 * AlignedStrLen
162 cycles for 10 * MasmBasic Len()
1179 cycles for 10 * strlenNidud ;**************
518 cycles for 10 * SzLen (Dave)
40 bytes:
847 cycles for 10 * CRT strlen
283 cycles for 10 * Masm32 StrLen ;**********
371 cycles for 10 * AlignedStrLen
168 cycles for 10 * MasmBasic Len()
344 cycles for 10 * strlenNidud ;***********
275 cycles for 10 * SzLen (Dave)
The win of time given by the alignment must be just linear with the lenght:
Quote
100 bytes:
MasmStrLen :176 micro/S,strlenNidud :124 micro/S
70 bytes:
MasmStrLen :80 micro/S,strlenNidud :78 micro/S
40 bytes:
MasmStrLen :57 micro/S,strlenNidud :45 micro/S
10 bytes:
MasmStrLen :28 micro/S,strlenNidud :28 micro/S
deleted
Quote from: ToutEnMasm on November 05, 2014, 05:30:25 AM
It seems there is some problems with the measurement
Sorry for that. Which CPU? Some produce a certain variance, although it looks pretty stable on mine, i.e. Celeron and i5.
Here is a long version that repeats each size twice. As you can see, that is still your version A which fails for len<4 - if you can post a complete new version, I'll include it.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
800 bytes:
5310 cycles for 10 * CRT strlen
5560 cycles for 10 * Masm32 StrLen
4821 cycles for 10 * AlignedStrLen
1348 cycles for 10 * MasmBasic Len()
4798 cycles for 10 * strlenNidud
3710 cycles for 10 * SzLen (Dave)
5191 cycles for 10 * CRT strlen
5148 cycles for 10 * Masm32 StrLen
4805 cycles for 10 * AlignedStrLen
1352 cycles for 10 * MasmBasic Len()
4802 cycles for 10 * strlenNidud
3712 cycles for 10 * SzLen (Dave)
534 bytes:
3704 cycles for 10 * CRT strlen
3289 cycles for 10 * Masm32 StrLen
3479 cycles for 10 * AlignedStrLen
987 cycles for 10 * MasmBasic Len()
3399 cycles for 10 * strlenNidud
2543 cycles for 10 * SzLen (Dave)
3689 cycles for 10 * CRT strlen
3294 cycles for 10 * Masm32 StrLen
3479 cycles for 10 * AlignedStrLen
989 cycles for 10 * MasmBasic Len()
3401 cycles for 10 * strlenNidud
2538 cycles for 10 * SzLen (Dave)
268 bytes:
2150 cycles for 10 * CRT strlen
1701 cycles for 10 * Masm32 StrLen
2033 cycles for 10 * AlignedStrLen
469 cycles for 10 * MasmBasic Len()
1994 cycles for 10 * strlenNidud
1379 cycles for 10 * SzLen (Dave)
2149 cycles for 10 * CRT strlen
1700 cycles for 10 * Masm32 StrLen
2033 cycles for 10 * AlignedStrLen
469 cycles for 10 * MasmBasic Len()
1992 cycles for 10 * strlenNidud
1375 cycles for 10 * SzLen (Dave)
2 bytes:
130 cycles for 10 * CRT strlen
117 cycles for 10 * Masm32 StrLen
185 cycles for 10 * AlignedStrLen
136 cycles for 10 * MasmBasic Len()
113 cycles for 10 * strlenNidud
147 cycles for 10 * SzLen (Dave)
128 cycles for 10 * CRT strlen
117 cycles for 10 * Masm32 StrLen
187 cycles for 10 * AlignedStrLen
131 cycles for 10 * MasmBasic Len()
116 cycles for 10 * strlenNidud
146 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
800 bytes:
4903 cycles for 10 * CRT strlen
4959 cycles for 10 * Masm32 StrLen
4658 cycles for 10 * AlignedStrLen
1309 cycles for 10 * MasmBasic Len()
4575 cycles for 10 * strlenNidud
3376 cycles for 10 * SzLen (Dave)
4922 cycles for 10 * CRT strlen
4952 cycles for 10 * Masm32 StrLen
4656 cycles for 10 * AlignedStrLen
1304 cycles for 10 * MasmBasic Len()
4572 cycles for 10 * strlenNidud
3400 cycles for 10 * SzLen (Dave)
534 bytes:
3342 cycles for 10 * CRT strlen
3449 cycles for 10 * Masm32 StrLen
3301 cycles for 10 * AlignedStrLen
933 cycles for 10 * MasmBasic Len()
3248 cycles for 10 * strlenNidud
2356 cycles for 10 * SzLen (Dave)
3364 cycles for 10 * CRT strlen
3433 cycles for 10 * Masm32 StrLen
3307 cycles for 10 * AlignedStrLen
927 cycles for 10 * MasmBasic Len()
3233 cycles for 10 * strlenNidud
2366 cycles for 10 * SzLen (Dave)
268 bytes:
1921 cycles for 10 * CRT strlen
1613 cycles for 10 * Masm32 StrLen
1965 cycles for 10 * AlignedStrLen
444 cycles for 10 * MasmBasic Len()
1903 cycles for 10 * strlenNidud
1234 cycles for 10 * SzLen (Dave)
1920 cycles for 10 * CRT strlen
1625 cycles for 10 * Masm32 StrLen
1960 cycles for 10 * AlignedStrLen
448 cycles for 10 * MasmBasic Len()
1909 cycles for 10 * strlenNidud
1226 cycles for 10 * SzLen (Dave)
2 bytes:
140 cycles for 10 * CRT strlen
115 cycles for 10 * Masm32 StrLen
182 cycles for 10 * AlignedStrLen
123 cycles for 10 * MasmBasic Len()
112 cycles for 10 * strlenNidud
135 cycles for 10 * SzLen (Dave)
219 cycles for 10 * CRT strlen
109 cycles for 10 * Masm32 StrLen
183 cycles for 10 * AlignedStrLen
118 cycles for 10 * MasmBasic Len()
111 cycles for 10 * strlenNidud
132 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
--- ok ---
Results from the new laptop for version 4:
Quote
Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz (SSE4)
800 bytes:
3922 cycles for 10 * CRT strlen
3970 cycles for 10 * Masm32 StrLen
4432 cycles for 10 * AlignedStrLen
1024 cycles for 10 * MasmBasic Len()
4626 cycles for 10 * strlenNidud
3684 cycles for 10 * SzLen (Dave)
4734 cycles for 10 * CRT strlen
4776 cycles for 10 * Masm32 StrLen
4849 cycles for 10 * AlignedStrLen
1032 cycles for 10 * MasmBasic Len()
4630 cycles for 10 * strlenNidud
3678 cycles for 10 * SzLen (Dave)
534 bytes:
3606 cycles for 10 * CRT strlen
3523 cycles for 10 * Masm32 StrLen
3858 cycles for 10 * AlignedStrLen
1677 cycles for 10 * MasmBasic Len()
2661 cycles for 10 * strlenNidud
2048 cycles for 10 * SzLen (Dave)
2783 cycles for 10 * CRT strlen
3481 cycles for 10 * Masm32 StrLen
3612 cycles for 10 * AlignedStrLen
1542 cycles for 10 * MasmBasic Len()
2671 cycles for 10 * strlenNidud
2428 cycles for 10 * SzLen (Dave)
268 bytes:
2414 cycles for 10 * CRT strlen
1333 cycles for 10 * Masm32 StrLen
1613 cycles for 10 * AlignedStrLen
955 cycles for 10 * MasmBasic Len()
2017 cycles for 10 * strlenNidud
1075 cycles for 10 * SzLen (Dave)
1600 cycles for 10 * CRT strlen
1311 cycles for 10 * Masm32 StrLen
2447 cycles for 10 * AlignedStrLen
379 cycles for 10 * MasmBasic Len()
2323 cycles for 10 * strlenNidud
1067 cycles for 10 * SzLen (Dave)
2 bytes:
111 cycles for 10 * CRT strlen
300 cycles for 10 * Masm32 StrLen
518 cycles for 10 * AlignedStrLen
335 cycles for 10 * MasmBasic Len()
345 cycles for 10 * strlenNidud
356 cycles for 10 * SzLen (Dave)
314 cycles for 10 * CRT strlen
301 cycles for 10 * Masm32 StrLen
516 cycles for 10 * AlignedStrLen
348 cycles for 10 * MasmBasic Len()
278 cycles for 10 * strlenNidud
350 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
--- ok ---
Gunther
Thanks, Gunther & Marinus. Have you noticed the stable i7 results vs unstable i5 results of Gunther? I have no explanation for that :(
Maybe Guthers computer is overclocked ?
I changed the CPU clock speed of my i7 computer.
Instead of overclocking i have lowered my CPU speed a bit to get a very quiet computer.
the loop counts are very low
the tests zip by too fast - lol
from past experience, a loop count that yields a 0.5 second pass results in fairly stable numbers
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
800 bytes:
8240 cycles for 10 * CRT strlen
8984 cycles for 10 * Masm32 StrLen
7501 cycles for 10 * AlignedStrLen
4380 cycles for 10 * MasmBasic Len()
7496 cycles for 10 * strlenNidud
6072 cycles for 10 * SzLen (Dave)
8281 cycles for 10 * CRT strlen
9109 cycles for 10 * Masm32 StrLen
7348 cycles for 10 * AlignedStrLen
4403 cycles for 10 * MasmBasic Len()
7625 cycles for 10 * strlenNidud
6119 cycles for 10 * SzLen (Dave)
534 bytes:
5692 cycles for 10 * CRT strlen
6445 cycles for 10 * Masm32 StrLen
5286 cycles for 10 * AlignedStrLen
3138 cycles for 10 * MasmBasic Len()
5433 cycles for 10 * strlenNidud
4224 cycles for 10 * SzLen (Dave)
5675 cycles for 10 * CRT strlen
6526 cycles for 10 * Masm32 StrLen
5159 cycles for 10 * AlignedStrLen
3163 cycles for 10 * MasmBasic Len()
5430 cycles for 10 * strlenNidud
4195 cycles for 10 * SzLen (Dave)
268 bytes:
3212 cycles for 10 * CRT strlen
3716 cycles for 10 * Masm32 StrLen
2953 cycles for 10 * AlignedStrLen
1573 cycles for 10 * MasmBasic Len()
3052 cycles for 10 * strlenNidud
2492 cycles for 10 * SzLen (Dave)
3309 cycles for 10 * CRT strlen
3580 cycles for 10 * Masm32 StrLen
2981 cycles for 10 * AlignedStrLen
1574 cycles for 10 * MasmBasic Len()
3061 cycles for 10 * strlenNidud
2580 cycles for 10 * SzLen (Dave)
2 bytes:
302 cycles for 10 * CRT strlen
281 cycles for 10 * Masm32 StrLen
320 cycles for 10 * AlignedStrLen
351 cycles for 10 * MasmBasic Len()
246 cycles for 10 * strlenNidud
339 cycles for 10 * SzLen (Dave)
299 cycles for 10 * CRT strlen
264 cycles for 10 * Masm32 StrLen
321 cycles for 10 * AlignedStrLen
351 cycles for 10 * MasmBasic Len()
245 cycles for 10 * strlenNidud
340 cycles for 10 * SzLen (Dave)
Pretty stable timings, especially for a P4, and yours is clearly the best non-SSE algo :t
Hi Marinus,
Quote from: Siekmanski on November 05, 2014, 12:17:44 PM
Maybe Guthers computer is overclocked ?
no, it isn't. It's an Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 2.30GHz. That's what the System says.
Gunther
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
800 bytes:
4423 cycles for 10 * CRT strlen
4437 cycles for 10 * Masm32 StrLen
4513 cycles for 10 * AlignedStrLen
1381 cycles for 10 * MasmBasic Len()
4670 cycles for 10 * strlenNidud
3220 cycles for 10 * SzLen (Dave)
4424 cycles for 10 * CRT strlen
4440 cycles for 10 * Masm32 StrLen
4521 cycles for 10 * AlignedStrLen
1646 cycles for 10 * MasmBasic Len()
4313 cycles for 10 * strlenNidud
3584 cycles for 10 * SzLen (Dave)
534 bytes:
3351 cycles for 10 * CRT strlen
3429 cycles for 10 * Masm32 StrLen
3507 cycles for 10 * AlignedStrLen
950 cycles for 10 * MasmBasic Len()
2988 cycles for 10 * strlenNidud
2583 cycles for 10 * SzLen (Dave)
3115 cycles for 10 * CRT strlen
3067 cycles for 10 * Masm32 StrLen
3153 cycles for 10 * AlignedStrLen
939 cycles for 10 * MasmBasic Len()
2996 cycles for 10 * strlenNidud
2587 cycles for 10 * SzLen (Dave)
268 bytes:
1806 cycles for 10 * CRT strlen
1818 cycles for 10 * Masm32 StrLen
2177 cycles for 10 * AlignedStrLen
703 cycles for 10 * MasmBasic Len()
1829 cycles for 10 * strlenNidud
1411 cycles for 10 * SzLen (Dave)
2178 cycles for 10 * CRT strlen
1816 cycles for 10 * Masm32 StrLen
2164 cycles for 10 * AlignedStrLen
699 cycles for 10 * MasmBasic Len()
1835 cycles for 10 * strlenNidud
1190 cycles for 10 * SzLen (Dave)
2 bytes:
109 cycles for 10 * CRT strlen
104 cycles for 10 * Masm32 StrLen
379 cycles for 10 * AlignedStrLen
247 cycles for 10 * MasmBasic Len()
205 cycles for 10 * strlenNidud
284 cycles for 10 * SzLen (Dave)
232 cycles for 10 * CRT strlen
223 cycles for 10 * Masm32 StrLen
180 cycles for 10 * AlignedStrLen
246 cycles for 10 * MasmBasic Len()
203 cycles for 10 * strlenNidud
259 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
800 bytes:
7784 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6988 cycles for 10 * AlignedStrLen
2023 cycles for 10 * MasmBasic Len()
7103 cycles for 10 * strlenNidud
6031 cycles for 10 * SzLen (Dave)
7759 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6958 cycles for 10 * AlignedStrLen
2020 cycles for 10 * MasmBasic Len()
7104 cycles for 10 * strlenNidud
6023 cycles for 10 * SzLen (Dave)
534 bytes:
5412 cycles for 10 * CRT strlen
5033 cycles for 10 * Masm32 StrLen
4861 cycles for 10 * AlignedStrLen
1412 cycles for 10 * MasmBasic Len()
4844 cycles for 10 * strlenNidud
4062 cycles for 10 * SzLen (Dave)
5430 cycles for 10 * CRT strlen
4975 cycles for 10 * Masm32 StrLen
4833 cycles for 10 * AlignedStrLen
1446 cycles for 10 * MasmBasic Len()
4847 cycles for 10 * strlenNidud
4053 cycles for 10 * SzLen (Dave)
268 bytes:
2908 cycles for 10 * CRT strlen
2619 cycles for 10 * Masm32 StrLen
2640 cycles for 10 * AlignedStrLen
852 cycles for 10 * MasmBasic Len()
2635 cycles for 10 * strlenNidud
2181 cycles for 10 * SzLen (Dave)
2906 cycles for 10 * CRT strlen
2595 cycles for 10 * Masm32 StrLen
2673 cycles for 10 * AlignedStrLen
853 cycles for 10 * MasmBasic Len()
2633 cycles for 10 * strlenNidud
2178 cycles for 10 * SzLen (Dave)
2 bytes:
181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
183 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)
181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
184 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
--- ok ---
Here's the result from the desktop computer:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
800 bytes:
4500 cycles for 10 * CRT strlen
4552 cycles for 10 * Masm32 StrLen
4264 cycles for 10 * AlignedStrLen
1202 cycles for 10 * MasmBasic Len()
4191 cycles for 10 * strlenNidud
3114 cycles for 10 * SzLen (Dave)
4511 cycles for 10 * CRT strlen
4594 cycles for 10 * Masm32 StrLen
4268 cycles for 10 * AlignedStrLen
1193 cycles for 10 * MasmBasic Len()
4214 cycles for 10 * strlenNidud
3117 cycles for 10 * SzLen (Dave)
534 bytes:
3249 cycles for 10 * CRT strlen
3216 cycles for 10 * Masm32 StrLen
3033 cycles for 10 * AlignedStrLen
858 cycles for 10 * MasmBasic Len()
2981 cycles for 10 * strlenNidud
2163 cycles for 10 * SzLen (Dave)
3251 cycles for 10 * CRT strlen
3777 cycles for 10 * Masm32 StrLen
3031 cycles for 10 * AlignedStrLen
877 cycles for 10 * MasmBasic Len()
2987 cycles for 10 * strlenNidud
2171 cycles for 10 * SzLen (Dave)
268 bytes:
1846 cycles for 10 * CRT strlen
1496 cycles for 10 * Masm32 StrLen
1815 cycles for 10 * AlignedStrLen
400 cycles for 10 * MasmBasic Len()
1750 cycles for 10 * strlenNidud
1709 cycles for 10 * SzLen (Dave)
1837 cycles for 10 * CRT strlen
1488 cycles for 10 * Masm32 StrLen
1809 cycles for 10 * AlignedStrLen
406 cycles for 10 * MasmBasic Len()
1755 cycles for 10 * strlenNidud
1124 cycles for 10 * SzLen (Dave)
2 bytes:
106 cycles for 10 * CRT strlen
247 cycles for 10 * Masm32 StrLen
404 cycles for 10 * AlignedStrLen
268 cycles for 10 * MasmBasic Len()
247 cycles for 10 * strlenNidud
296 cycles for 10 * SzLen (Dave)
167 cycles for 10 * CRT strlen
99 cycles for 10 * Masm32 StrLen
408 cycles for 10 * AlignedStrLen
271 cycles for 10 * MasmBasic Len()
248 cycles for 10 * strlenNidud
293 cycles for 10 * SzLen (Dave)
2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)
--- ok ---
Gunther
The REPEAT3 added to the align 4 give better result.
here is strlenNidud at wich ,I have added a REPEAT3 macro
I hope ,I have not added errors
align 4
strlenNidudR3 proc string:DWORD
mov ecx, string
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
not eax
and eax, edx
and eax, 80808080h
jnz done
and ecx, -4 ;align 4 ;0FFFFFFFCh ;c=1100b
lup:
REPEAT 3
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
not eax
and eax, edx
and eax, 80808080h
;jz lup
jnz done
ENDM
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea ecx, [edi+0F1F1F1F1h] ;F1 = FF - 0D - 1 search D(13), D devient FF
not eax
and eax, edx
and eax, 80808080h
jz lup
done:
bsf eax, eax
shr eax, 3
sub ecx, string
lea eax, [ecx+eax-4]
ret
strlenNidudR3 endp
I wonder at using such small test strings when anything will read a 1k string faster than you need. Even a sloppy old SCAS algo will do that job fast enough. If you are going to test string length algos, it should be on much larger test strings, well into the megabytes and if the comparisons are to be even vaguely relevant to how they get used in code, you need real time testing, not cache thrashing on small strings.
deleted
Hutch - how often do we actually use it that way?
i would guess most calls to StrLen are made for console out strings of 80 characters or less
maybe a pathname - they can be long, but usually less than 100 characters
i can't think of many applications where i wanted to get the length of a long string
maybe that's because i write the code to keep track of the length ;)
Dave,
For strings of that length I generally use the following style of code.
mov eax, [esp+4] ; get 1st arg off stack
sub eax, 1
lbl1:
add eax, 1
cmp BYTE PTR [eax], 0 ; loop through address testing for ascii 0
jne lbl1 ; loop back if not found
sub eax, [esp+4] ; sub original address from EAX to get length
ret 4 ; balance the stack by 4 bytes
When the timing is less than a millisecond, what is the point of chasing speed that you cannot even perceive ?
Quote from: dedndave on November 06, 2014, 12:45:48 PM
i would guess most calls to StrLen are made for console out strings of 80 characters or less
...
i can't think of many applications where i wanted to get the length of a long string
Indeed, for really long strings you would know the length. Typically, you need a fast strlen algo when processing huge amounts of strings in textfiles, see my post about 33 bytes average length of C header files (http://masm32.com/board/index.php?topic=3736.msg39494#msg39494). Besides, the speed of len() affects all kinds of other functions - for example, Len() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1144) is being used in about half of the ca. 200 MasmBasic functions.
That is why Len() should be fast: to process millions of small strings. JWasm, for example, could benefit from a faster len() function.
But of course, we can argue along the philosophical line "why should anybody nowadays indulge in something as obsolete as assembler programming when Microsoft offers 'sufficiently' fast and safe strlen functions such as StringCbLengthA (http://masm32.com/board/index.php?topic=2650.msg28162#msg28162)" ;-)
It's a good exercise on how optimise code.
For real fast functions,there is need of SSE intructions and a set of functions using them.
Which is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.
Quote from: hutch-- on November 06, 2014, 09:25:42 PMWhich is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.
Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.
Quote from: jj2007 on November 06, 2014, 11:20:55 PM
Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.
That's a reasonable point of view. :t
Gunther