Hi all,
I found this 4 procedures to get string length
strlen32 <- Author: Agner Fog
strlen32M <- is strlen32 modified by RuiLoureiro
GetStringLenX <- RuiLoureiro
szLen <- MASM
where the best one is just this: ;)
GetStringLenX proc pStr:DWORD
mov edx, [esp+4] ;pStr
xor eax, eax
jz short @F
_begin0: add eax, 1
@@: movzx ecx, byte ptr [edx+eax]
or ecx, ecx
jnz short _begin0
ret 4
GetStringLenX endp
I did some tests and we get the following results
for all 4 cases:
(
You should draw your own conclusions)
Could you run
TestString32.exe and
post the results here ?
Thanks
note I: Thanks Dave for that one.
note II: Jochen, it seems that it is not for you
because...
CASE 1 ------------------------------------------------
the addresses are aligned
------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
34 milliseconds, szLen - _string02- length=32
35 milliseconds, GetStringLenX - _string02- length=32
37 milliseconds, strlen32M - _string03- length=64
39 milliseconds, strlen32 - _string03- length=64
53 milliseconds, szLen - _string03- length=64
57 milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 2 ---------------------------------------------------
the addresses are not aligned by 1 byte
---------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
10 milliseconds, strlen32M - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32 - _string02- length=32
15 milliseconds, strlen32M - _string02- length=32
16 milliseconds, strlen32 - _string01- length=16
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64
34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 3 -----------------------------------------------------
the addresses are not aligned by 2 bytes
-----------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
10 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
12 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64
34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
65 milliseconds, GetStringLenX - _string03- length=64
********** END **********
CASE 4 ----------------------------------------------------
the addresses are not aligned by 3 bytes
----------------------------------------------------
Quote
...
-------------- START ----------------
...
***** Time table *****
10 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32 - _string02- length=32
16 milliseconds, strlen32M - _string02- length=32
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64
34 milliseconds, szLen - _string02- length=32
39 milliseconds, GetStringLenX - _string02- length=32
56 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64
********** END **********
FOR 128 BYTES Quote
...
-------------- START ----------------
...
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
13 milliseconds, szLen - _string01- length=16
13 milliseconds, GetStringLenX - _string01- length=16
15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
25 milliseconds, strlen32M - _string03- length=64
26 milliseconds, strlen32 - _string03- length=64
34 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenX - _string02- length=32
53 milliseconds, szLen - _string03- length=64
56 milliseconds, GetStringLenX - _string03- length=64
58 milliseconds, strlen32M - _string04- length=128
67 milliseconds, strlen32 - _string04- length=128
92 milliseconds, szLen - _string04- length=128
98 milliseconds, GetStringLenX - _string04- length=128
********** END **********
Hi Rui,
I moved the thread so it would be seen by the algo folks.
Here is my timings on my 3 gig Core2 quad.
*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
-------------- START ----------------
7 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
12 milliseconds, GetStringLenX, _string01- length=16
7 milliseconds, szLen, _string01- length=16
10 milliseconds, strlen32, _string02 - length=32
9 milliseconds, strlen32M, _string02 - length=32
22 milliseconds, GetStringLenX, _string02- length=32
12 milliseconds, szLen, _string02- length=32
18 milliseconds, strlen32, _string03 - length=64
18 milliseconds, strlen32M, _string03 - length=64
42 milliseconds, GetStringLenX, _string03- length=64
23 milliseconds, szLen, _string03- length=64
32 milliseconds, strlen32, _string04 - length=128
33 milliseconds, strlen32M, _string04 - length=128
64 milliseconds, GetStringLenX, _string04- length=128
44 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***
***** Time table *****
6 milliseconds, strlen32M - _string01- length=16
7 milliseconds, szLen - _string01- length=16
7 milliseconds, strlen32 - _string01- length=16
9 milliseconds, strlen32M - _string02- length=32
10 milliseconds, strlen32 - _string02- length=32
12 milliseconds, szLen - _string02- length=32
12 milliseconds, GetStringLenX - _string01- length=16
18 milliseconds, strlen32 - _string03- length=64
18 milliseconds, strlen32M - _string03- length=64
22 milliseconds, GetStringLenX - _string02- length=32
23 milliseconds, szLen - _string03- length=64
32 milliseconds, strlen32 - _string04- length=128
33 milliseconds, strlen32M - _string04- length=128
42 milliseconds, GetStringLenX - _string03- length=64
44 milliseconds, szLen - _string04- length=128
64 milliseconds, GetStringLenX - _string04- length=128
********** END **********
I have an implementation on x64 with SSE2 extensions, maybe interested.
Translate to x86 is very easy.
Regards.
Quote from: RuiLoureiro on June 10, 2014, 06:54:45 PM
note II: Jochen, it seems that it is not for you
because...
::) :(
AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
-------------- START ----------------
14 milliseconds, strlen32, _string01- length=16
11 milliseconds, strlen32M, _string01- length=16
24 milliseconds, GetStringLenX, _string01- length=16
10 milliseconds, szLen, _string01- length=16
19 milliseconds, strlen32, _string02 - length=32
16 milliseconds, strlen32M, _string02 - length=32
37 milliseconds, GetStringLenX, _string02- length=32
17 milliseconds, szLen, _string02- length=32
34 milliseconds, strlen32, _string03 - length=64
33 milliseconds, strlen32M, _string03 - length=64
66 milliseconds, GetStringLenX, _string03- length=64
37 milliseconds, szLen, _string03- length=64
56 milliseconds, strlen32, _string04 - length=128
55 milliseconds, strlen32M, _string04 - length=128
122 milliseconds, GetStringLenX, _string04- length=128
66 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***
***** Time table *****
10 milliseconds, szLen - _string01- length=16
11 milliseconds, strlen32M - _string01- length=16
14 milliseconds, strlen32 - _string01- length=16
16 milliseconds, strlen32M - _string02- length=32
17 milliseconds, szLen - _string02- length=32
19 milliseconds, strlen32 - _string02- length=32
24 milliseconds, GetStringLenX - _string01- length=16
33 milliseconds, strlen32M - _string03- length=64
34 milliseconds, strlen32 - _string03- length=64
37 milliseconds, szLen - _string03- length=64
37 milliseconds, GetStringLenX - _string02- length=32
55 milliseconds, strlen32M - _string04- length=128
56 milliseconds, strlen32 - _string04- length=128
66 milliseconds, szLen - _string04- length=128
66 milliseconds, GetStringLenX - _string03- length=64
122 milliseconds, GetStringLenX - _string04- length=128
********** END **********
if you want to see the other algorithms perform, try longer strings
many of them work well on strings of, say, 1000 bytes
that seems a little impractical to me, but i can see cases where it might be useful
generally, we think of display strings, which are shorter
but, fully qualified path names can be much longer
and, dealing with text files, you might want to access sentances, paragraphs, or even sections
:biggrin:
Hi,
Thank you Hutch, Jochen and Dave
Cpu2,
for now i don't want to use MMX, SSE. Thanks.
For me, it is useful for lengths of 30 bytes (+/-)
So, 64 is good!
Now, the best one is just this
(
for length=0 it is very very fast !!! ;) )
GetStringLenY proc pStr:DWORD
mov edx, [esp+4] ;pStr
mov eax, -1
@@: add eax, 1
movzx ecx, byte ptr [edx+eax]
or ecx, ecx
jnz short @B
ret 4
GetStringLenY endp
Please,
could you run TestString16A.exe and TestString64.exe and
post the results here ? Only the «Time table».
Thanks
note: your computers are faster. I do optimization based
on my.
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, GetStringLenY - _string01- length=16
13 milliseconds, szLen - _string01- length=16
15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
35 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenY - _string02- length=32
37 milliseconds, strlen32M - _string03- length=64
40 milliseconds, strlen32 - _string03- length=64
54 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32M - _string04- length=128
58 milliseconds, GetStringLenY - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
93 milliseconds, szLen - _string04- length=128
100 milliseconds, GetStringLenY - _string04- length=128
********** END **********
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****
96 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
99 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
122 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
124 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
127 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
127 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
157 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
169 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
179 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
********** END 2 **********
Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
Now, the best one is just this
(for length=0 it is very very fast !!! ;) )
For any other length, it could be a bit faster ;-)
No sources??
Rui,
results from TestString16A.exe:
STRINGS:
a bcd efg hij klm nop
ab cdefg hijk lmnopA
abc de fghijkl mn op
abc defg hi jk lm nop
abc de fghijkl mn op A
abc defg hi jk lm nop
X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
X is less than Y
ShowResultXY
X is EQUAL Y
ShowResultXY
X is greater than Y
ShowResultXY
64 milliseconds, CompareStringXYS, _string01X, _string01Y
42 milliseconds, CompareStringXYS, _string02X, _string02Y
45 milliseconds, CompareStringXYS, _string03X, _string03Y
28 milliseconds, CompareStringXYT, _string01X, _string01Y
28 milliseconds, CompareStringXYT, _string02X, _string02Y
28 milliseconds, CompareStringXYT, _string03X, _string03Y
48 milliseconds, CompareStringXYBS, _string01X, _string01Y
50 milliseconds, CompareStringXYBS, _string02X, _string02Y
52 milliseconds, CompareStringXYBS, _string03X, _string03Y
*** Press any key to get the time table ***
------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
***** Time table *****
28 milliseconds, CompareStringXYT -_string03X GREATER _string03Y-16 bytes
28 milliseconds, CompareStringXYT -_string02X EQUAL _string02Y-16 bytes
28 milliseconds, CompareStringXYT -_string01X LESS _string01Y-16 bytes
42 milliseconds, CompareStringXYS -_string02X EQUAL _string02Y-16 bytes
45 milliseconds, CompareStringXYS -_string03X GREATER _string03Y-16 bytes
48 milliseconds, CompareStringXYBS -_string01X LESS _string01Y-16 bytes
50 milliseconds, CompareStringXYBS -_string02X EQUAL _string02Y-16 bytes
52 milliseconds, CompareStringXYBS -_string03X GREATER _string03Y-16 bytes
64 milliseconds, CompareStringXYS -_string01X LESS _string01Y-16 bytes
********** END 2 **********
The results from TestString64.exe:
*** string01 ***16
16
16
16
*** string02 ***32
32
32
32
*** string03 ***64
64
64
64
*** string04 ***128
128
128
128
-------------- START ----------------
9 milliseconds, strlen32, _string01- length=16
6 milliseconds, strlen32M, _string01- length=16
13 milliseconds, GetStringLenY, _string01- length=16
5 milliseconds, szLen, _string01- length=16
7 milliseconds, strlen32, _string02 - length=32
5 milliseconds, strlen32M, _string02 - length=32
24 milliseconds, GetStringLenY, _string02- length=32
8 milliseconds, szLen, _string02- length=32
13 milliseconds, strlen32, _string03 - length=64
10 milliseconds, strlen32M, _string03 - length=64
35 milliseconds, GetStringLenY, _string03- length=64
16 milliseconds, szLen, _string03- length=64
28 milliseconds, strlen32, _string04 - length=128
26 milliseconds, strlen32M, _string04 - length=128
55 milliseconds, GetStringLenY, _string04- length=128
42 milliseconds, szLen, _string04- length=128
*** Press any key to get the time table ***
------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------
***** Time table *****
5 milliseconds, strlen32M - _string02- length=32
5 milliseconds, szLen - _string01- length=16
6 milliseconds, strlen32M - _string01- length=16
7 milliseconds, strlen32 - _string02- length=32
8 milliseconds, szLen - _string02- length=32
9 milliseconds, strlen32 - _string01- length=16
10 milliseconds, strlen32M - _string03- length=64
13 milliseconds, strlen32 - _string03- length=64
13 milliseconds, GetStringLenY - _string01- length=16
16 milliseconds, szLen - _string03- length=64
24 milliseconds, GetStringLenY - _string02- length=32
26 milliseconds, strlen32M - _string04- length=128
28 milliseconds, strlen32 - _string04- length=128
35 milliseconds, GetStringLenY - _string03- length=64
42 milliseconds, szLen - _string04- length=128
55 milliseconds, GetStringLenY - _string04- length=128
********** END **********
Gunther
Quote from: jj2007 on June 11, 2014, 02:12:04 AM
Quote from: RuiLoureiro on June 11, 2014, 01:42:35 AM
Now, the best one is just this
(for length=0 it is very very fast !!! ;) )
For any other length, it could be a bit faster ;-)
No sources??
more sources Jochen ? topic "Sorting strings" you have it.
No problems about this code, i post all things !
How do you improve it a bit ?
check the old forum - methods to get the string length has been discussed to death.
qWord,
More than to get the best code or the fastest
code, i like to write the way i think, following
my own logic. Meanwhile, i try to compare the
logic of some faster procedures with the way as i do.
And i have my conclusions.
In this case, i read strlen32 written by Agner Fog
(i dont need to use strlen) and i wrote a modified
version and tested it in the way you know.
I posted it because it is an optimized version
from an optimized version from Agner Fog.
About sorting strings, i want to add it in
the next version of the linked list project.
In my own projects i don't use null terminated strings.
Of course, the calculator doesn't use it.
As we see, when the length is not more than some
bytes i use any version optimized or not.
To me, complex methods to get the string length is
to put in the dustbin.
EDIT: i didn't do the things like that i saw in the old forum.
Gunther,
thanks !
Jochen,
where are you ? Are you sleeping ?
where is your answer ?
Try this :biggrin:
GetStringLenX PROC pStr:DWORD
mov eax,pStr
.while (BYTE PTR[eax])
inc eax
.endw
sub eax,pStr
ret
GetStringLenX ENDP
Your suggestion: GetStringLenZ is worse.
The difference of addresses is worse.
As we can see below, to use szLen
or GetStringLenY makes no difference
im my system,
up to length=32 or 64.Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
12 milliseconds, GetStringLenY - _string01- length=16
13 milliseconds, szLen - _string01- length=16
15 milliseconds, strlen32M - _string02- length=32
15 milliseconds, strlen32 - _string02- length=32
35 milliseconds, szLen - _string02- length=32
36 milliseconds, GetStringLenY - _string02- length=32
37 milliseconds, strlen32M - _string03- length=64
40 milliseconds, strlen32 - _string03- length=64
54 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32M - _string04- length=128
58 milliseconds, GetStringLenY - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
93 milliseconds, szLen - _string04- length=128
100 milliseconds, GetStringLenY - _string04- length=128
********** END **********
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****
9 milliseconds, strlen32M - _string01- length=16
10 milliseconds, strlen32 - _string01- length=16
14 milliseconds, strlen32M - _string02- length=32
14 milliseconds, szLen - _string01- length=16
15 milliseconds, strlen32 - _string02- length=32
18 milliseconds, GetStringLenZ - _string01- length=16
34 milliseconds, szLen - _string02- length=32
37 milliseconds, strlen32M - _string03- length=64
38 milliseconds, strlen32 - _string03- length=64
40 milliseconds, GetStringLenZ - _string02- length=32
53 milliseconds, szLen - _string03- length=64
58 milliseconds, strlen32 - _string04- length=128
58 milliseconds, strlen32M - _string04- length=128
68 milliseconds, GetStringLenZ - _string03- length=64
96 milliseconds, szLen - _string04- length=128
123 milliseconds, GetStringLenZ - _string04- length=128
********** END **********
Quote from: RuiLoureiro on June 11, 2014, 05:53:46 AM
Jochen,
where are you ? Are you sleeping ?
where is your answer ?
Yes, I was sleeping, and here is my answer (for a 100 byte string):
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
23657 cycles for 100 * Rui
3942 cycles for 100 * MB
13843 cycles for 100 * Masm32
14124 cycles for 100 * CRT
23757 cycles for 100 * HabranAs qWord wrote above,
we have tested it already.
the routines we have analyzed before have typically been optimized for longer strings
for shorter strings, say 50 characters or less, those routines have too much overhead
so, i tried to keep overhead to a minimum
and - access data as 4-aligned dwords
it hasn't been tested, but maybe it will give you some ideas....
OPTION PROLOGUE:None
OPTION EPILOGUE:None
ShortLen PROC _lpStr:LPSTR
mov edx,[esp+4]
test dl,3
mov ecx,[edx]
jz start_dwords3
xor eax,eax
inc edx
test cl,cl
jz all_done
test dl,3
jz start_dwords1
inc eax
inc edx
test ch,ch
jz all_done
test dl,3
jz start_dwords1
inc eax
inc edx
test ecx,0FF0000h
jz all_done
jmp short start_dwords2
start_dwords1:
add edx,4
start_dwords2:
mov ecx,[edx]
start_dwords3:
test cl,cl
mov al,0
jz end_dwords
test ch,ch
mov al,1
jz end_dwords
test ecx,0FF0000h
mov al,2
jz end_dwords
test ecx,0FF000000h
mov al,3
jnz start_dwords1
end_dwords:
sub eax,[esp+4]
add eax,edx
all_done:
ret 4
ShortLen ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
This very simple algo is still one of my favourites, mainly because its small, simple and flexible enough to inline it into the middle of a larger more complex algo. The aligned DWORD read versions will always be faster on longer linear reads but often its of no real gain in a more complex set of tasks.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
slen proc ptxt:DWORD
mov eax, [esp+4]
sub eax, 1
lbl0:
add eax, 1
cmp BYTE PTR [eax], 0
jne lbl0
sub eax, [esp+4]
ret 4
slen endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Well, well... so here are some brandnew algos, tested on 30 and 100 byte strings 8)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
11619 cycles for 100 * Rui
3512 cycles for 100 * MB
4734 cycles for 100 * Masm32
4036 cycles for 100 * CRT
11969 cycles for 100 * Habran
4930 cycles for 100 * ShortLen (Dave)
11413 cycles for 100 * slen (Hutch)
11615 cycles for 100 * Rui
3511 cycles for 100 * MB
5915 cycles for 100 * Masm32
4034 cycles for 100 * CRT
11841 cycles for 100 * Habran
4932 cycles for 100 * ShortLen (Dave)
11412 cycles for 100 * slen (Hutch)
32727 cycles for 100 * Rui
5932 cycles for 100 * MB
14423 cycles for 100 * Masm32
13117 cycles for 100 * CRT
33202 cycles for 100 * Habran
15347 cycles for 100 * ShortLen (Dave)
32430 cycles for 100 * slen (Hutch)
32743 cycles for 100 * Rui
5515 cycles for 100 * MB
14442 cycles for 100 * Masm32
13123 cycles for 100 * CRT
33367 cycles for 100 * Habran
15324 cycles for 100 * ShortLen (Dave)
32421 cycles for 100 * slen (Hutch)
100 = eax Rui
100 = eax MB
100 = eax Masm32
100 = eax CRT
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)
The string length topic was beaten to death. Anyway, here is it again:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
8547 cycles for 100 * Rui
1234 cycles for 100 * MB
4607 cycles for 100 * Masm32
2826 cycles for 100 * CRT
6069 cycles for 100 * Habran
3855 cycles for 100 * ShortLen (Dave)
8492 cycles for 100 * slen (Hutch)
8494 cycles for 100 * Rui
1233 cycles for 100 * MB
4609 cycles for 100 * Masm32
2812 cycles for 100 * CRT
6042 cycles for 100 * Habran
5088 cycles for 100 * ShortLen (Dave)
8486 cycles for 100 * slen (Hutch)
18892 cycles for 100 * Rui
2545 cycles for 100 * MB
9483 cycles for 100 * Masm32
8950 cycles for 100 * CRT
20342 cycles for 100 * Habran
12571 cycles for 100 * ShortLen (Dave)
19062 cycles for 100 * slen (Hutch)
18737 cycles for 100 * Rui
2545 cycles for 100 * MB
9462 cycles for 100 * Masm32
7726 cycles for 100 * CRT
20296 cycles for 100 * Habran
11327 cycles for 100 * ShortLen (Dave)
19168 cycles for 100 * slen (Hutch)
100 = eax Rui
100 = eax MB
100 = eax Masm32
100 = eax CRT
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)
--- ok ---
Gunther
Quote
Hutch:
This very simple algo is still one of my favourites, mainly because its small,
simple and flexible enough...
In DOS we used junk like rep scasb.
But i have that kind of instructions in the
dustbin.Now i avoid to move the addresses.
I like to use the index (=length, size,...) and
a minimum number of registers.
In any way, for me, strlen32 or strlen32M is good.
Remember that, in many applications,
we need toremove input spaces or to find CR/LF and then weget the length. So this is more important than
one to get the length only. For me, it is.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
12473 cycles for 100 * Rui
3762 cycles for 100 * MB
12026 cycles for 100 * Masm32
5466 cycles for 100 * CRT
14959 cycles for 100 * Habran
11102 cycles for 100 * ShortLen (Dave)
12101 cycles for 100 * slen (Hutch)
12692 cycles for 100 * Rui
3699 cycles for 100 * MB
11994 cycles for 100 * Masm32
5172 cycles for 100 * CRT
15321 cycles for 100 * Habran
11575 cycles for 100 * ShortLen (Dave)
12105 cycles for 100 * slen (Hutch)
28829 cycles for 100 * Rui
7346 cycles for 100 * MB
26394 cycles for 100 * Masm32
15833 cycles for 100 * CRT
35733 cycles for 100 * Habran
27929 cycles for 100 * ShortLen (Dave)
27561 cycles for 100 * slen (Hutch)
27941 cycles for 100 * Rui
7318 cycles for 100 * MB
26319 cycles for 100 * Masm32
15981 cycles for 100 * CRT
36253 cycles for 100 * Habran
24471 cycles for 100 * ShortLen (Dave)
27507 cycles for 100 * slen (Hutch)
100 = eax Rui
100 = eax MB
100 = eax Masm32
100 = eax CRT
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)
--- ok ---
deleted
Jochen,
Sorry, i think that i misunderstood your question.
HERE are strlen32, strlen32M
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32 proc pBuf:DWORD
push ebx
mov ecx, [esp+8] ; get pointer to string
mov eax, ecx ; copy pointer
and ecx, 3 ; lower 2 bits of address, check alignment
jz L2 ; string is aligned by 4. Go to loop
and eax, -4 ; align pointer by 4
mov ebx, [eax] ; read from nearest preceding boundary
shl ecx, 3 ; mul by 8 = displacement in bits
mov edx, -1
shl edx, cl ; make byte mask
not edx ; mask = 0FFH for false bytes
or ebx, edx ; mask out false bytes
; check first four bytes for zero
;-----------------------------------
lea ecx, [ebx-01010101H] ; subtract 1 from each byte
not ebx ; invert all bytes
and ecx, ebx ; and these two
and ecx, 80808080H ; test all sign bits
jnz L3 ; zero-byte found
; Main loop, read 4 bytes aligned
;-----------------------------------
L1: add eax, 4 ; increment pointer by 4
L2: mov ebx, [eax] ; read 4 bytes of string
lea ecx, [ebx-01010101H] ; subtract 1 from each byte
not ebx ; invert all bytes
and ecx, ebx ; and these two
and ecx, 80808080H ; test all sign bits
jz L1 ; no zero bytes, continue loop
L3: bsf ecx, ecx ; find right-most 1-bit
shr ecx, 3 ; divide by 8 = byte index
sub eax, [esp+8] ; subtract start address
add eax, ecx ; add index to byte
pop ebx
ret 4
strlen32 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;******************************************************************************
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32M proc pBuf:DWORD
mov ecx, [esp+4] ; get pointer to string
mov eax, ecx ; copy pointer
and ecx, 3 ; lower 2 bits of address, check alignment
jz L2 ; string is aligned by 4. Go to loop
and eax, -4 ; align pointer by 4
shl ecx, 3 ; mul by 8 = displacement in bits
mov edx, -1
shl edx, cl ; make byte mask
not edx ; mask = 0FFH for false bytes
or edx, [eax] ; read from nearest preceding boundary
; check first four bytes for zero
;-----------------------------------
lea ecx, [edx-01010101H] ; subtract 1 from each byte
not edx ; invert all bytes
and ecx, edx ; and these two
and ecx, 80808080H ; test all sign bits
jnz L3 ; zero-byte found
; Main loop, read 4 bytes aligned
;-----------------------------------
L1: add eax, 4 ; increment pointer by 4
L2: mov edx, [eax] ; read 4 bytes of string
lea ecx, [edx-01010101H] ; subtract 1 from each byte
not edx ; invert all bytes
and ecx, edx ; and these two
and ecx, 80808080H ; test all sign bits
jz L1 ; no zero bytes, continue loop
L3: bsf ecx, ecx ; find right-most 1-bit
shr ecx, 3 ; divide by 8 = byte index
sub eax, [esp+4] ; subtract start address
add eax, ecx ; add index to byte
ret 4
strlen32M endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: RuiLoureiro on June 11, 2014, 08:59:22 PM
Jochen,
Sorry, i think that i misunderstood your question.
HERE are strlen32, strlen32M
OK, integrated in attached testbed. They are fast indeed.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
11620 cycles for 100 * Rui
3512 cycles for 100 * MB
4338 cycles for 100 * strlen32
4755 cycles for 100 * strlen32M
11842 cycles for 100 * Habran
4932 cycles for 100 * ShortLen (Dave)
11417 cycles for 100 * slen (Hutch)
11628 cycles for 100 * Rui
3514 cycles for 100 * MB
4331 cycles for 100 * strlen32
4736 cycles for 100 * strlen32M
11842 cycles for 100 * Habran
4934 cycles for 100 * ShortLen (Dave)
11425 cycles for 100 * slen (Hutch)
32747 cycles for 100 * Rui
5516 cycles for 100 * MB
11223 cycles for 100 * strlen32
13623 cycles for 100 * strlen32M
33197 cycles for 100 * Habran
15323 cycles for 100 * ShortLen (Dave)
32441 cycles for 100 * slen (Hutch)
32974 cycles for 100 * Rui
5515 cycles for 100 * MB
11256 cycles for 100 * strlen32
13629 cycles for 100 * strlen32M
33159 cycles for 100 * Habran
15327 cycles for 100 * ShortLen (Dave)
32436 cycles for 100 * slen (Hutch)
Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:
Change
ShortLen PROC _lpStr:LPSTR
to
ShortLen PROC
; _lpStr:LPSTR
deleted
Quote from: nidud on June 11, 2014, 09:26:44 PM
fast :t
...
561 cycles - JJ
Too slow for my taste, but changing
void
len(string)
to
void
Len(string)
helps a lot ;-)
You need to write a simple algo to sort it, Jochen.
Where is your algo ? I bet it uses rep scasb !
This table is sorted.
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3704 cycles for 100 * MB
5093 cycles for 100 * strlen32M
5349 cycles for 100 * strlen32
11087 cycles for 100 * ShortLen (Dave)
12093 cycles for 100 * slen (Hutch)
12549 cycles for 100 * Rui
15481 cycles for 100 * Habran
3742 cycles for 100 * MB
5332 cycles for 100 * strlen32
5462 cycles for 100 * strlen32M
11163 cycles for 100 * ShortLen (Dave)
12060 cycles for 100 * slen (Hutch)
12775 cycles for 100 * Rui
14858 cycles for 100 * Habran
7334 cycles for 100 * MB
17334 cycles for 100 * strlen32M
17613 cycles for 100 * strlen32
27930 cycles for 100 * slen (Hutch)
28026 cycles for 100 * ShortLen (Dave)
28658 cycles for 100 * Rui
43521 cycles for 100 * Habran
7344 cycles for 100 * MB
17316 cycles for 100 * strlen32M
17779 cycles for 100 * strlen32
24022 cycles for 100 * ShortLen (Dave)
27805 cycles for 100 * slen (Hutch)
28074 cycles for 100 * Rui
38015 cycles for 100 * Habran
100 = eax Rui
100 = eax MB
100 = eax strlen32
100 = eax strlen32M
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)
--- ok ---
:biggrin:
: we have optimizations and ... optimizations !
for all tastes ! (we see it not only here)
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long - <100,200,300,400,500,700,1000>
------------------------------------------------------
2516 cycles - std
1128 cycles - Rui
1297 cycles - habran
1048 cycles - Dave
1443 cycles - Hutch
1200 cycles - JJ
839 cycles - Rui32
841 cycles - Rui32M
13747 cycles - std
7607 cycles - Rui
9441 cycles - habran
6250 cycles - Dave
7470 cycles - Hutch
7079 cycles - JJ
4154 cycles - Rui32
4137 cycles - Rui32M
--- ok ---
deleted
Quoteyep, that was better
What ? Did you see it ? Where is it ? Show us.I never saw, i never read any proc written by Jochen.
We should compare only comparable things.This is what we may see:
for my taste this is
junkQuote
align 16
TestA_s:
NameA equ Rui ; assign a descriptive name here
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
push offset somestring
call GetStringLenY ; Rui
dec ebx
.Until Sign?
ret
TestA endp
TestA_endp:
; ---------------------------------------------------
align 16
TestB_s:
NameB equ MB ; assign a descriptive name here
TestB proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
void Len(offset somestring)
dec ebx
.Until Sign?
ret
TestB endp
TestB_endp:
align 16
TestC_s:
; ----------------------------------------------------
; useC=0 ; uncomment to exclude TestC
NameC equ <strlen32> ; assign a descriptive name here
TestC proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
invoke strlen32, offset somestring
dec ebx
.Until Sign?
ret
TestC endp
TestC_endp:
Quote from: nidud on June 11, 2014, 08:48:19 PM
Dave's function fails (3, 700)?:
0000000A != 3 - STRLEN error
000003BC != 700 - STRLEN error
i don't understand what that means
i'll fix it if you like
Quote from: RuiLoureiro on June 11, 2014, 11:07:08 PM
Quoteyep, that was better
What ? Did you see it ? Where is it ? Show us.
We should compare only comparable things.
the test depends on what you're after
most strings are not aligned
well - BSTR's are - and strings that are inside structures probably are
otherwise... you have to devise a test that tests all alignments
just as an example, i attached a test to look at
1) select a single core, and wait 750 mS to bind before testing
2) select loop counts that yield ~0.5 seconds per pass
3) all alignments are tested
(notice that each string is differently aligned)
4) 16 strings are tested - the overall result is divided by 16 and rounded to nearest
5) you get fewer outliers if you open a console window and type the program name than if you click on it
6) the test should show the processor (this one does not) :P
Results for Jochen:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
9731 cycles for 100 * Rui
1146 cycles for 100 * MB
3871 cycles for 100 * strlen32
2730 cycles for 100 * strlen32M
8707 cycles for 100 * Habran
5088 cycles for 100 * ShortLen (Dave)
5891 cycles for 100 * slen (Hutch)
8489 cycles for 100 * Rui
2387 cycles for 100 * MB
3879 cycles for 100 * strlen32
2722 cycles for 100 * strlen32M
8698 cycles for 100 * Habran
5101 cycles for 100 * ShortLen (Dave)
5866 cycles for 100 * slen (Hutch)
18823 cycles for 100 * Rui
2029 cycles for 100 * MB
7357 cycles for 100 * strlen32
7455 cycles for 100 * strlen32M
20309 cycles for 100 * Habran
12587 cycles for 100 * ShortLen (Dave)
17441 cycles for 100 * slen (Hutch)
20010 cycles for 100 * Rui
2033 cycles for 100 * MB
7387 cycles for 100 * strlen32
7442 cycles for 100 * strlen32M
19070 cycles for 100 * Habran
12663 cycles for 100 * ShortLen (Dave)
18717 cycles for 100 * slen (Hutch)
100 = eax Rui
100 = eax MB
100 = eax strlen32
100 = eax strlen32M
100 = eax Habran
100 = eax ShortLen (Dave)
100 = eax slen (Hutch)
--- ok ---
Results for Rui:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long - <100,200,300,400,500,700,1000>
------------------------------------------------------
920 cycles - std
775 cycles - Rui
562 cycles - habran
378 cycles - Dave
544 cycles - Hutch
406 cycles - JJ
332 cycles - Rui32
246 cycles - Rui32M
209 cycles - JJ2
14480 cycles - std
10105 cycles - Rui
14099 cycles - habran
8649 cycles - Dave
9680 cycles - Hutch
7488 cycles - JJ
4513 cycles - Rui32
4634 cycles - Rui32M
1162 cycles - JJ2
--- ok ---
Gunther
Dave,
must show the proc.
it must be clear. otherwise i give him 0.
if in the school, he must explain all bits.
This is my rule:
i don't accept any results unless
you show your exercise.
Rui,
Quote from: RuiLoureiro on June 11, 2014, 11:35:46 PM
Dave,
must show the proc.
it must be clear. otherwise i give him 0.
if in the school, he must explain all bits.
This is my rule:
i don't accept any results unless
you show your exercise.
but slen.zip contains the source. What's your point?
Gunther
deleted
Gunther,
what are you talking about ?
Could you post here the proc
that i am talking about ?
Rui,
no offense. But the zip archive under post #30 contains the source. Or do you mean another procedure?
Gunther
No, iam not talking about it.
Do you know this:
This is the place to post assembler algorithms and code design for discussion, optimisation and any other...
You may do any tests you want to do with the code
i posted. I want to do the same. That's the point.
Read my reply 10.
nidud,
it is not correct to call "Rui32" and "Rui32M" but
AgnerFog.
:biggrin: :biggrin:
EDIT: Gunther,
Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)
Rui,
Quote from: RuiLoureiro on June 12, 2014, 12:46:41 AM
EDIT: Gunther,
Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)
I've read your post #10. I think that I'm talking about algorithms, I'm posting test results (not only in your thread), I'm not talking about gambuzinos, Yetis and other impossibilities.
But anyway, it's your thread. My apology, I won't post into your threads in the future.
Gunther
Hi
What did you do wrong, Gunther ?
I never saw anything wrong. It's clear.
My apology.
fixed my routine - and added ShowCpu :P
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
134 135 135 135 134
141 141 141 140 141
Hi Dave,
Here are some results.
Pre-Pentium4 (SSE1)
106 106 106 106 106
104 104 104 104 104
Press any key to continue ...
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
103 104 104 104 103
111 111 111 111 111
Press any key to continue ...
HTH,
Steve N.
Dave,
results from slen2.exe:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
55 51 52 51 51
77 77 77 77 77
Press any key to continue ...
Gunther
deleted
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
79459 cycles - 0: standard (scasb)
7603 cycles - 1: AgnerFog
7785 cycles - 2: AgnerFog (unaligned)
9866 cycles - 3: Dave
16232 cycles - 0: standard (scasb)
7615 cycles - 1: AgnerFog
9145 cycles - 2: AgnerFog (unaligned)
10294 cycles - 3: Dave
16081 cycles - 0: standard (scasb)
7759 cycles - 1: AgnerFog
7763 cycles - 2: AgnerFog (unaligned)
10762 cycles - 3: Dave
you might want to increase the loop counts for better stability
Hi nidud,
results for strlen4:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
18212 cycles - 0: standard (scasb)
1871 cycles - 1: AgnerFog
1962 cycles - 2: AgnerFog (unaligned)
2789 cycles - 3: Dave
11414 cycles - 0: standard (scasb)
4372 cycles - 1: AgnerFog
4257 cycles - 2: AgnerFog (unaligned)
6608 cycles - 3: Dave
17074 cycles - 0: standard (scasb)
4358 cycles - 1: AgnerFog
4204 cycles - 2: AgnerFog (unaligned)
6618 cycles - 3: Dave
--- ok ---
Gunther
Hi,
The first time it is run, it is slow on the first test.
Regards,
Steve N.
First run.
pre-P4 (SSE1)
------------------------------------------------------
315620 cycles - 0: standard (scasb)
5001 cycles - 1: AgnerFog
4967 cycles - 2: AgnerFog (unaligned)
6741 cycles - 3: Dave
11665 cycles - 0: standard (scasb)
4997 cycles - 1: AgnerFog
4975 cycles - 2: AgnerFog (unaligned)
6764 cycles - 3: Dave
11684 cycles - 0: standard (scasb)
4993 cycles - 1: AgnerFog
4967 cycles - 2: AgnerFog (unaligned)
6780 cycles - 3: Dave
--- ok ---
Second run
pre-P4 (SSE1)
------------------------------------------------------
11813 cycles - 0: standard (scasb)
4992 cycles - 1: AgnerFog
4968 cycles - 2: AgnerFog (unaligned)
6755 cycles - 3: Dave
11679 cycles - 0: standard (scasb)
4985 cycles - 1: AgnerFog
4966 cycles - 2: AgnerFog (unaligned)
6768 cycles - 3: Dave
11696 cycles - 0: standard (scasb)
4993 cycles - 1: AgnerFog
4977 cycles - 2: AgnerFog (unaligned)
6758 cycles - 3: Dave
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
296692 cycles - 0: standard (scasb)
4193 cycles - 1: AgnerFog
4078 cycles - 2: AgnerFog (unaligned)
6334 cycles - 3: Dave
11671 cycles - 0: standard (scasb)
4148 cycles - 1: AgnerFog
4064 cycles - 2: AgnerFog (unaligned)
6268 cycles - 3: Dave
11675 cycles - 0: standard (scasb)
4209 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6357 cycles - 3: Dave
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
11809 cycles - 0: standard (scasb)
4187 cycles - 1: AgnerFog
4093 cycles - 2: AgnerFog (unaligned)
6219 cycles - 3: Dave
11793 cycles - 0: standard (scasb)
4147 cycles - 1: AgnerFog
4074 cycles - 2: AgnerFog (unaligned)
6100 cycles - 3: Dave
11810 cycles - 0: standard (scasb)
4211 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6225 cycles - 3: Dave
--- ok ---
Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz (SSE4)
------------------------------------------------------
10025 cycles - 0: standard (scasb)
6427 cycles - 1: AgnerFog
6444 cycles - 2: AgnerFog (unaligned)
11482 cycles - 3: Dave
15319 cycles - 0: standard (scasb)
7873 cycles - 1: AgnerFog
6176 cycles - 2: AgnerFog (unaligned)
11058 cycles - 3: Dave
14978 cycles - 0: standard (scasb)
7816 cycles - 1: AgnerFog
6251 cycles - 2: AgnerFog (unaligned)
10995 cycles - 3: Dave
--- ok ---
deleted
Hi nidud,
strlen5:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
22200 cycles - 0: standard (scasb)
10776 cycles - 3: Dave
10271 cycles - 5: MB - len()
7120 cycles - 1: AgnerFog
7264 cycles - 2: AgnerFog (unaligned)
3007 cycles - 6: MB - Len() SSE
2280 cycles - 4: unaligned SSE2
21590 cycles - 0: standard (scasb)
10616 cycles - 3: Dave
10136 cycles - 5: MB - len()
7059 cycles - 1: AgnerFog
17323 cycles - 2: AgnerFog (unaligned)
7226 cycles - 6: MB - Len() SSE
5413 cycles - 4: unaligned SSE2
52253 cycles - 0: standard (scasb)
25722 cycles - 3: Dave
24451 cycles - 5: MB - len()
17339 cycles - 1: AgnerFog
17349 cycles - 2: AgnerFog (unaligned)
7205 cycles - 6: MB - Len() SSE
6116 cycles - 4: unaligned SSE2
--- ok ---
Gunther
deleted
strlen5
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
86037 cycles - 0: standard (scasb)
31180 cycles - 3: Dave
33575 cycles - 5: MB - len()
23079 cycles - 1: AgnerFog
25595 cycles - 2: AgnerFog (unaligned)
21374 cycles - 6: MB - Len() SSE
18166 cycles - 4: unaligned SSE2
49577 cycles - 0: standard (scasb)
31080 cycles - 3: Dave
32727 cycles - 5: MB - len()
23139 cycles - 1: AgnerFog
25405 cycles - 2: AgnerFog (unaligned)
21643 cycles - 6: MB - Len() SSE
18152 cycles - 4: unaligned SSE2
49638 cycles - 0: standard (scasb)
31000 cycles - 3: Dave
32762 cycles - 5: MB - len()
23151 cycles - 1: AgnerFog
31292 cycles - 2: AgnerFog (unaligned)
21172 cycles - 6: MB - Len() SSE
18204 cycles - 4: unaligned SSE2