News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Improvement of the strlen function

Started by TouEnMasm, November 02, 2014, 07:24:13 PM

Previous topic - Next topic

dedndave

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

800 bytes:
8240    cycles for 10 * CRT strlen
8984    cycles for 10 * Masm32 StrLen
7501    cycles for 10 * AlignedStrLen
4380    cycles for 10 * MasmBasic Len()
7496    cycles for 10 * strlenNidud
6072    cycles for 10 * SzLen (Dave)

8281    cycles for 10 * CRT strlen
9109    cycles for 10 * Masm32 StrLen
7348    cycles for 10 * AlignedStrLen
4403    cycles for 10 * MasmBasic Len()
7625    cycles for 10 * strlenNidud
6119    cycles for 10 * SzLen (Dave)

534 bytes:
5692    cycles for 10 * CRT strlen
6445    cycles for 10 * Masm32 StrLen
5286    cycles for 10 * AlignedStrLen
3138    cycles for 10 * MasmBasic Len()
5433    cycles for 10 * strlenNidud
4224    cycles for 10 * SzLen (Dave)

5675    cycles for 10 * CRT strlen
6526    cycles for 10 * Masm32 StrLen
5159    cycles for 10 * AlignedStrLen
3163    cycles for 10 * MasmBasic Len()
5430    cycles for 10 * strlenNidud
4195    cycles for 10 * SzLen (Dave)

268 bytes:
3212    cycles for 10 * CRT strlen
3716    cycles for 10 * Masm32 StrLen
2953    cycles for 10 * AlignedStrLen
1573    cycles for 10 * MasmBasic Len()
3052    cycles for 10 * strlenNidud
2492    cycles for 10 * SzLen (Dave)

3309    cycles for 10 * CRT strlen
3580    cycles for 10 * Masm32 StrLen
2981    cycles for 10 * AlignedStrLen
1574    cycles for 10 * MasmBasic Len()
3061    cycles for 10 * strlenNidud
2580    cycles for 10 * SzLen (Dave)

2 bytes:
302     cycles for 10 * CRT strlen
281     cycles for 10 * Masm32 StrLen
320     cycles for 10 * AlignedStrLen
351     cycles for 10 * MasmBasic Len()
246     cycles for 10 * strlenNidud
339     cycles for 10 * SzLen (Dave)

299     cycles for 10 * CRT strlen
264     cycles for 10 * Masm32 StrLen
321     cycles for 10 * AlignedStrLen
351     cycles for 10 * MasmBasic Len()
245     cycles for 10 * strlenNidud
340     cycles for 10 * SzLen (Dave)

jj2007

Pretty stable timings, especially for a P4, and yours is clearly the best non-SSE algo :t

Gunther

Hi Marinus,

Quote from: Siekmanski on November 05, 2014, 12:17:44 PM
Maybe Guthers computer is overclocked ?

no, it isn't. It's an Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 2.30GHz. That's what the System says.

Gunther
You have to know the facts before you can distort them.

FORTRANS

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

800 bytes:
4423 cycles for 10 * CRT strlen
4437 cycles for 10 * Masm32 StrLen
4513 cycles for 10 * AlignedStrLen
1381 cycles for 10 * MasmBasic Len()
4670 cycles for 10 * strlenNidud
3220 cycles for 10 * SzLen (Dave)

4424 cycles for 10 * CRT strlen
4440 cycles for 10 * Masm32 StrLen
4521 cycles for 10 * AlignedStrLen
1646 cycles for 10 * MasmBasic Len()
4313 cycles for 10 * strlenNidud
3584 cycles for 10 * SzLen (Dave)

534 bytes:
3351 cycles for 10 * CRT strlen
3429 cycles for 10 * Masm32 StrLen
3507 cycles for 10 * AlignedStrLen
950 cycles for 10 * MasmBasic Len()
2988 cycles for 10 * strlenNidud
2583 cycles for 10 * SzLen (Dave)

3115 cycles for 10 * CRT strlen
3067 cycles for 10 * Masm32 StrLen
3153 cycles for 10 * AlignedStrLen
939 cycles for 10 * MasmBasic Len()
2996 cycles for 10 * strlenNidud
2587 cycles for 10 * SzLen (Dave)

268 bytes:
1806 cycles for 10 * CRT strlen
1818 cycles for 10 * Masm32 StrLen
2177 cycles for 10 * AlignedStrLen
703 cycles for 10 * MasmBasic Len()
1829 cycles for 10 * strlenNidud
1411 cycles for 10 * SzLen (Dave)

2178 cycles for 10 * CRT strlen
1816 cycles for 10 * Masm32 StrLen
2164 cycles for 10 * AlignedStrLen
699 cycles for 10 * MasmBasic Len()
1835 cycles for 10 * strlenNidud
1190 cycles for 10 * SzLen (Dave)

2 bytes:
109 cycles for 10 * CRT strlen
104 cycles for 10 * Masm32 StrLen
379 cycles for 10 * AlignedStrLen
247 cycles for 10 * MasmBasic Len()
205 cycles for 10 * strlenNidud
284 cycles for 10 * SzLen (Dave)

232 cycles for 10 * CRT strlen
223 cycles for 10 * Masm32 StrLen
180 cycles for 10 * AlignedStrLen
246 cycles for 10 * MasmBasic Len()
203 cycles for 10 * strlenNidud
259 cycles for 10 * SzLen (Dave)

2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)

--- ok --- Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

800 bytes:
7784 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6988 cycles for 10 * AlignedStrLen
2023 cycles for 10 * MasmBasic Len()
7103 cycles for 10 * strlenNidud
6031 cycles for 10 * SzLen (Dave)

7759 cycles for 10 * CRT strlen
7381 cycles for 10 * Masm32 StrLen
6958 cycles for 10 * AlignedStrLen
2020 cycles for 10 * MasmBasic Len()
7104 cycles for 10 * strlenNidud
6023 cycles for 10 * SzLen (Dave)

534 bytes:
5412 cycles for 10 * CRT strlen
5033 cycles for 10 * Masm32 StrLen
4861 cycles for 10 * AlignedStrLen
1412 cycles for 10 * MasmBasic Len()
4844 cycles for 10 * strlenNidud
4062 cycles for 10 * SzLen (Dave)

5430 cycles for 10 * CRT strlen
4975 cycles for 10 * Masm32 StrLen
4833 cycles for 10 * AlignedStrLen
1446 cycles for 10 * MasmBasic Len()
4847 cycles for 10 * strlenNidud
4053 cycles for 10 * SzLen (Dave)

268 bytes:
2908 cycles for 10 * CRT strlen
2619 cycles for 10 * Masm32 StrLen
2640 cycles for 10 * AlignedStrLen
852 cycles for 10 * MasmBasic Len()
2635 cycles for 10 * strlenNidud
2181 cycles for 10 * SzLen (Dave)

2906 cycles for 10 * CRT strlen
2595 cycles for 10 * Masm32 StrLen
2673 cycles for 10 * AlignedStrLen
853 cycles for 10 * MasmBasic Len()
2633 cycles for 10 * strlenNidud
2178 cycles for 10 * SzLen (Dave)

2 bytes:
181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
183 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)

181 cycles for 10 * CRT strlen
135 cycles for 10 * Masm32 StrLen
213 cycles for 10 * AlignedStrLen
184 cycles for 10 * MasmBasic Len()
129 cycles for 10 * strlenNidud
220 cycles for 10 * SzLen (Dave)

2 = eax CRT strlen
2 = eax Masm32 StrLen
-1 = eax AlignedStrLen
2 = eax MasmBasic Len()
2 = eax strlenNidud
2 = eax SzLen (Dave)

--- ok ---

Gunther

Here's the result from the desktop computer:
Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

800 bytes:
4500    cycles for 10 * CRT strlen
4552    cycles for 10 * Masm32 StrLen
4264    cycles for 10 * AlignedStrLen
1202    cycles for 10 * MasmBasic Len()
4191    cycles for 10 * strlenNidud
3114    cycles for 10 * SzLen (Dave)

4511    cycles for 10 * CRT strlen
4594    cycles for 10 * Masm32 StrLen
4268    cycles for 10 * AlignedStrLen
1193    cycles for 10 * MasmBasic Len()
4214    cycles for 10 * strlenNidud
3117    cycles for 10 * SzLen (Dave)

534 bytes:
3249    cycles for 10 * CRT strlen
3216    cycles for 10 * Masm32 StrLen
3033    cycles for 10 * AlignedStrLen
858     cycles for 10 * MasmBasic Len()
2981    cycles for 10 * strlenNidud
2163    cycles for 10 * SzLen (Dave)

3251    cycles for 10 * CRT strlen
3777    cycles for 10 * Masm32 StrLen
3031    cycles for 10 * AlignedStrLen
877     cycles for 10 * MasmBasic Len()
2987    cycles for 10 * strlenNidud
2171    cycles for 10 * SzLen (Dave)

268 bytes:
1846    cycles for 10 * CRT strlen
1496    cycles for 10 * Masm32 StrLen
1815    cycles for 10 * AlignedStrLen
400     cycles for 10 * MasmBasic Len()
1750    cycles for 10 * strlenNidud
1709    cycles for 10 * SzLen (Dave)

1837    cycles for 10 * CRT strlen
1488    cycles for 10 * Masm32 StrLen
1809    cycles for 10 * AlignedStrLen
406     cycles for 10 * MasmBasic Len()
1755    cycles for 10 * strlenNidud
1124    cycles for 10 * SzLen (Dave)

2 bytes:
106     cycles for 10 * CRT strlen
247     cycles for 10 * Masm32 StrLen
404     cycles for 10 * AlignedStrLen
268     cycles for 10 * MasmBasic Len()
247     cycles for 10 * strlenNidud
296     cycles for 10 * SzLen (Dave)

167     cycles for 10 * CRT strlen
99      cycles for 10 * Masm32 StrLen
408     cycles for 10 * AlignedStrLen
271     cycles for 10 * MasmBasic Len()
248     cycles for 10 * strlenNidud
293     cycles for 10 * SzLen (Dave)

2       = eax CRT strlen
2       = eax Masm32 StrLen
-1      = eax AlignedStrLen
2       = eax MasmBasic Len()
2       = eax strlenNidud
2       = eax SzLen (Dave)

--- ok ---

Gunther
You have to know the facts before you can distort them.

TouEnMasm


The REPEAT3 added to the align 4 give better result.
here is strlenNidud at wich ,I have added a REPEAT3 macro
I hope ,I have not added errors


align 4
strlenNidudR3 proc string:DWORD
mov ecx, string
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
jnz done
and ecx, -4 ;align 4 ;0FFFFFFFCh ;c=1100b
lup:
REPEAT 3
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
;jz lup
jnz done
ENDM
mov eax, [ecx]
add ecx, 4
lea edx, [eax-01010101h]
;lea     ecx, [edi+0F1F1F1F1h]   ;F1 = FF - 0D - 1 search D(13),  D devient FF
not eax
and eax, edx
and eax, 80808080h
jz lup
done:
bsf eax, eax
shr eax, 3
sub ecx, string
lea eax, [ecx+eax-4]
ret
strlenNidudR3 endp
Fa is a musical note to play with CL

hutch--

I wonder at using such small test strings when anything will read a 1k string faster than you need. Even a sloppy old SCAS algo will do that job fast enough. If you are going to test string length algos, it should be on much larger test strings, well into the megabytes and if the comparisons are to be even vaguely relevant to how they get used in code, you need real time testing, not cache thrashing on small strings.

nidud

#37
deleted

dedndave

Hutch - how often do we actually use it that way?
i would guess most calls to StrLen are made for console out strings of 80 characters or less
maybe a pathname - they can be long, but usually less than 100 characters

i can't think of many applications where i wanted to get the length of a long string
maybe that's because i write the code to keep track of the length   ;)

hutch--

Dave,

For strings of that length I generally use the following style of code.


    mov eax, [esp+4]                ; get 1st arg off stack
    sub eax, 1
  lbl1:
    add eax, 1
    cmp BYTE PTR [eax], 0           ; loop through address testing for ascii 0
    jne lbl1                        ; loop back if not found

    sub eax, [esp+4]                ; sub original address from EAX to get length

    ret 4                           ; balance the stack by 4 bytes


When the timing is less than a millisecond, what is the point of chasing speed that you cannot even perceive ?

jj2007

#40
Quote from: dedndave on November 06, 2014, 12:45:48 PM
i would guess most calls to StrLen are made for console out strings of 80 characters or less
...
i can't think of many applications where i wanted to get the length of a long string

Indeed, for really long strings you would know the length. Typically, you need a fast strlen algo when processing huge amounts of strings in textfiles, see my post about 33 bytes average length of C header files. Besides, the speed of len() affects all kinds of other functions - for example, Len() is being used in about half of the ca. 200 MasmBasic functions. That is why Len() should be fast: to process millions of small strings. JWasm, for example, could benefit from a faster len() function.

But of course, we can argue along the philosophical line "why should anybody nowadays indulge in something as obsolete as assembler programming when Microsoft offers 'sufficiently' fast and safe strlen functions such as StringCbLengthA" ;-)

TouEnMasm


It's a good exercise on how optimise code.
For real fast functions,there is need of SSE intructions and a set of functions using them.
 
Fa is a musical note to play with CL

hutch--

Which is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.

jj2007

Quote from: hutch-- on November 06, 2014, 09:25:42 PMWhich is fine except they don't run on old machines. I would prefer to be working with SSE 4.1/2 but not enough machines use it yet.

Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.

Gunther

Quote from: jj2007 on November 06, 2014, 11:20:55 PM
Right. SSE 4.1/2 has some goodies indeed, but in most cases SSE2 does the job, and it works from P4 (introduced in 2001) onwards.

That's a reasonable point of view.  :t

Gunther
You have to know the facts before you can distort them.